Datacentre Incident

Incident Report for FireText

Postmortem

Incident Report for November 15th 2016

On Tuesday, Nov 15, starting at 08:38 GMT, FireText suffered serious outages across both our primary and secondary data centres. We take the reliability of our systems very seriously; it’s our number one priority. We’ve written this post-mortem to give you full disclosure on what happened, what went wrong, what we did right and what we’re doing to ensure this never happens again.

We also want to apologise for the outage. It shouldn’t have lasted as long as it did. Or at all.

Background

The core FireText application & infrastructure is hosted in two geo-graphically separate locations, here in the UK. The systems are designed to provide back up across each of these locations.

The Outage

At 8:33am, our monitoring tools detected an outage at our primary data centre. On investigation, this was due to HV mains failure to the primary site, on both the feeds. It is believed to be a local supply issue from the grid. The backup generators on site started automatically, as per their planned operating procedure, but went into fault. Power was eventually restored to some servers in the data centre and an automated script was slowly powering back the majority of remaining servers.

During this time, a procedural switch to our secondary site was interrupted by a separate data centre networking issue, which was not restored in the usual quick time by our provider. As such, we took the decision to start deploying the affected services onto a separate provider and datacenter, knowing full-well, this may take some time to complete. With no concrete ETA for resolution of both incidents, we believe it was the right decision. The secondary sites network connectivity was restored some hours later, but our services faulted upon resuming our switchover, leading to a partial restoration of our front facing services. We had, of course, also started the process of spinning up our services at an additional data centre.

Communication was requested continuously from the primary data centre on an ETA of the power restoration. We understand, this was difficult to provide, as such we were unable to get any confirmed ETA, as such continued onto a separate provider and datacenter.

We experienced a number of minor issues with deploying the affected services onto new instances from scratch. These caused significant delays. At approx 16:00, we received a call to say, that our servers would likely be powered up any moment. This new information, meant we moved some resources to preparing for and monitor the services at our primary location as they were coming alive. This unfortunately took longer than any of us anticipated, and in hindsight we could have better utilised our time on completing to spin up the new instances and restoring backups.

At 18:25, power started to be restored to the servers in the primary data centre. It was at this point, all hands were pulled on to ensuring all services started correctly. It took approx 25 mins for all instances to regain power and we were able to test all was ok, before clearing the initial backlog and queue of messages by 19:18.

Summary

Most importantly, we would once again like to extend our sincere apologies for this incident and the inconvenience caused. I would like to thank you for your patience. However, as a team, we have learnt a number of lessons from this incident. They include, but not limited to, relying on third parties to provide back up and resilience; the need to introduce standby instances at other locations; the requirement / necessity for deeper communications with our providers; the decisions we took during this process; preparing and overcoming issues to allow for prompt and seamless provisioning of new instances at multiple environments.

I am proud of how quickly the team responded to changing circumstances and the rational decisions taken. These decisions, provided an opportunity for a solution which would have likely been available very shortly after power had been restored. The entire team were flat out and working to full capacity, and for that we are extremely grateful to each individual. The time to have this third solution in place will be reviewed internally, and we’ll be making sure that every playbook is fully prepared for a variety of hosting solutions. We will also be having a full review with our hosting company. Our hosting provider has been extremely robust for many years with extremely high accreditation towards security and redundancy. We still believe this to be the case, and a lot will also be learnt by them, their wider customer base, and us in this incident. Although, we take full responsibility on providing a continued service to our customers, we will be looking for input from our hosting provider. Quite simply, the hardware redundancies should not have made an event of this nature possible. The focus over the days, weeks and months is to review our disaster procedures and further build on our resilience with additional redundancies. As before, we hope to restore confidence for all affected with continued availability.

The FireText Team

Posted Nov 16, 2016 - 15:09 GMT

Resolved

We're marking this as resolved. However, we'll continue to monitor over the coming hours, days & weeks. We want to, again, extend a sincerely apology for this incident today and the inconvenience it may have caused. We'll conduct our own post-mortem and also with the data centre and update in the coming days.

Posted Nov 15, 2016 - 20:18 GMT

Monitoring

All services are operational, and being monitored closely. The backlog of outbound messages has been processed. The queue of inbound messages and Delivery Receipts is being processed in accordance with the network retry patterns. We like to extend a sincere apology for this incident today and we'll be following this with a full post-mortem in the coming days.

Posted Nov 15, 2016 - 19:19 GMT

Update

The primary data-centre has been restored. Access to the App, Email to SMS and API has been restored and is under review. All message attempts now are being queued and will be processed in turn. All API attempts during this period will have received a 503 error and as such would have not been queued. Inbound messages have also been queued at network level and will be processed as required.

Further updates will follow this shortly. Thank you for your patience.

Posted Nov 15, 2016 - 18:59 GMT

Update

Progress has indeed been significant in the last hour. One team is resolving some minor challenges with the deployment of an additional server. The secondary data centre is partially restored and we are being told that the primary data-centre restored within the next 90-120 mins. Of course, we'll keep you updated on any improvements before then.

Posted Nov 15, 2016 - 18:31 GMT

Update

Progress continues on restoring the affected data-centres. We are in the final stages of configuring services via an additional provider. We hope to make significant progress in the following hour. We will of course keep you updated. Thanks again for your patience.

Posted Nov 15, 2016 - 17:19 GMT

Update

We have simultaneous work taking place on restoring the third-party primary and secondary data-centres, as well as an alternative with a different provider. We do not currently expect the API or App to be functional before 5pm today. We are however, doing everything we possibly can. We thank you for your continued patience.

Posted Nov 15, 2016 - 16:09 GMT

Update

Progress to restore services across our data-centres is ongoing. Thank you for your continued patience.

Posted Nov 15, 2016 - 15:13 GMT

Update

We are making significant process across both sites simultaneously. Thank you for your continued patience.

Posted Nov 15, 2016 - 14:16 GMT

Update

We are still working closely with our provider to restore services across our data-centres. Thank you for your continued patience.

Posted Nov 15, 2016 - 13:13 GMT

Update

We are still working with out provider to restore access to all services, thank you for your patience.

Posted Nov 15, 2016 - 12:05 GMT

Update

Services have been partially restored. However, we hope to have it fully restored, namely App and API, shortly to enable all services to be functional. Thank you again for your patience :)

Posted Nov 15, 2016 - 11:04 GMT

Update

We are experiencing an outage with both our primary and secondary datacentres. We are working closely with them to getting it resolved as soon as possible. Thank you for your patience :)

Posted Nov 15, 2016 - 10:22 GMT

Update

Our datacentre provider has reported a power failure at our primary location. Currently all scheduled messages will be sent after being restored. All inbound messages and delivery receipts will also process after this time. API requests during this period will not be successful. Thank you for your patience :)

Posted Nov 15, 2016 - 09:25 GMT

Identified

Our monitoring systems indicate our systems are currently unreachable, we’re working with our datacenter to rectify as soon as possible.

Posted Nov 15, 2016 - 08:54 GMT