On Tuesday, Nov 15, starting at 08:38 GMT, FireText suffered serious outages across both our primary and secondary data centres. We take the reliability of our systems very seriously; it’s our number one priority. We’ve written this post-mortem to give you full disclosure on what happened, what went wrong, what we did right and what we’re doing to ensure this never happens again.
We also want to apologise for the outage. It shouldn’t have lasted as long as it did. Or at all.
The core FireText application & infrastructure is hosted in two geo-graphically separate locations, here in the UK. The systems are designed to provide back up across each of these locations.
At 8:33am, our monitoring tools detected an outage at our primary data centre. On investigation, this was due to HV mains failure to the primary site, on both the feeds. It is believed to be a local supply issue from the grid. The backup generators on site started automatically, as per their planned operating procedure, but went into fault. Power was eventually restored to some servers in the data centre and an automated script was slowly powering back the majority of remaining servers.
During this time, a procedural switch to our secondary site was interrupted by a separate data centre networking issue, which was not restored in the usual quick time by our provider. As such, we took the decision to start deploying the affected services onto a separate provider and datacenter, knowing full-well, this may take some time to complete. With no concrete ETA for resolution of both incidents, we believe it was the right decision. The secondary sites network connectivity was restored some hours later, but our services faulted upon resuming our switchover, leading to a partial restoration of our front facing services. We had, of course, also started the process of spinning up our services at an additional data centre.
Communication was requested continuously from the primary data centre on an ETA of the power restoration. We understand, this was difficult to provide, as such we were unable to get any confirmed ETA, as such continued onto a separate provider and datacenter.
We experienced a number of minor issues with deploying the affected services onto new instances from scratch. These caused significant delays. At approx 16:00, we received a call to say, that our servers would likely be powered up any moment. This new information, meant we moved some resources to preparing for and monitor the services at our primary location as they were coming alive. This unfortunately took longer than any of us anticipated, and in hindsight we could have better utilised our time on completing to spin up the new instances and restoring backups.
At 18:25, power started to be restored to the servers in the primary data centre. It was at this point, all hands were pulled on to ensuring all services started correctly. It took approx 25 mins for all instances to regain power and we were able to test all was ok, before clearing the initial backlog and queue of messages by 19:18.
Most importantly, we would once again like to extend our sincere apologies for this incident and the inconvenience caused. I would like to thank you for your patience. However, as a team, we have learnt a number of lessons from this incident. They include, but not limited to, relying on third parties to provide back up and resilience; the need to introduce standby instances at other locations; the requirement / necessity for deeper communications with our providers; the decisions we took during this process; preparing and overcoming issues to allow for prompt and seamless provisioning of new instances at multiple environments.
I am proud of how quickly the team responded to changing circumstances and the rational decisions taken. These decisions, provided an opportunity for a solution which would have likely been available very shortly after power had been restored. The entire team were flat out and working to full capacity, and for that we are extremely grateful to each individual. The time to have this third solution in place will be reviewed internally, and we’ll be making sure that every playbook is fully prepared for a variety of hosting solutions. We will also be having a full review with our hosting company. Our hosting provider has been extremely robust for many years with extremely high accreditation towards security and redundancy. We still believe this to be the case, and a lot will also be learnt by them, their wider customer base, and us in this incident. Although, we take full responsibility on providing a continued service to our customers, we will be looking for input from our hosting provider. Quite simply, the hardware redundancies should not have made an event of this nature possible. The focus over the days, weeks and months is to review our disaster procedures and further build on our resilience with additional redundancies. As before, we hope to restore confidence for all affected with continued availability.
The FireText Team