Network Connectivity Issue

Incident Report for FireText

Postmortem

At 16:21, we were alerted of failure of services at our primary data centre. This was due to a disk failure on the server cluster. We saw intermittent connection to the services, for ourselves and number of customers, highlighting that the system was attempting to come back online. This continued until 16:31, at which time, we made the decision to move to our secondary data centre.

By 16:42, a number of our services were successfully up and running on the secondary data centre. However, due to the high load at the time, a handful of customers were experiencing timeouts and connection issues, caused by a DNS update taking a few moments to update. Those customers experiencing time-outs, primarily via the API from 16:21 during this time, may not have successfully submitted messages for sending.

By 17:08, both the API and the app services were restored in a degraded performance state.

We were able to add additional resources to the secondary data centre to allow for all customers to successfully submit messages shortly after.

A backlog of messages resulted during this time. We also had confirmation that the cluster of servers on the primary datacenter were back in operation. Unfortunately, although live, they were not functioning as expected, this delayed the processing of backlogged messages.

Messages that were successfully queued and sent during this time were majority cleared by 20:45. The remaining customers were cleared by 22:40.

We were pleased with the performance of the secondary data centre and the changes made since the previous incident (we had been incident free for over 2 years before this).

The secondary data centre was actively undertaking tests & upgrades following the incident in November (in isolation from the primary data centre). These were progressing extremely well, and were due to complete today (13th). As we did have resource, albeit at a reduced size for this window, that should the primary datacenter fail, we would be operational, but in a lesser performance than normal. We were able to pull resources to bring forward the completeness to the 12th to aid the complete support of a secondary site. As always, both sites are continually monitored.

Posted Jan 13, 2017 - 13:36 GMT

Resolved

This incident has been resolved.

Posted Jan 13, 2017 - 03:24 GMT

Monitoring

The backlog of queued messages is now cleared. Message sending via the API and Web App are now processing as normal. We will continue to monitor all services closely.

Posted Jan 12, 2017 - 22:50 GMT

Update

The backlog of message sending is almost clear and is now only affecting a small number of customers. Although, we believe to be operating as expected, we are reviewing all services.

Posted Jan 12, 2017 - 22:13 GMT

Update

The team encountered an issue with processing the backlog, and some customers are still experiencing delays with messages being cleared. The system is still operating on a degraded performance. We appreciate your ongoing patience.

Posted Jan 12, 2017 - 19:52 GMT

Update

The team are currently working to bring back full operational performance and for all messages queued to be cleared. Messages submitted during this time, will be added to the backlog for sending. We appreciate your ongoing patience.

Posted Jan 12, 2017 - 18:44 GMT

Update

We are still experiencing a degraded performance of our services. The team are currently working to bring back full operational performance. Messages sent during this time, are still be added to the backlog, whilst this continues to reduce. We appreciate your patience.

Posted Jan 12, 2017 - 17:54 GMT

Identified

Services have been restored and are currently in a degraded performance. Scheduled messages are being queued and we are clearing the back log of messages currently.

Posted Jan 12, 2017 - 17:10 GMT

Investigating

We're getting reports that FireText is unreachable externally, and are working with our provider to identify the issue.

Posted Jan 12, 2017 - 16:31 GMT