At 16:21, we were alerted of failure of services at our primary data centre. This was due to a disk failure on the server cluster. We saw intermittent connection to the services, for ourselves and number of customers, highlighting that the system was attempting to come back online. This continued until 16:31, at which time, we made the decision to move to our secondary data centre.
By 16:42, a number of our services were successfully up and running on the secondary data centre. However, due to the high load at the time, a handful of customers were experiencing timeouts and connection issues, caused by a DNS update taking a few moments to update. Those customers experiencing time-outs, primarily via the API from 16:21 during this time, may not have successfully submitted messages for sending.
By 17:08, both the API and the app services were restored in a degraded performance state.
We were able to add additional resources to the secondary data centre to allow for all customers to successfully submit messages shortly after.
A backlog of messages resulted during this time. We also had confirmation that the cluster of servers on the primary datacenter were back in operation. Unfortunately, although live, they were not functioning as expected, this delayed the processing of backlogged messages.
Messages that were successfully queued and sent during this time were majority cleared by 20:45. The remaining customers were cleared by 22:40.
We were pleased with the performance of the secondary data centre and the changes made since the previous incident (we had been incident free for over 2 years before this).
The secondary data centre was actively undertaking tests & upgrades following the incident in November (in isolation from the primary data centre). These were progressing extremely well, and were due to complete today (13th). As we did have resource, albeit at a reduced size for this window, that should the primary datacenter fail, we would be operational, but in a lesser performance than normal. We were able to pull resources to bring forward the completeness to the 12th to aid the complete support of a secondary site. As always, both sites are continually monitored.