At 13:43 on 09/04/2020 one of our database servers suffered a critical fault, resulting in requests not getting fulfilled. Automated attempts to recover and re-spawn the instance failed.
At 13:47 the team initiated manual attempts to mitigate affected traffic by isolating the affected db.
At 13:57 affected app and api services were restored, and messages were getting accepted again. As the changes were propagating, a small subset of requests will have still seen some timeouts in amongst the successes. These will have completely tailed off by 14:12.
A small backlog built up as services restarted and processes resumed. This resulted in delays for some customers.
Whilst the failed instance was being restored and re-synchronised, app users will have seen delays in seeing message reports although messages were being sent and delivered as normal.
By 14:57 the failed instance was back online, and all queues and delays cleared.
We would once again like to extend our apologies for this incident and the inconvenience caused, and would like to thank you for your patience. We take every incident extremely seriously. We review, evaluate and learn from issues in full in order to mitigate against them in the future. This includes the auto-failover and how we can minimise the time taken during a manual failover. The team were alerted and all positioned within seconds on this incident. As expected, the team continued to focus on the root cause of the critical fault established within this database, and are confident a solution has been applied and tested.