From the second of August 09:00 (UTC) until the third of August 15:00 (UTC) almost none of the scheduled Swydo reports were automatically sent. We were able to sent them all with a delay, but we like to explain what happened and how we’ll prevent this in the future.
On Saturday all monthly reports were sent as usual. On Sunday though, our Redis database provider (which holds our queue with all the reports that need to be sent) had some maintenance. We knew that this was coming, but it shouldn’t have affected our application. But it did. They changed their implementation in such a way that our queue could no longer handle the scheduled reports. The reports were queued, but not executed, meaning: No reports were sent.
What did we do to resolve it?
Our top priority was getting those reports out as soon as possible so that our customers would receive them anyway. However, that took longer than we expected. After some serious debugging and wondering how things could go wrong without us releasing a new version we found out the issue was caused by maintenance of our database provider. After many support phone calls with our provider without getting a clear solution we decided to start a new queue database at a different provider. We quickly found one that did work properly and got everything up and running again.
What did we do to prevent further incidents?
Database maintenance is a very normal thing. Usually we do not have any issues with it. But our provider drastically changed their implementation causing this failure. We moved to a different provider which we think will be more consistent and clear in it’s communication. We also changed our own implementation in such a way that we can quickly switch to another Redis database in the future. This will help us to migrate quickly in case of emergencies.
Where did we go wrong?
It’s easy to point at the database provider, but that doesn’t mean we couldn’t have prevented this or at least have reacted sooner on the issues. We didn’t have any queue monitoring in place to notify us about reports not being sent when they did enter the queue successfully. We should have that. When schedules fail to execute we should simply be notified about it.
What will we do to prevent further incidents
- Monitor the queue to get notified in time when tasks aren’t picked up
- Have a second Redis provider to quickly switch to in case of serious Redis issues
We are very sorry for the inconvenience that this caused to our customers and we will continue to improve our software to reduce these incidents to a minimum.
Michiel ter Reehorst