On Tuesday, April 28, 2020 from 12:03 PM ET until 12:25 PM ET, our giving day platform experienced a partial degradation of service for roughly 15% of traffic. This was the result of one of our backend API servers failing, within our backup Cloud provider environment. Our primary Cloud provider environment experienced no issues and was auto-scaling and performing as expected without issue. This failing server was one of 5 servers being load balanced to, at the time.
How We Resolved the Issue
The failing backup server self-recovered around 12:08 PM ET when it restarted itself and ultimately less traffic was directed at it, by the load balancer. At this point, any new requests coming into this recovered server processed successfully. To be safe, at 12:15 PM ET, the GiveGab engineering team removed the impacted server from rotation by pointing all traffic at our primary cloud provider which was operating at 100% success.
There continued to be lingering issues loading certain content on some pages until 12:25 PM ET, as previously loaded content with errors were sporadically cached. Around 12:20 PM ET, we reset the cache for our platforms. By 12:25 PM ET, all systems were fully recovered.
Impact on Users
During this time, while most pages loaded without issue, approximately 15% of users experienced slow response times or were unable to view certain pieces of content such as leaderboards, the prizes listing page, as well as some nonprofit pages and portions of their GiveGab admin dashboard. Additionally, a small portion of users (~2%) were unable to donate for about 5 minutes from 12:03 until 12:08 PM ET if they happened to be load balanced to that server.
We continue to monitor all of our systems and have seen no further issues. Our auto-scaling is operating as expected and everything is operating as expected. We have added additional monitoring that can automatically alert us sooner if we start to see similar issues going forward. We do not anticipate any further issues for the remainder of the event.