Degraded performance of GiveGab API and Giving Day Platform

Incident Report for GiveGab

Postmortem

What Happened

On Tuesday, April 28, 2020 from 12:03 PM ET until 12:25 PM ET, our giving day platform experienced a partial degradation of service for roughly 15% of traffic. This was the result of one of our backend API servers failing, within our backup Cloud provider environment. Our primary Cloud provider environment experienced no issues and was auto-scaling and performing as expected without issue. This failing server was one of 5 servers being load balanced to, at the time.

How We Resolved the Issue

The failing backup server self-recovered around 12:08 PM ET when it restarted itself and ultimately less traffic was directed at it, by the load balancer. At this point, any new requests coming into this recovered server processed successfully. To be safe, at 12:15 PM ET, the GiveGab engineering team removed the impacted server from rotation by pointing all traffic at our primary cloud provider which was operating at 100% success.

There continued to be lingering issues loading certain content on some pages until 12:25 PM ET, as previously loaded content with errors were sporadically cached. Around 12:20 PM ET, we reset the cache for our platforms. By 12:25 PM ET, all systems were fully recovered.

Impact on Users

During this time, while most pages loaded without issue, approximately 15% of users experienced slow response times or were unable to view certain pieces of content such as leaderboards, the prizes listing page, as well as some nonprofit pages and portions of their GiveGab admin dashboard. Additionally, a small portion of users (~2%) were unable to donate for about 5 minutes from 12:03 until 12:08 PM ET if they happened to be load balanced to that server.

Summary

We continue to monitor all of our systems and have seen no further issues. Our auto-scaling is operating as expected and everything is operating as expected. We have added additional monitoring that can automatically alert us sooner if we start to see similar issues going forward. We do not anticipate any further issues for the remainder of the event.

Posted Apr 29, 2020 - 09:48 EDT

Resolved

From 12:03 PM ET to approximately 12:25 PM ET, one of our redundant API servers running in our backup cloud provider had CPU over-utilization issues which resulted in page load errors for roughly 6% of requests during this time for our API and our giving day sites.

At 12:15 PM ET our load balancer stop routing most traffic to this server, we fully pulled this server out of rotation, and service was fully restored for 100% of traffic by 12:25 PM ET.

Posted Apr 28, 2020 - 12:03 EDT