A Postmorterm on Outdated Data and Slow Performance

May 12, 2024 (11mo ago)

A gif of people reacting to an outage

Table of contents

Issue Summary

Duration: The web application faced intermittent performance issues for a period of 24 hours starting from May 9, 2024, 11:00 EAT to May 10, 11:00 EAT. Intermittent spikes were noticed throughout the timeline stated.

Impact: Users encountered slow loading times and occasional web app outages. They were unable to access the services most of the time, and sometimes the services were slow. An estimated 24% of users were affected.

Root Cause: An aggressive caching strategy, combined with a data corruption issue, resulted in outdated and incorrect data being delivered to the users.

Timeline

Root Cause & Resolution

The problem arose from a mix of issues. An overly aggressive caching strategy kept old data, and a data corruption problem on the backend kept producing wrong information that ended up cached. The fix included using a more detailed caching plan with shorter cache durations. Also, the backend data corruption problem was fixed by repairing the database and checking data integrity.

Corrective & Preventative Measures