Table of contents
- Issue Summary
- Timeline
- Root Cause & Resolution
- Corrective & Preventative Measures
Issue Summary
Duration: The web application faced intermittent performance issues for a period of 24 hours starting from May 9, 2024, 11:00 EAT to May 10, 11:00 EAT. Intermittent spikes were noticed throughout the timeline stated.
Impact: Users encountered slow loading times and occasional web app outages. They were unable to access the services most of the time, and sometimes the services were slow. An estimated 24% of users were affected.
Root Cause: An aggressive caching strategy, combined with a data corruption issue, resulted in outdated and incorrect data being delivered to the users.
Timeline
- Throughout the day: Slow response times with initial user reports indicating service unavailability and occasional web app crashes.
- May 9, 2024 13:15 EAT: The engineers started an initial investigation focusing on assumptions about network latency and server performance issues.
- Efforts were made to optimize the server resources and troubleshoot the network connectivity.
- Misleading investigation: Extensive time is spent on the technical aspects of the server and network.
- May 9, 2024 16:28 EAT: The issue persists and the team decides to dive into the caching mechanisms.
- May 9, 2024 17:00 EAT: With no significant finding, the incident is escalated to the DevOps team that handles infrastructure including the caching systems.
- May 10, 2024 9:10 EAT: Further investigation reveals inconsistencies in the cached data and a mountain of outdated information.
- May 10, 2024 9:52 EAT: Root cause identified. A data corruption issue in the backend stack causes invalid data to be aggressively cached, resulting in inconsistencies being delivered to users.
- May 10, 2024 11:00 EAT: By implementing a tiered caching strategy and data validation checks, the problem was fixed by 11:00 EAT on May 10, 2024.
Root Cause & Resolution
The problem arose from a mix of issues. An overly aggressive caching strategy kept old data, and a data corruption problem on the backend kept producing wrong information that ended up cached. The fix included using a more detailed caching plan with shorter cache durations. Also, the backend data corruption problem was fixed by repairing the database and checking data integrity.
Corrective & Preventative Measures
- Introduce data validation checks at the cache layer to identify and prevent invalid data from being cached.
- Implement a tiered caching strategy with different cache expiration times depending on the type of data.
- Schedule regular data integrity checks and database maintenance procedures.
- Enhance monitoring of cache behaviour and performance metrics.