A Postmorterm on Outdated Data and Slow Performance

May 12, 2024

A gif of people reacting to an outage

Table of contents

  • Issue Summary
  • Timeline
  • Root Cause & Resolution
  • Corrective & Preventative Measures

Issue Summary

Duration: The web application faced intermittent performance issues for a period of 24 hours starting from May 9, 2024, 11:00 EAT to May 10, 11:00 EAT. Intermittent spikes were noticed throughout the timeline stated.

Impact: Users encountered slow loading times and occasional web app outages. They were unable to access the services most of the time, and sometimes the services were slow. An estimated 24% of users were affected.

Root Cause: An aggressive caching strategy, combined with a data corruption issue, resulted in outdated and incorrect data being delivered to the users.

Timeline

  • Throughout the day: Slow response times with initial user reports indicating service unavailability and occasional web app crashes.
  • May 9, 2024 13:15 EAT: The engineers started an initial investigation focusing on assumptions about network latency and server performance issues.
    • Efforts were made to optimize the server resources and troubleshoot the network connectivity.
    • Misleading investigation: Extensive time is spent on the technical aspects of the server and network.
  • May 9, 2024 16:28 EAT: The issue persists and the team decides to dive into the caching mechanisms.
  • May 9, 2024 17:00 EAT: With no significant finding, the incident is escalated to the DevOps team that handles infrastructure including the caching systems.
  • May 10, 2024 9:10 EAT: Further investigation reveals inconsistencies in the cached data and a mountain of outdated information.
  • May 10, 2024 9:52 EAT: Root cause identified. A data corruption issue in the backend stack causes invalid data to be aggressively cached, resulting in inconsistencies being delivered to users.
  • May 10, 2024 11:00 EAT: By implementing a tiered caching strategy and data validation checks, the problem was fixed by 11:00 EAT on May 10, 2024.

Root Cause & Resolution

The problem arose from a mix of issues. An overly aggressive caching strategy kept old data, and a data corruption problem on the backend kept producing wrong information that ended up cached. The fix included using a more detailed caching plan with shorter cache durations. Also, the backend data corruption problem was fixed by repairing the database and checking data integrity.

Corrective & Preventative Measures

  • Introduce data validation checks at the cache layer to identify and prevent invalid data from being cached.
  • Implement a tiered caching strategy with different cache expiration times depending on the type of data.
  • Schedule regular data integrity checks and database maintenance procedures.
  • Enhance monitoring of cache behaviour and performance metrics.
GitHub
LinkedIn
X