We want to provide you with a transparent overview of the recent system downtime incident, its resolution, and the preventive measures we've implemented to avoid similar occurrences in the future.
Incident Timeline:
Friday 16 February 2024
8:30 AM: System outage detected; monitoring server and metrics.
8:45 AM: Rolled back to the previous day's release.
9:00 AM: Issue persisted; continued monitoring and gathering system information. Identified a problem with token handling. Disabled token refreshes by one of our developers.
10:00 AM: Restored the system to the state of the previous day, undoing all rollbacks.
11:00 AM: Developers continue debugging and investigating potential workarounds or solutions for the token issue.
12:00 PM - 1:00 PM: Developers set up the production environment locally for debugging
3:15 PM - 5:00 PM: Additional logging deployed within the code.
6:00 PM: Developer identified a discrepancy in database connections within the verification server. Adjusted the database host and deployed changes.
6:30 PM - 7:30 PM: Most functionalities restored, except for the count feature. Developer discovered that tokens were not being sent along with requests.
Saturday 17 February 2024
Sunday 18 February 2024
Root Cause Analysis:
The system downtime was primarily caused by excessive token refreshing, leading to overload on the verification server. This resulted in a backlog of requests across multiple servers, including the verification, GPT, and total count servers.
Actions Taken:
Disabled all token refresh functionality and retained a single main refresh process. This may cause people to see a loading screen when returning to Watermelon in the browser.
Reconnected the verification server to the main database.
Preventive Measures:
To prevent similar incidents in the future, we have:
We apologize for any inconvenience this downtime may have caused and assure you that we are committed to maintaining the reliability and performance of our services.
Thank you for your understanding and continued support.