Affected
Operational from 7:30 AM to 11:55 AM
Operational from 7:30 AM to 11:55 AM
Operational from 7:30 AM to 11:55 AM
Operational from 7:30 AM to 11:55 AM
- PostmortemPostmortem
We want to provide you with a transparent overview of the recent system downtime incident, its resolution, and the preventive measures we've implemented to avoid similar occurrences in the future.
Incident Timeline:
Friday 16 February 2024
8:30 AM: System outage detected; monitoring server and metrics.
Database CPU utilization peaked at 100%.
Overload observed on the API-GPT and token verification servers.
8:45 AM: Rolled back to the previous day's release.
9:00 AM: Issue persisted; continued monitoring and gathering system information. Identified a problem with token handling. Disabled token refreshes by one of our developers.
10:00 AM: Restored the system to the state of the previous day, undoing all rollbacks.
11:00 AM: Developers continue debugging and investigating potential workarounds or solutions for the token issue.
12:00 PM - 1:00 PM: Developers set up the production environment locally for debugging
3:15 PM - 5:00 PM: Additional logging deployed within the code.
6:00 PM: Developer identified a discrepancy in database connections within the verification server. Adjusted the database host and deployed changes.
6:30 PM - 7:30 PM: Most functionalities restored, except for the count feature. Developer discovered that tokens were not being sent along with requests.
Saturday 17 February 2024
10:00 AM: A solution for the count is deployed to production. Watermelon is fully up and running again.
10:00 AM - 6:00 PM: Developers continue to monitor the situation.
Sunday 18 February 2024
09:00 AM - 6:00 PM: Developers continue to monitor the situation. Everything continues to work as expected.
Root Cause Analysis:
The system downtime was primarily caused by excessive token refreshing, leading to overload on the verification server. This resulted in a backlog of requests across multiple servers, including the verification, GPT, and total count servers.
Actions Taken:
Disabled all token refresh functionality and retained a single main refresh process. This may cause people to see a loading screen when returning to Watermelon in the browser.
Reconnected the verification server to the main database.
Preventive Measures:
To prevent similar incidents in the future, we have:
Improve the token refresh functionality. This will decrease the load on the server.
We apologize for any inconvenience this downtime may have caused and assure you that we are committed to maintaining the reliability and performance of our services.
Thank you for your understanding and continued support.
- ResolvedResolved
This incident has been resolved.
- UpdateUpdate
We've implemented a fix and are currently monitoring the situation. The count in conversations is up and running again as well. If you have a problem with logging in, please clear your cache and try again.
- MonitoringMonitoring
We've implemented a fix and are currently monitoring the situation. Everything is running again except the count in conversations.
- UpdateUpdate
We have found a solution for the problem and actively working on fixing it.
- IdentifiedIdentified
We have identified this issue and are working on a solution.
- UpdateUpdate
We are continuing to investigate this issue.
- InvestigatingInvestigating
We are currently investigating this incident.
Conversations are not loading conversations, and the Pulse overview is not showing any chatbots. The Pulse and legacy chatbots continue to work on the selected channels and once the outage has been resolved, the conversations will be visible again.