Outage of Conversations and Pulse overview

Postmortem

20, Feb, 2024 at 17 16

Postmortem

20, Feb, 2024 at 17 16

We want to provide you with a transparent overview of the recent system downtime incident, its resolution, and the preventive measures we've implemented to avoid similar occurrences in the future.

Incident Timeline:

Friday 16 February 2024

8:30 AM: System outage detected; monitoring server and metrics.
- Database CPU utilization peaked at 100%.
- Overload observed on the API-GPT and token verification servers.
8:45 AM: Rolled back to the previous day's release.
9:00 AM: Issue persisted; continued monitoring and gathering system information. Identified a problem with token handling. Disabled token refreshes by one of our developers.
10:00 AM: Restored the system to the state of the previous day, undoing all rollbacks.
11:00 AM: Developers continue debugging and investigating potential workarounds or solutions for the token issue.
12:00 PM - 1:00 PM: Developers set up the production environment locally for debugging
3:15 PM - 5:00 PM: Additional logging deployed within the code.
6:00 PM: Developer identified a discrepancy in database connections within the verification server. Adjusted the database host and deployed changes.
6:30 PM - 7:30 PM: Most functionalities restored, except for the count feature. Developer discovered that tokens were not being sent along with requests.

Saturday 17 February 2024

10:00 AM: A solution for the count is deployed to production. Watermelon is fully up and running again.
10:00 AM - 6:00 PM: Developers continue to monitor the situation.

Sunday 18 February 2024

09:00 AM - 6:00 PM: Developers continue to monitor the situation. Everything continues to work as expected.

Root Cause Analysis:

The system downtime was primarily caused by excessive token refreshing, leading to overload on the verification server. This resulted in a backlog of requests across multiple servers, including the verification, GPT, and total count servers.

Actions Taken:

Disabled all token refresh functionality and retained a single main refresh process. This may cause people to see a loading screen when returning to Watermelon in the browser.
Reconnected the verification server to the main database.

Preventive Measures:

To prevent similar incidents in the future, we have:

Improve the token refresh functionality. This will decrease the load on the server.

We apologize for any inconvenience this downtime may have caused and assure you that we are committed to maintaining the reliability and performance of our services.

Thank you for your understanding and continued support.

Resolved

20, Feb, 2024 at 11 55

Resolved

20, Feb, 2024 at 11 55

This incident has been resolved.

Update

17, Feb, 2024 at 10 00

Update

17, Feb, 2024 at 10 00

We've implemented a fix and are currently monitoring the situation. The count in conversations is up and running again as well. If you have a problem with logging in, please clear your cache and try again.

Monitoring

16, Feb, 2024 at 17 17

Monitoring

16, Feb, 2024 at 17 17

We've implemented a fix and are currently monitoring the situation. Everything is running again except the count in conversations.

Update

16, Feb, 2024 at 13 33

Update

16, Feb, 2024 at 13 33

We have found a solution for the problem and actively working on fixing it.

Identified

16, Feb, 2024 at 11 34

Identified

16, Feb, 2024 at 11 34

We have identified this issue and are working on a solution.

Update

16, Feb, 2024 at 8 33

Update

16, Feb, 2024 at 8 33

We are continuing to investigate this issue.

Investigating

16, Feb, 2024 at 7 30

Investigating

16, Feb, 2024 at 7 30

We are currently investigating this incident.

Conversations are not loading conversations, and the Pulse overview is not showing any chatbots. The Pulse and legacy chatbots continue to work on the selected channels and once the outage has been resolved, the conversations will be visible again.

Watermelon - Outage of Conversations and Pulse overview – Incident details