Watermelon - Outage of Conversations and Pulse overview – Incident details

All systems operational

Outage of Conversations and Pulse overview

Resolved
Operational
Started 10 months agoLasted 4 days

Affected

Watermelon

Chatbot engine

Messaging API

Operational from 7:30 AM to 11:55 AM

Website Widget

Operational from 7:30 AM to 11:55 AM

Facebook Messenger

Operational from 7:30 AM to 11:55 AM

WhatsApp

Operational from 7:30 AM to 11:55 AM

Updates
  • Postmortem
    Postmortem

    We want to provide you with a transparent overview of the recent system downtime incident, its resolution, and the preventive measures we've implemented to avoid similar occurrences in the future.

    Incident Timeline:

    Friday 16 February 2024

    • 8:30 AM: System outage detected; monitoring server and metrics.

      • Database CPU utilization peaked at 100%.

      • Overload observed on the API-GPT and token verification servers.

    • 8:45 AM: Rolled back to the previous day's release.

    • 9:00 AM: Issue persisted; continued monitoring and gathering system information. Identified a problem with token handling. Disabled token refreshes by one of our developers.

    • 10:00 AM: Restored the system to the state of the previous day, undoing all rollbacks.

    • 11:00 AM: Developers continue debugging and investigating potential workarounds or solutions for the token issue.

    • 12:00 PM - 1:00 PM: Developers set up the production environment locally for debugging

    • 3:15 PM - 5:00 PM: Additional logging deployed within the code.

    • 6:00 PM: Developer identified a discrepancy in database connections within the verification server. Adjusted the database host and deployed changes.

    • 6:30 PM - 7:30 PM: Most functionalities restored, except for the count feature. Developer discovered that tokens were not being sent along with requests.

    Saturday 17 February 2024

    • 10:00 AM: A solution for the count is deployed to production. Watermelon is fully up and running again.

    • 10:00 AM - 6:00 PM: Developers continue to monitor the situation.

    Sunday 18 February 2024

    • 09:00 AM - 6:00 PM: Developers continue to monitor the situation. Everything continues to work as expected.

    Root Cause Analysis:

    The system downtime was primarily caused by excessive token refreshing, leading to overload on the verification server. This resulted in a backlog of requests across multiple servers, including the verification, GPT, and total count servers.

    Actions Taken:

    1. Disabled all token refresh functionality and retained a single main refresh process. This may cause people to see a loading screen when returning to Watermelon in the browser.

    2. Reconnected the verification server to the main database.

    Preventive Measures:

    To prevent similar incidents in the future, we have:

    • Improve the token refresh functionality. This will decrease the load on the server.

    We apologize for any inconvenience this downtime may have caused and assure you that we are committed to maintaining the reliability and performance of our services.

    Thank you for your understanding and continued support.

  • Resolved
    Resolved

    This incident has been resolved.

  • Update
    Update

    We've implemented a fix and are currently monitoring the situation. The count in conversations is up and running again as well. If you have a problem with logging in, please clear your cache and try again.

  • Monitoring
    Monitoring

    We've implemented a fix and are currently monitoring the situation. Everything is running again except the count in conversations.

  • Update
    Update

    We have found a solution for the problem and actively working on fixing it.

  • Identified
    Identified

    We have identified this issue and are working on a solution.

  • Update
    Update

    We are continuing to investigate this issue.

  • Investigating
    Investigating

    We are currently investigating this incident.

    Conversations are not loading conversations, and the Pulse overview is not showing any chatbots. The Pulse and legacy chatbots continue to work on the selected channels and once the outage has been resolved, the conversations will be visible again.