Webhook Processing Disruption on April 12th, 2024
Incident Report for Thread
Postmortem

We rewrote our way of handling the PSA webhooks to make them faster and more reliable. Last week, on April 11st 2024, we started rolling it out to production users. We gradually started rolling it out to partners on that date, then continued doing so on the April 12nd 2024.

On the April 12nd, around 11am EST we identified that some of the messages weren't being processed, even when our metrics were showing them to be received and processed as expected by our application. We haven't found anything right away that would impact the webhook processing, but since it was causing messages to be missed we decided to revert all partners migrate right away. Roughly around 1pm EST all partners were fully migrated and their workspaces were working properly.

We created a sync job to recover the missed data from PSA, which we ran for all partners, however there might be a chance of some items still be lost on our end.

After a deep dive in the issue, we identified that the problem was caused by a cached Webhook handler instance that was incorrectly defining the wrong type of webhook events. For example: if a ticket webhook event was received, it could have a schedule event cached causing our application to not perform properly. This happened randomly so not all webhook calls were affected.

We released a fix and improved automated test coverage on that part of the application.

We deeply apologize for the incident and we understand how much this can impact our partners operations. We believe the improvements we're making towards the webhook improvements are a massive step forward in terms of speed of processing and stability and our engineering team will work hard to ensure this kind of problem doesn't occur again in the future.

Posted Apr 15, 2024 - 18:55 UTC

Resolved
On Friday, April 12th, 2024, we experienced an incident that affected some of our partners. This incident disrupted webhook processing, resulting in the omission of some webhooks. We promptly addressed the issue, minimizing the impact, but unfortunately, some updates were irretrievably lost.

We have identified the root cause of the problem and have already implemented a fix for it. We apologize for the inconvenience caused by this incident, and our engineering team is diligently working on enhancing the affected part of the application.
Posted Apr 12, 2024 - 14:00 UTC