[IBM-Aspera] - [Aoc API] - Service Disruption
Incident Report for IBM-Aspera Service Status
Postmortem

After a thorough retrospective, we have identified the root cause of this incident. The short story is that a certificate used on the origin server expired, leading to requests to fail. The long story is one of human error, a lack of automation, some monitoring problems, and a process gap that allowed this issue to remain open for far longer than it should have.

Our maintenance window was on February 24. Leading up to that maintenance window, this certificate in question had already been identified as needing rotation. The cert is used in 2 places: on AWS cloudfront, and on the API gateway on origin server. Both places needed the new cert in place. The AWS Cloudfront certificate was rotated and verified first. Followed by the API gateway server. But the process to verify the new certificate on the API gateway was not complete. One issue that had not been identified was that the private key used to generate the certificate was a different one than what was installed on the API Gateway. The certificate was installed and it looked like the API gateway was working. But the wrong process was taken to verify. It would have been more apparent that there was a mismatch if some additional steps were taken, like fully stopping and starting the API gateway pods. The current process doesn’t deem this as necessary though, because the API gateway automatically detects changes to certificates. This is the first process that has been identifyed as needing a fix. But the team will also be reviewing the implementation of some pre-existing automation to ensure certification rotation happens for this API gateway certificate (and on Cloudfront) without requiring human intervention.

The second failure was in monitoring. There are a multitude of synthetic checks triggered continuously against all customer-facing API servers to ensure alerts are generated if they go down. One monitor did trigger at 16:00 PST, but wasn’t staying open, causing the alert to auto-resolve. This led to a slower response than usual and made it harder to understand what was going on for the engineer on-call, and how to resolve it.

Our incident response guide will be updated to ensure that this type of issue is well-known to engineers on-call and that the steps needed to quickly identify the problem and resolve it are documented. Our monitoring checks will be reviewed to to see if we can ensure that the alerts generated during this type of issue will remain open instead of auto-resolving. Furthermore, we will try to remove the human element from this certificate rotation process, to ensure it never happens again.

We apologize for the inconvenience this may have caused your organization to experience.

Posted Feb 27, 2024 - 14:47 PST

Resolved
This incident has been resolved.
Posted Feb 25, 2024 - 18:24 PST
Investigating
Our engineering team is investigating an issue affecting the AoC API.
Posted Feb 25, 2024 - 16:00 PST
This incident affected: IBM-Aspera API Services (api.ibmaspera.com).