[IBM-Aspera] - [Aoc API] - Service Disruption - Degredation in automation job throughput
Incident Report for IBM-Aspera Service Status
Postmortem

This has been traced to an issue with a managed database solution. The managed database solution runs in an active-passive mode with regular fail-overs between members of the cluster for things such as updates, configuration changes, or crashes. This failover happens without direct management by the IBM Aspera SRE team. A failover was observed to have happened at Feb 2, 10:00 PST. After the failover, the active instance that began serving requests was observed to be not meeting performance expectations. Queries started taking too long and reached timeout limits configured for the Automation App sub-component that used the database. The db queries that were timing out caused other internal processes to crash, which caused some workflows to become stuck in an “executing” state because update messages were not able to be received, and which also prevented the sub-component from being able to manage more than 1 workflow-instance at a time for some organizations. This meant that some queues did not progress at all because the single workflow-instance that was allowed to run was stuck and could not progress without intervention.

To resolve the situation, the IBM Aspera SRE team forced a failover on the managed database, cancelled stuck workflow-instances, and increased the allowed number of concurrent workflow-instances temporarily to catch up on any queues that were behind. Configuration changes for the managed database are being considered and additional monitoring will be put in place for the symptoms of the issue.

Posted Feb 03, 2023 - 14:18 PST

Resolved
This incident has been resolved.
Posted Feb 03, 2023 - 13:01 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 03, 2023 - 12:20 PST
Investigating
Our engineering team is investigating an issue affecting the Autoamtion App. Customers may notice that they can't run more than 1 workflow-instance in parallel. We are actively investigating this issue.
Posted Feb 03, 2023 - 11:02 PST
This incident affected: IBM-Aspera API Services (api.ibmaspera.com).