This has been traced to an issue with a managed database solution. The managed database solution runs in an active-passive mode with regular fail-overs between members of the cluster for things such as updates, configuration changes, or crashes. This failover happens without direct management by the IBM Aspera SRE team. A failover was observed to have happened at Feb 2, 10:00 PST. After the failover, the active instance that began serving requests was observed to be not meeting performance expectations. Queries started taking too long and reached timeout limits configured for the Automation App sub-component that used the database. The db queries that were timing out caused other internal processes to crash, which caused some workflows to become stuck in an “executing” state because update messages were not able to be received, and which also prevented the sub-component from being able to manage more than 1 workflow-instance at a time for some organizations. This meant that some queues did not progress at all because the single workflow-instance that was allowed to run was stuck and could not progress without intervention.
To resolve the situation, the IBM Aspera SRE team forced a failover on the managed database, cancelled stuck workflow-instances, and increased the allowed number of concurrent workflow-instances temporarily to catch up on any queues that were behind. Configuration changes for the managed database are being considered and additional monitoring will be put in place for the symptoms of the issue.