[IBM-Aspera] - [Aoc API] - Service Disruption - Activity And Automation App
Incident Report for IBM-Aspera Service Status
Postmortem

The issue was caused be a database issue. To make a long story short, our Activity App database, which is a cassandra cluster, reached a certain threshold (> 2 billion cells) that started to cause individual nodes in the cluster to be Out of Memory killed. We resolved this issue yesterday by upgrading the latest patch release of the version of Cassandra we are on, which raised this limit to 4 billion cells. We continued to observe some cassandra nodes this morning to be Out of Memory killed. We increased memory and cpu limits but upon further investigation found that changing the disk access mode from mmap to mmap_index_only, seem to be config change that we needed. This change makes it so larger SSTable files are not loaded into memory, which has stabilized the cluster.

We have ran our current configuration of cassandra for couple of years unchanged. We are doing further investigations at this time to see if we can find further improvements.

Posted Nov 16, 2023 - 16:10 PST

Resolved
The issue seems to be resolved.
Posted Nov 16, 2023 - 15:59 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 16, 2023 - 15:00 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 16, 2023 - 13:31 PST
Investigating
Our engineering team is investigating an issue affecting the AoC Automation and Activity Apps. Our backend database seems to run into another issue and we are investigating the root cause.
Posted Nov 16, 2023 - 11:27 PST
This incident affected: IBM-Aspera API Services (api.ibmaspera.com).