The issue was caused be a database issue. To make a long story short, our Activity App database, which is a cassandra cluster, reached a certain threshold (> 2 billion cells) that started to cause individual nodes in the cluster to be Out of Memory killed. We resolved this issue yesterday by upgrading the latest patch release of the version of Cassandra we are on, which raised this limit to 4 billion cells. We continued to observe some cassandra nodes this morning to be Out of Memory killed. We increased memory and cpu limits but upon further investigation found that changing the disk access mode from
mmap_index_only, seem to be config change that we needed. This change makes it so larger SSTable files are not loaded into memory, which has stabilized the cluster.
We have ran our current configuration of cassandra for couple of years unchanged. We are doing further investigations at this time to see if we can find further improvements.