[IBM-Aspera] - [IBM: WDC] - Service Disruption
Incident Report for IBM-Aspera Service Status
Postmortem

Incident Description
During this incident, impacted IBM Cloud Database (ICD) customers in the Washington DC Region were unable to connect to
database instances with connection strings ending in blrrvkdw0thh68l98t20.databases.appdomain.cloud or
blrrvkdw0thh68l98t20.private.databases.appdomain.cloud.

This incident also impacted other IBM Cloud Services including:

  • Containers-Kubernetes - impacted customers would have seen delays in provisioning workers for new or existing clusters,
    delays in replacing, reloading, or deleting existing workers of clusters. Kubernetes workloads using previously provisioned
    infrastructure resources were unaffected.
  • Watson Conversations - impacted customers experienced intermittent RC5xx errors when using the Watson Assistant
    service.

On 28 April 2023 at 03:28, IBM Cloud Specialists were alerted by monitoring to potential issues with database instances in
Washington DC. Initial troubleshooting found that all of the front-end pods that direct traffic to the databases were not running. This became the focus of the investigation.

Root Cause Information

Root cause investigation determined that a pre-deployment step to create new worker pools was missed prior to a planned worker maintenance. This resulted in the pods being unable to be scheduled.
In order to prevent a recurrence of this type of issue, IBM Cloud Specialists will update procedures to provide additional
protection from a worker node pool not being created.

Timeline

This timeline only reflects the alerting, troubleshooting, and mitigation of the top-level service degradation or disruption.
Additional dependent services or unique customer environments are not reflected here, and these might have experienced
impact outside the times listed.

The following sequence of events occurred:

  1. 27 April 2023 08:26 PDT - Planned maintenance was performed to apply a routine patch via automation that caused the front-end pods to restart. Incident started.
  2. 27 April 2023 08:28 PDT - IBM Cloud Specialists were alerted by monitoring to potential issues with database instances in Washington DC.
  3. 27 April 2023 08:41 PDT - The planned maintenance was completed
  4. 27 April 2023 09:45 PDT - IBM Cloud Specialists engaged SMEs.
  5. 27 April 2023 10:31 PDT - IBM Cloud Specialists identified an issue creating new workers, attempted to create the missing worker pool to mitigate the issue, but were unable to get any workers to provision.
  6. 27 April 2023 11:35 PDT - SMEs determined that the billing ledger database was not reachable preventing provisioning and deprovisioning of all clusters, workers, dedicated hosts and Satellite locations.
  7. 28 April 2023 12:10 PDT - Troubleshooting determined that the issue was all front-end pods were non-running/pending on the ICD cluster. No DB connections would be possible in this state.
  8. 28 April 2023 12:39 PDT - IBM Cloud Specialists re-enabled an older disabled worker pool to allow the pods to schedule and run.
  9. 28 April 2023 01:06 PDT - All alerts for this issue were resolved. ICD Incident ended.
  10. 28 April 2023 02:00 PDT - Due to the extended database disruption all ATS nodes had entered an error state. Aspera on Cloud devops engineers were also receiving a host of various alerts and were working on troubleshooting. Despite the ICD issue resolution, database connection errors were still appearing because auto-connections
  11. 28 April 2023 02:00 PDT - Due to the extended database disruption all ATS nodes had entered an error state. Aspera on Cloud devops engineers were also receiving a host of various alerts and were working on troubleshooting. Despite the ICD issue resolution, database connection errors were still appearing because auto-connections
  12. 28 April 2023 05:51 PDT - AoC engineers have fully resolved all issues caused by the extended database outage.

Service Restoration

IBM Cloud Specialists re-enabled an older worker pool allowing the pods to be scheduled, mitigating the impact and ending the incident on 28 April 2023 at 08:06.

Completed and Future Actions
The IBM Cloud team has analyzed this incident for areas of improvement, including issue detection, identification and future
mitigation. The following specific improvements were identified:

Description Due Date
Update procedures to provide additional protection/checks from a worker node pool not being created 30 June 2023
Posted May 10, 2023 - 13:48 PDT

Resolved
This incident has been resolved.
Posted Apr 28, 2023 - 05:51 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 28, 2023 - 05:44 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 28, 2023 - 05:23 PDT
Investigating
We have been alerted to a service disruption affecting: ATS IBM Cloud Washington D.C. Our engineers are currently investigating the incident and will provide updates when more information is available.
Posted Apr 28, 2023 - 00:15 PDT
This incident affected: IBM Cloud Transfer Clusters (Washington D.C. (wdc)).