Incident Description
During this incident, impacted IBM Cloud Database (ICD) customers in the Washington DC Region were unable to connect to
database instances with connection strings ending in blrrvkdw0thh68l98t20.databases.appdomain.cloud or
blrrvkdw0thh68l98t20.private.databases.appdomain.cloud.
This incident also impacted other IBM Cloud Services including:
- Containers-Kubernetes - impacted customers would have seen delays in provisioning workers for new or existing clusters,
delays in replacing, reloading, or deleting existing workers of clusters. Kubernetes workloads using previously provisioned
infrastructure resources were unaffected.
- Watson Conversations - impacted customers experienced intermittent RC5xx errors when using the Watson Assistant
service.
On 28 April 2023 at 03:28, IBM Cloud Specialists were alerted by monitoring to potential issues with database instances in
Washington DC. Initial troubleshooting found that all of the front-end pods that direct traffic to the databases were not running. This became the focus of the investigation.
Root Cause Information
Root cause investigation determined that a pre-deployment step to create new worker pools was missed prior to a planned worker maintenance. This resulted in the pods being unable to be scheduled.
In order to prevent a recurrence of this type of issue, IBM Cloud Specialists will update procedures to provide additional
protection from a worker node pool not being created.
Timeline
This timeline only reflects the alerting, troubleshooting, and mitigation of the top-level service degradation or disruption.
Additional dependent services or unique customer environments are not reflected here, and these might have experienced
impact outside the times listed.
The following sequence of events occurred:
- 27 April 2023 08:26 PDT - Planned maintenance was performed to apply a routine patch via automation that caused the front-end pods to restart. Incident started.
- 27 April 2023 08:28 PDT - IBM Cloud Specialists were alerted by monitoring to potential issues with database instances in Washington DC.
- 27 April 2023 08:41 PDT - The planned maintenance was completed
- 27 April 2023 09:45 PDT - IBM Cloud Specialists engaged SMEs.
- 27 April 2023 10:31 PDT - IBM Cloud Specialists identified an issue creating new workers, attempted to create the missing worker pool to mitigate the issue, but were unable to get any workers to provision.
- 27 April 2023 11:35 PDT - SMEs determined that the billing ledger database was not reachable preventing provisioning and deprovisioning of all clusters, workers, dedicated hosts and Satellite locations.
- 28 April 2023 12:10 PDT - Troubleshooting determined that the issue was all front-end pods were non-running/pending on the ICD cluster. No DB connections would be possible in this state.
- 28 April 2023 12:39 PDT - IBM Cloud Specialists re-enabled an older disabled worker pool to allow the pods to schedule and run.
- 28 April 2023 01:06 PDT - All alerts for this issue were resolved. ICD Incident ended.
- 28 April 2023 02:00 PDT - Due to the extended database disruption all ATS nodes had entered an error state. Aspera on Cloud devops engineers were also receiving a host of various alerts and were working on troubleshooting. Despite the ICD issue resolution, database connection errors were still appearing because auto-connections
- 28 April 2023 02:00 PDT - Due to the extended database disruption all ATS nodes had entered an error state. Aspera on Cloud devops engineers were also receiving a host of various alerts and were working on troubleshooting. Despite the ICD issue resolution, database connection errors were still appearing because auto-connections
- 28 April 2023 05:51 PDT - AoC engineers have fully resolved all issues caused by the extended database outage.
Service Restoration
IBM Cloud Specialists re-enabled an older worker pool allowing the pods to be scheduled, mitigating the impact and ending the incident on 28 April 2023 at 08:06.
Completed and Future Actions
The IBM Cloud team has analyzed this incident for areas of improvement, including issue detection, identification and future
mitigation. The following specific improvements were identified:
Description |
Due Date |
Update procedures to provide additional protection/checks from a worker node pool not being created |
30 June 2023 |