This page describes how to resolve issues with AlloyDB Omni for Kubernetes when you encounter the following situations:
- You have a crashlooping database cluster in Kubernetes, and you know the name and namespace of the database, its internal instance, and the pod.
- You need to access a file on the database pod, but the crash loop prevents access.
A crashlooping database pod is a pod that repeatedly starts, crashes, and is then automatically restarted by Kubernetes. This cycle continues indefinitely until the underlying issue is resolved. The database becomes unavailable and unstable during this process.
Before you begin
Before you follow the troubleshooting guidance in this document, try these solutions:
- If high availability (HA) is available, trigger a failover. With HA, a synchronous standby instance is ready to replace the primary instance.
- Recover from a backup, if one is available. This approach is automated and less error prone.
Cannot access files on a crashlooping database pod
Description: A database pod in your Kubernetes cluster enters a
CrashLoopBackOff state. This state prevents you from using kubectl exec to
access the database container's file system for debugging.
Recommended fix: To resolve this issue, temporarily stop the crash loop by
patching the pod's StatefulSet to override the container's command. Before
you proceed, consider simpler alternatives, such as triggering a failover or
recovering from a backup, if available.
To access the pod, follow these steps:
Pause all related user-facing AlloyDB Omni resources to prevent them from interfering with the recovery process.
List all AlloyDB Omni Kubernetes custom resource definitions:
kubectl get crd | grep alloydbomni.dbadminReview the list of resources for your database cluster. Keep a list of the resources that you pause so that you can unpause them later. At a minimum, you must pause the
dbclusters.alloydbomni.dbadmin.googresource.Pause each resource separately by adding the
ctrlfwk.internal.gdc.goog/pauseannotation.kubectl annotate RESOURCE_TYPE -n NAMESPACE RESOURCE_NAME ctrlfwk.internal.gdc.goog/pause=trueReplace the following:
RESOURCE_TYPE: the type of the resource to pause, for example,dbclusters.alloydbomni.dbadmin.goog.NAMESPACE: the namespace of the database cluster.RESOURCE_NAME: the name of the resource to pause.
Pause the internal instance that manages the crashlooping pod.
Find the name of the internal instance.
kubectl get instances.alloydbomni.internal.dbadmin.goog -n NAMESPACEReplace the following:
NAMESPACE: the name of the namespace that you're interested in.
Pause the instance. The
ctrlfwk.internal.gdc.goog/pause=trueannotation tells the AlloyDB Omni controller to pause its management of the resource. This prevents the controller from automatically reconciling or changing the resource while you're manually performing troubleshooting steps, such as patching the underlyingStatefulSet.kubectl annotate instances.alloydbomni.internal.dbadmin.goog -n NAMESPACE INSTANCE_NAME ctrlfwk.internal.gdc.goog/pause=trueReplace the following:
NAMESPACE: the name of the namespace that you're interested in.INSTANCE_NAME: the name of the instance that you found.
Prevent the pod from crashlooping.
Patch the pod's
StatefulSetto override the main container's command. This change stops the application from starting and crashing, which lets you access the pod.Confirm the name of the crashlooping pod.
kubectl get pod -n NAMESPACEReplace
NAMESPACE, which is the name of the namespace that you're interested in.Find the name of the
StatefulSetthat owns the pod.kubectl get pod -n NAMESPACE POD_NAME -ojsonpath='{.metadata.ownerReferences[0].name}'Replace the following:
NAMESPACE: the name of the namespace that you're interested in.POD_NAME: the name of the crashlooping pod.
Patch the
StatefulSetto replace the database container's command withsleep infinity:kubectl patch statefulset STATEFULSET_NAME -n NAMESPACE --type='strategic' -p ' { "spec": { "template": { "spec": { "containers": [ { "name": "database", "command": ["sleep"], "args": ["infinity"], "livenessProbe": null, "startupProbe": null } ] } } } } 'Replace the following:
NAMESPACE: the name of the namespace that you're interested in.STATEFULSET_NAME: the name of theStatefulSet.
Delete the pod to apply the patch.
kubectl delete pod -n NAMESPACE POD_NAMEReplace the following:
NAMESPACE: the name of the namespace that you're interested in.POD_NAME: the name of the crashlooping pod.
Access the database in the pod.
kubectl exec -ti -n NAMESPACE POD_NAME -c database -- bashReplace the following:
NAMESPACE: the name of the namespace that you're interested in.POD_NAME: the name of the crashlooping pod.
After the new pod starts, it runs the
sleepcommand instead of the database application. You can now access the pod's shell.Unpause all components.
After investigating, restore the original state by removing the
pauseannotations. You unpause the resources in the reverse order of pausing them; for example, unpause the instance before you unpause the DBCluster. This action reconciles the resources and restarts the database pod with its original command.For each resource that you paused, remove the annotation by appending a hyphen to the annotation name. The following example removes the annotation from an instance:
kubectl annotate instances.alloydbomni.internal.dbadmin.goog -n NAMESPACE INSTANCE_NAME ctrlfwk.internal.gdc.goog/pause-Replace the following:
NAMESPACE: the name of the namespace that you're interested in.INSTANCE_NAME: the name of the instance that you found in the previous step.
Repeat this process for all other resources that you paused in the first step.