Troubleshoot a crashlooping database pod in AlloyDB Omni for Kubernetes

Select a documentation version:

This page describes how to resolve issues with AlloyDB Omni for Kubernetes when you encounter the following situations:

  • You have a crashlooping database cluster in Kubernetes, and you know the name and namespace of the database, its internal instance, and the pod.
  • You need to access a file on the database pod, but the crash loop prevents access.

A crashlooping database pod is a pod that repeatedly starts, crashes, and is then automatically restarted by Kubernetes. This cycle continues indefinitely until the underlying issue is resolved. The database becomes unavailable and unstable during this process.

Before you begin

Before you follow the troubleshooting guidance in this document, try these solutions:

  • If high availability (HA) is available, trigger a failover. With HA, a synchronous standby instance is ready to replace the primary instance.
  • Recover from a backup, if one is available. This approach is automated and less error prone.

Cannot access files on a crashlooping database pod

Description:A database pod in your Kubernetes cluster enters a CrashLoopBackOff state. This state prevents you from using kubectl exec to access the database container's file system for debugging.

Recommended fix:To resolve this issue, temporarily stop the crash loop by patching the pod's StatefulSet to override the container's command. Before you proceed, consider simpler alternatives, such as triggering a failover or recovering from a backup, if available.

To access the pod, follow these steps:

  1. Pause all related user-facing AlloyDB Omni resources to prevent them from interfering with the recovery process.

    1. List all AlloyDB Omni Kubernetes custom resource definitions:

       kubectl  
      get  
      crd  
       | 
        
      grep  
      alloydbomni.dbadmin 
      
    2. Review the list of resources for your database cluster. Keep a list of the resources that you pause so that you can unpause them later. At a minimum, you must pause the dbclusters.alloydbomni.dbadmin.goog resource.

    3. Pause each resource separately by adding the ctrlfwk.internal.gdc.goog/pause annotation.

       kubectl  
      annotate  
       RESOURCE_TYPE 
        
      -n  
       NAMESPACE 
        
       RESOURCE_NAME 
        
      ctrlfwk.internal.gdc.goog/pause = 
       true 
       
      

      Replace the following:

      • RESOURCE_TYPE : the type of the resource to pause, for example, dbclusters.alloydbomni.dbadmin.goog .
      • NAMESPACE : the namespace of the database cluster.
      • RESOURCE_NAME : the name of the resource to pause.
  2. Pause the internal instance that manages the crashlooping pod.

    1. Find the name of the internal instance.

       kubectl  
      get  
      instances.alloydbomni.internal.dbadmin.goog  
      -n  
       NAMESPACE 
       
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
    2. Pause the instance. The ctrlfwk.internal.gdc.goog/pause=true annotation tells the AlloyDB Omni controller to pause its management of the resource. This prevents the controller from automatically reconciling or changing the resource while you're manually performing troubleshooting steps, such as patching the underlying StatefulSet .

       kubectl  
      annotate  
      instances.alloydbomni.internal.dbadmin.goog  
      -n  
       NAMESPACE 
        
       INSTANCE_NAME 
        
      ctrlfwk.internal.gdc.goog/pause = 
       true 
       
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
      • INSTANCE_NAME : the name of the instance that you found.
  3. Prevent the pod from crashlooping.

    Patch the pod's StatefulSet to override the main container's command. This change stops the application from starting and crashing, which lets you access the pod.

    1. Confirm the name of the crashlooping pod.

       kubectl  
      get  
      pod  
      -n  
       NAMESPACE 
       
      

      Replace NAMESPACE , which is the name of the namespace that you're interested in.

    2. Find the name of the StatefulSet that owns the pod.

       kubectl  
      get  
      pod  
      -n  
       NAMESPACE 
        
       POD_NAME 
        
      -ojsonpath = 
       '{.metadata.ownerReferences[0].name}' 
       
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
      • POD_NAME : the name of the crashlooping pod.
    3. Patch the StatefulSet to replace the database container's command with sleep infinity :

       kubectl  
      patch  
      statefulset  
       STATEFULSET_NAME 
        
      -n  
       NAMESPACE 
        
      --type = 
       'strategic' 
        
      -p  
       ' 
       { 
       "spec": { 
       "template": { 
       "spec": { 
       "containers": [ 
       { 
       "name": "database", 
       "command": ["sleep"], 
       "args": ["infinity"], 
       "livenessProbe": null, 
       "startupProbe": null 
       } 
       ] 
       } 
       } 
       } 
       } 
       ' 
       
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
      • STATEFULSET_NAME : the name of the StatefulSet .
    4. Delete the pod to apply the patch.

       kubectl  
      delete  
      pod  
      -n  
       NAMESPACE 
        
       POD_NAME 
       
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
      • POD_NAME : the name of the crashlooping pod.
  4. Access the database in the pod.

     kubectl  
     exec 
      
    -ti  
    -n  
     NAMESPACE 
      
     POD_NAME 
      
    -c  
    database  
    --  
    bash 
    

    Replace the following:

    • NAMESPACE : the name of the namespace that you're interested in.
    • POD_NAME : the name of the crashlooping pod.

    After the new pod starts, it runs the sleep command instead of the database application. You can now access the pod's shell.

  5. Unpause all components.

    After investigating, restore the original state by removing the pause annotations. You unpause the resources in the reverse order of pausing them; for example, unpause the instance before you unpause the DBCluster. This action reconciles the resources and restarts the database pod with its original command.

    1. For each resource that you paused, remove the annotation by appending a hyphen to the annotation name. The following example removes the annotation from an instance:

       kubectl  
      annotate  
      instances.alloydbomni.internal.dbadmin.goog  
      -n  
       NAMESPACE 
        
       INSTANCE_NAME 
        
      ctrlfwk.internal.gdc.goog/pause- 
      

      Replace the following:

      • NAMESPACE : the name of the namespace that you're interested in.
      • INSTANCE_NAME : the name of the instance that you found in the previous step.
    2. Repeat this process for all other resources that you paused in the first step.

Design a Mobile Site
View Site in Mobile | Classic
Share by: