This document lists troubleshooting documents for common issues that you might
encounter when using Google Kubernetes Engine (GKE). Whether you're diagnosing
workload errors like ImagePullBackOff 
and CrashLoopBackOff 
, debugging
cluster autoscaling behavior, resolving PersistentVolume issues, or
troubleshooting node registration problems, the documents listed here can help.
If you're new to troubleshooting in GKE, start with Introduction to troubleshooting .
To diagnose and resolve issues you encounter, see the documents in the following sections:
- Cluster setup
- Autoscaling
- Storage
- Cluster security
- Workloads
- Cluster management
- Monitoring
- 4xx errors
- Known issues
To troubleshoot GKE networking, see Troubleshoot GKE networking in the GKE networking documentation.
This document is for Admins and architects, Security specialists, Networking specialists, or Storage specialists who troubleshoot GKE configurations. To learn more about GKE roles, see Common GKE user roles and tasks .
Introduction to troubleshooting
| Topic | Description | 
|---|---|
| Introduction to GKE troubleshooting | Get started troubleshooting GKE by learning about the overall process and fundamental concepts. | 
| Review service health and incidents | Learn how to check the health of GKE and related Google Cloud services to exclude platform issues. | 
| Assess cluster and workload health in the Google Cloud console | Learn how to use the Google Cloud console to investigate and resolve GKE issues. | 
| Investigate a cluster's state with kubectl | Explore common kubectlcommands and techniques for
      diagnosing problems in your clusters and workloads. | 
| Conduct historical analysis with Cloud Logging | Understand how to effectively use Cloud Logging to find root causes of issues in GKE. | 
| Perform proactive monitoring with Cloud Monitoring | Utilize Cloud Monitoring dashboards and metrics to identify, diagnose, and resolve GKE issues. | 
| Accelerate diagnosis with Gemini Cloud Assist | Discover how Gemini can assist in diagnosing and resolving GKE problems. | 
| Put it all together: Example troubleshooting scenario | Follow a step-by-step example of troubleshooting a common scenario in GKE. | 
Cluster setup
| Topic | Description | 
|---|---|
| Cluster creation | Resolve issues with creating clusters. | 
| Autopilot clusters | Diagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues. | 
| Kubectl command-line tool | Troubleshoot the kubectlcommand-line tool in
      GKE, including issues with authentication, authorization.
      This page also includes advice on how to troubleshoot the Konnectivity proxy 
to check if it's causing thekubectl logs,attach,exec, orport-forwardcommands to stop
      responding. | 
| Standard node pools | Troubleshoot GKE Standard node pools, including issues with node pool creation, best-effort provisioning, corrupted instance metadata, and migrating workloads to new node pools. | 
| Node registration | Troubleshoot issues that occur when adding nodes to your GKE Standard cluster, such as node registration failures and missing prerequisites for successful node registration. | 
| Container runtime | Troubleshoot container runtimes in GKE, including
      issues with containerdanddockershim, and
      private registries. | 
Autoscaling
| Topic | Description | 
|---|---|
| Cluster autoscaler not scaling down | Diagnose and resolve common reasons your cluster isn't removing
    underutilized nodes. Learn how to check for issues like restrictive PodDisruptionBudgets, Pods with local storage, or specific annotations
    (for example,"cluster-autoscaler.kubernetes.io/safe-to-evict": "false")
    that prevent node eviction. | 
| Cluster autoscaler not scaling up | Learn why the cluster autoscaler isn't adding new nodes to meet demand. Check for unschedulable Pods, verify that you haven't hit cluster or node pool size limits, and identify potential resource quota or regional VM availability issues. | 
| Horizontal Pod autoscaling | Troubleshoot problems with the Horizontal Pod Autoscaler not scaling your application's Pod replicas. Resolve common issues, such as misconfigured HorizontalPodAutoscaler objects or problems with the metrics pipeline. | 
Storage
| Topic | Description | 
|---|---|
| Storage | Troubleshoot storage, including issues with regional persistent disks, disk performance, and volume expansion. | 
Cluster security
| Topic | Description | 
|---|---|
| Authentication | Troubleshoot authentication in GKE, including issues with RBAC, Workload Identity Federation for GKE, and the GKE metadata server. | 
| Service accounts | Troubleshoot service accounts, including restoring the default service account and enabling the Compute Engine default service account. | 
| Application-layer secrets | Troubleshoot issues that can occur when configuring application-layer secrets encryption, including failed updates and errors where you're unable to use a Cloud KMS key or where the Cloud KMS key version was destroyed . | 
Cluster's root Certificate Authority expiring soon
| Topic | Description | 
|---|---|
| Root Certificate Authority (CA) expiring | If your cluster's root Certificate Authority (CA) is expiring soon, learn how to perform a credential rotation to prevent normal cluster operations from being interrupted. | 
Workloads
| Topic | Description | 
|---|---|
| Deployed workloads | Troubleshoot errors for workloads running in a GKE
      cluster, including PodUnschedulable.
      Read the PodUnschedulable section for advice on errors likeMatchNodeSelectorandDoes not have minimum availability. | 
| Image pulls | Troubleshoot image pulls. Learn what causes statuses like ImagePullBackOffandErrImagePulland how to resolve these statuses by fixing common issues like
      authentication and network connectivity. | 
| CrashLoopBackOff events | Troubleshoot CrashLoopBackOffevents in
      GKE. Diagnose issues like resource exhaustion, app
      misconfigurations, and liveness probe failures. | 
| OOM events | Troubleshoot Kubernetes Out of Memory (OOM) events. Identify causes, distinguish event types, and apply effective solutions for both container- and node-level OOM kills. | 
| Arm workloads | Troubleshoot issues with Arm workloads, including Pods on Arm nodes crashing. | 
| TPUs | Troubleshoot TPUs, including issues with quota, node auto-provisioning, workload configuration, and scheduling. | 
| GPUs | Troubleshoot GPUs, including issues with GPU driver installation, device plugin errors, and container images. | 
Cluster management
| Topic | Description | 
|---|---|
| Cluster upgrades | Troubleshoot and resolve GKE cluster and node upgrade issues, including long or incomplete upgrades, unexpected auto-upgrades, failures, and post-upgrade problems. | 
| Webhooks | Understand how to troubleshoot and ensure the stability of your cluster control plane when using admission webhooks. | 
| Namespace stuck in the Terminatingstate | Troubleshoot issues with namespaces stuck in the Terminatingstate by identifying and removing the unhealthy
      components that are blocking deletion. | 
| Concurrent operations | Troubleshoot concurrent operations by learning how to identify these errors and resolve them by waiting for operations to complete. | 
Monitoring
| Topic | Description | 
|---|---|
| System metrics | Troubleshoot system metrics not appearing in Cloud Monitoring. | 
| Monitoring dashboards | Troubleshoot monitoring dashboards, including issues with enabling monitoring, missing Kubernetes resources, and permissions. | 
| Logging | Troubleshoot logging, including issues with enabling logging, missing logs, and quotas. | 
4xx errors
| Topic | Description | 
|---|---|
| 4xx errors | Troubleshoot some of the 400, 401, 403, and 404 errors that you might encounter when using GKE. This page also includes information on how to troubleshoot missing edit permissions on account errors . | 
Known issues
| Topic | Description | 
|---|---|
| Known issues | Identify and resolve known issues that might affect your use of GKE. | 
What's next
-  If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics: - Opening a support case by contacting Cloud Customer Care .
- Getting support from the community by asking questions on StackOverflow 
and using the google-kubernetes-enginetag to search for similar issues. You can also join the#kubernetes-engineSlack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker .
 

