Assess cluster and workload health in the Google Cloud console

Autopilot Standard

When you need to quickly check the health of your Google Kubernetes Engine (GKE) clusters and workloads, it can be hard to know where to start. Visualizing the health of your clusters and workloads in the Google Cloud console helps you quickly assess the state of your environment. Cluster health refers to the health of the underlying GKE infrastructure like nodes and networking, while workload health refers to the status and performance of your apps running on the cluster.

Use this page to learn how to navigate the Kubernetes clusters and workloads pages to get a high-level overview, identify potential issues (like nodes under resource pressure or failing Pods), and drill down into specific resources for more details.

This information is important for Platform admins and operators who are responsible for maintaining cluster stability and need to perform quick health assessments and resource checks. It's also essential for Application developers who need to understand the runtime status of their deployments and investigate failures. For more information about the common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks .

To provide a complete picture of your app's health, the Google Cloud console also gives you access to powerful logging and monitoring tools, letting you investigate the root cause of past failures and proactively prevent future ones. For more information about these tools, see Conduct historical analysis with Cloud Logging and Perform proactive monitoring with Cloud Monitoring .

Find cluster issues

The Kubernetes clusterspage provides you with an overview of the health of your clusters. To identify problems with any of your clusters, start on this page.

To get started, in the Google Cloud console, go to the Kubernetes clusterspage.

Go to Kubernetes clusters

Here are some examples of how you can use this page for troubleshooting:

For advice on improving the health of your cluster, your upgrade strategy, and cost optimization, click View recommendations.
To identify unhealthy clusters, review the Statuscolumn. Any cluster that doesn't have a green checkmark needs attention.
To see potential issues, review the Notificationscolumn. Click any notification messages for more information.

Investigate a specific cluster

After you discover a problem with a cluster, explore the cluster's Detailspage for in-depth information that helps you troubleshoot your cluster and understand its configuration.

To go to a cluster's Detailspage, do the following:

Go to the Kubernetes clusterspage.

Go to Kubernetes clusters
Review the Namecolumn and click the name of the cluster that you want to investigate.

Here are some examples of how to use the cluster Detailspage to troubleshoot your cluster:

For general health checks, try the following options:
- To view cluster-level dashboards, go to the Observabilitytab. By default, GKE enables Cloud Monitoring when you create a cluster. When Cloud Monitoring is enabled, GKE automatically sets up the dashboards on this page. Here are some of the views you might find most useful for troubleshooting:
  - Overview: view a high-level summary of your cluster's health, resource utilization, and key events. This dashboard helps you quickly assess the overall state of your cluster and identify potential issues.
  - Traffic metrics: view node-based networking metrics for insights into the traffic between your Kubernetes workloads.
  - Workload state: view the state of Deployments, Pods, and containers. Identify failing or unhealthy instances, and detect resource constraints.
  - Control plane: view the control plane's health and performance. This dashboard lets you monitor key metrics of components such as kube-apiserver and etcd , identify performance bottlenecks, and detect component failures.
    
    Tip: These dashboards can serve as a starting point for creating customized dashboards. You can copy a predefined dashboard and then customize it to include specific metrics, filters, or visualizations relevant to your needs.
- To view recent app errors, go to the App errorstab. The information on this tab can help you prioritize and resolve errors by showing the number of occurrences, when an error first appeared, and when it last happened.
  
  To investigate an error further, click the error message to view a detailed error report, including links to relevant logs.
If you're troubleshooting issues after a recent upgrade or change, check the Cluster basicssection in the cluster Detailstab. Confirm that the version listed in the Versionfield is what you expect. For further investigation, click Show upgrade historyin the Upgradessection.
If you're using a Standard cluster and your Pods are stuck in a Pending state, or you suspect that nodes are overloaded, check the Nodestab. The Nodestab isn't available for Autopilot clusters because GKE manages nodes for you.
- In the Node Poolssection, check that autoscaling is configured correctly and that the machine type is appropriate for your workloads.
- In the Nodessection, look for any node with a status other than Ready . A NotReady status indicates a problem with the node itself, such as resource pressure or an issue with the kubelet (the kubelet is the agent that runs on each node to manage containers).

Find workload issues

When you suspect that there's a problem with a specific app, like a failed Deployment, go to the Workloadspage in the Google Cloud console. This page provides a centralized view of all of the apps that run within your clusters.

To get started, in the Google Cloud console, go to the Workloadspage.

Go to Workloads

Here are some examples of how you can use this page for troubleshooting:

To identify unhealthy workloads, review the Statuscolumn. Any workload that doesn't have a green checkmark needs attention.
If an app is unresponsive, review the Podscolumn. For example, a status like 1/3means only one of three app replicas is running, indicating a problem.

Investigate a specific workload

After you identify a problematic workload from the overview, explore the workload Detailspage to begin to isolate the root cause.

To go to a workload's Detailspage, do the following:

Go to the Workloadspage.

Go to Workloads
View the Namecolumn and click the name of the workload that you want to investigate.

Here are some examples of how to use the workload Detailspage to troubleshoot your workloads:

To check the workload's configuration, use the workload Overviewand Detailstabs. You can use this information to verify events such as whether the correct container image tag was deployed or check the workload's resource requests and limits.

Note: Depending on your workload type, you might not have an Overviewtab. For example, StatefulSets have only a Detailstab. However, both tabs help you review your configuration.
To find the name of a specific crashing Pod, go to the Managed Podssection. You might need this information for kubectl commands. This section lists all the Pods controlled by the workload, along with their statuses.
To see a history of recent changes to a workload, go to the Revision historytab. If you notice performance issues after a new Deployment, then use this section to identify which revision is active. You can then compare the configurations of the current revision with previous ones to pinpoint the source of the problem. If this tab isn't visible, the workload is either a type that doesn't use revisions or it hasn't yet had any updates.
If a Deployment seems to have failed, go to the Eventstab. This page is often the most valuable source of information because it shows Kubernetes-level events.
To look at your app's logs, click the Logstab. This page helps you understand what's happening inside your cluster. Look here for error messages and stack traces that can help you diagnose issues.
To confirm exactly what was deployed, view the YAMLtab. This page shows the live YAML manifest for the workload as it exists on the cluster. This information is useful for finding any discrepancies from your source-controlled manifests. If you're viewing a single Pod's YAML manifest, this tab also shows you the status of the Pod, which provides insights about Pod-level failures.

What's next

Read Investigate a cluster's state with kubectl (the next page in this series).
See these concepts applied in the example troubleshooting scenario .
For advice about resolving specific problems, review GKE's troubleshooting guides .
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care .
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker .