Monitor cluster performance with prebuilt dashboards

This document explains how to monitor cluster resource usage in Cluster Director by using pre-configured Cloud Monitoring dashboards.

Cluster Director provides built-in Monitoring dashboards that let you view telemetry pre-configured data and track the performance of the resources that your cluster uses. Use these dashboards to observe the health and resource usage of your cluster.

To learn more about Monitoring dashboards, see Dashboards overview .

Before you begin

When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.

Required roles

To get the permissions that you need to view clusters, ask your administrator to grant you the following IAM roles on the project:

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

View Monitoring dashboards

To view the prebuilt Monitoring dashboards in a cluster, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Directorpage.

    Go to Cluster Director

  2. In the navigation menu, click Clusters. The Clusterspage appears.

  3. In the Clusterstable, in the Namecolumn, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Detailstab is selected.

  4. Click the Observabilitytab. A pane that shows the available Monitoring prebuilt dashboards for your cluster appears.

Available Monitoring dashboards

The prebuilt Monitoring dashboards that you can view in a cluster show real-time and historical data for the following resource categories:

  • CPU: you can monitor vCPU usage across all nodes in your cluster. This information can help you improve the performance of your workloads or save costs by deleting unused resources.

  • Memory: you can track memory usage to help ensure that your nodes have sufficient memory for your jobs and prevent out-of-memory errors.

  • Network: you can observe network traffic patterns, including bandwidth and throughput. This information is useful for diagnosing and troubleshooting communication bottlenecks in distributed workloads.

  • GPU: for clusters with GPU nodes, you can monitor GPU usage and GPU memory usage. This information helps you verify that your resources are efficiently used.

  • Health: you can view high-level health metrics for the components in your cluster, giving you a quick overview of the operational status of your nodes.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: