Version 1.8. This version is no longer supported. For more information see the version support policy .

Using logging and monitoring

This page explains how to use Cloud Logging and Cloud Monitoring as well as Prometheus and Grafana for logging and monitoring of your Google Distributed Cloud implementation. For a summary of the configuration options available, see Logging and monitoring overview .

Using Cloud Logging and Cloud Monitoring

The following sections explain how to use Cloud Logging and Cloud Monitoring with Google Distributed Cloud clusters.

Monitored resources

Monitored resources are how Google represents resources such as clusters, nodes, Pods, and containers. To learn more, refer to Cloud Monitoring's Monitored resource types documentation.

To query for logs and metrics, you'll need to know at least these resource labels:

project_id : Project ID for the ID of the cluster's logging-monitoring project . You provided this value in the stackdriver.projectID field of your cluster configuration file.
location : A Google Cloud region where you want to store Cloud Logging logs and Cloud Monitoring metrics. It's a good idea to choose a region that is near your on-premises data center. You provided this value during installation in the stackdriver.clusterLocation field of your cluster configuration file.
cluster_name : Cluster name that you chose when you created the cluster.

You can retrieve the cluster_name value for either the admin or the user cluster by inspecting the Stackdriver custom resource:
```
kubectl -n kube-system get stackdrivers stackdriver -o yaml | grep 'clusterName:'
```

Accessing log data

You can access logs using the Logs Explorer in Google Cloud console. For example, to access a container's logs:

Open the Logs Explorer in Google Cloud console for your project.
Find logs for a container by:
1. Clicking on the top-left log catalog drop-down box and selecting Kubernetes Container.
2. Selecting the cluster name, then the namespace, and then a container from the hierarchy.

Creating dashboards to monitor cluster health

Google Distributed Cloud clusters are, by default, configured to monitor system and container metrics. After you create a cluster (admin or user), a best practice is to create the following dashboards with Cloud Monitoring to let your Google Distributed Cloud operations team monitor cluster health:

Control plane status dashboard
Pod status dashboard
Node status dashboard
VM health status dashboard

The dashboards are automatically created during admin cluster installation if Cloud Monitoring is enabled.

This section describes how to create these dashboards. For more information about the dashboard creation process described in the following sections, see Managing dashboards by API .

Prerequisites

Your Google Account must have the following permissions to create dashboards:

monitoring.dashboards.create
monitoring.dashboards.delete
monitoring.dashboards.update

You'll have these permissions if your account has one of the following roles. You can check your permissions ( in the Google Cloud console ):

monitoring.dashboardEditor
monitoring.editor
Project editor
Project owner

In addition, to use gcloud (gcloud CLI) to create dashboards, your Google Account must have the serviceusage.services.use permission.

Your account will have this permission if it has one of the following roles:

roles/serviceusage.serviceUsageConsumer
roles/serviceusage.serviceUsageAdmin
roles/owner
roles/editor
Project editor
Project owner

Create a control plane status dashboard

The Google Distributed Cloud control plane consists of the API server, scheduler, controller manager, and etcd. To monitor the status of the control plane, create a dashboard that monitors the state of these components.

Download the dashboard configuration: control-plane-status.json .
Create a custom dashboard with the configuration file by running the following command:
```
gcloud monitoring dashboards create --config-from-file=control-plane-status.json
```
In the Google Cloud console, select Monitoring, or use the following button:

Go to Monitoring
Select Resources > Dashboardsand view the dashboard named GKE on-prem control plane status. The control plane status of each user cluster is collected from separate namespaces within the admin cluster. The namespace_namefield is the user cluster name.

An service level objective (SLO) threshold of 0.999 is set in each chart.
Optionally create alerting policies .

Click to see a sample dashboard.

Create a pod status dashboard

To create a dashboard that includes the phase of each Pod, and the restart times and resource usage of each container, perform the following steps.

Download the dashboard configuration: pod-status.json .
Create a custom dashboard with the configuration file by running the following command:
```
gcloud monitoring dashboards create --config-from-file=pod-status.json
```
In the Google Cloud console, select Monitoring, or use the following button:

Go to Monitoring
Select Resources > Dashboardsand view the dashboard named GKE on-prem pod status.
Optionally create alerting policies .

Click to see a sample dashboard.

Create a node status dashboard

To create an node status dashboard to monitor the node condition, CPU, memory and disk usage, perform the following steps:

Download the dashboard configuration: node-status.json .
Create a custom dashboard with the configuration file by running the following command:
```
gcloud monitoring dashboards create --config-from-file=node-status.json
```
In the Google Cloud console, select Monitoring, or use the following button:

Go to Monitoring
Select Resources > Dashboardsand view the dashboard named GKE on-prem node status.
Optionally create alerting policies .

Click to see a sample dashboard.

Create a VM health status dashboard

A VM health status dashboard monitors CPU, memory, and disk resource contention signals for VMs in the admin cluster and user clusters.

To create an VM health status dashboard:

Make sure stackdriver.disableVsphereResourceMetrics is set to false. See User cluster configuration file .
Download the dashboard configuration: vm-health-status.json .
Create a custom dashboard with the configuration file by running the following command:
```
gcloud monitoring dashboards create --config-from-file=vm-health-status.json
```
In the Google Cloud console, select Monitoring, or use the following button:

Go to Monitoring
Select Resources > Dashboardsand view the dashboard named GKE on-prem VM health status.
Optionally create alerting policies .

Click to see a sample dashboard.

Configuring Stackdriver component resources

When you create a cluster, Google Distributed Cloud automatically creates a Stackdriver custom resource. You can edit the spec in the custom resource to override the default values for CPU and memory requests and limits for a Stackdriver component, and you can separately override the default storage size and storage class.

Override default values for requests and limits for CPU and memory

To override these defaults, do the following:

Open your Stackdriver custom resource in a command line editor:
```
kubectl --kubeconfig= KUBECONFIG 
-n kube-system edit stackdriver stackdriver
```
where KUBECONFIG is the path to your kubeconfig file for the cluster. This can be either an admin cluster or user cluster.

In the Stackdriver custom resource, add the resourceAttrOverride field under the spec section:

resourceAttrOverride: POD_NAME_WITHOUT_RANDOM_SUFFIX 
/ CONTAINER_NAME 
: LIMITS_OR_REQUESTS 
: RESOURCE 
: RESOURCE_QUANTITY

Note that the resourceAttrOverride field overrides all existing default limits and requests for the component you specify. An example file looks like the following:

apiVersion: addons.sigs.k8s.io/v1alpha1
kind: Stackdriver
metadata:
  name: stackdriver
  namespace: kube-system
spec:
  projectID: my-project
  clusterName: my-cluster
  clusterLocation: us-west-1a
  resourceAttrOverride:
    stackdriver-prometheus-k8s/prometheus-server:
      limits:
        cpu: 500m
        memory: 3000Mi
      requests:
        cpu: 300m
        memory: 2500Mi

Save changes and quit your command line editor.

Check the health of your Pods:

kubectl --kubeconfig= KUBECONFIG 
-n kube-system get pods | grep stackdriver

For example, a healthy Pod looks like the following:

stackdriver-prometheus-k8s-0                                2/2     Running   0          5d19h

Check the Pod spec of the component to make sure the resources are set correctly.

kubectl --kubeconfig= KUBECONFIG 
-n kube-system describe pod POD_NAME

where POD_NAME is the name of the Pod you just changed. For example, stackdriver-prometheus-k8s-0

The response looks like the following:

Name:         stackdriver-prometheus-k8s-0
  Namespace:    kube-system
  ...
  Containers:
    prometheus-server:
      Limits:
        cpu: 500m
        memory: 3000Mi
      Requests:
        cpu: 300m
        memory: 2500Mi
      ...

Override storage size defaults

To override these defaults, do the following:

Open your Stackdriver custom resource in a command line editor:

kubectl --kubeconfig= KUBECONFIG 
-n kube-system edit stackdriver stackdriver

Add the storageSizeOverride field under the spec section. You can use the component stackdriver-prometheus-k8s or stackdriver-prometheus-app . The section takes this format:

storageSizeOverride: STATEFULSET_NAME 
: SIZE

This example uses the statefulset stackdriver-prometheus-k8s and size 120Gi .

apiVersion: addons.sigs.k8s.io/v1alpha1
kind: Stackdriver
metadata:
  name: stackdriver
  namespace: kube-system
spec:
  projectID: my-project
  clusterName: my-cluster
  clusterLocation: us-west-1a
  storageSizeOverride:
    stackdriver-prometheus-k8s: 120Gi

Save, and quit your command line editor.

Check the health of your Pods:

kubectl --kubeconfig= KUBECONFIG 
-n kube-system get pods | grep stackdriver

For example, a healthy Pod looks like the following:

stackdriver-prometheus-k8s-0                                2/2     Running   0          5d19h

Check the Pod spec of the component to make sure the storage size is correctly overridden.

kubectl --kubeconfig= KUBECONFIG 
-n kube-system describe statefulset STATEFULSET_NAME

The response looks like the following:

Volume Claims:
 Name:          my-statefulset-persistent-volume-claim
 StorageClass:  my-storage-class
 Labels:
 Annotations:
 Capacity:      120Gi
 Access Modes:  [ReadWriteOnce]

Using logging and monitoring

Using Cloud Logging and Cloud Monitoring

Monitored resources

Accessing log data

Creating dashboards to monitor cluster health

Prerequisites

Create a control plane status dashboard

Create a pod status dashboard

Create a node status dashboard

Create a VM health status dashboard

Configuring Stackdriver component resources

Override default values for requests and limits for CPU and memory

Override storage size defaults

Override storage class defaults

Accessing metrics data

Accessing Monitoring metadata

Default Cloud Monitoring quota limits

Known issue: Cloud Monitoring error condition

Diagnosing the Cloud Monitoring failure

Recovering from the Cloud Monitoring error

Prometheus and Grafana

Enabling Prometheus and Grafana

Known issue

Accessing monitoring metrics from Grafana dashboards

Accessing alerts

Changing Prometheus Alertmanager configuration

Scaling Prometheus resources

Create a Grafana dashboard

Disabling in-cluster monitoring

Example: Adding application-level metrics to a Grafana dashboard

Deploy the example application

Verify that the metric is exposed and scraped

Using logging and monitoring Stay organized with collections Save and categorize content based on your preferences.

Using Cloud Logging and Cloud Monitoring

Monitored resources

Accessing log data

Creating dashboards to monitor cluster health

Prerequisites

Create a control plane status dashboard

Create a pod status dashboard

Create a node status dashboard

Create a VM health status dashboard

Configuring Stackdriver component resources

Override default values for requests and limits for CPU and memory

Override storage size defaults

Override storage class defaults

Accessing metrics data

Accessing Monitoring metadata

Default Cloud Monitoring quota limits

Known issue: Cloud Monitoring error condition

Diagnosing the Cloud Monitoring failure

Recovering from the Cloud Monitoring error

Prometheus and Grafana

Enabling Prometheus and Grafana

Known issue

Accessing monitoring metrics from Grafana dashboards

Accessing alerts

Changing Prometheus Alertmanager configuration

Scaling Prometheus resources

Create a Grafana dashboard

Disabling in-cluster monitoring

Example: Adding application-level metrics to a Grafana dashboard

Deploy the example application

Verify that the metric is exposed and scraped

Using logging and monitoring