Configure logging and monitoring

This document shows how to configure logging and monitoring for system components in Google Distributed Cloud (software only) for VMware.

By default, Cloud Logging, Cloud Monitoring, and Google Cloud Managed Service for Prometheus are enabled.

For more information about the options, see Logging and monitoring overview .

Monitored resources

Monitored resources are how Google represents resources such as clusters, nodes, Pods, and containers. To learn more, refer to Cloud Monitoring's Monitored resource types documentation.

To query for logs and metrics, you'll need to know at least these resource labels:

project_id : Project ID of the cluster's logging-monitoring project . You provided this value in the stackdriver.projectID field of your cluster configuration file.
location : A Google Cloud region where you want to route and store your Cloud Monitoring metrics. You specify the region during installation in the stackdriver.clusterLocation field of your cluster configuration file. We recommend choosing a region that's near your on-premises data center.

You specify the Cloud Logging logs routing and storage location in the Log Router configuration. For more information about logs routing, see Routing and storage overview .
cluster_name : Cluster name that you chose when you created the cluster.

You can retrieve the cluster_name value for either the admin or the user cluster by inspecting the Stackdriver custom resource:
```
kubectl get stackdriver stackdriver --namespace kube-system \
--kubeconfig CLUSTER_KUBECONFIG 
--output yaml | grep 'clusterName:'
```
where
- CLUSTER_KUBECONFIG is the path to the admin cluster's or user cluster's kubeconfig file for which the cluster name is required.

Logs and metrics routing

Stackdriver log forwarder ( stackdriver-log-forwarder ) sends logs from each node machine to Cloud Logging. Similarly, GKE metrics agent ( gke-metrics-agent ) sends metrics from each node machine to Cloud Monitoring. Before the logs and metrics are sent, the Stackdriver Operator ( stackdriver-operator ) attaches the value from the clusterLocation field in the stackdriver custom resource to each log entry and metric before they're routed to Google Cloud. Additionally, the logs and metrics are associated with the Google Cloud project specified in the stackdriver custom resource spec ( spec.projectID ).

All metrics and log entries sent by stackdriver agents are routed to a global ingestion endpoint. From there, the data is forwarded to the closest reachable regional Google Cloud endpoint to ensure the reliability of the data transport.

After the global endpoint receives the metric or log entry, what happens next is service-dependent:

How logs routing is configured: when the logging endpoint receives a log message, Cloud Logging passes the message through the Log Router. The sinks and filters in the Log Router configuration determine how to route the message. You can route log entries to destinations like regional Logging buckets, which store the log entry, or to Pub/Sub. For more information about how log routing works and how to configure it, see Routing and storage overview .

Neither the clusterLocation field in the stackdriver custom resource or the clusterOperations.location field in the Cluster spec are considered in this routing process. For logs, clusterLocation is used to label log entries only, which can be helpful for filtering in Logs Explorer.
How metrics routing is configured: when the metrics endpoint receives a metric entry, the entry is routed automatically to be stored in the location specified by the metric. The location in the metric came from the clusterLocation field in the stackdriver custom resource.
Plan your configuration: when you configure your Cloud Logging and Cloud Monitoring, configure Log Router and specify an appropriate clusterLocation with locations that best support your needs. For example, if you want logs and metrics to go to the same location, set clusterLocation to the same Google Cloud region that Log Router is using for your Google Cloud project.
Update your configuration when needed: you can make changes anytime to the destination settings for logs and metrics due to business requirements, such as disaster recovery plans. Changes to the Log Router configuration in Google Cloud take effect quickly. The location and projectID fields in the clusterOperations section of the Cluster resource are immutable, so they can't be updated after you create your cluster. We don't recommend that you change values in the stackdriver resource directly. This resource is reverted to the original cluster creation state whenever a cluster operation, such as an upgrade, triggers a reconciliation.

The stackdriver resource gets values for the clusterLocation and projectID fields from the stackdriver.clusterLocation and stackdriver.projectID fields in the clusterOperations section of the Cluster resource at cluster creation time.

Use Cloud Logging

You don't have to take any action to enable Cloud Logging for a cluster. However, you must specify the Google Cloud project where you want to view logs. In the cluster configuration file, you specify the Google Cloud project in the stackdriver section.

You can access logs using the Logs Explorer in the Google Cloud console. For example, to access a container's logs:

Open the Logs Explorer in Google Cloud console for your project.
Find logs for a container by:
1. Clicking on the top-left log catalog drop-down box and selecting Kubernetes Container.
2. Selecting the cluster name, then the namespace, and then a container from the hierarchy.

View logs for controllers in the bootstrap cluster

In the Google Cloud console, go to the Logs Explorer page:

Go to Logs Explorer

If you use the search bar to find this page, then select the result whose subheading is Logging .

To view all logs for controllers in the bootstrap cluster, run the following query in the query editor:

" ADMIN_CLUSTER_NAME 
"
resource.type="k8s_container"
resource.labels.cluster_name="gkectl-bootstrap-cluster"

To view the logs for a specific pod, edit the query to include that pod's name:

resource.type="k8s_container"
resource.labels.cluster_name="gkectl-bootstrap-cluster"
resource.labels.pod_name=" POD_NAME 
"

Use Cloud Monitoring

You don't have to take any action to enable Cloud Monitoring for a cluster. However, you must specify the Google Cloud project where you want to view metrics. In the cluster configuration file, you specify the Google Cloud project in the stackdriver section.

You can choose from over 1,500 metrics by using Metrics Explorer. To access Metrics Explorer, do the following:

In the Google Cloud console, select Monitoring, or use the following button:

Go to Monitoring
Select Resources> Metrics Explorer.

You can also view metrics in dashboards in the Google Cloud console. For information about creating dashboards and viewing metrics, see Creating dashboards .

View fleet-level monitoring data

For an overall view of your fleet's resource utilization using Cloud Monitoring data, including your Google Distributed Cloud clusters, you can use the Google Kubernetes Engine overview in the Google Cloud console. See Manage clusters from the Google Cloud console to find out more.

Default Cloud Monitoring quota limits

Google Distributed Cloud monitoring has a default limit of 6000 API calls per minute for each project. If you exceed this limit, your metrics may not be displayed. If you need a higher monitoring limit, request one through the Google Cloud console.

Configure the Stackdriver custom resource

When you create a cluster, Google Distributed Cloud automatically creates a Stackdriver custom resource. You can edit the spec in the custom resource to override the default values for CPU and memory requests and limits for a Stackdriver component, and you can separately override the default storage size and storage class.

Override the default CPU and memory requests and limits for a Stackdriver component

Clusters with high pod density introduce higher logging and monitoring overhead. In extreme cases, Stackdriver components may report close to the CPU and memory utilization limit or even may be subject to constant restarts due to resource limits. In this case, to override the default values for CPU and memory requests and limits for a Stackdriver component, use the following steps:

Run the following command to open your Stackdriver custom resource in a command-line editor:
```
kubectl -n kube-system edit stackdriver stackdriver
```
In the Stackdriver custom resource, add the resourceAttrOverride section under the spec field:
```
resourceAttrOverride: DAEMONSET_OR_DEPLOYMENT_NAME 
/ CONTAINER_NAME 
: LIMITS_OR_REQUESTS 
: RESOURCE 
: RESOURCE_QUANTITY 
```
Note that the resourceAttrOverride section overrides all existing default limits and requests for the component you specify. The following components are supported by resourceAttrOverride :
- gke-metrics-agent/gke-metrics-agent
- stackdriver-log-forwarder/stackdriver-log-forwarder
- stackdriver-metadata-agent-cluster-level/metadata-agent
- node-exporter/node-exporter
- kube-state-metrics/kube-state-metrics
An example file looks like the following:
```
apiVersion: addons.gke.io/v1alpha1
kind: Stackdriver
metadata:
  name: stackdriver
  namespace: kube-system
spec:
  projectID: my-project
  clusterName: my-cluster
  clusterLocation: us-west-1a
  resourceAttrOverride:
    gke-metrics-agent/gke-metrics-agent:
      requests:
        cpu: 110m
        memory: 240Mi
      limits:
        cpu: 200m
        memory: 4.5Gi
```
To save changes to the Stackdriver custom resource, save and quit your command-line editor.

Check the health of your Pod:

kubectl -n kube-system get pods -l "managed-by=stackdriver"

A response for a healthy Pod looks like the following:

gke-metrics-agent-4th8r                1/1     Running   1   40h

Check the Pod spec of the component to make sure the resources are set correctly.

kubectl -n kube-system describe pod POD_NAME

Replace POD_NAME with the name of the Pod you just changed. For example, gke-metrics-agent-4th8r .

The response looks like the following:

Name:         gke-metrics-agent-4th8r
  Namespace:    kube-system
  ...
  Containers:
    gke-metrics-agent:
      Limits:
        cpu: 200m
        memory: 4.5Gi
      Requests:
        cpu: 110m
        memory: 240Mi
      ...

Disable optimized metrics

By default, the metrics agents running in the cluster collect and report an optimized set of container, kubelet and kube-state-metrics metrics to Google Cloud Observability (formerly Stackdriver). If you require additional metrics, we recommend that you find a replacement from the list of Google Distributed Cloud metrics .

Here are some examples of replacements you might use:

Disabled metric	Replacements
`kube_pod_start_time`	`container/uptime`
`kube_pod_container_resource_requests`	`container/cpu/request_cores` `container/memory/request_bytes`
`kube_pod_container_resource_limits`	`container/cpu/limit_cores` `container/memory/limit_bytes`

To disable the optimized metrics default setting (not recommended), do the following:

Open your Stackdriver custom resource in a command-line editor:
```
kubectl -n kube-system edit stackdriver stackdriver
```

Set the optimizedMetrics field to false :

apiVersion: addons.gke.io/v1alpha1
kind: Stackdriver
metadata:
  name: stackdriver
  namespace: kube-system
spec:
  projectID: my-project
  clusterName: my-cluster
  clusterLocation: us-west-1a
  optimizedMetrics: false

Save changes, and quit your command-line editor.

Override storage size defaults

To override these defaults, do the following:

Open your Stackdriver custom resource in a command-line editor:

kubectl --kubeconfig= KUBECONFIG 
-n kube-system edit stackdriver stackdriver

Add the storageSizeOverride field under the spec section. You can use the component stackdriver-prometheus-k8s or stackdriver-prometheus-app . The section takes this format:

storageSizeOverride: STATEFULSET_NAME 
: SIZE

This example uses the statefulset stackdriver-prometheus-k8s and size 120Gi .

apiVersion: addons.gke.io/v1alpha1
kind: Stackdriver
metadata:
  name: stackdriver
  namespace: kube-system
spec:
  projectID: my-project
  clusterName: my-cluster
  clusterLocation: us-west-1a
  storageSizeOverride:
    stackdriver-prometheus-k8s: 120Gi

Save, and quit your command-line editor.

Check the health of your Pods:

kubectl --kubeconfig= KUBECONFIG 
-n kube-system get pods | grep stackdriver

For example, a healthy Pod looks like the following:

stackdriver-prometheus-k8s-0                                2/2     Running   0          5d19h

Check the Pod spec of the component to make sure the storage size is correctly overridden.

kubectl --kubeconfig= KUBECONFIG 
-n kube-system describe statefulset STATEFULSET_NAME

The response looks like the following:

Volume Claims:
 Name:          my-statefulset-persistent-volume-claim
 StorageClass:  my-storage-class
 Labels:
 Annotations:
 Capacity:      120Gi
 Access Modes:  [ReadWriteOnce]

Configure logging and monitoring

Monitored resources

Logs and metrics routing

Use Cloud Logging

View logs for controllers in the bootstrap cluster

Use Cloud Monitoring

View fleet-level monitoring data

Default Cloud Monitoring quota limits

Configure the Stackdriver custom resource

Override the default CPU and memory requests and limits for a Stackdriver component

Disable optimized metrics

Override storage size defaults

Override storage class defaults

Use Google Cloud Managed Service for Prometheus

Enable and disable Managed Service for Prometheus

View metric data

Configure Grafana dashboards with Managed Service for Prometheus

Additional resources

Known issue

Access monitoring metrics from Grafana dashboards

Access alerts

Change Prometheus Alertmanager configuration

Create a Grafana dashboard

Disabling Prometheus and Grafana

Example: Adding application-level metrics to a Grafana dashboard

Deploy the example application

Verify that the metric is exposed and scraped

Known issue: Cloud Monitoring error condition

Diagnose the Cloud Monitoring failure

Recover from the Cloud Monitoring error

Configure logging and monitoring Stay organized with collections Save and categorize content based on your preferences.

Monitored resources

Logs and metrics routing

Use Cloud Logging

View logs for controllers in the bootstrap cluster

Use Cloud Monitoring

View fleet-level monitoring data

Default Cloud Monitoring quota limits

Configure the Stackdriver custom resource

Override the default CPU and memory requests and limits for a Stackdriver component

Disable optimized metrics

Override storage size defaults

Override storage class defaults

Use Google Cloud Managed Service for Prometheus

Enable and disable Managed Service for Prometheus

View metric data

Configure Grafana dashboards with Managed Service for Prometheus

Additional resources

Known issue

Access monitoring metrics from Grafana dashboards

Access alerts

Change Prometheus Alertmanager configuration

Create a Grafana dashboard

Disabling Prometheus and Grafana

Example: Adding application-level metrics to a Grafana dashboard

Deploy the example application

Verify that the metric is exposed and scraped

Known issue: Cloud Monitoring error condition

Diagnose the Cloud Monitoring failure

Recover from the Cloud Monitoring error

Configure logging and monitoring