Google Distributed Cloud (software only) for bare metal supports multiple options for cluster logging and monitoring, including cloud-based managed services, open source tools, and validated compatibility with third-party commercial solutions. This page explains these options and provides basic guidance on selecting the right solution for your environment.
This page is for Admins and architects and Operators who want to monitor the health of deployed applications or services, such as for service level objective (SLO) compliance. To learn more about common roles and example tasks that are referenced in Google Cloud content, see Common GKE user roles and tasks .
Options for Google Distributed Cloud
You have several logging and monitoring options for your cluster:
- Cloud Logging and Cloud Monitoring, enabled by default on bare metal system components.
- Prometheus and Grafana are available from the Cloud Marketplace.
- Validated configurations with third-party solutions.
Cloud Logging and Cloud Monitoring
Google Cloud Observability is the built-in observability solution for Google Cloud. It offers a fully managed logging solution, metrics collection, monitoring, dashboarding, and alerting. Cloud Monitoring monitors Google Distributed Cloud clusters in a similar way as cloud-based GKE clusters.
Cloud Logging and Cloud Monitoring are enabled by default when you create clusters with the required service accounts and Identity and Access Management (IAM) roles. You can't disable Cloud Logging and Cloud Monitoring. For more information about service accounts and the required roles, see Configure service accounts .
The agents can be configured to change the following:
- Scope of logging and monitoring, from only system components (the default) to both system components and applications.
- Level of metrics collected, from only an optimized set of metrics (the default) to all metrics.
See Configuring Stackdriver agents for Google Distributed Cloud on this document for more information.
Logging and Monitoring provide a single, easy-to-configure, powerful cloud-based observability solution. We highly recommend Logging and Monitoring when running workloads on Google Distributed Cloud. For applications with components running on Google Distributed Cloud and standard on-premises infrastructure, you might consider other solutions for an end-to-end view of those applications.
-
For details about architecture, configuration, and what data is replicated to your Google Cloud project by default, see How Logging and Monitoring for Google Distributed Cloud works .
-
For more information about Logging, see the Cloud Logging documentation.
-
For more information about Monitoring, see the Cloud Monitoring documentation.
-
To learn how to view and use Cloud Monitoring resource utilization metrics from Google Distributed Cloud at fleet level, see Use the Google Kubernetes Engine overview .
Prometheus and Grafana
Prometheus and Grafana are two popular open source monitoring products available in the Cloud Marketplace :
-
Prometheus collects application and system metrics.
-
Alertmanager handles sending alerts with several different alerting mechanisms.
-
Grafana is a dashboarding tool.
We recommend that you use Google Cloud Managed Service for Prometheus, which is based in Cloud Monitoring, for all your monitoring needs. With Google Cloud Managed Service for Prometheus you can monitor system components without charge. Google Cloud Managed Service for Prometheus is also compatible with Grafana. However, if you prefer a pure local monitoring system, you can choose to install Prometheus and Grafana in your clusters.
If you installed Prometheus locally and want to collect metrics from system components, you need to give permission to your local Prometheus instance to access the metrics endpoints of system components:
-
Bind the service account for your Prometheus instance to the predefined
gke-metrics-agent
ClusterRole, and use service account token as credential to scrape metrics from the following system components:-
kube-apiserver
-
kube-scheduler
-
kube-controller-manager
-
kubelet
-
node-exporter
-
-
Use the client key and cert stored in the
kube-system/stackdriver-prometheus-etcd-scrape
secret to authenticate the metric scrape from etcd. -
Create a NetworkPolicy to allow access from your namespace to kube-state-metrics.
Third-party solutions
Google has worked with several third-party logging and monitoring solution providers to help their products work well with Google Distributed Cloud. These include Datadog, Elastic, and Splunk. Additional validated third parties will be added in the future.
The following solution guides are available for using third-party solutions with Google Distributed Cloud:
- Monitoring Google Distributed Cloud with the Elastic Stack
- Collect logs on Google Distributed Cloud with Splunk Connect
How Logging and Monitoring for Google Distributed Cloud works
Cloud Logging and Cloud Monitoring are installed and activated in each cluster when you create a new admin or user cluster.
The Stackdriver agents include several components on each cluster:
-
Stackdriver Operator(
stackdriver-operator-*
). Manages the lifecycle for all other Stackdriver agents deployed onto the cluster. -
Stackdriver Custom Resource.A resource that is automatically created as part of the Google Distributed Cloud installation process.
-
GKE Metrics Agent(
gke-metrics-agent-*
). An OpenTelemetry Collector based DaemonSet that scrapes metrics from each node to Cloud Monitoring. Anode-exporter
DaemonSet and akube-state-metrics
deployment are also included to provide more metrics about the cluster. -
Stackdriver Log Forwarder(
stackdriver-log-forwarder-*
). A Fluent Bit DaemonSet that forwards logs from each machine to the Cloud Logging. The log Forwarder buffers the log entries on the node locally and re-sends them for up to 4 hours. If the buffer gets full or if the Log Forwarder can't reach the Cloud Logging API for more than 4 hours, logs are dropped. -
Metadata Agent(
stackdriver-metadata-agent-
). A deployment that sends metadata for Kubernetes resources such as pods, deployments, or nodes to the Config Monitoring for Ops API. This addition of metadata lets you query your metrics data by deployment name, node name, or even Kubernetes service name.
You can see the agents installed by Stackdriver by running the following command:
kubectl
-n
kube-system
get
pods
-l
"managed-by=stackdriver"
The output of this command is similar to the following:
kube-system gke-metrics-agent-4th8r 1/1 Running 1 (40h ago) 40h
kube-system gke-metrics-agent-8lt4s 1/1 Running 1 (40h ago) 40h
kube-system gke-metrics-agent-dhxld 1/1 Running 1 (40h ago) 40h
kube-system gke-metrics-agent-lbkl2 1/1 Running 1 (40h ago) 40h
kube-system gke-metrics-agent-pblfk 1/1 Running 1 (40h ago) 40h
kube-system gke-metrics-agent-qfwft 1/1 Running 1 (40h ago) 40h
kube-system kube-state-metrics-9948b86dd-6chhh 1/1 Running 1 (40h ago) 40h
kube-system node-exporter-5s4pg 1/1 Running 1 (40h ago) 40h
kube-system node-exporter-d9gwv 1/1 Running 2 (40h ago) 40h
kube-system node-exporter-fhbql 1/1 Running 1 (40h ago) 40h
kube-system node-exporter-gzf8t 1/1 Running 1 (40h ago) 40h
kube-system node-exporter-tsrpp 1/1 Running 1 (40h ago) 40h
kube-system node-exporter-xzww7 1/1 Running 1 (40h ago) 40h
kube-system stackdriver-log-forwarder-8lwxh 1/1 Running 1 (40h ago) 40h
kube-system stackdriver-log-forwarder-f7cgf 1/1 Running 2 (40h ago) 40h
kube-system stackdriver-log-forwarder-fl5gf 1/1 Running 1 (40h ago) 40h
kube-system stackdriver-log-forwarder-q5lq8 1/1 Running 2 (40h ago) 40h
kube-system stackdriver-log-forwarder-www4b 1/1 Running 1 (40h ago) 40h
kube-system stackdriver-log-forwarder-xqgjc 1/1 Running 1 (40h ago) 40h
kube-system stackdriver-metadata-agent-cluster-level-5bb5b6d6bc-z9rx7 1/1 Running 1 (40h ago) 40h
Cloud Monitoring metrics
For a list of metrics collected by Cloud Monitoring, see View Google Distributed Cloud metrics .
Configuring Stackdriver agents for Google Distributed Cloud
The Stackdriver agents installed with Google Distributed Cloud collect data about system components for the purposes of maintaining and troubleshooting issues with your clusters. The following sections describe Stackdriver configuration and operating modes.
System Components Only (Default Mode)
Upon installation, Stackdriver agents are configured by default to collect logs and metrics, including performance details (for example, CPU and memory utilization), and similar metadata, for Google-provided system components. These include all workloads in the admin cluster, and for user clusters, workloads in the kube-system, gke-system, gke-connect, istio-system, and config-management- system namespaces.
System Components and Applications
To enable application logging and monitoring on top of the default mode, follow the steps in Enable application logging and monitoring .
Optimized metrics (Default metrics)
By default, the kube-state-metrics
deployments running in the cluster collect and report an
optimized set of kube metrics to Google Cloud Observability (formerly Stackdriver).
Fewer resources are needed to collect this optimized set of metrics, which improves overall performance and scalability.
Excluded kube metrics
The following kube metrics are excluded from the optimized metrics:
- kube_certificatesigningrequest_cert_length
- kube_certificatesigningrequest_condition
- kube_certificatesigningrequest_created
- kube_certificatesigningrequest_labels
- kube_configmap_annotations
- kube_configmap_info
- kube_configmap_labels
- kube_configmap_metadata_resource_version
- kube_daemonset_annotations
- kube_daemonset_created
- kube_daemonset_labels
- kube_daemonset_metadata_generation
- kube_daemonset_status_observed_generation
- kube_deployment_annotations
- kube_deployment_created
- kube_deployment_labels
- kube_deployment_spec_paused
- kube_deployment_spec_strategy_rollingupdate_max_surge
- kube_deployment_spec_strategy_rollingupdate_max_unavailable
- kube_deployment_status_condition
- kube_deployment_status_replicas_ready
- kube_endpoint_annotations
- kube_endpoint_created
- kube_endpoint_info
- kube_endpoint_labels
- kube_endpoint_ports
- kube_horizontalpodautoscaler_annotations
- kube_horizontalpodautoscaler_info
- kube_horizontalpodautoscaler_labels
- kube_horizontalpodautoscaler_metadata_generation
- kube_horizontalpodautoscaler_status_condition
- kube_job_annotations
- kube_job_complete
- kube_job_created
- kube_job_info
- kube_job_labels
- kube_job_owner
- kube_job_spec_completions
- kube_job_spec_parallelism
- kube_job_status_completion_time
- kube_job_status_start_time
- kube_job_status_succeeded
- kube_lease_owner
- kube_lease_renew_time
- kube_limitrange
- kube_limitrange_created
- kube_mutatingwebhookconfiguration_info
- kube_namespace_labels
- kube_networkpolicy_annotations
- kube_networkpolicy_labels
- kube_networkpolicy_spec_egress_rules
- kube_networkpolicy_spec_ingress_rules
- kube_node_annotations
- kube_node_role
- kube_persistentvolume_annotations
- kube_persistentvolume_labels
- kube_persistentvolumeclaim_access_mode
- kube_persistentvolumeclaim_annotations
- kube_persistentvolumeclaim_labels
- kube_pod_annotations
- kube_pod_completion_time
- kube_pod_container_resource_limits
- kube_pod_container_resource_requests
- kube_pod_container_state_started
- kube_pod_created
- kube_pod_init_container_info
- kube_pod_init_container_resource_limits
- kube_pod_init_container_resource_requests
- kube_pod_init_container_status_last_terminated_reason
- kube_pod_init_container_status_ready
- kube_pod_init_container_status_restarts_total
- kube_pod_init_container_status_running
- kube_pod_init_container_status_terminated
- kube_pod_init_container_status_terminated_reason
- kube_pod_init_container_status_waiting
- kube_pod_init_container_status_waiting_reason
- kube_pod_labels
- kube_pod_owner
- kube_pod_restart_policy
- kube_pod_spec_volumes_persistentvolumeclaims_readonly
- kube_pod_start_time
- kube_poddisruptionbudget_annotations
- kube_poddisruptionbudget_created
- kube_poddisruptionbudget_labels
- kube_poddisruptionbudget_status_expected_pods
- kube_poddisruptionbudget_status_observed_generation
- kube_poddisruptionbudget_status_pod_disruptions_allowed
- kube_replicaset_annotations
- kube_replicaset_created
- kube_replicaset_labels
- kube_replicaset_metadata_generation
- kube_replicaset_owner
- kube_replicaset_status_observed_generation
- kube_resourcequota_created
- kube_secret_annotations
- kube_secret_info
- kube_secret_labels
- kube_secret_metadata_resource_version
- kube_secret_type
- kube_service_annotations
- kube_service_created
- kube_service_info
- kube_service_labels
- kube_service_spec_type
- kube_statefulset_annotations
- kube_statefulset_created
- kube_statefulset_labels
- kube_statefulset_status_current_revision
- kube_statefulset_status_update_revision
- kube_storageclass_annotations
- kube_storageclass_created
- kube_storageclass_info
- kube_storageclass_labels
- kube_validatingwebhookconfiguration_info
- kube_validatingwebhookconfiguration_metadata_resource_version
- kube_volumeattachment_created
- kube_volumeattachment_info
- kube_volumeattachment_labels
- kube_volumeattachment_spec_source_persistentvolume
- kube_volumeattachment_status_attached
- kube_volumeattachment_status_attachment_metadata
The complete set of Google Distributed Cloud metrics is documented in View Anthos metrics .