This page shows how to create alerting policies for Google Distributed Cloud clusters.
Before you begin
You must have the following permissions to create alerting policies:
-
monitoring.alertPolicies.create
-
monitoring.alertPolicies.delete
-
monitoring.alertPolicies.update
You have these permissions if you have any one of the following roles :
-
monitoring.alertPolicyEditor
-
monitoring.editor
- Project Editor
- Project Owner
To check your roles, go to the IAM page in the Google Cloud console.
Creating a policy: Anthos on baremetal cluster API server unavailable
In this exercise, you create an alerting policy for Kubernetes API servers of clusters. With this policy in place, you can arrange to be notified whenever the API server of a cluster is unavailable.
-
Download the policy configuration file: apiserver-unavailable.json
-
Create the policy:
gcloud alpha monitoring policies create --policy-from-file = POLICY_CONFIG
Replace POLICY_CONFIG with the path of the configuration file you just downloaded.
-
View your alerting policies:
Console
-
In the Google Cloud console, go to the Monitoringpage.
-
On the left, select Alerting.
-
Under Policies, you can see a list of your alerting policies.
In the list, select Anthos on baremetal cluster API server unavailable (critical)to see details about your new policy. Under Conditions, you can see a description of the policy. For example:
Policy violates when ANY condition is met Anthos on baremetal cluster API server uptime is absent Anthos on baremetal cluster API server uptime is less than 99.99% per minute
gcloud
gcloud alpha monitoring policies list
The output shows detailed information about the policy. For example:
combi ner : OR co n di t io ns : - co n di t io n Abse nt : aggrega t io ns : - alig n me nt Period : 60 s crossSeriesReducer : REDUCE_MEAN groupByFields : - resource.label.projec t _id - resource.label.loca t io n - resource.label.clus ter _ na me - resource.label. na mespace_ na me - resource.label.co nta i ner _ na me - resource.label.pod_ na me perSeriesAlig ner : ALIGN_MAX dura t io n : 300 s f il ter : resource. t ype = "k8s_container" AND resource.labels. na mespace_ na me = "kube-system" AND me tr ic. t ype = "kubernetes.io/anthos/container/uptime" AND resource.label. "container_name" =mo n i t ori n g.regex. full _ma t ch( "kube-apiserver" ) tr igger : cou nt : 1 displayName : A nt hos o n bareme tal clus ter API server up t ime is abse nt na me : projec ts /…/aler t Policies/ 12404845535868002666 /co n di t io ns / 12404845535868003603 - co n di t io n Threshold : aggrega t io ns : - alig n me nt Period : 120 s crossSeriesReducer : REDUCE_MEAN groupByFields : - resource.label.projec t _id - resource.label.loca t io n - resource.label.clus ter _ na me - resource.label. na mespace_ na me - resource.label.co nta i ner _ na me - resource.label.pod_ na me perSeriesAlig ner : ALIGN_MAX compariso n : COMPARISON_LT dura t io n : 300 s f il ter : resource. t ype = "k8s_container" AND resource.labels. na mespace_ na me = "kube-system" AND me tr ic. t ype = "kubernetes.io/anthos/container/uptime" AND resource.label. "container_name" =mo n i t ori n g.regex. full _ma t ch( "kube-apiserver" ) t hresholdValue : 119.0 tr igger : cou nt : 1 displayName : A nt hos o n bareme tal clus ter API server up t ime is less t ha n 99.99 % per mi nute na me : projec ts /…/aler t Policies/ 12404845535868002666 /co n di t io ns / 12404845535868004540 crea t io n Record : mu tate Time : … mu tate dBy : … displayName : A nt hos o n bareme tal clus ter API server u na vailable (cri t ical) e na bled : true mu tat io n Record : mu tate Time : … mu tate dBy : … na me : projec ts /…/aler t Policies/ 12404845535868002666
-
Creating additional alerting policies
This section provides descriptions and configuration files for a set of recommended alerting policies.
To create a policy, follow the same steps that you used in the preceding exercise:
-
To download the configuration file, click the link in the right column.
-
To create the policy, run
gcloud alpha monitoring policies create
.
Control plane components availability
Alert name | Description | Alerting policy definition in Cloud Monitoring |
---|---|---|
Anthos on baremetal cluster API server unavailable (critical)
|
API server is not up or uptime is less than 99.99% per minute | apiserver-unavailable.json |
Anthos on baremetal cluster scheduler unavailable (critical)
|
Scheduler is not up or uptime is less than 99.99% per minute | scheduler-unavailable.json |
Anthos on baremetal controller manager unavailable (critical)
|
Controller manager has disappeared from metrics target discovery | controller-manager-unavailable.json |
Kubernetes system
Alert name | Description | Alerting policy definition in Cloud Monitoring |
---|---|---|
Anthos on baremetal pod crash looping (critical)
|
Pod is in a crash loop status | pod-crash-looping.json |
Anthos on baremetal pod not ready for more than one hour (critical)
|
Pod is in a non-ready state for more than one hour | pod-not-ready-1h.json |
Anthos on baremetal persistent volume high usage (critical)
|
Claimed persistent volume is expected to fill up | persistent-volume-usage-high.json |
Anthos on baremetal node not ready for more than one hour (critical)
|
Node is in a non-ready state for more than one hour | node-not-ready-1h.json |
Anthos on baremetal node cpu usage exceeds 80 percent (critical)
|
Node cpu usage is over 80% | node-cpu-usage-high.json |
Anthos on baremetal node memory usage exceeds 80 percent (critical)
|
Node memory usage is over 80% | node-memory-usage-high.json |
Anthos on baremetal node disk usage exceeds 80 percent (critical)
|
Node disk usage is over 80% | node-disk-usage-high.json |
Kubernetes performance
Alert name | Description | Alerting policy definition in Cloud Monitoring |
---|---|---|
Anthos on baremetal API server error count ratio exceeds 10 percent (critical)
|
API server is returning errors for more than 10% of requests | api-server-error-ratio-10-percent.json |
Anthos on baremetal API server error count ratio exceeds 5 percent (warning)
|
API server is returning errors for more than 5% of requests | api-server-error-ratio-5-percent.json |
Anthos on baremetal etcd leader changes too frequently (critical)
|
The etcd
leader changes too frequently |
etcd-leader-changes-too-frequent.json |
Anthos on baremetal etcd proposals failed too frequently (critical)
|
The etcd
proposals are failing too frequently |
etcd-proposals-failed-too-frequent.json |
Anthos on baremetal etcd server is not in quorum (critical)
|
The etcd
server is not in quorum |
etcd-server-not-in-quorum.json |
Getting notified
After you create an alerting policy, you can define one or more notification channels for the policy. There are several kinds of notification channels. For example, you can be notified by email, a Slack channel, or a mobile app. You can choose the channels that suit your needs.
For instructions about how to configure notification channels, see Managing notification channels .