Version 1.10. This version is no longer supported. For information about how to upgrade to version 1.11, see Upgrading Anthos on bare metal in the 1.11 documentation. For more information about supported and unsupported versions, see the Version history page in the latest documentation.

Creating alerting policies

This page shows how to create alerting policies for Google Distributed Cloud clusters.

Before you begin

You must have the following permissions to create alerting policies:

monitoring.alertPolicies.create
monitoring.alertPolicies.delete
monitoring.alertPolicies.update

You have these permissions if you have any one of the following roles :

monitoring.alertPolicyEditor
monitoring.editor
Project Editor
Project Owner

To check your roles, go to the IAM page in the Google Cloud console.

Creating a policy: Anthos on baremetal cluster API server unavailable

In this exercise, you create an alerting policy for Kubernetes API servers of clusters. With this policy in place, you can arrange to be notified whenever the API server of a cluster is unavailable.

Download the policy configuration file: apiserver-unavailable.json

Create the policy:

 gcloud  
alpha  
monitoring  
policies  
create  
--policy-from-file = 
 POLICY_CONFIG

Replace POLICY_CONFIG with the path of the configuration file you just downloaded.

View your alerting policies:

Console

In the Google Cloud console, go to the Monitoringpage.

Go to Monitoring
On the left, select Alerting.
Under Policies, you can see a list of your alerting policies.

In the list, select Anthos on baremetal cluster API server unavailable (critical)to see details about your new policy. Under Conditions, you can see a description of the policy. For example:
```
 Policy violates when ANY condition is met
Anthos on baremetal cluster API server uptime is absent
Anthos on baremetal cluster API server uptime is less than 99.99% per minute 
```

gcloud

 gcloud  
alpha  
monitoring  
policies  
list

The output shows detailed information about the policy. For example:

  combi 
 ner 
 : 
  
 OR 
 co 
 n 
 di 
 t 
 io 
 ns 
 : 
 - 
  
 co 
 n 
 di 
 t 
 io 
 n 
 Abse 
 nt 
 : 
  
 aggrega 
 t 
 io 
 ns 
 : 
  
 - 
  
 alig 
 n 
 me 
 nt 
 Period 
 : 
  
 60 
 s 
  
 crossSeriesReducer 
 : 
  
 REDUCE_MEAN 
  
 groupByFields 
 : 
  
 - 
  
 resource.label.projec 
 t 
 _id 
  
 - 
  
 resource.label.loca 
 t 
 io 
 n 
  
 - 
  
 resource.label.clus 
 ter 
 _ 
 na 
 me 
  
 - 
  
 resource.label. 
 na 
 mespace_ 
 na 
 me 
  
 - 
  
 resource.label.co 
 nta 
 i 
 ner 
 _ 
 na 
 me 
  
 - 
  
 resource.label.pod_ 
 na 
 me 
  
 perSeriesAlig 
 ner 
 : 
  
 ALIGN_MAX 
  
 dura 
 t 
 io 
 n 
 : 
  
 300 
 s 
  
 f 
 il 
 ter 
 : 
  
 resource. 
 t 
 ype 
  
 = 
  
 "k8s_container" 
  
 AND 
  
 resource.labels. 
 na 
 mespace_ 
 na 
 me 
  
 = 
  
 "kube-system" 
  
 AND 
  
 me 
 tr 
 ic. 
 t 
 ype 
  
 = 
  
 "kubernetes.io/anthos/container/uptime" 
  
 AND 
  
 resource.label. 
 "container_name" 
 =mo 
 n 
 i 
 t 
 ori 
 n 
 g.regex. 
 full 
 _ma 
 t 
 ch( 
 "kube-apiserver" 
 ) 
  
 tr 
 igger 
 : 
  
 cou 
 nt 
 : 
  
 1 
  
 displayName 
 : 
  
 A 
 nt 
 hos 
  
 o 
 n 
  
 bareme 
 tal 
  
 clus 
 ter 
  
 API 
  
 server 
  
 up 
 t 
 ime 
  
 is 
  
 abse 
 nt 
  
 na 
 me 
 : 
  
 projec 
 ts 
 /…/aler 
 t 
 Policies/ 
 12404845535868002666 
 /co 
 n 
 di 
 t 
 io 
 ns 
 / 
 12404845535868003603 
 - 
  
 co 
 n 
 di 
 t 
 io 
 n 
 Threshold 
 : 
  
 aggrega 
 t 
 io 
 ns 
 : 
  
 - 
  
 alig 
 n 
 me 
 nt 
 Period 
 : 
  
 120 
 s 
  
 crossSeriesReducer 
 : 
  
 REDUCE_MEAN 
  
 groupByFields 
 : 
  
 - 
  
 resource.label.projec 
 t 
 _id 
  
 - 
  
 resource.label.loca 
 t 
 io 
 n 
  
 - 
  
 resource.label.clus 
 ter 
 _ 
 na 
 me 
  
 - 
  
 resource.label. 
 na 
 mespace_ 
 na 
 me 
  
 - 
  
 resource.label.co 
 nta 
 i 
 ner 
 _ 
 na 
 me 
  
 - 
  
 resource.label.pod_ 
 na 
 me 
  
 perSeriesAlig 
 ner 
 : 
  
 ALIGN_MAX 
  
 compariso 
 n 
 : 
  
 COMPARISON_LT 
  
 dura 
 t 
 io 
 n 
 : 
  
 300 
 s 
  
 f 
 il 
 ter 
 : 
  
 resource. 
 t 
 ype 
  
 = 
  
 "k8s_container" 
  
 AND 
  
 resource.labels. 
 na 
 mespace_ 
 na 
 me 
  
 = 
  
 "kube-system" 
  
 AND 
  
 me 
 tr 
 ic. 
 t 
 ype 
  
 = 
  
 "kubernetes.io/anthos/container/uptime" 
  
 AND 
  
 resource.label. 
 "container_name" 
 =mo 
 n 
 i 
 t 
 ori 
 n 
 g.regex. 
 full 
 _ma 
 t 
 ch( 
 "kube-apiserver" 
 ) 
  
 t 
 hresholdValue 
 : 
  
 119.0 
  
 tr 
 igger 
 : 
  
 cou 
 nt 
 : 
  
 1 
  
 displayName 
 : 
  
 A 
 nt 
 hos 
  
 o 
 n 
  
 bareme 
 tal 
  
 clus 
 ter 
  
 API 
  
 server 
  
 up 
 t 
 ime 
  
 is 
  
 less 
  
 t 
 ha 
 n 
  
 99.99 
 % 
  
 per 
  
 mi 
 nute 
  
 na 
 me 
 : 
  
 projec 
 ts 
 /…/aler 
 t 
 Policies/ 
 12404845535868002666 
 /co 
 n 
 di 
 t 
 io 
 ns 
 / 
 12404845535868004540 
 crea 
 t 
 io 
 n 
 Record 
 : 
  
 mu 
 tate 
 Time 
 : 
  
 … 
  
 mu 
 tate 
 dBy 
 : 
  
 … 
 displayName 
 : 
  
 A 
 nt 
 hos 
  
 o 
 n 
  
 bareme 
 tal 
  
 clus 
 ter 
  
 API 
  
 server 
  
 u 
 na 
 vailable 
  
 (cri 
 t 
 ical) 
 e 
 na 
 bled 
 : 
  
 true 
 mu 
 tat 
 io 
 n 
 Record 
 : 
  
 mu 
 tate 
 Time 
 : 
  
 … 
  
 mu 
 tate 
 dBy 
 : 
  
 … 
 na 
 me 
 : 
  
 projec 
 ts 
 /…/aler 
 t 
 Policies/ 
 12404845535868002666

Creating additional alerting policies

This section provides descriptions and configuration files for a set of recommended alerting policies.

To create a policy, follow the same steps that you used in the preceding exercise:

To download the configuration file, click the link in the right column.
To create the policy, run gcloud alpha monitoring policies create .

Control plane components availability

Alert name	Description	Alerting policy definition in Cloud Monitoring
Anthos on baremetal cluster API server unavailable (critical)	API server is not up or uptime is less than 99.99% per minute	apiserver-unavailable.json
Anthos on baremetal cluster scheduler unavailable (critical)	Scheduler is not up or uptime is less than 99.99% per minute	scheduler-unavailable.json
Anthos on baremetal controller manager unavailable (critical)	Controller manager has disappeared from metrics target discovery	controller-manager-unavailable.json

Kubernetes system

Alert name	Description	Alerting policy definition in Cloud Monitoring
Anthos on baremetal pod crash looping (critical)	Pod is in a crash loop status	pod-crash-looping.json
Anthos on baremetal pod not ready for more than one hour (critical)	Pod is in a non-ready state for more than one hour	pod-not-ready-1h.json
Anthos on baremetal persistent volume high usage (critical)	Claimed persistent volume is expected to fill up	persistent-volume-usage-high.json
Anthos on baremetal node not ready for more than one hour (critical)	Node is in a non-ready state for more than one hour	node-not-ready-1h.json
Anthos on baremetal node cpu usage exceeds 80 percent (critical)	Node cpu usage is over 80%	node-cpu-usage-high.json
Anthos on baremetal node memory usage exceeds 80 percent (critical)	Node memory usage is over 80%	node-memory-usage-high.json
Anthos on baremetal node disk usage exceeds 80 percent (critical)	Node disk usage is over 80%	node-disk-usage-high.json

Kubernetes performance

Alert name	Description	Alerting policy definition in Cloud Monitoring
Anthos on baremetal API server error count ratio exceeds 10 percent (critical)	API server is returning errors for more than 10% of requests	api-server-error-ratio-10-percent.json
Anthos on baremetal API server error count ratio exceeds 5 percent (warning)	API server is returning errors for more than 5% of requests	api-server-error-ratio-5-percent.json
Anthos on baremetal etcd leader changes too frequently (critical)	The `etcd` leader changes too frequently	etcd-leader-changes-too-frequent.json
Anthos on baremetal etcd proposals failed too frequently (critical)	The `etcd` proposals are failing too frequently	etcd-proposals-failed-too-frequent.json
Anthos on baremetal etcd server is not in quorum (critical)	The `etcd` server is not in quorum	etcd-server-not-in-quorum.json

Getting notified

After you create an alerting policy, you can define one or more notification channels for the policy. There are several kinds of notification channels. For example, you can be notified by email, a Slack channel, or a mobile app. You can choose the channels that suit your needs.

For instructions about how to configure notification channels, see Managing notification channels .