Creating alerting policies

This page shows how to create alerting policies for Google Distributed Cloud clusters.

Before you begin

You must have the following permissions to create alerting policies:

  • monitoring.alertPolicies.create
  • monitoring.alertPolicies.delete
  • monitoring.alertPolicies.update

You have these permissions if you have any one of the following roles :

  • monitoring.alertPolicyEditor
  • monitoring.editor
  • Project Editor
  • Project Owner

To check your roles, go to the IAM page in the Google Cloud console.

Creating a policy: Anthos on baremetal cluster API server unavailable

In this exercise, you create an alerting policy for Kubernetes API servers of clusters. With this policy in place, you can arrange to be notified whenever the API server of a cluster is unavailable.

  1. Download the policy configuration file: apiserver-unavailable.json

  2. Create the policy:

     gcloud  
    alpha  
    monitoring  
    policies  
    create  
    --policy-from-file = 
     POLICY_CONFIG 
     
    

    Replace POLICY_CONFIG with the path of the configuration file you just downloaded.

  3. View your alerting policies:

    Console

    1. In the Google Cloud console, go to the Monitoringpage.

      Go to Monitoring

    2. On the left, select Alerting.

    3. Under Policies, you can see a list of your alerting policies.

      In the list, select Anthos on baremetal cluster API server unavailable (critical)to see details about your new policy. Under Conditions, you can see a description of the policy. For example:

       Policy violates when ANY condition is met
      Anthos on baremetal cluster API server uptime is absent
      Anthos on baremetal cluster API server uptime is less than 99.99% per minute 
      

    gcloud

     gcloud  
    alpha  
    monitoring  
    policies  
    list 
    

    The output shows detailed information about the policy. For example:

      combi 
     ner 
     : 
      
     OR 
     co 
     n 
     di 
     t 
     io 
     ns 
     : 
     - 
      
     co 
     n 
     di 
     t 
     io 
     n 
     Abse 
     nt 
     : 
      
     aggrega 
     t 
     io 
     ns 
     : 
      
     - 
      
     alig 
     n 
     me 
     nt 
     Period 
     : 
      
     60 
     s 
      
     crossSeriesReducer 
     : 
      
     REDUCE_MEAN 
      
     groupByFields 
     : 
      
     - 
      
     resource.label.projec 
     t 
     _id 
      
     - 
      
     resource.label.loca 
     t 
     io 
     n 
      
     - 
      
     resource.label.clus 
     ter 
     _ 
     na 
     me 
      
     - 
      
     resource.label. 
     na 
     mespace_ 
     na 
     me 
      
     - 
      
     resource.label.co 
     nta 
     i 
     ner 
     _ 
     na 
     me 
      
     - 
      
     resource.label.pod_ 
     na 
     me 
      
     perSeriesAlig 
     ner 
     : 
      
     ALIGN_MAX 
      
     dura 
     t 
     io 
     n 
     : 
      
     300 
     s 
      
     f 
     il 
     ter 
     : 
      
     resource. 
     t 
     ype 
      
     = 
      
     "k8s_container" 
      
     AND 
      
     resource.labels. 
     na 
     mespace_ 
     na 
     me 
      
     = 
      
     "kube-system" 
      
     AND 
      
     me 
     tr 
     ic. 
     t 
     ype 
      
     = 
      
     "kubernetes.io/anthos/container/uptime" 
      
     AND 
      
     resource.label. 
     "container_name" 
     =mo 
     n 
     i 
     t 
     ori 
     n 
     g.regex. 
     full 
     _ma 
     t 
     ch( 
     "kube-apiserver" 
     ) 
      
     tr 
     igger 
     : 
      
     cou 
     nt 
     : 
      
     1 
      
     displayName 
     : 
      
     A 
     nt 
     hos 
      
     o 
     n 
      
     bareme 
     tal 
      
     clus 
     ter 
      
     API 
      
     server 
      
     up 
     t 
     ime 
      
     is 
      
     abse 
     nt 
      
     na 
     me 
     : 
      
     projec 
     ts 
     /…/aler 
     t 
     Policies/ 
     12404845535868002666 
     /co 
     n 
     di 
     t 
     io 
     ns 
     / 
     12404845535868003603 
     - 
      
     co 
     n 
     di 
     t 
     io 
     n 
     Threshold 
     : 
      
     aggrega 
     t 
     io 
     ns 
     : 
      
     - 
      
     alig 
     n 
     me 
     nt 
     Period 
     : 
      
     120 
     s 
      
     crossSeriesReducer 
     : 
      
     REDUCE_MEAN 
      
     groupByFields 
     : 
      
     - 
      
     resource.label.projec 
     t 
     _id 
      
     - 
      
     resource.label.loca 
     t 
     io 
     n 
      
     - 
      
     resource.label.clus 
     ter 
     _ 
     na 
     me 
      
     - 
      
     resource.label. 
     na 
     mespace_ 
     na 
     me 
      
     - 
      
     resource.label.co 
     nta 
     i 
     ner 
     _ 
     na 
     me 
      
     - 
      
     resource.label.pod_ 
     na 
     me 
      
     perSeriesAlig 
     ner 
     : 
      
     ALIGN_MAX 
      
     compariso 
     n 
     : 
      
     COMPARISON_LT 
      
     dura 
     t 
     io 
     n 
     : 
      
     300 
     s 
      
     f 
     il 
     ter 
     : 
      
     resource. 
     t 
     ype 
      
     = 
      
     "k8s_container" 
      
     AND 
      
     resource.labels. 
     na 
     mespace_ 
     na 
     me 
      
     = 
      
     "kube-system" 
      
     AND 
      
     me 
     tr 
     ic. 
     t 
     ype 
      
     = 
      
     "kubernetes.io/anthos/container/uptime" 
      
     AND 
      
     resource.label. 
     "container_name" 
     =mo 
     n 
     i 
     t 
     ori 
     n 
     g.regex. 
     full 
     _ma 
     t 
     ch( 
     "kube-apiserver" 
     ) 
      
     t 
     hresholdValue 
     : 
      
     119.0 
      
     tr 
     igger 
     : 
      
     cou 
     nt 
     : 
      
     1 
      
     displayName 
     : 
      
     A 
     nt 
     hos 
      
     o 
     n 
      
     bareme 
     tal 
      
     clus 
     ter 
      
     API 
      
     server 
      
     up 
     t 
     ime 
      
     is 
      
     less 
      
     t 
     ha 
     n 
      
     99.99 
     % 
      
     per 
      
     mi 
     nute 
      
     na 
     me 
     : 
      
     projec 
     ts 
     /…/aler 
     t 
     Policies/ 
     12404845535868002666 
     /co 
     n 
     di 
     t 
     io 
     ns 
     / 
     12404845535868004540 
     crea 
     t 
     io 
     n 
     Record 
     : 
      
     mu 
     tate 
     Time 
     : 
      
      
      
     mu 
     tate 
     dBy 
     : 
      
      
     displayName 
     : 
      
     A 
     nt 
     hos 
      
     o 
     n 
      
     bareme 
     tal 
      
     clus 
     ter 
      
     API 
      
     server 
      
     u 
     na 
     vailable 
      
     (cri 
     t 
     ical) 
     e 
     na 
     bled 
     : 
      
     true 
     mu 
     tat 
     io 
     n 
     Record 
     : 
      
     mu 
     tate 
     Time 
     : 
      
      
      
     mu 
     tate 
     dBy 
     : 
      
      
     na 
     me 
     : 
      
     projec 
     ts 
     /…/aler 
     t 
     Policies/ 
     12404845535868002666 
     
    

Creating additional alerting policies

This section provides descriptions and configuration files for a set of recommended alerting policies.

To create a policy, follow the same steps that you used in the preceding exercise:

  1. To download the configuration file, click the link in the right column.

  2. To create the policy, run gcloud alpha monitoring policies create .

Control plane components availability

Alert name Description Alerting policy definition in Cloud Monitoring
Anthos on baremetal cluster API server unavailable (critical)
API server is not up or uptime is less than 99.99% per minute apiserver-unavailable.json
Anthos on baremetal cluster scheduler unavailable (critical)
Scheduler is not up or uptime is less than 99.99% per minute scheduler-unavailable.json
Anthos on baremetal controller manager unavailable (critical)
Controller manager has disappeared from metrics target discovery controller-manager-unavailable.json

Kubernetes system

Alert name Description Alerting policy definition in Cloud Monitoring
Anthos on baremetal pod crash looping (critical)
Pod is in a crash loop status pod-crash-looping.json
Anthos on baremetal pod not ready for more than one hour (critical)
Pod is in a non-ready state for more than one hour pod-not-ready-1h.json
Anthos on baremetal persistent volume high usage (critical)
Claimed persistent volume is expected to fill up persistent-volume-usage-high.json
Anthos on baremetal node not ready for more than one hour (critical)
Node is in a non-ready state for more than one hour node-not-ready-1h.json
Anthos on baremetal node cpu usage exceeds 80 percent (critical)
Node cpu usage is over 80% node-cpu-usage-high.json
Anthos on baremetal node memory usage exceeds 80 percent (critical)
Node memory usage is over 80% node-memory-usage-high.json
Anthos on baremetal node disk usage exceeds 80 percent (critical)
Node disk usage is over 80% node-disk-usage-high.json

Kubernetes performance

Alert name Description Alerting policy definition in Cloud Monitoring
Anthos on baremetal API server error count ratio exceeds 10 percent (critical)
API server is returning errors for more than 10% of requests api-server-error-ratio-10-percent.json
Anthos on baremetal API server error count ratio exceeds 5 percent (warning)
API server is returning errors for more than 5% of requests api-server-error-ratio-5-percent.json
Anthos on baremetal etcd leader changes too frequently (critical)
The etcd leader changes too frequently etcd-leader-changes-too-frequent.json
Anthos on baremetal etcd proposals failed too frequently (critical)
The etcd proposals are failing too frequently etcd-proposals-failed-too-frequent.json
Anthos on baremetal etcd server is not in quorum (critical)
The etcd server is not in quorum etcd-server-not-in-quorum.json

Getting notified

After you create an alerting policy, you can define one or more notification channels for the policy. There are several kinds of notification channels. For example, you can be notified by email, a Slack channel, or a mobile app. You can choose the channels that suit your needs.

For instructions about how to configure notification channels, see Managing notification channels .

Create a Mobile Website
View Site in Mobile | Classic
Share by: