Back up and restore clusters

This page describes how to back up and restore clusters created with Google Distributed Cloud. These instructions apply to all cluster types supported by Google Distributed Cloud.

Back up a cluster

The backup process has two parts. First, a snapshot is made from the etcd store. Then, the related PKI certificates are saved to a tar file. The etcd store is the Kubernetes backing store for all cluster data and contains all the Kubernetes objects and custom objects required to manage cluster state. The PKI certificates are used for authentication over TLS. This data is backed up from the cluster's control plane or from one of the control planes for a high-availability (HA) deployment.

We recommend you back up your clusters regularly to ensure your snapshot data is relatively current. The rate of backups depends upon the frequency in which significant changes occur for your clusters.

Make a snapshot of the etcd store

In Google Distributed Cloud, a pod named etcd- CONTROL_PLANE_NAME in the kube-system namespace runs the etcd for that control plane. To backup the cluster's etcd store, perform the following steps from your admin workstation:

  1. Use kubectl get po to identify the etcd Pod.

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
    get  
    po  
    -n  
    kube-system  
     \ 
      
    -l  
     'component=etcd,tier=control-plane' 
     
    

    The response includes the etcd Pod name and its status.

  2. Use kubectl describe pod to see the containers running in the etcd pod, including the etcd container.

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
    describe  
    pod  
     ETCD_POD_NAME 
      
    -n  
    kube-system 
    
  3. Run a Bash shell in the etcd container:

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
     exec 
      
    -it  
     \ 
      
     ETCD_POD_NAME 
      
    --container  
    etcd  
    --namespace  
    kube-system  
     \ 
      
    --  
    bin/sh 
    
  4. From the shell within the etcd container, use etcdctl (version 3 of the API) to save a snapshot, snapshot.db , of the etcd store.

      ETCDCTL_API 
     = 
     3 
      
    etcdctl  
    --endpoints = 
    https://127.0.0.1:2379  
     \ 
      
    --cacert = 
    /etc/kubernetes/pki/etcd/ca.crt  
     \ 
      
    --cert = 
    /etc/kubernetes/pki/etcd/peer.crt  
     \ 
      
    --key = 
    /etc/kubernetes/pki/etcd/peer.key  
     \ 
      
    snapshot  
    save  
    /tmp/snapshot DATESTAMP 
    .db 
    

    Replace DATESTAMP with the current date to prevent overwriting any subsequent snapshots.

  5. Exit from the shell in the container and run the following command to copy the snapshot file to the admin workstation.

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
    cp  
     \ 
      
    kube-system/ ETCD_POD_NAME 
    :/tmp/snapshot.db  
     \ 
      
    --container  
    etcd  
    snapshot.db 
    
  6. Copy the etcdctl binary from the etcd pod so that the same can be used during the restore process.

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
    cp  
     \ 
      
    kube-system/ ETCD_POD_NAME 
    :/usr/local/bin/etcdctl  
     \ 
      
    --container  
    etcd  
    etcdctl 
    
  7. Store the snapshot file and the etcdctl binary in a location that is outside of the cluster and is not dependent on the cluster's operation.

Archive the PKI certificates

The certificates to be backed up are located in the /etc/kubernetes/pki directory of the control plane. The PIK certificates together with the etcd store snapshot.db file are needed to to recover a cluster in the event the control plane goes down completely. The following steps create a tar file, containing the PKI certificates.

  1. Use ssh to connect to the cluster's control plane as root.

     ssh  
    root@ CONTROL_PLANE_NAME 
     
    
  2. From the control plane, create a tar file, certs_backup.tar.gz with the contents of the /etc/kubernetes/pki directory.

     tar  
    -czvf  
    certs_backup.tar.gz  
    -C  
    /etc/kubernetes/pki  
    . 
    

    Creating the tar file from within the control plane preserves all the certificate file permissions.

  3. Exit the control plane and, from the workstation, copy tar file containing the certificates to a preferred location on the workstation.

     sudo  
    scp  
    root@ CONTROL_PLANE_NAME 
    :certs_backup.tar.gz  
     BACKUP_PATH 
     
    

Restore a cluster

Restoring a cluster from a backup is a last resort and should be used when a cluster has failed catastrophically and cannot be returned to service any other way. For example, the etcd data is corrupted or the etcd Pod is in a crash loop.

The cluster restore process has two parts. First, the PKI certificates are restored on the control plane. Then, the etcd store data is restored.

Restore PKI certificates

Assuming you have backed up PKI certificates as described in Archive the PKI certificates , the following steps describe how to restore the certificates from the tar file to a control plane.

  1. Copy the PKI certificates tar file, certs_backup.tar.gz , from workstation to the cluster control plane.

     sudo  
    scp  
    -r  
     BACKUP_PATH 
    /certs_backup.tar.gz  
    root@ CONTROL_PLANE_NAME 
    :~/ 
    
  2. Use ssh to connect to the cluster's control plane as root.

     ssh  
    root@ CONTROL_PLANE_NAME 
     
    
  3. From the control plane, extract the contents of the tar file to the /etc/kubernetes/pki directory.

     tar  
    -xzvf  
    certs_backup.tar.gz  
    -C  
    /etc/kubernetes/pki/ 
    
  4. Exit the control plane.

Restore the etcd store

When restoring the etcd store, the process depends upon whether or not the cluster is running in high availability (HA) mode and, if so, whether or not quorum has been preserved. Use the following guidance to restore the etcd store for a given cluster failure situation:

  • If the failed cluster is not running in HA mode, restore the etcd store on the control plane with the following steps.

  • If the cluster is running in HA mode and quorum is preserved, do nothing. As long a quorum is preserved, you don't need to restore failed clusters.

  • If the cluster is running in HA mode and quorum is lost, repeat the following steps to restore the etcd store for each failed member.

Follow these steps from the workstation to remove and restore the etcd store on a control plane for a failed cluster:

  1. Create a /backup directory in the root directory of the control plane.

     ssh  
    root@ CONTROL_PLANE_NAME 
      
     "mkdir /backup" 
     
    

    This step is not strictly required, but we recommend it. The following steps assume you have created a /backup directory.

  2. Copy the etcd snapshot file, snapshot.db and etcdctl binary from the workstation to the backup directory on the cluster control plane.

     sudo  
    scp  
    snapshot.db  
    root@ CONTROL_PLANE_NAME 
    :/backup
    sudo  
    scp  
    etcdctl  
    root@ CONTROL_PLANE_NAME 
    :/backup 
    
  3. Use SSH to connect to the control plane node:

     ssh  
    root@ CONTROL_PLANE_NAME 
     
    
  4. Stop the etcd and kube-apiserver static pods by moving their manifest files out of the /etc/kubernetes/manifests directory and into the /backup directory.

     sudo  
    mv  
    /etc/kubernetes/manifests/etcd.yaml  
    /backup/etcd.yaml
    sudo  
    mv  
    /etc/kubernetes/manifests/kube-apiserver.yaml  
    /backup/kube-apiserver.yaml 
    
  5. Remove the etcd data directory.

     rm  
    -rf  
    /var/lib/etcd/ 
    
  6. Run etcdctl snapshot restore using the saved binary.

     sudo  
    chmod  
    +x  
    /backup/etcdctl
    sudo  
     ETCDCTL_API 
     = 
     3 
      
    /backup/etcdctl  
     \ 
      
    --cacert = 
    /etc/kubernetes/pki/etcd/ca.crt  
     \ 
      
    --cert = 
    /etc/kubernetes/pki/etcd/server.crt  
     \ 
      
    --key = 
    /etc/kubernetes/pki/etcd/server.key  
     \ 
      
    --data-dir = 
    /var/lib/etcd  
     \ 
      
    --name = 
     CONTROL_PLANE_NAME 
      
     \ 
      
    --initial-advertise-peer-urls = 
    https:// CONTROL_PLANE_IP 
    :2380  
     \ 
      
    --initial-cluster = 
      CONTROL_PLANE_NAME 
     
     = 
    https:// CONTROL_PLANE_IP 
    :2380  
     \ 
      
    snapshot  
    restore  
    /backup/snapshot.db 
    

    The entries for --name , --initial-advertise-peer-urls , and --initial-cluster can be found in the etcd.yaml manifest file that was moved to the /backup directory.

  7. Ensure that /var/lib/etcd was recreated and that a new member is created in /var/lib/etcd/member .

  8. Change the owner of the /var/lib/etcd/member directory to 2003 . Starting with Google Distributed Cloud release 1.10.0 , the etcd container runs as non-root user with UID and GID of 2003 .

     sudo  
    chown  
    -R  
     2003 
    :2003  
    /var/lib/etcd 
    
  9. Move the etcd and kube-apiserver manifests back to the /manifests directory so that the static pods can restart.

     sudo  
    mv  
    /backup/etcd.yaml  
    /etc/kubernetes/manifests/etcd.yaml
    sudo  
    mv  
    /backup/kube-apiserver.yaml  
    /etc/kubernetes/manifests/kube-apiserver.yaml 
    
  10. Run a Bash shell in the etcd container:

     kubectl  
    --kubeconfig  
     CLUSTER_KUBECONFIG 
      
     exec 
      
    -it  
     \ 
      
     ETCD_POD_NAME 
      
    --container  
    etcd  
    --namespace  
    kube-system  
     \ 
      
    --  
    bin/sh 
    
    1. Use etcdctl to confirm the added member is working properly.
      ETCDCTL_API 
     = 
     3 
      
    etcdctl  
    --cert = 
    /etc/kubernetes/pki/etcd/peer.crt  
     \ 
      
    --key = 
    /etc/kubernetes/pki/etcd/peer.key  
     \ 
      
    --cacert = 
    /etc/kubernetes/pki/etcd/ca.crt  
     \ 
      
    --endpoints = 
     CONTROL_PLANE_IP 
    :2379  
     \ 
      
    endpoint  
    health 
    

    If you are restoring multiple failed members, once all failed members have been restored, run the command with the control plane IP addresses from all restored members in the `--endpoints' field.

    For example:

      ETCDCTL_API 
     = 
     3 
      
    etcdctl  
    --cert = 
    /etc/kubernetes/pki/etcd/peer.crt  
     \ 
      
    --key = 
    /etc/kubernetes/pki/etcd/peer.key  
     \ 
      
    --cacert = 
    /etc/kubernetes/pki/etcd/ca.crt  
     \ 
      
    --endpoints = 
     10 
    .200.0.3:2379,10.200.0.4:2379,10.200.0.5:2379  
     \ 
      
    endpoint  
    health 
    

    On success for each endpoint, your cluster should be working properly.

Create a Mobile Website
View Site in Mobile | Classic
Share by: