This document describes how to replace a failed etcd replica in a high availability (HA) user cluster for Google Distributed Cloud.
The instructions given here apply to an HA user cluster that uses kubeception ; that is, a user cluster that does not have Controlplane V2 enabled. If you need to replace an etcd replica in a user cluster that has Controlplane V2 enabled, contact Cloud Customer Care .
Before you begin
-
Make sure the admin cluster is working correctly.
-
Make sure the other two etcd members in the user cluster are working correctly. If more than one etcd member has failed, see Recovery from etcd data corruption or loss .
Replacing a failed etcd replica
-
Back up a copy of the etcd PodDisruptionBudget (PDB) so you can restore it later.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME get pdb kube-etcd-pdb -o yaml > PATH_TO_PDB_FILE
Where:
-
ADMIN_CLUSTER_KUBECONFIGis the path to the kubeconfig file for the admin cluster. -
USER_CLUSTER_NAMEis the name of the user cluster that contains the failed etcd replica. -
PATH_TO_PDB_FILEis the path where you want to save the etcd PDB file, for instance/tmp/etcpdb.yaml.
-
-
Delete the etcd PodDisruptionBudget (PDB).
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME delete pdb kube-etcd-pdb
-
Run the following command to open the kube-etcd StatefulSet in your text editor:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd
Change the value of the
--initial-cluster-stateflag toexisting.containers: - name: kube-etcd ... args: - --initial-cluster-state=existing ... -
Drain the failed etcd replica node.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG drain NODE_NAME --ignore-daemonsets --delete-local-data
Where
NODE_NAMEis the name of the failed etcd replica node. -
Create a new shell in the container of one of the working kube-etcd pods.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it \ KUBE_ETCD_POD --container kube-etcd --namespace USER_CLUSTER_NAME \ -- bin/sh
Where
KUBE_ETCD_PODis the name of the working kube-etcd pod. For example,kube-etcd-0.From this new shell, run the following commands:
-
Remove the failed etcd replica node from the etcd cluster.
First, list all the members of the etcd cluster:
etcdctl member list -w table
The output shows all the member IDs. Determine the member ID of the failed replica.
Next, remove the failed replica:
export ETCDCTL_CACERT=/etcd.local.config/certificates/etcdCA.crt export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt export ETCDCTL_KEY=/etcd.local.config/certificates/etcd.key export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 etcdctl member remove MEMBER_IDWhere
MEMBER_IDis the hex member ID of the failed etcd replica pod. -
Add a new member with the same name and peer URL as the failed replica node.
etcdctl member add MEMBER_NAME --peer-urls=https:// MEMBER_NAME .kube-etcd:2380
Where
MEMBER_NAMEis the identifier of the failed kube-etcd replica node. For example,kube-etcd-1orkube-etcd2.
-
-
Follow steps 1-3 of Deploying the utility Pods to create a utility Pod in the admin cluster. This Pod is used to access the PersistentVolume (PV) of the failed etcd member in the user cluster.
-
Clean up the etcd data directory from within the utility Pod.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it -n USER_CLUSTER_NAME etcd-utility- MEMBER_NUMBER -- /bin/bash -c 'rm -rf /var/lib/etcd/*'
-
Delete the utility Pod.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG delete pod -n USER_CLUSTER_NAME etcd-utility- MEMBER_NUMBER
-
Uncordon the failed node.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG uncordon NODE_NAME
-
Open the kube-etcd StatefulSet in your text editor.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd
Change the value of the
--initial-cluster-stateflag toexisting.containers: - name: kube-etcd ... args: - --initial-cluster-state=existing ... -
Restore the etcd PDB which was deleted in step 1.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f /path/to/etcdpdb.yaml

