WhereKUBE_ETCD_PODis the name of the working
kube-etcd pod. For example,kube-etcd-0.
From this new shell, run the following commands:
Remove the failed etcd replica node from the etcd cluster.
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key --endpoints=https://127.0.0.1:2379 member removeMEMBER_ID
WhereMEMBER_IDis the ID of the failed etcd replica
node. To get the ID, run the following command:
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key member list -w fields
The previous command displays all the members of the etcd cluster. The output is similar to the following:
TheMemberIDin the preceding output is the member ID of the working kube-etcd Pod. Next, get theIDof the failed etcd replica node. In the preceding examplekube-etcd-0has anIDof4645269864592341793,kube-etcd-1has anIDof3728561467092418843andkube-etcd-2has aIDof2279696924967222455.
After you have the memberID, convert it from decimal to hex, because themember removecommand accepts a hex memberID, whilemember listreturns a decimal. You can useprintfto do the conversion. In this example forkube-etcd-2it will be:
printf '%x\n' 2279696924967222455
The output of the preceding command is theMEMBER_IDyou need to use for themember removecommand.
Add a new member with the same name and peer URL as the failed replica node.
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key member addMEMBER_NAME--peer-urls=https://MEMBER_NAME.kube-etcd:2380
WhereMEMBER_NAMEis the identifier of the failed kube-etcd replica node.
For example,kube-etcd-1orkube-etcd2.
Follow steps 1-3 ofDeploying the utility Podsto create a utility Pod in
the admin cluster. This Pod is used to access the PersistentVolume (PV) of the
failed etcd member in the user cluster.
Clean up the etcd data directory from within the utility Pod.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eThis guide provides steps to replace a failed etcd replica in a high availability user cluster for Google Distributed Cloud, ensuring the admin cluster and the other two etcd members are working correctly before starting.\u003c/p\u003e\n"],["\u003cp\u003eThe process involves backing up and then deleting the etcd PodDisruptionBudget (PDB), editing the kube-etcd StatefulSet to set the \u003ccode\u003e--initial-cluster-state\u003c/code\u003e flag to \u003ccode\u003eexisting\u003c/code\u003e, and draining the failed etcd replica node.\u003c/p\u003e\n"],["\u003cp\u003eA new shell in a working kube-etcd pod is used to remove the failed replica from the etcd cluster and add a new member with the same name and peer URL.\u003c/p\u003e\n"],["\u003cp\u003eThe failed etcd replica's data directory is cleaned, the failed node is uncordoned, the kube-etcd StatefulSet is updated to set \u003ccode\u003e--initial-cluster-state\u003c/code\u003e to \u003ccode\u003enew\u003c/code\u003e, and the etcd PDB is restored.\u003c/p\u003e\n"],["\u003cp\u003eThe utility Pod is necessary to clean up the etcd data directory from the failed etcd member, and then is deleted after the task is completed.\u003c/p\u003e\n"]]],[],null,["# Replacing a failed etcd replica\n\nThis page describes how to replace a failed etcd replica in a high availability\n(HA) user cluster for Google Distributed Cloud.\n\nBefore you begin\n----------------\n\n- Make sure the admin cluster is working correctly.\n\n- Make sure the other two etcd members in the user cluster are working\n correctly. If more than one etcd member has failed, see [Recovery from etcd\n data corruption or loss](/anthos/clusters/docs/on-prem/1.6/concepts/high-availability-disaster-recovery#recovery_from_etcd_data_corruption_or_loss).\n\nReplacing a failed etcd replica\n-------------------------------\n\n1. Back up a copy of the etcd PodDisruptionBudget (PDB) so you can restore it\n later.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME get pdb kube-etcd-pdb -o yaml \u003e /path/to/etcdpdb.yaml\n ```\n\n Where:\n - \u003cvar translate=\"no\"\u003eADMIN_CLUSTER_KUBECONFIG\u003c/var\u003e is the path to the\n kubeconfig file for the admin cluster.\n\n - \u003cvar translate=\"no\"\u003eUSER_CLUSTER_NAME\u003c/var\u003e is the name of the user cluster\n that contains the failed etcd replica.\n\n2. Delete the etcd PodDisruptionBudget (PDB).\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME delete pdb kube-etcd-pdb\n ```\n3. Run the following command to open the kube-etcd [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) in your text editor:\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd\n ```\n\n Change the value of the `--initial-cluster-state` flag to `existing`. \n\n ```\n containers:\n - name: kube-etcd\n ...\n args:\n - --initial-cluster-state=existing\n ...\n \n ```\n4. Drain the failed etcd replica node.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG drain NODE_NAME --ignore-daemonsets --delete-local-data\n ```\n\n Where \u003cvar translate=\"no\"\u003eNODE_NAME\u003c/var\u003e is the name of the failed etcd replica node.\n5. Create a new shell in the container of one of the working kube-etcd pods.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it \\\n KUBE_ETCD_POD --container kube-etcd --namespace USER_CLUSTER_NAME \\\n -- bin/sh\n ```\n\n Where \u003cvar translate=\"no\"\u003eKUBE_ETCD_POD\u003c/var\u003e is the name of the working\n kube-etcd pod. For example, `kube-etcd-0`.\n\n From this new shell, run the following commands:\n 1. Remove the failed etcd replica node from the etcd cluster.\n\n ```\n ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key --endpoints=https://127.0.0.1:2379 member remove MEMBER_ID\n ```\n\n Where \u003cvar translate=\"no\"\u003eMEMBER_ID\u003c/var\u003e is the ID of the failed etcd replica\n node. To get the ID, run the following command: \n\n ```\n ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key member list -w fields\n ```\n\n The previous command displays all the members of the etcd cluster. The output is similar to the following: \n\n sh-5.0# ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcd.key member list -w fields\n\n \"ClusterID\" : 6963206042588294154\n \"MemberID\" : 4645269864592341793\n \"Revision\" : 0\n \"RaftTerm\" : 15\n \"ID\" : 2279696924967222455\n \"Name\" : \"kube-etcd-2\"\n \"PeerURL\" : \"https://kube-etcd-2.kube-etcd:2380\"\n \"ClientURL\" : \"https://kube-etcd-2.kube-etcd:2379\"\n \"IsLearner\" : false\n\n \"ID\" : 3728561467092418843\n \"Name\" : \"kube-etcd-1\"\n \"PeerURL\" : \"https://kube-etcd-1.kube-etcd:2380\"\n \"ClientURL\" : \"https://kube-etcd-1.kube-etcd:2379\"\n \"IsLearner\" : false\n\n \"ID\" : 4645269864592341793\n \"Name\" : \"kube-etcd-0\"\n \"PeerURL\" : \"https://kube-etcd-0.kube-etcd:2380\"\n \"ClientURL\" : \"https://kube-etcd-0.kube-etcd:2379\"\n \"IsLearner\" : false\n\n sh-5.0#\n\n The `MemberID` in the preceding output is the member ID of the working kube-etcd Pod. Next, get the `ID` of the failed etcd replica node. In the preceding example `kube-etcd-0` has an `ID` of `4645269864592341793`, `kube-etcd-1` has an `ID` of `3728561467092418843` and `kube-etcd-2` has a `ID` of `2279696924967222455`.\n\n After you have the member `ID`, convert it from decimal to hex, because the `member remove` command accepts a hex member `ID`, while `member list` returns a decimal. You can use `printf` to do the conversion. In this example for `kube-etcd-2` it will be:\n\n ```\n printf '%x\\n' 2279696924967222455\n ```\n\n The output of the preceding command is the \u003cvar translate=\"no\"\u003eMEMBER_ID\u003c/var\u003e you need to use for the `member remove` command.\n 2. Add a new member with the same name and peer URL as the failed replica node.\n\n ```\n ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key member add MEMBER_NAME --peer-urls=https://MEMBER_NAME.kube-etcd:2380\n ```\n\n Where \u003cvar translate=\"no\"\u003eMEMBER_NAME\u003c/var\u003e is the identifier of the failed kube-etcd replica node.\n For example, `kube-etcd-1` or `kube-etcd2`.\n6. Follow steps 1-3 of [Deploying the utility Pods](/anthos/clusters/docs/on-prem/1.6/how-to/backing-up#deploy_utility_pods) to create a utility Pod in\n the admin cluster. This Pod is used to access the PersistentVolume (PV) of the\n failed etcd member in the user cluster.\n\n7. Clean up the etcd data directory from within the utility Pod.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it -n USER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER -- bash -c 'rm -rf /var/lib/etcd/*'\n ```\n8. Delete the utility Pod.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG delete pod -n USER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER\n ```\n9. Uncordon the failed node.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG uncordon NODE_NAME\n ```\n10. Open the kube-etcd StatefulSet in your text editor.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd\n ```\n\n Change the value of the `--initial-cluster-state` flag to `new`. \n\n ```\n containers:\n - name: kube-etcd\n ...\n args:\n - --initial-cluster-state=new\n ...\n \n ```\n11. Restore the etcd PDB which was deleted in step 1.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f /path/to/etcdpdb.yaml\n ```"]]