Version 1.8. This version is no longer supported. For information about how to upgrade to version 1.9, see Upgrading Anthos on bare metal in the 1.9 documentation. For more information about supported and unsupported versions, see the Version history page in the latest documentation.

Force-removing broken nodes in Google Distributed Cloud

When a node is broken and needs to be removed from a cluster for repair or replacement, you can force its removal from the cluster.

Force-removing worker nodes

In Google Distributed Cloud, you can add an annotation to mark a node for force removal.

After removing the node from the parent nodepool, run the following command to annotate the corresponding failing machine with the baremetal.cluster.gke.io/force-remove annotation. The value of the annotation itself does not matter:

kubectl --kubeconfig ADMIN_KUBECONFIG 
-n CLUSTER_NAMESPACE 
\
  annotate machine 10.200.0.8 baremetal.cluster.gke.io/force-remove=true

Google Distributed Cloud removes the node successfully.

Force-removing Control Plane nodes

Force-removing a control plane node is similar to performing a kubeadm reset on control plane nodes, and requires additional steps.

To force-remove a control plane node from the node pools, you need to take the following actions against the cluster that contains the failing control plane node:

remove the failing etcd member running on the failing node from the etcd cluster
update the ClusterStatus in the kube to remove the corresponding apiEndpoint .

Removing a failing `etcd` member

To remove the failing control plan node, first run etcdctl on the remaining healthy etcd pods. For more general information on this operation, see this Kubernetes documentation.

In the following procedure, CLUSTER_KUBECONFIG is the path to the kubeconfig file of the cluster.

Look up the etcd pod with the following command:

kubectl --kubeconfig CLUSTER_KUBECONFIG 
get \
 pod -n kube-system -l component=etcd -o wide

The command returns the following list of nodes. For this example, assume node 10.200.0.8is inaccessible and unrecoverable:

 NAME 
  
 READY 
  
 STATUS 
  
 RESTARTS 
  
 AGE 
  
 IP 
  
 NODE 
 etcd 
 - 
 357 
 b68f4ecf0 
  
 1 
 / 
 1 
  
 Running 
  
 0 
  
 9 
 m2s 
  
 10.200.0.6 
  
 357 
 b68f4ecf0 
 etcd 
 - 
 7 
 d7c21db88b3 
  
 1 
 / 
 1 
  
 Running 
  
 0 
  
 33 
 m 
  
 10.200.0.7 
  
 7 
 d7c21db88b3 
 etcd 
 - 
 b049141e0802 
  
 1 
 / 
 1 
  
 Running 
  
 0 
  
 8 
 m22s 
  
 10.200.0.8 
  
 b049141e0802

Exec into one of the remaining healthy etcd pods:

kubectl --kubeconfig CLUSTER_KUBECONFIG 
exec -it -n \
kube-system etcd-357b68f4ecf0 -- /bin/sh

Look up the current members to find the ID of the failing member. The command will return a list:

etcdctl --endpoints=https://10.200.0.6:2379,https://10.200.0.7:2379 --key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt  member list

This command returns, for example:

 23 
 da9c3f2594532a 
 , 
  
 started 
 , 
  
 7 
 d7c21db88b3 
 , 
  
 https 
 : 
 //10.200.0.6:2380, https://10.200.0.6:2379, false 
 772 
 c1a54956b7f51 
 , 
  
 started 
 , 
  
 357 
 b68f4ecf0 
 , 
  
 https 
 : 
 //10.200.0.7:2380, https://10.200.0.7:2379, false 
 f64f66ad8d3e7960 
 , 
  
 started 
 , 
  
 b049141e0802 
 , 
  
 https 
 : 
 //10.200.0.8:2380, https://10.200.0.8:2379, false

Remove the failing member:

etcdctl --endpoints=https://10.200.0.6:2379,https://10.200.0.7:2379 --key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt \
 member remove f64f66ad8d3e7960

Updating `ClusterStatus` and removing the failing `apiEndpoint`

In the following procedure, CLUSTER_KUBECONFIG is the path to the kubeconfig file of the cluster.

Look up the ClusterStatus section inside the kubeadm-config config map:

kubectl --kubeconfig CLUSTER_KUBECONFIG 
describe configmap -n \
kube-system kubeadm-config

The command returns results similar to those shown below:

 ... 
 ClusterStatus 
 : 
 ---- 
 apiEndpoints 
 : 
 7 
 d7c21db88b3 
 : 
  
 advertiseAddress 
 : 
  
 10.200.0.6 
  
 bindPort 
 : 
  
 6444 
 357 
 b68f4ecf0 
 : 
  
 advertiseAddress 
 : 
  
 10.200.0.7 
  
 bindPort 
 : 
  
 6444 
 b049141e0802 
 : 
  
 advertiseAddress 
 : 
  
 10.200.0.8 
  
 bindPort 
 : 
  
 6444 
 apiVersion 
 : 
  
 kubeadm 
 . 
 k8s 
 . 
 io 
 / 
 v1beta2 
 kind 
 : 
  
 ClusterStatus 
 ...

Edit the config map to remove the section that contains the failing IP (this example shows the results of removing 10.200.0.8 using the kubectl edit command):

kubectl --kubeconfig CLUSTER_KUBECONFIG 
edit configmap \
-n kube-system kubeadm-config

After editing, the config map looks similar to the following:

 ... 
 ClusterStatus 
 : 
  
 | 
  
 apiEndpoints 
 : 
  
 7 
 d7c21db88b3 
 : 
  
 advertiseAddress 
 : 
  
 10.200.0.6 
  
 bindPort 
 : 
  
 6444 
  
 357 
 b68f4ecf0 
 : 
  
 advertiseAddress 
 : 
  
 10.200.0.7 
  
 bindPort 
 : 
  
 6444 
  
 apiVersion 
 : 
  
 kubeadm 
 . 
 k8s 
 . 
 io 
 / 
 v1beta2 
  
 kind 
 : 
  
 ClusterStatus 
 ...

When you save the edited config map, the failing node is removed from the cluster.

Force-removing broken nodes in Google Distributed Cloud

Force-removing worker nodes

Force-removing Control Plane nodes

Removing a failing etcd member

Updating ClusterStatus and removing the failing apiEndpoint

Removing a failing `etcd` member

Updating `ClusterStatus` and removing the failing `apiEndpoint`