Known issues

This document describes known issues for version 1.8 of Google Distributed Cloud.

/var/log/audit/ filling up disk space

Category

OS

Identified Versions

1.8.0+, 1.9.0+, 1.10.0+, 1.11.0+, 1.12.0+, 1.13.0+

Symptoms

/var/log/audit/ is filled with audit logs. You can check the disk usage by running sudo du -h -d 1 /var/log/audit .

Cause

Since Anthos v1.8, the Ubuntu image is hardened with CIS Level2 Benchmark. And one of the compliance rules, 4.1.2.2 Ensure audit logs are not automatically deleted , ensures the auditd setting max_log_file_action = keep_logs . This results in all the audit rules kept on the disk.

Workaround

Admin workstation

For the admin workstation, you can manually change the auditd settings to rotate the logs automatically, and then restart the auditd service:

 sed -i 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' /etc/audit/auditd.conf
sed -i 's/num_logs = .*/num_logs = 250/g' /etc/audit/auditd.conf
systemctl restart auditd 

The above setting would make auditd automatically rotate its logs once it has generated more than 250 files (each with 8M size).

Cluster nodes

For cluster nodes, apply the following DaemonSet to your cluster to prevent potential issues:

  apiVersion 
 : 
  
 apps 
 / 
 v1 
 kind 
 : 
  
 DaemonSet 
 metadata 
 : 
  
 name 
 : 
  
 change 
 - 
 auditd 
 - 
 log 
 - 
 action 
  
 namespace 
 : 
  
 kube 
 - 
 system 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 change 
 - 
 auditd 
 - 
 log 
 - 
 action 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 app 
 : 
  
 change 
 - 
 auditd 
 - 
 log 
 - 
 action 
  
 spec 
 : 
  
 hostIPC 
 : 
  
 true 
  
 hostPID 
 : 
  
 true 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 update 
 - 
 audit 
 - 
 rule 
  
 image 
 : 
  
 ubuntu 
  
 command 
 : 
  
 [ 
 "chroot" 
 , 
  
 "/host" 
 , 
  
 "bash" 
 , 
  
 "-c" 
 ] 
  
 args 
 : 
  
 - 
  
 | 
  
 while 
  
 true 
 ; 
  
 do 
  
 if 
  
 $ 
 ( 
 grep 
  
 - 
 q 
  
 "max_log_file_action = keep_logs" 
  
 /etc/audit/ 
 auditd 
 . 
 conf 
 ); 
  
 then 
  
 echo 
  
 "updating auditd max_log_file_action to rotate with a max of 250 files" 
  
 sed 
  
 - 
 i 
  
 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' 
  
 /etc/audit/ 
 auditd 
 . 
 conf 
  
 sed 
  
 - 
 i 
  
 's/num_logs = .*/num_logs = 250/g' 
  
 /etc/audit/ 
 auditd 
 . 
 conf 
  
 echo 
  
 "restarting auditd" 
  
 systemctl 
  
 restart 
  
 auditd 
  
 else 
  
 echo 
  
 "auditd setting is expected, skip update" 
  
 fi 
  
 sleep 
  
 600 
  
 done 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 mountPath 
 : 
  
 / 
 host 
  
 securityContext 
 : 
  
 privileged 
 : 
  
 true 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 hostPath 
 : 
  
 path 
 : 
  
 / 
 

Note that making this auditd config change would violate CIS Level2 rule 4.1.2.2 Ensure audit logs are not automatically deleted .

User cluster upgrade/update fails due to 'failed to register user cluster'

Category

Upgrade, Update

Identified Versions

1.7.0+, 1.8.0+

Symptoms

Run gkectl diagnose cluster when a previous gkectl command timed out in the following cases.

  1. Upgrading user clusters with GKE connect enabled to 1.8 versions.
  2. Running gkectl update cluster on 1.8 user clusters with GKE connect enabled.
  3. Running gkectl update cluster to enable GKE connect on 1.8 user clusters.
 $  
gkectl  
diagnose  
cluster  
--kubeconfig  
kubeconfig  
--cluster-name  
foo-cluster
…  
Unhealthy  
Resources:  
OnPremUserCluster  
foo-cluster:  
not  
ready:  
ready  
condition  
is  
not  
true:  
ClusterCreateOrUpdate:  
failed  
to  
register  
user  
cluster  
 "foo-cluster" 
:  
failed  
to  
register  
cluster:  
...
... 

Note that the functionality of GKE connect should not be affected. In other words, if GKE connect was functional before the command, it should remain functional.

Cause

The Connect Agent version 20210514-00-00 used in 1.8 versions is out of support.

Workaround

Please contact Google support to mitigate the issue.

systemd-timesyncd not running after reboot on Ubuntu Node

Category

OS

Identified Versions

1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+

Symptoms

systemctl status systemd-timesyncd should show that the service is dead:

   
  
 systemd 
 - 
 timesyncd 
 . 
 service 
  
 - 
  
 Network 
  
 Time 
  
 Synchronization 
 Loaded 
 : 
  
 loaded 
  
 ( 
 / 
 lib 
 / 
 systemd 
 / 
 system 
 / 
 systemd 
 - 
 timesyncd 
 . 
 service 
 ; 
  
 enabled 
 ; 
  
 vendor 
  
 preset 
 : 
  
 enabled 
 ) 
 Active 
 : 
  
 inactive 
  
 ( 
 dead 
 ) 
 

This could cause time out of sync issues.

Cause

chrony was incorrectly installed on Ubuntu OS image, and there's conflict between chrony and systemd-timesyncd , where systemd-timesyncd would become inactive and chrony become active everytime Ubuntu VM got rebooted. However, systemd-timesyncd should be the default ntp client for the VM.

Workaround

Option 1: Manually run restart systemd-timesyncd every time when VM got rebooted.

Option 2: Deploy the following Daemonset so that systemd-timesyncd will always be restarted if it's dead.

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 DaemonSet 
 metadata 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 spec 
 : 
  
 hostIPC 
 : 
  
 true 
  
 hostPID 
 : 
  
 true 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 # Use your preferred image. 
  
 image 
 : 
  
 ubuntu 
  
 command 
 : 
  
 - 
  
 /bin/bash 
  
 - 
  
 -c 
  
 - 
  
 | 
  
 while true; do 
  
 echo $(date -u) 
  
 echo "Checking systemd-timesyncd status..." 
  
 chroot /host systemctl status systemd-timesyncd 
  
 if (( $? != 0 )) ; then 
  
 echo "Restarting systemd-timesyncd..." 
  
 chroot /host systemctl start systemd-timesyncd 
  
 else 
  
 echo "systemd-timesyncd is running." 
  
 fi; 
  
 sleep 60 
  
 done 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 mountPath 
 : 
  
 /host 
  
 resources 
 : 
  
 requests 
 : 
  
 memory 
 : 
  
 "10Mi" 
  
 cpu 
 : 
  
 "10m" 
  
 securityContext 
 : 
  
 privileged 
 : 
  
 true 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 hostPath 
 : 
  
 path 
 : 
  
 / 
 ```` 
 ## ClientConfig custom resource 
 ` 
 gkectl update` reverts any manual changes that you have made to the ClientConfig 
 custom resource. We strongly recommend that you back up the ClientConfig 
 resource after every manual change. 
 ## gkectl check-config</code> validation fails: can't find F5 BIG-IP partitions 
< dl 
>
< dt>Symptoms</dt> 
< dd><p>Validation fails because F5 BIG-IP partitions can't be found, even though they exist.</p></dd> 
< dt>Potential causes</dt> 
< dd><p>An issue with the F5 BIG-IP API can cause validation to fail.</p></dd> 
< dt>Resolution</dt> 
< dd><p>Try running <code>gkectl check-config</code> again.</p></dd> 
< /dl 
> ## Disruption for workloads with PodDisruptionBudgets {:#workloads_pdbs_disruption} 
 Upgrading clusters can cause disruption or downtime for workloads that use 
 [ 
 PodDisruptionBudgets 
 ] 
 (https://kubernetes.io/docs/concepts/workloads/pods/disruptions/){:.external} 
  
 (PDBs). 
 ## Nodes fail to complete their upgrade process 
 If you have `PodDisruptionBudget` objects configured that are unable to 
 allow any additional disruptions, node upgrades might fail to upgrade to the 
 control plane version after repeated attempts. To prevent this failure, we 
 recommend that you scale up the `Deployment` or `HorizontalPodAutoscaler` to 
 allow the node to drain while still respecting the `PodDisruptionBudget` 
 configuration. 
 To see all `PodDisruptionBudget` objects that do not allow any disruptions 
 : 
 

kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}' ```

User cluster installation failed because of cert-manager/ca-injector's leader election issue in Anthos 1.8.2 and 1.8.3

You might see an installation failure due to cert-manager-cainjector in crashloop, when the apiserver/etcd is slow. The following command,

kubectl logs --kubeconfig USER_CLUSTER_KUBECONFIG 
-n kube-system deployments/cert-manager-cainjector
might produce something like the following logs:
  I0923 
  
 16 
 : 
 19 
 : 
 27.911174 
  
 1 
  
 leaderelection 
 . 
 go 
 : 
 278 
 ] 
  
 failed 
  
 to 
  
 renew 
  
 lease 
  
 kube 
 - 
 system 
 / 
 cert 
 - 
 manager 
 - 
 cainjector 
 - 
 leader 
 - 
 election 
 : 
  
 timed 
  
 out 
  
 waiting 
  
 for 
  
 the 
  
 condition 
 E0923 
  
 16 
 : 
 19 
 : 
 27.911110 
  
 1 
  
 leaderelection 
 . 
 go 
 : 
 321 
 ] 
  
 error 
  
 retrieving 
  
 resource 
  
 lock 
  
 kube 
 - 
 system 
 / 
 cert 
 - 
 manager 
 - 
 cainjector 
 - 
 leader 
 - 
 election 
 - 
 core 
 : 
  
 Get 
  
 "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core" 
 : 
  
 context 
  
 deadline 
  
 exceeded 
 I0923 
  
 16 
 : 
 19 
 : 
 27.911593 
  
 1 
  
 leaderelection 
 . 
 go 
 : 
 278 
 ] 
  
 failed 
  
 to 
  
 renew 
  
 lease 
  
 kube 
 - 
 system 
 / 
 cert 
 - 
 manager 
 - 
 cainjector 
 - 
 leader 
 - 
 election 
 - 
 core 
 : 
  
 timed 
  
 out 
  
 waiting 
  
 for 
  
 the 
  
 condition 
 E0923 
  
 16 
 : 
 19 
 : 
 27.911629 
  
 1 
  
 start 
 . 
 go 
 : 
 163 
 ] 
  
 cert 
 - 
 manager 
 / 
 ca 
 - 
 injector 
  
 "msg" 
 = 
 "error running core-only manager" 
  
 "error" 
 = 
 "leader election lost" 
 

Run the following commands to mitigate the problem.

First, scale down the monitoring-operator so it will not revert the changes to the cert-manager-cainjector Deployment.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
-n USER_CLUSTER_NAME 
scale deployment monitoring-operator --replicas=0

Second, patch the cert-manager-cainjector Deployment to disable leader election, which is safe because we only have one replica running. It is not required for a single replica.

# Ensure that we run only 1 cainjector replica, even during rolling updates.
kubectl patch --kubeconfig USER_CLUSTER_KUBECONFIG 
-n kube-system deployment cert-manager-cainjector --type=strategic --patch '
spec:
  strategy:
    rollingUpdate:
      maxSurge: 0
'
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl patch --kubeconfig USER_CLUSTER_KUBECONFIG 
-n kube-system deployment cert-manager-cainjector --type=json --patch '[
    {
        "op": "add",
        "path": "/spec/template/spec/containers/0/args/-",
        "value": "--leader-elect=false"
    }
]'

Keep monitoring-operator replicas at 0 as a mitigation until the installation is finished. Otherwise it will revert the change.

After the installation is finished and the cluster is up and running, turn on the monitoring-operator for day-2 operations:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
-n USER_CLUSTER_NAME 
scale deployment monitoring-operator --replicas=1

Note that after upgrading to 1.8.4 and above (or 1.9.1 and above, if upgrading to 1.9), these steps will no longer be necessary since Anthos will disable leader-election for cainjector. Until then, if you face this issue during each upgrade, it will be necessary to perform the same mitigation steps again.

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

Admin cluster certificate renewal process

  1. Make sure that OpenSSL is installed on the admin workstation before you begin.

  2. Set the KUBECONFIG variable:

    KUBECONFIG= ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG 
    

    Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.

  3. Get the IP address and SSH keys for the admin master node:

    kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \
    -o jsonpath='{.data.vsphere_tmp}' | base64 -d > \
    ~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key
    
    export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \
    jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \
    --selector='node-role.kubernetes.io/master')
  4. Check if the certificates are expired:

    ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \
    "sudo kubeadm alpha certs check-expiration"

    If the certificates are expired, you must renew them before upgrading the admin cluster.

  5. Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.

    • Back up the admin cluster kubeconfig file:

      ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
      "sudo cat /etc/kubernetes/admin.conf" > new_admin.conf vi "${KUBECONFIG}"
  • Replace client-certificate-data and client-key-data in kubeconfig with client-certificate-data and client-key-data in the new_admin.conf file that you created.

  • Back up old certificates:

    This is an optional, but recommended, step.

    # ssh into admin master if you didn't in the previous step
    ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
    
    # on admin master
    sudo tar -czvf backup.tar.gz /etc/kubernetes
    logout
    
    # on worker node
    sudo scp -i ~/.ssh/admin-cluster.key \
    ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
  • Renew the certificates with kubeadm:

    # ssh into admin master
     ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
     # on admin master
     sudo kubeadm alpha certs renew all
  • Restart static Pods running on the admin master node:

    # on admin master
      cd /etc/kubernetes
      sudo mkdir tempdir
      sudo mv manifests/*.yaml tempdir/
      sleep 5
      echo "remove pods"
      # ensure kubelet detect those change remove those pods
      # wait until the result of this command is empty
      sudo docker ps | grep kube-apiserver
    
      # ensure kubelet start those pods again
      echo "start pods again"
      sudo mv tempdir/*.yaml manifests/
      sleep 30
      # ensure kubelet start those pods again
      # should show some results
      sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd
    
      # clean up
      sudo rm -rf tempdir
    
      logout
  • Renew the certificates of admin cluster worker nodes

    Check node certificates expiration date

    kubectl get nodes -o wide
        # find the oldest node, fill NODE_IP with the internal ip of that node
        ssh -i ~/.ssh/admin-cluster.key ubuntu@"${NODE_IP}"
        openssl x509 -enddate -noout -in /var/lib/kubelet/pki/kubelet-client-current.pem
        logout

    If the certificate is about to expire, renew node certificates by manual node repair .

  • You must validate the renewed certificates, and validate the certificate of kube-apiserver.

    • Check certificates expiration:

      ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
      "sudo kubeadm alpha certs check-expiration"
    • Check certificate of kube-apiserver:

      # Get the IP address of kube-apiserver
      cat $KUBECONFIG | grep server
      # Get the current kube-apiserver certificate
      openssl s_client -showcerts -connect : 
      | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'
      > current-kube-apiserver.crt # check expiration date of this cert openssl x509 -in current-kube-apiserver.crt -noout -enddate
  • /etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods

    Starting from Google Distributed Cloud 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark . . As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled to ensure the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked".

    The script uses /run/aide as a temporary directory to save its cron logs, and over time it could use up all the space in /run . See /etc/cron.daily/aide script uses all space in /run for a workaround.

    If you see one or more Pods crashlooping on a node, run df -h /run on the node. If the command output shows 100% space usage, then you are likely experiencing this issue.

    This issue is fixed in version 1.8.1. For the 1.7.2 and 1.8.0 versions, you can resolve this issue manually with either of the following two workarounds:

    1. Periodically remove the log files at /run/aide/cron.daily.old* (recommended).
    2. Follow the steps mentioned in /etc/cron.daily/aide script uses all space in /run . (Note: this workaround could potentially affect the node compliance state).

    Upgrading Seesaw load balancer with version 1.8.0

    If you use the gkectl upgrade loadbalancer to attempt to update some parameters of the Seesaw load balancer in version 1.8.0, this will not work in either DHCP or IPAM mode. If your setup includes this configuration, do not upgrade to version 1.8.0, but instead to version 1.8.1 or later.

    You might experience this issue if you are using one of the following versions of Google Distributed Cloud.

    • 1.7.2-gke.2
    • 1.7.3-gke.2
    • 1.8.0-gke.21
    • 1.8.0-gke.24
    • 1.8.0-gke.25
    • 1.8.1-gke.7
    • 1.8.2-gke.8

    You might get the following error when you attempt to SSH into your Anthos VMs, including the admin workstation, cluster nodes, and Seesaw nodes:

    WARNING: Your password has expired.

    This error occurs because the ubuntu user password on the VMs has expired. You must manually reset the user password's expiration time to a large value before logging into the VMs.

    Prevention of password expiry error

    If you are running the affected versions listed above, and the user password hasn't expired yet, you should extend the expiration time before seeing the SSH error.

    Run the following command on each Anthos VM:

    sudo chage -M 99999 ubuntu

    Mitigation of password expiry error

    If the user password has already expired and you can't log in to the VMs to extend the expiration time, perform the following mitigation steps for each component.

    Admin workstation

    Use a temporary VM to perform the following steps. You can create an admin workstation using the 1.7.1-gke.4 version to use as the temporary VM.

    1. Ensure the temporary VM and the admin workstation are in a power off state.

    2. Attach the boot disk of the admin workstation to the temporary VM. The boot disk is the one with the label "Hard disk 1".

    3. Mount the boot disk inside the VM by running these commands. Substitute your own boot disk identifier for dev/sdc1 .

      sudo mkdir -p /mnt/boot-disk
      sudo mount /dev/sdc1 /mnt/boot-disk
    4. Set the ubuntu user expiration date to a large value such as 99999 days.

      sudo chroot /mnt/boot-disk chage -M 99999 ubuntu
    5. Shut down the temporary VM.

    6. Power on the admin workstation. You should now be able to SSH as usual.

    7. As cleanup, delete the temporary VM.

    Admin cluster control plane VM

    Follow the instructions to recreate the admin cluster control plane VM .

    Admin cluster addon VMs

    Run the following command from the admin workstation to recreate the VM:

    kubectl --kubeconfig= ADMIN_CLUSTER_KUBECONFIG 
    patch machinedeployment gke-admin-node --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"

    After you run this command, wait for the admin cluster addon VMs to finish recreation and to be ready before you continue with the next steps.

    User cluster control plane VMs

    Run the following command from the admin workstation to recreate the VMs:

    usermaster=`kubectl --kubeconfig= ADMIN_CLUSTER_KUBECONFIG 
    get machinedeployments -l set=user-master -o name` && kubectl --kubeconfig= ADMIN_CLUSTER_KUBECONFIG 
    patch $usermaster --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"

    After you run this command, wait for the user cluster control plane VMs to finish recreation and to be ready before you continue with the next steps.

    User cluster worker VMs

    Run the following command from the admin workstation to recreate the VMs.

    for md in `kubectl --kubeconfig= USER_CLUSTER_KUBECONFIG 
    get machinedeployments -l set=node -o name`; do kubectl patch --kubeconfig= USER_CLUSTER_KUBECONFIG 
    $md --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"; done

    Seesaw VMs

    Run the following commands from the admin workstation to recreate the Seesaw VMs. There will be some downtime. If HA is enabled for the load balancer, the maximum down time is two seconds.

    gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
    --config ADMIN_CLUSTER_CONFIG 
    --admin-cluster --no-diff
    gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
    --config USER_CLUSTER_CONFIG 
    --no-diff

    Restarting or upgrading vCenter for versions lower than 7.0U2

    If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in VM Information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

    Related govmomi bug: https://github.com/vmware/govmomi/issues/2552

    This workaround is provided by VMware support:

    1. The issue is fixed in vCenter versions 7.0U2 and above.
    
    2. For lower versions:
    Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the 
    VM's portgroup.

    gkectl create-config admin and gkectl create-config cluster panic

    In versions 1.8.0-1.8.3, the gkectl create-config admin/cluster command panics with the message panic: invalid version: "latest" .

    As a workaround, use gkectl create-config admin/cluster --gke-on-prem-version=DESIRED_CLUSTER_VERSION . Replace DESIRED_CLUSTER_VERSION with the desired version, such as 1.8.2-gke.8.

    Creating/upgrading admin cluster timeout

    This issue affects 1.8.0-1.8.3.

    Your admin cluster creation or admin cluster upgrade might time out with the following error:

     Error getting kubeconfig: error running remote command 'sudo cat /etc/kubernetes/admin.conf': error: Process exited with status 1, stderr: 'cat: /etc/kubernetes/admin.conf: No such file or directory 
    

    In addition, the log at nodes/ADMIN_MASTER_NODE/files/var/log/startup.log in the external cluster snapshot ends with this message:

      [ 
     preflight 
     ] 
      
     You 
      
     can 
      
     also 
      
     perform 
      
     this 
      
     action 
      
     in 
      
     beforehand 
      
     using 
      
     'kubeadm config images pull' 
     
    

    This error happens when the network is slow between the admin control-plane VM and the container registry. Make sure to inspect your network or proxy setup to reduce the latency and increase the bandwidth.

    SSH connection closed by remote host

    For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark . To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config has the following settings:

    ClientAliveInterval 300
    ClientAliveCountMax 0

    The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0 value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:

    Connection to [IP] closed by remote host.
    Connection to [IP] closed.

    As a workaround, you can either:

    • Use nohup to prevent your command being terminated on SSH disconnection,

      nohup gkectl upgrade admin --config admin-cluster.yaml --kubeconfig kubeconfig
    • Update the sshd_config to use a non-zero ClientAliveCountMax value. The CIS rule recommends to use a value less than 3.

      sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' /etc/ssh/sshd_config
      sudo systemctl restart sshd

      Make sure you reconnect your ssh session.

    Conflict with cert-manager when upgrading to version 1.8.2 or above

    If you have your own cert-manager installation with Google Distributed Cloud, you might experience a failure when you attempt to upgrade to versions 1.8.2 or above. This is a result of a conflict between your version of cert-manager , which is likely installed in the cert-manager namespace, and the monitoring-operator version.

    If you try to install another copy of cert-manager after upgrading to Google Distributed Cloud version 1.8.2 or above, the installation might fail due to a conflict with the existing one managed by monitoring-operator .

    The metrics-ca cluster issuer, which control-plane and observability components rely on for creation and rotation of cert secrets, requires a metrics-ca cert secret to be stored in the cluster resource namespace. This namespace is kube-system for the monitoring-operator installation, and likely to be cert-manager for your installation.

    If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.8.2 or later:

    Avoid conflicts during upgrade

    1. Uninstall your version of cert-manager . If you defined your own resources, you may want to backup them.

    2. Perform the upgrade .

    3. Follow the following instructions to restore your own cert-manager .

    Restore your own cert-manager in user clusters

    • Scale the monitoring-operator deployment to 0.

      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -n USER_CLUSTER_NAME 
      scale deployment monitoring-operator --replicas=0
    • Scale the cert-manager deployments managed by monitoring-operator to 0.

      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager --replicas=0
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager-cainjector --replicas=0
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager-webhook --replicas=0
    • Reinstall cert-manager .

    • Restore your customized resources if you have them.

    • Copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from kube-system to the cluster resource namespace of your installed cert-manager. Your installed cert-manager namespace is cert-manager if using the upstream default cert-manager installation , but that depends on your installation.

      relevant_fields='
      {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
      name: .metadata.name,
      namespace: " YOUR_INSTALLED_CERT_MANAGER_NAMESPACE 
      "
      },
      spec: .spec
      }
      '
      f1=$(mktemp)
      f2=$(mktemp)
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG 
      get issuer -n kube-system metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f1
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG 
      get certificate -n kube-system metrics-ca -o json | jq "${relevant_fields}" > $f2
      kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG 
      -f $f1
      kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG 
      -f $f2

    Restore your own cert-manager in admin clusters

    In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Google Distributed Cloud control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigeecustomer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.

    • Scale the monitoring-operator deployment to 0.

      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment monitoring-operator --replicas=0
    • Scale the cert-manager deployments managed by monitoring-operator to 0.

      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager --replicas=0
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager-cainjector --replicas=0
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -n kube-system scale deployment cert-manager-webhook --replicas=0
    • Reinstall the customer's cert-manager . Restore your customized resources if you have.

    • Copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from kube-system to the cluster resource namespace of your installed cert-manager. Your installed cert-manager namespace is cert-manager if using the upstream default cert-manager installation , but that depends on your installation.

      relevant_fields='
      {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
      name: .metadata.name,
      namespace: " YOUR_INSTALLED_CERT_MANAGER_NAMESPACE 
      "
      },
      spec: .spec
      }
      '
      f3=$(mktemp)
      f4=$(mktemp)
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      get issuer -n kube-system metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f3
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      get certificate -n kube-system metrics-ca -o json | jq "${relevant_fields}" > $f4
      kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -f $f3
      kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG 
      -f $f4

    False positives in docker, containerd, and runc vulnerability scanning

    The docker, containerd, and runc in the Ubuntu OS images shipped with Google Distributed Cloud are pinned to special versions using Ubuntu PPA . This ensures that any container runtime changes will be qualified by Google Distributed Cloud before each release.

    However, the special versions are unknown to the Ubuntu CVE Tracker , which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc vulnerability scanning results.

    For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Google Distributed Cloud.

    Refer to the release notes for any CVE fixes.

    Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73 .

    /etc/cron.daily/aide CPU and memory spike issue

    Starting from Google Distributed Cloud version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark .

    As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.

    The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide process.

    If the spikes are affecting your workload, you can disable the daily cron job:

    `sudo chmod -x /etc/cron.daily/aide`.

    Cisco ACI doesn't work with Direct Server Return (DSR)

    Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning. A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).

    You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGsor uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.

    If your logging-monitoring service account bearer token is larger than 512 KB, it can break the Seesaw load balancer logs . To fix this issue, upgrade to version 1.9 or later.

    Connectivity issues between Pods due to anetd daemons in software deadlock

    Clusters with enableDataplaneV2 set to true can experience connectivity issues between Pods due to anetd daemons (running as a Daemonset) entering a software deadlock. While in this state, anetd daemons will see stale nodes (previously deleted nodes) as peers and miss newly added nodes as new peers.

    If you have experienced this issue, complete the following steps to restart the anetd daemons to refresh the peer nodes, and connectivity should be restored.

    1. Find all anetd daemons in the cluster:

      kubectl --kubeconfig= USER_CLUSTER_KUBECONFIG 
      -n kube-system get pods -o wide | grep anetd
    2. Check whether anetd daemons currently see stale peers:

      kubectl --kubeconfig= USER_CLUSTER_KUBECONFIG 
      -n kube-system exec -it ANETD_XYZ 
      -- cilium-health status

      Replace ANETD_XYZ with the name of an anetd Pod.

    3. Restart all affected Pods:

      kubectl --kubeconfig= USER_CLUSTER_KUBECONFIG 
      -n kube-system delete pod ANETD_XYZ 
      

    gkectl diagnose checking certificates failure

    If your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose , it is safe to ignore them.

     Checking user cluster certificates...FAILURE
        Reason: 3 user cluster certificates error(s).
        Unhealthy Resources:
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out 
    
    Create a Mobile Website
    View Site in Mobile | Classic
    Share by: