Stay organized with collectionsSave and categorize content based on your preferences.
This document describes known issues for version 1.7 of
Google Distributed Cloud.
User cluster upgrade fails due to 'failed to register to GCP'
Category
Upgrade
Identified Versions
1.7.0+, 1.8.0+
Symptoms
When upgrading user clusters to 1.7 versions, the commandgkectl upgrade clusterfails with error messages similar to
$gkectlupgradecluster--kubeconfigkubeconfig--configuser-cluster.yaml…Upgradingtobundleversion:"1.7.1-gke.4"…Exitwitherror:failedtoregistertoGCP,gcloudoutput:,error:errorrunningcommand'gcloud alpha container hub memberships register foo-cluster --kubeconfig kubeconfig --context cluster --version 20210129-01-00 --enable-workload-identity --has-private-issuer --verbosity=error --quiet':error:exitstatus1,stderr:'Waiting for membership to be created...
The errors indicate that the user cluster upgrade is mostly completed except that theConnect Agenthas not been upgraded. However, the functionality of GKE connect should not be affected.
Cause
The Connect Agent version20210129-01-00used in 1.7 versions is out of support.
Workaround
Please contact Google support to mitigate the issue.
systemd-timesyncd not running after reboot on Ubuntu Node
Category
OS
Identified Versions
1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+
Symptoms
systemctl status systemd-timesyncdshould show that the service is dead:
chronywas incorrectly installed on Ubuntu OS image, and there's conflict
betweenchronyandsystemd-timesyncd, wheresystemd-timesyncdwould become
inactive andchronybecome active everytime Ubuntu VM got rebooted. However,systemd-timesyncdshould be the default ntp client for the VM.
Workaround
Option 1: Manually runrestart systemd-timesyncdevery time when VM got rebooted.
Option 2: Deploy the following Daemonset so thatsystemd-timesyncdwill always
be restarted if it's dead.
apiVersion:apps/v1kind:DaemonSetmetadata:name:ensure-systemd-timesyncdspec:selector:matchLabels:name:ensure-systemd-timesyncdtemplate:metadata:labels:name:ensure-systemd-timesyncdspec:hostIPC:truehostPID:truecontainers:-name:ensure-systemd-timesyncd# Use your preferred image.image:ubuntucommand:-/bin/bash--c-|while true; doecho $(date -u)echo "Checking systemd-timesyncd status..."chroot /host systemctl status systemd-timesyncdif (( $? != 0 )) ; thenecho "Restarting systemd-timesyncd..."chroot /host systemctl start systemd-timesyncdelseecho "systemd-timesyncd is running."fi;sleep 60donevolumeMounts:-name:hostmountPath:/hostresources:requests:memory:"10Mi"cpu:"10m"securityContext:privileged:truevolumes:-name:hosthostPath:path:/````## ClientConfig custom resource`gkectl update` reverts any manual changes that you have made to the ClientConfigcustom resource. We strongly recommend that you back up the ClientConfigresource after every manual change.## `kubectl describe CSINode` and `gkectl diagnose snapshot``kubectl describe CSINode` and `gkectl diagnose snapshot` sometimes fail due tothe[OSS Kubernetes issue](https://github.com/kubernetes/kubectl/issues/848){:.external}on dereferencing nil pointer fields.## OIDC and the CA certificateThe OIDC provider doesn't use the common CA by default. You must explicitlysupply the CA certificate.Upgrading the admin cluster from 1.5 to 1.6.0 breaks 1.5 user clusters that usean OIDC provider and have no value for `authentication.oidc.capath` in the[user cluster configuration file](/anthos/clusters/docs/on-prem/1.7/how-to/user-cluster-configuration-file).To work around this issue, run the following script:<section><pre class="devsite-click-to-copy">USER_CLUSTER_KUBECONFIG=<var class="edit">YOUR_USER_CLUSTER_KUBECONFIG</var>IDENTITY_PROVIDER=<var class="edit">YOUR_OIDC_PROVIDER_ADDRESS</var>openssl s_client -showcerts -verify 5 -connect $IDENTITY_PROVIDER:443 < /dev/null | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/{ if(/BEGIN CERTIFICATE/){i++}; out="tmpcert"i".pem"; print >out}'ROOT_CA_ISSUED_CERT=$(ls tmpcert*.pem | tail -1)ROOT_CA_CERT="/etc/ssl/certs/$(openssl x509 -in $ROOT_CA_ISSUED_CERT -noout -issuer_hash).0"cat tmpcert*.pem $ROOT_CA_CERT > certchain.pem CERT=$(echo $(base64 certchain.pem) | sed 's\ \\g') rm tmpcert1.pem tmpcert2.pemkubectl --kubeconfig $USER_CLUSTER_KUBECONFIG patch clientconfig default -n kube-public --type json -p "[{ \"op\":\"replace\", \"path\":\"/spec/authentication/0/oidc/certificateAuthorityData\", \"value\":\"${CERT}\"}]"</pre></section>Replace the following:* <var>YOUR_OIDC_IDENTITY_PROVICER</var>: The address of your OIDC provider:* <var>YOUR_YOUR_USER_CLUSTER_KUBECONFIG</var>:The path of your user clusterkubeconfig file.## gkectl check-config</code> validation fails: can't find F5 BIG-IP partitions<dl>
<dt>Symptoms</dt><dd><p>Validation fails because F5 BIG-IP partitions can't be found, even though they exist.</p></dd><dt>Potential causes</dt><dd><p>An issue with the F5 BIG-IP API can cause validation to fail.</p></dd><dt>Resolution</dt><dd><p>Try running <code>gkectl check-config</code> again.</p></dd></dl>## Disruption for workloads with PodDisruptionBudgets {:#workloads_pdbs_disruption}Upgrading clusters can cause disruption or downtime for workloads that use[PodDisruptionBudgets](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/){:.external}(PDBs).## Nodes fail to complete their upgrade processIf you have `PodDisruptionBudget` objects configured that are unable toallow any additional disruptions, node upgrades might fail to upgrade to thecontrol plane version after repeated attempts. To prevent this failure, werecommend that you scale up the `Deployment` or `HorizontalPodAutoscaler` toallow the node to drain while still respecting the `PodDisruptionBudget`configuration.To see all `PodDisruptionBudget` objects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
```
Log Forwarder makes an excessive number of OAuth 2.0 requests
With Google Distributed Cloud, version 1.7.1, you might experience issues with Log Forwarder consuming memory by making excessive OAuth 2.0 requests. Here is a workaround, in which you downgrade thestackdriver-operatorversion, clean up the disk, and restart Log Forwarder.
Step 0: Download images to your private registry if appropriate
If you use a private registry, follow these steps to download these images to your private registry before proceeding. Omit this step if you do not use a private registry.
ReplacePRIVATE_REGISTRY_HOSTwith the hostname or IP address of your private Docker registry.
Logs and metrics are not sent to project specified bystackdriver.projectID
In Google Distributed Cloud 1.7, logs are sent to the parent project of the service account specified in thestackdriver.serviceAccountKeyPathfield of your cluster configuration file. The value ofstackdriver.projectIDis ignored. This issue will be fixed in an upcoming release.
As a workaround, view logs in the parent project of your logging-monitoring service account.
Renewal of certificates might be required before an admin cluster upgrade
Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.
Admin cluster certificate renewal process
Make sure that OpenSSL is installed on the admin workstation before you begin.
Get the IP address and SSH keys for the admin master node:
Replaceclient-certificate-dataandclient-key-datain kubeconfig withclient-certificate-dataandclient-key-datain thenew_admin.conffile that you created.
Back up old certificates:
This is an optional, but recommended, step.
# ssh into admin master if you didn't in the previous step
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
# on admin master
sudo tar -czvf backup.tar.gz /etc/kubernetes
logout
# on worker node
sudo scp -i ~/.ssh/admin-cluster.key \
ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
Renew the certificates with kubeadm:
# ssh into admin master
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
# on admin master
sudo kubeadm alpha certs renew all
Restart static Pods running on the admin master node:
# on admin master
cd /etc/kubernetes
sudo mkdir tempdir
sudo mv manifests/*.yaml tempdir/
sleep 5
echo "remove pods"
# ensure kubelet detect those change remove those pods
# wait until the result of this command is empty
sudo docker ps | grep kube-apiserver
# ensure kubelet start those pods again
echo "start pods again"
sudo mv tempdir/*.yaml manifests/
sleep 30
# ensure kubelet start those pods again
# should show some results
sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd
# clean up
sudo rm -rf tempdir
logout
Renew the certificates of admin cluster worker nodes
Check node certificates expiration date
kubectl get nodes -o wide
# find the oldest node, fill NODE_IP with the internal ip of that node
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${NODE_IP}"
openssl x509 -enddate -noout -in /var/lib/kubelet/pki/kubelet-client-current.pem
logout
If the certificate is about to expire, renew node certificates bymanual node repair.
You must validate the renewed certificates, and validate the certificate of kube-apiserver.
# Get the IP address of kube-apiserver
cat[ADMIN_CLUSTER_KUBECONFIG]| grep server
# Get the current kube-apiserver certificate
openssl s_client -showcerts -connect: | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > current-kube-apiserver.crt
# check expiration date of this cert
openssl x509 -in current-kube-apiserver.crt -noout -enddate
/etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods
Starting from Google Distributed Cloud 1.7.2, the Ubuntu OS images are hardened withCIS L1 Server Benchmark.
.
As a result, the cron script/etc/cron.daily/aidehas been installed so that an aide check is scheduled to
ensure the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked".
If you see one or more Pods crashlooping on a node, rundf -h /runon the node. If the command output shows 100% space usage,
then you are likely experiencing this issue.
We anticipate a fix in a future release. Meanwhile, you can resolve this issue with either of the following two workarounds:
Periodically remove the log files at/run/aide/cron.daily.old*(recommended).
Follow the steps mentioned in the above external link. (Note: this workaround could potentially affect the node compliance state).
Using Google Distributed Cloud with Anthos Service Mesh version 1.7 or later
If you use Google Distributed Cloud with Anthos Service Mesh version 1.7 or later, and you want to upgrade to Google Distributed Cloud version 1.6.0-1.6.3 or Google Distributed Cloud version 1.7.0-1.7.2, you must remove thebundle.gke.io/component-nameandbundle.gke.io/component-versionlabels from the following Custom Resource Definitions (CRDs):
destinationrules.networking.istio.io
envoyfilters.networking.istio.io
serviceentries.networking.istio.io
virtualservices.networking.istio.io
Run this command to update the CRDdestinationrules.networking.istio.ioin your user cluster:
Remove thebundle.gke.io/component-versionandbundle.gke.io/component-namelabels from the CRD.
Alternatively, you can wait for the 1.6.4 and 1.7.3 release, and thenupgradeto 1.6.4 or 1.7.3 directly.
Cannot log in to admin workstation due to password expiry issue
You might experience this issue if you are using one of the following versions of Google Distributed Cloud.
1.7.2-gke.2
1.7.3-gke.2
1.8.0-gke.21
1.8.0-gke.24
1.8.0-gke.25
1.8.1-gke.7
1.8.2-gke.8
You might get the following error when you attempt to SSH into your Anthos VMs, including the admin workstation, cluster nodes, and Seesaw nodes:
WARNING: Your password has expired.
This error occurs because the ubuntu user password on the VMs has expired. You must manually reset the user password's expiration time to a large value before logging into the VMs.
Prevention of password expiry error
If you are running the affected versions listed above, and the user password hasn't expired yet, you should extend the expiration time before seeing the SSH error.
Run the following command on each Anthos VM:
sudo chage -M 99999 ubuntu
Mitigation of password expiry error
If the user password has already expired and you can't log in to the VMs to extend the expiration time, perform the following mitigation steps for each component.
Admin workstation
Use a temporary VM to perform the following steps. You can create an admin workstation using the1.7.1-gke.4 versionto use as the temporary VM.
Ensure the temporary VM and the admin workstation are in a power off state.
Attach the boot disk of the admin workstation to the temporary VM. The boot disk is the one with the label "Hard disk 1".
Mount the boot disk inside the VM by running these commands. Substitute your own boot disk identifier fordev/sdc1.
sudo mkdir -p /mnt/boot-disk
sudo mount /dev/sdc1 /mnt/boot-disk
Set the ubuntu user expiration date to a large value such as99999days.
sudo chroot /mnt/boot-disk chage -M 99999 ubuntu
Shut down the temporary VM.
Power on the admin workstation. You should now be able to SSH as usual.
After you run this command, wait for the user cluster control plane VMs to finish recreation and to be ready before you continue with the next steps.
User cluster worker VMs
Run the following command from the admin workstation to recreate the VMs.
for md in `kubectl --kubeconfig=USER_CLUSTER_KUBECONFIGget machinedeployments -l set=node -o name`; do kubectl patch --kubeconfig=USER_CLUSTER_KUBECONFIG$md --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"; done
Seesaw VMs
Run the following commands from the admin workstation to recreate the Seesaw VMs. There will be some downtime. If HA is enabled for the load balancer, the maximum down time is two seconds.
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise,
the network name in vm information from vCenter is incorrect, and results in the machine being in anUnavailablestate. This eventually leads to the nodes being auto-repaired to create new ones.
1. The issue is fixed in vCenter versions 7.0U2 and above.
2. For lower versions:
Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the
VM's portgroup.
SSH connection closed by remote host
For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS images are hardened withCIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured",/etc/ssh/sshd_confighas the following settings:
ClientAliveInterval 300
ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, theClientAliveCountMax 0value causes
unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected
even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:
Connection to [IP] closed by remote host.
Connection to [IP] closed.
As a workaround, you can either:
Usenohupto prevent your command being terminated on SSH disconnection,
False positives in docker, containerd, and runc vulnerability scanning
The docker, containerd, and runc in the Ubuntu OS images shipped with
Google Distributed Cloud are pinned to special versions usingUbuntu PPA.
This ensures that any container runtime changes will be qualified by Google Distributed Cloud before each release.
However, the special versions are unknown to theUbuntu CVE Tracker,
which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc
vulnerability scanning results.
For example, you might see the following false positives from your CVE scanning
results. These CVEs are already fixed in the latest patch versions of Google Distributed Cloud.
Starting from Google Distributed Cloud version 1.7.2, the Ubuntu OS images are hardened withCIS L1 Server Benchmark.
As a result, the cron script/etc/cron.daily/aidehas been installed so that anaidecheck is scheduled so as to
ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.
The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem,
you may experience CPU and memory usage spikes around that time that are caused by thisaideprocess.
If the spikes are affecting your workload, you can disable the daily cron job:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eUser cluster upgrades to version 1.7 may fail due to an unsupported Connect Agent version, however, the GKE connect functionality should remain unaffected, and users should contact Google support for resolution.\u003c/p\u003e\n"],["\u003cp\u003eUbuntu node reboots in versions 1.7.1-1.7.5, 1.8.0-1.8.4, and 1.9.0+ can lead to \u003ccode\u003esystemd-timesyncd\u003c/code\u003e becoming inactive, potentially causing time synchronization issues, but can be resolved by manually restarting the service or deploying a provided DaemonSet.\u003c/p\u003e\n"],["\u003cp\u003eManual changes to the ClientConfig custom resource can be reverted by \u003ccode\u003egkectl update\u003c/code\u003e, so it is strongly advised to back up this resource after making manual adjustments.\u003c/p\u003e\n"],["\u003cp\u003eIn Google Distributed Cloud version 1.7.1, the Log Forwarder may consume excessive memory by making too many OAuth 2.0 requests, which is mitigated by downgrading the \u003ccode\u003estackdriver-operator\u003c/code\u003e version, cleaning up the disk buffer, and restarting the Log Forwarder (this issue has been fixed in version 1.7.2).\u003c/p\u003e\n"],["\u003cp\u003eFor Google Distributed Cloud version 1.7.2 and above, SSH connections might be unexpectedly terminated even when the client is active due to the CIS compliance settings, which can be worked around by using \u003ccode\u003enohup\u003c/code\u003e or updating the \u003ccode\u003esshd_config\u003c/code\u003e file.\u003c/p\u003e\n"]]],[],null,[]]