Version 1.7. This version is no longer supported. For more information see the version support policy .

Known issues

This document describes known issues for version 1.7 of Google Distributed Cloud.

User cluster upgrade fails due to 'failed to register to GCP'

Identified Versions

1.7.0+, 1.8.0+

Symptoms

When upgrading user clusters to 1.7 versions, the command gkectl upgrade cluster fails with error messages similar to

  $ 
  
 gkectl 
  
 upgrade 
  
 cluster 
  
 -- 
 kubeconfig 
  
 kubeconfig 
  
 -- 
 config 
  
 user 
 - 
 cluster 
 . 
 yaml 
 … 
 Upgrading 
  
 to 
  
 bundle 
  
 version 
 : 
  
 "1.7.1-gke.4" 
 … 
 Exit 
  
 with 
  
 error 
 : 
 failed 
  
 to 
  
 register 
  
 to 
  
 GCP 
 , 
  
 gcloud 
  
 output 
 : 
  
 , 
  
 error 
 : 
  
 error 
  
 running 
  
 command 
  
 'gcloud alpha container hub memberships register foo-cluster --kubeconfig kubeconfig --context cluster --version 20210129-01-00  --enable-workload-identity --has-private-issuer --verbosity=error --quiet' 
 : 
  
 error 
 : 
  
 exit 
  
 status 
  
 1 
 , 
  
 stderr 
 : 
  
 'Waiting for membership to be created...

The errors indicate that the user cluster upgrade is mostly completed except that the Connect Agent has not been upgraded. However, the functionality of GKE connect should not be affected.

Cause

The Connect Agent version 20210129-01-00 used in 1.7 versions is out of support.

Workaround

Please contact Google support to mitigate the issue.

systemd-timesyncd not running after reboot on Ubuntu Node

Identified Versions

1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+

Symptoms

systemctl status systemd-timesyncd should show that the service is dead:

  ● 
  
 systemd 
 - 
 timesyncd 
 . 
 service 
  
 - 
  
 Network 
  
 Time 
  
 Synchronization 
 Loaded 
 : 
  
 loaded 
  
 ( 
 / 
 lib 
 / 
 systemd 
 / 
 system 
 / 
 systemd 
 - 
 timesyncd 
 . 
 service 
 ; 
  
 enabled 
 ; 
  
 vendor 
  
 preset 
 : 
  
 enabled 
 ) 
 Active 
 : 
  
 inactive 
  
 ( 
 dead 
 )

This could cause time out of sync issues.

Cause

chrony was incorrectly installed on Ubuntu OS image, and there's conflict between chrony and systemd-timesyncd , where systemd-timesyncd would become inactive and chrony become active everytime Ubuntu VM got rebooted. However, systemd-timesyncd should be the default ntp client for the VM.

Workaround

Option 1: Manually run restart systemd-timesyncd every time when VM got rebooted.

Option 2: Deploy the following Daemonset so that systemd-timesyncd will always be restarted if it's dead.

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 DaemonSet 
 metadata 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 spec 
 : 
  
 hostIPC 
 : 
  
 true 
  
 hostPID 
 : 
  
 true 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ensure-systemd-timesyncd 
  
 # Use your preferred image. 
  
 image 
 : 
  
 ubuntu 
  
 command 
 : 
  
 - 
  
 /bin/bash 
  
 - 
  
 -c 
  
 - 
  
 | 
  
 while true; do 
  
 echo $(date -u) 
  
 echo "Checking systemd-timesyncd status..." 
  
 chroot /host systemctl status systemd-timesyncd 
  
 if (( $? != 0 )) ; then 
  
 echo "Restarting systemd-timesyncd..." 
  
 chroot /host systemctl start systemd-timesyncd 
  
 else 
  
 echo "systemd-timesyncd is running." 
  
 fi; 
  
 sleep 60 
  
 done 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 mountPath 
 : 
  
 /host 
  
 resources 
 : 
  
 requests 
 : 
  
 memory 
 : 
  
 "10Mi" 
  
 cpu 
 : 
  
 "10m" 
  
 securityContext 
 : 
  
 privileged 
 : 
  
 true 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 host 
  
 hostPath 
 : 
  
 path 
 : 
  
 / 
 ```` 
 ## ClientConfig custom resource 
 ` 
 gkectl update` reverts any manual changes that you have made to the ClientConfig 
 custom resource. We strongly recommend that you back up the ClientConfig 
 resource after every manual change. 
 ## `kubectl describe CSINode` and `gkectl diagnose snapshot` 
 ` 
 kubectl describe CSINode` and `gkectl diagnose snapshot` sometimes fail due to 
 the 
 [ 
 OSS Kubernetes issue 
 ] 
 (https://github.com/kubernetes/kubectl/issues/848){:.external} 
  
 on dereferencing nil pointer fields. 
 ## OIDC and the CA certificate 
 The OIDC provider doesn't use the common CA by default. You must explicitly 
 supply the CA certificate. 
 Upgrading the admin cluster from 1.5 to 1.6.0 breaks 1.5 user clusters that use 
 an OIDC provider and have no value for `authentication.oidc.capath` in the 
 [ 
 user cluster configuration file 
 ] 
 (/anthos/clusters/docs/on-prem/1.7/how-to/user-cluster-configuration-file). 
 To work around this issue, run the following script 
 : 
< section><pre class="devsite-click-to-copy"> 
 USER_CLUSTER_KUBECONFIG=<var class="edit">YOUR_USER_CLUSTER_KUBECONFIG</var> 
 IDENTITY_PROVIDER=<var class="edit">YOUR_OIDC_PROVIDER_ADDRESS</var> 
 openssl s_client -showcerts -verify 5 -connect $IDENTITY_PROVIDER:443 < /dev/null | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/{ if(/BEGIN CERTIFICATE/){i++}; out="tmpcert"i".pem"; print >out}' 
 ROOT_CA_ISSUED_CERT=$(ls tmpcert*.pem | tail -1) 
 ROOT_CA_CERT="/etc/ssl/certs/$(openssl x509 -in $ROOT_CA_ISSUED_CERT -noout -issuer_hash).0" 
 cat tmpcert*.pem $ROOT_CA_CERT > certchain.pem CERT=$(echo $(base64 certchain.pem) | sed 's\ \\g') rm tmpcert1.pem tmpcert2.pem 
 kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG patch clientconfig default -n kube-public --type json -p "[{ \"op\" 
 : 
  
 \"replace\", \"path\" 
 : 
  
 \"/spec/authentication/0/oidc/certificateAuthorityData\", \"value\":\"${CERT}\"}]" 
< /pre></section> 
 Replace the following 
 : 
 * <var>YOUR_OIDC_IDENTITY_PROVICER</var>: The address of your OIDC provider 
 : 
 * <var>YOUR_YOUR_USER_CLUSTER_KUBECONFIG</var> 
 : 
  
 The path of your user cluster 
  
 kubeconfig file. 
 ## gkectl check-config</code> validation fails: can't find F5 BIG-IP partitions 
< dl 
>
< dt>Symptoms</dt> 
< dd><p>Validation fails because F5 BIG-IP partitions can't be found, even though they exist.</p></dd> 
< dt>Potential causes</dt> 
< dd><p>An issue with the F5 BIG-IP API can cause validation to fail.</p></dd> 
< dt>Resolution</dt> 
< dd><p>Try running <code>gkectl check-config</code> again.</p></dd> 
< /dl 
> ## Disruption for workloads with PodDisruptionBudgets {:#workloads_pdbs_disruption} 
 Upgrading clusters can cause disruption or downtime for workloads that use 
 [ 
 PodDisruptionBudgets 
 ] 
 (https://kubernetes.io/docs/concepts/workloads/pods/disruptions/){:.external} 
  
 (PDBs). 
 ## Nodes fail to complete their upgrade process 
 If you have `PodDisruptionBudget` objects configured that are unable to 
 allow any additional disruptions, node upgrades might fail to upgrade to the 
 control plane version after repeated attempts. To prevent this failure, we 
 recommend that you scale up the `Deployment` or `HorizontalPodAutoscaler` to 
 allow the node to drain while still respecting the `PodDisruptionBudget` 
 configuration. 
 To see all `PodDisruptionBudget` objects that do not allow any disruptions 
 :

kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}' ```

Log Forwarder makes an excessive number of OAuth 2.0 requests

With Google Distributed Cloud, version 1.7.1, you might experience issues with Log Forwarder consuming memory by making excessive OAuth 2.0 requests. Here is a workaround, in which you downgrade the stackdriver-operator version, clean up the disk, and restart Log Forwarder.

Step 0: Download images to your private registry if appropriate

If you use a private registry, follow these steps to download these images to your private registry before proceeding. Omit this step if you do not use a private registry.

Replace PRIVATE_REGISTRY_HOST with the hostname or IP address of your private Docker registry.

stackdriver-operator

docker pull gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440

docker tag gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440 \
    PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440

docker push PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440

fluent-bit

docker pull gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3

docker tag gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3 \
    PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3

docker push PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3

prometheus

docker pull gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0

docker tag gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0 \
    PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0

docker push PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0

Step 1: Downgrade the stackdriver-operator version

Run the following command to downgrade your version of stackdriver-operator.

kubectl  --kubeconfig [CLUSTER_KUBECONFIG] 
-n kube-system patch deployment stackdriver-operator -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"stackdriver-operator","image":"gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440"}]}}}}'

Step 2: Clean up the disk buffer for Log Forwarder

Deploy the DaemonSet in the cluster to clean up the buffer.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit-cleanup
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluent-bit-cleanup
  template:
    metadata:
      labels:
        app: fluent-bit-cleanup
    spec:
      containers:
      - name: fluent-bit-cleanup
        image: debian:10-slim
        command: ["bash", "-c"]
        args:
        - |
          rm -rf /var/log/fluent-bit-buffers/
          echo "Fluent Bit local buffer is cleaned up."
          sleep 3600
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        securityContext:
          privileged: true
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: node-role.gke.io/observability
        effect: NoSchedule
      volumes:
      - name: varlog
        hostPath:
          path: /var/log

Verify the disk buffer is cleaned up.

kubectl --kubeconfig [CLUSTER_KUBECONFIG] 
logs -n kube-system -l app=fluent-bit-cleanup | grep "cleaned up" | wc -l

The output shows the number of nodes in the cluster.

kubectl --kubeconfig [CLUSTER_KUBECONFIG] 
-n kube-system get pods -l app=fluent-bit-cleanup --no-headers | wc -l

The output shows the number of nodes in the cluster.

Delete the cleanup DaemonSet.

kubectl --kubeconfig [CLUSTER_KUBECONFIG] 
-n kube-system delete ds fluent-bit-cleanup

Step 3: Restart Log Forwarder

kubectl --kubeconfig [CLUSTER_KUBECONFIG] 
-n kube-system rollout restart ds/stackdriver-log-forwarder

Logs and metrics are not sent to project specified by `stackdriver.projectID`

In Google Distributed Cloud 1.7, logs are sent to the parent project of the service account specified in the stackdriver.serviceAccountKeyPath field of your cluster configuration file. The value of stackdriver.projectID is ignored. This issue will be fixed in an upcoming release.

As a workaround, view logs in the parent project of your logging-monitoring service account.

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

Admin cluster certificate renewal process

Make sure that OpenSSL is installed on the admin workstation before you begin.

Get the IP address and SSH keys for the admin master node:

kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] 
get secrets -n kube-system sshkeys \
-o jsonpath='{.data.vsphere_tmp}' | base64 -d > \
~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key

export MASTER_NODE_IP=$(kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] 
get nodes -o \
jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \
--selector='node-role.kubernetes.io/master')

Check if the certificates are expired:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \
"sudo kubeadm alpha certs check-expiration"
```
If the certificates are expired, you must renew them before upgrading the admin cluster.
Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
- Back up the admin cluster kubeconfig file:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf
vi [ADMIN_CLUSTER_KUBECONFIG] 
```

Known issues

User cluster upgrade fails due to 'failed to register to GCP'

Category

Identified Versions

Symptoms

Cause

Workaround

systemd-timesyncd not running after reboot on Ubuntu Node

Category

Identified Versions

Symptoms

Cause

Workaround

Log Forwarder makes an excessive number of OAuth 2.0 requests

Step 0: Download images to your private registry if appropriate

Step 1: Downgrade the stackdriver-operator version

Step 2: Clean up the disk buffer for Log Forwarder

Step 3: Restart Log Forwarder

Logs and metrics are not sent to project specified by `stackdriver.projectID`

Renewal of certificates might be required before an admin cluster upgrade

Admin cluster certificate renewal process

/etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods

Using Google Distributed Cloud with Anthos Service Mesh version 1.7 or later

Cannot log in to admin workstation due to password expiry issue

Prevention of password expiry error

Mitigation of password expiry error

Admin workstation

Admin cluster control plane VM

Admin cluster addon VMs

User cluster control plane VMs

User cluster worker VMs

Seesaw VMs

Restarting or upgrading vCenter for versions lower than 7.0U2

SSH connection closed by remote host

False positives in docker, containerd, and runc vulnerability scanning

`/etc/cron.daily/aide` CPU and memory spike issue

Known issues Stay organized with collections Save and categorize content based on your preferences.

User cluster upgrade fails due to 'failed to register to GCP'

Category

Identified Versions

Symptoms

Cause

Workaround

systemd-timesyncd not running after reboot on Ubuntu Node

Category

Identified Versions

Symptoms

Cause

Workaround

Log Forwarder makes an excessive number of OAuth 2.0 requests

Step 0: Download images to your private registry if appropriate

Step 1: Downgrade the stackdriver-operator version

Step 2: Clean up the disk buffer for Log Forwarder

Step 3: Restart Log Forwarder

Logs and metrics are not sent to project specified by stackdriver.projectID

Renewal of certificates might be required before an admin cluster upgrade

Admin cluster certificate renewal process

/etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods

Using Google Distributed Cloud with Anthos Service Mesh version 1.7 or later

Cannot log in to admin workstation due to password expiry issue

Prevention of password expiry error

Mitigation of password expiry error

Admin workstation

Admin cluster control plane VM

Admin cluster addon VMs

User cluster control plane VMs

User cluster worker VMs

Seesaw VMs

Restarting or upgrading vCenter for versions lower than 7.0U2

SSH connection closed by remote host

False positives in docker, containerd, and runc vulnerability scanning

/etc/cron.daily/aide CPU and memory spike issue

Known issues

Logs and metrics are not sent to project specified by `stackdriver.projectID`

`/etc/cron.daily/aide` CPU and memory spike issue