/var/log/audit/is filled with audit logs. You can check the disk
usage by runningsudo du -h -d 1 /var/log/audit.
Cause
Since Anthos v1.8, the Ubuntu image is hardened with CIS Level2 Benchmark. And
one of the compliance rules,4.1.2.2 Ensure audit logs are not automatically deleted,
ensures the auditd settingmax_log_file_action = keep_logs. This results in all the
audit rules kept on the disk.
Workaround
Admin workstation
For the admin workstation, you can manually change the auditd settings to rotate the
logs automatically, and then restart the auditd service:
sed -i 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' /etc/audit/auditd.conf
sed -i 's/num_logs = .*/num_logs = 250/g' /etc/audit/auditd.conf
systemctl restart auditd
The above setting would make auditd automatically rotate its logs once it has
generated more than 250 files (each with 8M size).
Cluster nodes
For cluster nodes, apply the following DaemonSet to your cluster to prevent potential issues:
apiVersion:apps/v1kind:DaemonSetmetadata:name:change-auditd-log-actionnamespace:kube-systemspec:selector:matchLabels:app:change-auditd-log-actiontemplate:metadata:labels:app:change-auditd-log-actionspec:hostIPC:truehostPID:truecontainers:-name:update-audit-ruleimage:ubuntucommand:["chroot","/host","bash","-c"]args:-|whiletrue;doif$(grep-q"max_log_file_action = keep_logs"/etc/audit/auditd.conf);thenecho"updating auditd max_log_file_action to rotate with a max of 250 files"sed-i's/max_log_file_action = keep_logs/max_log_file_action = rotate/g'/etc/audit/auditd.confsed-i's/num_logs = .*/num_logs = 250/g'/etc/audit/auditd.confecho"restarting auditd"systemctlrestartauditdelseecho"auditd setting is expected, skip update"fisleep600donevolumeMounts:-name:hostmountPath:/hostsecurityContext:privileged:truevolumes:-name:hosthostPath:path:/
Note that making this auditd config change would violate CIS Level2 rule4.1.2.2 Ensure audit logs are not automatically deleted.
systemd-timesyncd not running after reboot on Ubuntu Node
Category
OS
Identified Versions
1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+
Symptoms
systemctl status systemd-timesyncdshould show that the service is dead:
chronywas incorrectly installed on Ubuntu OS image, and there's conflict
betweenchronyandsystemd-timesyncd, wheresystemd-timesyncdwould become
inactive andchronybecome active everytime Ubuntu VM got rebooted. However,systemd-timesyncdshould be the default ntp client for the VM.
Workaround
Option 1: Manually runrestart systemd-timesyncdevery time when VM got rebooted.
Option 2: Deploy the following Daemonset so thatsystemd-timesyncdwill always
be restarted if it's dead.
apiVersion:apps/v1kind:DaemonSetmetadata:name:ensure-systemd-timesyncdspec:selector:matchLabels:name:ensure-systemd-timesyncdtemplate:metadata:labels:name:ensure-systemd-timesyncdspec:hostIPC:truehostPID:truecontainers:-name:ensure-systemd-timesyncd# Use your preferred image.image:ubuntucommand:-/bin/bash--c-|while true; doecho $(date -u)echo "Checking systemd-timesyncd status..."chroot /host systemctl status systemd-timesyncdif (( $? != 0 )) ; thenecho "Restarting systemd-timesyncd..."chroot /host systemctl start systemd-timesyncdelseecho "systemd-timesyncd is running."fi;sleep 60donevolumeMounts:-name:hostmountPath:/hostresources:requests:memory:"10Mi"cpu:"10m"securityContext:privileged:truevolumes:-name:hosthostPath:path:/
ClientConfig custom resource
gkectl updatereverts any manual changes that you have made to the ClientConfig
custom resource. We strongly recommend that you back up the ClientConfig
resource after every manual change.
Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
Potential causes
An issue with the F5 BIG-IP API can cause validation to fail.
Resolution
Try runninggkectl check-configagain.
Disruption for workloads with PodDisruptionBudgets
Upgrading clusters can cause disruption or downtime for workloads that usePodDisruptionBudgets(PDBs).
Nodes fail to complete their upgrade process
If you havePodDisruptionBudgetobjects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up theDeploymentorHorizontalPodAutoscalerto
allow the node to drain while still respecting thePodDisruptionBudgetconfiguration.
To see allPodDisruptionBudgetobjects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
User cluster installation failed because of cert-manager/ca-injector's leader election issue in Anthos 1.9.0
You might see an installation failure due tocert-manager-cainjectorin crashloop, when the apiserver/etcd is slow. The following command,
Second, patch thecert-manager-cainjectorDeployment to disable leader election, which is safe because we only have one replica running. It is not required for a single replica.
# Ensure that we run only 1 cainjector replica, even during rolling updates.
kubectl patch --kubeconfigUSER_CLUSTER_KUBECONFIG-n kube-system deployment cert-manager-cainjector --type=strategic --patch '
spec:
strategy:
rollingUpdate:
maxSurge: 0
'
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl patch --kubeconfigUSER_CLUSTER_KUBECONFIG-n kube-system deployment cert-manager-cainjector --type=json --patch '[
{
"op": "add",
"path": "/spec/template/spec/containers/0/args/-",
"value": "--leader-elect=false"
}
]'
Keepmonitoring-operatorreplicas at 0 as a mitigation until the installation is finished. Otherwise it will revert the change.
After the installation is finished and the cluster is up and running, turn on themonitoring-operatorfor day-2 operations:
After upgrading to 1.9.1 or above, these steps will no longer be necessary since Anthos will disable leader-election for cainjector.
Renewal of certificates might be required before an admin cluster upgrade
Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.
Admin cluster certificate renewal process
Make sure that OpenSSL is installed on the admin workstation before you begin.
Set theKUBECONFIGvariable:
KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
ReplaceABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIGwith the absolute path to the admin cluster kubeconfig file.
Get the IP address and SSH keys for the admin master node:
Replaceclient-certificate-dataandclient-key-datain kubeconfig withclient-certificate-dataandclient-key-datain thenew_admin.conffile that you created.
Back up old certificates:
This is an optional, but recommended, step.
# ssh into admin master if you didn't in the previous step
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
# on admin master
sudo tar -czvf backup.tar.gz /etc/kubernetes
logout
# on worker node
sudo scp -i ~/.ssh/admin-cluster.key \
ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
Renew the certificates with kubeadm:
# ssh into admin master
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
# on admin master
sudo kubeadm alpha certs renew all
Restart static Pods running on the admin master node:
# on admin master
cd /etc/kubernetes
sudo mkdir tempdir
sudo mv manifests/*.yaml tempdir/
sleep 5
echo "remove pods"
# ensure kubelet detect those change remove those pods
# wait until the result of this command is empty
sudo docker ps | grep kube-apiserver
# ensure kubelet start those pods again
echo "start pods again"
sudo mv tempdir/*.yaml manifests/
sleep 30
# ensure kubelet start those pods again
# should show some results
sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd
# clean up
sudo rm -rf tempdir
logout
You must validate the renewed certificates, and validate the certificate of kube-apiserver.
# Get the IP address of kube-apiserver
cat $KUBECONFIG | grep server
# Get the current kube-apiserver certificate
openssl s_client -showcerts -connect: | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > current-kube-apiserver.crt
# check expiration date of this cert
openssl x509 -in current-kube-apiserver.crt -noout -enddate
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise,
the network name in VM Information from vCenter is incorrect, and results in the machine being in anUnavailablestate. This eventually leads to the nodes being auto-repaired to create new ones.
1. The issue is fixed in vCenter versions 7.0U2 and above.
2. For lower versions:
Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the
VM's portgroup.
SSH connection closed by remote host
For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS images are hardened withCIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured",/etc/ssh/sshd_confighas the following settings:
ClientAliveInterval 300
ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, theClientAliveCountMax 0value causes
unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected
even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:
Connection to [IP] closed by remote host.
Connection to [IP] closed.
As a workaround, you can either:
Usenohupto prevent your command being terminated on SSH disconnection,
Conflict withcert-managerwhen upgrading to version 1.9.0 or 1.9.1
If you have your owncert-managerinstallation with Google Distributed Cloud, you might experience a failure when you attempt to upgrade to versions 1.9.0 or 1.9.1. This is a result of a conflict between your version ofcert-manager, which is likely installed in thecert-managernamespace, and themonitoring-operatorversion.
If you try to install another copy ofcert-managerafter upgrading to Google Distributed Cloud version 1.9.0 or 1.9.1, the installation might fail due to a conflict with the existing one managed bymonitoring-operator.
Themetrics-cacluster issuer, which control-plane and observability components rely on for creation and rotation of cert secrets, requires ametrics-cacert secret to be stored in the cluster resource namespace. This namespace iskube-systemfor the monitoring-operator installation, and likely to becert-managerfor your installation.
If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.9.0 and 1.9.1:
Avoid conflicts during upgrade
Uninstall your version ofcert-manager. If you defined your own resources, you may want tobackupthem.
Copy themetrics-cacert-manager.io/v1 Certificate and themetrics-pki.cluster.localIssuer resources fromkube-systemto the cluster resource namespace of your installed cert-manager. Your installed cert-manager namespace iscert-managerif using theupstream default cert-manager installation, but that depends on your installation.
In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Google Distributed Cloud control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are anApigeecustomer and you only need cert-manager forApigee, you do not need to run the admin cluster commands.
Copy themetrics-cacert-manager.io/v1 Certificate and themetrics-pki.cluster.localIssuer resources fromkube-systemto the cluster resource namespace of your installed cert-manager. Your installed cert-manager namespace iscert-managerif using theupstream default cert-manager installation, but that depends on your installation.
Conflict withcert-managerwhen upgrading to version 1.9.2 or above
In 1.9.2 or above releases,monitoring-operatorwill install cert-manager in thecert-managernamespace. If for certain reasons, you need to install your own cert-manager, please follow the following instructions to avoid conflicts:
Avoid conflicts during upgrade
Uninstall your version ofcert-manager. If you defined your own resources, you may want tobackupthem.
You can skip this step if you are usingupstream default cert-manager installation, or you are sure your cert-manager is installed in thecert-managernamespace. Otherwise, copy themetrics-cacert-manager.io/v1 Certificate and themetrics-pki.cluster.localIssuer resources fromcert-managerto the cluster resource namespace of your installed cert-manager.
In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Google Distributed Cloud control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are anApigeecustomer and you only need cert-manager forApigee, you do not need to run the admin cluster commands.
You can skip this step if you are usingupstream default cert-manager installation, or you are sure your cert-manager is installed in thecert-managernamespace. Otherwise, copy themetrics-cacert-manager.io/v1 Certificate and themetrics-pki.cluster.localIssuer resources fromcert-managerto the cluster resource namespace of your installed cert-manager.
False positives in docker, containerd, and runc vulnerability scanning
The docker, containerd, and runc in the Ubuntu OS images shipped with
Google Distributed Cloud are pinned to special versions usingUbuntu PPA.
This ensures that any container runtime changes will be qualified by Google Distributed Cloud before each release.
However, the special versions are unknown to theUbuntu CVE Tracker,
which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc
vulnerability scanning results.
For example, you might see the following false positives from your CVE scanning
results. These CVEs are already fixed in the latest patch versions of Google Distributed Cloud.
Unhealthy konnectivity server Pods when using the Seesaw or manual mode load balancer
If you are using Seesaw or the manual mode load balancer, you might notice the konnectivity server Pods are unhealthy. This happens because Seesaw does not support reusing an IP address across a service. For manual mode, creating a load balancer service does not automatically provision the service on your load balancer.
SSH tunneling is enabled in version 1.9 clusters. Thus, even if the konnectivity server is not healthy, you can still use the SSH tunnel, so that the connectivity to and within the cluster should not be affected. Therefore, you do not need to be concerned about these unhealthy Pods.
If you plan to upgrade from version 1.9.0 to 1.9.x, it is recommended you delete the unhealthy konnectivity server deployments before upgrading. Run this command.
Starting from Google Distributed Cloud version 1.7.2, the Ubuntu OS images are hardened withCIS L1 Server Benchmark.
As a result, the cron script/etc/cron.daily/aidehas been installed so that anaidecheck is scheduled so as to
ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.
The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem,
you may experience CPU and memory usage spikes around that time that are caused by thisaideprocess.
If the spikes are affecting your workload, you can disable the daily cron job:
`sudo chmod -x /etc/cron.daily/aide`.
Load balancers and NSX-T stateful distributed firewall rules interact unpredictably
When deploying Google Distributed Cloud version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules,stackdriver-operatormight fail to creategke-metrics-agent-confConfigMap and causegke-connect-agentPods to be in a crash loop.
The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Google Distributed Cloud releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K. Followthese instructionsto disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.
If your clusters use a manual load balancer, followthese instructionsto configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.
Failure to register admin cluster during creation
If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the providedgkeConnectspec during its creation, you will get the following error.
Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: code = PermissionDenied desc = Permission 'gkehub.memberships.get' denied onPROJECT_PATH
You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.
failed to migrate to first admin trust chain: failed to parse current version "":
invalid version: "" failed to migrate to first admin trust chain: failed to parse
current version "": invalid version: ""
If this error occurs, follow these steps to fix the cluster registration issue. After you do this fix, you can then upgrade your admin cluster.
Providegovc, the command line interface to vSphere, some variables declaring elements of your vCenter Server and vSphere environment.
export GOVC_URL=https://VCENTER_SERVER_ADDRESSexport GOVC_USERNAME=VCENTER_SERVER_USERNAMEexport GOVC_PASSWORD=VCENTER_SERVER_PASSWORDexport GOVC_DATASTORE=VSPHERE_DATASTOREexport GOVC_DATACENTER=VSPHERE_DATACENTERexport GOVC_INSECURE=true
# DATA_DISK_NAME should not include the suffix ".vmdk"
export DATA_DISK_NAME=DATA_DISK_NAME
Replace the following:
VCENTER_SERVER_ADDRESSis your vCenter Server's IP address or hostname.
VCENTER_SERVER_USERNAMEis the username of an account that holds the
Administrator role or equivalent privileges in vCenter Server.
VCENTER_SERVER_PASSWORDis the vCenter Server account's password.
VSPHERE_DATASTOREis the name of the datastore you've configured in your vSphere
environment.
VSPHERE_DATACENTERis the name of the datacenter you've configured in your
vSphere environment.
# Find out the gkeOnPremVersion
export KUBECONFIG=ADMIN_CLUSTER_KUBECONFIGADMIN_CLUSTER_NAME=$(kubectl get onpremadmincluster -n kube-system --no-headers | awk '{ print $1 }')
GKE_ON_PREM_VERSION=$(kubectl get onpremadmincluster -n kube-system $ADMIN_CLUSTER_NAME -o=jsonpath='{.spec.gkeOnPremVersion}')
# Replace the gkeOnPremVersion in temp-checkpoint.yaml
sed -i "s/gkeonpremversion: \"\"/gkeonpremversion: \"$GKE_ON_PREM_VERSION\"/" temp-checkpoint.yaml
#The steps below are only needed for upgrading from 1.9x to 1.10x clusters.
# Find out the provider ID of the admin control-plane VM
ADMIN_CONTROL_PLANE_MACHINE_NAME=$(kubectl get machines --no-headers | grep master)
ADMIN_CONTROL_PLANE_PROVIDER_ID=$(kubectl get machines $ADMIN_CONTROL_PLANE_MACHINE_NAME -o=jsonpath='{.spec.providerID}' | sed 's/\//\\\//g')
# Fill in the providerID field in temp-checkpoint.yaml
sed -i "s/providerid: null/providerid: \"$ADMIN_CONTROL_PLANE_PROVIDER_ID\"/" temp-checkpoint.yaml
ReplaceADMIN_CLUSTER_KUBECONFIGwith the path of your admin cluster kubeconfig file.
Generate a new checksum.
Change the last line of the checkpoint file to
checksum:$NEW_CHECKSUM
ReplaceNEW_CHECKSUMwith the output of the following command:
If you have experienced this issue with an existing cluster, you can do one of the following:
Disable Anthos Identity Service (AIS). If you disable AIS, that will not remove the deployed AIS binary or remove AIS ClientConfig. To disable AIS, run this command:
image: gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8 # use 1.1.0-anthos.8
imagePullPolicy: IfNotPresent
name: gke-metrics-agent
Cisco ACI doesn't work with Direct Server Return (DSR)
Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning. A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).
You can configure the L4-L7 Virtual IP option by going toTenant > Application Profiles > Application EPGsoruSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.
gkectl diagnose checking certificates failure
If your work station does not have access to user cluster worker nodes, it will get the following failures when runninggkectl diagnose, it is safe to ignore them.
Checking user cluster certificates...FAILURE
Reason: 3 user cluster certificates error(s).
Unhealthy Resources:
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003e\u003ccode\u003e/var/log/audit/\u003c/code\u003e may fill up disk space due to CIS Level 2 Benchmark compliance, which can be resolved by adjusting the \u003ccode\u003eauditd\u003c/code\u003e settings to rotate logs automatically, either manually on the admin workstation or via a DaemonSet for cluster nodes.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003esystemd-timesyncd\u003c/code\u003e service may not run after a reboot on Ubuntu nodes, due to a conflict with \u003ccode\u003echrony\u003c/code\u003e, requiring manual restarts or deployment of a DaemonSet to ensure the service's continuous operation.\u003c/p\u003e\n"],["\u003cp\u003eUpgrading clusters may cause disruption for workloads with PodDisruptionBudgets (PDBs), and node upgrades might fail if PDBs do not allow any additional disruptions, so scaling up deployments or horizontal pod autoscalers is recommended.\u003c/p\u003e\n"],["\u003cp\u003eInstallation failures due to \u003ccode\u003ecert-manager-cainjector\u003c/code\u003e leader election issues in Anthos 1.9.0 can be mitigated by temporarily scaling down \u003ccode\u003emonitoring-operator\u003c/code\u003e, patching \u003ccode\u003ecert-manager-cainjector\u003c/code\u003e to disable leader election, and then restoring \u003ccode\u003emonitoring-operator\u003c/code\u003e after installation, while upgrading to 1.9.1 or later would prevent this issue.\u003c/p\u003e\n"],["\u003cp\u003eAdmin cluster certificates may expire and should be checked and renewed before upgrading, which includes backing up the \u003ccode\u003eadmin.conf\u003c/code\u003e file, renewing certificates with \u003ccode\u003ekubeadm\u003c/code\u003e, and restarting static Pods, and there are specific steps on how to handle conflicts with \u003ccode\u003ecert-manager\u003c/code\u003e when upgrading to version 1.9.0, 1.9.1, or later.\u003c/p\u003e\n"]]],[],null,[]]