This document describes known issues for version 1.7 of Google Distributed Cloud.
User cluster upgrade fails due to 'failed to register to GCP'
Category
Upgrade
Identified Versions
1.7.0+, 1.8.0+
Symptoms
When upgrading user clusters to 1.7 versions, the command gkectl upgrade cluster
fails with error messages similar to
$
gkectl
upgrade
cluster
--
kubeconfig
kubeconfig
--
config
user
-
cluster
.
yaml
…
Upgrading
to
bundle
version
:
"1.7.1-gke.4"
…
Exit
with
error
:
failed
to
register
to
GCP
,
gcloud
output
:
,
error
:
error
running
command
'gcloud alpha container hub memberships register foo-cluster --kubeconfig kubeconfig --context cluster --version 20210129-01-00 --enable-workload-identity --has-private-issuer --verbosity=error --quiet'
:
error
:
exit
status
1
,
stderr
:
'Waiting for membership to be created...
The errors indicate that the user cluster upgrade is mostly completed except that the Connect Agent has not been upgraded. However, the functionality of GKE connect should not be affected.
Cause
The Connect Agent version 20210129-01-00
used in 1.7 versions is out of support.
Workaround
Please contact Google support to mitigate the issue.
systemd-timesyncd not running after reboot on Ubuntu Node
Category
OS
Identified Versions
1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+
Symptoms
systemctl status systemd-timesyncd
should show that the service is dead:
●
systemd
-
timesyncd
.
service
-
Network
Time
Synchronization
Loaded
:
loaded
(
/
lib
/
systemd
/
system
/
systemd
-
timesyncd
.
service
;
enabled
;
vendor
preset
:
enabled
)
Active
:
inactive
(
dead
)
This could cause time out of sync issues.
Cause
chrony
was incorrectly installed on Ubuntu OS image, and there's conflict
between chrony
and systemd-timesyncd
, where systemd-timesyncd
would become
inactive and chrony
become active everytime Ubuntu VM got rebooted. However, systemd-timesyncd
should be the default ntp client for the VM.
Workaround
Option 1: Manually run restart systemd-timesyncd
every time when VM got rebooted.
Option 2: Deploy the following Daemonset so that systemd-timesyncd
will always
be restarted if it's dead.
apiVersion
:
apps/v1
kind
:
DaemonSet
metadata
:
name
:
ensure-systemd-timesyncd
spec
:
selector
:
matchLabels
:
name
:
ensure-systemd-timesyncd
template
:
metadata
:
labels
:
name
:
ensure-systemd-timesyncd
spec
:
hostIPC
:
true
hostPID
:
true
containers
:
-
name
:
ensure-systemd-timesyncd
# Use your preferred image.
image
:
ubuntu
command
:
-
/bin/bash
-
-c
-
|
while true; do
echo $(date -u)
echo "Checking systemd-timesyncd status..."
chroot /host systemctl status systemd-timesyncd
if (( $? != 0 )) ; then
echo "Restarting systemd-timesyncd..."
chroot /host systemctl start systemd-timesyncd
else
echo "systemd-timesyncd is running."
fi;
sleep 60
done
volumeMounts
:
-
name
:
host
mountPath
:
/host
resources
:
requests
:
memory
:
"10Mi"
cpu
:
"10m"
securityContext
:
privileged
:
true
volumes
:
-
name
:
host
hostPath
:
path
:
/
````
## ClientConfig custom resource
`
gkectl update` reverts any manual changes that you have made to the ClientConfig
custom resource. We strongly recommend that you back up the ClientConfig
resource after every manual change.
## `kubectl describe CSINode` and `gkectl diagnose snapshot`
`
kubectl describe CSINode` and `gkectl diagnose snapshot` sometimes fail due to
the
[
OSS Kubernetes issue
]
(https://github.com/kubernetes/kubectl/issues/848){:.external}
on dereferencing nil pointer fields.
## OIDC and the CA certificate
The OIDC provider doesn't use the common CA by default. You must explicitly
supply the CA certificate.
Upgrading the admin cluster from 1.5 to 1.6.0 breaks 1.5 user clusters that use
an OIDC provider and have no value for `authentication.oidc.capath` in the
[
user cluster configuration file
]
(/anthos/clusters/docs/on-prem/1.7/how-to/user-cluster-configuration-file).
To work around this issue, run the following script
:
< section><pre class="devsite-click-to-copy">
USER_CLUSTER_KUBECONFIG=<var class="edit">YOUR_USER_CLUSTER_KUBECONFIG</var>
IDENTITY_PROVIDER=<var class="edit">YOUR_OIDC_PROVIDER_ADDRESS</var>
openssl s_client -showcerts -verify 5 -connect $IDENTITY_PROVIDER:443 < /dev/null | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/{ if(/BEGIN CERTIFICATE/){i++}; out="tmpcert"i".pem"; print >out}'
ROOT_CA_ISSUED_CERT=$(ls tmpcert*.pem | tail -1)
ROOT_CA_CERT="/etc/ssl/certs/$(openssl x509 -in $ROOT_CA_ISSUED_CERT -noout -issuer_hash).0"
cat tmpcert*.pem $ROOT_CA_CERT > certchain.pem CERT=$(echo $(base64 certchain.pem) | sed 's\ \\g') rm tmpcert1.pem tmpcert2.pem
kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG patch clientconfig default -n kube-public --type json -p "[{ \"op\"
:
\"replace\", \"path\"
:
\"/spec/authentication/0/oidc/certificateAuthorityData\", \"value\":\"${CERT}\"}]"
< /pre></section>
Replace the following
:
* <var>YOUR_OIDC_IDENTITY_PROVICER</var>: The address of your OIDC provider
:
* <var>YOUR_YOUR_USER_CLUSTER_KUBECONFIG</var>
:
The path of your user cluster
kubeconfig file.
## gkectl check-config</code> validation fails: can't find F5 BIG-IP partitions
< dl
>
< dt>Symptoms</dt>
< dd><p>Validation fails because F5 BIG-IP partitions can't be found, even though they exist.</p></dd>
< dt>Potential causes</dt>
< dd><p>An issue with the F5 BIG-IP API can cause validation to fail.</p></dd>
< dt>Resolution</dt>
< dd><p>Try running <code>gkectl check-config</code> again.</p></dd>
< /dl
> ## Disruption for workloads with PodDisruptionBudgets {:#workloads_pdbs_disruption}
Upgrading clusters can cause disruption or downtime for workloads that use
[
PodDisruptionBudgets
]
(https://kubernetes.io/docs/concepts/workloads/pods/disruptions/){:.external}
(PDBs).
## Nodes fail to complete their upgrade process
If you have `PodDisruptionBudget` objects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up the `Deployment` or `HorizontalPodAutoscaler` to
allow the node to drain while still respecting the `PodDisruptionBudget`
configuration.
To see all `PodDisruptionBudget` objects that do not allow any disruptions
:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}' ```
Log Forwarder makes an excessive number of OAuth 2.0 requests
With Google Distributed Cloud, version 1.7.1, you might experience issues with Log Forwarder consuming memory by making excessive OAuth 2.0 requests. Here is a workaround, in which you downgrade the stackdriver-operator
version, clean up the disk, and restart Log Forwarder.
Step 0: Download images to your private registry if appropriate
If you use a private registry, follow these steps to download these images to your private registry before proceeding. Omit this step if you do not use a private registry.
Replace PRIVATE_REGISTRY_HOST with the hostname or IP address of your private Docker registry.
stackdriver-operator
docker pull gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440
docker tag gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440 \
PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440
docker push PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440
fluent-bit
docker pull gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3
docker tag gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3 \
PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3
docker push PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3
prometheus
docker pull gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0
docker tag gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0 \
PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0
docker push PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0
Step 1: Downgrade the stackdriver-operator version
- Run the following command to downgrade your version of stackdriver-operator.
kubectl --kubeconfig [CLUSTER_KUBECONFIG] -n kube-system patch deployment stackdriver-operator -p \ '{"spec":{"template":{"spec":{"containers":[{"name":"stackdriver-operator","image":"gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440"}]}}}}'
Step 2: Clean up the disk buffer for Log Forwarder
- Deploy the DaemonSet in the cluster to clean up the buffer.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit-cleanup
namespace: kube-system
spec:
selector:
matchLabels:
app: fluent-bit-cleanup
template:
metadata:
labels:
app: fluent-bit-cleanup
spec:
containers:
- name: fluent-bit-cleanup
image: debian:10-slim
command: ["bash", "-c"]
args:
- |
rm -rf /var/log/fluent-bit-buffers/
echo "Fluent Bit local buffer is cleaned up."
sleep 3600
volumeMounts:
- name: varlog
mountPath: /var/log
securityContext:
privileged: true
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: node-role.gke.io/observability
effect: NoSchedule
volumes:
- name: varlog
hostPath:
path: /var/log
- Verify the disk buffer is cleaned up.
kubectl --kubeconfig [CLUSTER_KUBECONFIG] logs -n kube-system -l app=fluent-bit-cleanup | grep "cleaned up" | wc -l
The output shows the number of nodes in the cluster.
kubectl --kubeconfig [CLUSTER_KUBECONFIG] -n kube-system get pods -l app=fluent-bit-cleanup --no-headers | wc -l
The output shows the number of nodes in the cluster.
- Delete the cleanup DaemonSet.
kubectl --kubeconfig [CLUSTER_KUBECONFIG] -n kube-system delete ds fluent-bit-cleanup
Step 3: Restart Log Forwarder
kubectl --kubeconfig [CLUSTER_KUBECONFIG] -n kube-system rollout restart ds/stackdriver-log-forwarder
Logs and metrics are not sent to project specified by stackdriver.projectID
In Google Distributed Cloud 1.7, logs are sent to the parent project of the service account specified in the stackdriver.serviceAccountKeyPath
field of your cluster configuration file. The value of stackdriver.projectID
is ignored. This issue will be fixed in an upcoming release.
As a workaround, view logs in the parent project of your logging-monitoring service account.
Renewal of certificates might be required before an admin cluster upgrade
Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.
Admin cluster certificate renewal process
-
Make sure that OpenSSL is installed on the admin workstation before you begin.
-
Get the IP address and SSH keys for the admin master node:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get secrets -n kube-system sshkeys \ -o jsonpath='{.data.vsphere_tmp}' | base64 -d > \ ~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key export MASTER_NODE_IP=$(kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes -o \ jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \ --selector='node-role.kubernetes.io/master')
-
Check if the certificates are expired:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \ "sudo kubeadm alpha certs check-expiration"If the certificates are expired, you must renew them before upgrading the admin cluster.
-
Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
-
Back up the admin cluster kubeconfig file:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf vi [ADMIN_CLUSTER_KUBECONFIG]
-

