This page lists known issues for GKE networking. This page is for Admins and architects who manage the lifecycle of the underlying technology infrastructure, and respond to alerts and pages when service level objectives (SLOs) aren't met or applications fail.
To filter the known issues by a product version, select your filters from the following drop-down menus.
Select your GKE version:
Or, search for your issue:
Pod IP address leak on nodes with GKE Dataplane V2
Clusters with GKE Dataplane V2 enabled might experience Pod IP address exhaustion on nodes. This issue is caused by a container runtime bug that can leak allocated IP addresses when Pods hit transient CNI errors during creation.
The issue is triggered when the GKE cluster node is upgraded to or created with one of the following GKE versions:
- 1.33 and later
- 1.32 and later
- 1.31.2-gke.1115000 or later
- 1.30.8-gke.1051001 or later
- 1.29.10-gke.1059000 or later
- 1.28.15-gke.1024000 or later
When this issue occurs, new Pods that are scheduled on the affected node fail to start and return an error message similar to the following: failed to assign an IP address to container
.
Workaround:
To mitigate this issue, you can apply the mitigation DaemonSet to the cluster to clean up the leaked IP resources:
apiVersion : apps/v1 kind : DaemonSet metadata : name : cleanup-ipam-dir namespace : kube-system spec : selector : matchLabels : name : cleanup-ipam template : metadata : labels : name : cleanup-ipam spec : hostNetwork : true securityContext : runAsUser : 0 runAsGroup : 0 seccompProfile : type : RuntimeDefault automountServiceAccountToken : false containers : - name : cleanup-ipam image : gcr.io/gke-networking-test-images/ubuntu-test:2022@sha256:6cfbdf42ccaa85ec93146263b6e4c60ebae78951bd732469bca303e7ebddd85e command : - /bin/bash - -c - | while true; do for hash in $(find /hostipam -iregex '/hostipam/[0-9].*' -mmin +10 -exec head -n1 {} \; ); do hash="${hash%%[[:space:]]}" if [ -z $(ctr -n k8s.io c ls | grep $hash | awk '{print $1}') ]; then grep -ilr $hash /hostipam fi done | xargs -r rm echo "Done cleaning up /var/lib/cni/networks/gke-pod-network at $(date)" sleep 120s done volumeMounts : - name : host-ipam mountPath : /hostipam - name : host-ctr mountPath : /run/containerd securityContext : allowPrivilegeEscalation : false capabilities : drop : - ALL volumes : - name : host-ipam hostPath : path : /var/lib/cni/networks/gke-pod-network - name : host-ctr hostPath : path : /run/containerd
- 1.33.1-gke.1107000 and later
Ingress and Service load balancers outages on clusters with a legacy network
An incompatibility with legacy networks causes the backends of a GKE-managed load balancer deployed using Ingress or Service to be detached. This results in the load balancer having no active backends, which in turn leads to all incoming requests to those load balancers getting dropped.
The issue impacts GKE clusters which use a legacy network and are on version 1.31 or later.
To identify GKE clusters with a legacy network, run the following command:
gcloud container clusters describe CLUSTER_NAME --location= LOCATION --format="value(subnetwork)"
A cluster with a legacy network will get an empty output for this command.
Workaround:
As legacy networks have been deprecated for some time, the preferred solution is to migrate your legacy network to a VPC network. You can do this by converting a legacy network that contains GKE clusters . If you are unable to perform this migration at this time, reach out to Cloud Customer Care.
- 1.30.10-gke.1070000 and later
- 1.31.5-gke.1068000 and later
- 1.32.1-gke.1002000 and later
Newly created nodes are not added to the layer 4 internal load balancers
Google Cloud load balancers that are created for internal LoadBalancer Services might miss newly created nodes in the backend instance group.
The issue will be most visible on a cluster that was scaled to zero nodes and then scaled back to one or more nodes.
Workarounds:
- Turn on GKE subsetting
and recreate the Service.
Note : GKE subsetting can't be turned off after it is turned on.
- Create another internal LoadBalancing Service. When it syncs, the instance group will also be fixed for the affected Service. The new Service can be removed after the sync.
- Add and then remove the node.kubernetes.io/exclude-from-external-load-balancers label from one of the nodes.
- Add a node to the cluster. You can remove the node after the Service starts functioning.
- 1.31.7-gke.1158000 and later
- 1.32.3-gke.1499000 and later
Gateway API issues due to storedVersions removed from CRD status
The Kube-Addon-Manager in GKE incorrectly removes the v1alpha2
storedVersion
from the status of Gateway API CRDs like gateway
, httpRoute
, gatewayClass
, and referenceGrant
. This problem occurs even when the cluster still has instances of those CRDs stored in the v1alpha2
format. If the GKE cluster version is upgraded without the storedVersions
, the Gateway API calls fail. The failed calls might also break controllers that implement Gateway API.
Your cluster may be at risk if it meets all of the following conditions:
- The Gateway API is enabled on your cluster.
- You have, at any point in the past, installed a
v1alpha2
version of a Gateway API CRD. - Your cluster has been running on an affected GKE version.
Workaround:
The recommended workaround is to delay cluster upgrades until the issue is resolved.
Alternatively, if you need to upgrade the cluster version, you must update the storage version for all affected Gateway API CRDs to v1beta1
. The following example updates the gatewayClass
CRD:
- Check for the presence the of
v1alpha2
storage version:kubectl get crd gatewayclasses.gateway.networking.k8s.io -ojsonpath="{.status.storedVersions}"
- Adjust the storage version to
v1beta1
by running the following on all GatewayClass resources present on the cluster:kubectl annotate gatewayclass gateway-class-name bump-storage-version="yes"
- Remove the
v1alpha2
storage version and set the storage version tov1beta1
:kubectl patch customresourcedefinitions gatewayclasses.gateway.networking.k8s.io --subresource='status' --type='merge' -p '{"status":{"storedVersions":["v1beta1"]}}'
- Perform the upgrade as usual.
- 1.32.3-gke.1170000 and later
New Pods failing to initialize stuck on ContainerCreating
New Pods fail to be created and are stuck in the ContainerCreating
state. When this issue happens, the service container logs the following:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "[sandbox-ID]": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
The issue impacts GKE clusters in versions between 1.32 and before 1.32.3-gke.1170000, that were created at GKE versions 1.31 or 1.32. The root cause is that an in-memory data structure that maintained a collection of allocated Cilium identities was not correctly synchronized with the Kubernetes API server state.
To confirm the GKE version that was used to create the cluster, you can query the initialClusterVersion
resource by using the following command:
gcloud container clusters describe [cluster_name] --location [location] --format='value(initialClusterVersion)'
If the GKE cluster has logging enabled, the cilium-agent
container logs the unable to resolve identity: timed out waiting for cilium-operator to allocate CiliumIdentity for key
message in Logs Explorer by using the following query:
resource.type="k8s_container" resource.labels.container_name="cilium-agent"
Workaround:
A temporary mitigation is to restart the control plane. This can be achieved by upgrading the control plane to the same version it is already running:
gcloud container clusters upgrade [cluster_name] --location [location] --cluster-version=[version] --master
NEG Controller stops managing endpoints when port removed from Service
When the NEG controller is configured to create a Standalone NEG for a Service and one of the configured ports is later removed from the Service, the NEG controller will eventually stop managing endpoints for the NEG. In addition to Services where the user creates a Standalone NEG annotation, this also affects Services which are referenced by GKE Gateway, MCI, and GKE Multi Cluster Gateway.
Workaround:
When removing a port from a Service with a Standalone NEG annotation, the annotation needs to also be updated to remove the port in question.
Gateway TLS configuration error
We've identified an issue with configuring TLS for Gateways in clusters running GKE version 1.28.4-gke.1083000. This affects TLS configurations using either an SSLCertificate or a CertificateMap . If you're upgrading a cluster with existing Gateways, updates made to the Gateway will fail. For new Gateways, the load balancers won't be provisioned. This issue will be fixed in an upcoming GKE 1.28 patch version.
- 1.26.13-gke.1052000 and later
- 1.27.10-gke.1055000 and later
- 1.28.6-gke.1095000 and later
- 1.29.1-gke.1016000 and later
Intermittent connection establishment failures
Clusters on control plane versions 1.26.6-gke.1900 and later might encounter intermittent connection establishment failures.
The chances of failures are low and it doesn't affect all clusters. The failures should stop completely after a few days since the symptom onset.
- 1.27.11-gke.1118000 or later
- 1.28.7-gke.1100000 or later
- 1.29.2-gke.1217000 or later
DNS resolution issues with Container-Optimized OS
Workloads running on GKE clusters with Container-Optimized OS-based nodes might experience DNS resolution issues.
Network Policy drops a connection due to incorrect connection tracking lookup
For clusters with GKE Dataplane V2 enabled, when a client Pod connects to itself using a Service or the virtual IP address of an internal passthrough Network Load Balancer, the reply packet is not identified as a part of an existing connection due to incorrect conntrack lookup in the dataplane. This means that a Network Policy that restricts ingress traffic for the Pod is incorrectly enforced on the packet.
The impact of this issue depends on the number of configured Pods for the Service. For example, if the Service has 1 backend Pod, the connection always fails. If the Service has 2 backend Pods, the connection fails 50% of the time.
Workaround:
You can mitigate this issue by configuring the port
and containerPort
in the Service manifest to be the same value.
- 1.28.3-gke.1090000 or later
- 1.27.11-gke.1097000 or later
Packet drops for hairpin connection flows
For clusters with GKE Dataplane V2 enabled, when a Pod creates a TCP connection to itself using a Service, such that the Pod is both the source and destination of the connection, GKE Dataplane V2 eBPF connection tracking incorrectly tracks the connection states, leading to leaked conntrack entries.
When a connection tuple (protocol, source/destination IP, and source/destination port) has been leaked, new connections using the same connection tuple might result in return packets being dropped.
Workaround:
Use one of the following workarounds:
- Enable TCP reuse (keep-alives) for an application running in a Pod that might communicate with itself using a Service. This prevents the TCP FIN flag from being issued and avoids leaking the conntrack entry.
- When using short-lived connections, expose the Pod using a proxy load balancer, such as Gateway, to expose the Service. This results in the destination of the connection request being set to the load balancer IP address, preventing GKE Dataplane V2 from performing SNAT to the loopback IP address.
Device typed network in GKE multi-network fails with long network names
Cluster creation fails with the following error:
error starting
very-long-string-that-exceeds-character-limit-gpu-nic0 device plugin endpoint: listen
unix
/var/lib/kubelet/plugins_registry/networking.gke.io.networks_very-long-string-that-exceeds-character-limit-gpu-nic0.sock:
bind: invalid argument
Workaround:
Limit the
length of device-typed network object names to 41 characters or less. The
full path of each UNIX domain socket is composed, including the
corresponding network name. Linux has a limitation on socket path lengths
(under 107 bytes). After accounting for the directory, filename prefix, and
the .sock
extension, the network name is limited to a maximum
of 41 characters.
- 1.30.4-gke.1282000 or later
- 1.29.8-gke.1157000 or later
- 1.28.13-gke.1078000 or later
- 1.27.16-gke.1342000 or later
Connectivity issues for hostPort
Pods after control plane upgrade
Clusters with network policy enabled might experience connectivity issues with hostPort Pods. Additionally, newly created Pods might take an additional 30 to 60 seconds to be ready.
The issue is triggered when the GKE control plane of a cluster is upgraded to one of the following GKE versions
- 1.30 to 1.30.4-gke.1281999
- 1.29.1-gke.1545000 to 1.29.8-gke.1156999
- 1.28.7-gke.1042000 to 1.28.13-gke.1077999
- 1.27.12-gke.1107000 to 1.27.16-gke.1341999
Workaround:
Upgrade or recreate nodes immediately after the GKE control plane upgrade.
- 1.32.1-gke.1729000 or later
- 1.31.6-gke.1020000 or later
Broken UDP traffic between Pods that run on the same node
Clusters with intra-node visibility enabled might experience broken UDP traffic between Pods that run on the same node.
The issue is triggered when the GKE cluster node is upgraded to or created with one of the following GKE versions:
- 1.32.1-gke.1729000 or later
- 1.31.6-gke.1020000 or later
The impacted path is Pod-to-Pod UDP traffic on the same node through Hostport or Service.
Resolution
Upgrade the cluster to one of the following fixed versions:
- 1.32.3-gke.1927000 or later
- 1.31.7-gke.1390000 or later
Calico Pods not healthy on clusters with less than 3 total nodes and insufficient vCPU
Calico-typha and calico-node Pods can't be scheduled on clusters meeting all of the following conditions: fewer than 3 nodes total, each node having 1 or fewer allocatable vCPUs, and network policy enabled enabled. This is due to insufficient CPU resources.
Workarounds:
- Scale to a minimum of 3 node pools with 1 node using 1 allocatable vCPU.
- Resize a single node pool to a minimum of 3 nodes with 1 allocatable vCPU.
- Use a machine-type with at least 2 allocatable vCPU on a node pool with a single node.