Using GKE Dataplane V2

This page explains how to enable and troubleshoot GKE Dataplane V2 for Google Kubernetes Engine (GKE) clusters.

New Autopilot clusters have GKE Dataplane V2 enabled in versions 1.22.7-gke.1500 and later and versions 1.23.4-gke.1500 and later. If you're experiencing issues with using GKE Dataplane V2, skip to Troubleshooting .

Creating a GKE cluster with GKE Dataplane V2

You can enable GKE Dataplane V2 when you create new clusters with GKE version 1.20.6-gke.700 and later by using the gcloud CLI or the GKE API. You can also enable GKE Dataplane V2 in Preview when you create new clusters with GKE version 1.17.9 and later

Console

To create a new cluster with GKE Dataplane V2, perform the following tasks:

In the Google Cloud console, go to the Create a Kubernetes cluster page.

Go to Create a Kubernetes cluster
In the Networking section, select the Enable Dataplane V2checkbox. The Enable Kubernetes Network Policy option is disabled when you select Enable Dataplane V2 because network policy enforcement is built into GKE Dataplane V2.
Click Create.

gcloud

To create a new cluster with GKE Dataplane V2, use the following command:

 gcloud  
container  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--enable-dataplane-v2  
 \ 
  
--enable-ip-alias  
 \ 
  
--release-channel  
 CHANNEL_NAME 
  
 \ 
  
--location  
 COMPUTE_LOCATION

Replace the following:

CLUSTER_NAME : the name of your new cluster.
CHANNEL_NAME : a release channel that includes GKE version 1.20.6-gke.700 or later. If you prefer not to use a release channel, you can also use the --cluster-version flag instead of --release-channel , specifying version 1.20.6-gke.700 or later.
COMPUTE_LOCATION : the Compute Engine location for the new cluster.

API

To create a new cluster with GKE Dataplane V2, specify the datapathProvider field in the networkConfig object in your cluster create request .

The following JSON snippet shows the configuration needed to enable GKE Dataplane V2:

  "cluster" 
 :{ 
  
 "initialClusterVersion" 
 : 
 " VERSION 
" 
 , 
  
 "ipAllocationPolicy" 
 :{ 
  
 "useIpAliases" 
 : 
 true 
  
 }, 
  
 "networkConfig" 
 :{ 
  
 "datapathProvider" 
 : 
 "ADVANCED_DATAPATH" 
  
 }, 
  
 "releaseChannel" 
 :{ 
  
 "channel" 
 : 
 " CHANNEL_NAME 
" 
  
 } 
 }

Replace the following:

VERSION : your cluster version, which must be GKE 1.20.6-gke.700 or later.
CHANNEL_NAME : a release channel that includes GKE version 1.20.6-gke.700 or later.

Troubleshooting issues with GKE Dataplane V2

This section shows you how to investigate and resolve issues with GKE Dataplane V2.

Confirm that GKE Dataplane V2 is enabled:
```
 kubectl  
-n  
kube-system  
get  
pods  
-l  
k8s-app = 
cilium  
-o  
wide 
```
If GKE Dataplane V2 is running, the output includes Pods with the prefix anetd- . anetd is the networking controller for GKE Dataplane V2.

If the issue is with services or network policy enforcement, check the anetd Pod logs. Use the following log selectors in Cloud Logging:

 resource.type = 
 "k8s_container" 
labels. "k8s-pod/k8s-app" 
 = 
 "cilium" 
resource.labels.cluster_name = 
 " CLUSTER_NAME 
"

If Pod creation is failing, check the kubelet logs for clues. Use the following log selectors in Cloud Logging:
```
 resource.type = 
 "k8s_node" 
 log_name 
 = 
~ ".*/logs/kubelet" 
resource.labels.cluster_name = 
 " CLUSTER_NAME 
" 
 
```
Replace CLUSTER_NAME with the name of the cluster, or remove it entirely to see logs for all clusters.
If the anetd Pods are not running, examine the cilium-config ConfigMap for any modifications. Avoid altering existing fields within this ConfigMap, because such changes can destabilize the cluster and disrupt anetd . The ConfigMap gets patched back to the default state only if new fields are added to it. Any changes to existing fields are not patched back, and we recommend not changing or customizing the ConfigMap.

Known issues

When you use GKE Dataplane V2, you might encounter the following known issues.

Connection timeouts for not-ready Pods

When a Pod is not ready, connections to the associated Service can time out. This is the expected behavior for GKE Dataplane V2, and it differs from kube-proxy, which can return a faster connection refused error.

Identity-Relevant Label filtering for Cilium Identity doesn't take effect and Pods are stuck in ContainerCreating state

Affected versions: 1.34, 1.35

In GKE Dataplane V2 clusters, emergency use of Identity-Relevant Label filtering via kube-system/cilium-config-emergency-override ConfigMap is not correctly applied at the affected versions.

This approach limits which Pod labels are used for Cilium Identity generation.

When other mechanisms of preventing/removing high cardinality label key/values from Pods is not available (such as when labels are applied by a tool or framework), Identity-Relevant Label filtering can be used to exclude the label keys from Cilium Identity calculation. For more information about configuring these rules, see Identity-Relevant Labels in the Cilium documentation.

For the affected GKE versions, Cilium identities created by the operator continue to include the excluded labels.

Symptoms

Pods with labels that should be filtered for Cilium Identity generation might fail to start and get stuck in the ContainerCreating state. Pod events might show timeout errors:

 {"level":"warning", "msg":"Error changing endpoint identity", "error":"unable to resolve identity: timed out waiting for cilium-operator to allocate CiliumIdentity for key ...;, error: exponential backoff cancelled via context: context canceled", "k8sPodName":"...", "subsys":"endpoint"}

Instead of sharing identities based on filtered labels, Pods with unique label values continue to generate unique Cilium Identities. This can lead to a sharp increase of identities, potentially exhausting available Cilium Identities (up to a limit of 65,536) and causing scalability issues.

Fixed versions

To fix this issue, upgrade your cluster to one of the following GKE versions:

1.34.6-gke.1307000 or later
1.35.2-gke.1962000 or later

Workaround

As a workaround, apply the label filtering rules to the data.labels field in the main cilium-config ConfigMap and remove them from cilium-config-emergency-override . This situation persists through control plane operations, such as upgrades, because GKE preserves user modifications to fields it does not manage within the cilium-config ConfigMap.

Remove the labels key from the data section of the cilium-config-emergency-override ConfigMap if it exists.

Edit the cilium-config ConfigMap by adding or modifying the labels key in the data section. For example, to prevent labels named uuid from being used for identity generation:

  apiVersion 
 : 
  
 v1 
 kind 
 : 
  
 ConfigMap 
 metadata 
 : 
  
 name 
 : 
  
 cilium-config 
  
 namespace 
 : 
  
 kube-system 
 data 
 : 
  
 # ... other existing keys 
  
 labels 
 : 
  
 "!uuid" 
  
 # ... other existing keys

Restart the anet-operator on the control plane by upgrading the control plane to the same version it is running. This forces the operator to restart and reload its configuration:

 gcloud  
container  
clusters  
upgrade  
 CLUSTER_NAME 
  
 \ 
  
--location  
 CLUSTER_LOCATION 
  
 \ 
  
--project  
 PROJECT_ID 
  
 \ 
  
--cluster-version  
 $( 
gcloud  
container  
clusters  
describe  
 CLUSTER_NAME 
  
--location  
 CLUSTER_LOCATION 
  
--project  
 PROJECT_ID 
  
--format = 
 "value(currentMasterVersion)" 
 ) 
  
 \ 
  
--master

After the control plane restarts, restart the anetd DaemonSet to ensure node agents also pick up any required changes:
```
 kubectl  
rollout  
restart  
daemonset  
anetd  
-n  
kube-system 
```

Intermittent connectivity issues related to `NodePort` range conflicts in GKE Dataplane V2 clusters

In GKE Dataplane V2 clusters, intermittent connectivity problems can occur for masqueraded traffic or with ephemeral port usage. These problems are due to the potential port conflicts with the reserved NodePort range and typically happen in the following scenarios:

Custom ip-masq-agent :If you use a custom ip-masq-agent (version 2.10 or later), where the cluster has NodePort or Load Balancer services, you might observe intermittent connectivity issues due to their conflict with the NodePort range. Since version 2.10 and later, ip-masq-agent has the --random-fully argument implemented internally by default. To mitigate this, explicitly set --random-fully=false (applicable since version 2.11) under arguments in your ip-masq-agent configuration. For configuration details, see Configuring an IP masquerade agent in Standard clusters .
Ephemeral port range overlap:If the ephemeral port range that's defined by net.ipv4.ip_local_port_range on your GKE nodes overlaps with the NodePort range (30000-32767), it can also trigger connectivity issues. To prevent this problem, ensure that these two ranges don't overlap.

Review your ip-masq-agent configuration and ephemeral port range settings to ensure they don't conflict with the NodePort range. If you encounter intermittent connectivity issues, consider these potential causes and adjust your configuration accordingly.

Connectivity issues with `hostPort` in GKE Dataplane V2 clusters

Affected GKE versions: 1.29 and later

In clusters that use GKE Dataplane V2, you might encounter connectivity failures when traffic targets a node's IP:Port where port is the hostPort defined on the Pod. These issues arise in two primary scenarios:

Nodes with hostPort behind a passthrough Network Load Balancer:

hostPort ties a Pod to a specific node's port, and a passthrough Network Load Balancer distributes traffic across all nodes. When you expose Pods to the internet using hostPort and a passthrough Network Load Balancer, the load balancer might send traffic to a node where the Pod isn't running, causing connection failures. This is due to a known limitation in GKE Dataplane V2 where passthrough Network Load Balancer traffic is not consistently forwarded to hostPort Pods.

Workaround:When exposing hostPort s of a Pod on the node with a passthrough Network Load Balancer, specify the internal or external IP address of the Network Load Balancer in the Pod's hostIP field.
```
 ports:
- containerPort: 62000
  hostPort: 62000
  protocol: TCP
  hostIP: 35.232.62.64
- containerPort: 60000
  hostPort: 60000
  protocol: TCP
  hostIP: 35.232.62.64
  # Assuming 35.232.62.64 is the external IP address of a passthrough Network Load Balancer. 
```
hostPort conflict with reserved NodePort range:

If a Pod's hostPort conflicts with the reserved NodePort range (30000-32767), Cilium might fail to forward traffic to the Pod. This behavior has been observed in cluster versions 1.29 and later as Cilium now manages hostPort capabilites, replacing the previous Portmap method. This is an expected behavior for Cilium and is mentioned in their public documentation.

We don't plan to fix these limitations in later versions. The root cause of these issues is related to Cilium's behavior and outside the direct control of GKE.

Recommendation:We recommend that you migrate to NodePort Services instead of hostPort for improved reliability. NodePort Services provide similar capabilities.

Network Policy port ranges don't take effect

If you specify an endPort field in a Network Policy on a cluster that has GKE Dataplane V2 enabled, it won't take effect.

Kubernetes Network Policy API lets you specify a range of ports where the Network Policy is enforced. This API is supported in clusters with Calico Network Policy but is not supported in clusters with GKE Dataplane V2.

You can verify the behavior of your NetworkPolicy objects by reading them back after writing them to the API server. If the object still contains the endPort field, the feature is enforced. If the endPort field is missing, the feature is not enforced. In all cases, the object stored in the API server is the source of truth for the Network Policy.

For more information see KEP-2079: Network Policy to support Port Ranges .

Fixed versions

To fix this issue, upgrade your cluster to GKE versions 1.32 or later

Network Policy drops a connection due to incorrect connection tracking lookup

When a client Pod connects to itself using a Service or the virtual IP address of an internal passthrough Network Load Balancer, the reply packet is not identified as a part of an existing connection due to incorrect conntrack lookup in the dataplane. This means that a Network Policy that restricts ingress traffic for the Pod is incorrectly enforced on the packet.

The impact of this issue depends on the number of configured Pods for the Service. For example, if the Service has 1 backend Pod, the connection always fails. If the Service has 2 backend Pods, the connection fails 50% of the time.

Fixed versions

To fix this issue, upgrade your cluster to one of the following GKE versions:

1.28.3-gke.1090000 or later.

Workarounds

You can mitigate this issue by configuring the port and containerPort in the Service manifest to be the same value.

Packet drops for hairpin connection flows

When a Pod creates a TCP connection to itself using a Service, such that the Pod is both the source and destination of the connection, GKE Dataplane V2 eBPF connection tracking incorrectly tracks the connection states, leading to leaked conntrack entries.

When a connection tuple (protocol, source/destination IP, and source/destination port) has been leaked, new connections using the same connection tuple might result in return packets being dropped.

Fixed versions

To fix this issue, upgrade your cluster to one of the following GKE versions:

1.28.3-gke.1090000 or later
1.27.11-gke.1097000 or later

Workarounds

Use one of the following workarounds:

Enable TCP reuse (keep-alives) for applications running in Pods that might communicate with itself using a Service. This prevents the TCP FIN flag from being issued and avoid leaking the conntrack entry.
When using short-lived connections, expose the Pod using a proxy load balancer, such as Gateway , to expose the Service. This results in the destination of the connection request being set to the load balancer IP address, preventing GKE Dataplane V2 from performing SNAT to the loopback IP address.

Upgrade of GKE control plane causes `anetd` Pod deadlock

When you upgrade a GKE cluster that has GKE Dataplane V2 (advanced datapath) enabled from version 1.27 to 1.28, you might encounter a deadlock situation. Workloads might experience disruptions due to the inability to terminate old Pods or schedule necessary components like anetd .

Cause

The cluster upgrade process increases the resource requirement for the GKE Dataplane V2 components. This increase might lead to resource contention, which disrupts communication between the Cilium Container Network Interface (CNI) plugin and the Cilium daemon.

Symptoms

You might see the following symptoms:

anetd Pods remain stuck in a Pending state.
Workload Pods get stuck in a Terminating state.
Errors indicating Cilium communication failures, such as failed to connect to Cilium daemon .

Errors during network resource cleanup for Pod sandboxes, for example:

 1rpc error: code = Unknown desc = failed to destroy network for sandbox "[sandbox_id]": plugin type="cilium-cni" failed (delete): unable to connect to Cilium daemon... connection refused

Workaround

Standard clusters: To resolve the issue and allow the anetd Pod to be scheduled, temporarily increase the allocatable resources on the affected node.

To identify the affected node and to check its allocatable CPU and memory, run the following command:

 kubectl  
get  
nodes  
 $NODE_NAME 
  
-o  
json  
 | 
  
jq  
 '.status.allocatable | {cpu, memory}'

To temporarily increase the allocatable CPU and memory, run the following command:

 kubectl  
patch  
node  
 $NODE_NAME 
  
-p  
 '{"status":{"allocatable":{"cpu": CPU_VALUE 
, "memory": MEMORY_VALUE 
}}}'

Autopilot clusters: To resolve the deadlock issue on Autopilot clusters, free up resources by force deleting the affected Pod:

 kubectl  
delete  
pod  
 POD_NAME 
  
-n  
 NAMESPACE 
  
--grace-period = 
 0 
  
--force

Replace the following:

POD_NAME : the name of the Pod.
NAMESPACE : the namespace of the Pod.

After you increase the allocatable resources on the node and when the upgrade from GKE version 1.27 to 1.28 completes, the anetd Pod runs on the newer version.

Nodes in `NodeNotReady` state due to missing `containerID` error

When clusters are upgraded to GKE version 1.35.1-gke.1616000 and later, nodes might immediately enter a NodeNotReady state if both GKE Dataplane V2 and Cloud Service Mesh are enabled.

Cause

Starting with GKE version 1.35.1-gke.1616000, GKE Dataplane V2 clusters use CNI version 1.1.0 in their CNI configuration files. This change requires downstream CNI plugins, such as Google Managed Istio, to also support CNI version 1.1.0. Because of a delay in the Managed Istio rollout, some clusters have not yet received the compatible version (1.23), leading to the initialization failure.

Symptoms

Affected nodes immediately show as NodeNotReady . The following error message appears in the containerd logs:

 NetworkPluginNotReady message:Network plugin returns error: missing containerID

Workaround

To mitigate the issue, downgrade the impacted cluster to a GKE version earlier than 1.35.1-gke.1616000.

Custom eBPF programs interference

GKE uses eBPF programs to manage networking for GKE Dataplane V2. If you deploy custom eBPF programs on GKE-managed node network interfaces, these programs can interfere with GKE-managed eBPF programs and cause networking issues.

GKE doesn't support custom eBPF programs attached to the following network interfaces:

eth*
ens4
lo
cilium*
gke*
veth*

The presence of custom eBPF programs on these interfaces can interfere with the GKE Dataplane V2 anetd agent-installed programs, which can disrupt cluster networking. We recommend that you remove any custom eBPF programs or workloads that inject such programs from your cluster.

Discover custom eBPF programs

To discover custom eBPF programs running on cluster nodes, you can create a DaemonSet configured with the hostNetwork: true setting, that uses bpftool to query such eBPF programs:

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 DaemonSet 
 metadata 
 : 
  
 name 
 : 
  
 bpftool-logger 
  
 labels 
 : 
  
 app 
 : 
  
 bpftool-logger 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 bpftool-logger 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 app 
 : 
  
 bpftool-logger 
  
 spec 
 : 
  
 hostPID 
 : 
  
 true 
  
 hostNetwork 
 : 
  
 true 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 bpftool 
  
 image 
 : 
  
 ubuntu:22.04 
  
 securityContext 
 : 
  
 privileged 
 : 
  
 true 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 NODE_NAME 
  
 valueFrom 
 : 
  
 fieldRef 
 : 
  
 fieldPath 
 : 
  
 spec.nodeName 
  
 command 
 : 
  
 - 
  
 /bin/bash 
  
 - 
  
 -c 
  
 - 
  
 | 
  
 echo "Installing dependencies..." 
  
 apt-get update -y > /dev/null 2>&1 
  
 apt-get install -y curl tar > /dev/null 2>&1 
  
 echo "Downloading and setting up bpftool..." 
  
 curl -sL https://github.com/libbpf/bpftool/releases/download/v7.7.0/bpftool-v7.7.0-amd64.tar.gz | tar xz 
  
 chmod +x bpftool 
  
 mv bpftool /usr/local/bin/ 
  
 echo "========== $(date) | Node: ${NODE_NAME} ==========" 
  
 bpftool net | grep -E '^(eth|ens4|lo|cilium|gke|veth)' | grep -v ' cil_' 
  
 sleep infinity

Save the manifest as ebpf-discovery.yaml and apply the DaemonSet:
```
 kubectl  
apply  
-f  
ebpf-discovery.yaml 
```

Wait for the Pods to be running:

 kubectl  
rollout  
status  
ds/bpftool-logger

Check the logs from the Pods to discover eBPF programs:

 kubectl  
logs  
-l  
 app 
 = 
bpftool-logger

When you have finished, delete the DaemonSet:

 kubectl  
delete  
-f  
ebpf-discovery.yaml

What's next

Learn how to use network policy logging .
Learn how to control communication between Pods and Services using network policies .
Learn more about GKE Dataplane V2 .
Learn more about GKE Dataplane V2 observability .
Learn how to configure GKE Dataplane V2 observability .

Using GKE Dataplane V2 Stay organized with collections Save and categorize content based on your preferences.

Creating a GKE cluster with GKE Dataplane V2

Console

gcloud

API

Troubleshooting issues with GKE Dataplane V2

Known issues

Connection timeouts for not-ready Pods

Identity-Relevant Label filtering for Cilium Identity doesn't take effect and Pods are stuck in ContainerCreating state

Intermittent connectivity issues related to NodePort range conflicts in GKE Dataplane V2 clusters

Connectivity issues with hostPort in GKE Dataplane V2 clusters

Network Policy port ranges don't take effect

Fixed versions

Network Policy drops a connection due to incorrect connection tracking lookup

Fixed versions

Workarounds

Packet drops for hairpin connection flows

Fixed versions

Workarounds

Upgrade of GKE control plane causes anetd Pod deadlock

Cause

Symptoms

Workaround

Nodes in NodeNotReady state due to missing containerID error

Cause

Symptoms

Workaround

Custom eBPF programs interference

Discover custom eBPF programs

What's next

Using GKE Dataplane V2

Intermittent connectivity issues related to `NodePort` range conflicts in GKE Dataplane V2 clusters

Connectivity issues with `hostPort` in GKE Dataplane V2 clusters

Upgrade of GKE control plane causes `anetd` Pod deadlock

Nodes in `NodeNotReady` state due to missing `containerID` error