Resolving traffic management issues in Cloud Service Mesh
This section explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support .
API server connection errors in istiod
logs
Istiod cannot contact the apiserver
if you see errors similar to the following:
error k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to watch *crd.IstioSomeCustomResource`…dial tcp 10.43.240.1:443: connect: connection refused
You can use the regular expression string /error.*cannot list resource/
to
find this error in the logs.
This error is usually transient and if you reached the proxy logs using kubectl
, the issue might be resolved already. This error is usually caused
by events that make the API server temporarily unavailable, such as when an API
server that is not in a high availability configuration reboots for an upgrade
or autoscaling change.
The istio-init
container crashes
This problem can occur when the pod iptables rules are not applied to the pod network namespace. This can be caused by:
- An incomplete istio-cni installation
- Insufficient workload pod permissions (missing
CAP_NET_ADMINpermission)
If you use the Istio CNI plugin, verify that you followed the instructions
completely. Verify that the istio-cni-node
container is ready, and check the
logs. If the problem persists, establish a secure shell (SSH) into the host node
and search the node logs for nsenter
commands, and see if there are any errors
present.
If you don't use the Istio CNI plugin, verify that the workload pod has CAP_NET_ADMIN
permission, which is automatically set by the sidecar injector.
Connection refused after pod starts
When a Pod starts and gets connection refused
trying to connect to an
endpoint, the problem might be that the application container started before
the isto-proxy
container. In this case, the application container sends the
request to istio-proxy
, but the connection is refused because istio-proxy
isn't listening on the port yet.
In this case, you can:
-
Modify your application's startup code to make continuous requests to the
istio-proxyhealth endpoint until the application receives a 200 code. Theistio-proxyhealth endpoint is:http://localhost:15020/healthz/ready -
Add a retry request mechanism to your application workload.
Listing gateways returns empty
Symptom:When you list Gateways using kubectl get gateway --all-namespaces
after successfully creating a Cloud Service Mesh Gateway, the command returns No resources found
.
This problem can happen on GKE 1.20 and later because the GKE Gateway controller
automatically installs the GKE Gateway.networking.x-k8s.io/v1alpha1
resource
in clusters. To workaround the issue:
-
Check if there are multiple gateway custom resources in the cluster:
kubectl api-resources | grep gatewayExample output:
gateways gw networking.istio.io/v1beta1 true Gateway gatewayclasses gc networking.x-k8s.io/v1alpha1 false GatewayClass gateways gtw networking.x-k8s.io/v1alpha1 true Gateway
-
If the list shows entries other than Gateways with the
apiVersionnetworking.istio.io/v1beta1, use the full resource name or the distinguishable short names in thekubectlcommand. For example, runkubectl get gworkubectl get gateways.networking.istio.ioinstead ofkubectl get gatewayto make sure istio Gateways are listed.
For more information on this issue, see Kubernetes Gateways and Istio Gateways .
Envoy proxy hanging on initialization
If debug logs indicate that the Envoy proxy is hanging during initialization, you can use the following command to identify what is blocking the process:
kubectl
-n
<ns>
-c
istio-proxy
exec
-it
POD_NAME
--
/usr/local/bin/pilot-agent
request
POST
/init_dump
Troubleshooting 5xx HTTP response errors
You may encounter 5xx HTTP response errors when accessing applications through the Istio Ingress Gateway. Follow these steps to diagnose and resolve the issue.
- Identify the control plane
- Verify potential misconfigurations
- Verify pod discovery
- Analyze access logs
Identify the control plane
Determine the control plane version and configuration because different versions may influence diagnostic processes. To verify the state of the managed control plane, run the following command:
gcloud
container
fleet
mesh
describe
--project
PROJECT_ID
An ACTIVE
state indicates the managed control plane is running normally.
Verify potential misconfigurations
Common misconfigurations can lead to routing failures:
- Namespace Mismatch: The
VirtualServicemust be in the same namespace as the backend GKE service. - Gateway Reference Mismatch: The
VirtualServicemust explicitly reference the correctGateway. For example, if theVirtualServicereferencesistio-system/istio-ingressgatewaybut the gateway is in thedefaultnamespace, traffic won't route correctly.
Verify pod discovery
Ensure the application pod is correctly discovered by the mesh:
istioctl
proxy-config
endpoints
POD_NAME
Analyze access logs
Enable access logs to determine if errors originate from the backend application
or the proxy. Key fields include RESPONSE_FLAGS
, UPSTREAM_LOCAL_ADDRESS
, and RESPONSE_CODE
.
If logs indicate upstream errors, perform a direct curl
test to the
GKE service from a pod in the same namespace. If the curl
request returns the same 5xx error, the issue originates from the backend
application itself.
Troubleshooting SSL certificate issues
SSL certificate errors at the Istio Ingress Gateway can be caused by expired certificates, protocol mismatches, or incorrect secret configurations.
Identify SSL errors
Review the logs from the Istio Ingress Gateway pod:
kubectl
logs
-l
app
=
istio-ingressgateway
-n
GATEWAY_NAMESPACE
Verify certificates and keys
Ensure the certificate and private key are valid and match using MD5 hashes:
openssl
x509
-noout
-modulus
-in
CERTIFICATE.CRT
|
openssl
md5
openssl
rsa
-noout
-modulus
-in
PRIVATE_KEY
|
openssl
md5
Confirm the certificate is not expired and that the Common Name (CN) or Subject Alternative Name (SAN) matches the domain.
Check TLS protocol and secret configuration
Verify the TLS version and cipher suites in the Gateway
CRD. Ensure the
Kubernetes secret containing the certificate and key is in the same namespace as
the Ingress Gateway and that the credentialName
matches the secret name.
Troubleshooting intermittent timeouts to external endpoints
Intermittent timeouts may occur when requests are made to external endpoints (like Third party NLB) that resolve to multiple IP addresses.
Potential cause: ServiceEntry resolution
If spec.resolution
is set to DNS
, Istio uses "strict DNS," which load
balances over all resolved IP addresses. Some NLBs don't support this.
Resolution
To resolve this, set the resolution on the ServiceEntry
to DNS_ROUND_ROBIN
.
Troubleshooting uneven traffic distribution
If traffic is not distributed evenly across application pods, check the following:
- Pod Health: Ensure all application pods were healthy during the imbalance.
- Load Balancing Algorithm: Review the
DestinationRuleconfiguration for theloadBalancersetting (e.g.,consistentHashorROUND_ROBIN). - Locality Load Balancing: Verify if
localityLbSettingis enabled. Note that this is not supported with theTRAFFIC_DIRECTORcontrol plane.
Troubleshooting service propagation and quota issues
If new services or networking configurations are not being reflected in the mesh, it may be due to resource quotas being reached in the fleet project.
Symptoms
- Networking configurations (such as
VirtualServiceorDestinationRule) are not being pushed to sidecar proxies. - New services appear "invisible" to the mesh despite being correctly defined.
Steps to identify and resolve
- Check resource quotas: Verify if the fleet project has reached its quota
for
BackendServiceresources by checking the Global internal traffic director backend servicesquota in the fleet project. Cloud Service Mesh typically creates oneBackendServiceper Kubernetes service port. - Review scale limitations: Ensure that your configuration remains within the supported scale limitations for Cloud Service Mesh .
- Increase quotas: If quotas have been reached, request an increase for the affected resource in the fleet project to restore normal service propagation.
Troubleshooting VirtualService evaluation
If a VirtualService
is not behaving as expected, remember that routes are
evaluated in the order they are listed. When you have multiple Virtual Services
for the same host, their routes are merged. The routes from "older" Virtual
Services get prioritized, therefore placed before routes from "newer" ones in
the merged list. This reinforces the "first match" rule.

