Configure GKE for ML Diagnostics
If you are using Google Kubernetes Engine (GKE) for your ML workload, use this guide to configure your GKE cluster and install the required GKE artifacts.
The configuration of your workload depends on whether you use on-demand profiling or programmatic profiling .
- On-demand profiling: Requires you to install
connection-operator. - Programmatic profiling: Requires you to install
injection-webhook, label the workload and use the ML Diagnostics SDK .
If you are using a version of GKE that is later than 1.35.0-gke.3065000
, you can set up GKE cluster for ML Diagnostics with a
single gcloud CLI command. For more information, see Set up with
gcloud CLI
.
For GKE versions prior to 1.35.0-gke.3065000
, you need to
manually configure the GKE cluster to install the cert-manager
, injection-webhook
, and connection-operator
artifacts. For more information,
see Manual installation
.
Set up with gcloud CLI
For GKE versions later than 1.35.0-gke.3065000
, use one of the
following gcloud CLI commands to deploy the required ML Diagnostics
components (both connection-operator and injection-webhook) into your
GKE cluster.
For new GKE clusters:
gcloud
beta
container
clusters
create
CLUSTER_NAME
--enable-managed-mldiagnostics
For existing GKE clusters:
gcloud
beta
container
clusters
update
CLUSTER_NAME
--enable-managed-mldiagnostics
To disable ML Diagnostics, use the following:
gcloud
beta
container
clusters
update
CLUSTER_NAME
--no-enable-managed-mldiagnostics
You can also enable the gcloud CLI commands through the GKE Google Cloud console:
-
For new GKE clusters, go to Feature Manager > Managed Machine Learning Diagnostics.
-
For existing GKE clusters, go to Clusters, select the name of your cluster, go to Edit, and edit Managed Machine Learning Diagnosticsunder Features.
For more information on gcloud CLI commands to set up a
GKE cluster for ML Diagnostics, refer to the enable-managed-mldiagnostics
flag in the following API reference pages:
Manual installation
For GKE versions prior to 1.35.0-gke.3065000
, you need to
manually configure the GKE cluster to install the following:
-
cert-manager: A prerequisite for theinjection-webhook. -
injection-webhook: Provides the SDK with the required metadata. It supports common ML Kubernetes workloads, likeJobSet,RayJob, andLeaderWorkerSet. -
connection-operator: For on-demand profiling on GKE. Deployingconnection-operatoralong withinjection-webhookinto the GKE cluster will initialize profiling requests to target pods with profiling servers running when you trigger on-demand capture.
For more information on setting up for Google Kubernetes Engine, see Configure Google Kubernetes Engine cluster .
Cert-manager
cert-manager
acts as the certificate controller for your cluster, ensuring
that your applications are secure and that your certificates never
unintentionally expire.
Use Helm to install the following:
helm
repo
add
jetstack
https://charts.jetstack.io
helm
repo
update
helm
install
\
cert-manager
jetstack/cert-manager
\
--namespace
cert-manager
\
--create-namespace
\
--version
v1.13.0
\
--set
installCRDs
=
true
\
--set
global.leaderElection.namespace =
cert-manager
\
--timeout
10m
Injection-webhook
injection-webhook
passes metadata into the SDK. Use helm upgrade --install
to install for the first time or upgrade an
existing installation.
Use Helm to install the following:
helm
upgrade
--install
mldiagnostics-injection-webhook
\
--namespace =
gke-mldiagnostics
\
--create-namespace
\
--version
0
.25.0
\
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-injection-webhook
Connection-operator
connection-operator
enables on-demand profiling on GKE. Use the
following table to find the correct mldiagnostics-connection-operator
version:
| JAX Version | Helm Chart Version |
|---|---|
| 0.8.x | 0.24.0 |
| 0.9.x+ | 0.24.0+ |
Use Helm to install the required version.
For JAX 0.8.x:
helm
upgrade
--install
mldiagnostics-connection-operator
\
--namespace =
gke-mldiagnostics
\
--create-namespace
\
--version
0
.24.0
\
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator
\
--set
'mldiagnosticsConnectionOperator.controller.args={--metrics-bind-address=:8443,--health-probe-bind-address=:8081,--sidecar-timeout=65m,--disable-hostname-override}'
For JAX 0.9.x+:
helm
upgrade
--install
mldiagnostics-connection-operator
\
--namespace =
gke-mldiagnostics
\
--create-namespace
\
--version
0
.24.0
\
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator
Label workload
For programmatic profiling, you need to trigger the injection-webhook
to inject metadata into pods. Label either the workload or its namespace with managed-mldiagnostics-gke=true
before deploying the workload:
-
Label a workload. Label a
Jobset,LWS, orRayJobworkload, which will enable the webhook for that specific workload. The following is an example for aJobSetworkload:apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: single-host-tpu-v3-jobset2 namespace: default labels: managed-mldiagnostics-gke: "true" -
Label a namespace.This will enable the webhook for all
Jobset,LWS, andRayJobworkloads within that namespace.kubectl create namespace ai-workloads kubectl label namespace ai-workloads managed-mldiagnostics-gke = true

