Configure GKE for ML Diagnostics

If you are using Google Kubernetes Engine (GKE) for your ML workload, use this guide to configure your GKE cluster and install the required GKE artifacts.

The configuration of your workload depends on whether you use on-demand profiling or programmatic profiling .

  • On-demand profiling: Requires you to install connection-operator .
  • Programmatic profiling: Requires you to install injection-webhook , label the workload and use the ML Diagnostics SDK .

If you are using a version of GKE that is later than 1.35.0-gke.3065000 , you can set up GKE cluster for ML Diagnostics with a single gcloud CLI command. For more information, see Set up with gcloud CLI .

For GKE versions prior to 1.35.0-gke.3065000 , you need to manually configure the GKE cluster to install the cert-manager , injection-webhook , and connection-operator artifacts. For more information, see Manual installation .

Set up with gcloud CLI

For GKE versions later than 1.35.0-gke.3065000 , use one of the following gcloud CLI commands to deploy the required ML Diagnostics components (both connection-operator and injection-webhook) into your GKE cluster.

For new GKE clusters:

 gcloud  
beta  
container  
clusters  
create  
 CLUSTER_NAME 
  
--enable-managed-mldiagnostics 

For existing GKE clusters:

 gcloud  
beta  
container  
clusters  
update  
 CLUSTER_NAME 
  
--enable-managed-mldiagnostics 

To disable ML Diagnostics, use the following:

 gcloud  
beta  
container  
clusters  
update  
 CLUSTER_NAME 
  
--no-enable-managed-mldiagnostics 

You can also enable the gcloud CLI commands through the GKE Google Cloud console:

  • For new GKE clusters, go to Feature Manager > Managed Machine Learning Diagnostics.

    Go to GKE Managed Machine Learning Diagnostics

  • For existing GKE clusters, go to Clusters, select the name of your cluster, go to Edit, and edit Managed Machine Learning Diagnosticsunder Features.

For more information on gcloud CLI commands to set up a GKE cluster for ML Diagnostics, refer to the enable-managed-mldiagnostics flag in the following API reference pages:

Manual installation

For GKE versions prior to 1.35.0-gke.3065000 , you need to manually configure the GKE cluster to install the following:

  • cert-manager : A prerequisite for the injection-webhook .
  • injection-webhook : Provides the SDK with the required metadata. It supports common ML Kubernetes workloads, like JobSet , RayJob , and LeaderWorkerSet .
  • connection-operator : For on-demand profiling on GKE. Deploying connection-operator along with injection-webhook into the GKE cluster will initialize profiling requests to target pods with profiling servers running when you trigger on-demand capture.

For more information on setting up for Google Kubernetes Engine, see Configure Google Kubernetes Engine cluster .

Cert-manager

cert-manager acts as the certificate controller for your cluster, ensuring that your applications are secure and that your certificates never unintentionally expire.

Use Helm to install the following:

 helm  
repo  
add  
jetstack  
https://charts.jetstack.io
helm  
repo  
update

helm  
install  
 \ 
  
cert-manager  
jetstack/cert-manager  
 \ 
  
--namespace  
cert-manager  
 \ 
  
--create-namespace  
 \ 
  
--version  
v1.13.0  
 \ 
  
--set  
 installCRDs 
 = 
 true 
  
 \ 
  
--set  
global.leaderElection.namespace = 
cert-manager  
 \ 
  
--timeout  
10m 

Injection-webhook

injection-webhook passes metadata into the SDK. Use helm upgrade --install to install for the first time or upgrade an existing installation.

Use Helm to install the following:

 helm  
upgrade  
--install  
mldiagnostics-injection-webhook  
 \ 
  
--namespace = 
gke-mldiagnostics  
 \ 
  
--create-namespace  
 \ 
  
--version  
 0 
.25.0  
 \ 
  
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-injection-webhook 

Connection-operator

connection-operator enables on-demand profiling on GKE. Use the following table to find the correct mldiagnostics-connection-operator version:

JAX Version Helm Chart Version
0.8.x 0.24.0
0.9.x+ 0.24.0+

Use Helm to install the required version.

For JAX 0.8.x:

 helm  
upgrade  
--install  
mldiagnostics-connection-operator  
 \ 
  
--namespace = 
gke-mldiagnostics  
 \ 
  
--create-namespace  
 \ 
  
--version  
 0 
.24.0  
 \ 
  
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator  
 \ 
  
--set  
 'mldiagnosticsConnectionOperator.controller.args={--metrics-bind-address=:8443,--health-probe-bind-address=:8081,--sidecar-timeout=65m,--disable-hostname-override}' 
 

For JAX 0.9.x+:

 helm  
upgrade  
--install  
mldiagnostics-connection-operator  
 \ 
  
--namespace = 
gke-mldiagnostics  
 \ 
  
--create-namespace  
 \ 
  
--version  
 0 
.24.0  
 \ 
  
oci://us-docker.pkg.dev/ai-on-gke/mldiagnostics-webhook-and-operator-helm/mldiagnostics-connection-operator 

Label workload

For programmatic profiling, you need to trigger the injection-webhook to inject metadata into pods. Label either the workload or its namespace with managed-mldiagnostics-gke=true before deploying the workload:

  • Label a workload. Label a Jobset , LWS , or RayJob workload, which will enable the webhook for that specific workload. The following is an example for a JobSet workload:

     apiVersion:  
    jobset.x-k8s.io/v1alpha2
    kind:  
    JobSet
    metadata:  
    name:  
    single-host-tpu-v3-jobset2  
    namespace:  
    default  
    labels:  
    managed-mldiagnostics-gke:  
     "true" 
     
    
  • Label a namespace.This will enable the webhook for all Jobset , LWS , and RayJob workloads within that namespace.

     kubectl  
    create  
    namespace  
    ai-workloads
    kubectl  
    label  
    namespace  
    ai-workloads  
    managed-mldiagnostics-gke = 
     true 
     
    
Create a Mobile Website
View Site in Mobile | Classic
Share by: