Troubleshoot GPU health check failures

Before starting a job on a compute node in your cluster, Slurm runs a prolog script named a_chs_gpu_health_check_hcs.sh to quickly check the node's GPU health. If this check fails, then Slurm drains the node, and Cluster Director sets the node state to Unhealthy. To identify unhealthy nodes in your cluster, view the cluster topology .

This document explains how to troubleshoot a node's GPU health check failure and return the node to service. If the health check failed due to a false positive, or if your workload can tolerate the hardware issue that the health check detected, then you can bypass the health check for a specific job .

Resolve a GPU health check failure

To resolve a compute node's health check failure and return the node to service, complete the following steps:

  1. Identify the cause of the failure

  2. Resolve the error

  3. Run a manual health check

  4. Undrain the node

Identify the cause of the failure

To identify the cause of a GPU health check failure in a compute node, review the node's log files by using the Logs Explorer :

  1. In the Google Cloud console, go to the Logs Explorerpage.

    Go to Logs Explorer

  2. In the Querypane, enter the following query. If you can't see the Querypane, then click the Show querytoggle to the on position.

     SEARCH("`/var/log/slurm/chs_health_check.log`")
    resource.labels.instance_id=" NODE_NAME 
    " 
    

    Replace NODE_NAME with the name of the compute node.

  3. To run the query, click Run query. The Query resultspane displays error messages from nvidia-smi and DCGM diagnostics . These logs show a history of diagnostic results from the prolog health check script, which runs each time Slurm assigns a job to a node. The error messages describe critical failures, including asynchronous driver events like XID errors. An error may occur when no job is running on the node, but the script only detects it when a job tries to start.

Resolve the error

After you identify the cause of the GPU health check failure in the previous section, choose one of the following resolution methods based on the error:

If you can't resolve the error, then contact your account team or support .

Resolve a hardware error

If the log files show a XID error that requires a node recreation, then recreate the node by completing the following steps:

  1. If you haven't already, then connect to a login node in your cluster .

  2. Recreate the node:

     sudo scontrol update nodename= NODE_NAME 
    state=POWER_DOWN_ASAP reason= "Recreate a faulty compute node." 
     
    

Resolve a temporary issue

If the GPU health check failure indicates a temporary issue, such as a transient power spike or cosmic ray-induced bit flip, then do the following:

  1. If you haven't already, then connect to the compute node .

  2. Stop all tasks that use the node's GPUs:

     sudo  
    systemctl  
    stop  
    nvidia-persistenced.service  
    nvidia-dcgm.service  
    google-cloud-ops-agent-opentelemetry-collector.service 
    
  3. Reset the GPUs:

     sudo  
    nvidia-smi  
    -r 
    
  4. Restart the tasks:

     sudo  
    systemctl  
    start  
    nvidia-persistenced.service  
    nvidia-dcgm.service  
    google-cloud-ops-agent-opentelemetry-collector.service 
    

Run a manual health check

After you attempt to fix a GPU health check failure, as described in the previous section, run a manual health check to verify the compute node's status. If the a_chs_gpu_health_check_hcs.sh script runs without errors, then you've resolved the issue.

To manually run a health check on a compute node, complete the following steps:

  1. If you haven't already, then connect to the compute node .

  2. Run the a_chs_gpu_health_check_hcs.sh prolog script:

     sudo  
    /slurm/custom_scripts/prolog.d/a_chs_gpu_health_check_hcs.sh 
    

Undrain the node

If you resolved the issue in the previous sections, and you want to return the compute node to service and let it accept new jobs, then undrain the node:

  1. If you haven't already, then connect to a login node in your cluster .

  2. Undrain the node:

     sudo  
    scontrol  
    update  
     nodename 
     = 
     NODE_NAME 
      
     state 
     = 
    RESUME 
    

Before starting new jobs on the node, Slurm runs a GPU health check.

Design a Mobile Site
View Site in Mobile | Classic
Share by: