Report faulty host

If you notice an issue on an A4X Max, A4X, A4, or A3 Ultra instance that you can't resolve otherwise—such as slower performance within a cluster or consistently high GPU temperatures—then you can report its host as faulty. When you report a host as faulty, Compute Engine automatically repairs the compute instance by running host maintenance. For A4 and A3 Ultra instances, Compute Engine attempts to migrate the instance to a different host when maintenance starts, if you have unused reserved capacity or capacity is available in the instance's zone. Reporting a host as faulty helps you minimize downtime for your workload.

This document explains how to report and repair faulty host instances that are part of a Slurm cluster or other instance-based clusters. To report faulty hosts in a Google Kubernetes Engine (GKE) cluster, see Report faulty hosts through GKE .

Limitations

When you report a faulty host, the following limitations apply:

You can only report a faulty host if the compute instance that runs on the host meets all of the following conditions:
- The compute instance is running.
- The compute instance uses an A4X Max, A4X, A4, or A3 Ultra machine type.
- The compute instance uses the reservation-bound provisioning model .
  
  Note: If a running A4X Max, A4X, A4, or A3 Ultra instance uses a different provisioning model, but you still want to report its host as faulty, then contact your account team.
Google Cloud makes best-effort attempts to fulfill all your report faulty host requests. However, due to capacity constraints or rate limits, a request might not always be fulfilled.

Before you begin

Select the tab for how you plan to use the samples on this page:

Console

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

gcloud

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

REST

To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

Install the Google Cloud CLI. After installation, initialize the Google Cloud CLI by running the following command:

gcloud  
init

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

Required roles

To get the permissions that you need to report a faulty host, ask your administrator to grant you the following IAM roles:

Compute Instance Admin (v1) ( roles/compute.instanceAdmin.v1 ) on the compute instance or the project
To view the state of a faulty host report operation by using Cloud Logging: Logs Viewer ( roles/logging.viewer ) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to report a faulty host. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to report a faulty host:

To create a faulty host report: compute.instances.update on the instance
To view a list of operations by using Logging: logging.operations.list on the project
To view the details of an operation by using Logging: logging.operations.get on the project
To view a list of operations in Compute Engine: compute.zoneOperations.list on the project
To view the details of an operation in Compute Engine: compute.zoneOperations.describe on the project

You might also be able to get these permissions with custom roles or other predefined roles .

Understand the faulty host report process

After you report a faulty host for a compute instance, the time when the compute instance restarts varies based on the reservation operational mode that is specified in the reservation that the compute instance uses. To verify the reservation operational mode for a reservation, view the reservationOperationalMode field in the reservation . The following table summarizes the faulty host process for the two available reservation operational modes: all capacity mode and managed mode .

All capacity mode ( ALL_CAPACITY )

Managed mode ( HIGHLY_AVAILABLE_CAPACITY )

Supported machine types

A4X Max and A4X

A4 and A3 Ultra

Faulty host report API rate limiting

No rate limits apply.

Calls to the API may be rate-limited.

Faulty host report process

When you report a faulty host for a compute instance that runs in the all capacity mode, the following occurs:

Report the faulty host : The instance remains in the RUNNING state throughout the report faulty host operation, which usually takes 10-12 minutes to complete. To review the operation state, see Review report faulty host operations in this document.
Repair the host : After the report faulty host operation completes, the host repair operation starts within a minute.

When the repair host operation starts, the instance stops and its state changes depending on the automatic restart ( automaticRestart ) setting that is specified for the instance:
- If automatic restart is enabled for the instance, the instance state changes to REPAIRING . The instance automatically restarts when its host is healthy unless you stop the instance before then.
- If automatic restart is disabled for the instance, the instance state changes to TERMINATED . You need to manually restart the instance after its host is healthy.
Repairing the faulty host can take 3-14 days, or even longer at times.
Restart the instance : After the host repair operation completes (usually 3-14 days), one of the following occurs:
- If the instance is in the REPAIRING state and the resources are available when the repair completes, then Compute Engine automatically restarts the instance on the repaired host.
- Otherwise, if the instance is in the TERMINATED state or if resources aren't available when the repair completes, then the instance state stays in or changes to TERMINATED . You must manually restart the instance when you want it to run. However, restarting the instance might fail if resources aren't available when you restart the instance; for example, this can happen if other instances are already using the repaired host.

When you report a faulty host for a compute instance that runs in the managed mode, the following occurs:

Report the faulty host : The instance remains in the RUNNING state throughout the report faulty host operation, which usually takes 10-12 minutes to complete. To review the operation state, see Review report faulty host operations in this document.
Start repairing the host : After the report faulty host operation completes, the host repair operation starts within a minute.

When the repair host operation starts, the instance stops and its state changes depending on the automatic restart ( automaticRestart ) setting that is specified for the instance:
- If automatic restart is enabled for the instance, the instance state changes to REPAIRING . The instance automatically restarts when its host is healthy unless you stop the instance before then.
- If automatic restart is disabled for the instance, the instance state changes to TERMINATED . You need to manually restart the instance after its host is healthy.
Repairing the faulty host can take 3-14 days, or even longer at times.
Migrate and restart the instance : After the host repair operation starts (usually 10-12 minutes), Compute Engine attempts to reserve one more host to replace your reported faulty host in your reserved capacity. If Compute Engine finds a healthy host—if it successfully replaces the faulty host or otherwise finds a matching healthy host in your reserved capacity—then Compute Engine migrates the instance to that host. Then, restarting the instance happens through one of the following:
- If the instance is in the REPAIRING state and resources are available before or when the repair completes, then Compute Engine automatically restarts the instance on a healthy host.
- Otherwise, if the instance is in the TERMINATED state or if resources aren't available before or when the repair completes, then the instance state stays in or changes to TERMINATED . You must manually restart the instance when you want it to run. However, restarting the instance might fail if resources aren't available when you restart the instance; for example, this can happen if other instances are already using the repaired host.

Report a faulty host

To report a faulty host, complete the following steps:

Review the host on which your compute instance runs.

For instructions, see View topology of a compute instance .
Optional: Back up Local SSD data. When the instance stops, Compute Engine automatically discards the data of any Local SSD disks that are attached to the instance. You can't recover Local SSD data after Compute Engine discards it.

For instructions on how to preserve Local SSD data, see Local SSD data backup .
Report the faulty host. To report a faulty host, select one of the following options. The host repair operation starts immediately, within a minute after the report faulty host operation completes. If the instance becomes unresponsive after you start the faulty host report operation, then, after you wait for at least 15 minutes, we recommend that you restart the instance.
gcloud

To report a faulty host, use the following gcloud compute instances report-host-as-faulty command :
```
 gcloud compute instances report-host-as-faulty INSTANCE_NAME 
\
    --async \
    --disruption-schedule=IMMEDIATE \
    --fault-reasons=behavior= FAULT_REASON 
,description= DESCRIPTION 
\
    --zone= ZONE 
 
```
Replace the following:
- INSTANCE_NAME : the name of the compute instance.
- FAULT_REASON : a list of host issues that your instance encountered, separated by commas—for example, ISSUE_1,ISSUE_2 . You can specify the following values:
  - PERFORMANCE : that GPUs that are attached to the instance have performance issues compared to other GPUs in the cluster, you see no XID errors in the logs, and the Compute Engine detects no other usual failure patterns such as silent data corruption.
  - SILENT_DATA_CORRUPTION : you see data corruption in your instance, but the instance keeps running. Silent data corruption can be due to issues like vCPUs defects, software bugs, or kernel issues.
  - UNRECOVERABLE_GPU_ERROR : you identified an unrecoverable GPU error with an XID.
  - BEHAVIOR_UNSPECIFIED : you aren't sure about what the issue to your instance is.
- DESCRIPTION : a description of the issue that is affecting your instance, such as XID information or suspected performance problems.
- ZONE : the zone where the instance exists.
REST

To report a faulty host, make the following POST request to the instances.reportHostAsFaulty method .

When you report a faulty host, you can specify multiple fault reasons at once. For example, to specify two fault reasons, make a request as follows:
```
 POST https://compute.googleapis.com/compute/v1/projects/ PROJECT_ID 
/zones/ ZONE 
/instances/ INSTANCE_NAME 
/reportHostAsFaulty

{
  "disruptionSchedule": "IMMEDIATE",
  "faultReasons": [
    {
      "behavior": " FAULT_REASON_1 
",
      "description": " DESCRIPTION_1 
"
    },
    {
      "behavior": " FAULT_REASON_2 
",
      "description": " DESCRIPTION_2 
"
    }
  ]
} 
```
Replace the following:
- PROJECT_ID : the ID of the project where the instance exists.
- ZONE : the zone where the instance exists.
- INSTANCE_NAME : the name of the compute instance.
- FAULT_REASON_1 and FAULT_REASON_2 : each host issue that your instance encountered. You can specify the following values:
  - PERFORMANCE : that GPUs that are attached to the instance have performance issues compared to other GPUs in the cluster, you see no XID errors in the logs, and the Compute Engine detects no other usual failure patterns such as silent data corruption.
  - SILENT_DATA_CORRUPTION : you see data corruption in your instance, but the instance keeps running. Silent data corruption can be due to issues like vCPUs defects, software bugs, or kernel issues.
  - UNRECOVERABLE_GPU_ERROR : you identified an unrecoverable GPU error with an XID.
  - BEHAVIOR_UNSPECIFIED : you aren't sure about what the issue to your instance is.
- DESCRIPTION_1 and DESCRIPTION_2 : a description for each host issue that you specified, such as XID information or suspected performance problems.

Review report faulty host operations

After you report a faulty host, Compute Engine starts a series of operations to mark the host as faulty and prepares the host for repair. Specifically, during a report faulty host operation, the following process happens:

Mark the host as faulty. Compute Engine creates the report faulty host operation. The report faulty host operation then creates a sequence of sub-operations. These sub-operations mark the underlying host as faulty.
Prepare the host for repairs. After all sub-operations complete, the report faulty host operation starts. Compute Engine stops the instance and starts the repair faulty host operation. Based on the reservation operational mode that is specified in the reservation that the instance uses, and if healthy hosts are available, Compute Engine either keeps the instance stopped or attempts to automatically migrate and restart the instance.
Report completion and repair the host. Compute Engine completes the report faulty host operation, and the host repair operation runs.

To track the status of the report faulty host ( compute.instances.reportHostAsFaulty ) operations in your project, select one of the following options. For more information about other operations that you can use to track repairs, migration, and automatic restart, see Maintenance and restart behaviors and Monitor and plan for a host maintenance event in the Compute Engine documentation.

Console (Instance operations)

In the Google Cloud console, go to the Operationspage.

Go to Operations
In the table that appears, locate the instance that you reported.
In the row that contains the instance, in the Statuscolumn, you can see the status of the report faulty host operation. When the operation completes, the value is Done.
Optional: To verify if Compute Engine has restarted the instance, view the details of the instance .

Console (Instance logs)

In the Google Cloud console, go to the Logs Explorerpage.

Go to Logs Explorer
Verify that the Show querytoggle is set to the on position.

In the query editor, enter the following query:

 resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.reportHostAsFaulty"

Click Run query. The Query resultspane displays the query results.

gcloud

To view the status of the report faulty host operations in your project, use the gcloud compute operations list command with the --filter flag set to operationType:reportHostAsFaulty :
```
 gcloud compute operations list --filter="operationType:reportHostAsFaulty" 
```
If you want to view the details of a specific faulty host operation, then use the gcloud compute operations describe command :
```
 gcloud compute operations describe OPERATION_NAME 
\
    --zone=" ZONE 
" 
```
Replace the following:
- OPERATION_NAME : the name of the operation.
- ZONE : the zone where the operation exists.

REST

To view the status of the report faulty host operations in your project, make a GET request to the zoneOperations.list method . In the request URL, include the filter query parameter set to items.operationType:reportHostAsFaulty .

 GET https://compute.googleapis.com/compute/v1/projects/ PROJECT_ID 
/zones/ ZONE 
/operations&filter=items.operationType:reportHostAsFaulty

Replace the following:

PROJECT_ID : the name of the operation.
ZONE : the zone where the operations exist.

What's next?

If you encounter issues when reporting a faulty host, then see Troubleshoot faulty host API .

Report faulty host Stay organized with collections Save and categorize content based on your preferences.

Limitations

Before you begin

Console

gcloud

REST

Required roles

Required permissions

Understand the faulty host report process

Report a faulty host

gcloud

REST

Review report faulty host operations

Console (Instance operations)

Console (Instance logs)

gcloud

REST

What's next?

Report faulty host