Monitor instances and operations

Cloud Monitoring automatically collects and stores information about your Managed Lustre instance.

This document provides a detailed overview of the metrics available for monitoring Managed Lustre on Google Cloud. These metrics help you understand the performance, capacity, and health of your Managed Lustre file systems, so you can identify bottlenecks, troubleshoot issues, and optimize resource utilization.

You can use these metrics in Cloud Monitoring to create custom dashboards, set up alerts, and gain deeper insights into your Managed Lustre instance's behavior.

Cloud Monitoring is automatically enabled for Managed Lustre. There's no charge for the collection of data or to view metrics in the Google Cloud console. API calls may incur charges; see Cloud Monitoring pricing for pricing details.

Required IAM roles

The following roles are required:

Monitoring Viewer( roles/monitoring.viewer ), or equivalent permissions, to view metrics in Cloud Monitoring.
Monitoring Editor( roles/monitoring.editor ), or equivalent permissions, to configure alerts.

Learn how to grant an IAM role .

View metrics

Cloud Monitoring metrics are available from two locations in the Google Cloud console:

The Managed Lustre instance details page displays available metrics. In addition to the metrics listed on this page, it computes the bandwidth of bytes copied and the rate of objects copied.
The Cloud Monitoring page provides multiple chart options and customizations.

View metrics on the instance details page

To view a specific instance's metrics:

Go to the Instancespage in the Google Cloud console.

Go to Instances
Click the instance for which to view metrics. The Instance detailspage appears.
Click the Monitoringtab. The default dashboard is displayed.

View metrics in Cloud Monitoring

To view Managed Lustre metrics in Cloud Monitoring, do the following:

Go to the Metrics Explorerpage in the Google Cloud console.

Go to Monitoring: Metrics Explorer
Follow the instructions in Create charts with Metrics Explorer to select and display your metrics.

Set up alerts

You can configure alerting policies in Cloud Monitoring to notify you when your Managed Lustre file system meets specific conditions, such as exceeding storage capacity or throughput limits.

Prerequisites

To create alerting policies, you must have the Monitoring Editor( roles/monitoring.editor ) IAM role on the project.

Create an alerting policy

To set up an alert, define a condition using a metric or a PromQL query and configure notification channels.

In the Google Cloud console, go to the Alertingpage in the Google Cloud console.

Go to Monitoring: Alerting
Click + Create policy.
Select Builderand select your metric, or choose Code editorto enter a query with PromQL . In the metric picker, Managed Lustre metrics fall under the Lustre instanceand Lustre locationresources.
Configure your trigger logic and define your notification channels and notification settings.
Click Create policy.

For more information about creating triggers and other options, see:

Example: Create a storage capacity alert

The following example demonstrates how to create an alert that triggers when your Managed Lustre instance exceeds 80% of its provisioned capacity.

In the Google Cloud console, go to the Alertingpage in the Google Cloud console.

Go to Monitoring: Alerting
Click + Create policy.
Select Code editor.

In the Query Editor, paste the following PromQL query:

 (
  sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes)
  -
  sum by (instance_id, location) (lustre_googleapis_com:instance_available_bytes)
)
/
sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes)
> 0.8

This query calculates the usage ratio across all instances: (Total - Available) / Total . The value 0.8 represents the total bytes reaching 80% usage. To alert at 90%, change this value to 0.9 .

Click Run Queryto verify the syntax and view a chart of the current usage ratio.
Click Nextand configure the trigger to Any time series violates.

Click Next. In the Documentationsection, add recommended actions for resolving the capacity issue. For example:

  ## Action Required: Lustre Capacity Warning 
The Managed Lustre instance is exceeding 80% capacity usage. **Metric:** 
Usage Ratio > 0.8 **Severity:** 
Warning **Recommended Actions:** 
 1. 
Check the instance details in the Google Cloud console. 2. 
Verify if this is expected data growth or a runaway process. 3. 
If valid, consider expanding the storage capacity of the instance or deleting old data to free up space. 4. 
Failure to address this may result in "No Space Left on Device" errors for client applications.

Create an alerting policy with gcloud

You can create alerting policies using the Google Cloud CLI. Note that you must edit the alert in the Google Cloud console later to enable specific notification channels.

The following example creates an 80% capacity alert using gcloud :

 gcloud  
monitoring  
policies  
create  
 \ 
  
--policy-from-file = 
/dev/stdin  
<<EOF { 
  
 "displayName" 
:  
 "Lustre High Capacity Usage (>80%)" 
,  
 "severity" 
:  
 "WARNING" 
,  
 "combiner" 
:  
 "OR" 
,  
 "conditions" 
:  
 [ 
  
 { 
  
 "displayName" 
:  
 "Capacity Usage Ratio > 0.8" 
,  
 "conditionPrometheusQueryLanguage" 
:  
 { 
  
 "query" 
:  
 "(sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes) - sum by (instance_id, location) (lustre_googleapis_com:instance_available_bytes)) / sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes) > 0.8" 
,  
 "duration" 
:  
 "300s" 
,  
 "evaluationInterval" 
:  
 "60s" 
,  
 "alertRule" 
:  
 "AlwaysOn" 
  
 } 
  
 } 
  
 ] 
,  
 "documentation" 
:  
 { 
  
 "content" 
:  
 "Action Required: The Managed Lustre instance is exceeding 80% capacity usage. Please verify if storage expansion is required." 
,  
 "mimeType" 
:  
 "text/markdown" 
  
 } 
 } 
EOF

Metric details

Managed Lustre metrics are attached to the following monitored resource types:

lustre.googleapis.com/Instance
lustre.googleapis.com/Job
lustre.googleapis.com/QuotaEntity

Data is sampled every 60 seconds. After sampling, data may not be visible for up to 180 seconds.

Storage capacity metrics

Metrics related to the storage space available and provisioned on your Lustre file system.

For metric labels, the value of target uses the format <fsname>-<TYPE><HEXA> where <HEXA> is the zero-based index of the target in hexadecimal. For example, if your file system name is filesys , the 43rd OST is filesys-OST002a , and the 4th MDT is filesys-MDT0003 .

Storage capacity metrics are attached to the lustre.googleapis.com/Instance resource.

Metric	Description	Details
`available_bytes`	The number of bytes of storage space for a given Object Storage Target (OST) or Metadata Target (MDT) that is available to non-root users.	Display Name:Available bytes Metric Kind:GAUGE Value Type:INT64 Unit:bytes Labels: `component` : The target type: `ost` , `mdt` , or `mgt` . `target` : The name of the target.
`capacity_bytes`	The number of bytes provisioned for the given target. The total cluster usable data or metadata space for an instance can be obtained by adding the capacity of all targets for a given type of target.	Display Name:Capacity bytes Metric Kind:GAUGE Value Type:INT64 Unit:bytes Labels: `component` : The target type: `ost` , `mdt` , or `mgt` . `target` : The name of the target.
`free_bytes`	The number of bytes of storage space for a given OST or MDT that is available to root users.	Display Name:Free bytes Metric Kind:GAUGE Value Type:INT64 Unit:bytes Labels: `component` : The target type: `ost` , `mdt` , or `mgt` . `target` : The name of the target.

Inode (object) metrics

Metrics related to the number of inodes (objects) available and the maximum capacity.

Inode metrics are attached to the lustre.googleapis.com/Instance resource.

Metric	Description	Details
`inodes_free`	The number of inodes (objects) available on the given target.	Display Name:Free inodes Metric Kind:GAUGE Value Type:INT64 Unit:inodes Labels: `component` : The target type. `target` : The name of the target.
`inodes_maximum`	The maximum number of inodes (objects) the target can hold.	Display Name:Maximum inodes Metric Kind:GAUGE Value Type:INT64 Unit:inodes Labels: `component` : The target type. `target` : The name of the target.

I/O performance metrics

Metrics providing insight into data transfer rates and operation latency.

I/O performance metrics are attached to the lustre.googleapis.com/Instance resource.

Metric	Description	Details
`io_time_milliseconds_total`	The number of read or write operations whose latency is within the bucketed latency ranges.	Display Name:Operation latency Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `component` : The target type. `operation` : The operation type. `size` : The bucketed latency range. For example, 512 includes the count of operations that took between 512 and 1024 milliseconds. `target` : The name of the target.
`read_bytes_total`	The number of data bytes read from the given OST.	Display Name:Data read bytes Metric Kind:CUMULATIVE Value Type:INT64 Unit:bytes Labels: `component` : The target type: always `ost` . `operation` : The operation type: `read` . `target` : The name of the target.
`read_samples_total`	The number of read operations performed on the given OST.	Display Name:Data read operations Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `component` : The target type: always `ost` . `operation` : The operation type: `read` . `target` : The name of the target.
`write_bytes_total`	The number of data bytes written to the given OST.	Display Name:Data write bytes Metric Kind:CUMULATIVE Value Type:INT64 Unit:bytes Labels: `component` : The target type: always `ost` . `operation` : The operation type: `write` . `target` : The name of the target.
`write_samples_total`	The number of write operations performed on the given OST.	Display Name:Data write operations Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `component` : The target type: always `ost` . `operation` : The operation type: `write` . `target` : The name of the target.

Client connection metrics

Metrics specifically for understanding client connectivity.

Client connection metrics are attached to the lustre.googleapis.com/Instance resource.

Metric	Description	Details
`connected_clients`	The number of clients currently connected to the given MDT.	Display Name:Connected clients Metric Kind:GAUGE Value Type:INT64 Unit:clients Labels: `component` : The target type. This is always `mdt` . `target` : The name of the MDT.

File system quota metrics

File system quota metrics allow you to monitor storage and inode consumption for specific users, groups, and projects. Use these metrics to track current usage against the soft and hard limits configured on your file system.

File system quota metrics are associated with the lustre.googleapis.com/QuotaEntity monitored resource.

Metric	Description	Details
`used_bytes`	The total number of bytes currently consumed by the user, group, or project.	Display Name:Quota used bytes Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.
`soft_limit_bytes`	The storage consumption threshold that triggers a grace period. If usage remains above this limit after the grace period expires, this becomes an enforced hard limit.	Display Name:Quota soft limit bytes Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.
`hard_limit_bytes`	The maximum storage usage allowed for the user, group, or project. Writes exceeding this limit are denied.	Display Name:Quota hard limit bytes Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.
`used_inodes`	The total number of inodes (file records) currently consumed by the user, group, or project.	Display Name:Quota used inodes Metric Kind:GAUGE Value Type:INT64 Unit:Count Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.
`soft_limit_inodes`	The inode consumption threshold that triggers a grace period. If usage remains above this limit after the grace period expires, this becomes an enforced hard limit.	Display Name:Quota soft limit inodes Metric Kind:GAUGE Value Type:INT64 Unit:Count Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.
`hard_limit_inodes`	The maximum number of inodes allowed for the user, group, or project. File creation exceeding this limit is denied.	Display Name:Quota hard limit inodes Metric Kind:GAUGE Value Type:INT64 Unit:Count Labels: `accounting_type` : One of `user` , `group` , or `project` . `id` : The numeric ID of the user, group, or project. `target` : The name of the Lustre target device.

Jobstats metrics

Metrics providing read, write, and metadata statistics per JobID, as configured on the client.

To collect these metrics, use lctl to configure the jobid_var parameter on your Lustre clients. For more information, see Lustre Jobstats .

To configure the client to report a specific identifier (for example, procname_uid ), use the lctl set_param jobid_var command:

 lctl  
set_param  
 jobid_var 
 = 
procname_uid

Jobstats metrics are attached to the lustre.googleapis.com/Job resource.

Metric	Description	Details
`read_bytes_total`	The total number of bytes read by the job.	Display Name:Data read bytes by job Metric Kind:CUMULATIVE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`write_bytes_total`	The total number of bytes written by the job.	Display Name:Data write bytes by job Metric Kind:CUMULATIVE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`metadata_operations_total`	Total metadata operations performed by the job.	Display Name:Metadata operations by job Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`read_samples_total`	The total number of read operations performed by the job.	Display Name:Data read operations by job Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`write_samples_total`	The total number of write operations performed by the job.	Display Name:Data write operations by job Metric Kind:CUMULATIVE Value Type:INT64 Unit:operations Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`read_maximum_size_bytes`	The maximum size in bytes of read operations by the job.	Display Name:Data read maximum size by job Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`read_minimum_size_bytes`	The minimum size in bytes of read operations by the job.	Display Name:Data read minimum size by job Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`write_maximum_size_bytes`	The maximum size in bytes of write operations by the job.	Display Name:Data write maximum size by job Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.
`write_minimum_size_bytes`	The minimum size in bytes of write operations by the job.	Display Name:Data write minimum size by job Metric Kind:GAUGE Value Type:INT64 Unit:Bytes Labels: `job_id` : The JobID sent by the client. `component` : The target type. `target` : The name of the target. `instance_id` : The ID of the Managed Lustre instance.