This document describes the tools and files you can use to monitor and troubleshoot Serverless for Apache Spark batch workloads .
Troubleshoot workloads from the Google Cloud console
When a batch job fails or has poor performance, a recommended first step is to open to its Batch detailspage from the Batches page in the Google Cloud console.
Use the Summary tab: your troubleshooting Hub
The Summarytab, which is selected by default when the Batch detailspage opens, displays critical metrics and filtered logs to help you make a quick initial assessment of batch health. After this initial assessment, you can perform a deeper analysis using more specialized tools lined on the Batch detailspage, such as the Spark UI , the Logs Explorer , and Gemini Cloud Assist .
Batch metric highlights
The Summarytab on the Batch detailspage includes charts that display important batch workload metric values. The metric charts populate after the completes, and offer a visual indication of potential issues such as resource contention, data skew, or memory pressure.
The following table lists the Spark workload metrics displayed on the Batch detailspage in the Google Cloud console, and describes how metric values can provide insight into workload status and performance.
Job logs
The Batch detailspage includes a Job logssection that lists warnings and errors filtered from the job (batch workload) logs. his feature allows for quick identification of critical issues without needing to manually parse through extensive log files. You can select a log Severity(for example, Error
) from the drop-down menu and add a text Filterto narrow down the results. To perform a more in-depth analysis, click the View in Logs Explorericon
to open the selected batch logs in the Logs Explorer
.

Example: Logs Exploreropens after choosing Errors
from the Severity
selector on the Batch detailspage in the Google Cloud console.
Spark UI
The Spark UI collects Apache Spark execution details from Serverless for Apache Spark batch workloads. There is no charge for the Spark UI feature, which is enabled by default.
Data collected by the Spark UI feature is retained for 90 days. You can use this web interface to monitor and debug Spark workloads without having to create a Persistent History Server .
Required Identity and Access Management permissions and roles
The following permissions are required to use the Spark UI feature with batch workloads.
-
Data collection permission:
dataproc.batches.sparkApplicationWrite
. This permission must be granted to the service account that runs batch workloads. This permission is included in theDataproc Worker
role, which is automatically granted to the Compute Engine default service account that Serverless for Apache Spark uses by default (see Serverless for Apache Spark service account ). However, if you specify a custom service account for your batch workload, you must add thedataproc.batches.sparkApplicationWrite
permission to that service account (typically, by granting the service account the DataprocWorker
role). -
Spark UI access permission:
dataproc.batches.sparkApplicationRead
. This permission must be granted to a user to access the Spark UI in the Google Cloud console. This permission is included in theDataproc Viewer
,Dataproc Editor
andDataproc Administrator
roles. To open the Spark UI in the Google Cloud console, you must have one of these roles or have a custom role that includes this permission.
Open the Spark UI
The Spark UI page is available in the Google Cloud console batch workloads.
-
Go to the Serverless for Apache Spark interactive sessionspage.
-
Click a Batch IDto open the Batch detailspage.
-
Click View Spark UIin the top menu.
The View Spark UIbutton is disabled in the following cases:
- If a required permission isn't granted
- If you clear the Enable Spark UIcheckbox on the Batch detailspage
- If you set the
spark.dataproc.appContext.enabled
property tofalse
when you submit a batch workload
AI-powered Investigations with Gemini Cloud Assist (Preview)
Overview
The Gemini Cloud Assist Investigations preview feature uses Gemini advanced capabilities to assist in creating and running Serverless for Apache Spark batch workloads. This feature analyzes failed and slow-running workloads to identify root causes and recommend fixes. It creates persistent analysis that you can review, save, and share with Google Cloud support to facilitating collaboration and accelerate issue resolution.
Features
Use this feature to create investigations from the Google Cloud console:
- Add a natural language context description to an issue before creating an investigation.
- Analyse failed and slow batch workloads.
- Get insights into issue root causes with recommended fixes.
- Create Google Cloud support cases with the full investigation context attached.
Before you begin
To get started using the Investigation feature, in your Google Cloud project, enable the Gemini Cloud Assist API .
Start an investigation
To start an investigation, do one of the following:
-
Option 1: In the Google Cloud console, go to the Batches List Page . For any batch with a
Failed
status, an INVESTIGATEbutton appears in the Insights by Geminicolumn. Click the button to start an investigation. -
Option 2: Open the Batch Details Pageof the batch workload to investigate. For both
Succeeded
andFailed
batch workloads, in the Health overviewsection of the Summarytab, an INVESTIGATEbutton appears in the Insights by Geminipanel. Click the button to start an investigation.The investigation button text indicates the status of the investigation:
- INVESTIGATE:No investigation has been run for this batch_details. Click the button to start an investigation.
- VIEW INVESTIGATION:An investigation has been completed. Click the button To view the results.
- INVESTIGATING:An investigation is in progress.
Interpret investigation results
Once an investigation is complete, the Investigation detailspage opens. This page contains the full Gemini analysis, which is organized into the following sections:
- Issue: A collapsed section containing auto-populated details of the batch worklod being investigated.
- Relevant Observations: A collapsed section that lists key data points and anomalies that Gemini found during its analysis of logs and metrics.
- Hypotheses: This is the primary section, which is expanded by default.
It presents a list of potential root causes for the observed issue. Each hypothesis
includes:
- Overview: A description of the possible cause, such as "High Shuffle Write Time and Potential Task Skew."
- Recommended Fixes: A list of actionable steps to address the potential issue.
Take action
After reviewing the hypotheses and recommendations:
-
Apply one or more of the suggested fixes to the job configuration or code, and then rerun the job.
-
Provide feedback on the helpfulness of the investigation by clicking the thumbs-up or thumbs-down icons at the top of the panel.
Review and escalate investigations
The results of a previously run investigation can be reviewed by clicking the investigation name on the Cloud Assist Investigationspage to open the Investigation detailspage.
If further assistance is needed, you can use open a Google Cloud support case. This process provides the support engineer with the complete context of the previously performed investigation, including the observations and hypotheses generated by Gemini. This context sharing significantly reduces the back-and-forth communication required with the support team, and leads to faster case resolution.
To create a support case from an investigation:
In the Investigation detailspage, click Request support.
Preview status and pricing
There is no charge for Gemini Cloud Assist investigations during public preview. Charges will apply to the feature when it becomes generally available (GA) .
For more information about pricing after general availability, see Gemini Cloud Assist Pricing .
Ask Gemini Preview (Retiring on 22 September 2025)
The Ask Geminipreview feature provided one-click access to insights on the Batchesand Batch detailspages in the Google Cloud console through an Ask Geminibutton. This function generated a summary of errors, anomalies, and potential performance improvements based on workload logs and metrics.
After the "Ask Gemini" preview retires on 22 September, 2025, users can continue to obtain AI-powered assistance using the Gemini Cloud Assist Investigations feature .
Important:To ensure uninterrupted troubleshooting AI assistance, enabling Gemini Cloud Assist Investigations before September 22, 2025 is highly recommended.
Serverless for Apache Spark logs
Logging is enabled by default in Serverless for Apache Spark, and workload logs persist after a
workload finishes. Serverless for Apache Spark collects workload logs in Cloud Logging
.
You can access Serverless for Apache Spark logs under the Cloud Dataproc Batch
resource in the Logs Explorer.
Query Serverless for Apache Spark logs
The Logs Explorer in the Google Cloud console provides a query pane to help you build a query to examine batch workload logs. Here are steps you can follow to build a query to examine batch workload logs:
- Your current project is selected. You can click Refine scope Projectto select a different project.
-
Define a batch logs query.
-
Use filter menus to filter for a batch workload.
-
Under All resources, select the Cloud Dataproc Batchresource.
-
In the Select resourcepanel, select the batch LOCATION, then the BATCH ID. These batch parameters are listed on the Dataproc Batches page in the Google Cloud console.
-
Click Apply.
-
Under Select log names. enter
dataproc.googleapis.com
in the Search log namesbox to limit the log types to query. Select one or more of the listed log file names.
-
-
-
Use the query editor to filter for VM-specific logs.
-
Specify the resource type and VM resource name as shown in the following example:
resource.type="cloud_dataproc_batch" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch- BATCH_UUID - VM_SUFFIX "
- BATCH_UUID:The batch UUID is listed in the Batch details
page in the Google Cloud console, which opens when you click the
Batch ID on the Batchespage.
The batch logs also list the batch UUID in the VM resource name. Here's an example from a batch driver.log:
- BATCH_UUID:The batch UUID is listed in the Batch details
page in the Google Cloud console, which opens when you click the
Batch ID on the Batchespage.
-
-
-
Click Run query.
Serverless for Apache Spark log types and sample queries
The following list describes different Serverless for Apache Spark log types and provides sample Logs Explorer queries for each log type.
-
dataproc.googleapis.com/output
: This log file contains batch workload output. Serverless for Apache Spark streams batch output to theoutput
namespace, and sets the filename toJOB_ID .driver.log
.Sample Logs Explorer query for output logs:
resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Foutput"
-
dataproc.googleapis.com/spark
: Thespark
namespace aggregates Spark logs for daemons and executors running on Dataproc cluster master and worker VMs. Each log entry includes amaster
,worker
orexecutor
component label to identify the log source, as follows:-
executor
: Logs from user-code executors. Typically, these are distributed logs. -
master
: Logs from the Spark standalone resource manager master, which are similar to Dataproc on Compute Engine YARNResourceManager
logs. -
worker
: Logs from the Spark standalone resource manager worker, which are similar to Dataproc on Compute Engine YARNNodeManager
logs.
Sample Logs Explorer query for all logs in the
spark
namespace:resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Fspark"
Sample Logs Explorer query for Spark standalone component logs in the
spark
namespace:resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Fspark" jsonPayload.component=" COMPONENT "
-
-
dataproc.googleapis.com/startup
: Thestartup
namespace includes the batch (cluster) startup logs. Any initialization script logs are included. Components are identified by label, for example:startup-script[855]: ... activate-component-spark[3050]: ... enable spark-worker
resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Fstartup" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch- BATCH_UUID - VM_SUFFIX "
-
dataproc.googleapis.com/agent
: Theagent
namespace aggregates Dataproc agent logs. Each log entry includes filename label that identifies the log source.Sample Logs Explorer query for agent logs generated by a specified worker VM:
resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Fagent" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch- BATCH UUID -w WORKER # "
-
dataproc.googleapis.com/autoscaler
: Theautoscaler
namespace aggregates Serverless for Apache Spark autoscaler logs.Sample Logs Explorer query for agent logs generated by a specified worker VM:
resource.type="cloud_dataproc_batch" resource.labels.location=" REGION " resource.labels.batch_id=" BATCH_ID " logName="projects/ PROJECT_ID /logs/dataproc.googleapis.com%2Fautoscaler" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch- BATCH UUID -w WORKER # "
For more information, see Dataproc logs .
For information on Serverless for Apache Spark audit logs, see Dataproc audit logging .
Workload metrics
Serverless for Apache Spark provides batch and Spark metrics that you can view from the Metrics Explorer or the Batch detailspage in the Google Cloud console.
Batch metrics
Dataproc batch
resource metrics provide insight into batch resources,
such as the number of batch executors. Batch metrics are prefixed with dataproc.googleapis.com/batch
.
Spark metrics
By default, Serverless for Apache Spark enables the collection of available Spark metrics , unless you use Spark metrics collection properties to disable or override the collection of one or more Spark metrics.
Available Spark metrics
include Spark driver and executor metrics, and system metrics. Available Spark metrics are prefixed
with custom.googleapis.com/
.
Set up metric alerts
You can create Dataproc metric alerts to receive notice of workload issues.
Create charts
You can create charts that visualize workload metrics by using the Metrics Explorer
in the
Google Cloud console. For example, you can
create a chart to display disk:bytes_used
, and then filter by batch_id
.
Cloud Monitoring
Monitoring uses workload metadata and metrics to provide insights into the health and performance of Serverless for Apache Spark workloads. Workload metrics include Spark metrics, batch metrics, and operation metrics.
You can use Cloud Monitoring in the Google Cloud console to explore metrics, add charts, create dashboards, and create alerts.
Create dashboards
You can create a dashboard to monitor workloads using metrics from multiple projects and different Google Cloud products. For more information, see Create and manage custom dashboards .
Persistent History Server
Serverless for Apache Spark creates the compute resources that are needed to run a workload, runs the workload on those resources, and then deletes the resources when the workload finishes. Workload metrics and events don't persist after a workload completes. However, you can use a Persistent History Server (PHS) to retain workload application history (event logs) in Cloud Storage.
To use a PHS with a batch workload, do the following:
-
Specify your PHS when you submit a workload .
-
Use the Component Gateway to connect to the PHS to view application details, scheduler stages, task level details, and environment and executor information.
Autotuning
- Enable autotuning for Serverless for Apache Spark:You can enable Autotuning for Serverless for Apache Spark when you submit each recurring Spark batch workload using the Google Cloud console, gcloud CLI, or the Dataproc API.
Console
Perform the following steps to enable autotuning on each recurring Spark batch workload:
-
In the Google Cloud console, go to the Dataproc Batchespage.
-
To create a batch workload, click Create.
-
In the Containersection, fill in the Cohortname, which identifies the batch as one of a series of recurring workloads. Gemini-assisted analysis is applied to the second and subsequent workloads that are submitted with this cohort name. For example, specify
TPCH-Query1
as the cohort name for a scheduled workload that runs a daily TPC-H query. -
Fill in other sections of the Create batchpage as needed, then click Submit. For more information, see Submit a batch workload .
gcloud
Run the following gcloud CLI gcloud dataproc batches submit
command locally in a terminal window or in Cloud Shell
to enable autotuning on each recurring Spark batch workload:
gcloud dataproc batches submit COMMAND \ --region = REGION \ --cohort = COHORT \ other arguments ...
Replace the following:
- COMMAND
: the Spark workload type, such as
Spark
,PySpark
,Spark-Sql
, orSpark-R
. - REGION : the region where your workload will run.
- COHORT
: the cohort name, which
identifies the batch as one of a series of recurring workloads.
Gemini-assisted analysis is applied to the second and subsequent workloads that are submitted
with this cohort name. For example, specify
TPCH Query 1
as the cohort name for a scheduled workload that runs a daily TPC-H query.
API
Include the RuntimeConfig.cohort
name in a batches.create
request to enable autotuning on each recurring Spark
batch workload. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, specify TPCH-Query1
as the cohort name for a scheduled workload that runs a daily TPC-H query.
Example:
...
runtimeConfig:
cohort:
TPCH-Query1
...