Create a Managed Service for Apache Spark Spark-enabled instance

This page describes how to create a Managed Service for Apache Spark Spark-enabled Vertex AI Workbench instance. This page also describes the benefits of the Managed Service for Apache Spark JupyterLab extension and provides an overview on how to use the extension with Managed Service for Apache Spark and Managed Service for Apache Spark on Compute Engine.

Overview of the Managed Service for Apache Spark JupyterLab extension

Vertex AI Workbench instances have the Managed Service for Apache Spark JupyterLab extension preinstalled, as of version M113 and later.

The Managed Service for Apache Spark JupyterLab extension provides two ways to run Apache Spark notebook jobs: Managed Service for Apache Spark clusters and Managed Service for Apache Spark.

Managed Service for Apache Spark clustersinclude a rich set of features with control over the infrastructure that Spark runs on. You choose the size and configuration of your Spark cluster, allowing for customization and control over your environment. This approach is ideal for complex workloads, long-running jobs, and fine-grained resource management.
Managed Service for Apache Sparkeliminates infrastructure concerns. You submit your Spark jobs, and Google handles the provisioning, scaling, and optimization of resources behind the scenes. This serverless approach offers a cost-efficient option for data science and ML workloads.

With both options, you can use Spark for data processing and analysis. The choice between Managed Service for Apache Spark clusters and Managed Service for Apache Spark depends on your specific workload requirements, required level of control, and resource usage patterns.

Benefits of using Managed Service for Apache Spark for data science and ML workloads include:

No cluster management: You don't need to worry about provisioning, configuring, or managing Spark clusters. This saves you time and resources.
Autoscaling: Managed Service for Apache Spark automatically scales up and down based on the workload, so you only pay for the resources you use.
High performance: Managed Service for Apache Spark is optimized for performance and takes advantage of Google Cloud's infrastructure.
Integration with other Google Cloud technologies: Managed Service for Apache Spark integrates with other Google Cloud products, such as BigQuery and Dataplex Universal Catalog.

For more information, see the Managed Service for Apache Spark documentation .

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Go to project selector

Enable the Cloud Resource Manager, Managed Service for Apache Spark, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Go to project selector

Enable the Cloud Resource Manager, Managed Service for Apache Spark, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

Enable the APIs

Required roles

To ensure that the service account has the necessary permissions to run a notebook file on a Managed Service for Apache Spark cluster or a Managed Service for Apache Spark cluster, ask your administrator to grant the following IAM roles to the service account:

Dataproc Worker ( roles/dataproc.worker ) on your project
Dataproc Editor ( roles/dataproc.editor ) on the cluster for the dataproc.clusters.use permission

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to run a notebook file on a Managed Service for Apache Spark cluster or a Managed Service for Apache Spark cluster. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to run a notebook file on a Managed Service for Apache Spark cluster or a Managed Service for Apache Spark cluster:

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

Your administrator might also be able to give the service account these permissions with custom roles or other predefined roles .

Create an instance with Managed Service for Apache Spark enabled

To create a Vertex AI Workbench instance with Managed Service for Apache Spark enabled, do the following:

In the Google Cloud console, go to the Instancespage.

Go to Instances
Click Create new.
In the New instancedialog, click Advanced options.
In the Create instancedialog, in the Detailssection, make sure Enable Dataproc Serverless Interactive Sessionsis selected.
Make sure Workbench typeis set to Instance.
In the Environmentsection, make sure you use the latest version or a version numbered M113 or higher.
Click Create.

Vertex AI Workbench creates an instance and automatically starts it. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLablink.

Open JupyterLab

Next to your instance's name, click Open JupyterLab.

The JupyterLab Launchertab opens in your browser. By default it contains sections for Managed Service for Apache Spark Notebooksand Managed Service for Apache Spark Jobs and Sessions. If there are Jupyter-ready clusters in the selected project and region, there will be a section called Managed Service for Apache Spark Cluster Notebooks.

Use the extension with Managed Service for Apache Spark

Managed Service for Apache Spark runtime templates that are in the same region and project as your Vertex AI Workbench instance appear in the Managed Service for Apache Spark Notebookssection of the JupyterLab Launchertab.

To create a runtime template, see Create a Managed Service for Apache Spark runtime template .

To open a new serverless Spark notebook, click a runtime template. It takes about a minute for the remote Spark kernel to start. After the kernel starts, you can start coding.

Use the extension with Managed Service for Apache Spark on Compute Engine

If you created a Managed Service for Apache Spark on Compute Engine Jupyter cluster , the Launchertab has a Managed Service for Apache Spark Cluster Notebookssection.

Four cards appear for each Jupyter-ready Managed Service for Apache Spark cluster that you have access to in that region and project.

To change the region and project, do the following:

Select Settings > Cloud Managed Service for Apache Spark Settings.
On the Setup Configtab, under Project Info, change the Project IDand Region, and then click Save.

These changes don't take effect until you restart JupyterLab.
To restart JupyterLab, select File > Shut Down, and then click Open JupyterLabon the Vertex AI Workbench instancespage.

To create a new notebook, click a card. After the remote kernel on the Managed Service for Apache Spark cluster starts, you can start writing your code and then run it on your cluster.

Manage Managed Service for Apache Spark on an instance using the gcloud CLI and the API

This section describes ways to manage Managed Service for Apache Spark on a Vertex AI Workbench instance.

Change the region of your Managed Service for Apache Spark cluster

Your Vertex AI Workbench instance's default kernels, such as Python and TensorFlow, are local kernels that run in the instance's VM. On a Managed Service for Apache Spark Spark-enabled Vertex AI Workbench instance, your notebook runs on a Managed Service for Apache Spark cluster through a remote kernel. The remote kernel runs on a service outside of your instance's VM, which lets you access any Managed Service for Apache Spark cluster within the same project.

By default Vertex AI Workbench uses Managed Service for Apache Spark clusters within the same region as your instance, but you can change the Managed Service for Apache Spark region as long as the Component Gateway and the optional Jupyter component are enabled on the Managed Service for Apache Spark cluster.

Test Access

The Managed Service for Apache Spark JupyterLab extension is enabled by default for Vertex AI Workbench instances. To test access to Managed Service for Apache Spark, you can check access to your instance's remote kernels by sending the following curl request to the kernels.googleusercontent.com domain:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https:// PROJECT_ID 
-dot- REGION 
.kernels.googleusercontent.com/api/kernelspecs | jq .

If the curl command fails, check to make sure that:

Your DNS entries are configured correctly.
There is a cluster available in the same project (or you will need to create one if it doesn't exist).
Your cluster has both the Component Gateway and the optional Jupyter component enabled.

Turn off Managed Service for Apache Spark

Vertex AI Workbench instances are created with Managed Service for Apache Spark enabled by default. You can create a Vertex AI Workbench instance with Managed Service for Apache Spark turned off by setting the disable-mixer metadata key to true .

gcloud  
workbench  
instances  
create  
 INSTANCE_NAME 
  
--metadata = 
disable-mixer = 
 true

Enable Managed Service for Apache Spark

You can enable Managed Service for Apache Spark on a stopped Vertex AI Workbench instance by updating the metadata value.

gcloud  
workbench  
instances  
update  
 INSTANCE_NAME 
  
--metadata = 
disable-mixer = 
 false

Manage Managed Service for Apache Spark using Terraform

Managed Service for Apache Spark for Vertex AI Workbench instances on Terraform is managed using the disable-mixer key in the metadata field. Turn on Managed Service for Apache Spark by setting the disable-mixer metadata key to false . Turn off Managed Service for Apache Spark by setting the disable-mixer metadata key to true .

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands .

 resource "google_workbench_instance" "default" {
  name     = "workbench-instance-example"
  location = "us-central1-a"

  gce_setup {
    machine_type = "n1-standard-1"
    vm_image {
      project = "cloud-notebooks-managed"
      family  = "workbench-instances"
    }
    metadata = {
      disable-mixer = "false"
    }
  }
}

Troubleshoot

To diagnose and resolve issues related to creating a Managed Service for Apache Spark Spark-enabled instance, see Troubleshooting Vertex AI Workbench .

What's next

For more information about the Managed Service for Apache Spark JupyterLab extension, see Use the JupyterLab extension to develop serverless Spark workloads .
To learn more about Managed Service for Apache Spark, see the Managed Service for Apache Spark documentation
Learn how to run Managed Service for Apache Spark workloads without provisioning and managing clusters.
To learn more about using Spark with Google Cloud products and services, see Spark on Google Cloud .
Browse the available Managed Service for Apache Spark templates on GitHub .
Learn about Serverless Spark through the serverless-spark-workshop on GitHub .
Read the Apache Spark documentation .