Use the Dataproc JupyterLab plugin for serverless batch and interactive notebook sessions

Dataproc Serverless limitations and considerations

  • Spark jobs are executed with the service account identity, not the submitting user's identity.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project .

    Go to project selector

  3. Enable the Dataproc API.

    Enable the API

  4. Install the Google Cloud CLI.
  5. To initialize the gcloud CLI, run the following command:

    gcloud init
  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project .

    Go to project selector

  7. Enable the Dataproc API.

    Enable the API

  8. Install the Google Cloud CLI.
  9. To initialize the gcloud CLI, run the following command:

    gcloud init

Install the Dataproc JupyterLab plugin

You can install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance .

To install the plugin, follow these steps:

  1. Make sure that Python 3.8+ is installed on your machine. You can download and install Python from python.org/downloads .

    1. Verify the Python 3.8+ installation.

       python3 --version 
      
  2. Install JupyterLab 3.6.3+ on your machine.

     pip3 install --upgrade jupyterlab 
    
    1. Verify the JupyterLab 3.6.3+ installation.

       pip3 show jupyterlab 
      
  3. Install the Dataproc JupyterLab plugin.

     pip3 install dataproc-jupyter-plugin 
    
    1. If your JupyterLab version is earlier than 4.0.0 , enable the plugin extension.

       jupyter server extension enable dataproc_jupyter_plugin 
      
  4. Start JupyterLab .

     jupyter lab 
    
    1. The JupyterLab Launcherpage opens in your browser. It contains a Dataproc Jobs and Sessionssection. It can also contain Dataproc Serverless Notebooksand Dataproc Cluster Notebookssections if you have access to Dataproc serverless notebooks or Dataproc clusters with the Jupyter optional component running in your project.

    2. By default, your Dataproc Serverless for Spark Interactive session runs in the project and region you set when you ran gcloud init in Before you begin . You can change the project and region settings for your sessions from the JupyterLab Settings > Dataproc Settingspage.

Create a Dataproc Serverless runtime template

Dataproc Serverless runtime templates (also called session templates) contain configuration settings for executing Spark code in a session. You can create and manage runtime templates using Jupyterlab or the gcloud CLI.

JupyterLab

  1. Click the New runtime template card in the Dataproc Serverless Notebookssection on the JupyterLab Launcherpage.

  2. Fill in the Runtime templateform.

  3. Specify a Display nameand Description, and then input or confirm the other settings.

    Notes:

    • Network Configuration: The subnetworkmust have Private Google Access enabled and must allow subnet communication on all ports (see Dataproc Serverless for Spark network configuration ).

      If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled for Private Google Access:

      • Enable it for Private Google Access , or
      • Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settingspage.
    • Metastore: To use a Dataproc Metastore service in your sessions, select the metastore project ID, region, and service.

    • Max idle time:The maximum notebook idle time before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).

    • Max session time:The maximum lifetime of a session before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).

    • PHS: You can select an available Persistent Spark History Server to allow you to access session logs during and after sessions.

    • Spark properties:Click Add Propertyfor each property to set for your serverless Spark sessions. See Spark properties for a listing of supported and unsupported Spark properties, including Spark runtime, resource, and autoscaling properties.

    • Labels:Click Add Labelfor each label to set on your serverless Spark sessions.

  4. View your runtime templates from the Settings > Dataproc Settingspage.

    • You can delete a template from the Actionmenu for the template.
  5. Click Save.

  6. Open and reload the JupyterLab Launcherpage to view the saved notebook template card on the JupyterLab Launcherpage.

gcloud

  1. Create a YAML file with your runtime template configuration.

    Simple YAML

    environmentConfig:
      executionConfig:
        networkUri: default
    jupyterSession:
      kernel: PYTHON
      displayName: Team A
    labels:
      purpose: testing
    description: Team A Development Environment

    Complex YAML

    environmentConfig:
      executionConfig:
        serviceAccount: sa1
        # Choose either networkUri or subnetworkUri
        networkUri: default
        subnetworkUri: subnet
        networkTags:
         - tag1
        kmsKey: key1
        idleTtl: 3600s
        ttl: 14400s
        stagingBucket: staging-bucket
      peripheralsConfig:
        metastoreService: projects/my-project-id/locations/us-central1/services/my-metastore-id
        sparkHistoryServerConfig:
          dataprocCluster: projects/my-project-id/regions/us-central1/clusters/my-cluster-id
    jupyterSession:
      kernel: PYTHON
      displayName: Team A
    labels:
      purpose: testing
    runtimeConfig:
      version: "1.1"
      containerImage: gcr.io/my-project-id/my-image:1.0.1
      properties:
        "p1": "v1"
    description: Team A Development Environment

    If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled forPrivate Google Access:

    • Enable it for Private Google Access , or
    • Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settingspage.
  2. Create a session (runtime) template from your YAML file by running the following gcloud beta dataproc session-templates import command locally or in Cloud Shell :

    gcloud beta dataproc session-templates import TEMPLATE_ID 
    \
        --source=YAML_FILE \
        --project= PROJECT_ID 
    \
        --location= REGION 
    

Launch and manage notebooks

After installing the Dataproc JupyterLab plugin , you can click template cards on the JupyterLab Launcherpage to:

Launch a Jupyter notebook on Dataproc Serverless

The Dataproc Serverless Notebookssection on the JupyterLab Launcher page displays notebook template cards that map to Dataproc Serverless runtime templates (see Create a Dataproc Serverless runtime template ).

  1. Click a card to create a Dataproc Serverless session and launch a notebook. When session creation is complete and the notebook kernel is ready to use, the kernel status changes from Unknown to Idle .

  2. Write and test notebook code.

    1. Copy and paste the following PySpark Pi estimation code in the PySpark notebook cell, then press Shift+Return to run the code.

      import random
          
      def inside(p):
          x, y = random.random(), random.random()
          return x*x + y*y < 1
          
      count = sc.parallelize(range(0, 10000)) .filter(inside).count()
      print("Pi is roughly %f" % (4.0 * count / 10000))

      Notebook result:

  3. After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernelfrom the Kerneltab.

    • If you don't terminate the session, Dataproc terminates the session when the session idle timer expires. You can configure the session idle time in the runtime template configuration . The default session idle time is one hour.

Launch a notebook on a Dataproc on Compute Engine cluster

If you created a Dataproc on Compute Engine Jupyter cluster , the JupyterLab Launcherpage contains a Dataproc Cluster Notebooksection with pre-installed kernel cards.

To launch a Jupyter notebook on your Dataproc on Compute Engine cluster:

  1. Click a card in the Dataproc Cluster Notebooksection.

  2. When the kernel status changes from Unknown to Idle , you can start writing and executing notebook code.

  3. After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernelfrom the Kerneltab.

Manage input and output files in Cloud Storage

Analyzing exploratory data and building ML models often involves file-based inputs and outputs. Dataproc Serverless accesses these files on Cloud Storage.

  • To access the Cloud Storage browser, click the Cloud Storage browser icon in the JupyterLab Launcherpage sidebar, then double-click a folder to view its contents.

  • You can click Jupyter-supported file types to open and edit them. When you save changes to the files, they are written to Cloud Storage.

  • To create a new Cloud Storage folder, click the new folder icon, then enter the name of the folder.

  • To upload files into a Cloud Storage bucket or a folder, click the upload icon, then select the files to upload.

Develop Spark notebook code

After installing the Dataproc JupyterLab plugin , you can launch Jupyter notebooks from the JupyterLab Launcherpage to develop application code.

PySpark and Python code development

Dataproc Serverless and Dataproc on Compute Engine clusters support PySpark kernels. Dataproc on Compute Engine also supports Python kernels.

SQL code development

Click the PySpark kernel card in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section of the JupyterLab Launcherpage to open a PySpark notebook to write and execute SQL code.

Spark SQL magic:Since the PySpark kernel that launches Dataproc Serverless Notebooks is preloaded with Spark SQL magic, instead of using spark.sql('SQL STATEMENT').show() to wrap your SQL statement, you can type %%sparksql magic at the top of a cell, then type your SQL statement in the cell.

BigQuery SQL:The BigQuery Spark connector allows your notebook code to load data from BigQuery tables, perform analysis in Spark, and then write the results to a BigQuery table.

The Dataproc Serverless 2.1 runtime includes the BigQuery Spark connector . If you use the Dataproc Serverless 2.0 or earlier runtime to launch Dataproc Serverless notebooks, you can install Spark BigQuery Connector by adding the following Spark property to your Dataproc Serverless runtime template :

spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar

Scala code development

Dataproc on Compute Engine clusters created with image version 2.0+, 2.1+ and later, include Apache Toree , a Scala kernel for the Jupyter Notebook platform that provides interactive access to Spark.

  • Click the Apache Toree card in the Dataproc Cluster Notebook section on the JupyterLab Launcherpage to open a notebook for Scala code development.

If a Dataproc Metastore (DPMS) instance is attached to a Dataproc Serverless runtime template or a Dataproc on Compute Engine cluster, the DPMS instance schema is displayed in the JupyterLab Metadata Explorer when a notebook is opened. DPMS is a fully-managed and horizontally-scalable Hive Metastore (HMS) service on Google Cloud.

To view HMS metadata in the Metadata Explorer:

To open the JupyterLab Metadata Explorer, click its icon in the sidebar.

You can search for a database, table, or column in the Metadata Explorer. Click a database, table, or column name to view the associated metadata.

Deploy your code

After installing the Dataproc JupyterLab plugin , you can use JupyterLab to:

  • Execute your notebook code on the Dataproc Serverless infrastructure

  • Submit batch jobs to the Dataproc Serverless infrastructure or to your Dataproc on Compute Engine cluster.

Run notebook code on Dataproc Serverless

  • Click the Runicon or press the Shift-Returnkeys to run code in a notebook cell.

  • Use the Runmenu to run code in one or more notebook cells.

Submit a batch job to Dataproc Serverless

  • Click the Serverlesscard in the Dataproc Jobs and Sessionssection on the JupyterLab Launcherpage.

  • Click the Batchtab, then click Create Batchand fill in the Batch Infofields.

  • Click Submitto submit the job.

Submit a batch job to a Dataproc on Compute Engine cluster

  • Click the Clusterscard in the Dataproc Jobs and Sessionssection on the JupyterLab Launcherpage.

  • Click the Jobstab, then click Submit Job.

  • Select a Cluster, then fill in the Jobfields.

  • Click Submitto submit the job.

View and manage resources

After installing the Dataproc JupyterLab plugin , you can view and manage Dataproc Serverless and Dataproc on Compute Engine from the Dataproc Jobs and Sessionssection on the JupyterLab Launcherpage.

Click the Dataproc Jobs and Sessionssection to show the Clustersand Serverlesscards.

To view and manage Dataproc Serverless sessions:

  1. Click the Serverlesscard.
  2. Click the Sessionstab, then a session ID to open the Session detailspage to view session properties, view Google Cloud log in Logs Explorer, and terminate a session. Note: A unique Dataproc Serverless session is created to launch each Dataproc Serverless notebook.

To view and manage Dataproc Serverless batches:

  1. Click the Batchestab to view the list of Dataproc Serverless batches in the current project and region. Click a batch ID to view batch details.

To view and manage Dataproc on Compute Engine clusters:

  1. Click the Clusterscard. The Clusterstab is selected to list active Dataproc on Compute Engine clusters in the current project and region. You can click on icons in the Actionscolumn to start, stop, or restart a cluster. Click a cluster name to view cluster details. You can click on icons in the Actionscolumn to clone, stop, or delete a job.

To view and manage Dataproc on Compute Engine jobs:

  1. Click the Jobscard to view the list of jobs in the current project. Click a job ID to view job details.