The Dataproc Spark Connect client is a wrapper of the Apache Spark Connect client. It lets applications communicate with a remote Serverless for Apache Spark session using the Spark Connect protocol. This document shows you how to install, configure, and use the client.
Before you begin
-
Ensure you have the Identity and Access Management roles that contain the permissions needed to manage interactive sessions and session templates.
-
If you run the client outside of Google Cloud, provide authentication credentials . Set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to the path of your service account key file.
Install or uninstall the client
You can install or uninstall the dataproc-spark-connect
package using pip
.
Install
To install the latest version of the client, run the following command:
pip
install
-U
dataproc-spark-connect
Uninstall
To uninstall the client, run the following command:
pip
uninstall
dataproc-spark-connect
Configure the client
Specify the project and region for your session. You can set these values using environment variables or by using the builder API in your code.
Environment variables
Set the GOOGLE_CLOUD_PROJECT
and GOOGLE_CLOUD_REGION
environment
variables.
Builder API
Use the .projectId()
and .location()
methods.
spark
=
DataprocSparkSession
.
builder
.
projectId
(
"my-project"
)
.
location
(
"us-central1"
)
.
getOrCreate
()
Start a Spark session
To start a Spark session, add the required imports to your PySpark application
or notebook, then call the DataprocSparkSession.builder.getOrCreate()
API.
-
Import the
DataprocSparkSessionclass. -
Call the
getOrCreate()method to start the session.from google.cloud.dataproc_spark_connect import DataprocSparkSession spark = DataprocSparkSession . builder . getOrCreate ()
Configure Spark properties
To configure Spark properties, chain one or more .config()
methods to the
builder.
from
google.cloud.dataproc_spark_connect
import
DataprocSparkSession
spark
=
DataprocSparkSession
.
builder
.
config
(
'spark.executor.memory'
,
'48g'
)
.
config
(
'spark.executor.cores'
,
'8'
)
.
getOrCreate
()
Use advanced configuration
For advanced configuration, use the Session
class to customize settings such
as the subnetwork or runtime version.
from
google.cloud.dataproc_spark_connect
import
DataprocSparkSession
from
google.cloud.dataproc_v1
import
Session
session_config
=
Session
()
session_config
.
environment_config
.
execution_config
.
subnetwork_uri
=
' SUBNET
'
session_config
.
runtime_config
.
version
=
'3.0'
spark
=
DataprocSparkSession
.
builder
.
projectId
(
'my-project'
)
.
location
(
'us-central1'
)
.
dataprocSessionConfig
(
session_config
)
.
getOrCreate
()
Reuse a named session
Named sessions let you share a single Spark session across multiple notebooks while avoiding repeated session startup-time delays.
-
In your first notebook, create a session with a custom ID.
from google.cloud.dataproc_spark_connect import DataprocSparkSession session_id = 'my-ml-pipeline-session' spark = DataprocSparkSession . builder . dataprocSessionId ( session_id ) . getOrCreate () df = spark . createDataFrame ([( 1 , 'data' )], [ 'id' , 'value' ]) df . show () -
In another notebook, reuse the session by specifying the same session ID.
from google.cloud.dataproc_spark_connect import DataprocSparkSession session_id = 'my-ml-pipeline-session' spark = DataprocSparkSession . builder . dataprocSessionId ( session_id ) . getOrCreate () df = spark . createDataFrame ([( 2 , 'more-data' )], [ 'id' , 'value' ]) df . show ()
Session IDs must be 4-63 characters long, start with a lowercase letter, and
contain only lowercase letters, numbers, and hyphens. The ID cannot end with a
hyphen. A session with an ID that is in a TERMINATED
state cannot be
reused.
Use Spark SQL magic commands
The package supports the sparksql-magic
library
to execute
Spark SQL queries in Jupyter notebooks. Magic commands are an optional feature.
-
Install the required dependencies.
pip install IPython sparksql-magic -
Load the magic extension.
%load_ext sparksql_magic -
Optional: configure default settings.
%config SparkSql.limit=20 -
Execute SQL queries.
%%sparksql SELECT * FROM your_table
To use advanced options, add flags to the %%sparksql
command. For example, to
cache results and create a view, run the following command:
%%sparksql --cache --view result_view df
SELECT * FROM your_table WHERE condition = true
The following options are available:
-
--cacheor-c: caches the DataFrame. -
--eageror-e: caches with eager loading. -
--view VIEWor-v VIEW: creates a temporary view. -
--limit Nor-l N: overrides the default row display limit. -
variable_name: stores the result in a variable.
What's next
-
Learn more about Dataproc sessions .

