Connect to Apache Spark

As a BigQuery administrator, you can create a connection to enable data analysts to run stored procedures for Apache Spark .

Before you begin

Enable the BigQuery Connection API.

Enable the API
To get the permissions that you need to create a Spark connection, ask your administrator to grant you the BigQuery Connection Admin ( roles/bigquery.connectionAdmin ) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .
Optional: To manage your metadata using Dataproc Metastore , ensure that you have created a Dataproc Metastore service .
Optional: To view job history using Spark History Server web interfaces , ensure that you have created a Dataproc Persistent History Server (PHS) .

Location considerations

When you choose a location for your data, consider the following:

Multi-regions

You must specify Google Cloud resources located in the same large geographic area:

A connection in the BigQuery US multi-region can reference a Spark History Server or a Dataproc Metastore in any single region in the US geographic area, such as us-central1 , us-east4 , or us-west2 .
A connection in the BigQuery EU multi-region can reference a Spark History Server or a Dataproc Metastore in member states of the European Union, such as europe-north1 or europe-west3 .

Single regions

A connection in a single region can only reference Google Cloud resources in the same region. For example, a connection in the single region us-east4 can only reference a Spark History Server or a Dataproc Metastore in us-east4 .

Create connections

Select one of the following options:

Console

Go to the BigQuerypage.

Go to BigQuery
In the Explorerpane, click Add data.

The Add datadialog opens.
In the Filter Bypane, in the Data Source Typesection, select Databases.

Alternatively, in the Search for data sourcesfield, you can enter Spark .
In the Featured data sourcessection, click Apache Spark.
Click the Apache Spark: BigQuery Federationsolution card.
In the External data sourcepane, enter the following information:
- In the Connection typelist, select Apache Spark.
- In the Connection IDfield, enter a name for your connection—for example, spark_connection .
- In the Data locationlist, select a region.
You can create a connection in regions and multi-regions that support BigQuery . For more information, see Location considerations .
- Optional: From the Metastore servicelist, select a Dataproc Metastore .
- Optional: In the History server clusterfield, enter a Dataproc Persistent History Server .
Click Create connection.
Click Go to connection.
In the Connection infopane, copy the service account ID for use in a following step.

bq

In a command-line environment, use the bq mk command to create a connection:
```
bq mk --connection --connection_type='SPARK' \
 --properties= PROPERTIES 
\
 --project_id= PROJECT_ID 
\
 --location= LOCATION 
 CONNECTION_ID 
```
Replace the following:
- PROPERTIES : a key-value pair to provide connection-specific parameters in JSON format
  
  For example:
```
--properties='{
"metastoreServiceConfig": {"metastoreService": " METASTORE_SERVICE_NAME 
"},
"sparkHistoryServerConfig": {"dataprocCluster": " DATAPROC_CLUSTER_NAME 
"}
}'
```
  Replace the following:
  - METASTORE_SERVICE_NAME : the Dataproc Metastore with a gRPC network configuration —for example, projects/my-project-id/locations/us-central1/services/my-service
    
    For more information, see how to access the stored Hive metastore metadata using an endpoint protocol .
  - DATAPROC_CLUSTER_NAME : the Spark History Server configuration—for example, projects/my-project-id/regions/us-central1/clusters/my-cluster
    
    For more information, see Create a Persistent History Server cluster .
- PROJECT_ID : your Google Cloud project ID
- LOCATION : the location where you want to store the connection—for example, US
- CONNECTION_ID : the connection ID—for example, myconnection
  
  When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/.../locations/.../connections/ myconnection

Retrieve and copy the service account ID because you need it in another step:

bq show --location= LOCATION 
--connection PROJECT_ID 
. LOCATION 
. CONNECTION_ID

The output is similar to the following:

Connection myproject.us.myconnection

       name           type                    properties
---------------------- ------- ---------------------------------------------------
myproject.us.myconnection  SPARK   {"serviceAccountId": "bqserver@example.iam.gserviceaccount.com"}

For information about how to manage connections, see Manage connections .

Grant access to the service account

To let a stored procedure for Apache Spark access your Google Cloud resources, you need to grant the service account that's associated with the stored procedure's connection the necessary IAM permissions. Alternatively, you can use your custom service account for data access.

To read and write data from and to BigQuery, you need to give the service account the following IAM permissions:
- bigquery.tables.* on your BigQuery tables
- bigquery.readsessions.* on your project
The roles/bigquery.admin IAM role includes the permissions that the service account needs in order to read and write data from and to BigQuery.

Note: If your stored procedure writes data to a temporary Cloud Storage bucket and then loads the Cloud Storage data to BigQuery , then you need to give the service account the bigquery.jobs.create permission on your project. For more information about IAM roles and permissions in BigQuery, see Access control with IAM .
To read and write data from and to Cloud Storage, you need to give the service account the storage.objects.* permission on your Cloud Storage objects.

The roles/storage.objectAdmin IAM role includes the permissions that the service account needs in order to read and write data from and to Cloud Storage.
If you specify Dataproc Metastore when you create a connection, then for BigQuery to retrieve details about the metastore configuration, you need to give the service account the metastore.services.get permission on your Dataproc Metastore.

The predefined roles/metastore.metadataViewer role includes the permission that the service account needs in order to retrieve details about the metastore configuration.

You also need to grant the service account the roles/storage.objectAdmin role on the Cloud Storage bucket so that your stored procedure can access the Hive warehouse directory of your Dataproc Metastore ( hive.metastore.warehouse.dir ). If your stored procedure performs operations on the metastore, you might need to give additional permissions. For more information about IAM roles and permissions in Dataproc Metastore, see Dataproc Metastore predefined roles and permissions .
If you specify a Dataproc Persistent History Server when you create a connection, then you need to grant the service account the following roles:
- The roles/dataproc.viewer role on your Dataproc Persistent History Server that contains the dataproc.clusters.get permission.
- The roles/storage.objectAdmin role on the Cloud Storage bucket that you specify for the property spark:spark.history.fs.logDirectory when you create the Dataproc Persistent History Server.
For more information, see Dataproc Persistent History Server and Dataproc roles and permissions .

Share connections with users

You can grant the following roles to let users query data and manage connections:

roles/bigquery.connectionUser : enables users to use connections to connect with external data sources and run queries on them.
roles/bigquery.connectionAdmin : enables users to manage connections.

For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions .

Select one of the following options:

Console

Go to the BigQuerypage.

Go to BigQuery

Connections are listed in your project, in a group called External connections.
In the Explorerpane, click your project name > External connections > connection .
In the Detailspane, click Shareto share a connection. Then do the following:
1. In the Connection permissionsdialog, share the connection with other principals by adding or editing principals.
2. Click Save.

bq

You cannot share a connection with the bq command-line tool. To share a connection, use the Google Cloud console or the BigQuery Connections API method to share a connection.

API

Use the projects.locations.connections.setIAM method in the BigQuery Connections REST API reference section, and supply an instance of the policy resource.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Java API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  import 
  
 com.google.api.resourcenames. ResourceName 
 
 ; 
 import 
  
 com.google.cloud.bigquery.connection.v1. ConnectionName 
 
 ; 
 import 
  
 com.google.cloud.bigqueryconnection.v1. ConnectionServiceClient 
 
 ; 
 import 
  
 com.google.iam.v1. Binding 
 
 ; 
 import 
  
 com.google.iam.v1. Policy 
 
 ; 
 import 
  
 com.google.iam.v1. SetIamPolicyRequest 
 
 ; 
 import 
  
 java.io.IOException 
 ; 
 // Sample to share connections 
 public 
  
 class 
 ShareConnection 
  
 { 
  
 public 
  
 static 
  
 void 
  
 main 
 ( 
 String 
 [] 
  
 args 
 ) 
  
 throws 
  
 IOException 
  
 { 
  
 // TODO(developer): Replace these variables before running the sample. 
  
 String 
  
 projectId 
  
 = 
  
 "MY_PROJECT_ID" 
 ; 
  
 String 
  
 location 
  
 = 
  
 "MY_LOCATION" 
 ; 
  
 String 
  
 connectionId 
  
 = 
  
 "MY_CONNECTION_ID" 
 ; 
  
 shareConnection 
 ( 
 projectId 
 , 
  
 location 
 , 
  
 connectionId 
 ); 
  
 } 
  
 static 
  
 void 
  
 shareConnection 
 ( 
 String 
  
 projectId 
 , 
  
 String 
  
 location 
 , 
  
 String 
  
 connectionId 
 ) 
  
 throws 
  
 IOException 
  
 { 
  
 try 
  
 ( 
  ConnectionServiceClient 
 
  
 client 
  
 = 
  
  ConnectionServiceClient 
 
 . 
 create 
 ()) 
  
 { 
  
  ResourceName 
 
  
 resource 
  
 = 
  
  ConnectionName 
 
 . 
 of 
 ( 
 projectId 
 , 
  
 location 
 , 
  
 connectionId 
 ); 
  
  Binding 
 
  
 binding 
  
 = 
  
  Binding 
 
 . 
 newBuilder 
 () 
  
 . 
  addMembers 
 
 ( 
 "group:example-analyst-group@google.com" 
 ) 
  
 . 
 setRole 
 ( 
 "roles/bigquery.connectionUser" 
 ) 
  
 . 
 build 
 (); 
  
  Policy 
 
  
 policy 
  
 = 
  
  Policy 
 
 . 
 newBuilder 
 (). 
  addBindings 
 
 ( 
 binding 
 ). 
 build 
 (); 
  
  SetIamPolicyRequest 
 
  
 request 
  
 = 
  
  SetIamPolicyRequest 
 
 . 
 newBuilder 
 () 
  
 . 
 setResource 
 ( 
 resource 
 . 
 toString 
 ()) 
  
 . 
  setPolicy 
 
 ( 
 policy 
 ) 
  
 . 
 build 
 (); 
  
 client 
 . 
 setIamPolicy 
 ( 
 request 
 ); 
  
 System 
 . 
 out 
 . 
 println 
 ( 
 "Connection shared successfully" 
 ); 
  
 } 
  
 } 
 }

What's next

Learn about different connection types .
Learn about managing connections .
Learn how to create a stored procedure for Apache Spark .
Learn how to manage stored procedures .