Dataproc optional Delta Lake component

You can install additional components like Delta Lake when you create a Dataproc cluster using the Optional components feature. This page describes how you can optionally install the Delta Lake component on a Dataproc cluster.

When installed on a Dataproc cluster, the Delta Lake component installs Delta Lake libraries and configures Spark and Hive in the cluster to work with Delta Lake.

Compatible Dataproc image versions

You can install the Delta Lake component on Dataproc clusters created with Dataproc image version 2.2.46 and later image versions.

See Supported Dataproc versions for the Delta Lake component version included in Dataproc image releases.

When you create a Dataproc cluster with the Delta Lake component enabled, the following Spark properties are configured to work with Delta Lake.

Config file Property Default value
/etc/spark/conf/spark-defaults.conf
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
/etc/spark/conf/spark-defaults.conf
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog

Install the component

Install the component when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.

Console

  1. In the Google Cloud console, go to the Dataproc Create a clusterpage.

    Go to Create a cluster

    The Set up clusterpanel is selected.

  2. In the Componentssection, under Optional components, select Delta Lakeand other optional components to install on your cluster.

gcloud CLI

To create a Dataproc cluster that includes the Delta Lake component, use the gcloud dataproc clusters create command with the --optional-components flag.

gcloud dataproc clusters create CLUSTER_NAME 
\
    --optional-components= DELTA 
\
    --region= REGION 
\
     ... other flags 

Notes:

  • CLUSTER_NAME : Specify the name of the cluster.
  • REGION : Specify a Compute Engine region where the cluster will be located.

REST API

The Delta Lake component can be specified through the Dataproc API using the SoftwareConfig.Component as part of a clusters.create request.

Usage examples

This section provides data read and write examples using Delta Lake tables.

Delta Lake table

Write to a Delta Lake table

You can use the Spark DataFrame to write data to a Delta Lake table. The following examples create a DataFrame with sample data, create a my_delta_table Delta Lake table In Cloud Storage, and then write the data to the Delta Lake table.

PySpark

  # Create a DataFrame with sample data. 
 data 
 = 
 spark 
 . 
 createDataFrame 
 ([( 
 1 
 , 
 "Alice" 
 ), 
 ( 
 2 
 , 
 "Bob" 
 )], 
 [ 
 "id" 
 , 
 "name" 
 ]) 
 # Create a Delta Lake table in Cloud Storage. 
 spark 
 . 
 sql 
 ( 
 """CREATE TABLE IF NOT EXISTS my_delta_table ( 
 id integer, 
 name string) 
 USING delta 
 LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""" 
 ) 
 # Write the DataFrame to the Delta Lake table in Cloud Storage. 
 data 
 . 
 writeTo 
 ( 
 "my_delta_table" 
 ) 
 . 
 append 
 () 
 

Scala

  // Create a DataFrame with sample data. 
 val 
  
 data 
  
 = 
  
 Seq 
 (( 
 1 
 , 
  
 "Alice" 
 ), 
  
 ( 
 2 
 , 
  
 "Bob" 
 )). 
 toDF 
 ( 
 "id" 
 , 
  
 "name" 
 ) 
 // Create a Delta Lake table in Cloud Storage. 
 spark 
 . 
 sql 
 ( 
 """CREATE TABLE IF NOT EXISTS my_delta_table ( 
 id integer, 
 name string) 
 USING delta 
 LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""" 
 ) 
 // Write the DataFrame to the Delta Lake table in Cloud Storage. 
 data 
 . 
 write 
 . 
 format 
 ( 
 "delta" 
 ). 
 mode 
 ( 
 "append" 
 ). 
 saveAsTable 
 ( 
 "my_delta_table" 
 ) 
 

Spark SQL

  CREATE 
  
 TABLE 
  
 IF 
  
 NOT 
  
 EXISTS 
  
 my_delta_table 
  
 ( 
  
 id 
  
 integer 
 , 
  
 name 
  
 string 
 ) 
 USING 
  
 delta 
 LOCATION 
  
 'gs://delta-gcs-demo/example-prefix/default/my_delta_table' 
 ; 
 INSERT 
  
 INTO 
  
 my_delta_table 
  
 VALUES 
  
 ( 
 "1" 
 , 
  
 "Alice" 
 ), 
  
 ( 
 "2" 
 , 
  
 "Bob" 
 ); 
 

Read from a Delta Lake table

The following examples read the my_delta_table and display its contents.

PySpark

  # Read the Delta Lake table into a DataFrame. 
 df 
 = 
 spark 
 . 
 table 
 ( 
 "my_delta_table" 
 ) 
 # Display the data. 
 df 
 . 
 show 
 () 
 

Scala

  // Read the Delta Lake table into a DataFrame. 
 val 
  
 df 
  
 = 
  
 spark 
 . 
 table 
 ( 
 "my_delta_table" 
 ) 
 // Display the data. 
 df 
 . 
 show 
 () 
 

Spark SQL

  SELECT 
  
 * 
  
 FROM 
  
 my_delta_table 
 ; 
 

Hive with Delta Lake

Write to a Delta Table in Hive.

The Dataproc Delta Lake optional component is pre-configured to work with Hive external tables.

For more information, see Hive connector .

Run the examples in a beeline client.

 beeline  
-u  
jdbc:hive2:// 

Create a Spark Delta Lake table.

The Delta Lake table must be created using Spark before a Hive external table can reference it.

  CREATE 
  
 TABLE 
  
 IF 
  
 NOT 
  
 EXISTS 
  
 my_delta_table 
  
 ( 
  
 id 
  
 integer 
 , 
  
 name 
  
 string 
 ) 
 USING 
  
 delta 
 LOCATION 
  
 'gs://delta-gcs-demo/example-prefix/default/my_delta_table' 
 ; 
 INSERT 
  
 INTO 
  
 my_delta_table 
  
 VALUES 
  
 ( 
 "1" 
 , 
  
 "Alice" 
 ), 
  
 ( 
 "2" 
 , 
  
 "Bob" 
 ); 
 

Create a Hive external table.

  SET 
  
 hive 
 . 
 input 
 . 
 format 
 = 
 io 
 . 
 delta 
 . 
 hive 
 . 
 HiveInputFormat 
 ; 
 SET 
  
 hive 
 . 
 tez 
 . 
 input 
 . 
 format 
 = 
 io 
 . 
 delta 
 . 
 hive 
 . 
 HiveInputFormat 
 ; 
 CREATE 
  
 EXTERNAL 
  
 TABLE 
  
 deltaTable 
 ( 
 id 
  
 INT 
 , 
  
 name 
  
 STRING 
 ) 
 STORED 
  
 BY 
  
 'io.delta.hive.DeltaStorageHandler' 
 LOCATION 
  
 'gs://delta-gcs-demo/example-prefix/default/my_delta_table' 
 ; 
 

Notes:

  • The io.delta.hive.DeltaStorageHandler class implements the Hive data source APIs. It can load a Delta table and extract its metadata. If the table schema in the CREATE TABLE statement is not consistent with the underlying Delta Lake metadata, an error is thrown.

Read from a Delta Lake table in Hive.

To read data from a Delta table, use a SELECT statement:

  SELECT 
  
 * 
  
 FROM 
  
 deltaTable 
 ; 
 

Drop a Delta Lake table.

To drop a Delta table, use the DROP TABLE statement:

  DROP 
  
 TABLE 
  
 deltaTable 
 ; 
 
Design a Mobile Site
View Site in Mobile | Classic
Share by: