By default, Cloud Data Fusion encrypts customer content at rest. Cloud Data Fusion handles encryption for you without any additional actions on your part. This option is called Google default encryption .
If you want to control your encryption keys, then you can use customer-managed encryption keys (CMEKs) in Cloud KMS with CMEK-integrated services including Cloud Data Fusion. Using Cloud KMS keys gives you control over their protection level, location, rotation schedule, usage and access permissions, and cryptographic boundaries. Using Cloud KMS also lets you track key usage , view audit logs, and control key lifecycles. Instead of Google owning and managing the symmetric key encryption keys (KEKs) that protect your data, you control and manage these keys in Cloud KMS.
After you set up your resources with CMEKs, the experience of accessing your Cloud Data Fusion resources is similar to using Google default encryption. For more information about your encryption options, see Customer-managed encryption keys (CMEK) .
Cloud Data Fusion supports Cloud KMS key usage tracking for the Instance
resource.
CMEK lets you control the data that's written to Google internal resources in tenant projects and data written by Cloud Data Fusion pipelines, including the following:
- Pipeline logs and metadata
- Dataproc cluster metadata
- Various Cloud Storage, BigQuery, Pub/Sub, and Spanner data sinks, actions, and sources
Cloud Data Fusion resources
For a list of Cloud Data Fusion plugins that support CMEK, see the supported plugins .
Cloud Data Fusion supports CMEK for Dataproc clusters. Cloud Data Fusion creates a temporary Dataproc cluster for use in the pipeline, and then deletes the cluster when the pipeline completes. CMEK protects the cluster metadata written to the following:
- Persistent disks (PD) attached to cluster VMs
- Job driver output and other metadata written to the auto-created or user-created Dataproc staging bucket
Set up CMEK
Create a Cloud KMS key
Create a Cloud KMS key in the Google Cloud project that contains the Cloud Data Fusion instance or in a separate user project. The Cloud KMS key ring location must match the region where you create the instance. A multi-region or global region key isn't allowed at the instance level because Cloud Data Fusion is always associated with a particular region.
Get the resource name for the key
REST API
Get the resource name of the key that you created with the following command:
projects/ PROJECT_ID
/locations/ REGION
/keyRings/ KEY_RING_NAME
/cryptoKeys/ KEY_NAME
Replace the following:
- PROJECT_ID : the customer project that hosts the Cloud Data Fusion instance
- REGION
: a Google Cloud region
that's close to your location—for example,
us-east1
- KEY_RING_NAME : the name of the key ring that groups the cryptographic keys together
- KEY_NAME : the Cloud KMS key name
Console
-
Go to the Key managementpage.
-
Next to your key, click More .
-
Select Copy Resource Nameto copy the resource name to the clipboard.
Update your project's service accounts to use the key
To set up your project's service accounts to use your key:
-
Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role (
roles/cloudkms.cryptoKeyEncrypterDecrypter
) to the Cloud Data Fusionservice agent (see Granting roles to a service account for specific resources ). This account is in the following format:service- PROJECT_NUMBER @gcp-sa-datafusion.iam.gserviceaccount.com
Granting the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Data Fusion service agent enables Cloud Data Fusion to use CMEK to encrypt any customer data stored in tenant projects.
-
Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engineservice agent (see Assigning a Cloud KMS key to a Cloud Storage service account ). This account, which by default is granted the Compute Engine Service Agent role, is of the form:
service- PROJECT_NUMBER @compute-system.iam.gserviceaccount.com
Granting the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engineservice agent enables Cloud Data Fusion to use CMEK to encrypt persistent disk (PD) metadata written by the Dataproc cluster running in your pipeline.
-
Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storageservice agent (see Assigning a Cloud KMS key to a Cloud Storage service agent ). This service agent is of the form:
service- PROJECT_NUMBER @gs-project-accounts.iam.gserviceaccount.com
Granting this role to the Cloud Storage service agent enables Cloud Data Fusion to use CMEK to encrypt the Cloud Storage bucket that stores and caches pipeline information and data written to the Dataproc cluster staging bucket and any other Cloud Storage buckets in your project used by your pipeline.
-
Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Google Cloud Dataproc Service Agent. This service agent is of the form:
service- PROJECT_NUMBER @dataproc-accounts.iam.gserviceaccount.com
-
Optional: If your pipeline uses BigQueryresources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the BigQuery service account (see Grant encryption and decryption permission ). This account is of the form:
bq- PROJECT_NUMBER @bigquery-encryption.iam.gserviceaccount.com
-
Optional: If your pipeline uses Pub/Subresources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Pub/Sub service account (see Using customer-managed encryption keys ). This account is of the form:
service- PROJECT_NUMBER @gcp-sa-pubsub.iam.gserviceaccount.com
-
Optional: If your pipeline uses Spannerresources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Spanner service account. This account is of the form:
service- PROJECT_NUMBER @gcp-sa-spanner.iam.gserviceaccount.com
Create a Cloud Data Fusion instance with CMEK
CMEK is available in all editions of Cloud Data Fusion version 6.5.0 and later.
REST API
-
To create an instance with a customer-managed encryption key, set the following environment variables:
export PROJECT = PROJECT_ID export LOCATION = REGION export INSTANCE = INSTANCE_ID export DATA_FUSION_API_NAME = datafusion.googleapis.com export KEY = KEY_NAME
Replace the following:
- PROJECT_ID : the customer project that hosts the Cloud Data Fusion instance
- REGION
: a Google Cloud region that's
close
to your location—for example,
us-east1
- INSTANCE_ID : the name of the Cloud Data Fusion instance
- KEY_NAME : the full resource name of the CMEK key
-
Run the following command to create a Cloud Data Fusion instance:
curl -H "Authorization: Bearer $( gcloud auth print-access-token ) " -H "Content-Type: application/json" https:// $DATA_FUSION_API_NAME /v1/projects/ $PROJECT /locations/ $LOCATION /instances?instance_id = INSTANCE -X POST -d '{"description": "CMEK-enabled CDF instance created through REST.", "type": "BASIC", "cryptoKeyConfig": {"key_reference": "$KEY"} }'
Console
-
Go to the Cloud Data Fusion page.
-
Click Instances, and then click Create an instance.
-
In the Advanced options, select Use a customer-managed encryption key (CMEK).
-
In the Select a customer-managed keyfield, select the resource name for the key.
-
After you enter all of the instance details, click Create. When the instance is ready to use, it appears on the Instancespage.
Check if CMEK is enabled on an instance
Console
View the instance details:
-
In the Google Cloud console, go to the Cloud Data Fusion page.
-
Click Instances, and then click the instance's name to go to the Instance detailspage.
If CMEK is enabled, the Encryption keyfield is shown as Available.
If CMEK is disabled, the Encryption keyfield is shown as Not available.
Use CMEK with supported plugins
When you set the encryption key name, use the following form:
projects/ PROJECT_ID
/locations/ REGION
/keyRings/ KEY_RING_NAME
/cryptoKeys/ KEY_NAME
The following table describes the behavior of the key in the Cloud Data Fusion plugins that support CMEK.
Cloud Storage Copy
Cloud Storage Move
Cloud Storage Done File Marker
Use CMEK with Dataproc cluster metadata
The pre-created compute profiles use the CMEK key provided during instance creation to encrypt the Persistent Disk (PD) and the staging bucket metadata written by the Dataproc cluster running in your pipeline. You can modify to use another key by doing one of the following:
- Recommended: Create a new Dataproc compute profile (Enterprise edition only).
- Edit an existing Dataproc compute profile (Developer, Basic, or Enterprise editions).
Console
-
Open the Cloud Data Fusion instance:
-
In the Google Cloud console, go to the Cloud Data Fusion page.
-
To open the instance in the Cloud Data Fusion Studio, click Instances, and then click View instance.
-
-
Click System Admin > Configuration.
-
Click the System Compute Profilesdrop-down.
-
Click Create New Profile, and select Dataproc.
-
Enter a Profile label, Profile name, and Description.
-
By default, Dataproc creates staging and temp buckets whenever an ephemeral cluster is created by Cloud Data Fusion. Cloud Data Fusion supports passing the Dataproc staging bucket as an argument in the compute profile. To encrypt the staging bucket, create a CMEK-enabled bucket and pass it as an argument to Dataproc in the compute profile.
-
By default, Cloud Data Fusion auto-creates a Cloud Storage bucket to stage dependencies used by Dataproc. If you prefer to use a Cloud Storage bucket that already exists in your project, follow these steps:
-
In the General Settingssection, enter your existing Cloud Storage bucket in the Cloud Storage Bucketfield.
-
-
Get the resource ID of your Cloud KMS key. In the General Settingssection, enter your resource ID in the Encryption Key Namefield.
-
Click Create.
-
If more than one profile is listed in the System Compute Profilessection of the Configurationtab, make the new Dataproc profile the default profile by holding the pointer over the profile name field and clicking the star that appears.
Use CMEK with other resources
The provided CMEK key is set to the system preference during Cloud Data Fusion instance creation. It is used to encrypt data written to newly created resources by pipeline sinks such as Cloud Storage, BigQuery, Pub/Sub, or Spanner sinks.
This key only applies to newly created resources. If the resource already exists before pipeline execution, you should manually apply the CMEK key to those existing resources.
You can change the CMEK key by doing one of the following:
- Use a runtime argument.
- Set a Cloud Data Fusion system preference.
Runtime argument
- In the Cloud Data Fusion Pipeline Studiopage, click the drop-down arrow to the right of the Runbutton.
- In the Namefield, enter
gcp.cmek.key.name
. - In the Valuefield, enter your key's resource ID
.
-
Click Save.
The runtime argument you set here applies only to runs of the current pipeline.
Preference
- In the Cloud Data Fusion UI, click SYSTEM ADMIN.
- Click the Configurationtab.
- Click the System Preferencesdrop-down.
- Click Edit System Preferences.
- In the Keyfield, enter
gcp.cmek.key.name
. - In the Valuefield, enter your key's resource ID
.
- Click Save & Close.