The Dataproc provisioner in Cloud Data Fusion calls the Dataproc API to create and delete clusters in your Google Cloud projects. You can configure the clusters in the provisioner's settings.
For more information about compatibility between Cloud Data Fusion versions and Dataproc versions, see Version compatibility .
Properties
The service account key provided to the provisioner must have permission to access the Dataproc and Compute Engine APIs. Because your account key is sensitive, we recommend that you provide the account key using Secure Storage .
After you create the secure key, you can add it to a namespace or a system compute profile. For a namespace compute profile, click the shield and select the secure key. For a system compute profile, enter the name of the key in the Secure Account Keyfield.
The number of master nodes in the cluster. These nodes contain the YARN Resource Manager, HDFS NameNode, and all drivers. Must be set to 1or 3.
Default is 1.
The type of master machine to use. Select one of the following machine types:
- n1
- n2
- n2d
- e2
In Cloud Data Fusion version 6.7.2 and later, the default is e2.
In version 6.7.1, the default is n2.
In version 6.7.0 and earlier, the default is n1.
Number of virtual cores allocated to a master node.
Default is 2.
The amount of memory, in gigabytes, allocated to a master node.
Default is 8 GB.
Disk size, in gigabytes, allocated to a master node.
Default is 1000 GB.
Type of boot disk for a master node:
- Standard Persistent Disk
- SSD Persistent Disk
Default is Standard Persistent Disk.
The type of worker machine to use. Select one of the following machine types:
- n1
- n2
- n2d
- e2
In Cloud Data Fusion version 6.7.2 and later, the default is e2.
In version 6.7.1, the default is n2.
In version 6.7.0 and earlier, the default is n1.
Number of virtual cores allocated to a worker node.
Default is 2.
The amount of memory, in gigabytes, allocated to a worker node.
Default is 8 GB.
Disk size, in gigabytes, allocated to a worker node.
Default is 1000 GB.
Type of boot disk for a worker node:
- Standard Persistent Disk
- SSD Persistent Disk
Default is Standard Persistent Disk.
Worker nodes contain a YARN NodeManager and an HDFS DataNode.
Default is 2.
Path for the autoscaling policy ID or the resource URI.
For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see When to use autoscaling and Autoscale Dataproc clusters .
Enables Secure Boot on the Dataproc VMs.
Default is False.
Enables virtual Trusted Platform Module ( vTPM ) on the Dataproc VMs.
Default is False.
Enables virtual Integrity Monitoring on the Dataproc VMs.
Default is False.
Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark history files in Dataproc.
This property was introduced in Cloud Data Fusion version 6.9.2.
The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.
This property was introduced in Cloud Data Fusion version 6.9.2.
Labels to organize the Dataproc clusters and jobs being created.
You can label each resource and then filter the resources by labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.
Configure Dataproc to delete a cluster if it's idle longer than the specified number of minutes. Clusters are normally deleted directly after a run ends, but deletion can fail in rare situations. For more information, see Troubleshoot deleting clusters .
Default is 30minutes.
Whether to skip cluster deletion at the end of a run. You must manually delete clusters. This should only be used when debugging a failed run.
Default is False.
Enable the Stackdriver logging integration.
Default is True.
Enable the Stackdriver monitoring integration.
Default is True.
Enable the component gateway to access to the cluster's interfaces, such as the YARN ResourceManager and Spark HistoryServer.
Default is False.
When the system is running on Google Cloud in the same network as the cluster, it normally uses the internal IP address when communicating with the cluster. To always use the external IP address, set this value to True.
Default is False.
The number of seconds to wait after creating a cluster to begin polling to see if the cluster has been created.
Default is 60seconds.
Polling settings control how often cluster status is polled when creating and deleting clusters. If you have many pipelines scheduled to run at the same time, you may want to change these settings.
Maximum amount of random jitter, in seconds, to add to the delay when creating a cluster. You can use this property to prevent many simultaneous API calls in Google Cloud when you have a lot of pipelines that are scheduled to run at the exact same time.
Default is 20seconds.
The number of seconds to wait after deleting a cluster to begin polling to see if the cluster has been deleted.
Default is 30seconds.
The number of seconds to wait between polls for cluster status.
Default is 2.
Dataproc profile web interface properties mapped to JSON properties
Dataproc profile UI property name | Dataproc profile JSON property name |
---|---|
Profile label | name
|
Profile name | label
|
Description | description
|
Project ID | projectId
|
Creator service account key | accountKey
|
Region | region
|
Zone | zone
|
Network | network
|
Network host project ID | networkHostProjectId
|
Subnet | subnet
|
Runner service account | serviceAccount
|
Number of masters | masterNumNodes
|
Master machine type | masterMachineType
|
Master cores | masterCPUs
|
Master memory (GB) | masterMemoryMB
|
Master disk size (GB) | masterDiskGB
|
Master disk type | masterDiskType
|
Number of primary workers | workerNumNodes
|
Number of secondary workers | secondaryWorkerNumNodes
|
Worker machine type | workerMachineType
|
Worker cores | workerCPUs
|
Worker memory (GB) | workerMemoryMB
|
Worker disk size (GB) | workerDiskGB
|
Worker disk type | workerDiskType
|
Metadata | clusterMetaData
|
Network tags | networkTags
|
Enable Secure Boot | secureBootEnabled
|
Enable vTPM | vTpmEnabled
|
Enable Integrity Monitoring | integrityMonitoringEnabled
|
Image version | imageVersion
|
Custom image URI | customImageUri
|
Cloud Storage bucket | gcsBucket
|
Encryption key name | encryptionKeyName
|
Autoscaling policy | autoScalingPolicy
|
Initialization actions | initActions
|
Cluster properties | clusterProperties
|
Labels | clusterLabels
|
Max idle time | idleTTL
|
Skip cluster delete | skipDelete
|
Enable Stackdriver Logging Integration | stackdriverLoggingEnabled
|
Enable Stackdriver Monitoring Integration | stackdriverMonitoringEnabled
|
Enable Component Gateway | componentGatewayEnabled
|
Prefer external IP | preferExternalIP
|
Create poll delay | pollCreateDelay
|
Create poll jitter | pollCreateJitter
|
Delete poll delay | pollDeleteDelay
|
Poll interval | pollInterval
|
Best Practices
When you create a static cluster for your pipelines, refer to the cluster configuration best practices .
What's next
- Learn more about managing compute profiles .