This document describes how to create a Dataproc zero-scale cluster.
Dataproc zero-scale clusters provide a cost-effective way to use Dataproc clusters. Unlike standard Dataproc clusters that require at least two primary workers, Dataproc zero-scale clusters use only secondary workers that can be scaled down to zero.
Dataproc zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupiter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.
Characteristics and limitations
A Dataproc zero-scale cluster shares similarities with a standard cluster, but has the following unique characteristics and limitations:
- Requires image version
2.2.53or later. - Supports only secondary workers, not primary workers.
-
Includes services such as YARN, but doesn't support the HDFS file system.
- To use Cloud Storage as the default file system, set the
core:fs.defaultFScluster property to a Cloud Storage bucket location (gs:// BUCKET_NAME). - If you disable a component during cluster creation, also disable HDFS.
- To use Cloud Storage as the default file system, set the
-
Can't be converted to or from a standard cluster.
-
Requires an autoscaling policy for
ZERO_SCALEcluster types. -
Requires selecting flexible VMs as machine type.
-
Doesn't support the Oozie component.
-
Can't be created from the Google Cloud console.
Optional: Configure an autoscaling policy
You can configure an autoscaling policy to define secondary working scaling for a zero-scale cluster. When doing so, note the following:
- Set the cluster type to
ZERO_SCALE. - Configure an autoscaling policy to the secondary worker config only.
For more information, see Create an autoscaling policy .
Create a Dataproc zero-scale cluster
Create a zero-scale cluster using the gcloud CLI or the Dataproc API.
gcloud
Run gcloud dataproc clusters create
command locally in a terminal window or in Cloud Shell
.
gcloud
dataproc
clusters
create
CLUSTER_NAME
\
--region =
REGION
\
--cluster-type =
zero-scale
\
--autoscaling-policy =
AUTOSCALING_POLICY
\
--properties =
core:fs.defaultFS =
gs:// BUCKET_NAME
\
--secondary-worker-machine-types =
"type= MACHINE_TYPE1
[,type= MACHINE_TYPE2
...][,rank= RANK
]"
...other
args
Replace the following:
- CLUSTER_NAME : name of the Dataproc zero-scale cluster.
- REGION : an available Compute Engine region .
- AUTOSCALING_POLICY : the ID or resource URI of the autoscaling policy.
- BUCKET_NAME : name of your Cloud Storage bucket.
- MACHINE_TYPE
: specific Compute Engine
machine type, such as
n1-standard-4,e2-standard-8. - RANK : defines the priority of a list of machine types.
REST
Create a zero-scale cluster using a Dataproc REST API cluster.create request:
- Set
ClusterConfig.ClusterTypefor thesecondaryWorkerConfigtoZERO_SCALE. - Set the
AutoscalingConfig.policyUriwith theZERO_SCALEautoscaling policy ID. - Add the
core:fs.defaultFS:gs:// BUCKET_NAMESoftwareConfig.property . Replace BUCKET_NAME with the name of your Cloud Storage bucket.
What's next
- Learn more about Dataproc autoscaling .

