Create a Dataproc zero-scale cluster

This document describes how to create a Dataproc zero-scale cluster.

Dataproc zero-scale clusters provide a cost-effective way to use Dataproc clusters. Unlike standard Dataproc clusters that require at least two primary workers, Dataproc zero-scale clusters use only secondary workers that can be scaled down to zero.

Dataproc zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupiter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.

Characteristics and limitations

A Dataproc zero-scale cluster shares similarities with a standard cluster, but has the following unique characteristics and limitations:

  • Requires image version 2.2.53 or later.
  • Supports only secondary workers, not primary workers.
  • Includes services such as YARN, but doesn't support the HDFS file system.

    • To use Cloud Storage as the default file system, set the core:fs.defaultFS cluster property to a Cloud Storage bucket location ( gs:// BUCKET_NAME ).
    • If you disable a component during cluster creation, also disable HDFS.
  • Can't be converted to or from a standard cluster.

  • Requires an autoscaling policy for ZERO_SCALE cluster types.

  • Requires selecting flexible VMs as machine type.

  • Doesn't support the Oozie component.

  • Can't be created from the Google Cloud console.

Optional: Configure an autoscaling policy

You can configure an autoscaling policy to define secondary working scaling for a zero-scale cluster. When doing so, note the following:

  • Set the cluster type to ZERO_SCALE .
  • Configure an autoscaling policy to the secondary worker config only.

For more information, see Create an autoscaling policy .

Create a Dataproc zero-scale cluster

Create a zero-scale cluster using the gcloud CLI or the Dataproc API.

gcloud

Run gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell .

 gcloud  
dataproc  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--region = 
 REGION 
  
 \ 
  
--cluster-type = 
zero-scale  
 \ 
  
--autoscaling-policy = 
 AUTOSCALING_POLICY 
  
 \ 
  
--properties = 
core:fs.defaultFS = 
gs:// BUCKET_NAME 
  
 \ 
  
--secondary-worker-machine-types = 
 "type= MACHINE_TYPE1 
[,type= MACHINE_TYPE2 
...][,rank= RANK 
]" 
  
...other  
args 

Replace the following:

  • CLUSTER_NAME : name of the Dataproc zero-scale cluster.
  • REGION : an available Compute Engine region .
  • AUTOSCALING_POLICY : the ID or resource URI of the autoscaling policy.
  • BUCKET_NAME : name of your Cloud Storage bucket.
  • MACHINE_TYPE : specific Compute Engine machine type, such as n1-standard-4 , e2-standard-8 .
  • RANK : defines the priority of a list of machine types.

REST

Create a zero-scale cluster using a Dataproc REST API cluster.create request:

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: