Create a cluster

How to create a Dataproc cluster

Requirements:

  • Name:The cluster name must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.

  • Cluster region:You must specify a Compute Engine region for the cluster, such as us-east1 or europe-west1 , to isolate cluster resources, such as VM instances and cluster metadata stored in Cloud Storage, within the region.

    • See Regional endpoints for more information on regional endpoints.
    • See Available regions & zones for information on selecting a region. You can also run the gcloud compute regions list command to display a listing of available regions.
  • Connectivity: Compute Engine Virtual Machine instances (VMs) in a Dataproc cluster, consisting of master and worker VMs, require full internal IP networking cross connectivity. The default VPC network provides this connectivity (see Dataproc Cluster Network Configuration ).

gcloud

To create a Dataproc cluster on the command line, run the gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell .

gcloud dataproc clusters create CLUSTER_NAME 
\
    --region= REGION 

The command creates a cluster with default Dataproc service settings for your master and worker virtual machine instances, disk sizes and types, network type, region and zone where your cluster is deployed, and other cluster settings. See the gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.

Create a cluster with a YAML file

  1. Run the following gcloud command to export the configuration of an existing Dataproc cluster into a cluster.yaml file.
    gcloud dataproc clusters export EXISTING_CLUSTER_NAME 
    \
        --region= REGION 
    \
        --destination= cluster.yaml 
    
  2. Create a new cluster by importing the YAML file configuration.
    gcloud dataproc clusters import NEW_CLUSTER_NAME 
    \
        --region= REGION 
    \
        --source= cluster.yaml 
    

Note:During the export operation, cluster-specific fields, such as cluster name, output-only fields, and automatically applied labels are filtered. These fields are disallowed in the imported YAML file used to create a cluster.

REST

This section shows how to create a cluster with required values and the default configuration (1 master, 2 workers).

Before using any of the request data, make the following replacements:

  • CLUSTER_NAME : cluster name
  • PROJECT : Google Cloud project ID
  • REGION : An available Compute Engine region where the cluster will be created.
  • ZONE : An optional zone within the selected region where the cluster will be created.

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/ PROJECT 
/regions/ REGION 
/clusters

Request JSON body:

{
  "project_id":" PROJECT 
",
  "cluster_name":" CLUSTER_NAME 
",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":" ZONE 
"
    }
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
"name": "projects/ PROJECT 
/regions/ REGION 
/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": " CLUSTER_NAME 
",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

Console

Open the Dataproc Create a cluster page in the Google Cloud console in your browser, then click Createin the cluster on Compute enginerow in the Create a Dataproc cluster on Compute Enginepage. The Set up cluster panel is selected with fields filled in with default values. You can select each panel and confirm or change default values to customize your cluster.

Click Createto create the cluster. The cluster name appears in the Clusterspage, and its status is updated to Running after the cluster is provisioned. Click the cluster name to open the cluster details page where you can examine jobs, instances, and configuration settings for your cluster and connect to web interfaces running on your cluster.

Go

  1. Install the client library.
  2. Set up application default credentials.
  3. Run the code.
      import 
      
     ( 
      
    " context 
    "  
    " fmt 
    "  
    " io 
    "  
     dataproc 
      
    " cloud 
     . 
     google 
     . 
     com 
     / 
     go 
     / 
     dataproc 
     / 
     apiv1 
    "  
    " cloud 
     . 
     google 
     . 
     com 
     / 
     go 
     / 
     dataproc 
     / 
     apiv1 
     / 
     dataprocpb 
    "  
    " google 
     . 
     golang 
     . 
     org 
     / 
     api 
     / 
     option 
    " ) 
     func 
      
     createCluster 
     ( 
     w 
      
     io 
     . 
     Writer 
     , 
      
     projectID 
     , 
      
     region 
     , 
      
     clusterName 
      
     string 
     ) 
      
     error 
      
     { 
      
     // projectID := "your-project-id 
    "  
     // region := "us-central1 
    "  
     // clusterName := "your-cluster 
    "  
     ctx 
      
     := 
      
     context 
     . 
     Background 
     () 
      
     // Create the cluster client. 
      
     endpoint 
      
     := 
      
     region 
      
     + 
      
    " - 
     dataproc 
     . 
     googleapis 
     . 
     com 
     : 
     443 
    "  
     clusterClient 
     , 
      
     err 
      
     := 
      
     dataproc 
     . 
     NewClusterControllerClient 
     ( 
     ctx 
     , 
      
     option 
     . 
     WithEndpoint 
     ( 
     endpoint 
     )) 
      
     if 
      
     err 
      
     != 
      
     nil 
      
     { 
      
     return 
      
     fmt 
     . 
     Errorf 
     ( 
    " dataproc 
     . 
     NewClusterControllerClient 
     : 
      
     % 
     w 
    " , 
      
     err 
     ) 
      
     } 
      
     defer 
      
     clusterClient 
     . 
     Close 
     () 
      
     // Create the cluster config. 
      
     req 
      
     := 
      
    & dataprocpb 
     . 
     CreateClusterRequest 
     { 
      
     ProjectId 
     : 
      
     projectID 
     , 
      
     Region 
     : 
      
     region 
     , 
      
     Cluster 
     : 
      
    & dataprocpb 
     . 
     Cluster 
     { 
      
     ProjectId 
     : 
      
     projectID 
     , 
      
     ClusterName 
     : 
      
     clusterName 
     , 
      
     Config 
     : 
      
    & dataprocpb 
     . 
     ClusterConfig 
     { 
      
     MasterConfig 
     : 
      
    & dataprocpb 
     . 
     InstanceGroupConfig 
     { 
      
     NumInstances 
     : 
      
     1 
     , 
      
     MachineTypeUri 
     : 
      
    " n1 
     - 
     standard 
     - 
     2 
    " , 
      
     }, 
      
     WorkerConfig 
     : 
      
    & dataprocpb 
     . 
     InstanceGroupConfig 
     { 
      
     NumInstances 
     : 
      
     2 
     , 
      
     MachineTypeUri 
     : 
      
    " n1 
     - 
     standard 
     - 
     2 
    " , 
      
     }, 
      
     }, 
      
     }, 
      
     } 
      
     // Create the cluster. 
      
     op 
     , 
      
     err 
      
     := 
      
     clusterClient 
     . 
     CreateCluster 
     ( 
     ctx 
     , 
      
     req 
     ) 
      
     if 
      
     err 
      
     != 
      
     nil 
      
     { 
      
     return 
      
     fmt 
     . 
     Errorf 
     ( 
    " CreateCluster 
     : 
      
     % 
     w 
    " , 
      
     err 
     ) 
      
     } 
      
     resp 
     , 
      
     err 
      
     := 
      
     op 
     . 
     Wait 
     ( 
     ctx 
     ) 
      
     if 
      
     err 
      
     != 
      
     nil 
      
     { 
      
     return 
      
     fmt 
     . 
     Errorf 
     ( 
    " CreateCluster 
     . 
     Wait 
     : 
      
     % 
     w 
    " , 
      
     err 
     ) 
      
     } 
      
     // Output a success message. 
      
     fmt 
     . 
     Fprintf 
     ( 
     w 
     , 
      
    " Cluster 
      
     created 
      
     successfully 
     : 
      
     % 
     s 
    " , 
      
     resp 
     . 
     ClusterName 
     ) 
      
     return 
      
     nil 
     } 
     
    

Java

  1. Install the client library.
  2. Set up application default credentials.
  3. Run the code.
      import 
      
     com.google.api.gax.longrunning.OperationFuture 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.Cluster 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.ClusterConfig 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.ClusterControllerClient 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.ClusterControllerSettings 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.ClusterOperationMetadata 
     ; 
     import 
      
     com.google.cloud.dataproc.v1.InstanceGroupConfig 
     ; 
     import 
      
     java.io.IOException 
     ; 
     import 
      
     java.util.concurrent.ExecutionException 
     ; 
     public 
      
     class 
     CreateCluster 
      
     { 
      
     public 
      
     static 
      
     void 
      
     createCluster 
     () 
      
     throws 
      
     IOException 
     , 
      
     InterruptedException 
      
     { 
      
     // TODO(developer): Replace these variables before running the sample. 
      
     String 
      
     projectId 
      
     = 
      
    " your 
     - 
     project 
     - 
     id 
    " ; 
      
     String 
      
     region 
      
     = 
      
    " your 
     - 
     project 
     - 
     region 
    " ; 
      
     String 
      
     clusterName 
      
     = 
      
    " your 
     - 
     cluster 
     - 
     name 
    " ; 
      
     createCluster 
     ( 
     projectId 
     , 
      
     region 
     , 
      
     clusterName 
     ); 
      
     } 
      
     public 
      
     static 
      
     void 
      
     createCluster 
     ( 
     String 
      
     projectId 
     , 
      
     String 
      
     region 
     , 
      
     String 
      
     clusterName 
     ) 
      
     throws 
      
     IOException 
     , 
      
     InterruptedException 
      
     { 
      
     String 
      
     myEndpoint 
      
     = 
      
     String 
     . 
     format 
     ( 
    " % 
     s 
     - 
     dataproc 
     . 
     googleapis 
     . 
     com 
     : 
     443 
    " , 
      
     region 
     ); 
      
     // Configure the settings for the cluster controller client. 
      
     ClusterControllerSettings 
      
     clusterControllerSettings 
      
     = 
      
     ClusterControllerSettings 
     . 
     newBuilder 
     (). 
     setEndpoint 
     ( 
     myEndpoint 
     ). 
     build 
     (); 
      
     // Create a cluster controller client with the configured settings. The client only needs to be 
      
     // created once and can be reused for multiple requests. Using a try-with-resources 
      
     // closes the client, but this can also be done manually with the .close() method. 
      
     try 
      
     ( 
     ClusterControllerClient 
      
     clusterControllerClient 
      
     = 
      
     ClusterControllerClient 
     . 
     create 
     ( 
     clusterControllerSettings 
     )) 
      
     { 
      
     // Configure the settings for our cluster. 
      
     InstanceGroupConfig 
      
     masterConfig 
      
     = 
      
     InstanceGroupConfig 
     . 
     newBuilder 
     () 
      
     . 
     setMachineTypeUri 
     ( 
    " n1 
     - 
     standard 
     - 
     2 
    " ) 
      
     . 
     setNumInstances 
     ( 
     1 
     ) 
      
     . 
     build 
     (); 
      
     InstanceGroupConfig 
      
     workerConfig 
      
     = 
      
     InstanceGroupConfig 
     . 
     newBuilder 
     () 
      
     . 
     setMachineTypeUri 
     ( 
    " n1 
     - 
     standard 
     - 
     2 
    " ) 
      
     . 
     setNumInstances 
     ( 
     2 
     ) 
      
     . 
     build 
     (); 
      
     ClusterConfig 
      
     clusterConfig 
      
     = 
      
     ClusterConfig 
     . 
     newBuilder 
     () 
      
     . 
     setMasterConfig 
     ( 
     masterConfig 
     ) 
      
     . 
     setWorkerConfig 
     ( 
     workerConfig 
     ) 
      
     . 
     build 
     (); 
      
     // Create the cluster object with the desired cluster config. 
      
     Cluster 
      
     cluster 
      
     = 
      
     Cluster 
     . 
     newBuilder 
     (). 
     setClusterName 
     ( 
     clusterName 
     ). 
     setConfig 
     ( 
     clusterConfig 
     ). 
     build 
     (); 
      
     // Create the Cloud Dataproc cluster. 
      
     OperationFuture<Cluster 
     , 
      
     ClusterOperationMetadata> 
      
     createClusterAsyncRequest 
      
     = 
      
     clusterControllerClient 
     . 
     createClusterAsync 
     ( 
     projectId 
     , 
      
     region 
     , 
      
     cluster 
     ); 
      
     Cluster 
      
     response 
      
     = 
      
     createClusterAsyncRequest 
     . 
     get 
     (); 
      
     // Print out a success message. 
      
     System 
     . 
     out 
     . 
     printf 
     ( 
    " Cluster 
      
     created 
      
     successfully 
     : 
      
     % 
     s 
    " , 
      
     response 
     . 
     getClusterName 
     ()); 
      
     } 
      
     catch 
      
     ( 
     ExecutionException 
      
     e 
     ) 
      
     { 
      
     System 
     . 
     err 
     . 
     println 
     ( 
     String 
     . 
     format 
     ( 
    " Error 
      
     executing 
      
     createCluster 
     : 
      
     % 
     s 
      
    " , 
      
     e 
     . 
     getMessage 
     ())); 
      
     } 
      
     } 
     } 
     
    

Node.js

  1. Install the client library.
  2. Set up application default credentials.
  3. Run the code.
      const 
      
     dataproc 
      
     = 
      
     require 
     ( 
    ' @ 
     google 
     - 
     cloud 
     / 
     dataproc 
    ' ); 
     // TODO(developer): Uncomment and set the following variables 
     // projectId = 'YOUR_PROJECT_ID 
    ' // region = 'YOUR_CLUSTER_REGION 
    ' // clusterName = 'YOUR_CLUSTER_NAME 
    ' // Create a client with the endpoint set to the desired cluster region 
     const 
      
     client 
      
     = 
      
     new 
      
     dataproc 
     . 
     v1 
     . 
     ClusterControllerClient 
     ({ 
      
     apiEndpoint 
     : 
      
     ` 
     ${ 
     region 
     } 
     -dataproc.googleapis.com` 
     , 
      
     projectId 
     : 
      
     projectId 
     , 
     }); 
     async 
      
     function 
      
     createCluster 
     () 
      
     { 
      
     // Create the cluster config 
      
     const 
      
     request 
      
     = 
      
     { 
      
     projectId 
     : 
      
     projectId 
     , 
      
     region 
     : 
      
     region 
     , 
      
     cluster 
     : 
      
     { 
      
     clusterName 
     : 
      
     clusterName 
     , 
      
     config 
     : 
      
     { 
      
     masterConfig 
     : 
      
     { 
      
     numInstances 
     : 
      
     1 
     , 
      
     machineTypeUri 
     : 
      
    ' n1 
     - 
     standard 
     - 
     2 
    ' , 
      
     }, 
      
     workerConfig 
     : 
      
     { 
      
     numInstances 
     : 
      
     2 
     , 
      
     machineTypeUri 
     : 
      
    ' n1 
     - 
     standard 
     - 
     2 
    ' , 
      
     }, 
      
     }, 
      
     }, 
      
     }; 
      
     // Create the cluster 
      
     const 
      
     [ 
     operation 
     ] 
      
     = 
      
     await 
      
     client 
     . 
     createCluster 
     ( 
     request 
     ); 
      
     const 
      
     [ 
     response 
     ] 
      
     = 
      
     await 
      
     operation 
     . 
     promise 
     (); 
      
     // Output a success message 
      
     console 
     . 
     log 
     ( 
     `Cluster created successfully: 
     ${ 
     response 
     . 
     clusterName 
     } 
     ` 
     ); 
     
    

Python

  1. Install the client library.
  2. Set up application default credentials.
  3. Run the code.
      from 
     google.cloud 
     import 
     dataproc_v1 
     as 
     dataproc 
     def 
     create_cluster 
     ( 
     project_id 
     , 
     region 
     , 
     cluster_name 
     ): 
    """ This 
     sample 
     walks 
     a 
     user 
     through 
     creating 
     a 
     Cloud 
     Dataproc 
     cluster 
     using 
     the 
     Python 
     client 
     library 
     . 
     Args 
     : 
     project_id 
     ( 
     string 
     ): 
     Project 
     to 
     use 
     for 
     creating 
     resources 
     . 
     region 
     ( 
     string 
     ): 
     Region 
     where 
     the 
     resources 
     should 
     live 
     . 
     cluster_name 
     ( 
     string 
     ): 
     Name 
     to 
     use 
     for 
     creating 
     a 
     cluster 
     . 
    """ # Create a client with the endpoint set to the desired cluster region. 
     cluster_client 
     = 
     dataproc 
     . 
     ClusterControllerClient 
     ( 
     client_options 
     = 
     { 
    " api_endpoint 
    " : 
     f 
    " { 
     region 
     } 
     - 
     dataproc 
     . 
     googleapis 
     . 
     com 
     : 
     443 
    " } 
     ) 
     # Create the cluster config. 
     cluster 
     = 
     { 
    " project_id 
    " : 
     project_id 
     , 
    " cluster_name 
    " : 
     cluster_name 
     , 
    " config 
    " : 
     { 
    " master_config 
    " : 
     { 
    " num_instances 
    " : 
     1 
     , 
    " machine_type_uri 
    " : 
    " n1 
     - 
     standard 
     - 
     2 
    " }, 
    " worker_config 
    " : 
     { 
    " num_instances 
    " : 
     2 
     , 
    " machine_type_uri 
    " : 
    " n1 
     - 
     standard 
     - 
     2 
    " }, 
     }, 
     } 
     # Create the cluster. 
     operation 
     = 
     cluster_client 
     . 
     create_cluster 
     ( 
     request 
     = 
     { 
    " project_id 
    " : 
     project_id 
     , 
    " region 
    " : 
     region 
     , 
    " cluster 
    " : 
     cluster 
     } 
     ) 
     result 
     = 
     operation 
     . 
     result 
     () 
     # Output a success message. 
     print 
     ( 
     f"Cluster 
     created 
     successfully 
     : 
     { 
     result 
     . 
     cluster_name 
     }")