Flexible VMs is a Dataproc feature that lets you specify prioritized lists of VM types for Dataproc secondary workers when you create a Dataproc cluster .
Why use flexible VMs
Previously, if a VM type was unavailable when you submitted a cluster creation request, the request failed, and you needed to update your request, script, or code to specify a "next-best" VM type. This re-request process could involve multiple iterations until you specified a VM type that was available.
The Dataproc Flexible VM feature helps your cluster creation request succeed by selecting secondary worker VM types from your ranked VM lists, and then searching for zones within your specified cluster region with availability of the listed VM types.
Terminology
-  VM type: The family, memory capacity, and number of CPU cores of a VM instance. Dataproc supports the use of predefined and custom VM types . 
-  Secondary workers: Secondary workers don't store data. They function only as processing nodes. You can use secondary workers to scale compute without scaling storage. 
Limitations and considerations
-  Flexible VMs are available in Dataproc on Compute Engine 2.0.74+,2.1.22+and later Dataproc on Compute Engine image versions .
-  You can specify flexible VMs for secondary workers only. 
-  You can specify up to five ranked VM type lists, with up to 10 VM types in a list. For more information, see How to request flexible VMs . 
-  The creation of a cluster with flexible VMs requires the use of Dataproc autozone placement , which allows Dataproc to choose the zone that has the capacity to fulfill your VM type requests. 
-  If your cluster creation request includes an autoscaling policy , flexible VMs can be from different VM families, but they must have the same amount of memory and core count. 
-  When provisioning flexible VMs, Dataproc consumes "any matching" available reservations, but not "specific" reservations (see Consume reserved instances ). Machine types that match reservations are first selected within a rank, followed by VM types with the largest number of CPUs. 
-  Dataproc applies Google Cloud quotas to flexible VM provisioning. 
-  Although you can specify different CPU-to-memory ratios for primary and secondary worker V types in a cluster, this can lead to performance degradation because the smallest CPU-to-memory ratio is used as the smallest container unit. 
-  If you update a cluster that was created using flexible VMs, Dataproc selects and adds workers from the flexible VM lists that you provided when you created your cluster. 
Request flexible VMs
You can specify flexible VMs when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or Dataproc API.
- You can specify up to five ranked VM type lists, with up to 10 VM types in a list. Lowest ranked lists have the highest priority. By default, flexible VM lists have a rank of 0. Within a list, Dataproc prioritizes VM types with unused reservations, followed by the largest VM sizes. VM types within a list with the same CPU count are treated equally.
Console
To create a cluster with secondary worker flexible VMs:
-  Open the Dataproc Create a cluster on Compute Engine page in the Google Cloud console. 
-  The Set up clusterpanel is selected with fields filled in with default values. You can change the suggested name and the cluster region, and make other changes. Make sure that Anyis selected as the cluster Zoneto allow Dataproc autozone placement to choose the zone that has the best availability of the VM types specified in your flexible VM lists. 
-  Select the Configure nodespanel. In the Secondary worker nodessection, specify the number and preemptibility of secondary workers. - Click Add a secondary workerfor each rank of secondary workers, specifying one or more machine types to include in each rank.
 
-  After confirming and specifying cluster details in the cluster create panels, click Create. 
gcloud
Use the  gcloud dataproc clusters create 
 
command to add multiple secondary-worker-machine-types 
flags to specify ranked
flexible VM  lists for Dataproc secondary workers 
.
The default flexible VM secondary worker type is Spot, which is a preemptible type.
In the following gcloud CLI example, Dataproc attempts
to provision secondary workers with n2-standard-8 
VMs first (rank 0). If
n2-standard-8 machines are not available, Dataproc attempts
to provision secondary workers with either e2-standard-8 
or t2d-standard-8 
VMs (rank 1).
gcloud dataproc clusters create CLUSTER_NAME \ --region= REGION \ --zone="" \ --master-machine-type=n1-standard-8 \ --worker-machine-type=n1-standard-8 \ --num-workers=4 \ --num-secondary-workers=4 \ --secondary-worker-type=non-preemptible \ --secondary-worker-machine-types="type=n2-standard-8,rank=0" \ --secondary-worker-machine-types="type=e2-standard-8,type=t2d-standard-8,rank=1"
Notes:
-  --zone="": The Flexible VM feature requires Dataproc autozone placement to allow Dataproc to choose the zone that has your VM types available for use. Passing an empty value ("") to the--zoneflag overrides any zone selection specified in your defaultgcloud config list.
-  Dataproc generates component roleproperties based on machine cores and memory. You can override these system-generated properties with the--propertiesflag, using the following syntax:--properties=" ROLE : MACHINE_TYPE : COMPONENT_PREFIX : COMPONENT_PROPERTY = VALUE " Only the secondary_workerrole is the only supported role.In the following example, the --propertiesflag changes the number of cores ofe2-standard-8machines assigned to secondary worker nodes from8to6:--properties="secondary_worker:e2-standard-8:yarn:yarn.nodemanager.resource.cpu-vcores=6" 
API
Use the  instanceFlexibilityPolicy.instanceSelectionList 
 
as part of a  Dataproc API  clusters.create 
 
request to specify a ranked list of machineTypes 
for secondary workers.
Example:
The following JSON snippet from a Dataproc clusters.create 
 request body 
specifies secondary workers machine types for rank 0 and rank 1.
"config": {
  "secondaryWorkerConfig": {
    "instanceFlexibilityPolicy": {
      "instanceSelectionList": [
        {
          "machineTypes": [
            "n1-standard-4",
            "n2-standard-4"
          ],
          "rank": 0
        },
        {
          "machineTypes": [
            "e2-standard-4",
            "n2d-standard-4"
          ],
          "rank": 1
        }
      ]
    }
  }
} 
 Use cluster properties to customize component roles:Dataproc
generates component role 
properties based on VM cores and memory.
You can override these system-generated properties by adding  SoftwareConfig.properties 
 
to your clusters.create 
request, using the following  key 
= value 
 
syntax:
ROLE : MACHINE_TYPE : COMPONENT_PREFIX : COMPONENT_PROPERTY = VALUE
Only the secondary_worker 
role is the only supported role.
In the following example, the properties 
field changes the number of cores
assigned to the secondary worker node of an e2-standard-8 
VM from 8 
to 6 
:
"secondary_worker:e2-standard-8:yarn:yarn.nodemanager.resource.cpu-vcores=6"

