Dataproc services

This page lists services that Dataproc image versions run on Dataproc cluster nodes.

All nodes

The following services run on all nodes in a cluster.

Node type
Service
Image versions
Description
All nodes
all
Receives jobs from Dataproc and launches job drivers
all
Collects and pushes logs to Logging

Standard clusters

The following services run on standard clusters.

Node type
Service
Image versions
Description
Master
all
Manages the HDFS filesystem
all
Checkpoints the NameNode
all
Serves mapreduce application history information
all
Schedules and manages YARN applications
1.3+
Serves YARN application history information
all
Manages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
all
Serves queries received from clients (primarily beeline shell queries ) against Hive
< 1.5
A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
1.5+
A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
< 1.3
NFS is the Network File System.
all
Serves Spark application history information
All Workers
all
Launches and manages YARN containers
Primary Workers only
all
Stores HDFS blocks

HA Clusters

In Dataproc High Availability (HA) clusters , different services run on different master nodes, as show below. HA cluster worker node services are the same as those listed for standard clusters .

Node type
Service
Image versions
Description
All masters
all
A quorum of journal nodes maintains an edit log of HDFS namespace modifications. If a failover occurs, the Standby NameNode reads the edit log and takes control from the Active NameNode.
all
Schedules and manages YARN applications
all
Manages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
all
Serves queries received from clients (primarily beeline shell queries ) against Hive
all
A ZooKeeper quorum is used for distributed coordination. In High Availability (HA) clusters , it is used for HDFS NameNodes and YARN resource managers leader election.
Masters 0 and 1 only
all
Manages the HDFS filesystem
all
ZKFC is the ZKFailoverController process, which runs with the HDFS NameNode. It monitors the health of the NameNode, and manages leader election via ZooKeeper in the event of a failover.
Master 0 only
all
Serves mapreduce application history information
1.3+
Serves YARN application history information
< 1.5
A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
1.5+
A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
< 1.3
NFS is the Network File System.
all
Serves Spark application history information
Design a Mobile Site
View Site in Mobile | Classic
Share by: