"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Managed Service for Apache Spark services

This page lists services that Managed Service for Apache Spark image versions run on Managed Service for Apache Spark cluster nodes.

All nodes

The following services run on all nodes in a cluster.

Node type

Service

Image versions

Description

All nodes

google-dataproc-agent

all

Receives jobs from Managed Service for Apache Spark and launches job drivers

google-fluentd

all

Collects and pushes logs to Logging

Standard clusters

The following services run on standard clusters.

Node type

Service

Image versions

Description

Master

hadoop-hdfs-namenode

all

Manages the HDFS filesystem

hadoop-hdfs-secondarynamenode

all

Checkpoints the NameNode

hadoop-mapreduce-historyserver

all

Serves mapreduce application history information

hadoop-yarn-resourcemanager

all

Schedules and manages YARN applications

hadoop-yarn-timelineserver

1.3+

Serves YARN application history information

hive-metastore

all

Manages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):

hive-server2

all

Serves queries received from clients (primarily beeline shell queries ) against Hive

mariadb

< 1.5

A relational database used as the default underlying database for Hive metastore in Managed Service for Apache Spark < 1.5 images

mysql

1.5+

A relational database used as the default underlying database for Hive metastore in Managed Service for Apache Spark 1.5+ images

nfs-kernel-server

< 1.3

NFS is the Network File System.

spark-history-server

all

Serves Spark application history information

All Workers

hadoop-yarn-nodemanager

all

Launches and manages YARN containers

Primary Workers only

hadoop-hdfs-datanode

all

Stores HDFS blocks

HA Clusters

In Managed Service for Apache Spark High Availability (HA) clusters , different services run on different master nodes, as show below. HA cluster worker node services are the same as those listed for standard clusters .

Node type

Service

Image versions

Description

All masters

hadoop-hdfs-journalnode

all

A quorum of journal nodes maintains an edit log of HDFS namespace modifications. If a failover occurs, the Standby NameNode reads the edit log and takes control from the Active NameNode.

hadoop-yarn-resourcemanager

all

Schedules and manages YARN applications

hive-metastore

all

hive-server2

all

Serves queries received from clients (primarily beeline shell queries ) against Hive

zookeeper-server

all

A ZooKeeper quorum is used for distributed coordination. In High Availability (HA) clusters , it is used for HDFS NameNodes and YARN resource managers leader election.

Masters 0 and 1 only

hadoop-hdfs-namenode

all

Manages the HDFS filesystem

hadoop-hdfs-zkfc

all

ZKFC is the ZKFailoverController process, which runs with the HDFS NameNode. It monitors the health of the NameNode, and manages leader election via ZooKeeper in the event of a failover.

Master 0 only

hadoop-mapreduce-historyserver

all

Serves mapreduce application history information

hadoop-yarn-timelineserver

1.3+

Serves YARN application history information

mariadb

< 1.5

A relational database used as the default underlying database for Hive metastore in Managed Service for Apache Spark < 1.5 images

mysql

1.5+