This page delineates the sequence of steps involved with the submission, execution, and completion of a Dataproc job. It also discusses job throttling and debugging.
Dataproc jobs flow
- User submits job to Dataproc.
- JobStatus.State
is marked as
PENDING.
- JobStatus.State
is marked as
- Job waits to be acquired by the
dataprocagent.- If the job is acquired, JobStatus.State
is marked as
RUNNING. - If the job is not acquired due to agent failure, Compute Engine
network failure, or other cause, the job is marked
ERROR.
- If the job is acquired, JobStatus.State
is marked as
- Once a job is acquired by the agent, the agent verifies that there are
sufficient resources available on the Dataproc cluster's master node
to start the driver.
- If sufficient resources are not available, the job is delayed (throttled). JobStatus.Substate
shows the job as
QUEUED, and Job.JobStatus.details provides information on the cause of the delay.
- If sufficient resources are not available, the job is delayed (throttled). JobStatus.Substate
shows the job as
- If sufficient resources are available, the
dataprocagent starts the job driver process.- At this stage, typically there are one or more applications running in Apache Hadoop YARN . However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.
- The
dataprocagent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see Job monitoring and debugging ). - Yarn application(s) complete.
- Job continues to be reported as
RUNNINGwhile driver performs any job completion tasks, such as materializing collections. - An unhandled or uncaught failure in the Main thread can leave the
driver in a zombie state (marked as
RUNNINGwithout information as to the cause of the failure).
- Job continues to be reported as
- Driver exits.
dataprocagent reports completion to Dataproc.- Dataproc reports job as
DONE.
- Dataproc reports job as
Job concurrency
You can configure the maximum number of concurrent Dataproc jobs
with the dataproc:dataproc.scheduler.max-concurrent-jobs
cluster property when you create a cluster. If this property value is not set,
the upper limit on concurrent jobs is calculated as max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5)
. masterMemoryMb
is determined by the master VM's machine type. masterMemoryMbPerJob
is 1024
by default, but is configurable at cluster creation with the dataproc:dataproc.scheduler.driver-size-mb
cluster property.

