Customize your Spark job runtime environment with Docker on YARN

The Dataproc Docker on YARNfeature allows you to create and use a Docker image to customize your Spark job runtime environment. The image can include customizations to Java, Python, and R dependencies, and to your job jar.

Limitations

Feature availability or support is not availablewith:

Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
MapReduce jobs (only supported for Spark jobs )
Spark client mode (only supported with Spark cluster mode)
Kerberos clusters : cluster creation fails if you create a cluster with Docker on YARN and Kerberos enabled.
Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Spark are used, not your customizations.

Create a Docker image

The first step to customize your Spark environment is building a Docker image .

Dockerfile

You can use the following Dockerfile as an example, making changes and additions to meet you needs.

 FROM  
debian:10-slim # Suppress interactive prompts. 
ENV  
 DEBIAN_FRONTEND 
 = 
noninteractive # Required: Install utilities required by Spark scripts. 
RUN  
apt  
update && 
apt  
install  
-y  
procps  
tini # Optional: Add extra jars. 
ENV  
 SPARK_EXTRA_JARS_DIR 
 = 
/opt/spark/jars/
ENV  
 SPARK_EXTRA_CLASSPATH 
 = 
 '/opt/spark/jars/*' 
RUN  
mkdir  
-p  
 " 
 ${ 
 SPARK_EXTRA_JARS_DIR 
 } 
 " 
COPY  
*.jar  
 " 
 ${ 
 SPARK_EXTRA_JARS_DIR 
 } 
 " 
 # Optional: Install and configure Miniconda3. 
ENV  
 CONDA_HOME 
 = 
/opt/miniconda3
ENV  
 PYSPARK_PYTHON 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin/python
ENV  
 PYSPARK_DRIVER_PYTHON 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin/python

ENV  
 PATH 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin: ${ 
 PATH 
 } 
COPY  
Miniconda3-py39_4.10.3-Linux-x86_64.sh  
.
RUN  
bash  
Miniconda3-py39_4.10.3-Linux-x86_64.sh  
-b  
-p  
/opt/miniconda3  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
always_yes  
True  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
auto_update_conda  
False  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--prepend  
channels  
conda-forge  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
channel_priority  
strict # Optional: Install Conda packages. 
 # 
 # The following packages are installed in the default image. It is strongly 
 # recommended to include all of them. 
 # 
 # Use mamba to install packages quickly. 
RUN  
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
install  
mamba  
-n  
base  
-c  
conda-forge  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/mamba  
install  
 \ 
  
conda  
 \ 
  
cython  
 \ 
  
fastavro  
 \ 
  
fastparquet  
 \ 
  
gcsfs  
 \ 
  
google-cloud-bigquery-storage  
 \ 
  
google-cloud-bigquery [ 
pandas ] 
  
 \ 
  
google-cloud-bigtable  
 \ 
  
google-cloud-container  
 \ 
  
google-cloud-datacatalog  
 \ 
  
google-cloud-dataproc  
 \ 
  
google-cloud-datastore  
 \ 
  
google-cloud-language  
 \ 
  
google-cloud-logging  
 \ 
  
google-cloud-monitoring  
 \ 
  
google-cloud-pubsub  
 \ 
  
google-cloud-redis  
 \ 
  
google-cloud-spanner  
 \ 
  
google-cloud-speech  
 \ 
  
google-cloud-storage  
 \ 
  
google-cloud-texttospeech  
 \ 
  
google-cloud-translate  
 \ 
  
google-cloud-vision  
 \ 
  
koalas  
 \ 
  
matplotlib  
 \ 
  
nltk  
 \ 
  
numba  
 \ 
  
numpy  
 \ 
  
openblas  
 \ 
  
orc  
 \ 
  
pandas  
 \ 
  
pyarrow  
 \ 
  
pysal  
 \ 
  
pytables  
 \ 
  
python  
 \ 
  
regex  
 \ 
  
requests  
 \ 
  
rtree  
 \ 
  
scikit-image  
 \ 
  
scikit-learn  
 \ 
  
scipy  
 \ 
  
seaborn  
 \ 
  
sqlalchemy  
 \ 
  
sympy  
 \ 
  
virtualenv # Optional: Add extra Python modules. 
ENV  
 PYTHONPATH 
 = 
/opt/python/packages
RUN  
mkdir  
-p  
 " 
 ${ 
 PYTHONPATH 
 } 
 " 
COPY  
test_util.py  
 " 
 ${ 
 PYTHONPATH 
 } 
 " 
 # Required: Create the 'yarn_docker_user' group/user. 
 # The GID and UID must be 1099. Home directory is required. 
RUN  
groupadd  
-g  
 1099 
  
yarn_docker_user
RUN  
useradd  
-u  
 1099 
  
-g  
 1099 
  
-d  
/home/yarn_docker_user  
-m  
yarn_docker_user
USER  
yarn_docker_user

Build and push the image

The following is commands for building and pushing the example Docker image, you can make changes according to your customizations.

  # Increase the version number when there is a change to avoid referencing 
 # a cached older image. Avoid reusing the version number, including the default 
 # `latest` version. 
 IMAGE 
 = 
gcr.io/my-project/my-image:1.0.1 # Download the BigQuery connector. 
gcloud  
storage  
cp  
 \ 
  
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar  
. # Download the Miniconda3 installer. 
wget  
https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh # Python module example: 
cat  
>test_util.py  
<<EOF
def  
hello ( 
name ) 
:  
print ( 
 "hello {}" 
.format ( 
name )) 
def  
read_lines ( 
path ) 
:  
with  
open ( 
path ) 
  
as  
f:  
 return 
  
f.readlines () 
EOF # Build and push the image. 
docker  
build  
-t  
 " 
 ${ 
 IMAGE 
 } 
 " 
  
.
docker  
push  
 " 
 ${ 
 IMAGE 
 } 
 "

Create a Dataproc cluster

After creating a Docker image that customizes your Spark environment, create a Dataproc cluster that will use your Docker image when running Spark jobs.

gcloud

gcloud dataproc clusters create CLUSTER_NAME 
\
    --region= REGION 
\
    --image-version= DP_IMAGE 
\
    --optional-components=DOCKER \
    --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image= DOCKER_IMAGE 
\
     other flags

Replace the following;

CLUSTER_NAME : The cluster name.
REGION : The cluster region.
DP_IMAGE : Dataproc image version must be 2.0.49 or later ( --image-version=2.0 will use a qualified minor version later than 2.0.49 ).
--optional-components=DOCKER : Enables the Docker component on the cluster.
--properties flag:
- dataproc:yarn.docker.enable=true : Required property to enable the Dataproc Docker on YARN feature.
- dataproc:yarn.docker.image : Optional property that you can add to specify your DOCKER_IMAGE using the following Container Registry image naming format: {hostname}/{project-id}/{image}:{tag} .
  Example:
```
dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1
```
  Requirement:You must host your Docker image on Container Registry or Artifact Registry . (Dataproc cannot fetch containers from other registries).
  
  Recommendation:Add this property when you create your cluster to cache your Docker image and avoid YARN timeouts later when you submit a job that uses the image.

When dataproc:yarn.docker.enable is set to true , Dataproc updates Hadoop and Spark configurations to enable the Docker on YARN feature in the cluster. For example, spark.submit.deployMode is set to cluster , and spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS and spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS are set to mount directories from the host into the container.

Submit a Spark job to the cluster

After creating a Dataproc cluster , submit a Spark job to the cluster that uses your Docker image. The example in this section submits a PySpark job to the cluster.

Set job properties:

  # Set the Docker image URI. 
 IMAGE 
 =( 
e.g.,  
gcr.io/my-project/my-image:1.0.1 ) 
 # Required: Use `#` as the delimiter for properties to avoid conflicts. 
 JOB_PROPERTIES 
 = 
 '^#^' 
 # Required: Set Spark properties with the Docker image. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE= 
 ${ 
 IMAGE 
 } 
 " 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE= 
 ${ 
 IMAGE 
 } 
 " 
 # Optional: Add custom jars to Spark classpath. Don't set these properties if 
 # there are no customizations. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.driver.extraClassPath=/opt/spark/jars/*" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executor.extraClassPath=/opt/spark/jars/*" 
 # Optional: Set custom PySpark Python path only if there are customizations. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.pyspark.python=/opt/miniconda3/bin/python" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.pyspark.driver.python=/opt/miniconda3/bin/python" 
 # Optional: Set custom Python module path only if there are customizations. 
 # Since the `PYTHONPATH` environment variable defined in the Dockerfile is 
 # overridden by Spark, it must be set as a job property. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executorEnv.PYTHONPATH=/opt/python/packages"

Notes:

See Launching Applications Using Docker Containers information on related properties.

gcloud

Submit the job to the cluster.

gcloud dataproc jobs submit pyspark PYFILE 
\
    --cluster= CLUSTER_NAME 
\
    --region= REGION 
\
    --properties=${JOB_PROPERTIES}

Replace the following;

PYFILE : The file path to your PySpark job file. It can be a local file path or the URI of the file in Cloud Storage ( gs:// BUCKET_NAME / PySpark filename ).
CLUSTER_NAME : The cluster name.
REGION : The cluster region.

Customize your Spark job runtime environment with Docker on YARN Stay organized with collections Save and categorize content based on your preferences.

Limitations

Create a Docker image

Dockerfile

Build and push the image

Create a Dataproc cluster

gcloud

Submit a Spark job to the cluster

gcloud

Customize your Spark job runtime environment with Docker on YARN