The Dataproc Docker on YARNfeature allows you to create and use a Docker image to customize your Spark job runtime environment. The image can include customizations to Java, Python, and R dependencies, and to your job jar.
Limitations
Feature availability or support is not availablewith:
- Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
- MapReduce jobs (only supported for Spark jobs )
- Spark client mode (only supported with Spark cluster mode)
- Kerberos clusters : cluster creation fails if you create a cluster with Docker on YARN and Kerberos enabled.
- Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Spark are used, not your customizations.
Create a Docker image
The first step to customize your Spark environment is building a Docker image .
Dockerfile
You can use the following Dockerfile as an example, making changes and additions to meet you needs.
 FROM  
debian:10-slim # Suppress interactive prompts. 
ENV  
 DEBIAN_FRONTEND 
 = 
noninteractive # Required: Install utilities required by Spark scripts. 
RUN  
apt  
update && 
apt  
install  
-y  
procps  
tini # Optional: Add extra jars. 
ENV  
 SPARK_EXTRA_JARS_DIR 
 = 
/opt/spark/jars/
ENV  
 SPARK_EXTRA_CLASSPATH 
 = 
 '/opt/spark/jars/*' 
RUN  
mkdir  
-p  
 " 
 ${ 
 SPARK_EXTRA_JARS_DIR 
 } 
 " 
COPY  
*.jar  
 " 
 ${ 
 SPARK_EXTRA_JARS_DIR 
 } 
 " 
 # Optional: Install and configure Miniconda3. 
ENV  
 CONDA_HOME 
 = 
/opt/miniconda3
ENV  
 PYSPARK_PYTHON 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin/python
ENV  
 PYSPARK_DRIVER_PYTHON 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin/python
ENV  
 PATH 
 = 
 ${ 
 CONDA_HOME 
 } 
/bin: ${ 
 PATH 
 } 
COPY  
Miniconda3-py39_4.10.3-Linux-x86_64.sh  
.
RUN  
bash  
Miniconda3-py39_4.10.3-Linux-x86_64.sh  
-b  
-p  
/opt/miniconda3  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
always_yes  
True  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
auto_update_conda  
False  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--prepend  
channels  
conda-forge  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
config  
--system  
--set  
channel_priority  
strict # Optional: Install Conda packages. 
 # 
 # The following packages are installed in the default image. It is strongly 
 # recommended to include all of them. 
 # 
 # Use mamba to install packages quickly. 
RUN  
 ${ 
 CONDA_HOME 
 } 
/bin/conda  
install  
mamba  
-n  
base  
-c  
conda-forge  
 \ 
 && 
 ${ 
 CONDA_HOME 
 } 
/bin/mamba  
install  
 \ 
  
conda  
 \ 
  
cython  
 \ 
  
fastavro  
 \ 
  
fastparquet  
 \ 
  
gcsfs  
 \ 
  
google-cloud-bigquery-storage  
 \ 
  
google-cloud-bigquery [ 
pandas ] 
  
 \ 
  
google-cloud-bigtable  
 \ 
  
google-cloud-container  
 \ 
  
google-cloud-datacatalog  
 \ 
  
google-cloud-dataproc  
 \ 
  
google-cloud-datastore  
 \ 
  
google-cloud-language  
 \ 
  
google-cloud-logging  
 \ 
  
google-cloud-monitoring  
 \ 
  
google-cloud-pubsub  
 \ 
  
google-cloud-redis  
 \ 
  
google-cloud-spanner  
 \ 
  
google-cloud-speech  
 \ 
  
google-cloud-storage  
 \ 
  
google-cloud-texttospeech  
 \ 
  
google-cloud-translate  
 \ 
  
google-cloud-vision  
 \ 
  
koalas  
 \ 
  
matplotlib  
 \ 
  
nltk  
 \ 
  
numba  
 \ 
  
numpy  
 \ 
  
openblas  
 \ 
  
orc  
 \ 
  
pandas  
 \ 
  
pyarrow  
 \ 
  
pysal  
 \ 
  
pytables  
 \ 
  
python  
 \ 
  
regex  
 \ 
  
requests  
 \ 
  
rtree  
 \ 
  
scikit-image  
 \ 
  
scikit-learn  
 \ 
  
scipy  
 \ 
  
seaborn  
 \ 
  
sqlalchemy  
 \ 
  
sympy  
 \ 
  
virtualenv # Optional: Add extra Python modules. 
ENV  
 PYTHONPATH 
 = 
/opt/python/packages
RUN  
mkdir  
-p  
 " 
 ${ 
 PYTHONPATH 
 } 
 " 
COPY  
test_util.py  
 " 
 ${ 
 PYTHONPATH 
 } 
 " 
 # Required: Create the 'yarn_docker_user' group/user. 
 # The GID and UID must be 1099. Home directory is required. 
RUN  
groupadd  
-g  
 1099 
  
yarn_docker_user
RUN  
useradd  
-u  
 1099 
  
-g  
 1099 
  
-d  
/home/yarn_docker_user  
-m  
yarn_docker_user
USER  
yarn_docker_user 
 
Build and push the image
The following is commands for building and pushing the example Docker image, you can make changes according to your customizations.
  # Increase the version number when there is a change to avoid referencing 
 # a cached older image. Avoid reusing the version number, including the default 
 # `latest` version. 
 IMAGE 
 = 
gcr.io/my-project/my-image:1.0.1 # Download the BigQuery connector. 
gcloud  
storage  
cp  
 \ 
  
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar  
. # Download the Miniconda3 installer. 
wget  
https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh # Python module example: 
cat  
>test_util.py  
<<EOF
def  
hello ( 
name ) 
:  
print ( 
 "hello {}" 
.format ( 
name )) 
def  
read_lines ( 
path ) 
:  
with  
open ( 
path ) 
  
as  
f:  
 return 
  
f.readlines () 
EOF # Build and push the image. 
docker  
build  
-t  
 " 
 ${ 
 IMAGE 
 } 
 " 
  
.
docker  
push  
 " 
 ${ 
 IMAGE 
 } 
 " 
 
 
Create a Dataproc cluster
After creating a Docker image that customizes your Spark environment, create a Dataproc cluster that will use your Docker image when running Spark jobs.
gcloud
gcloud dataproc clusters create CLUSTER_NAME \ --region= REGION \ --image-version= DP_IMAGE \ --optional-components=DOCKER \ --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image= DOCKER_IMAGE \ other flags
Replace the following;
- CLUSTER_NAME : The cluster name.
- REGION : The cluster region.
-  DP_IMAGE 
: Dataproc image version must be 2.0.49or later (--image-version=2.0will use a qualified minor version later than2.0.49).
-  --optional-components=DOCKER: Enables the Docker component on the cluster.
-  --propertiesflag:-  dataproc:yarn.docker.enable=true: Required property to enable the Dataproc Docker on YARN feature.
-  dataproc:yarn.docker.image: Optional property that you can add to specify your DOCKER_IMAGE using the following Container Registry image naming format:{hostname}/{project-id}/{image}:{tag}.Example: dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1 Requirement:You must host your Docker image on Container Registry or Artifact Registry . (Dataproc cannot fetch containers from other registries). Recommendation:Add this property when you create your cluster to cache your Docker image and avoid YARN timeouts later when you submit a job that uses the image. 
 
-  
When dataproc:yarn.docker.enable 
is set to true 
, Dataproc
updates Hadoop and Spark configurations to enable the Docker on YARN feature in
the cluster. For example, spark.submit.deployMode 
is set to cluster 
, and spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS 
and spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS 
are set to mount
directories from the host into the container.
Submit a Spark job to the cluster
After creating a Dataproc cluster , submit a Spark job to the cluster that uses your Docker image. The example in this section submits a PySpark job to the cluster.
Set job properties:
  # Set the Docker image URI. 
 IMAGE 
 =( 
e.g.,  
gcr.io/my-project/my-image:1.0.1 ) 
 # Required: Use `#` as the delimiter for properties to avoid conflicts. 
 JOB_PROPERTIES 
 = 
 '^#^' 
 # Required: Set Spark properties with the Docker image. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE= 
 ${ 
 IMAGE 
 } 
 " 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE= 
 ${ 
 IMAGE 
 } 
 " 
 # Optional: Add custom jars to Spark classpath. Don't set these properties if 
 # there are no customizations. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.driver.extraClassPath=/opt/spark/jars/*" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executor.extraClassPath=/opt/spark/jars/*" 
 # Optional: Set custom PySpark Python path only if there are customizations. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.pyspark.python=/opt/miniconda3/bin/python" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.pyspark.driver.python=/opt/miniconda3/bin/python" 
 # Optional: Set custom Python module path only if there are customizations. 
 # Since the `PYTHONPATH` environment variable defined in the Dockerfile is 
 # overridden by Spark, it must be set as a job property. 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages" 
 JOB_PROPERTIES 
 = 
 " 
 ${ 
 JOB_PROPERTIES 
 } 
 #spark.executorEnv.PYTHONPATH=/opt/python/packages" 
 
 
Notes:
- See Launching Applications Using Docker Containers information on related properties.
gcloud
Submit the job to the cluster.
gcloud dataproc jobs submit pyspark PYFILE \ --cluster= CLUSTER_NAME \ --region= REGION \ --properties=${JOB_PROPERTIES}
Replace the following;
-  PYFILE 
: The file path to your PySpark job file. It can be
a local file path or the URI of the file in Cloud Storage
( gs:// BUCKET_NAME / PySpark filename).
- CLUSTER_NAME : The cluster name.
- REGION : The cluster region.

