You can specify a custom container image to use with Dataproc on GKE . Your custom container image must use one of the Dataproc on GKE base Spark images .
Use a custom container image
To use a Dataproc on GKE  custom container image, set the spark.kubernetes.container.image property 
when you create a Dataproc on GKE  virtual cluster 
or submit a Spark job 
to the cluster.
- gcloud CLI cluster creation example: gcloud dataproc clusters gke create "${DP_CLUSTER}" \ --properties=spark:spark.kubernetes.container.image= custom-image \ ... other args ... 
- gcloud CLI job submit example: gcloud dataproc jobs submit spark \ --properties=spark.kubernetes.container.image= custom-image \ ... other args ... 
Custom container image requirements and settings
Base images
You can use docker 
tools for building customized docker based upon one of
the published Dataproc on GKE base Spark images 
.
Container user
Dataproc on GKE  runs Spark containers as the Linux spark 
user with a 1099 
UID and a 1099 
GID. Use the UID and GID for filesystem permissions.
For example, if you add a jar file at /opt/spark/jars/my-lib.jar 
in the image
as a workload dependency, you must give the spark 
user read permission to the file.
Components
-  Java:The JAVA_HOMEenvironment variable points to the location of the Java installation. The current default value is/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64, which is subject to change (see the Dataproc release notes for updated information).- If you customize the Java environment, make sure that JAVA_HOMEis set to the correct location andPATHincludes the path to binaries.
 
- If you customize the Java environment, make sure that 
-  Python:Dataproc on GKE base Spark images have Miniconda3 installed at /opt/conda.CONDA_HOMEpoints to this location,${CONDA_HOME}/binis included inPATH, andPYSPARK_PYTHONis set to${CONDA_HOME}/python.-  If you customize Conda, make sure that CONDA_HOMEpoints to the Conda home directory ,${CONDA_HOME}/binis included inPATH, andPYSPARK_PYTHONis set to${CONDA_HOME}/python.
-  You can install, remove, and update packages in the default base environment, or create a new environment, but it is strongly recommended that the environment include all packages installed in the base environment of the base container image. 
-  If you add Python modules, such as a Python script with utility functions, to the container image, include the module directories in PYTHONPATH.
 
-  
-  Spark:Spark is installed in /usr/lib/spark, andSPARK_HOMEpoints to this location. Spark cannot be customized.If it is changed, the container image will be rejected or fail to operate correctly.-  Jobs: You can customize Spark job dependencies. SPARK_EXTRA_CLASSPATHdefines the extra classpath for Spark JVM processes. Recommendation: put jars under/opt/spark/jars, and setSPARK_EXTRA_CLASSPATHto/opt/spark/jars/*.If you embed the job jar in the image, the recommended directory is /opt/spark/job. When you submit the job, you can reference it with a local path, for example,file:///opt/spark/job/my-spark-job.jar.
-  Cloud Storage connector:The Cloud Storage connector is installed at /usr/lib/spark/jars.
-  Utilities: The procpsandtiniutility packages are required to run Spark. These utilities are included in the base Spark images , so custom images do not need to re-install them.
-  Entrypoint: Dataproc on GKE ignores any changes made to the ENTRYPOINTandCMDprimitives in the container image.
-  Initialization scripts:you can add an optional initialization script at /opt/init-script.sh. An initialization script can download files from Cloud Storage, start a proxy within the container, call other scripts, and perform other startup tasks.The entrypoint script calls the initialization script with all command line args ( $@) before starting the Spark driver, Spark executor, and other processes. The initialization script can select the type of Spark process based on the first arg ($1): possible values includespark-submitfor driver containers, andexecutorfor executor containers.
 
-  
-  Configs:Spark configs are located under /etc/spark/conf. TheSPARK_CONF_DIRenvironment variable points to this location.Don't customize Spark configs in the container image. Instead, submit any properties via the Dataproc on GKE API for the following reasons: - Some properties, such as executor memory size, are determined at runtime, not at container image build time; they must be injected by Dataproc on GKE .
- Dataproc on GKE  places restrictions on the properties supplied by users.
Dataproc on GKE  mounts configs from configMapinto/etc/spark/confin the container, overriding settings embedded in the image.
 
Base Spark images
Dataproc supports the following base Spark container images:
- Spark 3.5 : ${REGION}-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.2
Sample custom container image build
Sample Dockerfile
  FROM 
  
 us 
 - 
 central1 
 - 
 docker 
 . 
 pkg 
 . 
 dev 
 / 
 cloud 
 - 
 dataproc 
 / 
 spark 
 / 
 dataproc_2 
 .0 
 : 
 latest 
 # Change to root temporarily so that it has permissions to create dirs and copy 
 # files. 
 USER 
  
 root 
 # Add a BigQuery connector jar. 
 ENV 
  
 SPARK_EXTRA_JARS_DIR 
 =/ 
 opt 
 / 
 spark 
 / 
 jars 
 / 
 ENV 
  
 SPARK_EXTRA_CLASSPATH 
 = 
 '/opt/spark/jars/*' 
 RUN 
  
 mkdir 
  
 - 
 p 
  
 "${SPARK_EXTRA_JARS_DIR}" 
  
 \ 
 && 
 chown 
  
 spark 
 : 
 spark 
  
 "${SPARK_EXTRA_JARS_DIR}" 
 COPY 
  
 -- 
 chown 
 = 
 spark 
 : 
 spark 
  
 \ 
  
 spark 
 - 
 bigquery 
 - 
 with 
 - 
 dependencies_2 
 .12 
 - 
 0.22.2 
 . 
 jar 
  
 "${SPARK_EXTRA_JARS_DIR}" 
 # Install Cloud Storage client Conda package. 
 RUN 
  
 "${CONDA_HOME}/bin/conda" 
  
 install 
  
 google 
 - 
 cloud 
 - 
 storage 
 # Add a custom Python file. 
 ENV 
  
 PYTHONPATH 
 =/ 
 opt 
 / 
 python 
 / 
 packages 
 RUN 
  
 mkdir 
  
 - 
 p 
  
 "${PYTHONPATH}" 
 COPY 
  
 test_util 
 . 
 py 
  
 "${PYTHONPATH}" 
 # Add an init script. 
 COPY 
  
 -- 
 chown 
 = 
 spark 
 : 
 spark 
  
 init 
 - 
 script 
 . 
 sh 
  
 / 
 opt 
 / 
 init 
 - 
 script 
 . 
 sh 
 # (Optional) Set user back to `spark`. 
 USER 
  
 spark 
 
 
Build the container image
Run the following commands in the Dockerfile directory
- Set image (example: us-central1-docker.pkg.dev/my-project/spark/spark-test-image:latest) and change to build directory.IMAGE= custom container image \ BUILD_DIR=$(mktemp -d) \ cd "${BUILD_DIR}" 
-  Download the BigQuery connector. gcloud storage cp \ gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar . 
-  Create a Python example file. cat >test_util.py <<'EOF' def hello(name): print("hello {}".format(name)) 
 def read_lines(path): with open(path) as f: return f.readlines() EOF
-  Create an example init script. cat >init-script.sh <<EOF echo "hello world" >/tmp/init-script.out EOF 
-  Build and push the image. docker build -t "${IMAGE}" . && docker push "${IMAGE}"

