Notes:
-  Version 2.3is a lightweight image that contains only core components, reducing exposure to Common Vulnerabilities and Exposures (CVEs). For higher security compliance requirements, use the image version2.3or later, when creating a Dataproc cluster.
-  If you choose to install optional components when creating a Dataproc cluster with 2.3image, they will be downloaded and installed during cluster creation. This might increase the cluster startup time. To avoid this delay, you can create a custom image with the optional components pre-installed. This is achieved by runninggenerate_custom_image.pywith the--optional-componentsflag.
Notes
-  The following optional components are supported in non-arm 2.3 images: - Apache Flink
- Apache Hive WebHCat
- Apache Hudi
- Apache Iceberg
- Apache Pig
- Delta Lake
- Docker
- JupyterLab Notebook
- Ranger
- Solr
- Trino
- Zeppelin notebook
- Zookeeper
 
-  2.3.x-*-armimages support only the pre-installed components and the following optional components. The other 2.3 optional components and all initialization actions aren't supported:- Apache Hive WebHCat
- Docker
- Zeppelin notebook
- Zookeeper (installed in high availability clusters ; optional component in other clusters)
 
-  yarn.nodemanager.recovery.enabledand HDFS Audit Logging are enabled by default in 2.3 images.
-  micromamba, instead ofcondain previous image versions, is installed as part of the Python installation.
-  Docker and Zeppelin installation issues: - Installation fails if the cluster has no public internet access. As a
workaround, create a cluster that uses a custom image with optional
components pre-installed. You can do this by running  generate_custom_image.pywith the--optional-componentsflag .
- Installation can fail if the cluster is pinned to an older sub-minor image
version: Packages are installed on demand from public OSS repositories, and a package
might not be available upstream to support the installation.
As a workaround, create a cluster that uses a custom image with optional
components pre-installed in the custom image. To do this, run  generate_custom_image.pywith the--optional-componentsflag .
 
- Installation fails if the cluster has no public internet access. As a
workaround, create a cluster that uses a custom image with optional
components pre-installed. You can do this by running  
-  The default resource calculator for YARN has been changed from DefaultResourceCalculator to DominantResourceCalculator , which uses the dominant-resource concept to determine resource allocation, such as Memory and CPU allocation. This change impacts Autoscaler , which scales based on the dominant resource usage of the cluster. 
Image version 2.3 machine learning (ML) components
The Dataproc 2.3-ml-ubuntu 
image extends the 2.3 base image
with ML-specific software. It supports 2.3 image optional components and other
2.3 features, and adds the component versions listed in the following sections.
GPU-specific libraries
For Dataproc jobs that use GPU VMs,
the following NVIDIA driver and libraries are available in the 2.3-ml-ubuntu 
image. You can use them to accomplish the following
tasks:
- Accelerate Spark batch workloads with the NVIDIA Spark Rapids library
- Train machine learning workloads
- Run distributed batch inference using Spark
| Package Name | Version | 
|---|---|
| Spark Rapids | 25.04.0 | 
| NVIDIA Driver | Ubuntu 22.04 LTS Accelerated with NVIDIA driver version 570 | 
| CUDA | 12.6.3 | 
| cublas | 12.6.4 | 
| cusolver | 11.7.1 | 
| cupti | 12.6.80 | 
| cusparse | 12.5.4 | 
| cuDNN | 9.10.1 | 
| NCCL | 2.27.5 | 
XGBoost libraries
The following Maven package versions 
are available in 2.3-ml-ubuntu 
image to let you use XGBoost 
with Spark in Java or
Scala.
| Group ID | Package Name | Version | 
|---|---|---|
|   
ml.dmlc | xgboost4j-gpu_2.12 | 2.1.1 | 
|   
ml.dmlc | xgboost4j-spark-gpu_2.12 | 2.1.1 | 
Python libraries
The 2.3-ml-ubuntu 
image contains the following libraries, which support different
stages in the ML lifecycle.
| Package | Version | 
|---|---|
| accelerate | 1.8.1 | 
| conda | 23.11.0 | 
| cookiecutter | 2.5.0 | 
| curl | 8.12.1 | 
| cython | 3.0.12 | 
| dask | 2023.12.1 | 
| datasets | 3.6.0 | 
| deepspeed | 0.17.2 | 
| delta-spark | 3.2.0 | 
| evaluate | 0.4.5 | 
| fastavro | 1.9.7 | 
| fastparquet | 2023.10.1 | 
| fiona | 1.10.0 | 
| gateway-provisioners[yarn] | 0.4.0 | 
| gcsfs | 2023.12.2.post1 | 
| google-auth-oauthlib | 1.2.2 | 
| google-cloud-aiplatform | 1.88.0 | 
| google-cloud-bigquery[pandas] | 3.31.0 | 
| google-cloud-bigquery-storage | 2.30.0 | 
| google-cloud-bigtable | 2.30.1 | 
| google-cloud-container | 2.56.1 | 
| google-cloud-datacatalog | 3.26.1 | 
| google-cloud-dataproc | 5.18.1 | 
| google-cloud-datastore | 2.21.0 | 
| google-cloud-language | 2.17.2 | 
| google-cloud-logging | 3.11.4 | 
| google-cloud-monitoring | 2.27.2 | 
| google-cloud-pubsub | 2.29.1 | 
| google-cloud-redis | 2.18.1 | 
| google-cloud-spanner | 3.53.0 | 
| google-cloud-speech | 2.32.0 | 
| google-cloud-storage | 2.19.0 | 
| google-cloud-texttospeech | 2.25.1 | 
| google-cloud-translate | 3.20.3 | 
| google-cloud-vision | 3.10.2 | 
| huggingface_hub | 0.33.1 | 
| httplib2 | 0.22.0 | 
| ipyparallel | 8.6.1 | 
| ipython-sql | 0.3.9 | 
| ipywidgets | 8.1.7 | 
| jupyter_contrib_nbextensions | 0.7.0 | 
| jupyter_http_over_ws | 0.0.8 | 
| jupyter_kernel_gateway | 2.5.2 | 
| jupyter_server | 1.24.0 | 
| jupyterhub | 4.1.6 | 
| jupyterlab | 3.6.8 | 
| jupyterlab-git | 0.44.0 | 
| jupyterlab_widgets | 3.0.15 | 
| koalas | 0.22.0 | 
| langchain | 0.3.26 | 
| lightgbm | 4.6.0 | 
| markdown | 3.5.2 | 
| matplotlib | 3.8.4 | 
| mlflow | 3.1.1 | 
| nbconvert | 7.14.2 | 
| nbdime | 3.2.1 | 
| nltk | 3.9.1 | 
| notebook | 6.5.7 | 
| numba | 0.58.1 | 
| numpy | 1.26.4 | 
| oauth2client | 4.1.3 | 
| onnx | 1.17.0 | 
| openblas | 0.3.25 | 
| opencv | 4.11.0 | 
| orc | 2.1.1 | 
| pandas | 2.1.4 | 
| pandas-profiling | 3.0.0 | 
| papermill | 2.4.0 | 
| pyarrow | 16.1.0 | 
| pydot | 2.0.0 | 
| pyhive | 0.7.0 | 
| pynvml | 12.0.0 | 
| pysal | 23.7 | 
| pytables | 3.9.2 | 
| python | 3.11 | 
| regex | 2023.12.25 | 
| requests | 2.32.2 | 
| requests-kerberos | 0.12.0 | 
| rtree | 1.1.0 | 
| scikit-image | 0.22.0 | 
| scikit-learn | 1.5.2 | 
| scipy | 1.11.4 | 
| seaborn | 0.13.2 | 
| sentence-transformers | 5.0.0 | 
| setuptools | 79.0.1 | 
| shap | 0.48.0 | 
| shapely | 2.1.1 | 
| spacy | 3.8.7 | 
| spark-tensorflow-distributor | 1.0.0 | 
| spyder | 5.5.6 | 
| sqlalchemy | 2.0.41 | 
| sympy | 1.13.3 | 
| tensorflow | 2.18.0 | 
| tokenizers | 0.21.4.dev0 | 
| toree | 0.5.0 | 
| torch | 2.6.0 | 
| torch-model-archiver | 0.11.1 | 
| torcheval | 0.0.7 | 
| tornado | 6.4.2 | 
| torchvision | 0.21.0 | 
| traitlets | 5.14.3 | 
| transformers | 4.53.1 | 
| uritemplate | 4.1.1 | 
| virtualenv | 20.26.6 | 
| wordcloud | 1.9.4 | 
| xgboost | 2.1.4 | 
R libraries
The following R library versions are included in 2.3-ml-ubuntu 
image.
| Package Name | Version | 
|---|---|
| r-ggplot2 | 3.4.4 | 
| r-irkernel | 1.3.2 | 
| r-rcurl | 1.98-1.16 | 
| r-recommended | 4.3 | 

