Stay organized with collectionsSave and categorize content based on your preferences.
Dataproc integrates with Apache Hadoop and the Hadoop Distributed
File System (HDFS). The following features and considerations can be important
when selecting compute and data storage options for Dataproc
clusters and jobs:
HDFS with Cloud Storage:
Dataproc uses the
Hadoop Distributed File System (HDFS) for storage. Additionally,
Dataproc automatically installs the HDFS-compatibleCloud Storage connector,
which enables the use of Cloud Storage
in parallel with HDFS. Data can be moved in and out of a cluster through
upload and download to HDFS or Cloud Storage.
VM disks:
By default, when no local SSDs are provided, HDFS data and intermediate
shuffle data is stored on VM boot disks, which arePersistent Disks.
If you uselocal SSDs,
HDFS data and intermediate shuffle data is stored on the SSDs.
Persistent disk (PD) size and type affect performance and VM size, whether using HDFS or Cloud Storage
for data storage.
VM Boot disks are deleted when the cluster is deleted.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eDataproc utilizes the Hadoop Distributed File System (HDFS) for storage and integrates with Cloud Storage.\u003c/p\u003e\n"],["\u003cp\u003eData can be moved into and out of Dataproc clusters via upload and download to HDFS or Cloud Storage.\u003c/p\u003e\n"],["\u003cp\u003eHDFS data and intermediate shuffle data are stored on VM boot disks by default, unless local SSDs are configured.\u003c/p\u003e\n"],["\u003cp\u003ePersistent disk size and type influence performance and VM size, regardless of whether HDFS or Cloud Storage is utilized.\u003c/p\u003e\n"],["\u003cp\u003eVM Boot disks are deleted when the cluster is deleted.\u003c/p\u003e\n"]]],[],null,["Dataproc integrates with Apache Hadoop and the Hadoop Distributed\nFile System (HDFS). The following features and considerations can be important\nwhen selecting compute and data storage options for Dataproc\nclusters and jobs:\n\n- HDFS with Cloud Storage: Dataproc uses the Hadoop Distributed File System (HDFS) for storage. Additionally, Dataproc automatically installs the HDFS-compatible [Cloud Storage connector](/dataproc/docs/concepts/connectors/cloud-storage), which enables the use of Cloud Storage in parallel with HDFS. Data can be moved in and out of a cluster through upload and download to HDFS or Cloud Storage.\n- VM disks:\n - By default, when no local SSDs are provided, HDFS data and intermediate shuffle data is stored on VM boot disks, which are [Persistent Disks](https://cloud.google.com/persistent-disk/).\n - If you use [local SSDs](/dataproc/docs/concepts/compute/dataproc-local-ssds), HDFS data and intermediate shuffle data is stored on the SSDs.\n - Persistent disk (PD) size and type affect performance and VM size, whether using HDFS or Cloud Storage for data storage.\n - **VM Boot disks are deleted when the cluster is deleted.**"]]