Storage Insights datasets

The Storage Insights datasets feature helps you understand, organize, and manage your data at scale. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to update the metadata for. A queryable metadata index for the included buckets and objects within those projects, are made available as a BigQuery linked dataset.

If you want to get insights for your Cloud Storage resources that are exported to BigQuery, use the Storage Insights datasets. These insights can help you with data exploration, cost optimization, security enforcement, and governance implementation. Storage Insights datasets is an exclusive feature only available through the Storage Intelligence subscription.

Overview

A Storage Insights dataset is a rolling snapshot of metadata for all the buckets and objects within one or multiple specified source projects within an organization. The information provided by datasets lets you better understand and routinely audit your Cloud Storage data.

To create a dataset, you first create a dataset configuration in a project. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to view the metadata for. The dataset configuration generates datasets daily. Both dataset configurations and datasets are resources stored within Cloud Storage.

To view a dataset, you must first link the dataset to BigQuery .

Dataset configuration properties

When you create a dataset configuration , you set these properties of the dataset. It can take up to 48 hours for you to see the first data populated as a linked dataset in BigQuery after you configure the dataset. Any newly added objects or buckets are included in the next daily snapshot.

  • Name: a name that's used to reference the dataset. Names are used as the identifier of dataset configurations and cannot be changed after the configuration is created. The name contains up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.

  • Description (optional): a description of the dataset. You can edit the description at any time.

  • Dataset scope: a required field that specifies an organization, projects, or folders containing the buckets and objects for which you want metadata. You can specify projects or folders individually or as a CSV file, with each project or folder number on a separate line. You can specify up to 10,000 projects or folders in one dataset configuration. Datasets are configured for the specified dataset scope. Only one dataset scope can be specified for each dataset configuration. You can update the dataset scope when editing the dataset configuration.

  • Bucket filters (optional): filters used to include and exclude specific buckets from the dataset by bucket name or by regions.

  • Retention period: the number of days that the dataset captures and retains data for, including the creation date of the dataset. Datasets update with metadata every 24 hours and can retain data for up to 90 days. Data captured outside of the retention window is automatically deleted. For example, suppose you have a dataset that was created on October 1, 2023 with a retention window set to 30. On October 30, the dataset will reflect the past 30 days of data, from October 1 to October 30. On October 31, the dataset will reflect the data from October 2 to October 31. You can modify the retention window at any time.

  • Location: a location to store the dataset and its data. For example, us-central1 . The location must be supported by BigQuery . We recommend that you select the location of your BigQuery tables, if you have any.

  • Service agent type: either a configuration-scoped service agent or a project-scoped service agent.

    Creating a dataset configuration provisions a service agent for you. In order to read datasets, the service agent must be granted the required permissions to read data from Cloud Storage buckets.

    A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. For more information about the permissions required to read and write datasets, see Permissions . When a dataset configuration is deleted, the project-scoped service agent is not deleted. A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. For more information about the permissions required to read and write datasets, see Permissions . When a dataset configuration is deleted, the project-scoped service agent is not deleted.

    A configuration-scoped service agent can only access and write the dataset generated by the particular dataset configuration. This means if you have multiple dataset configurations, you'll need to grant required permissions to each configuration-scoped service agent. When a dataset configuration is deleted, the configuration-scoped service agent is deleted.

Link the dataset to BigQuery after creating a dataset configuration. Linking a dataset to BigQuery creates a linked dataset in BigQuery for querying. You can link or unlink the dataset at any point.

For more information about the properties you set when creating or updating a dataset configuration, see the DatasetConfigs resource in the JSON API documentation .

Supported locations

The following BigQuery locations are supported for creating linked datasets:

  • EU
  • US
  • asia-southeast1
  • europe-west1
  • us-central1
  • us-east1
  • us-east4

Dataset schema of metadata

The following metadata fields are included in datasets. For more information about BigQuery column modes, see Modes . The column modes determine how BigQuery stores and queries the data.

The snapshotTime field stores the time of the bucket metadata snapshot refresh in RFC 3339 format .

Bucket metadata

Unless otherwise noted, you can find more detailed descriptions of the following bucket metadata fields by referring to the Buckets resource representation for the JSON API .

Metadata field Mode Type
snapshotTime
NULLABLE TIMESTAMP
name
NULLABLE STRING
location
NULLABLE STRING
project
NULLABLE INTEGER
storageClass
NULLABLE STRING
versioning
NULLABLE BOOLEAN
lifecycle
NULLABLE BOOLEAN
metageneration
NULLABLE INTEGER
timeCreated
NULLABLE TIMESTAMP
public 1
NULLABLE RECORD
public.bucketPolicyOnly
NULLABLE BOOLEAN
public.publicAccessPrevention
NULLABLE STRING
iamConfiguration
NULLABLE RECORD
iamConfiguration.uniformBucketLevelAccess
NULLABLE RECORD
iamConfiguration.uniformBucketLevelAccess.enabled
NULLABLE BOOLEAN
iamConfiguration.publicAccessPrevention
NULLABLE STRING
autoclass
NULLABLE RECORD
autoclass.enabled
NULLABLE BOOLEAN
autoclass.toggleTime
NULLABLE TIMESTAMP
softDeletePolicy
NULLABLE OBJECT
softDeletePolicy.effectiveTime
NULLABLE DATETIME
softDeletePolicy.retentionDurationSeconds
NULLABLE LONG
tags 2
NULLABLE RECORD
tags.lastUpdatedTime
NULLABLE TIMESTAMP
tags.tagMap
REPEATED RECORD
tags.tagMap.key
NULLABLE STRING
tags.tagMap.value
NULLABLE STRING
resourceTags 3
REPEATED RECORD
resourceTags.key
NULLABLE STRING
resourceTags.value
NULLABLE STRING
labels
REPEATED RECORD
labels.key
NULLABLE STRING
labels.value
NULLABLE STRING

1 [Deprecated] This field is deprecated; use iamConfiguration instead.

2 [Deprecated] This field is deprecated; use resourceTags instead.

3 The bucket's tags . For more information, see Cloud Resource Manager API .

Object metadata

Unless otherwise noted, you can find more detailed descriptions of the following object metadata fields by referring to the Objects resource representation for the JSON API .

Metadata field Mode Type
snapshotTime
NULLABLE TIMESTAMP
bucket
NULLABLE STRING
location
NULLABLE STRING
componentCount
NULLABLE INTEGER
contentDisposition
NULLABLE STRING
contentEncoding
NULLABLE STRING
contentLanguage
NULLABLE STRING
contentType
NULLABLE STRING
crc32c
NULLABLE INTEGER
customTime
NULLABLE TIMESTAMP
etag
NULLABLE STRING
eventBasedHold
NULLABLE BOOLEAN
generation
NULLABLE INTEGER
md5Hash
NULLABLE STRING
metageneration
NULLABLE INTEGER
name
NULLABLE STRING
size
NULLABLE INTEGER
storageClass
NULLABLE STRING
temporaryHold
NULLABLE BOOLEAN
timeCreated
NULLABLE TIMESTAMP
timeDeleted
NULLABLE TIMESTAMP
updated
NULLABLE TIMESTAMP
timeStorageClassUpdated
NULLABLE TIMESTAMP
retentionExpirationTime
NULLABLE TIMESTAMP
softDeleteTime
NULLABLE DATETIME
hardDeleteTime
NULLABLE DATETIME
metadata.key
NULLABLE STRING
metadata.value
NULLABLE STRING

Latest bucket and object metadata snapshot

The linked dataset exposes the latest snapshot of the bucket and object metadata through the following dedicated views:

  • The bucket_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage buckets. Its structure matches the Bucket metadata schema .

  • The object_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage objects. Its structure matches the Object metadata schema .

Project metadata

The project metadata is exposed as a view named project_attributes_view in the linked dataset:

Metadata field Mode Type
snapshotTime
NULLABLE TIMESTAMP
name
NULLABLE STRING
id
NULLABLE STRING
number
NULLABLE NUMBER

Dataset schema for events and errors

In the linked dataset, you can also view the snapshot processing events and errors in the events_view and error_attributes_view views. To learn how to troubleshoot the snapshot processing errors, see Troubleshoot dataset errors .

Events log

You can view event logs in the events_view view in the linked dataset:

Column name Mode Type Description
manifest.snapshotTime
NULLABLE TIMESTAMP The time in RFC 3339 format that the snapshot of the events is refreshed at.
manifest.viewName
NULLABLE STRING The name of the view that was refreshed.
manifest.location
NULLABLE STRING The source location of the data that was refreshed.
globalManifest.snapshotTime
NULLABLE TIMESTAMP The time in RFC 3339 format that the snapshot of the events is refreshed at.
eventTime
NULLABLE STRING The time that the event happened at.
eventCode
NULLABLE STRING The event code associated with the corresponding entry. The event code 1 refers to the manifest.viewName view being refreshed with all entries for the source location manifest.location within the snapshot manifest.snapshotTime . The event code 2 indicates that the dataset is refreshed with the bucket and object entries for all source locations. This refresh occurs within the snapshot globalManifest.snapshotTime .

Error codes

You can view error codes in the error_attributes_view view in the linked dataset:

Column name Mode Type Description
errorCode
NULLABLE INTEGER The error code associated with this entry. For a list of valid values and how to resolve them, see Troubleshoot dataset errors .
errorSource
NULLABLE STRING The source of the error. Valid value: CONFIGURATION_PREPROCESSING .
errorTime
NULLABLE TIMESTAMP The time the error happened.
sourceGcsLocation
NULLABLE STRING The source Cloud Storage location of the error. For projects this field is null because they are locationless.
bucketErrorRecord.bucketName
NULLABLE STRING The name of the bucket involved in the error. You can use this information to debug a bucket error.
bucketErrorRecord.serviceAccount
NULLABLE STRING The service account that needs permission to ingest objects from the bucket. You can use this information to debug a bucket error.
projectErrorRecord.projectNumber
NULLABLE INTEGER The number of the project involved in the error. You can use this information to debug a project error.
projectErrorRecord.organizationName
NULLABLE STRING The number of the organization the project must belong to in order to be processed. A value of 0 indicates that the dataset is not in the organization. You can use this information to debug a project error.

Troubleshoot dataset errors

To learn how to troubleshoot the snapshot processing errors that are logged into the error_attributes_view view in the linked dataset, see the following table:

Error Code Error Case Error Message Troubleshooting
1
Source project doesn't belong to the organization Source project projectErrorRecord.projectNumber doesn't belong to the organization projectErrorRecord.organizationName . Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName . For instructions about how to migrate a project between organizations, see Migrate projects between organizations .
2
Bucket authorization error Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName . Give service account bucketErrorRecord.serviceAccount Identity and Access Management (IAM) permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName . For more information, see Grant required permissions to the service agent .
3
Destination project doesn't belong to the organization Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName . Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName . For instructions about how to migrate a project between organizations, see Migrate projects between organizations .
4
Source project doesn't have Storage Intelligence configured. Source project projectErrorRecord.projectNumber doesn't have Storage Intelligence configured. Configure Storage Intelligence for the source project projectErrorRecord.projectNumber . For more information, see Configure and manage Storage Intelligence .
5
Bucket doesn't have Storage Intelligence configured. Bucket bucketErrorRecord.bucketName doesn't have Storage Intelligence configured. Configure Storage Intelligence for the bucket bucketErrorRecord.bucketName . For more information, see Configure and manage Storage Intelligence .

Considerations

Consider the following for dataset configurations:

  • When you rename a folder in a bucket with hierarchical namespace enabled, the object names in that bucket get updated. When ingested by the linked dataset, these object snapshots are considered new entries in the linked datasets.

  • Datasets are supported only in these BigQuery locations .

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: