About data insights for unstructured data

Data insights for unstructured data in Knowledge Catalog transforms dark data or unstructured files such as PDFs into structured, queryable assets. While standard discovery tools are limited to file-level metadata such as size and type, data insights for unstructured data uses Vertex AI to analyze file contents. It automatically extracts the business context required to ground AI agents and power advanced analytics.

This automation eliminates the need for manual document parsing and custom ETL code, letting you discover, classify, and use data that was previously inaccessible.

Automated discovery of unstructured data

A discovery scan is a process that automatically locates your unstructured files in Cloud Storage and catalogs them into one or multiple BigLake object tables in BigQuery for analysis. It serves as the entry point for data insights for unstructured data. The system automatically registers the resulting BigLake object tables as entries in Knowledge Catalog. When multiple tables are created due to a discovery scan, each of the entries has its own insights tab. You can then open this entry to explore the generated data insights. When you run a discovery scan with data insights for unstructured data enabled, the system performs these actions:

  1. Identifies and groups files.Automatically identifies and organizes unstructured files in Cloud Storage into BigLake object tables. These object tables are read-only tables that provide a structured interface to your unstructured data.

  2. Performs data insights for unstructured data.Uses Vertex AI to analyze the actual content within the files to understand their meaning and structure. This includes entity inference , which uses Generative AI to extract specific attributes, for example, Company , Product , or Serial Number , from the file content. It also includes relationship extraction , which identifies how these entities connect, for example, Component is_part_of Product , to create a semantic graph.

  3. Generates schemas and graph profiles.Provides an AI-suggested relational schema and a graph profile aspect . This is a Knowledge Catalog metadata aspect containing the inferred schemas for the entities and relationships.

  4. Enriches metadata.Automatically populates the Knowledge Catalog with AI-generated metadata. This makes the data searchable and ready for extraction.

Instead of manually designing database schemas, you can perform data extraction using one-click SQL or pipeline orchestration. This process materializes inferred entities and relationships into structured formats, such as tables or views.

Use cases

You can use data insights for unstructured data for various purposes, including the following:

  • Automated ETL pipeline generation.Automate data extraction from Cloud Storage to BigQuery by replacing custom parsers with automated schema suggestion and one-click deployment to materialize data into BigQuery tables, views, or semantic graphs.

    For example, a financial services company can automatically extract invoice details, vendor names, and contract terms from thousands of PDF invoices, materializing them directly into BigQuery for immediate spend analytics without writing custom parsing code.

  • Content classification and validation.Automatically group dark data into searchable assets enriched with AI-generated metadata, which lets data stewards perform human-in-the-loop validation and monitoring of extracted entities at scale.

    For example, a legal or compliance department can automatically classify large repositories of historical contracts and extract key entities. This lets data stewards validate the metadata before using it for critical regulatory reporting.

  • AI agent grounding.Ground Retrieval-Augmented Generation (RAG) agents with verified graphs. This provides a clear "traceability chain" connecting raw files to structured business logic, reducing hallucination, which lets AI agents navigate multi-table joins with zero ambiguity.

    For example, a manufacturing company can extract equipment relationships from maintenance logs. When a technician asks a conversational AI agent "Which regions are affected by the silicone recall?", the agent uses the verified relationship graph to provide an accurate answer with a clear traceability chain back to the original manuals.

Limitations

Review the following limitations before using data insights for unstructured data:

  • Supported formats.While discovery scans automatically identify and group various unstructured file types into BigQuery object tables, data insights for unstructured data is only optimized for PDF files.

  • Locations.Data insights for unstructured data is only available in locations that support Vertex AI Gemini 2.5 Pro models. For a list of supported regions, see the Supported regionssection in Gemini 2.5 Pro .

Pricing

During the Preview phase, data insights for unstructured data is available for experimentation and testing at no additional charge for semantic inference capabilities. However, you remain responsible for the costs of underlying resources and services consumed during the process.

Preview period

  • Semantic inference.There is no charge for using Vertex AI to extract semantic information and infer graph profiles during discovery scans throughout the preview period.

  • Underlying resource costs.Standard charges apply for the resources required to store and process your data:

    • Knowledge Catalog.

      • Discovery scans are billed based on Knowledge Catalog Premium processing SKUs (DCU hours) for the scanning and grouping of unstructured data. For more information, see Knowledge Catalog pricing .

      • AI-generated metadata, including graph profiles, incurs standard Knowledge Catalog storage charges.

    • BigQuery.

      • If using the pipeline extraction method, standard charges for Dataform execution and BigQuery jobs apply.

      • If using the SQL method, standard BigQuery ML charges and BigQuery job charges apply.

      • Any data materialized into BigQuery, including object tables, inferred metadata, and extracted entities, incurs standard BigQuery storage and query charges. For more information, see BigQuery pricing .

General Availability (GA)

Official billing for data insights for unstructured data commences upon General Availability (GA).

Quotas

Standard DataScan resource and API quotas apply to each individual discovery job. A specific quota governs semantic inference volume: Total daily semantic inference executions on BigQuery object tables are limited to one per project per day.

Because data insights for unstructured data relies on a discovery scan, the limits for how many tables a discovery scan supports apply. For more information, see BigQuery quotas and limits .

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: