Generate video embeddings by using the ML.GENERATE_EMBEDDING function
This document shows you how to create a BigQuery ML remote model
that references a Vertex AI embedding foundation model
.
You then use that model with the ML.GENERATE_EMBEDDING
function
to create video embeddings by using data from a
BigQuery object table
.
Required roles
-
To create a connection, you need membership in the following Identity and Access Management (IAM) role:
-
roles/bigquery.connectionAdmin
-
-
To grant permissions to the connection's service account, you need the following permission:
-
resourcemanager.projects.setIamPolicy
-
-
To create the model using BigQuery ML, you need the following IAM permissions:
-
bigquery.jobs.create
-
bigquery.models.create
-
bigquery.models.getData
-
bigquery.models.updateData
-
bigquery.models.updateMetadata
-
-
To run inference, you need the following permissions:
-
bigquery.tables.getData
on the table -
bigquery.models.getData
on the model -
bigquery.jobs.create
-
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project .
-
Enable the BigQuery, BigQuery Connection, and Vertex AI APIs.
Create a dataset
Create a BigQuery dataset to store your ML model:
-
In the Google Cloud console, go to the BigQuery page.
-
In the Explorerpane, click your project name.
-
Click View actions > Create dataset.
-
On the Create datasetpage, do the following:
-
For Dataset ID, enter
bqml_tutorial
. -
For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region . For simplicity, store your dataset in the same location. -
Leave the remaining default settings as they are, and click Create dataset.
-
Create a connection
Create a Cloud resource connection and get the connection's service account. Create the connection in the same location as the dataset you created in the previous step.
Select one of the following options:
Console
-
Go to the BigQuerypage.
-
To create a connection, click Add, and then click Connections to external data sources.
-
In the Connection typelist, select Vertex AI remote models, remote functions and BigLake (Cloud Resource).
-
In the Connection IDfield, enter a name for your connection.
-
Click Create connection.
-
Click Go to connection.
-
In the Connection infopane, copy the service account ID for use in a later step.
bq
-
In a command-line environment, create a connection:
bq mk --connection --location = REGION --project_id = PROJECT_ID \ --connection_type = CLOUD_RESOURCE CONNECTION_ID
The
--project_id
parameter overrides the default project.Replace the following:
-
REGION
: your connection region -
PROJECT_ID
: your Google Cloud project ID -
CONNECTION_ID
: an ID for your connection
When you create a connection resource, BigQuery creates a unique system service account and associates it with the connection.
Troubleshooting: If you get the following connection error, update the Google Cloud SDK :
Flags parsing error: flag --connection_type=CLOUD_RESOURCE: value should be one of...
-
-
Retrieve and copy the service account ID for use in a later step:
bq show --connection PROJECT_ID . REGION . CONNECTION_ID
The output is similar to the following:
name properties 1234. REGION . CONNECTION_ID {"serviceAccountId": "connection-1234-9u56h9@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}
Terraform
Append the following section into your main.tf
file.
## This creates a cloud resource connection. ## Note: The cloud resource nested object has only one output only field - serviceAccountId. resource "google_bigquery_connection" "connection" { connection_id = " CONNECTION_ID " project = " PROJECT_ID " location = " REGION " cloud_resource {} }
-
CONNECTION_ID
: an ID for your connection -
PROJECT_ID
: your Google Cloud project ID -
REGION
: your connection region
Give the service account access
Grant the connection's service account the Vertex AI User role.
If you plan to specify the endpoint as a URL when you create the remote model, for example endpoint = 'https://us-central1-aiplatform.googleapis.com/v1/projects/myproject/locations/us-central1/publishers/google/models/text-embedding-004'
, grant this role in the same project you specify in the URL.
If you plan to specify the endpoint by using the model name when you create the remote model, for example endpoint = 'text-embedding-004'
, grant this role in the same project where you plan to create the remote model.
Granting the role in a different project results in the error bqcx-1234567890-xxxx@gcp-sa-bigquery-condel.iam.gserviceaccount.com does not have the permission to access resource
.
To grant the role, follow these steps:
Console
-
Go to the IAM & Adminpage.
-
Click Grant access.
The Add principalsdialog opens.
-
In the New principalsfield, enter the service account ID that you copied earlier.
-
In the Select a rolefield, select Vertex AI, and then select Vertex AI User.
-
Click Save.
gcloud
Use the gcloud projects add-iam-policy-binding
command
:
gcloud projects add-iam-policy-binding ' PROJECT_NUMBER ' --member='serviceAccount: MEMBER ' --role='roles/aiplatform.user' --condition=None
Replace the following:
-
PROJECT_NUMBER
: your project number -
MEMBER
: the service account ID that you copied earlier
Create an object table
Create an object table that stores video contents. The object table makes it possible to analyze the video without moving them from Cloud Storage.
The Cloud Storage bucket used by the object table should be in the
same project where you plan to create the model and call the ML.GENERATE_EMBEDDING
function. If you want to call the ML.GENERATE_EMBEDDING
function in a different project than the one
that contains the Cloud Storage bucket used by the object table, you must grant the Storage Admin role at the bucket level
to the service-A@gcp-sa-aiplatform.iam.gserviceaccount.com
service account.
Create a model
-
In the Google Cloud console, go to the BigQuerypage.
-
Using the SQL editor, create a remote model :
CREATE OR REPLACE MODEL ` PROJECT_ID . DATASET_ID . MODEL_NAME ` REMOTE WITH CONNECTION ` PROJECT_ID . REGION . CONNECTION_ID ` OPTIONS ( ENDPOINT = ' ENDPOINT ' );
Replace the following:
-
PROJECT_ID
: your project ID -
DATASET_ID
: the ID of the dataset to contain the model -
MODEL_NAME
: the name of the model -
REGION
: the region used by the connection -
CONNECTION_ID
: the ID of your BigQuery connectionWhen you view the connection details in the Google Cloud console, this is the value in the last section of the fully qualified connection ID that is shown in Connection ID, for example
projects/myproject/locations/connection_location/connections/ myconnection
-
ENDPOINT
: the embedding LLM to use, in this casemultimodalembedding@001
.
-
Generate video embeddings
Generate video embeddings with the ML.GENERATE_EMBEDDING
function
by using video data from an object table:
SELECT * FROM ML . GENERATE_EMBEDDING ( MODEL ` PROJECT_ID . DATASET_ID . MODEL_NAME ` , TABLE PROJECT_ID . DATASET_ID . TABLE_NAME , STRUCT ( FLATTEN_JSON AS flatten_json_output , START_SECOND AS start_second , END_SECOND AS end_second , INTERVAL_SECONDS AS interval_seconds ) );
Replace the following:
-
PROJECT_ID
: your project ID. -
DATASET_ID
: the ID of the dataset that contains the model. -
MODEL_NAME
: the name of the remote model over amultimodalembedding@001
model. -
TABLE_NAME
: the name of the object table that contains the videos to embed. -
FLATTEN_JSON
: aBOOL
value that indicates whether to parse the embedding into a separate column. The default value isTRUE
. -
START_SECOND
: aFLOAT64
value that specifies the second in the video at which to start the embedding. The default value is0
. This value must be positive and less than theend_second
value. -
END_SECOND
: aFLOAT64
value that specifies the second in the video at which to end the embedding. The default value is120
. This value must be positive and greater than thestart_second
value. -
INTERVAL_SECONDS
: aFLOAT64
value that specifies the interval to use when creating embeddings. For example, if you setstart_second = 0
,end_second = 120
, andinterval_seconds = 10
, then the video is split into twelve 10 second segments ([0, 10), [10, 20), [20, 30)...
) and embeddings are generated for each segment. This value must be greater than4
and less than120
. The default value is16
.
Example
The following example shows how to create embeddings for the videos in
the videos
object table. Embeddings are created for each 5 second interval
between the 10 second and 40 second marks in each video.
SELECT * FROM ML . GENERATE_EMBEDDING ( MODEL ` mydataset . embedding_model ` , TABLE ` mydataset . videos ` , STRUCT ( TRUE AS flatten_json_output , 10 AS start_second , 40 AS end_second , 5 AS interval_seconds ) );