Configure and use entity resolution in BigQuery

This document shows how to implement entity resolution for entity resolution end users (hereafter referred to as end users ) and identity providers.

End users can use this document to connect with an identity provider and use the provider's service to match records. Identity providers can use this document to set up and configure services to share with end users on the Google Cloud Marketplace.

Workflow for end users

The following sections show end users how to configure entity resolution in BigQuery. For a visual representation of the complete setup, see the architecture for entity resolution .

Before you begin

Contact and establish a relationship with an identity provider. BigQuery supports entity resolution with LiveRamp and TransUnion .
Acquire the following items from the identity provider:
- Service account credentials
- Remote function signature
Create two datasets in your project:
- Input dataset
- Output dataset

Required roles

To get the permissions that you need to run entity resolution jobs, ask your administrator to grant you the following IAM roles:

For the identity provider's service account to read the input dataset and write to the output dataset:
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the input dataset
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on the output dataset

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Translate or resolve entities

For specific identity provider instructions, refer to the following sections.

LiveRamp

Prerequisites

Configure LiveRamp Embedded Identity in BigQuery. For more information, see Enabling LiveRamp Embedded Identity in BigQuery .
Coordinate with LiveRamp to enable API credentials for use with Embedded Identity. For more information, see Authentication .

Setup

The following steps are required when you use LiveRamp Embedded Identity for the first time. After setup is complete, only the input table and metadata table need to be modified between runs.

Create an input table

Create a table in the input dataset. Populate the table with RampIDs, target domains, and target types. For details and examples, see Input Table Columns and Descriptions .

Create a metadata table

The metadata table is used to control the execution of LiveRamp Embedded Identity on BigQuery. Create a metadata table in the input dataset. Populate the metadata table with client IDs, execution modes, target domains, and target types. For details and examples, see Metadata Table Columns and Descriptions .

Share tables with LiveRamp

Grant the LiveRamp Google Cloud service account access to view and process data in your input dataset. For details and examples, see Share Tables and Datasets with LiveRamp .

Run an embedded identity job

To run an embedded identity job with LiveRamp in BigQuery, do the following:

Confirm that all RampIDs that were encoded in your domain are in your input table.
Confirm that your metadata table is still accurate before you run the job.
Contact LiveRampIdentitySupport@liveramp.com with a job process request. Include the project ID, dataset ID, and table ID (if applicable) for your input table, metadata table, and output dataset. For more information, see Notify LiveRamp to Initiate Transcoding .

Results are generally delivered to your output dataset within three business days.

LiveRamp support

For support issues, contact LiveRamp Identity Support .

LiveRamp billing

LiveRamp handles billing for entity resolution.

TransUnion

Prerequisites

Contact TransUnion Cloud Support to execute an agreement to access the service. Provide the details of your Google Cloud project ID, input data types, use case, and data volume.
TransUnion Cloud Support enables the service for your Google Cloud project and shares a detailed implementation guide that includes available output data.

Setup

The following steps are required when you use TransUnion's TruAudience Identity Resolution and Enrichment service in your BigQuery environment.

Create an external connection

Create a connection to an external data source of the Vertex AI remote models, remote functions and BigLake (Cloud Resource)type. You will use this connection to trigger the identity resolution service hosted in the TransUnion Google Cloud account from your Google Cloud account.

Copy the connection ID and service account ID and share these identifiers with the TransUnion customer delivery team.

Create a remote function

Create a remote function that interacts with the service orchestrator endpoint that is hosted on the TransUnion Google Cloud project to pass the necessary metadata (including schema mappings) to the TransUnion service. Use the connection ID from the external connection that you created and the TransUnion-hosted cloud function endpoint shared by the TransUnion customer delivery team.

Create an input table

Create a table in the input dataset. TransUnion supports name, postal address, email, phone, date of birth, IPv4 address, and device IDs as inputs. Follow the formatting guidelines in the implementation guide that TransUnion shared with you.

Create a metadata table

Create a metadata table that will store the configuration required by the identity resolution service to process data, including schema mappings. For details and examples, refer to the implementation guide that TransUnion shared with you.

Create a job status table

Create a table that will receive updates about the processing of an input batch. You can query this table to trigger other downstream processes in your pipeline. The possible job statuses are as follows: RUNNING , COMPLETED , or ERROR .

Create the service invocation

Use the following procedure to call the TransUnion identity resolution service after collecting all the metadata, packaging it, and passing it to the invocation cloud function endpoint hosted by TransUnion.

  -- create service invocation procedure 
 CREATE 
  
 OR 
  
 REPLACE 
  
 PROCEDURE 
  
 ` 
< project_id 
> . 
< dataset_id 
> . 
 TransUnion_get_identities 
 ` 
 ( 
 metadata_table 
  
 STRING 
 , 
  
 config_id 
  
 STRING 
 ) 
  
 begin 
  
 declare 
  
 sql_query 
  
 STRING 
 ; 
 declare 
  
 json_result 
  
 STRING 
 ; 
 declare 
  
 base64_result 
  
 STRING 
 ; 
 SET 
  
 sql_query 
  
 = 
  
 '''select to_json_string(array_agg(struct(config_id,key,value))) from `''' 
  
 || 
  
 metadata_table 
  
 || 
  
 '''` where  config_id="''' 
  
 || 
  
 config_id 
  
 || 
  
 '''" ''' 
 ; 
 EXECUTE 
  
 immediate 
  
 sql_query 
  
 INTO 
  
 json_result 
 ; 
 SET 
  
 base64_result 
  
 = 
  
 ( 
 SELECT 
  
 to_base64 
 ( 
 CAST 
 ( 
 json_result 
  
 AS 
  
 bytes 
 ))); 
 SELECT 
  
 ` 
< project_id 
> . 
< dataset_id 
> . 
 remote_call_TransUnion_er 
 ` 
 ( 
 base64_result 
 ); 
 END 
 ;

Create the matching output table

Run the following SQL script to create the matching output table. This is the standard output of the application, which includes match flags, scores, persistent individual IDs, and household IDs.

  -- create output table 
 CREATE 
  
 TABLE 
  
 ` 
< project_id 
> . 
< dataset_id 
> . 
 TransUnion_identity_output 
 ` 
 ( 
  
 batchid 
  
 STRING 
 , 
  
 uniqueid 
  
 STRING 
 , 
  
 ekey 
  
 STRING 
 , 
  
 hhid 
  
 STRING 
 , 
  
 collaborationid 
  
 STRING 
 , 
  
 firstnamematch 
  
 STRING 
 , 
  
 lastnamematch 
  
 STRING 
 , 
  
 addressmatches 
  
 STRING 
 , 
  
 addresslinkagescores 
  
 STRING 
 , 
  
 phonematches 
  
 STRING 
 , 
  
 phonelinkagescores 
  
 STRING 
 , 
  
 emailmatches 
  
 STRING 
 , 
  
 emaillinkagescores 
  
 STRING 
 , 
  
 dobmatches 
  
 STRING 
 , 
  
 doblinkagescore 
  
 STRING 
 , 
  
 ipmatches 
  
 STRING 
 , 
  
 iplinkagescore 
  
 STRING 
 , 
  
 devicematches 
  
 STRING 
 , 
  
 devicelinkagescore 
  
 STRING 
 , 
  
 lastprocessed 
  
 STRING 
 );

Configure metadata

Follow the implementation guide that TransUnion shared with you to map your input schema to the application schema. This metadata also configures the generation of collaboration IDs, which are shareable non-persistent identifiers that can be used in data clean rooms.

Grant read and write access

Obtain the service account ID of the Apache Spark connection from the TransUnion customer delivery team and grant it read and write access to the dataset containing the input and output tables. We recommend providing the service account ID with a BigQuery Data Editor role on the dataset.

Invoke the application

You can invoke the application from within your environment by running the following script.

 call  
 ` 
<project_id>.<dataset_id>.TransUnion_get_identities ` 
 ( 
 "<project_id>.<dataset_id>.TransUnion_er_metadata" 
, "1" 
 ) 
 ; 
--  
using  
metadata  
table,  
and  
 1 
  
 = 
  
config_id  
 for 
  
the  
batch  
run

Support

For technical issues, contact TransUnion Cloud Support .

Billing and usage

TransUnion tracks usage of the application and uses it for billing purposes. Active customers can contact their TransUnion delivery representative for more information.

Workflow for identity providers

The following sections show identity providers how to configure entity resolution in BigQuery. For a visual representation of the complete setup, see the architecture for entity resolution .

Before you begin

Create a Cloud Run job or a Cloud Run function to integrate with the remote function. Both options are suitable for this purpose.
Note the name of the service account that's associated with the Cloud Run or Cloud Run function:
1. In the Google Cloud console, go to the Cloud Functionspage.
  
  Go to Cloud Functions
2. Click the function's name, and then click the Detailstab.
3. In the General Informationpane, find and note the service account name for the remote function.
Create a remote function .
Collect end-user principals from the end user.

Required roles

To get the permissions that you need to run entity resolution jobs, ask your administrator to grant you the following IAM roles:

For the service account that's associated with your function to read and write on associated datasets and launch jobs:
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on the project
- BigQuery Job User ( roles/bigquery.jobUser ) on the project
For the end-user principal to see and connect to the remote function:
- BigQuery Connection User ( roles/bigquery.connectionUser ) on the connection
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the control plane dataset with the remote function

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Share entity resolution remote function

Modify and share the following remote interface code with the end user. The end user needs this code to start the entity resolution job.

  `PARTNER_PROJECT_ID.DATASET_ID` 
 . 
 match 
 `( LIST_OF_PARAMETERS 
)

Replace LIST_OF_PARAMETERS with the list of parameters that are passed to the remote function.

Optional: Provide job metadata

You can optionally provide job metadata by using a separate remote function or by writing a new status table in the user's output dataset. Examples of metadata include job statuses and metrics.

Billing for identity providers

To streamline customer billing and onboarding, we recommend that you integrate your entity resolution service with the Google Cloud Marketplace . This lets you set up a pricing model based on the entity resolution job usage, with Google handling the billing for you. For more information, see Offering software as a service (SaaS) products .