Evaluate result and quality

Preview

This product is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms . You can process personal data for this product as outlined in the Cloud Data Processing Addendum , subject to the obligations and restrictions described in the agreement under which you access Google Cloud. Pre-GA products are available "as is" and might have limited support. For more information, see the launch stage descriptions .

Understand the result

Enterprise Knowledge Graph writes results into a new BigQuery table for every job. This is a snapshot of the data at the time the job is executed. By default, every job generates a random cluster_id for each entity cluster. However, if you want to keep the ID stable among different job runs, use the previous BigQuery result table advanced option.

Examine the result table

Output Schema

Field name	Type	Description
cluster_id	STRING	This cluster ID is a private knowledge graph machine ID (MID) assigned to this cluster of records. It can be used to uniquely identify the record in your dataset. You can use the Previous BigQuery tablein the Advanced Options to keep this `cluster_id` stable and consistent across multiple runs.
source_name	STRING	The source name specified in the input configuration, to help you join dataset together.
source_key	STRING	The unique key in your source table, to help you join dataset together.
confidence	FLOAT	Confidence score that determines how strongly these records belong to this cluster.
assignment_age	INTEGER	Used internally for cluster_id (MID) stabilization across different jobs.
cloud_kg_mid	STRING	The Google Cloud Knowledge Graph linked entity MID. You could use this MID as your permanent ID or look up additional details from Cloud Knowledge Graph API.

Use SQL to join the dataset together

Enterprise Knowledge Graph outputs grouped entities by cluster ID. The simplest way to view the result is by using the cluster ID to "group by" your result. The following example performs a quick sanity check by joining the output table with the original table.

  # 
  
 get 
  
 all 
  
 entity 
  
 clusters 
 SELECT 
  
 distinct 
  
 ( 
 cluster_id 
 ) 
  
 FROM 
  
 ` 
 ekg 
 - 
 test 
 . 
< dataset 
> . 
 clusters_9425187210682344597 
 ` 
  
 order 
  
 by 
  
 cluster_id 
  
 LIMIT 
  
 1000 
 ; 
  
 # 
  
 join 
  
 data 
  
 with 
  
 original 
  
 table 
 SELECT 
  
 confidence 
 , 
  
 RS 
 ., 
  
 SRC 
 . 
  
 FROM 
  
 ` 
 ekg 
 - 
 test 
 . 
< dataset 
> . 
 clusters_9425187210682344597 
 ` 
  
 as 
  
 RS 
  
 join 
  
 ` 
 ekg 
 - 
 api 
 - 
 test 
 . 
 demo 
 . 
 organization 
 ` 
  
 as 
  
 SRC 
 on 
  
 RS 
 . 
 source_key 
  
 = 
  
 SRC 
 . 
 source_key 
  
 where 
  
 cluster_id 
  
 = 
  
 "r-02b72jsgrbws18" 
 ;

This entity cluster represents two different records that belong to the same cluster. This same cluster_id signals that these two records should be joined and merged.

Use SQL to join the results

Measure success

Pair-wise

Precision: Ratio of distinct entities incorrectly identified as similar false positives (easier to detect by manual inspection).
Recall: Ratio of similar entities that aren't identified as false negatives or harder to detect.

Cluster V-measure

Cluster V-measure: (1 + beta) * homogeneity * completeness / (beta * homogeneity + completeness) where beta=1.
Cluster Homogeneity: Ratio of clusters that have entities belonging to the same entity.
Cluster Completeness: Ratio of clusters in which all entities belonging to the same entity are placed into the same cluster.