Input data format and structure

To build a new index or update an existing index, provide vectors to Vector Search in the format and structure described in the following sections.

Prerequisites

Store your input data in a Cloud Storage bucket , in your Google Cloud project.

Input data files should be organized as follows:

  • Each batch of input data files should be under a single Cloud Storage directory.
  • Data files should be placed directly under batch_root and named with the following suffixes: .csv , .json , and .avro .
  • There is a limit of 5000 objects (files) in the batch root directory.
  • Each data file is interpreted as a set of records. The format of the record is determined by the suffix of the filename and those format requirements are described. See Data file formats .
  • Each record should have an id , a feature vector, and your optional fields supported by Vertex AI Feature Store, like restricts and crowding.
  • A subdirectory named delete may be present. Each file directly under batch_root / delete is taken as a text file of id records with one id in each line.
  • All other subdirectories are not allowed.
  • Transcoding of gzip-compressed files isn't supported as input data.

Input data processing

  • All records from all data files, including those under delete , consist of a single batch of input.
  • The relative ordering of records within a data file is not important.
  • A single ID should only appear once in a batch. If there is a duplicate with the same ID, it displays as one vector count.
  • An ID cannot appear both in a regular data file and a delete data file.
  • All IDs from a data file under delete causes it to be removed from the next index version.
  • Records from regular data files is included in the next version, overwriting a value in an older index version.

The following are examples of dense, sparse, and hybrid embeddings:

  • Dense embeddings:

      { 
     "id" 
     : 
      
     "1" 
     , 
      
     "embedding" 
     : 
      
     [ 
     1 
     , 
     1 
     , 
     1 
     ]} 
     { 
     "id" 
     : 
      
     "2" 
     , 
      
     "embedding" 
     : 
      
     [ 
     2 
     , 
     2 
     , 
     2 
     ]} 
     
    
  • Sparse embeddings:

      { 
     "id" 
     : 
      
     "3" 
     , 
      
     "sparse_embedding" 
     : 
      
     { 
     "values" 
     : 
      
     [ 
     0.1 
     , 
      
     0.2 
     ], 
      
     "dimensions" 
     : 
      
     [ 
     1 
     , 
      
     4 
     ]}} 
     { 
     "id" 
     : 
      
     "4" 
     , 
      
     "sparse_embedding" 
     : 
      
     { 
     "values" 
     : 
      
     [ 
     -0.4 
     , 
      
     0.2 
     , 
      
     -1.3 
     ], 
      
     "dimensions" 
     : 
      
     [ 
     10 
     , 
      
     20 
     , 
      
     20 
     ]}} 
     
    
  • Hybrid embeddings:

      { 
     "id" 
     : 
      
     "5" 
     , 
      
     "embedding" 
     : 
      
     [ 
     5 
     , 
      
     5 
     , 
      
     -5 
     ], 
      
     "sparse_embedding" 
     : 
      
     { 
     "values" 
     : 
      
     [ 
     0.1 
     ], 
      
     "dimensions" 
     : 
      
     [ 
     500 
     ]}} 
     { 
     "id" 
     : 
      
     "6" 
     , 
      
     "embedding" 
     : 
      
     [ 
     6 
     , 
      
     7 
     , 
      
     -8.1 
     ], 
      
     "sparse_embedding" 
     : 
      
     { 
     "values" 
     : 
      
     [ 
     0.1 
     , 
      
     -0.2 
     ], 
      
     "dimensions" 
     : 
      
     [ 
     40 
     , 
      
     901 
     ]}} 
     
    

The following is an example of a valid input data file organization:

  ba 
 t 
 ch_roo 
 t 
 / 
  
 feature 
 _ 
 f 
 ile_ 
 1. 
 csv 
  
 feature 
 _ 
 f 
 ile_ 
 2. 
 csv 
  
 dele 
 te 
 / 
  
 dele 
 te 
 _ 
 f 
 ile. 
 t 
 x 
 t 
 

The feature_file_1.csv and feature_file_2.csv files contain records in CSV format. The delete_file.txt file contains a list of record IDs to be deleted from the next index version.

Data file formats

JSON

  • Encode the JSON file using UTF-8.
  • Each line of the JSON file will be interpreted as a separate JSON object.
  • Each record must contain an id field to specify the ID of the vector.
  • Each record must contain at least one of embedding or sparse_embedding .
  • The embedding field is an array of N floating point numbers that represents the feature vector, where N is the dimension of the feature vector that was configured when the index was created. This field can be used for dense embeddings only.
    • configs.dimensions , which is specified at index creation time, must be the same length as embeddings . configs.dimensions applies only to embedding , not to sparse_embedding .
  • The sparse_embedding field is an object with values and dimensions fields. The values field is a list of floating point numbers that represents the feature vector and the dimensions field is a list of integers that represent the dimension in which the corresponding value is located. For example, a sparse embedding that looks like [0,0.1,0,0,0.2] can be represented as "sparse_embedding": {"values": [0.1, 0.2], "dimensions": [1,4]} . This field can be used for sparse embeddings only.
    • The length of sparse_embedding.values must be the same length as sparse_embedding.dimensions . They don't need to be the same length as configs.dimensions , which is specified at index creation time and doesn't apply to sparse_embedding .
  • An optional restricts field can be included that specifies an array of TokenNamespace objects in restricts. For each object:
    • Specify a namespace field that is the TokenNamespace.namespace .
    • An optional allow field can be set to an array of strings which are the list of TokenNamespace.string_tokens .
    • An optional deny field can be set to an array of strings which are the list of TokenNamespace.string_blacklist_tokens .
    • The value of the field crowding_tag , if present, must be a string.
  • An optional numeric_restricts field can be included that specifies an array of NumericRestrictNamespace . For each object:
    • Specify a namespace field that is the NumericRestrictNamespace.namespace .
    • One of the value fields value_int , value_float , and value_double .
    • It must not have a field named op. This field is only for queries.

Avro

  • Use a valid Avro file.
  • To represent a sparse-only datapoint, provide a sparse embedding in the sparse_embedding field and enter an empty list in the embedding field.
  • Make records that conform to the following schema:

      { 
      
     "type" 
     : 
      
     "record" 
     , 
      
     "name" 
     : 
      
     "FeatureVector" 
     , 
      
     "fields" 
     : 
      
     [ 
      
     { 
      
     "name" 
     : 
      
     "id" 
     , 
      
     "type" 
     : 
      
     "string" 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "embedding" 
     , 
      
     "type" 
     : 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     "float" 
      
     } 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "sparse_embedding" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     { 
      
     "type" 
     : 
      
     "record" 
     , 
      
     "name" 
     : 
      
     "sparse_embedding" 
     , 
      
     "fields" 
     : 
      
     [ 
      
     { 
      
     "name" 
     : 
      
     "values" 
     , 
      
     "type" 
     : 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     "float" 
      
     } 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "dimensions" 
     , 
      
     "type" 
     : 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     "long" 
      
     } 
      
     } 
      
     ] 
      
     } 
      
     ] 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "restricts" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     { 
      
     "type" 
     : 
      
     "record" 
     , 
      
     "name" 
     : 
      
     "Restrict" 
     , 
      
     "fields" 
     : 
      
     [ 
      
     { 
      
     "name" 
     : 
      
     "namespace" 
     , 
      
     "type" 
     : 
      
     "string" 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "allow" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     "string" 
      
     } 
      
     ] 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "deny" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     "string" 
      
     } 
      
     ] 
      
     } 
      
     ] 
      
     } 
      
     } 
      
     ] 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "numeric_restricts" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     { 
      
     "type" 
     : 
      
     "array" 
     , 
      
     "items" 
     : 
      
     { 
      
     "name" 
     : 
      
     "NumericRestrict" 
     , 
      
     "type" 
     : 
      
     "record" 
     , 
      
     "fields" 
     : 
      
     [ 
      
     { 
      
     "name" 
     : 
      
     "namespace" 
     , 
      
     "type" 
     : 
      
     "string" 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "value_int" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     "int" 
      
     ], 
      
     "default" 
     : 
      
     null 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "value_float" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     "float" 
      
     ], 
      
     "default" 
     : 
      
     null 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "value_double" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     "double" 
      
     ], 
      
     "default" 
     : 
      
     null 
      
     } 
      
     ] 
      
     } 
      
     } 
      
     ], 
      
     "default" 
     : 
      
     null 
      
     }, 
      
     { 
      
     "name" 
     : 
      
     "crowding_tag" 
     , 
      
     "type" 
     : 
      
     [ 
      
     "null" 
     , 
      
     "string" 
      
     ] 
      
     } 
      
     ] 
     } 
     
    

CSV

  • Format: ID,N feature vector values,Any number of dimension:value sparse values,name=value lists
  • Encode the CSV file using UTF-8.
  • Each line of the CSV must contain exactly one record.
  • The first value in each line must be the vector ID, which must be a valid UTF-8 string.
  • Following the ID, at least one of dense embedding or sparse embedding must be specified.
  • For a dense embedding, the next N values represent the feature vector, where N is the dimension of the feature vector that was configured when the index was created.
  • For a sparse embedding, any number of dimension:value can be specified, in which value is parsed as a float and dimension is parsed as a long .
  • For a hybrid embedding that has both dense and sparse embeddings, dense embeddings must be specified before sparse embeddings.
  • Feature vector values must be floating point literals as defined in the Java language spec .
  • Additional values may be in the form name=value .
  • The name crowding_tag is interpreted as the crowding tag and may only appear once in the record.
  • All other name=value pairs are interpreted as token namespace restricts. The same name may be repeated if there are multiple values in a namespace.

    For example, color=red,color=blue represents this TokenNamespace :

     {
      "namespace": "color"
      "string_tokens": ["red", "blue"]
    } 
    
  • If value starts with ! , the rest of the string is interpreted as an excluded value.

    For example, color=!red represents this TokenNamespace :

     {
      "namespace": "color"
      "string_blacklist_tokens": ["red"]
    } 
    
  • #name=numericValue pairs with number type suffix is interpreted as numeric namespace restricts. Number type suffix is i for int, f for float, and d for double. The same name shouldn't be repeated as there should be a single value associated per namespace.

    For example, #size=3i represents this NumericRestrictNamespace :

     {
      "namespace": "size"
      "value_int": 3
    } 
    

    #ratio=0.1f represents this NumericRestrictNamespace :

     {
      "namespace": "ratio"
      "value_float": 0.1
    } 
    

    #weight=0.3d represents this NumericRestriction :

     {
      "namespace": "weight"
      "value_double": 0.3
    } 
    
  • The following example is a datapoint with id: "6" , embedding: [7, -8.1] , sparse_embedding: {values: [0.1, -0.2, 0.5], dimensions: [40, 901, 1111]} , crowding tag test , token allowlist of color: red, blue , token denylist of color: purple , and numeric restrict of ratio with float 0.1 :

     6,7,-8.1,40:0.1,901:-0.2,1111:0.5,crowding_tag=test,color=red,color=blue,color=!purple,ratio=0.1f 
    

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: