Create and manage RUM index

This document shows you how to create the RUM extension and create indexes to optimize full-text search in AlloyDB for PostgreSQL. It provides examples for common use cases, including ranking, phrase searching, and sorting by timestamp.

Before you begin

To create the RUM extension, you must have the alloydb_superuser role. For more information about granting roles, see Add an IAM user or service account to a cluster .

Create the RUM extension

You must create the RUM extension once per database.

Connect to your AlloyDB database using psql or another client. For more information, see Connect to a cluster instance .

Run the following SQL command to create the extension:

  CREATE 
  
 EXTENSION 
  
 IF 
  
 NOT 
  
 EXISTS 
  
 rum 
 ;

Create a RUM index

To optimize full-text search queries, create a RUM index on your data. RUM offers several operator classes for different use cases.

Types of RUM operator classes

The following table summarizes the different RUM operator classes and their primary use cases.

Operator Class	Main Use Case	Limitations
`rum_tsvector_ops`	Standard full-text search with ranking and phrase search.	N/A
`rum_tsvector_hash_ops`	Smaller index and faster updates for full-text search.	Does not support prefix searching.
`rum_tsvector_addon_ops`	Full-text search sorted by another column.	N/A
`rum_anyarray_ops`	Searching within array columns.	N/A
`rum_<TYPE>_ops`	Indexing scalar types for distance-based queries.	N/A
`rum_tsvector_hash_addon_ops`	Hash-based full-text search sorted by another column.	Does not support prefix matching.
`rum_tsquery_ops`	Indexing stored `tsquery` values for reverse search.	N/A
`rum_anyarray_addon_ops`	Array search sorted by another column.	N/A

Index for basic full-text search

Use the rum_tsvector_ops operator class for standard text search that requires fast ranking and phrase search capabilities. This operator class stores the position of each lexeme in the index. The following example creates a table named documents with a content column.

Create a table named documents :

  CREATE 
  
 TABLE 
  
 documents 
  
 ( 
  
 id 
  
 SERIAL 
  
 PRIMARY 
  
 KEY 
 , 
  
 title 
  
 TEXT 
  
 NOT 
  
 NULL 
 , 
  
 content 
  
 TEXT 
  
 NOT 
  
 NULL 
 , 
  
 published_at 
  
 TIMESTAMP 
  
 WITH 
  
 TIME 
  
 ZONE 
  
 DEFAULT 
  
 NOW 
 () 
 );

Populate the documents table with sample data:

  INSERT 
  
 INTO 
  
 documents 
  
 ( 
 title 
 , 
  
 content 
 ) 
  
 VALUES 
  
 ( 
 'Title' 
 , 
  
 'This search engine is working as intended' 
 );

Add a generated tsvector column to your table. This column automatically stores the processed text and improves query performance:

  ALTER 
  
 TABLE 
  
 documents 
 ADD 
  
 COLUMN 
  
 search_vector 
  
 tsvector 
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
 to_tsvector 
 ( 
 'english' 
 , 
  
 content 
 )) 
  
 STORED 
 ;

Create the RUM index on the new search_vector column:

  CREATE 
  
 INDEX 
  
 idx_docs_rum 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 search_vector 
  
 rum_tsvector_ops 
 );

Query the table using the index. The <=> operator computes the relevance score, or distance, between the document and the query directly from the index, enabling fast sorting:

  SELECT 
  
 title 
 , 
  
 content 
 FROM 
  
 documents 
 WHERE 
  
 search_vector 
  
 @@ 
  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'search <-> engine' 
 ) 
 ORDER 
  
 BY 
  
 search_vector 
  
< = 
>  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'search <-> engine' 
 );

Populate documents table with more data:

  INSERT 
  
 INTO 
  
 documents 
  
 ( 
 title 
 , 
  
 content 
 ) 
  
 VALUES 
  
 ( 
 'Title1' 
 , 
  
 'English is my primary language.' 
 ); 
 INSERT 
  
 INTO 
  
 documents 
  
 ( 
 title 
 , 
  
 content 
 ) 
  
 VALUES 
  
 ( 
 'Title2' 
 , 
  
 'Google has a great engineering culture' 
 );

Run a prefix search query. This finds documents containing words that start with eng , such as engineer or english :

  SELECT 
  
 title 
 , 
  
 content 
 FROM 
  
 documents 
 WHERE 
  
 search_vector 
  
 @@ 
  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'eng:*' 
 );

Index for optimized hash search

Use the rum_tsvector_hash_ops operator class to reduce index size and improve update speeds. This class stores a hash of each lexeme instead of the full lexeme. This approach results in a smaller index but does not support prefix searching. The following example assumes that you have a table named documents with a search_vector column.

Create the RUM index using the hash operator class:

  CREATE 
  
 INDEX 
  
 idx_docs_rum_hash 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 search_vector 
  
 rum_tsvector_hash_ops 
 );

Populate documents table with more data:

  INSERT 
  
 INTO 
  
 documents 
  
 ( 
 title 
 , 
  
 content 
 ) 
  
 VALUES 
  
 ( 
 'Title3' 
 , 
  
 'That person was driving incredibly fast, however the routing was not very efficient' 
 );

Run a standard match query:

  SELECT 
  
 * 
  
 FROM 
  
 documents 
  
 WHERE 
  
 search_vector 
  
 @@ 
  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'fast & efficient' 
 );

Index for search sorted by timestamp

Use the rum_tsvector_addon_ops operator class to optimize queries that filter by text and sort by another field, such as a timestamp. This pattern stores the additional field's value directly in the index, which avoids a slow sort operation after the search. The following example assumes that you have a table named documents with a search_vector column and a published_at column.

Create an index that includes the published_at timestamp:

  CREATE 
  
 INDEX 
  
 idx_docs_rum_timestamp 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 search_vector 
  
 rum_tsvector_addon_ops 
 , 
  
 published_at 
 ) 
 WITH 
  
 ( 
 attach 
  
 = 
  
 'published_at' 
 , 
  
 to 
  
 = 
  
 'search_vector' 
 );

Run a query that finds documents containing the word engine and sorts them by publication date. The index handles both the search and the sort efficiently:

  SELECT 
  
 title 
 , 
  
 published_at 
 FROM 
  
 documents 
 WHERE 
  
 search_vector 
  
 @@ 
  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'engine' 
 ) 
 ORDER 
  
 BY 
  
 published_at 
  
 DESC 
 ;

Index for array search

Use the rum_anyarray_ops operator class to index array columns, such as a list of tags. This lets you efficiently query for arrays that overlap ( && ), contain ( @> ), or are contained by ( <@ ) other arrays. The following example adds a tags column to the documents table.

Add a tags column and populate it with data:

  ALTER 
  
 TABLE 
  
 documents 
  
 ADD 
  
 COLUMN 
  
 tags 
  
 TEXT 
 []; 
 INSERT 
  
 INTO 
  
 documents 
  
 ( 
 title 
 , 
  
 content 
 , 
  
 tags 
 ) 
  
 VALUES 
  
 ( 
  
 'Title4' 
 , 
  
 'Sample Text' 
 , 
  
 ARRAY 
 [ 
 'ai' 
 , 
  
 'ml' 
 ] 
  
 );

Create the RUM index on a TEXT[] column named tags :

  CREATE 
  
 INDEX 
  
 idx_tags_rum 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 tags 
  
 rum_anyarray_ops 
 );

Run a query to find documents that have either ai or ml in their tags:

  SELECT 
  
 * 
  
 FROM 
  
 documents 
  
 WHERE 
  
 tags 
 && 
 '{"ai", "ml"}' 
 ;

Index for scalar types

Use the rum_<TYPE>_ops operator classes to index columns that contain continuous values, such as integers, timestamps, or floating-point numbers. These operator classes let you use the <=> operator to efficiently calculate the distance between values. The following example assumes that you have a table named documents .

Add a generic integer column, such as rating , to the documents table:

  ALTER 
  
 TABLE 
  
 documents 
 ADD 
  
 COLUMN 
  
 rating 
  
 INT 
 ; 
 UPDATE 
  
 documents 
  
 SET 
  
 rating 
  
 = 
  
 floor 
 ( 
 random 
 () 
  
 * 
  
 5 
  
 + 
  
 1 
 );

Create a RUM index on the rating column:

  CREATE 
  
 INDEX 
  
 idx_rating_rum 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 rating 
  
 rum_int4_ops 
 );

Run a query to find documents with a rating closest to the value 5:

  SELECT 
  
 title 
 , 
  
 rating 
 FROM 
  
 documents 
 ORDER 
  
 BY 
  
 rating 
  
< = 
>  
 5 
 ;

Index for optimized hash search sorted by timestamp

Use the rum_tsvector_hash_addon_ops operator class to combine the benefits of a hash index with the sorting capabilities of an addon index. This class stores a hash of each lexeme along with the value of an additional column. This configuration supports efficient sorting by the additional column but does not support prefix matching. The following example assumes that you have a table named documents with a search_vector column and a published_at timestamp column.

Create a RUM index that uses the hash operator class and includes the published_at timestamp:

  CREATE 
  
 INDEX 
  
 idx_docs_rum_hash_timestamp 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 search_vector 
  
 rum_tsvector_hash_addon_ops 
 , 
  
 published_at 
 ) 
 WITH 
  
 ( 
 attach 
  
 = 
  
 'published_at' 
 , 
  
 to 
  
 = 
  
 'search_vector' 
 );

Run a query that finds documents containing engine and sorts them by publication date:

  SELECT 
  
 title 
 , 
  
 published_at 
 FROM 
  
 documents 
 WHERE 
  
 search_vector 
  
 @@ 
  
 to_tsquery 
 ( 
 'english' 
 , 
  
 'engine' 
 ) 
 ORDER 
  
 BY 
  
 published_at 
  
 DESC 
 ;

Index for stored queries

Use the rum_tsquery_ops operator class to index tsquery values. This lets you perform "reverse search," identifying which stored queries match a given input document. The following example creates a table named queries .

Create a table to store queries:

  CREATE 
  
 TABLE 
  
 queries 
  
 ( 
 query_text 
  
 tsquery 
 ); 
 INSERT 
  
 INTO 
  
 queries 
  
 ( 
 query_text 
 ) 
  
 VALUES 
  
 ( 
 plainto_tsquery 
 ( 
 'AlloyDB is fast!' 
 ));

Create a RUM index on the query_text column:

  CREATE 
  
 INDEX 
  
 idx_queries_rum 
 ON 
  
 queries 
 USING 
  
 rum 
  
 ( 
 query_text 
  
 rum_tsquery_ops 
 );

Run a query to find stored queries that match a document:

  SELECT 
  
 * 
 FROM 
  
 queries 
 WHERE 
  
 to_tsvector 
 ( 
 'english' 
 , 
  
 'AlloyDB is fast' 
 ) 
  
 @@ 
  
 query_text 
 ;

Index for array search sorted by timestamp

Use the rum_anyarray_addon_ops operator class to index array columns along with an additional column for sorting. The following example assumes that you have a table named documents with a tags column and a published_at timestamp column.

Create a RUM index on the tags column that includes the published_at timestamp:

  CREATE 
  
 INDEX 
  
 idx_tags_rum_timestamp 
 ON 
  
 documents 
 USING 
  
 rum 
  
 ( 
 tags 
  
 rum_anyarray_addon_ops 
 , 
  
 published_at 
 ) 
 WITH 
  
 ( 
 attach 
  
 = 
  
 'published_at' 
 , 
  
 to 
  
 = 
  
 'tags' 
 );

Run a query to find documents with the ai tag, sorted by publication date:

  SELECT 
  
 title 
 , 
  
 published_at 
 FROM 
  
 documents 
 WHERE 
  
 tags 
  
 @ 
>  
 '{"ai"}' 
 ORDER 
  
 BY 
  
 published_at 
  
 DESC 
 ;

What's next

Learn about Full-text search .
Learn how to Run a hybrid vector similarity search .