Tune connector settings

The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. Knowing how to tune these settings can greatly streamline the indexing of data. This guide lists several issues that can surface during indexing and the settings used to resolve them.

Indexing throughput is low for FullTraversalConnector

The following table lists configuration settings to improve throughput for a FullTraversalConnector :

Setting	Description	Default	Configuration change to try
`traverse.partitionSize`	The number of `ApiOperation()` to be processed in batches before fetching additional `APIOperation()` . The SDK waits for current partition to be processed before fetching additional items. This setting is dependent on amount of memory available. Smaller partition sizes, such as 50 or 100, require less memory but more waiting on behalf of the SDK.	50	If you have a lot of memory available, try increasing `partitionSize` to 1000 or more.
`batch.batchSize`	The number of requests batched together. At the end of partitioning the SDK waits for all batched requests to process from the partition. Larger batches require a longer wait.	10	Try lowering batch size.
`batch.maxActiveBatches`	Number of allowable concurrently executing batches.	20	If you lower `batchSize` , you should bump `maxActiveBatches` according to the this formula: `maxActiveBatches = (partitionSize / batchSize` ) + 50. For example if your `partititionSize` is 1000 and your `batchSize` is 5, your `maxActiveBatches` should be 250. The extra 50 is a buffer for retry requests. This increase allows the connector to batch all requests without blocking.
`traverse.threadPoolSize`	Number of threads the connector creates to allow for parallel processing. A single iterator fetches operations (typically `RepositoryDoc` objects) serially, but the API calls process in parallel using `threadPoolSize` number of threads. Each thread processes one item at a time. The default of 50 would process at max only 50 items simultaneously and it takes approximately 4 seconds to process an individual item (including the indexing request).	50	Try increasing `threadPoolSize` by a multiple of 10.

Finally, consider using the setRequestMode() method to change the API request mode (either ASYNCHRONOUS or SYNCHRONOUS ).

For additional information on configuration file parameters, refer to Google-supplied configuration parameters .

Indexing throughput is low for ListTraversalConnector

By default, a connector that implements the ListTraversalConnnector uses a single traverser to index your items. To increase indexing throughput, you can create multiple traversers each with its own configuration focusing on specific item statuses ( NEW_ITEM , MODIFIED , and so on). The following table lists configuration settings to improve throughput:

Setting

Description

Default

Configuration change to try

repository.traversers = t1, t2, t3, ...

Creates one or more individual traversers where t1, t2, t3, ... is the unique name of each. Each named traverser has its own set of settings which are identified using the traverser's unique name, such as

traversers. t1 
.hostload

and

traversers. t2 
.hostload

One traverser

Use this setting to add additional traversers

traversers. t1 
.hostload = n

Identifies the number of threads, n , to use to simultaneously index items.

Experiment with tuning n based on how much load you want to put on your repository. Start with values of 10 or above.

schedule.pollQueueIntervalSecs = s

Identifies the number of seconds, s , to wait before re-polling . The content connector continues to poll items as long as the API returns items in the poll response. When poll response is empty, the connector waits for s seconds before trying again. This setting is only used by the ListingConnector

Try lowering to 1.

traverser. t1 
.pollRequest.statuses = status1 
, status2 
, …

Specifies the statuses, status1 , status2 , … , of the items to index. For example, setting status1 to NEW_ITEM and status2 to MODIFIED instructs traverser t1 to index only items with those statuses.

One traverser checks for all statuses

Experiment with having different traversers poll for different statuses.