The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. Knowing how to tune these settings can greatly streamline the indexing of data. This guide lists several issues that can surface during indexing and the settings used to resolve them.
Indexing throughput is low for FullTraversalConnector
The following table lists configuration settings to improve throughput for a FullTraversalConnector :
Setting | Description | Default | Configuration change to try |
---|---|---|---|
The number of ApiOperation()
to be processed in batches before fetching additional APIOperation()
. The SDK waits for current partition to be processed before fetching additional items. This setting is dependent on amount of memory available. Smaller partition sizes, such as 50 or 100, require less memory but more waiting on behalf of the SDK. |
50 | If you have a lot of memory available, try increasing partitionSize
to 1000 or more. |
|
The number of requests batched together. At the end of partitioning the SDK waits for all batched requests to process from the partition. Larger batches require a longer wait. | 10 | Try lowering batch size. | |
Number of allowable concurrently executing batches. | 20 | If you lower batchSize
, you should bump maxActiveBatches
according to the this formula: maxActiveBatches = (partitionSize / batchSize
) + 50. For example if your partititionSize
is 1000 and your batchSize
is 5, your maxActiveBatches
should be 250. The extra 50 is a buffer for retry requests. This increase allows the connector to batch all requests without blocking. |
|
Number of threads the connector creates to allow for parallel processing. A single iterator fetches operations (typically RepositoryDoc
objects) serially, but the API calls process in parallel using threadPoolSize
number of threads. Each thread processes one item at a time. The default of 50 would process at max only 50 items simultaneously and it takes approximately 4 seconds to process an individual item (including the indexing request). |
50 | Try increasing threadPoolSize
by a multiple of 10. |
Finally, consider using the setRequestMode()
method to change the API request mode (either ASYNCHRONOUS
or SYNCHRONOUS
).
For additional information on configuration file parameters, refer to Google-supplied configuration parameters .
Indexing throughput is low for ListTraversalConnector
By default, a connector that implements the ListTraversalConnnector uses a
single traverser to index your items. To increase indexing throughput, you can
create multiple traversers each with its own configuration focusing on specific
item statuses ( NEW_ITEM
, MODIFIED
, and so on). The following table lists
configuration settings to improve throughput:
repository.traversers = t1, t2, t3, ...
traversers. t1
.hostload
and traversers. t2
.hostload
traversers. t1
.hostload = n
schedule.pollQueueIntervalSecs = s
traverser. t1
.pollRequest.statuses = status1
, status2
, …
NEW_ITEM
and status2
to MODIFIED
instructs traverser t1
to index only items with those statuses.For additional information on configuration file parameters, refer to Google-supplied configuration parameters .
SDK timeouts or interrupts while uploading large files
If you experience SDK timeout or interrupts while uploading large files,
specify a larger timeout using
traverser.timeout= s
(where s
= number of seconds). This value identifies how long worker
threads have to process an item. The default timeout in the SDK is 60 seconds
for traverser threads. Additionally, if you experience individual API requests
timing out, use the following methods to increase request timeout values:
Request timeout parameter | Description | Default |
---|---|---|
Connect timeout for indexing API requests. | 120 seconds. | |
Read timeout for indexing API requests. | 120 seconds. |