If your data store uses basic website search, the freshness of your store's index mirrors the freshness that's available in Google Search.
If advanced website indexing is enabled in your data store, the web pages in your data store are refreshed in the following ways:
- Automatic refresh
- Manual refresh
- Sitemap-based refresh
This page describes automatic and manual refresh. To understand and implement sitemap-based refresh, see Index and refresh according to sitemap .
Before you begin
If you use the robots.txt 
file in your website, update it. For more
information, see how to prepare your website's robots.txt 
file 
.
Automatic refresh
Vertex AI Search performs automatic refresh as follows:
- After you create a data store, it generates an initial index for the included pages.
- After the initial indexing, it indexes any newly discovered pages and recrawls existing pages on a best-effort basis.
- It regularly refreshes data stores that encounter a query rate of 50 queries/30 days.
Manual refresh
If you want to refresh specific web pages in a data store with Advanced website indexing 
turned on, you
can call the  recrawlUris 
 
method. You use the uris 
field to specify each
web page that you want to crawl. The recrawlUris 
method is a long-running
operation 
that runs until your specified web pages are
crawled or until it times out after 24 hours, whichever comes first. If the recrawlUris 
method times out you can call the method again, specifying the web
pages that remain to be crawled. You can poll the  operations.get 
 
method to monitor the status of your recrawl operation 
.
Limits on recrawling
There are limits to how often you can crawl web pages and how many web pages that you can crawl at a time:
-  Calls per day.The maximum number of calls to the recrawlUrismethod allowed is 20 per day, per project.
-  Web pages per call.The maximum number of urisvalues that you can specify with a call to therecrawlUrismethod is 10,000.
Recrawl the web pages in your data store
You can manually crawl specific web pages in a data store that has Advanced website indexing turned on.
REST
To use the command line to crawl specific web pages in your data store, follow these steps:
-  Find your data store ID. If you already have your data store ID, skip to the next step. -  In the Google Cloud console, go to the AI Applicationspage and in the navigation menu, click Data Stores. 
-  Click the name of your data store. 
-  On the Datapage for your data store, get the data store ID. 
 
-  
-  Call the recrawlUrismethod, using theurisfield to specify each web page that you want to crawl. Eachurirepresents a single page even if it contains asterisks (*). Wildcard patterns are not supported.curl -X POST \ -H "Authorization: Bearer $( gcloud auth print-access-token ) " \ -H "Content-Type: application/json" \ -H "X-Goog-User-Project: PROJECT_ID " \ "https://discoveryengine.googleapis.com/v1alpha/projects/ PROJECT_ID /locations/global/collections/default_collection/dataStores/ DATA_STORE_ID /siteSearchEngine:recrawlUris" \ -d '{ "uris": [ URIS ] }'Replace the following: -  PROJECT_ID: the ID of your Google Cloud project.
-  DATA_STORE_ID: the ID of the Vertex AI Search data store.
-  URIS: the list of web pages that you want to crawl—for example,"https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3".
 The output is similar to the following: { "name" : "projects/ PROJECT_ID /locations/global/collections/default_collection/dataStores/ DATA_STORE_ID /operations/recrawl-uris-0123456789012345678" , "metadata" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata" } }
-  
-  Save the namevalue as input for theoperations.getoperation when monitoring the status of your recrawl operation .
Monitor the status of your recrawl operation
The recrawlUris 
method, which you use to crawl web pages in a data
store 
, is a long-running operation 
that runs until your specified web pages are crawled
or until it times out after 24 hours, whichever comes first. You can monitor the
status of the this long-running operation by polling the  operations.get 
 
method, specifying the name 
value returned by the recrawlUris 
method. Continue polling until the response indicates that either:
(1) All of your web pages are crawled, or (2) The operation timed out before all
of your web pages were crawled. If recrawlUris 
times out, you can call it
again, specifying the websites that were not crawled.
REST
To use the command line to monitor the status of a recrawl operation, follow these steps:
-  Find your data store ID. If you already have your data store ID, skip to the next step. -  In the Google Cloud console, go to the AI Applicationspage and in the navigation menu, click Data Stores. 
-  Click the name of your data store. 
-  On the Datapage for your data store, get the data store ID. 
 
-  
-  Poll the operations.getmethod.curl -X GET \ -H "Authorization: Bearer $( gcloud auth print-access-token ) " \ -H "Content-Type: application/json" \ -H "X-Goog-User-Project: PROJECT_ID " \ "https://discoveryengine.googleapis.com/v1alpha/ OPERATION_NAME "Replace the following: -  PROJECT_ID: the ID of your Google Cloud project.
-  OPERATION_NAME: the operation name, found in thenamefield returned in your call to therecrawlUrismethod in Recrawl the web pages in your data store . You can also get the operation name by listing long-running operations .
 
-  
-  Evaluate each response. -  If a response indicates that there are pending URIs and the recrawl operation is not done, your web pages are still being crawled. Continue polling. Example { "name" : "projects/ PROJECT_ID /locations/global/collections/default_collection/dataStores/ DATA_STORE_ID /operations/recrawl-uris-0123456789012345678" , "metadata" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata" , "createTime" : "2023-09-05T22:07:28.690950Z" , "updateTime" : "2023-09-05T22:22:10.978843Z" , "validUrisCount" : 4000 , "successCount" : 2215 , "pendingCount" : 1785 } , "done" : false, "response" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse" , } } The response fields can be described as follows: -  createTime: indicates the time that the long-running operation started.
-  updateTime: indicates the last time that the long-running operation metadata was updated. indicates the metadata updates every five minutes until the operation is done.
-  validUrisCount: indicates that you specified 4,000 valid URIs in your call to therecrawlUrismethod.
-  successCount: indicates that 2,215 URIs were successfully crawled.
-  pendingCount: indicates that 1,785 URIs have not yet been crawled.
-  done: a value offalseindicates that the recrawl operation is still in progress.
 
-  
-  If a response indicates that there are no pending URIs (no pendingCountfield is returned) and the recrawl operation is done, then your web pages are crawled. Stop polling—you can quit this procedure.Example { "name" : "projects/ PROJECT_ID /locations/global/collections/default_collection/dataStores/ DATA_STORE_ID /operations/recrawl-uris-0123456789012345678" , "metadata" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata" , "createTime" : "2023-09-05T22:07:28.690950Z" , "updateTime" : "2023-09-05T22:37:11.367998Z" , "validUrisCount" : 4000 , "successCount" : 4000 } , "done" : true, "response" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse" } } The response fields can be described as follows: -  createTime: indicates the time that the long-running operation started.
-  updateTime: indicates the last time that the long-running operation metadata was updated. indicates the metadata updates every five minutes until the operation is done.
-  validUrisCount: indicates that you specified 4,000 valid URIs in your call to therecrawlUrismethod.
-  successCount: indicates that 4,000 URIs were successfully crawled.
-  done: a value oftrueindicates that the recrawl operation is done.
 
-  
 
-  
-  If a response indicates that there are pending URIs and the recrawl operation is done, then the recrawl operation timed out (after 24 hours) before all of your web pages were crawled. Start again at Recrawl the web pages in your data store . Use the failedUrisvalues in theoperations.getresponse for the values in theurisfield in your new call to therecrawlUrismethod.Example. { "name" : "projects/ PROJECT_ID /locations/global/collections/default_collection/dataStores/ DATA_STORE_ID /operations/recrawl-uris-8765432109876543210" , "metadata" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata" , "createTime" : "2023-09-05T22:07:28.690950Z" , "updateTime" : "2023-09-06T22:09:10.613751Z" , "validUrisCount" : 10000 , "successCount" : 9988 , "pendingCount" : 12 } , "done" : true, "response" : { "@type" : "type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse" , "failedUris" : [ "https://example.com/page-9989" , "https://example.com/page-9990" , "https://example.com/page-9991" , "https://example.com/page-9992" , "https://example.com/page-9993" , "https://example.com/page-9994" , "https://example.com/page-9995" , "https://example.com/page-9996" , "https://example.com/page-9997" , "https://example.com/page-9998" , "https://example.com/page-9999" , "https://example.com/page-10000" ] , "failureSamples" : [ { "uri" : "https://example.com/page-9989" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9990" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9991" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9992" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9993" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9994" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9995" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9996" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9997" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } , { "uri" : "https://example.com/page-9998" , "failureReasons" : [ { "corpusType" : "DESKTOP" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } , { "corpusType" : "MOBILE" , "errorMessage" : "Page was crawled but was not indexed by UCS within 24 hours." } ] } ] } } Here are some descriptions of response fields: -  createTime. The time that the long-running operation started.
-  updateTime. The last time that the long-running operation metadata was updated. The metadata updates every five minutes until the operation is done.
-  validUrisCount. Indicates that you specified 10,000 valid URIs in your call to therecrawlUrismethod.
-  successCount. Indicates that 9,988 URIs were successfully crawled.
-  pendingCount. Indicates that 12 URIs have not yet been crawled.
-  done. A value oftrueindicates that the recrawl operation is done.
-  failedUris. A list of URIs that were not crawled before the recrawl operation timed out.
-  failureInfo. Information about URIs that failed to crawl. At most, tenfailureInfoarray values are returned, even if more than ten URIs failed to crawl.
-  errorMessage. The reason a URI failed to crawl, bycorpusType. For more information, see Error messages .
 
-  
Timely refresh
Google recommends that you perform manual refresh on your new and updated pages to ensure that you have the latest index.
Error messages
When you are monitoring the status of your recrawl operation 
, if the recrawl operation times out while you are
polling the operations.get 
method, operations.get 
returns error messages for
web pages that were not crawled. The following table lists the error messages,
whether the error is transient (a temporary error that resolves itself), and the
actions that you can take before retrying the recrawlUris 
method. You can retry
all transient errors immediately. All intransient errors can be retried after
implementing the remedy.
| Error message | Is it a transient error? | Action before retrying recrawl | 
|---|---|---|
|   
 Page was crawled but was not indexed by Vertex AI Search within 24 hours | Yes | Use the failedUrisvalues in theoperations.getresponse for the values in theurisfield when you call therecrawlUrismethod. | 
|   
 Crawling was blocked by the site's  robots.txt | No | Unblock the URI in your website's robots.txtfile, ensure that the Googlebot user agent is permitted to crawl the website,
   and retry recrawl. For more information, see How to write and submit a robots.txt file 
.
   If you cannot access therobots.txtfile, contact the domain owner. | 
|   
 Page is unreachable | No | Check the URI that you specified when you call the recrawlUrismethod. Ensure you provide the literal URI and not a URI pattern. | 
|   
 Crawling timed out | Yes | Use the failedUrisvalues in theoperations.getresponse for the values in theurisfield when you call therecrawlUrismethod. | 
|   
 Page was rejected by Google crawler | Yes | Use the failedUrisvalues in theoperations.getresponse for the values in theurisfield when you call therecrawlUrismethod. | 
|   
 URL could not be followed by Google crawler | No | If there are multiple redirects, use the URI from the last redirect and retry | 
|   
 Page was not found (404) | No | Check the URI that you specified when you call the recrawlUrismethod. Ensure you provide the literal URI and not a URI pattern.Any page that responds with a `4xx` error code is removed from the index. | 
|   
 Page requires authentication | No | Advanced website indexing doesn't support crawling web pages that require authentication. | 
How deleted pages are handled
When a page is deleted, Google recommends that you manually refresh the deleted URLs.
When your website data store is crawled during either an automatic 
or a manual 
refresh, if a web page responds with a 4xx 
client error
code or 5xx 
server error code, the unresponsive web page is removed from the
index.

