Deploy an Apache Nutch Indexer Plugin

You can set up Google Cloud Search to serve web content to your users by deploying the Cloud Search indexer plugin for Apache Nutch , an open source web crawler.

When you start the web crawl, Apache Nutch crawls the web and uses the indexer plugin to upload original binary (or text) versions of document content to the Google Cloud Search API. The Cloud Search API indexes the content and serves the results to your users.

Important considerations

Before you deploy the indexer plugin, be aware of the following considerations.

System requirements

Operating system

Linux only:

Ubuntu
Red Hat Enterprise Linux 5.0
SUSE Enterprise Linux 10 (64 bit)

Software

Apache Nutch version 1.15. The indexer plugin software includes this version of Nutch.
Java JRE 1.8 installed on the computer that will run the indexer plugin

Apache Tika document types

Apache Tika 1.18 supported document formats

Deploy the indexer plugin

These steps describe how to install the indexer plugin and configure its components to crawl URLs and return results to Cloud Search.

Prerequisites

Before you deploy the indexer plugin, gather the information required to connect Cloud Search and the data source:

Google Workspace private key (which contains the service account ID). For information on obtaining a private key, go to Configure access to the Cloud Search API .
Google Workspace data source ID. For information on obtaining a data source ID, go to Add a data source to search .

Step 1: Build and install the plugin software and Apache Nutch

Clone the indexer plugin repository from GitHub.

 $  
 
git  
clone  
https://github.com/google-cloudsearch/apache-nutch-indexer-plugin.git $  
 
 cd 
  
apache-nutch-indexer-plugin

Check out the version of the indexer plugin you want:
```
 $  
 
git  
checkout  
tags/v1-0.0.5
```
Build the indexer plugin.
```
 $  
 
mvn  
package
```
To skip tests when building the plugin, use mvn package -DskipTests .
Download Apache Nutch 1.15 and follow the Apache Nutch installation instructions .
Extract target/google-cloudsearch-apache-nutch-indexer-plugin-v1.0.0.5.zip to a folder. Copy the plugins/indexer-google-cloudsearch folder to the Apache Nutch plugins folder ( apache-nutch-1.15/plugins ).

Step 2: Configure the indexer plugin

To configure the plugin, create a file named plugin-configuration.properties . The configuration file must specify the following parameters to access the Cloud Search data source.

Setting	Parameter
Data source ID	`api.sourceId = 1234567890abcdef` Required. The Cloud Search source ID that the Google Workspace administrator set up for the indexer plugin.
Service account	`api.serviceAccountPrivateKeyFile = ./PrivateKey.json` Required. The Cloud Search service account key file that the Google Workspace administrator created for indexer plugin accessibility.

The following example shows a sample configuration file:

  # data source access 
 api.sourceId 
 = 
 1234567890abcdef 
 api.serviceAccountPrivateKeyFile 
 = 
 ./PrivateKey.json

The configuration file can also contain parameters that control plugin behavior, such as how the plugin pushes data into the Cloud Search API, and how it populates metadata and structured data. For descriptions of these parameters, see Google-supplied connector parameters .

Step 3: Configure Apache Nutch

Open conf/nutch-site.xml and add the following parameters:
Setting

Parameter
Plugin includes
plugin.includes = text
Required. List of plugins to use. This must include at least:

index-basic

index-more

indexer-google-cloudsearch

conf/nutch-default.xml provides a default value, but you must manually add indexer-google-cloudsearch to it.
Metatags names

metatags.names = text
Optional. Comma-separated list of tags that map to properties in the corresponding data source schema. To learn more, see Nutch-parse metatags .
The following example shows the required modification to nutch-site.xml :
```
 <property>  
<name>plugin.includes</name>  
<value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more|metadata)|query-(basic|site|url|lang)|indexer-google-cloudsearch|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf|metatags)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property> 
```
Open conf/index-writers.xml and add the following section:
```
 <writer  
id="indexer_google_cloud_search_1"  
class="org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter">  
<parameters>  
<param  
name="gcs.config.file"  
value="path/to/sdk-configuration.properties"/>  
</parameters>  
<mapping>  
<copy  
/>  
<rename  
/>  
<remove  
/>  
</mapping>
</writer> 
```
The <writer> section contains the following parameters:
Setting

Parameter

Path to Cloud Search configuration file

gcs.config.file = path
Required. The full (absolute) path to the Cloud Search configuration file.
Upload format
gcs.uploadFormat = text
Optional. The format the plugin uses to push document content to the Cloud Search API. Valid values are:

raw : pushes original, unconverted content.

text : pushes extracted textual content. The default is raw .

Step 4: Configure web crawl

Before you start a web crawl, configure it to only include information that your organization wants to make available. For more information, see the Nutch tutorial .

Set up start URLs.

Start URLs control where the web crawler begins crawling your content. The crawler must be able to reach all content you want to include by following the links.

To set up start URLs:
1. Change to the Nutch installation directory:
```
 $  
 
 cd 
  
~/nutch/apache-nutch-X.Y/
```
2. Create a directory for URLs:
```
 $  
 
mkdir  
urls
```
3. Create a file named seed.txt and list one URL per line.

Set up follow and do-not-follow rules.

Follow URL rules control which URLs the crawler indexes. Do-not-follow rules exclude URLs from being crawled.

To set up these rules:

Change to the Nutch installation directory.
Edit conf/regex-urlfilter.txt :
```
 $  
 
nano  
conf/regex-urlfilter.txt
```

Enter regular expressions with a "+" or "-" prefix:

 # skip file extensions
-\.(gif|GIF|jpg|JPG|png|PNG|ico)

# skip protocols (file: ftp: and mailto:)
-^(file|ftp|mailto):

# allow urls starting with https://support.google.com/gsa/
+^https://support.google.com/gsa/

# accept anything else
#+.

Edit the crawl script.

If the gcs.uploadFormat parameter is missing or set to "raw," you must add -addBinaryContent -base64 arguments to the nutch index command. These arguments tell the Nutch Indexer module to include binary content in Base64.

Open the crawl script in apache-nutch-1.15/bin .

Add the options as shown in this example:

   
 if 
  
 $INDEXFLAG 
 ; 
  
 then 
  
 echo 
  
 "Indexing 
 $SEGMENT 
 to index" 
  
__bin_nutch  
index  
 $JAVA_PROPERTIES 
  
 " 
 $CRAWL_PATH 
 " 
/crawldb  
-addBinaryContent  
-base64  
-linkdb  
 " 
 $CRAWL_PATH 
 " 
/linkdb  
 " 
 $CRAWL_PATH 
 " 
/segments/ $SEGMENT 
  
 echo 
  
 "Cleaning up index if possible" 
  
__bin_nutch  
clean  
 $JAVA_PROPERTIES 
  
 " 
 $CRAWL_PATH 
 " 
/crawldb  
 else 
  
 echo 
  
 "Skipping indexing ..."

Step 5: Start a web crawl and content upload

After you set up the indexer plugin, you can run it in local mode. Use scripts from ./bin to execute a crawling job.

The following example assumes components are in the local directory. Run Nutch from the apache-nutch-1.15 directory:

 $  
 
bin/crawl  
-i  
-s  
urls/  
crawl-test/  
 5

Crawl logs are available in the terminal or the logs/ directory. To direct logging output, edit conf/log4j.properties .