The Harvest Indexer

Edition , for Harvest version

Simon Wilkinson

Introduction

Harvest is a program for producing searchable indexes of documents published on the World Wide Web. It can index documents produced in a wide variety of file formats (HTML, PDF, Postscript, RTF to name but a few) and by a range of different protocols (HTTP, FTP, NTTP). It is designed to support hierarchical indexing, allowing a number of sites to contribute to one central index.

The ability to perform hierarchical indexing is given by Harvest's seperation into two seperate systems -- the Gatherer and the Broker. The Gatherer produces a summarised database of all of the objects on the server(s) it is assigned to gather from. The Broker then fetches the database, indexes it with a search tool, and makes it available for user queries. The Broker can fetch databases from multiple Gatherers, and from other Brokers, so it is possible to build up a complex, hierarchical tree of indexing systems.

The Harvest Gatherer and Broker both communicate using a data structure called SOIF to store their content summaries. Its not necessary to know that much about SOIF to be able to use Harvest, although a small amount of knowledge doesn't hurt!

Harvest is a powerful package, whose more esoteric features require some documentation, hence the size of this manual! However setting it up to do simple things is very straightforward. If you just want to index a single web server, then you need only look at Chapter 2 and the start of Chapter 3. If you are wanting to try and do something more complicated, then you should read Chapter 2 and Chapter 3, and dip into the rest.

The Gatherer

The Gatherer is, in itself, made up of a number of subcomponents. There are two packages involved in generating the database -- the enumerator and the summariser (essence). A third component, the gather daemon, makes the generated database available for Brokers to collect.

The enumerator fetches documents, extracts a list of all of the links that they contain, and then passes the document on to essence (the summariser) to create a summary of its contents. This content summary is then added, by essence, to its database, and the enumerator continues this process for all of the links it found in the document.

The content summary is in a data exchange format called SOIF (Summarised Object Interchange Format), which is designed to allow the fast and simple exchange of large numbers of content summaries between sites.

The gather daemon makes this content summary available to Brokers, in either a plaintext or a compressed format. The daemon supports incremental updates, so removing the need to transfer the entire database every time.

The Broker

The Broker provides the search interface to the information collected by the Gatherer. Again, it is made of a number of components - a daemon, an indexing system, and a few CGIs.

The daemon fetches data from gather daemons and from other Brokers, storing it in a directory structure on the local disk. It uses a seperate, customisable, indexing engine to index and search the data, and provides a interface to allow users to carry out these searches. The CGIs provide an interface between a web form to perform the query, and the broker daemon.

SOIF

SOIF is the data representation the Harvest uses to store all of its content summaries. It is beyond this document's remit to fully describe SOIF, but a basic knowledge of its features and facilities may help when working with Harvest.

SOIF represents data as a series of attribute, value pairs, where the attribute describes the meaning of the value. For instance this document might have the following attributes and values:

Title: The Harvest Indexer
Description: A user manual for the Harvest Indexer

SOIF also includes a template type, which defines the meaning of the attributes, and need not concern us further, and a URL, which is the URL of the document which is being described in the SOIF. The SOIF representation also includes extra byte counts so as to be able to store arbitrary binary data.

Installing Harvest

Harvest is distributed in source code form only. This means that in order to run it you will need access to a C compiler (gcc is preferred) and a number of other compilation tools. Whilst users may chose to make binary releases available, these binary releases are not supported in any way.

Required Programs

Harvest needs a number of programs to be installed on the system in order for it to run. Whilst these programs are not a "standard" part of all Unix systems, many systems will already have them installed.

The following are required to build the Harvest binaries

Perl
Bison
Flex

The following are required in order to run Harvest

Perl
Gzip
A HTTP Server

Harvest will currently work with both Perl 4 and Perl 5, however it is likely that future versions will take advantage of some of the advanced features of Perl 5.

A list of the platforms on which Harvest is known to compile, along with the compilers used is available as part of the FAQ. If you get Harvest to work on a system not listed there, please mail us.

Getting the code

The most recent Harvest code is available from @uref{ftp://ftp.tardis.ed.ac.uk/pub/harvest} or one of its mirror sites (as listed in the FAQ). Two digit version numbers (ie 1.6) are stable releases, three digit versions (ie 1.6.1) are less well tested releases, which may contain bug fixes or new features. At the time of writing the current stable version is

The code is distributed as Gzip'd tar archives - to uncompress it into your current working directory you should use

gzip -dc harvest--src.tar.gz | tar -xf -

Configuring the system

By default the Harvest software will be installed into /usr/local/harvest, and the object files created during the compilation process will be stored in the same directory as the source. Both of these can be change by varying the invocation of the configure command.

Harvest currently uses the path that its installed into as the location for all of its binaries (in bin and lib), data files (in brokers and gatherers) and cgi-bins (in cgi-bin). This will probably be changed in a later release.

If you are happy with the defaults, then you can simply configure Harvest by typing

./configure

and move on to the next stage.

Installing into a different directory

You can control which directory Harvest is installed into with the prefix option to configure. For instance, to install into /home/joe/harvest you would use

configure --prefix=/home/joe/harvest

Building in a different directory

If you want to build your object files in a different tree from your source files (for instance, if you're building Harvest for more than one architecture), then change directory into the directory you want as the root of your object file tree, and then run the configure program in the directory you unpacked the source into.

For instance, if you want to build the code in /home/harvest/builds, but unpacked the source in /home/harvest/src/ then you would do

cd /home/harvest/builds
/home/harvest/src/configure --prefix=/usr/local/harvest

Note that you will need a make that supports VPATH (such as GNU make) to use this feature.

Other configure options

The Harvest configure utility supports many of the other standard options available in GNU configure utilities - these are documented in ????. However note that the options to change where certain files are installed (bindir, libdir, etc.) are not currently supported.

Installing the code

Once the code has been configured, you can start the make process by typing

make all

Providing all goes well, you can then install the binaries into the location you chose when you configured the system by typing

make install

If either of these two processes fail, please file a bug report as detailed in section ????

Make targets

The above two targets are all that are needed in normal use, however the Harvest makefiles support the following targets, which may be needed in some cases.

all: Build the entire set of binaries
install: Install the entire set of binaries
clean: Remove all files that were created by make all
distclean: Remove all non-distribution files, including those created by running configure

Configuring your web server

The query interface to the broker is produced by means of a number of web pages and a perl script that is run by the web server as a CGI. In order for this to function correctly, some configuration of the web server is currently required.

Harvest expects to be able to run the CGI's which are installed in cgi-bin from the URL @url{http://your.web.server/Harvest/cgi-bin} and to be able to access the pages in the root of the Harvest installation at @url{http://your.web.server/Harvest/}. The means of achieving this vary from web server to web server. Instructions for some web servers are included below, more examples would be appreciated.

CERN httpd v3.0

You must add an Exec (for the CGIs) and a Pass entry to your server configuration file. If your installation root is /usr/local/harvest then you would need to add the following to your config file

Exec /Harvest/cgi-bin/* /usr/local/harvest/cgi-bin/*
Pass /Harvest/*         /usr/local/harvest/*

Apache or NCSA httpd

You must add an Alias and a ScriptAlias entry. The ordering of these is important, they must be in the order shown below. Assuming that your installation root is /usr/local/harvest you would add

ScriptAlias /Harvest/cgi-bin/ /usr/local/harvest/cgi-bin/
Alias       /Harvest/         /usr/local/harvest/

Getting Started with Harvest

Having installed Harvest on your system, the next step is to setup an indexing system. Its recommended that you try and build a system to index your own web server, before experimenting with more long range indexing. This means that you will have more control, and feedback on whats happening. The section below provides a tutorial on this, and suggestions for fault finding.

Configuring the Gatherer and Broker

As mentioned earlier, Harvest is comprised of two seperate components, the Broker and the Gatherer. Both of these are necessary to produce a search engine for a single server.

Typing 'RunHarvest' will give a sequence of questions, automating this configuration process. It will prompt you for a series of questions, the answers to which will probably be set to the correct values automatically.

You will be prompted for the following

On which host does your WWW server run?

Give the name of the host part of your web servers name. For example, if you normally access your web server as http://www.dcs.ed.ac.uk/ then give the host name as www.dcs.ed.ac.uk. This defaults to the name of the machine on which you are working.

On which port does your WWW server run?

The port is the section after the colon in the host part of a URL. For instance with a URL of http://www.dcs.ed.ac.uk:8080/ the port would be 8080. If there is no port given in the URL, for example http://www.dcs.ed.ac.uk/ then just accept the default of 80.

Select a standard configuration

For the purposes of this tutorial, select 1 - Index your entire WWW site.

Enter a short description of this Harvest server

Enter a short (not more than 1 line) description of what you are indexing. This description is used in various automatically generated HTML pages, and in some of the summarised information produced.

Enter a one word description of this Harvest server

This is used as a directory name for the directories in which the gatherer and broker configuration, log, and data files are stored. The directories themselves will be located in /usr/local/harvest/brokers and /usr/local/harvest/gatherers (assuming that the installation root was /usr/local/harvest)

Where do you want to install the gatherer?

This is the path to the directory in which all of the gatherers files will be stored. As described above this is derived from the installation root, and the one word description of the server. It should be on a filesystem with sufficient free space to store the data generated from the gatherer run.

On which port should the gatherer run?

The gather daemon makes its summarized content available over a TCP/IP port, running by default as a standalone daemon. The port number you select should be higher than 1024, and unused by all other daemons on your machine.

Where do you want to install the broker?

This is the path to the broker's data directory. It should be on a filesystem with enough free space to store both the data being indexed, and the index itself.

On which port should the broker run?

The broker interacts with the outside world, including the CGI query script, by means of a TCP/IP port. The port number chosen should be unused by all other daemons on your machine.

Enter a password for the Broker's administrative commands?

The broker provides a web interface to some of its administrative functions. This password is used to protect those functions from unauthorised use. Don't use a password which needs to remain secure, as, due to the nature of web based interfaces, it is possible that this password may be disclosed in use.

Would you like to edit the Gatherer's workload specification?

Answer no - the workload will have automatically been set to index the first 250 documents it finds on your web server.

Running Harvest

Harvest will now run for the first time. Providing all goes well, both the Gatherer and the Broker should be run successfully, and the RunHarvest script will finish with a URL at which you can test the results.

The Gatherer will have indexed the first 250 HTML documents it found on your web site. If you want to change this, or to change the manner in which the results are displayed or indexed, continue reading.

If it didn't work, then the section on Troubleshooting should help. If you can't find your answer there, then try asking on the comp.infosystems.harvest newsgroup.

Configuring the Gatherer

Having run the 'RunHarvest' command you will have a directory for your gatherer which will all ready be populated with the relevant files. If you want to do things by hand, without using the RunHarvest script, then please see ??? for details on how to create all of the relevant config files from scratch.

There are a number of different places in which the gatherer can be configured, although these are all gradually migrating back into a central config file, there are still somethings that can't be controlled from the configuration file. Anyway, as most things are controlled from there it seems like a good place to start.

The Gatherer configuration file

The gatherer configuration will be saved into a file called name.cf where name is the single word name you gave to your gatherer previously. This file has three real sections, the first contains a set of global configuration directives, the second a list of RootNodes, and the third a set of LeafNodes.

A LeafNode URL is one which is simply fetched and summarised, none of the links within the document are followed.

A RootNode URL is fetched and summarised, and then the URLs referenced by that RootNode are fetched and so on ...

Configuration directives

These directives control the behaviour of the entire gatherer. They are specified at the top of the configuration file in the format

Directive: Value

Access-Delay

A default delay to be taken between network accesses. This delay is timed from the completion of one request to the beginning of the next, and can be overriden with the RootNode specific configuration options detailed below.

Data-Directory

The directory where all of the gatherers working, and production, data files are stored. By default this is Top-Directory/data

Debug-Options

Debug flags to pass into the various programs that comprise the gatherer. These are discussed in more detail in the Troubleshooting chapter.

Essence-Options

Additional options to pass to the essence program which carries out the process of summarising the gathered data. The available options are...

FIXME: Fill this bit in

Gatherer-Options

Options to pass to the gatherer programs.

The only valid option at present is --savespace which instructs the programs to make as little use as possible of temporary disk space whilst producing the databases.

Gatherer-Host

This is the name of the host on which the gatherer is running. It is automatically obtained from 'hostname' if this attribute is not present. The contents are included in the content summary of every object.

Gatherer-Name

This contains a description of the gatherer. It normally defaults to the one line description entered during the Run-Harvest process. It is included in the content summary of every object summarised.

Gatherer-Port

The port on which the gatherd is running. This doesn't affect the actual gatherd port, just the information that is included within the content summary.

HTTP-Basic-Auth

HTTP authentication control.

It is possible to gather from web sites where data is protected using a Username / Password combination, using the HTTP Basic authentication standard. To do this, use the following

HTTP-Basic-Auth: realm username password

where realm is the protection realm as defined by the web server, username is the user name within that realm, and password is the password for that username.

Note that doing this will make password protected documents available within the gatherer and broker without any protection.

FTP-Auth

FTP authentication.

Normally ftp servers are gathered from by using the anonymous user. This allows certain ftp servers to be accessed using usernames and passwords.

The syntax is

FTP-Auth: host username password

where host is the ftp server host, as it appears in URLs accessing that server.

Note that doing this will make previously password protected information available with no protection in the gatherer and broker.

HTTP-Proxy

A proxy to use for all HTTP URL requests, specified in the form host:port. For example - squid.dcs.ed.ac.uk:3128

Keep-Cache

Harvest maintains a cache of URLs accessed, so that multiple requests for the same URL don't result in multiple accesses to the remote server. If this directive is set to yes then the cache won't be deleted after each run. However, Harvest currently doesn't do any form of checking for validity of the objects stored in this disk cache, so storing between runs can result in gatherering of old data.

Lib-Directory

The directory where some additional configuration files can be found. By default this is set to Top-Directory/lib

Local-Mapping

Local Mapping allows URLs to be mapped to files on the local filesystem, allowing for faster gathering. However, Local Mapping will bypass any server based access restrictions, and ignore any server side include directives, so should be used with care.

Local-Mapping directives have the form

Local-Mapping: urlprefix fileprefix

This causes any URLs begining with urlprefix to be rewritten as files beginning with fileprefix. The original URL is retained as the intentifier for the content summary. If the Local Mapping fails for any reason, then the URL is gathered directly.

A limited form of wildcarding is supported. The * symbol in the URL prefix will match against one or more characters. The matched string can be included in the fileprefix by including a *

For example, take the following directive

Local-Mapping: http://www.tardis.ed.ac.uk/~*/* /public/*/pages/*

This will map the URL http://www.tardis.ed.ac.uk/~sxw/index.html to the file /public/sxw/pages/index.html

Locale

The locale directive is used to set the LOCALE environment variable to provide internationalisation support. See the internationalisation chapter for more details.

Log-File

The file to use as the gatherer's log file. By default this is the file log.gatherer in the gatherer's directory.

Maintainer

The email address of the person running the Harvest system. This is sent in the HTTP From: header, and can be used by remote sites to contact the Harvest administrator in cases of trouble.

Errorlog-File

The name of the file to use as the error log. By default this is set to the file log.error in the gatherer's directory.

Post-Summarizing

Post summarizing allows the content summaries generated within the gathering process to be altered.

This tag specifies a file which contains Post-Summarizing directives, the format of this file is described in detail below.

Refresh-Rate

This is the frequency at which documents should be refreshed. It is specified in seconds.

Time-To-Live

This is the default maximum life of a document from the time it is a gathered. Documents which are not revisited within their Time-To-Live will have their content summaries deleted. The value is given in seconds.

Top-Directory

This specifies the directory in which all of the Gatherer's files are contained. It is normally the directory which contains the configuration file.

User-Agent

This is the User-Agent value that is included in the HTTP headers transmitted by the gatherer to remote servers, and the value that is used for checking for access permission against robots.txt files.

Working-Directory

The directory which contains the temporary disk cache used by the gatherer, and is used as a scratch space during the gathering process. By default this is Top-Directory/tmp

The RootNode section

This section of the configuration file specifies the core of the gatherer's workload. It consists of a set of <RootNode> and </RootNode> delimeters, and between them a list of RootNodes and their options. A root node is a URL at which the gathering process starts, the gatherer will then follow the links leading off from that RootNode, according to the options specified.

A RootNode line has the format

url [option [option [option ... ]]]

url may be either a URL, or a program name, preceeded by a pipe (|) symbol. In this case the program should produce a list of URLs on stdout when run, which the gatherer will then use with the options specified as if each URL had been specified as an individual root node. (for example |outputurls.pl would run the outputurls.pl program).

Options are given as attribute=value pairs. The following options are supported

Access

Specify the access protocols permitted for this rootnode. The value should be a comma seperated list of permitted protocols. This defaults to being just HTTP.

For example

Access=HTTP,FTP

Delay

The default delay between accesses. The delay is measured from the completion of one request to the start of the next, and defaults to being 1 second. Running a gatherer with a low delay against a remote server is considered very antisocial.

Depth

The maximum depth to go to whilst traversing the site. Depth=0 makes the depth unlimited, Depth=1 gives only the URL listed, Depth=2 gives the listed URL and any URLs that it contains, and so on.

Enumeration

FIXME: What does Enumeration do?

Host

The Host attribute takes the form

Host=max[,host-filter]

Where max is the maximum number of hosts to visit (defaults to 1), and host-filter is the name of a file containing a set of regular expressions to control the hosts visited. The format of this file is discussed in more detail below.

It is not currently possible to make the maxmimum number of hosts unlimited.

Search

This defines the search technique to be used. Spiders can work in either a breadth or depth first fashion. Previously Harvest only offered a breadth first search, however this can be very memory-inefficient, although it does provide a better summary of the server if only a few URLs are being gathered. Depth first search uses significantly less memory, although it can result in poor summarys of servers if only a small number of URLs are being gathered, and the Depth control is set to being unlimited.

The Search attribute takes either Depth or Breadth as its value, that is

Search=(Depth|Breadth)

URL

The URL attribute has the form

URL=max[,url-filter]

Where max is the maximum number of URLs to gather (defaults to 250), and url-filter is the name of a file containing a set of regular expressions limiting the URLs fetched. The format of this file is described below. It is not possible to remove the limit on the number of URLs to gather.

The default url-filter is contained in /usr/local/harevst/lib/gatherer/URL-filter-default

LeafNodes

The LeafNode section contains a list of URLs to add to the index, in the form

<LeafNode>
url
...
</LeafNode>

As with RootNodes url may be either a URL, or a program name, preceeded by a pipe (|) symbol.

The Post-Summarizing rules file

Warning! Post summarizing rules are a fairly complex and daunting part of Harvest. They give the system great power, but in many cases are uneccessary. Feel free to skip this section!

Post summarizing rules are used to make changes to an attribute, or a selection of attributes, in a piece of SOIF.

A post summarizing rule file may be located any where in the filesystem, with the path to its location being supplied in the Gatherer configuration file, as noted above.

The file itself has a format similar to a Makefile, with a set of conditions on one line, followed by the actions on the lines below, indented with a tab.

That is

conditions
<tab> action
<tab> action
...

Postsummarizing conditions

Conditions are evaluated on individual SOIF attributes, if the conditions are satisfied then the actions below will be run.

A condition consists of three parts, an attribute name, and operator, and and a string. Operators can be string equality or regular expression matches, with the corresponding negated operators also available.

equality: attribute == 'string'
regular expression: attribute ~ /string/
inequality: attribute != 'string'
negated regular expression: attribute !~ /string/

Note that strings for equality matches must be enclosed in single quotes, and those for regular expressions in /'s.

Multiple conditions may be joined together with the AND (&&) and OR (||) operators. The expression is evaluated from left to right, there is no way of grouping conditions at present.

Post summarizing actions

If the condition is satisfied then an action will be carried out on the object. This action may do one of a number of things.

It may override the value of an existing attribute, or create a new one. The syntax to do this is
```
attribute = "value"
```
It may delete the entire object. The syntax is
```
delete()
```
It can pipe the value of an attribute through a program, and reassign the attribute to what the program returns
```
attribute | program
```
It can pipe the value of a number of attributes to a program, and reassign all of these attributes with what the program returns. In this case, the values that are being selected are encapsulated as a SOIF object, and the program is expected to return a SOIF object as well. Techniques for doing this are discussed in the Manipulating SOIF chapter later in this manual.
```
attribute1,attribute2,attribute3 ... ! program
```

FIXME: Should include a bit about Jochen's #env code in here somewhere

The Host-Filter rules file

The URL-Filter rules file

Configuring the Broker

Starting from scratch`

Advanced Topics

Manipulating SOIF

Several parts of Harvest provide an opportunity to manipulate SOIF streams or objects. However, it is not immediately clear how to go about this manipulation. Whatever happens, the manipulation is going to require some programming on behalf of the user, be it in Perl or C. If you have no interest in doing such programming, I'd recommend not going any further into this chapter

It is beyond the scope of this book to discuss actual programming techniques, however I will document some of the code available within Harvest for accessing SOIF objects, and provide pointers to other sources.

C

Harvest provides a C library for accessing SOIF objects, as part of its template library (source in src/common/template). This C library is used extensively throughout the Harvest code, and there are plenty of examples to use.

FIXME: Flesh out this section with an example of use, and pointers to FIXME: more information

Perl

Perl 4

Harvest provides a set of Perl 4 functions to manipulate SOIF records in src/common/template/soif.pl. These map SOIF's attribute/value model onto Perl hashes, making reading and writing SOIF relatively simple.

To read a record from stdin, the following will suffice

($template,$url,%SOIF)=&soif'parse();

This has created an associative array (or hash) called @code(%SOIF) which contains all of the attribute,value pairs from the original SOIF record. These can now be manipulated as a standard perl variable. When manipulation is complete, the record may be written back out using

&soif'print($template,$url,%SOIF);

Perl 5

Harvest doesn't ship with any modules specifically for Perl 5, although the Perl 4 ones should work. However, CPAN now houses a set of modules written by Dave Beckett which include one for handling SOIF. The module is called Metadata::SOIF and is part of the Metadata.tar.gz package. Detailed usage descriptions are contained within the package and module themselves.

This is the recommended technique for both speed and ease of use for parsing SOIF records.

This document was generated on 8 July 1998 using the texi2html translator version 1.51.