Harvest is a program for producing searchable indexes of documents published on the World Wide Web. It can index documents produced in a wide variety of file formats (HTML, PDF, Postscript, RTF to name but a few) and by a range of different protocols (HTTP, FTP, NTTP). It is designed to support hierarchical indexing, allowing a number of sites to contribute to one central index.
The ability to perform hierarchical indexing is given by Harvest's seperation into two seperate systems -- the Gatherer and the Broker. The Gatherer produces a summarised database of all of the objects on the server(s) it is assigned to gather from. The Broker then fetches the database, indexes it with a search tool, and makes it available for user queries. The Broker can fetch databases from multiple Gatherers, and from other Brokers, so it is possible to build up a complex, hierarchical tree of indexing systems.
The Harvest Gatherer and Broker both communicate using a data structure called SOIF to store their content summaries. Its not necessary to know that much about SOIF to be able to use Harvest, although a small amount of knowledge doesn't hurt!
Harvest is a powerful package, whose more esoteric features require some documentation, hence the size of this manual! However setting it up to do simple things is very straightforward. If you just want to index a single web server, then you need only look at Chapter 2 and the start of Chapter 3. If you are wanting to try and do something more complicated, then you should read Chapter 2 and Chapter 3, and dip into the rest.
The Gatherer is, in itself, made up of a number of subcomponents. There are two packages involved in generating the database -- the enumerator and the summariser (essence). A third component, the gather daemon, makes the generated database available for Brokers to collect.
The enumerator fetches documents, extracts a list of all of the links that they contain, and then passes the document on to essence (the summariser) to create a summary of its contents. This content summary is then added, by essence, to its database, and the enumerator continues this process for all of the links it found in the document.
The content summary is in a data exchange format called SOIF (Summarised Object Interchange Format), which is designed to allow the fast and simple exchange of large numbers of content summaries between sites.
The gather daemon makes this content summary available to Brokers, in either a plaintext or a compressed format. The daemon supports incremental updates, so removing the need to transfer the entire database every time.
The Broker provides the search interface to the information collected by the Gatherer. Again, it is made of a number of components - a daemon, an indexing system, and a few CGIs.
The daemon fetches data from gather daemons and from other Brokers, storing it in a directory structure on the local disk. It uses a seperate, customisable, indexing engine to index and search the data, and provides a interface to allow users to carry out these searches. The CGIs provide an interface between a web form to perform the query, and the broker daemon.
SOIF is the data representation the Harvest uses to store all of its content summaries. It is beyond this document's remit to fully describe SOIF, but a basic knowledge of its features and facilities may help when working with Harvest.
SOIF represents data as a series of attribute, value pairs, where the attribute describes the meaning of the value. For instance this document might have the following attributes and values:
Title: The Harvest Indexer Description: A user manual for the Harvest Indexer
SOIF also includes a template type, which defines the meaning of the attributes, and need not concern us further, and a URL, which is the URL of the document which is being described in the SOIF. The SOIF representation also includes extra byte counts so as to be able to store arbitrary binary data.
Harvest is distributed in source code form only. This means that in order to run it you will need access to a C compiler (gcc is preferred) and a number of other compilation tools. Whilst users may chose to make binary releases available, these binary releases are not supported in any way.
Harvest needs a number of programs to be installed on the system in order for it to run. Whilst these programs are not a "standard" part of all Unix systems, many systems will already have them installed.
The following are required to build the Harvest binaries
The following are required in order to run Harvest
Harvest will currently work with both Perl 4 and Perl 5, however it is likely that future versions will take advantage of some of the advanced features of Perl 5.
A list of the platforms on which Harvest is known to compile, along with the compilers used is available as part of the FAQ. If you get Harvest to work on a system not listed there, please mail us.
The most recent Harvest code is available from @uref{ftp://ftp.tardis.ed.ac.uk/pub/harvest} or one of its mirror sites (as listed in the FAQ). Two digit version numbers (ie 1.6) are stable releases, three digit versions (ie 1.6.1) are less well tested releases, which may contain bug fixes or new features. At the time of writing the current stable version is
The code is distributed as Gzip'd tar archives - to uncompress it into your current working directory you should use
gzip -dc harvest--src.tar.gz | tar -xf -
By default the Harvest software will be installed into /usr/local/harvest, and the object files created during the compilation process will be stored in the same directory as the source. Both of these can be change by varying the invocation of the configure command.
Harvest currently uses the path that its installed into as the location for all of its binaries (in bin and lib), data files (in brokers and gatherers) and cgi-bins (in cgi-bin). This will probably be changed in a later release.
If you are happy with the defaults, then you can simply configure Harvest by typing
./configure
and move on to the next stage.
You can control which directory Harvest is installed into with the
prefix
option to configure. For instance, to install into
/home/joe/harvest
you would use
configure --prefix=/home/joe/harvest
If you want to build your object files in a different tree from your source files (for instance, if you're building Harvest for more than one architecture), then change directory into the directory you want as the root of your object file tree, and then run the configure program in the directory you unpacked the source into.
For instance, if you want to build the code in /home/harvest/builds
,
but unpacked the source in /home/harvest/src/
then you would do
cd /home/harvest/builds /home/harvest/src/configure --prefix=/usr/local/harvest
Note that you will need a make
that supports VPATH
(such as GNU make) to
use this feature.
The Harvest configure utility supports many of the other standard options available in GNU configure utilities - these are documented in ????. However note that the options to change where certain files are installed (bindir, libdir, etc.) are not currently supported.
Once the code has been configured, you can start the make process by typing
make all
Providing all goes well, you can then install the binaries into the location you chose when you configured the system by typing
make install
If either of these two processes fail, please file a bug report as detailed in section ????
The above two targets are all that are needed in normal use, however the Harvest makefiles support the following targets, which may be needed in some cases.
all
install
clean
make all
distclean
configure
The query interface to the broker is produced by means of a number of web pages and a perl script that is run by the web server as a CGI. In order for this to function correctly, some configuration of the web server is currently required.
Harvest expects to be able to run the CGI's which are installed in
cgi-bin
from the URL @url{http://your.web.server/Harvest/cgi-bin} and
to be able to access the pages in the root of the Harvest installation at
@url{http://your.web.server/Harvest/}. The means of achieving this vary from
web server to web server. Instructions for some web servers are included
below, more examples would be appreciated.
You must add an Exec
(for the CGIs) and a Pass
entry to your
server configuration file. If your installation root is
/usr/local/harvest
then you would need to add the following to your
config file
Exec /Harvest/cgi-bin/* /usr/local/harvest/cgi-bin/* Pass /Harvest/* /usr/local/harvest/*
You must add an Alias
and a ScriptAlias
entry. The ordering
of these is important, they must be in the order shown below. Assuming that
your installation root is /usr/local/harvest
you would add
ScriptAlias /Harvest/cgi-bin/ /usr/local/harvest/cgi-bin/ Alias /Harvest/ /usr/local/harvest/
Having installed Harvest on your system, the next step is to setup an indexing system. Its recommended that you try and build a system to index your own web server, before experimenting with more long range indexing. This means that you will have more control, and feedback on whats happening. The section below provides a tutorial on this, and suggestions for fault finding.
As mentioned earlier, Harvest is comprised of two seperate components, the Broker and the Gatherer. Both of these are necessary to produce a search engine for a single server.
Typing 'RunHarvest' will give a sequence of questions, automating this configuration process. It will prompt you for a series of questions, the answers to which will probably be set to the correct values automatically.
You will be prompted for the following
Give the name of the host part of your web servers name. For example, if
you normally access your web server as http://www.dcs.ed.ac.uk/
then
give the host name as www.dcs.ed.ac.uk
. This defaults to the name of
the machine on which you are working.
The port is the section after the colon in the host part of a URL. For
instance with a URL of http://www.dcs.ed.ac.uk:8080/
the port
would be 8080. If there is no port given in the URL, for example
http://www.dcs.ed.ac.uk/
then just accept the default of 80
.
For the purposes of this tutorial, select 1 - Index your entire WWW site.
Enter a short (not more than 1 line) description of what you are indexing. This description is used in various automatically generated HTML pages, and in some of the summarised information produced.
This is used as a directory name for the directories in which the gatherer
and broker configuration, log, and data files are stored. The directories
themselves will be located in /usr/local/harvest/brokers
and
/usr/local/harvest/gatherers
(assuming that the installation root
was /usr/local/harvest
)
This is the path to the directory in which all of the gatherers files will be stored. As described above this is derived from the installation root, and the one word description of the server. It should be on a filesystem with sufficient free space to store the data generated from the gatherer run.
The gather daemon makes its summarized content available over a TCP/IP port, running by default as a standalone daemon. The port number you select should be higher than 1024, and unused by all other daemons on your machine.
This is the path to the broker's data directory. It should be on a filesystem with enough free space to store both the data being indexed, and the index itself.
The broker interacts with the outside world, including the CGI query script, by means of a TCP/IP port. The port number chosen should be unused by all other daemons on your machine.
The broker provides a web interface to some of its administrative functions. This password is used to protect those functions from unauthorised use. Don't use a password which needs to remain secure, as, due to the nature of web based interfaces, it is possible that this password may be disclosed in use.
Answer no - the workload will have automatically been set to index the first 250 documents it finds on your web server.
Harvest will now run for the first time. Providing all goes well, both the Gatherer and the Broker should be run successfully, and the RunHarvest script will finish with a URL at which you can test the results.
The Gatherer will have indexed the first 250 HTML documents it found on your web site. If you want to change this, or to change the manner in which the results are displayed or indexed, continue reading.
If it didn't work, then the section on Troubleshooting should help. If
you can't find your answer there, then try asking on the
comp.infosystems.harvest
newsgroup.
Having run the 'RunHarvest' command you will have a directory for your gatherer which will all ready be populated with the relevant files. If you want to do things by hand, without using the RunHarvest script, then please see ??? for details on how to create all of the relevant config files from scratch.
There are a number of different places in which the gatherer can be configured, although these are all gradually migrating back into a central config file, there are still somethings that can't be controlled from the configuration file. Anyway, as most things are controlled from there it seems like a good place to start.
The gatherer configuration will be saved into a file called
name
.cf where name is the single word name you gave to your
gatherer previously. This file has three real sections, the first
contains a set of global configuration directives, the second a
list of RootNodes, and the third a set of LeafNodes.
A LeafNode URL is one which is simply fetched and summarised, none of the links within the document are followed.
A RootNode URL is fetched and summarised, and then the URLs referenced by that RootNode are fetched and so on ...
These directives control the behaviour of the entire gatherer. They are specified at the top of the configuration file in the format
Directive
: Value
A default delay to be taken between network accesses. This delay is timed from the completion of one request to the beginning of the next, and can be overriden with the RootNode specific configuration options detailed below.
The directory where all of the gatherers working, and production, data
files are stored. By default this is Top-Directory
/data
Debug flags to pass into the various programs that comprise the gatherer. These are discussed in more detail in the Troubleshooting chapter.
Additional options to pass to the essence program which carries out the process of summarising the gathered data. The available options are...
FIXME: Fill this bit in
Options to pass to the gatherer programs.
The only valid option at present is --savespace
which instructs
the programs to make as little use as possible of temporary disk space
whilst producing the databases.
This is the name of the host on which the gatherer is running. It is automatically obtained from 'hostname' if this attribute is not present. The contents are included in the content summary of every object.
This contains a description of the gatherer. It normally defaults to the one line description entered during the Run-Harvest process. It is included in the content summary of every object summarised.
The port on which the gatherd is running. This doesn't affect the actual gatherd port, just the information that is included within the content summary.
HTTP authentication control.
It is possible to gather from web sites where data is protected using a Username / Password combination, using the HTTP Basic authentication standard. To do this, use the following
HTTP-Basic-Auth: realm username password
where realm
is the protection realm as defined by the web server,
username
is the user name within that realm, and password
is the password for that username.
Note that doing this will make password protected documents available within the gatherer and broker without any protection.
FTP authentication.
Normally ftp servers are gathered from by using the anonymous
user. This allows certain ftp servers to be accessed using usernames
and passwords.
The syntax is
FTP-Auth: host username password
where host
is the ftp server host, as it appears in URLs
accessing that server.
Note that doing this will make previously password protected information available with no protection in the gatherer and broker.
A proxy to use for all HTTP URL requests, specified in the form
host:port
. For example - squid.dcs.ed.ac.uk:3128
Harvest maintains a cache of URLs accessed, so that multiple requests
for the same URL don't result in multiple accesses to the remote server.
If this directive is set to yes
then the cache won't be deleted
after each run. However, Harvest currently doesn't do any form of checking
for validity of the objects stored in this disk cache, so storing between
runs can result in gatherering of old data.
The directory where some additional configuration files can be
found. By default this is set to Top-Directory
/lib
Local Mapping allows URLs to be mapped to files on the local filesystem, allowing for faster gathering. However, Local Mapping will bypass any server based access restrictions, and ignore any server side include directives, so should be used with care.
Local-Mapping directives have the form
Local-Mapping: urlprefix fileprefix
This causes any URLs begining with urlprefix
to be rewritten as
files beginning with fileprefix
. The original URL is retained as
the intentifier for the content summary. If the Local Mapping fails
for any reason, then the URL is gathered directly.
A limited form of wildcarding is supported. The *
symbol in the
URL prefix will match against one or more characters. The matched string
can be included in the fileprefix
by including a *
For example, take the following directive
Local-Mapping: http://www.tardis.ed.ac.uk/~*/* /public/*/pages/*
This will map the URL http://www.tardis.ed.ac.uk/~sxw/index.html
to the file /public/sxw/pages/index.html
The locale directive is used to set the LOCALE
environment variable to
provide internationalisation support. See the internationalisation chapter
for more details.
The file to use as the gatherer's log file. By default this is the file
log.gatherer
in the gatherer's directory.
The email address of the person running the Harvest system. This is sent in the HTTP From: header, and can be used by remote sites to contact the Harvest administrator in cases of trouble.
The name of the file to use as the error log. By default this is set to
the file log.error
in the gatherer's directory.
Post summarizing allows the content summaries generated within the gathering process to be altered.
This tag specifies a file which contains Post-Summarizing directives, the format of this file is described in detail below.
This is the frequency at which documents should be refreshed. It is specified in seconds.
This is the default maximum life of a document from the time it is a gathered. Documents which are not revisited within their Time-To-Live will have their content summaries deleted. The value is given in seconds.
This specifies the directory in which all of the Gatherer's files are contained. It is normally the directory which contains the configuration file.
This is the User-Agent value that is included in the HTTP headers transmitted by the gatherer to remote servers, and the value that is used for checking for access permission against robots.txt files.
The directory which contains the temporary disk cache used by the gatherer,
and is used as a scratch space during the gathering process. By default
this is Top-Directory
/tmp
This section of the configuration file specifies the core of the gatherer's
workload. It consists of a set of <RootNode>
and </RootNode>
delimeters, and between them a list of RootNodes and their options.
A root node is a URL at which the gathering process starts, the gatherer
will then follow the links leading off from that RootNode, according to
the options specified.
A RootNode line has the format
url [option [option [option ... ]]]
url
may be either a URL, or a program name, preceeded by a pipe
(|
) symbol. In this case the program should produce a list of
URLs on stdout when run, which the gatherer will then use with the options
specified as if each URL had been specified as an individual root node.
(for example |outputurls.pl
would run the outputurls.pl program).
Options are given as attribute=value
pairs. The following options
are supported
Specify the access protocols permitted for this rootnode. The value should be a comma seperated list of permitted protocols. This defaults to being just HTTP.
For example
Access=HTTP,FTP
The default delay between accesses. The delay is measured from the completion of one request to the start of the next, and defaults to being 1 second. Running a gatherer with a low delay against a remote server is considered very antisocial.
The maximum depth to go to whilst traversing the site. Depth=0
makes the depth unlimited, Depth=1
gives only the URL listed,
Depth=2
gives the listed URL and any URLs that it contains, and so on.
FIXME: What does Enumeration do?
The Host attribute takes the form
Host=max[,host-filter]
Where max
is the maximum number of hosts to visit (defaults to
1), and host-filter
is the name of a file containing a set of
regular expressions to control the hosts visited. The format of this
file is discussed in more detail below.
It is not currently possible to make the maxmimum number of hosts unlimited.
This defines the search technique to be used. Spiders can work in either a breadth or depth first fashion. Previously Harvest only offered a breadth first search, however this can be very memory-inefficient, although it does provide a better summary of the server if only a few URLs are being gathered. Depth first search uses significantly less memory, although it can result in poor summarys of servers if only a small number of URLs are being gathered, and the Depth control is set to being unlimited.
The Search attribute takes either Depth or Breadth as its value, that is
Search=(Depth|Breadth)
The URL attribute has the form
URL=max[,url-filter]
Where max
is the maximum number of URLs to gather (defaults to 250),
and url-filter
is the name of a file containing a set of regular
expressions limiting the URLs fetched. The format of this file is described
below. It is not possible to remove the limit on the number of URLs to
gather.
The default url-filter is contained in
/usr/local/harevst/lib/gatherer/URL-filter-default
The LeafNode section contains a list of URLs to add to the index, in the form
<LeafNode> url ... </LeafNode>
As with RootNodes url
may be either a URL, or a program name,
preceeded by a pipe (|
) symbol.
Warning! Post summarizing rules are a fairly complex and daunting part of Harvest. They give the system great power, but in many cases are uneccessary. Feel free to skip this section!
Post summarizing rules are used to make changes to an attribute, or a selection of attributes, in a piece of SOIF.
A post summarizing rule file may be located any where in the filesystem, with the path to its location being supplied in the Gatherer configuration file, as noted above.
The file itself has a format similar to a Makefile, with a set of conditions on one line, followed by the actions on the lines below, indented with a tab.
That is
conditions <tab> action <tab> action ...
Conditions are evaluated on individual SOIF attributes, if the conditions are satisfied then the actions below will be run.
A condition consists of three parts, an attribute name, and operator, and and a string. Operators can be string equality or regular expression matches, with the corresponding negated operators also available.
attribute == 'string'
attribute ~ /string/
attribute != 'string'
attribute !~ /string/
Note that strings for equality matches must be enclosed in single quotes, and those for regular expressions in /'s.
Multiple conditions may be joined together with the AND (&&
) and
OR (||
) operators. The expression is evaluated from left to
right, there is no way of grouping conditions at present.
If the condition is satisfied then an action will be carried out on the object. This action may do one of a number of things.
attribute = "value"
delete()
attribute | program
attribute1,attribute2,attribute3 ... ! program
FIXME: Should include a bit about Jochen's #env code in here somewhere
Several parts of Harvest provide an opportunity to manipulate SOIF streams or objects. However, it is not immediately clear how to go about this manipulation. Whatever happens, the manipulation is going to require some programming on behalf of the user, be it in Perl or C. If you have no interest in doing such programming, I'd recommend not going any further into this chapter
It is beyond the scope of this book to discuss actual programming techniques, however I will document some of the code available within Harvest for accessing SOIF objects, and provide pointers to other sources.
Harvest provides a C library for accessing SOIF objects, as part of its
template library (source in src/common/template
). This C library
is used extensively throughout the Harvest code, and there are plenty
of examples to use.
FIXME: Flesh out this section with an example of use, and pointers to FIXME: more information
Harvest provides a set of Perl 4 functions to manipulate SOIF records
in src/common/template/soif.pl
. These map SOIF's attribute/value
model onto Perl hashes, making reading and writing SOIF relatively
simple.
To read a record from stdin, the following will suffice
($template,$url,%SOIF)=&soif'parse();
This has created an associative array (or hash) called @code(%SOIF) which contains all of the attribute,value pairs from the original SOIF record. These can now be manipulated as a standard perl variable. When manipulation is complete, the record may be written back out using
&soif'print($template,$url,%SOIF);
Harvest doesn't ship with any modules specifically for Perl 5, although
the Perl 4 ones should work. However, CPAN now houses a set of
modules written by Dave Beckett which include one for handling SOIF.
The module is called Metadata::SOIF
and is part of the
Metadata.tar.gz
package. Detailed usage descriptions are contained
within the package and module themselves.
This is the recommended technique for both speed and ease of use for parsing SOIF records.
This document was generated on 8 July 1998 using the texi2html translator version 1.51.