htDig Install

Development on htDig is no longer ongoing, apparently. However, the 3.1.6 version of htDig, which is still available from http://www.htdig.org, is a classic, with good performance and few bugs, that just keeps running and running. So, if you want to install this version, download its tar ball and unzip it:

     tar -xvzf htdig-3.1.6.tar.gz

Change to the build directory and configure the build process:

     cd htdig-3.1.6
     ./configure --prefix=/usr/local/htdig --with-cgi-bin-dir=/var/www/cgi-bin \
                 --with-image-dir=/var/www/html/htdig \
                 --with-search-dir=/usr/local/htdig/template

If you get some b.s. message about how C++ is required and you should consider installing the libstdc++ library, ignore it and run the following configure command instead:

     CPPFLAGS="-Wno-deprecated" ./configure --prefix=/usr/local/htdig \
                 --with-cgi-bin-dir=/var/www/cgi-bin \
                 --with-image-dir=/var/www/html/htdig \
                 --with-search-dir=/usr/local/htdig/template

The problem is that the compiler winkies have decided that fstream.h and its ilk should be replaced by something "better" so they whack out a warning when you use it. The htDig guys used this header file as a proxy for C++ and, when the autoconf macro that tests for the presence of fstream.h sees the warning, it thinks the header file wasn't found. Its all b.s. We don't give a damm about what the compiler winkies think and the autoconf macro is brokey. So, don't worry about a thing.

Once you get ./configure to fly, make htDig with:

     CXXFLAGS="-Wno-deprecated" make -e

Incidentally, in the same vein as fstream being broken, above, it would appear that changes to ifstream in later versions of the C library can cause htnotify to loop forever (and chew up a whole bunch of CPU cycles). In a nutshell, whoever wrote htnotify included code that can call the ifstream functions with an empty string for the filename. In earlier versions of the C library, this simply resulted in bad() returning a non-zero value, which caused htnotify to go on about its business.

However, with the new C library, bad() no longer returns a non-zero value and htnotify tries to read from the file with no name. This doesn't seem to return anything from eof(), either, so it loops forever. Nice work, guys (admittedly, the htnotify code is bogus but not returning EOF on an empty file is bogus too).

So, if you plan to use htnotify, you should fix the code therein by applying this patch:

     --- htnotify.cc.orig 2002-01-31 18:47:00.000000000 -0500
     +++ htnotify.cc 2009-02-02 20:10:56.000000000 -0500
     @@ -185,7 +185,7 @@
          // define default preamble text - blank string
          preambleText = "";
 
     -    if (prefixfile != NULL)
     +    if ((prefixfile != NULL) && (prefixfile != '\0'))
          {
              ifstream    in(prefixfile);
              char        buffer[1024];
     @@ -212,7 +212,7 @@
          postambleText << "    http://www.htdig.org/meta.html\n\n";
          postambleText << "Cheers!\n\nht://Dig Notification Service\n";
 
     -    if (suffixfile != NULL)
     +    if ((suffixfile != NULL) && (suffixfile != '\0'))
          {
              ifstream    in(suffixfile);
              char        buffer[1024];

Once the patch is applied, rebuild htnotify with:

     CXXFLAGS="-Wno-deprecated" make -e

Then, install htDig as super-duper user:

     su
     make install

A single copy of htDig can be installed and shared by several Web sites on a single Web server. If you'd like all of the Web site config files to be retained in the htDig "conf" directory, it requires a kludge but it works. If you want all of the Web sites to use a single config file in the shared htDig "conf" directory, the same kludge will work in that case too. Or, you can use a separate copy of the config file in each of the Web sites' "db" directories. Its your choice.

Set up a "db" directory under the Web site's top level directory. Copy the rundig script from the htDig "bin" directory to the Web site's "db" directory. Hack it to point to the site's database directory and the htDig common directories. Here are the values to hack:

     DBDIR=/var/www/BSMDev/db
     COMMONDIR=/usr/local/htdig/common
     BINDIR=/usr/local/htdig/bin

If you only want a single, shared config file, skip this step and proceed to hacking the config file. If you want separate config files for each of the Web sites, but all in the common htDig "conf" directory, copy htdig.conf to a file named for each of the Web sites:

     cp /usr/local/htdig/conf/htdig.conf /usr/local/htdig/conf/BSMDev.conf

If you want a separate copy of the config file for each Web site, in the "db" directory specific to that Web site, copy htdig.conf there:

     cp /usr/local/htdig/conf/htdig.conf /var/www/BSMDev/db

Note that you should then make a symlink from the htDig "conf" directory to the actual location of the config file, just as a pointer to remind yourself that hacking config files in the "conf" directory actually is hacking a file in the Web site's "db" directory:

     cd /usr/local/htdig/conf
     ln -s /var/www/BSMDev/db/htdig.conf BSMDev.conf

Hack the config file, wherever it is, to configure the site or sites in question. Typical parameters to hack are:

     database_dir:          /var/www/BSMDev/db
     start_url:             http://www.bsmdevelopment.com/
     local_urls:            http://www.bsmdevelopment.com/=/var/www/BSMDev/html/
     local_urls_only:       true
     local_default_doc:     index.html welcome.html
     limit_urls_to:         ${start_url}
     maintainer:            ewilde@bsmdevelopment.com

You can change any of the other options, if you want.

If you are using a config file in the htDig "conf" directory (either a single file or multiple files), now comes the clever bit (sure, whatever). Make a hard link from the Web site's "db" directory to the config file in the htDig common config directory:

     ln /usr/local/htdig/conf/BSMDev.conf /var/www/BSMDev/db/htdig.conf

or

     ln /usr/local/htdig/conf/htdig.conf /var/www/BSMDev/db/htdig.conf

You must do this, because of the way that htDig processes the config files. If a hard link isn't used, htDig will not work in this shared mode.

If your cofig file is actually in the Web site's "db" directory and there is a soft link from the htDig "conf" directory, you need do nothing at this point because htDig will use the config file in the "db" directory.

Create specific versions of SearchWrapper.html, SearchSyntax.html and SearchNoMatch.html for the site in question. These files are invoked by htDig as part of the search process. They should be put in the Web site's "db" directory.

Copy the htsearch program from the htDig build directory to the top level (shared) cgi-bin directory (if it isn't already there):

     cp htsearch/htsearch /var/www/cgi-bin

In the Web site's cgi-bin directory, make a symbolic link to htsearch in the shared cgi-bin directory:

     ln -s ../../cgi-bin/htsearch /var/www/BSMDev/cgi-bin/htsearch

In the Web site's html directory, make a symbolic link to the installation directory where the htDig icons were installed (the image directory):

     ln -s /var/www/html/htdig /var/www/BSMDev/html/htdig

Now we come to the problem of htfuzzy, the program that creates indexes for different "fuzzy" search algorithms. These indexes can then be used by the htsearch program to do fuzzy matches against search terms that are entered by the user.

If you wish the "fuzzy" match algorithms to work, you must build the databases that drive them, using htfuzzy. If you are going to use the "endings" algorithm, you must get the affix rules and language dictionary for the language of your choice (htDig comes with affix rules and a simple dictionary for English bundled) from the ispell Web page (http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html) and then run htfuzzy, which will build the databases in the common directory.

And, therein lies the problem. If you wish to use multiple languages for your different Web sites, or even better, multiple languages within a single Web site, you will be faced with coming up with some scheme for switching the common directory around on the fly. Good luck.

Fortunately, if you are just interested in English, you can do something like this (in this case for the "endings" and "synonyms" algorithms):

     cd /usr/local/htdig/common
     /usr/local/htdig/bin/htfuzzy -c /usr/local/htdig/conf/htdig.conf \
       endings synonyms

If you'll be adding or removing pages on you Web site(s) on a regular basis, you may want to install the following script (reindex) somewhere in your common scripts directory (e.g. /var/www/Scripts/reindex).

     #!/bin/sh
     #
     # Shell script (run by cron) to check whether any Web pages in this directory
     # have changed and reindex them for searching, if so.
     #
     # This script takes one argument, the name of the Web directory under
     # /var/www that is to be indexed (e.g. MyDir).
     #
     #
     # Check to see whether any pages in the Web directory tree that we are given
     # are newer than the last indexed date.
     #
     ArePagesNew()
     {
     #
     # See if the timestamp file exists.  If not, we are all done.  If so, we
     # must look at the Web directory tree.
     #
     if [ -f /var/www/$1/db/previous_index ]
     then
         #
         # Run a find command that traverses the Web directory tree, looking for
         # any HTML files that are newer.
         #
         /usr/bin/find /var/www/$1/html -name \*\.html -type f \
             -newer /var/www/$1/db/previous_index \
             -exec touch -f /var/www/$1/db/current_index \;
         #
         # Compare the current stamp file with the previous stamp file.  If their
         # times are different, there's been a change.
         #
         if [ /var/www/$1/db/current_index -nt /var/www/$1/db/previous_index ]
         then
             return 0  #New
         else
             return 1  #The same
         fi
     else
         touch -f /var/www/$1/db/current_index
         return 0  #New
     fi
     }
     #
     # If there is a reference directory, build an index page that points to all
     # of the reference material.  This material is loaded dynamically and is
     # never directly linked to.  Thus, the crawler will never find it unless
     # there is a pointer to it.  The page we create is secretly linked to by the
     # top level index of this directory so that the crawler can find and index
     # all of the reference pages.
     #
     LinkRefPages()
     {
     #
     # See if the reference directory exists.  If not, we are all done.  If so,
     # we must look at the directory tree for reference documents.  Currently
     # they are all documents that start with:
     #
     #      DocIdx_
     #      Inst_
     #      PR_
     #      Samp_
     #      Tech_
     #
     if ! test -d /var/www/$1/html/Reference; then return 0; fi
     #
     # Run a find command that traverses the reference directory, looking for any
     # HTML files that match the pattern.
     #
     /usr/bin/find /var/www/$1/html/Reference -name DocIdx_\\.html -type f \
         -exec echo \<br\>\<a href=\{\}\>@\{\}@\</a\> \
         >/var/www/$1/html/Reference/refindex_1.html \;
     /usr/bin/find /var/www/$1/html/Reference -name Inst_\*\.html -type f \
         -exec echo \<br\>\<a href=\{\}\>@\{\}@\</a\> \
         >>/var/www/$1/html/Reference/refindex_1.html \;
     /usr/bin/find /var/www/$1/html/Reference -name PR_\*\.html -type f \
         -exec echo \<br\>\<a href=\{\}\>@\{\}@\</a\> \
         >>/var/www/$1/html/Reference/refindex_1.html \;
     /usr/bin/find /var/www/$1/html/Reference -name Samp_\*\.html -type f \
         -exec echo \<br\>\<a href=\{\}\>@\{\}@\</a\> \
         >>/var/www/$1/html/Reference/refindex_1.html \;
     /usr/bin/find /var/www/$1/html/Reference -name Tech_\*\.html -type f \
         -exec echo \<br\>\<a href=\{\}\>@\{\}@\</a\> \
         >>/var/www/$1/html/Reference/refindex_1.html \;
     #
     # If we didn't find any files, we're all done.
     #
     if ! test -s /var/www/$1/html/Reference/refindex_1.html; then return 0; fi
     #
     # Adjust the index entries to be human/robot readable.
     #
     sed "s/@\/var\/www\/$1\/html\/Reference\///" \
         /var/www/$1/html/Reference/refindex_1.html | sed "s/.html@//" \
         | sed "s/\/var\/www\/$1\/html\/Reference/./" \
         >/var/www/$1/html/Reference/refindex_2.html
     #
     # Start the file out with the requisite HTML.
     #
     echo \<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\> \
         >/var/www/$1/html/Reference/refindex.html
     echo \<html\>\<head\> >>/var/www/$1/html/Reference/refindex.html
     echo \<meta name=\"robots\" content=\"All\"\> \
         >>/var/www/$1/html/Reference/refindex.html
     echo \</head\>\<body\> >>/var/www/$1/html/Reference/refindex.html
     #
     # Include the index we generated.
     #
     cat /var/www/$1/html/Reference/refindex_2.html \
         >>/var/www/$1/html/Reference/refindex.html
     #
     # Finish off the HTML page.
     #
     echo \</body\>\</html\> >>/var/www/$1/html/Reference/refindex.html
     #
     # Clean up and make the index visible.
     #
     rm -f /var/www/$1/html/Reference/refindex_1.html \
         /var/www/$1/html/Reference/refindex_2.html
     chgrp webmin /var/www/$1/html/Reference/refindex.html
     return 0
     }
     #
     # Check if any pages are newer than the last index time and reindex, if so.
     #
     if (ArePagesNew $1); then
         LinkRefPages $1
         /var/www/$1/db/rundig
         touch -f /var/www/$1/db/previous_index
     fi

Add an entry to crontab to run the reindex script on a nightly basis:

     # Reindex any Web pages that have changed since yesterday.  Then, send
     # email notification of any expired pages that are found.
     15 2 * * * root /var/www/Scripts/reindex BSMDev
     15 3 * * * root /usr/local/htdig/bin/htnotify \
                     -c /var/www/BSMDev/db/htdig.conf

The reindex script will index all of the pages on the Web site, whenever a change is made, and build the htDig indexes. It will also build a page, for use by Web search engines, that references, through a link, all dynamically linked pages found on the site (i.e. those pages that appear on the site but are not specifically referenced directly by other pages on the site). This page can be pushed to the Web site's home directory (or elsewhere) so that a Web crawler from a search engine will find all of the dynamically linked pages (the page simply references all of the pages in a list and is never meant to actually be seen by users -- a secret, invisible link to it should be placed in one of the Web site's top level pages).

To add the secret, invisible link, put some HTML that resembles this on one of your site's pages (usually the reference directory's index):

     <!-- Secret, hidden link to the generated reference index -->
     <span style="visibility: hidden;">
       <a href="refindex.html">invisible</a>
     </span>

If you wish to index PDF files, as well as HTML files, you need to install Xpdf and doc2html.

Download the latest Xpdf tar file from http://www.foolabs.com/xpdf/ and untar it in the top level source directory:

     tar -xvzf xpdf-3.02.tar.gz

It will create a new directory for that version of Xpdf. Switch to that directory and build it:

     cd xpdf-3.02
     ./configure
     make

Then, as super duper user, install it:

     su
     make install

Download the latest doc2html file from http://www.htdig.org. Unfortunatlely, there is no tar file, just a zippity-do-dah file. So, you'll have to un-zip it using some kind of magic (or a kludge).

The files in the zip archive are just Perl script files so there's no compiling involved. To implement them, copy them to the proper locations:

     su
     cp /rpm/htDig/doc2html_31/doc2html.pl /usr/local/bin
     chmod go+x /usr/local/bin/doc2html.pl
     cp /rpm/htDig/doc2html_31/pdf2html.pl /usr/local/bin
     chmod go+x /usr/local/bin/pdf2html.pl

Edit these two files per the instructions in DETAILS:

  1. Change the first line in each file to #!/usr/bin/perl -w
  2. Point the variables at /usr/local/bin/...