Web Indexer

Introduction

Welcome to Web Indexer. This program enables you to index your website, or group of web pages, to produce a HTML document which lists the HTML files, with a description of each file. You can tell the program to ignore certain files/directories by either making the description of the HTML page to be "IGNOREINDEX", for one file.... or put a file called ".ignore_index" in a directory if you wish it to ignore ALL files in that directory.

Here is an example... using my pages

Contents of Web Indexer

README : file explaining the usage of the web indexer
web_index.pl : The Indexer program web_index.conf : A Sample default configuration for the program
Recurse.pm : Module to process files recursively through dirs
images/ : a directory holding graphics needed for the graphics version
web_index2.0.zip or web_index2.0.tar.gz : The full package (Save this link to get the file... there is no ftp available)

How does it get the description?

You tell the program various ways to look for the description in each HTML document. The three current methods require you to add the following HTML code to your HTML page... (preferebly in the area) where [description] is a description of the page (of course).

<WINDEX "[description]">
<META NAME="description" CONTENT="[description]">
<TITLE>[description]</TITLE>
Smart checking: Look for 2, if it isn't found look for 3.
1. = just used for this web indexer (#2 is preferable)
2. = HTML3.0 complient tag which not only this program uses but other search engines/web spiders.
3. = The standard HTML tag

Configuration of the program

There are two ways to run this program, using command line arguments (e.g. web_index.pl -d /usr/home/dion ... ) or via the configuration file (e.g. see web_index.conf).

    COMMAND LINE ARGS:
    -----------------
    h		=	Show the Usage Help
    w		=	Get the description via <WINDEX "description here">
    m		= 	Get the description via 
			<META NAME="description" CONTENT="description here">
    t		=	Get the description via <TITLE>Grab this part</TITLE>
    i		=	If the description = IGNOREINDEX then ignore the file  
    I		=	If a ".ignore_index" file is in a directory ignore
                        all files in the directory and move ot the next
    T           =       Text output ONLY (Not using the graphics)
    c 		=	Read configuration from web_index.conf
    d [dir]	=	Start indexing from [dir]
    u [url]	=	Set the base URL to http://[url]
    C [file] 	=	Read configuration from [file]
    o [file]    =	HTML filename to output too

    EXAMPLE: To setup an index of my web pages (starting at dion) i would
    % web_index.pl -iI -d /usr/home/dion/www -u /dion

    which would produce HTML: /dion/web_index.html

    USING THE CONFIGURATION FILE (Default: web_index.conf)
    ------------------------------------------------------

    If you wish to use the defauly web_index.conf then you would call
    % web_index.pl -c. 
    If you want to specify a different filename then use
    % web_index.pl -C /path/to/file.conf

    The config file... and the different variables you can set

    1. -> ROOT_INDEX_DIRECTORY: /usr/home/dion/www
       Set the directory for the indexer to start looking through to 
       compile it's Site Index
   
    2. -> ROOT_URL: /dion
       Set the base URL for the index (basically the URL which points to
       the ROOT_INDEX_DIRECTORY)

    3. -> IMAGES_URL: /images
       Set the relative URL where the images are stored (e.g. if you have
       your images at /images you would have
       the above setting)

    4. -> OUTPUT_FILE: /usr/home/dion/www/windex.html
       Set the HTML doc which will have the Index in it

    5. -> GET_DESCRIPTION_FROM: w or m or t
       Here you select the method for the program to get the [description]
       w =  description via <WINDEX "[description]">
       m =  description via <META NAME="description" CONTENT="[description]">
       t =  description via <TITLE>[description]</TITLE>
       
       if you leave it blank it will try to use "m" and if it doesn't get
       a match it will try "t"

    6. -> IGNORE_DIRECTORIES: yes
       If you have "yes" there then if you make a file ".ignore_index" in
       a directory the program will ignore all files in it and move to the
       next.

    7. -> IGNORE_FILES: yes
       If you have "yes" there then if you make a description "IGNOREINDEX"
       (e.g.  if you are
        using the "m" method) then that file will be ignored

    8. -> TEXT_OUTPUT_ONLY: yes
       If you have "yes" there then if will not print out any nice images
       that are in the "images/" directory. Personally i like the images :)

    9. -> HTML_HEADER
          [put html here]
	  END_HTML_HEADER

       All the HTML inbetween the two tags HTML_HEADER, and END_HTML_HEADER
       will be printed at the top of the HTML output file (OUTPUT_FILE)

   10. -> HTML_FOOTER
	  [put html here]
	  END_HTML_FOOTER

       All the HTML inbetween the two tags HTML_FOOTER, and END_HTML_FOOTER
       will be printed at the bottom of the HTML output file (OUTPUT_FILE)

 Configuration of web_search.cgi



   Now to setup the Web Search part of the package. It is also simple.

   1. Place the program in a place where http:// can get to it. 
      E.g. in /cgi-bin or in your web directory.

   2. Now make sure the web_index.pl has it's FORM ACTION pointing to the cgi

      

   3. Edit the web_search.cgi itself and change the following:  

      $image_dir = "images";

      to point to the directory that points to the where the "images/" one
      is

   4. Change the PrintHeader, and PrintFooter function to customise the HTML
      that you want and change the following which holds the default search:

      <INPUT TYPE="hidden" NAME="IGNORE" VALUE="yes">
      <INPUT TYPE="hidden" NAME="boolean" VALUE="OR">
      <INPUT TYPE="hidden" NAME="case" VALUE="Insensitive">

   5. Edit the web_search.html file
      Change all the <A HREF> and <IMG> tags to point to your images etc.
      Change the  again points to the web_search.cgi. Now set the
      following variables:

      <INPUT TYPE="hidden" NAME="DOC_ROOT" VALUE="/usr/home/dion/www">
      <INPUT TYPE="hidden" NAME="URL_ROOT" VALUE="/dion">
      <INPUT TYPE="hidden" NAME="IGNORE" VALUE="yes">

      These are the same as for the web_index.pl

   6. Celebrate you are done :)


 * ------------------------------------------------------------------------ *
 *     If you have any questions or comments contact Dion Almaer            *
 * ------------------------------------------------------------------------ *
 *          Email Address     |  dion@almaer.com                            *
 *          WWW Page          |  /dion                *
 * ------------------------------------------------------------------------ *
 *   -=<  M E M B E R    S E R V I C E S    I N T E R N A T I O N A L  >=-  *
 * ------------------------------------------------------------------------ *