An article was published today that provides a fix for the Verity spidering utility known as vspider when used with ColdFusion MX 7. That article provides some additional style files will help populate the appropriate metadata fields of a collection such as Title, URL, Size, etc... The article also comes with an example of using vspider on Windows.

ColdFusion MX 7: Additional files for using Verity Spider

It should be noted that when running vpsider on Unix or Linux, that the environmental variable for LD_LIBRARY_PATH must also be set to include the location of core Verity binary files. Its often useful to create a script to set up all the vspider commands, and in that script you can set the LD_LIBRARY_PATH to include the Verity bin directory.

An example of running vspider without first adjusting LD_LIBRARY_PATH follows. Note that it fails with a missing dependency even though the dependency library is in the same directory.

bash-2.03# pwd

bash-2.03# ./vspider ./vspider: fatal: open failed: No such file or directory

bash-2.03# ls -l
-rwxrwxr-x 1 nobody other 3893632 Sep 24 2004

bash-2.03# ldd vspider => (file not found) => (file not found) => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/ => /usr/lib/

An example script is shown below that is used to run the vspider utility to spider localhost. This example restricts the spidering to the web documents under /vspider_target/, which is a test directory containing a mixture of files of various extensions and content. Note the usage of the -start and -include switches to contain the spidering activity. I personally like to pipe the results to an output file (>> out.txt)as a convenient record of events. Note also how LD_LIBRARY_PATH is set to contain the Verity bin directory.
CFVERITY=/opt/coldfusionmx7/verity;export CFVERITY
PATH=$PATH:$CFVERITY/k2/_ssol26/bin;export PATH
# set the collection name as an input argument and pass to the COLL variable
# the following commands should be contained all on one line
vspider -style $CFVERITY/Data/stylesets/ColdFusionVspider -collection $CFVERITY/collections/$COLL
-include "*/vspider_target*" -start "http://localhost:$CFMXPORT/vspider_target/" >> out.txt

For Linux, use _ilnx21 in place of _ssol26.

The script is made executable while logged in as root with chmod u+x, and then executed:
bash-2.03# ./ newSpiderTest1

The contents of the output file are shown here. Observe the Inserting of files, and the summary at the end:
bash-2.03# cat out.txt
vspider - Verity, Inc. Version K5.5.0 (_ssol26, Sep 24 2004)
2005/03/31 14:53:52 Info: [vspider] (ind006000) Message database loaded from [/opt/coldfusionmx7/verity/k2/common/ind.msg].
2005/03/31 14:53:55 Info: [vspider] (ind006001) License loaded from [/opt/coldfusionmx7/verity/k2/common/runtime.lic].
2005/03/31 14:53:55 Info: [vspider] (ind005005) Licensed for local spidering.
2005/03/31 14:53:55 Info: [vspider] (ind005008) Not licensed for remote spidering.
2005/03/31 14:53:55 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/].
2005/03/31 14:53:55 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/CodeSweeper.log].
2005/03/31 14:53:55 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/bar.cfm].
2005/03/31 14:53:55 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/baz.cfm].
2005/03/31 14:53:55 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.cfm].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.htm].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.html].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.pdf].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.doc].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/foo.txt].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/qux.cfm].
2005/03/31 14:53:56 Progress: [vspider] (ind031000) Inserting [http://localhost:8501/vspider_target/search.cfm].
2005/03/31 14:54:03 Progress: [vspider] (ind002115) Optimizing VDK collection [/opt/coldfusionmx7/verity/collections/newSpiderTest1].
Progress: [vspider] (ind010020) Vspider summary: Submitted 12 documents for insert,0 documents for deletion,0 documents for update;
Progress: [vspider] (ind010021) Vspider summary: Indexed 12 documents, Deleted 0 documents, 0 bad documents;
Progress: [vspider] (ind010022) Vspider summary: Skipped 1 keys, including 0 duplicate documents rejected;
Progress: [vspider] (ind010023) Vspider summary: Failed to fetch 0 keys.
vspider done

When indexing with Vspider, it will create the collection if the collection does not already exist. Looking at the collection files created, you'll see they are owned by the user and group root/other, which differs from collections generated and indexed through the ColdFusion Administrator alone where those will have the user/group set as the ColdFusion runtime user, in this case nobody/nobody for the bookclub collection:
bash-2.03# ls -l /opt/coldfusionmx7/verity/collections/
total 14
drwxr-xr-x 12 nobody nobody 512 Mar 30 13:54 bookclub
-rwxrwxr-x 1 nobody other 0 Jun 5 2003 empty.txt
drwxr-xr-x 12 root other 512 Mar 31 14:53 newSpiderTest1

The permissions here should generally be ok, but if you find any problems in the CFAdmin then you should chown -R the collection directory to set the user as the ColdFusion runtime user.

The vspider utility does not update the ColdFusion configuration files to make it aware of the new collection. Following along with the Technote instructions, enter the CFAdmin and add a collection having the same name. The collection directory will not be overwritten or touched, but doing this will cause a corresponding entry in the config file neo-verity.xml so that your application can reference the collection by name. Do not run any other operations on the collection from the CF Admin, such as indexing, repairing, or purging according to the Technote instructions.

The vspider collection is now searchable via CFSEARCH from your application.

A word of caution: Do not turn on Directory Browsing in your webserver when getting started with vspider. If you don't restrict vspider's search scope properly with the -start and -include options, you could easily end up having vspider index and run every single document on your webserver. Since I work in tech support, I have a multitude of files that do destructive operations like deleting files or deleting records from a table, or annoying operations like sending out lots of email to myself. I made this mistake early on and ended up causing all kinds of havoc on my system and also managed to spam myself pretty good.