Adobe ColdFusion 8

Processing options

The following sections describe the Verity Spider processing options.

-abspath

Type

File system only

Generates absolute paths for files. Use this option when the document locations are not going to change, but the collection might be moved around.

When you index a web server's contents through the file system, use the -prefixmap option with the -abspath option to map the absolute file paths to URLs.

See also -prefixmap.

-detectdupfile

Type

File system only

Enables checksum-based detection of duplicates when indexing file systems.

By default, a document checksum is not computed on indexed files. By using the
-detectdupfile option, a checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.

-indexers

Syntax

-indexers num_indexers

Specifies the maximum number of indexing threads to run on a collection.

The default value is 2. Increasing the value for the -indexers option requires additional CPU and memory resources.

See also

-maxindmem.

-license

Syntax

-license path_and_filename

Specifies the license file to use.

By default, the ind.lic file is used, from the verity_root/platform/bin directory.; where platform represents the platform directory.

-maxindmem

Syntax

-maxindmem kilobytes

Specifies the maximum amount of memory, in kilobytes, used by each indexing thread. Specify the number of threads with the -indexers option.

By default, each indexing thread uses as much memory as is available from the system.

-maxnumdoc

Syntax

-maxnumdoc num_docs

Specifies the maximum number of documents to download or submit for indexing. The value for num_docs does not necessarily correspond to the number of documents indexed. The following factors affect the actual number:

  • Whether the value of num_docs falls within a block of documents dictated by the -submitsize option. If it does, the entire block of documents must be processed.
  • Whether documents retrieved are actually indexed, because they are invalid or corrupt.

-mimemap

Syntax

-mimemap path_and_filename

Specifies a control file (simple ASCII text) that maps file extensions to MIME-types. This lets you make custom associations and override defaults.

The following is the format for the control file:

#file_ext_no_dot                                            mime-type
abc                                             application/word

-nocache

Type

Web crawling only

Used with the -noindex or -nosubmit options, this option disables the caching of files during website indexing. This has the effect of decreasing the demands on your disk space.

Normally, Verity Spider downloads URLs, then writes them to a bulk insert file and downloads the documents themselves. When indexing occurs, once the -submitsize option has been reached, the cached files are indexed and then deleted. If you use the -noindex option, the bulk insert file is submitted but not processed by Verity Spider, and so the documents are not deleted until indexing occurs. This will usually be mkvdk or collsvc, or you can use Verity Spider again with the -processbif option.

By using the -nocache option in conjunction with the -noindex or -nosubmit option, you avoid storing files locally. Files are downloaded only when indexing actually occurs.

See also

-noindex.

-nodupdetect

Type

Web crawling only

Disables checksum-based detection of duplicates when indexing websites. URL-based duplicate detection is still performed.

By default, a document checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.

See also

-followdup.

-noindex

Specifies that Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is then submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). The BIF will be processed by the next indexing process run for the collection, whether it is Verity Spider, mkvdk, or collection servicers (collsvc).

Do not try to start Verity Spider and another process at the same time. You must allow Verity Spider time to generate enough work for the secondary indexing process. If you are using mkvdk, you can run it in persistent mode to ensure it will act upon work generated by Verity Spider.

Note: When you execute an indexing job for a collection and you use the -noindex option, the persistent store for the collection is not updated.

See also

-nocache and -nosubmit.

For more information on the mkvdk utility, see Using the mkvdk utility.

-nosubmit

Specifies that Verity Spider gathers document locations without submitting them. The document locations are stored in a bulk insert file (BIF), which is not submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). You can also use Verity Spider again with the -processbif option. With an indexing process other than Verity Spider, you must specify the name and path for the BIF, because the collection has no record of it.

-persist

Syntax

-persist num_seconds

Enables the Verity Spider to run in persistent mode, checking for updates every num_seconds seconds until it is stopped.

While Verity Spider is running in persistent mode, there is no optimization. After Verity Spider is taken out of persistent mode, you need to perform optimization on the collection. For more information about using the mkvdk utility, see Using the mkvdk utility.

Note: Do not run more than one Verity Spider process in persistent mode. As the Verity Spider is a resource-intensive process, only run it in persistent mode with an interval of less than one day. For time intervals greater than twelve hours, use some form of scheduling. Some examples are cron jobs for UNIX, and the AT command for Windows server.

-preferred

Type

Web crawling only

Syntax

-preferred exp_1 [exp_n] ...

Specifies a list of hosts or domains that are preferred when retrieving documents for viewing. You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. To use regular expressions, also specify the -regexp option. Use this option when you leave duplicate detection enabled and do not specify the -nodupdetect option.

When indexing, you might encounter a nonpreferred host first. In that case, documents are parsed and followed and stored as candidates. When duplicates are encountered on another server, which is preferred, the duplicate documents from the nonpreferred server are skipped. When documents are requested for viewing, they will be retrieved from the preferred server.

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

See also

-regexp.

-prefixmap

Syntax

-prefixmap path_and_filename

Specifies a control file (simple ASCII text) that maps file system paths to web aliases.

In conjunction with the -abspath option, this option is typically used to create a URL field that is the web equivalent of a file system path. File system indexing is faster than web crawling over the network. If you use the -prefixmap option to replace the file system path with the web URL, relative hyperlinks in the HTML pages are kept intact when returned in Verity search results.

The following is the format for the control file:

src_field src_prefix dest_field dest_prefix

If you use backslashes, you must double them so that they are properly escaped; for example:

C:\\test\\docs\\path

For example, to map the filepath /usr/pub/docs to http://web/~verity, use the following:

vdkvgwkey /usr/pub URL http://web/~verity

See also

-abspath.

-processbif

Syntax

-processbif 'command_string !*'

Specifies a command string in which you can call a program or script that operates on BIFs generated by Verity Spider.

Due to the use of special characters, which represent the bulk insert file (BIF), you must run Verity Spider with a command file using the -cmdfile option.

For example, if you want to use a script called fix_bif to add customized information to BIF files, use the following command:

vspider -cmdfile filename

Where filename is the text-only command file that contains the following (along with any other necessary options):

-processbif 'fix_bif !*'

Your command file will include other options as well.

-regexp

Specifies the use of regular expressions rather than the default wildcard expressions for the following options: -exclude, -indexclude, -include, -indinclude, -skip, -indskip, -preferred, and -nofollow.

Wildcard expressions allow the use of the asterisk (*) for text strings, and the question mark (?) for single characters, as the following table shows:

Wildcard expression

Text string

a*t

although, attitude, audit

a?t

ant, art

file?.htm

files.htm, file1.htm, filer.htm

name?.*

names.txt, named.blank, names.ext

Regular expressions allow for more powerful and flexible matching of alphanumeric strings; for example, to match "ab11" or "ab34" but not "abcd" or "ab11cd," you could use the following regular expression:

^ab[0-9][0-9]$

The full extent to which regular expressions can be employed is beyond the scope of this description. For more information on regular expressions, refer to a book devoted to the subject.

-submitsize

Syntax

-submitsize num_documents

Specifies the number of documents submitted for indexing at one time. The default value is 128. The upper limit is 64,000.

Note: Although larger values mean more efficient processing by the indexer, smaller values allow more parallelism on multi-CPU systems. In the event of a halt during indexing, a smaller value means fewer documents will be lost.

If a halt occurs during indexing, the chunk of documents specified by the -submitsize option is lost because there is no transactional rollback for indexing and the documents are no longer in the queue for indexing. When you rerun the indexing task, Verity Spider can only continue with URLs and documents that are enqueued.

-temp

Syntax

-temp path

Specifies the directory for temporary files (disk cache). By default, the temp directory is under the job directory (optionally specified with the -jobpath option).

If you do not specify a value for this option, Verity Spider creates a /spider/temp directory within the collection. For multiple-collection tasks, the first collection specified is used.

Note: Make sure the location you specify contains enough disk space to handle the documents that are downloaded and held before indexing. The documents are deleted from the hard disk after they are indexed.

See also

-jobpath, for specifying the location of all indexing job directories and files, one of which is the temp directory.