The following sections describe the Verity Spider processing options.
File system only
Generates absolute paths for files. Use this option when the document locations are not going to change, but the collection might be moved around.
When you index a web server's contents through the file system, use the -prefixmap option with the -abspath option to map the absolute file paths to URLs.
See also -prefixmap.
File system only
Enables checksum-based detection of duplicates when indexing file systems.
By default, a document checksum is not computed on indexed files. By using the
-detectdupfile option, a checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.
-indexers num_indexers
Specifies the maximum number of indexing threads to run on a collection.
The default value is 2. Increasing the value for the -indexers option requires additional CPU and memory resources.
-license path_and_filename
Specifies the license file to use.
By default, the ind.lic file is used, from the verity_root/platform/bin directory.; where platform represents the platform directory.
-maxindmem kilobytes
Specifies the maximum amount of memory, in kilobytes, used by each indexing thread. Specify the number of threads with the -indexers option.
By default, each indexing thread uses as much memory as is available from the system.
-maxnumdoc num_docs
Specifies the maximum number of documents to download or submit for indexing. The value for num_docs does not necessarily correspond to the number of documents indexed. The following factors affect the actual number:
-mimemap path_and_filename
Specifies a control file (simple ASCII text) that maps file extensions to MIME-types. This lets you make custom associations and override defaults.
The following is the format for the control file:
#file_ext_no_dot mime-type abc application/word
Web crawling only
Used with the -noindex or -nosubmit options, this option disables the caching of files during website indexing. This has the effect of decreasing the demands on your disk space.
Normally, Verity Spider downloads URLs, then writes them to a bulk insert file and downloads the documents themselves. When indexing occurs, once the -submitsize option has been reached, the cached files are indexed and then deleted. If you use the -noindex option, the bulk insert file is submitted but not processed by Verity Spider, and so the documents are not deleted until indexing occurs. This will usually be mkvdk or collsvc, or you can use Verity Spider again with the -processbif option.
By using the -nocache option in conjunction with the -noindex or -nosubmit option, you avoid storing files locally. Files are downloaded only when indexing actually occurs.
Web crawling only
Disables checksum-based detection of duplicates when indexing websites. URL-based duplicate detection is still performed.
By default, a document checksum is computed based on the CRC-32 algorithm. The checksum combined with the document size is used to determine if the document is a duplicate.
Specifies that Verity Spider gathers document locations without indexing them. The document locations are stored in a bulk insert file (BIF), which is then submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). The BIF will be processed by the next indexing process run for the collection, whether it is Verity Spider, mkvdk, or collection servicers (collsvc).
Do not try to start Verity Spider and another process at the same time. You must allow Verity Spider time to generate enough work for the secondary indexing process. If you are using mkvdk, you can run it in persistent mode to ensure it will act upon work generated by Verity Spider.
For more information on the mkvdk utility, see Using the mkvdk utility.
Specifies that Verity Spider gathers document locations without submitting them. The document locations are stored in a bulk insert file (BIF), which is not submitted to the collection. This option is typically used in conjunction with a separate indexing process, such as mkvdk or collection servicers (collsvc). You can also use Verity Spider again with the -processbif option. With an indexing process other than Verity Spider, you must specify the name and path for the BIF, because the collection has no record of it.
-persist num_seconds
Enables the Verity Spider to run in persistent mode, checking for updates every num_seconds seconds until it is stopped.
While Verity Spider is running in persistent mode, there is no optimization. After Verity Spider is taken out of persistent mode, you need to perform optimization on the collection. For more information about using the mkvdk utility, see Using the mkvdk utility.
Web crawling only
-preferred exp_1 [exp_n] ...
Specifies a list of hosts or domains that are preferred when retrieving documents for viewing. You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. To use regular expressions, also specify the -regexp option. Use this option when you leave duplicate detection enabled and do not specify the -nodupdetect option.
When indexing, you might encounter a nonpreferred host first. In that case, documents are parsed and followed and stored as candidates. When duplicates are encountered on another server, which is preferred, the duplicate documents from the nonpreferred server are skipped. When documents are requested for viewing, they will be retrieved from the preferred server.
In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).
-prefixmap path_and_filename
Specifies a control file (simple ASCII text) that maps file system paths to web aliases.
In conjunction with the -abspath option, this option is typically used to create a URL field that is the web equivalent of a file system path. File system indexing is faster than web crawling over the network. If you use the -prefixmap option to replace the file system path with the web URL, relative hyperlinks in the HTML pages are kept intact when returned in Verity search results.
The following is the format for the control file:
src_field src_prefix dest_field dest_prefix
If you use backslashes, you must double them so that they are properly escaped; for example:
C:\\test\\docs\\path
For example, to map the filepath /usr/pub/docs to http://web/~verity, use the following:
vdkvgwkey /usr/pub URL http://web/~verity
-processbif 'command_string !*'
Specifies a command string in which you can call a program or script that operates on BIFs generated by Verity Spider.
Due to the use of special characters, which represent the bulk insert file (BIF), you must run Verity Spider with a command file using the -cmdfile option.
For example, if you want to use a script called fix_bif to add customized information to BIF files, use the following command:
vspider -cmdfile filename
Where filename is the text-only command file that contains the following (along with any other necessary options):
-processbif 'fix_bif !*'
Your command file will include other options as well.
Specifies the use of regular expressions rather than the default wildcard expressions for the following options: -exclude, -indexclude, -include, -indinclude, -skip, -indskip, -preferred, and -nofollow.
Wildcard expressions allow the use of the asterisk (*) for text strings, and the question mark (?) for single characters, as the following table shows:
Wildcard expression |
Text string |
---|---|
a*t |
although, attitude, audit |
a?t |
ant, art |
file?.htm |
files.htm, file1.htm, filer.htm |
name?.* |
names.txt, named.blank, names.ext |
Regular expressions allow for more powerful and flexible matching of alphanumeric strings; for example, to match "ab11" or "ab34" but not "abcd" or "ab11cd," you could use the following regular expression:
^ab[0-9][0-9]$
The full extent to which regular expressions can be employed is beyond the scope of this description. For more information on regular expressions, refer to a book devoted to the subject.
-submitsize num_documents
Specifies the number of documents submitted for indexing at one time. The default value is 128. The upper limit is 64,000.
If a halt occurs during indexing, the chunk of documents specified by the -submitsize option is lost because there is no transactional rollback for indexing and the documents are no longer in the queue for indexing. When you rerun the indexing task, Verity Spider can only continue with URLs and documents that are enqueued.
-temp path
Specifies the directory for temporary files (disk cache). By default, the temp directory is under the job directory (optionally specified with the -jobpath option).
If you do not specify a value for this option, Verity Spider creates a /spider/temp directory within the collection. For multiple-collection tasks, the first collection specified is used.
-jobpath, for specifying the location of all indexing job directories and files, one of which is the temp directory.