Web crawling only
-host name_1 [name_n] ...
Limits indexing to the specified host or hosts. You must use only complete text strings for hosts. You cannot use wildcard expressions.
You can list multiple hosts by separating each one with a single space. URLs not on the specified host(s) are not downloaded or parsed.
Web crawling only
Lets you index SSL-enabled websites.
Web crawling only
-jumps num_jumps
Specifies the maximum number of levels an indexing job can go from the starting URL. Specify a number between 0 and 254.
The default value is unlimited. If you see extremely large numbers of documents in a collection where you do not expect them, consider experimenting with this option, in conjunction with the Content options, to pare down your collection.
Specifies to ignore ROBOT META tag directives.
In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt under the root directory of a website. In HTML 4.0, every document can have robot directives embedded in the META field. Use this option to ignore them. Use this option with discretion.
Web crawling only
-nofollow "exp"
Specifies that Verity Spider cannot follow any URLs that match the exp expression. If you do not specify an exp value for the -nofollow option, Verity Spider assumes a value of "*", where no documents are followed.
You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. Always encapsulate the exp values in double-quotation marks to ensure that they are properly interpreted.
If you use backslashes, you must double them so that they are properly escaped; for example:
C:\\test\\docs\\path
To use regular expressions, also specify the -regexp option.
Earlier versions of Verity Spider did not allow the use of an expression. This meant that for each starting point URL, only the first document would be indexed. With the addition of the expression functionality, you can now selectively skip URLs, even within documents.
Web crawling only
Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.
If you are re-indexing a site and the robots.txt file has changed, Verity Spider deletes documents that have been newly disallowed by the robots.txt file.
Use this option with discretion and extreme care, especially in conjunction with the -cgiok option.
-pathlen num_pathsegments
Limits indexing to the specified number of path segments in the URL or file system path. The path length is determined as follows:
For the following URL, the path length would be four:
http://www.spider:80/comics/fun/funny/world.html <-1-><2><-3-> <---4--->
For the following file system path, the path length would be three:
C:\files\docs\datasheets <-1-><-2-><---3--->
The default value is 100 path segments.
-refreshtime timeunits
Specifies not to refresh any documents that have been indexed since the timeunits value began.
The following is the syntax for timeunits:
n day n hour n min n sec
Where n is a positive integer. You must include spaces, and since the first three letters of each time unit are parsed, you can use the singular or plural form of the word.
If you specify the following:
-refreshtime 1 day 6 hours
Only those documents that were last indexed at least 30 hours and 1 second ago, are refreshed.
Web crawling only
Forces parsing of all HTML documents already in the collection. You must specify a starting point with the -start option when you use the -reparse option.
You can use the -reparse option when you want to include paths and documents that were previously skipped due to exclusion or inclusion criteria. Remember to change the criteria, or there will be little for Verity Spider to do. This can be easy to overlook when you are using the
-cmdfile option.
Specifies that no limits are placed on Verity Spider if neither the -host nor the -domain option is specified. The default is to limit based on the host of the first starting point listed.
-virtualhost name_1 [name_n] ...
Specifies that DNS lookups are avoided for the hosts listed. You must use only complete text strings for hosts. You cannot use wildcard expressions. This lets you index by alias, such as when multiple web servers are running on the same host. You can use regular expressions.
Normally, when Verity Spider resolves host names, it uses DNS lookups to convert the names to canonical names, of which there can be only one per computer. This allows for the detection of duplicate documents, to prevent results from being diluted. In the case of multiple aliased hosts, however, duplication is not a barrier as documents can be referred to by more than one alias and yet remain distinct because of the different alias names.
You can have both marketing.verity.com and sales.verity.com running on the same host. Each alias has a different document root, although document names such as index.htm can occur for both. With the -virtualhost option, both server aliases can be indexed as distinct sites. Without the -virtualhost option, they would both be resolved to the same host name, and only the first document encountered from any duplicate pair would be indexed.