Adobe ColdFusion 8

-host

Type

Web crawling only

Syntax

-host name_1 [name_n] ...

Limits indexing to the specified host or hosts. You must use only complete text strings for hosts. You cannot use wildcard expressions.

You can list multiple hosts by separating each one with a single space. URLs not on the specified host(s) are not downloaded or parsed.

-https

Type

Web crawling only

Lets you index SSL-enabled websites.

Note: You must have the Verity SSL Option Pack installed to use the -https option. The Verity SSL Option Pack is a Verity Spider add-on available separately from a Verity salesperson.

-jumps

Type

Web crawling only

Syntax

-jumps num_jumps

Specifies the maximum number of levels an indexing job can go from the starting URL. Specify a number between 0 and 254.

The default value is unlimited. If you see extremely large numbers of documents in a collection where you do not expect them, consider experimenting with this option, in conjunction with the Content options, to pare down your collection.

-nodocrobo

Specifies to ignore ROBOT META tag directives.

In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt under the root directory of a website. In HTML 4.0, every document can have robot directives embedded in the META field. Use this option to ignore them. Use this option with discretion.

-nofollow

Type

Web crawling only

Syntax

-nofollow "exp"

Specifies that Verity Spider cannot follow any URLs that match the exp expression. If you do not specify an exp value for the -nofollow option, Verity Spider assumes a value of "*", where no documents are followed.

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. Always encapsulate the exp values in double-quotation marks to ensure that they are properly interpreted.

If you use backslashes, you must double them so that they are properly escaped; for example:

C:\\test\\docs\\path

To use regular expressions, also specify the -regexp option.

Earlier versions of Verity Spider did not allow the use of an expression. This meant that for each starting point URL, only the first document would be indexed. With the addition of the expression functionality, you can now selectively skip URLs, even within documents.

See also

-regexp

-norobo

Type

Web crawling only

Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.

If you are re-indexing a site and the robots.txt file has changed, Verity Spider deletes documents that have been newly disallowed by the robots.txt file.

Use this option with discretion and extreme care, especially in conjunction with the -cgiok option.

See also

-nodocrobo.

-pathlen

Syntax

-pathlen num_pathsegments

Limits indexing to the specified number of path segments in the URL or file system path. The path length is determined as follows:

  • The host name and drive letter are not included; for example, neither www.spider.com:80/ nor C:\ would be included in determining the path length.
  • All elements following the host name are included.
  • The actual filename, if present, is included; for example, /world.html would be included in determining the path length.
  • Any directory paths between the host and the actual filename are included.

Example

For the following URL, the path length would be four:

http://www.spider:80/comics/fun/funny/world.html
 <-1-><2><-3-> <---4--->

For the following file system path, the path length would be three:

C:\files\docs\datasheets
<-1-><-2-><---3--->

The default value is 100 path segments.

-refreshtime

Syntax

-refreshtime timeunits

Specifies not to refresh any documents that have been indexed since the timeunits value began.

The following is the syntax for timeunits:

n day n hour n min n sec

Where n is a positive integer. You must include spaces, and since the first three letters of each time unit are parsed, you can use the singular or plural form of the word.

If you specify the following:

-refreshtime 1 day 6 hours

Only those documents that were last indexed at least 30 hours and 1 second ago, are refreshed.

Note: This option is valid only with the -refresh option. When you use vsdb -recreate, the last indexed date is cleared.

-reparse

Type

Web crawling only

Forces parsing of all HTML documents already in the collection. You must specify a starting point with the -start option when you use the -reparse option.

You can use the -reparse option when you want to include paths and documents that were previously skipped due to exclusion or inclusion criteria. Remember to change the criteria, or there will be little for Verity Spider to do. This can be easy to overlook when you are using the
-cmdfile option.

-unlimited

Specifies that no limits are placed on Verity Spider if neither the -host nor the -domain option is specified. The default is to limit based on the host of the first starting point listed.

-virtualhost

Syntax

-virtualhost name_1 [name_n] ...

Specifies that DNS lookups are avoided for the hosts listed. You must use only complete text strings for hosts. You cannot use wildcard expressions. This lets you index by alias, such as when multiple web servers are running on the same host. You can use regular expressions.

Normally, when Verity Spider resolves host names, it uses DNS lookups to convert the names to canonical names, of which there can be only one per computer. This allows for the detection of duplicate documents, to prevent results from being diluted. In the case of multiple aliased hosts, however, duplication is not a barrier as documents can be referred to by more than one alias and yet remain distinct because of the different alias names.

Example

You can have both marketing.verity.com and sales.verity.com running on the same host. Each alias has a different document root, although document names such as index.htm can occur for both. With the -virtualhost option, both server aliases can be indexed as distinct sites. Without the -virtualhost option, they would both be resolved to the same host name, and only the first document encountered from any duplicate pair would be indexed.

Note: If you are using Netscape Enterprise Server, and you have specified only the host name as a virtual host, Verity Spider cannot index the virtual host site. This is because Verity Spider always adds the domain name to the document key.