The following sections describe the Verity Spider networking options.
Web crawling only
-agentname string
Specifies the value for the agent name field that is part of the HTTP request. Since web servers can be configured to return different versions of the same page depending on the requesting agent, you can use the -agentname option to impersonate a browser client.
Use double-quotation marks if the name contains a space. Use the -cmdfile option if the agent name you want to use contains forbidden characters, such as slashes or backslashes.
-connections num_connections
Specifies the maximum number of simultaneous socket connections to make to websites for indexing. Each connection implies a separate thread.
The default value is 6.
Web crawling only
-delay num_milliseconds
Specifies the minimum time between HTTP requests, in milliseconds. The default value is 0 milliseconds for no delay.
Web crawling only
-header string
Specifies an HTTP header to add to the request; for example:
-header "Referer: http://www.verity.com/"
Verity Spider sends some predefined headers, such as Accept and User-Agent, by default. Special headers are sometimes necessary to correctly index a site.
For example, earlier versions of Verity Spider did not support the Host header, which is needed for Virtual Host indexing. Also, a Proxy-authentication header was needed to pass a username and password to a proxy server. In the current version of Verity Spider, the Host header is supported by default, and the -proxyauth option is available for proxy server authentication. Therefore, the -header option is maintained only for backwards compatibility and possible future enhancements.
-hostcache num_hostnames
Specifies the number of host names to cache to avoid DNS lookups. Without this option, the host cache continues to grow.
The default value is 256.
Web crawling only
Disables round-robin indexing of websites with network flow control.
By default, Verity Spider uses round-robin indexing of websites to avoid overwhelming a web server and to improve indexing performance. Verity Spider connects to each web server in a round-robin manner, using up to the value for the -connections option. This means that one URL is fetched from each web server, in turn.
Web crawling only
-noproxy name_1 [name_n] ...
Used in conjunction with the -proxy option, the -noproxy option specifies that Verity Spider directly access the hosts whose names match those specified. By default, when you specify the -proxy option, Verity Spider first tries to access every host with the proxy information. To improve performance, use the -noproxy option for the hosts you know can be accessed without a proxy host. For the name variable, you can use the asterisk (*) wildcard for text strings; for example:
'*.verity.com'
You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions.
In Windows, include double-quotation marks around the argument to protect the asterisk special character (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).
Web crawling only
-proxy proxyhost:port
Specifies host and port for proxy server.
See also -proxyauth for proxy servers that require authentication, and -noproxy for hosts that you know are accessible without having to go through a proxy server.
Web crawling only
-proxyauth login:password
Specifies login information for proxy server connections that require authorization to get outside the firewall. Use this option in conjunction with the -proxy option.
Web crawling only
-retry num_retries
Specifies the number of times that Verity Spider should attempt to access a URL. Use the -retry option when it is likely that an unstable network connection will give false rejections.
The default value is 4.
Web crawling only
-timeout num_seconds
Specifies the time period, in seconds, that Verity Spider should wait before timing out on a network connection and on accessing data. The data access value is automatically twice the value you specify for the network connection time out.
The default value for the network connection time-out is 30 seconds, and therefore the default value for the data access time-out is 60 seconds.