BasicCrawlerOptions
Properties
handleRequestFunction
Type: HandleRequest
User-provided function that performs the logic of the crawler. It is called for each URL to crawl.
The function receives the following object as an argument:
{
request: Request,
session: Session,
crawler: BasicCrawler,
}
where the Request instance represents the URL to crawl.
The function must return a promise, which is then awaited by the crawler.
If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries times. If all the retries
fail, the crawler calls the function provided to the handleFailedRequestFunction parameter. To make this work, you should always let your
function throw exceptions rather than catch them. The exceptions are logged to the request using the
Request.pushErrorMessage() function.
requestList
Type: RequestList
Static list of URLs to be processed. Either requestList or requestQueue option must be provided (or both).
requestQueue
Type: RequestQueue
Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either requestList or requestQueue option must be
provided (or both).
handleRequestTimeoutSecs
Type: number = 60
Timeout in which the function passed as handleRequestFunction needs to finish, in seconds.
handleFailedRequestFunction
Type: HandleFailedRequest
A function to handle requests that failed more than option.maxRequestRetries times.
The function receives the following object as an argument:
{
request: Request,
error: Error,
session: Session,
crawler: BasicCrawler,
}
where the Request instance corresponds to the failed request, and the Error instance represents the last error thrown during
processing of the request.
See source code for the default implementation of this function.
maxRequestRetries
Type: number = 3
Indicates how many times the request is retried if
BasicCrawlerOptions.handleRequestFunction fails.
maxRequestsPerCrawl
Type: number
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
autoscaledPoolOptions
Type: AutoscaledPoolOptions
Custom options passed to the underlying AutoscaledPool constructor. Note that the runTaskFunction and
isTaskReadyFunction options are provided by BasicCrawler and cannot be overridden. However, you can provide a custom implementation of
isFinishedFunction.
minConcurrency
Type: number = 1
Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option.
WARNING: If you set this value too high with respect to the available system memory and CPU, your crawler will run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically.
maxConcurrency
Type: number = 1000
Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option.
useSessionPool
Type: boolean = true
Basic crawler will initialize the SessionPool with the corresponding sessionPoolOptions. The session instance will be than
available in the handleRequestFunction.
sessionPoolOptions
Type: SessionPoolOptions
The configuration options for SessionPool to use.