Version: 2.3

BasicCrawler

Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.

BasicCrawler is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a crawler that already facilitates this functionality, please consider using CheerioCrawler, PuppeteerCrawler or PlaywrightCrawler.

BasicCrawler invokes the user-provided BasicCrawlerOptions.handleRequestFunction for each Request object, which represents a single URL to crawl. The Request objects are fed from the RequestList or the RequestQueue instances provided by the BasicCrawlerOptions.requestList or BasicCrawlerOptions.requestQueue constructor options, respectively.

If both BasicCrawlerOptions.requestList and BasicCrawlerOptions.requestQueue options are used, the instance first processes URLs from the RequestList and automatically enqueues all of them to RequestQueue before it starts their processing. This ensures that a single URL is not crawled multiple times.

The crawler finishes if there are no more Request objects to crawl.

New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the AutoscaledPool class. All AutoscaledPool configuration options can be passed to the autoscaledPoolOptions parameter of the BasicCrawler constructor. For user convenience, the minConcurrency and maxConcurrency AutoscaledPool options are available directly in the BasicCrawler constructor.

Example usage:

// Prepare a list of URLs to crawl
const requestList = new Apify.RequestList({
    sources: [
        { url: 'http://www.example.com/page-1' },
        { url: 'http://www.example.com/page-2' },
    ],
});
await requestList.initialize();

// Crawl the URLs
const crawler = new Apify.BasicCrawler({
    requestList,
    handleRequestFunction: async ({ request }) => {
        // 'request' contains an instance of the Request class
        // Here we simply fetch the HTML of the page and store it to a dataset
        const { body } = await Apify.utils.requestAsBrowser(request);
        await Apify.pushData({
            url: request.url,
            html: body,
        });
    },
});

await crawler.run();

Properties

`stats`

Type: Statistics

Contains statistics about the current run.

`requestList`

Type: RequestList

A reference to the underlying RequestList class that manages the crawler's Requests. Only available if used by the crawler.

`requestQueue`

Type: RequestQueue

A reference to the underlying RequestQueue class that manages the crawler's Requests. Only available if used by the crawler.

`sessionPool`

Type: SessionPool

A reference to the underlying SessionPool class that manages the crawler's Sessions. Only available if used by the crawler.

`autoscaledPool`

Type: AutoscaledPool

A reference to the underlying AutoscaledPool class that manages the concurrency of the crawler. Note that this property is only initialized after calling the BasicCrawler.run() function. You can use it to change the concurrency settings on the fly, to pause the crawler by calling AutoscaledPool.pause() or to abort it by calling AutoscaledPool.abort().

`new BasicCrawler(options)`

Parameters:

options: BasicCrawlerOptions - All BasicCrawler parameters are passed via an options object.

`basicCrawler.optionsShape`

Internal:

`basicCrawler.log`

`basicCrawler.sessionPoolOptions`

`basicCrawler.run()`

Runs the crawler. Returns a promise that gets resolved once all the requests are processed.

Returns:

Promise<void>