BasicCrawler
Provides a simple framework for parallel crawling of web pages. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
BasicCrawler is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If you want a
crawler that already facilitates this functionality, please consider using CheerioCrawler,
PuppeteerCrawler or PlaywrightCrawler.
BasicCrawler invokes the user-provided BasicCrawlerOptions.handleRequestFunction for
each Request object, which represents a single URL to crawl. The Request objects are fed from the
RequestList or the RequestQueue instances provided by the
BasicCrawlerOptions.requestList or
BasicCrawlerOptions.requestQueue constructor options, respectively.
If both BasicCrawlerOptions.requestList and
BasicCrawlerOptions.requestQueue options are used, the instance first processes URLs from the
RequestList and automatically enqueues all of them to RequestQueue before it starts their
processing. This ensures that a single URL is not crawled multiple times.
The crawler finishes if there are no more Request objects to crawl.
New requests are only dispatched when there is enough free CPU and memory available, using the functionality provided by the
AutoscaledPool class. All AutoscaledPool configuration options can be passed to the
autoscaledPoolOptions parameter of the BasicCrawler constructor. For user convenience, the minConcurrency and maxConcurrency
AutoscaledPool options are available directly in the BasicCrawler constructor.
Example usage:
// Prepare a list of URLs to crawl
const requestList = new Apify.RequestList({
sources: [
{ url: 'http://www.example.com/page-1' },
{ url: 'http://www.example.com/page-2' },
],
});
await requestList.initialize();
// Crawl the URLs
const crawler = new Apify.BasicCrawler({
requestList,
handleRequestFunction: async ({ request }) => {
// 'request' contains an instance of the Request class
// Here we simply fetch the HTML of the page and store it to a dataset
const { body } = await Apify.utils.requestAsBrowser(request);
await Apify.pushData({
url: request.url,
html: body,
});
},
});
await crawler.run();
Properties
stats
Type: Statistics
Contains statistics about the current run.
requestList
Type: RequestList
A reference to the underlying RequestList class that manages the crawler's Requests. Only available if
used by the crawler.
requestQueue
Type: RequestQueue
A reference to the underlying RequestQueue class that manages the crawler's Requests. Only available if
used by the crawler.
sessionPool
Type: SessionPool
A reference to the underlying SessionPool class that manages the crawler's Sessions. Only available if
used by the crawler.
autoscaledPool
Type: AutoscaledPool
A reference to the underlying AutoscaledPool class that manages the concurrency of the crawler. Note that this property is
only initialized after calling the BasicCrawler.run() function. You can use it to change the concurrency settings on the
fly, to pause the crawler by calling AutoscaledPool.pause() or to abort it by calling
AutoscaledPool.abort().
new BasicCrawler(options)
Parameters:
options:BasicCrawlerOptions- AllBasicCrawlerparameters are passed via an options object.
basicCrawler.optionsShape
Internal:
basicCrawler.log
basicCrawler.sessionPoolOptions
basicCrawler.run()
Runs the crawler. Returns a promise that gets resolved once all the requests are processed.
Returns:
Promise<void>