Skip to main content

HTML::EmailExtractor - Scraping email addresses from website pages

Overview of the scraper

OverviewHTML::EmailExtractorHTML::EmailExtractor collects email addresses from specified pages. It supports navigating through internal pages of the website up to a specified depth, which allows it to go through all the pages of the site, collecting internal and external links. The email scraper has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages where data is loaded by scripts. Capable of reaching speeds up to 250 requests per minute – that's 15,000 links per hour.

Use cases for the scraper

Scraping emails from a website with page navigation deep into a specified limit

Scraping emails from a website with page navigation deep into a specified limit
  1. Add the option Parse to level, in the list select the necessary value (limit).
  2. In the Requests section, check the Unique requests option.
  3. In the Results section, check the Unique by line option.
  4. As a request, specify the link to the website from which you need to scrape emails.
Download example

How to import an example into A-Parser

eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr

Scraping emails from a database of websites with page navigation deep into a specified limit

Scraping emails from a database of websites with page navigation deep into a specified limit
  1. Add the option Parse to level, in the list select the necessary value (limit).
  2. In the Requests section, check the Unique requests option.
  3. In the Results section, check the Unique by line option.
  4. As a request, specify the links to the websites from which you need to scrape emails, or in Requests from select File and upload a file of requests with a database of websites.
Download example

How to import an example into A-Parser

eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr

Scraping emails from a database of links

Parsing emails by site database
  1. In the Queries section, check the Unique queries option.
  2. In the Results section, check the Unique per line option.
  3. As a query, specify the links from which you need to scrape emails, or in Queries from select File and upload a file with a link database.
Download example

How to import example into A-Parser

eJxtU01z0zAQ/S+aHmAmOPTAxbc00wwwaV3a9BRyEPE6COuLXSkpePLfWTmOHZfe
tG/fvv1UI4Kkmh4QCAKJfN0I375FLkqoZNRBTISXSIDJvRafV3fLPL81Uunbl4By
Gxwy5UzebCaCBfhJC4dGJqErf511qr3zSe5h5dhZKQ0DvGDrXhpIUaUMkLxZ1Qq9
e5+Fl6Qgy1IF5azUpwypriHrs1W/Y4qngMrumM8mKqAFOsNwgFYkgX/OFa7FVWsL
lolt/LdTjMgDRpgI4moX3DGUvaOSmtijAqDkERQ+lcR4I5ydab2EPeiB1srfRKVL
nuOs4qAvXeDblOI/jWPf4WWqPeABuYZepbVuirshqnRLt+PGreO2tTIqsE1zF23a
zUcGawDfj+0+0YxD6NN0yl12PhUPtmTmsLWZH6BRG6PNjMGts5XaFdwAqhLOzGhX
fI+FnTvjNaS+bNSat0LwOFzIjLo1JGMo8HXwvE0xuuTgnKavT6dSPSq+wE+pQMOT
vMzaSW6l1s+Py0uPGC6KjZ8heMqn08PhkNV/DaWlZhin3+3Z8wMl4Bjy6Mq4DVuw
4bXLOKpZwoxRqSv5IUBNY5hMpqkVEKnUADvHN8yDPG76P9v/7Obtn5s3R76RX/Rw
oqeBJjJjvBniAxD59fEfH7B6cg==

Collected data

Example of collected data

  • Email addresses
  • Total number of addresses on the page
  • Array with all collected pages (used when Use Pages option is enabled)

Capabilities

  • Multi-page scraping (pagination)
  • Navigation through internal site pages up to a specified depth (option Parse to level) – allows to cover all site pages, collecting internal and external links
  • Determining follow links for links
  • Limit on page transitions (option Follow links limit)
  • Ability to consider subdomains as internal site pages
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of site encodings to UTF-8
  • CloudFlare protection bypass
  • Choice of engine (HTTP or Chrome)
  • Supports all the functionality of HTML::LinkExtractorHTML::LinkExtractor

Use cases

  • Email address scraping
  • Displaying the number of e-mail addresses

Queries

As queries, it is necessary to specify links to pages, for example:

https://a-parser.com/pages/support/

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON

Displaying the number of email addresses

Result format:

$mailcount

Example of result:

4

Possible settings

Parameter NameDefault ValueDescription
Good statusAllSelection of which server response will be considered successful. If another response is received during scraping, the request will be repeated with a different proxy
Good code RegExAbility to specify a regular expression to check the response code
Ban Proxy Code RegExAbility to temporarily ban a proxy for a time (Proxy ban time) based on the server response code
MethodGETRequest method
POST bodyContent to be sent to the server when using the POST method. Supports variables $query – URL request, $query.orig – original request, and $pagenum - page number when using the Use Pages option.
CookiesAbility to specify cookies for the request.
User agent_Automatically substituted user-agent of the current Chrome version_User-Agent header when requesting pages
Additional headersAbility to specify custom request headers with support for templating features and using variables from the request builder
Read only headersRead headers only. In some cases, it saves traffic if there is no need to process content
Detect charset on contentDetect charset based on the content of the page
Emulate browser headersEmulate browser headers
Max redirects count0Maximum number of redirects the scraper will follow
Follow common redirectsAllows for http <-> https and www.domain <-> domain redirects within the same domain, bypassing the Max redirects count limit
Max cookies count16Maximum number of cookies to save
EngineHTTP (Fast, JavaScript Disabled)Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled)
Chrome HeadlessIf this option is enabled, the browser will not be displayed
Chrome DevToolsAllows the use of Chromium debugging tools
Chrome Log Proxy connectionsIf this option is enabled, information about Chrome connections will be logged
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about the values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Don't verify TLS certsDisable TLS certificate validation
Randomize TLS FingerprintThis option allows bypassing site bans by TLS fingerprint
Bypass CloudFlareAutomatic bypass of CloudFlare checks
Bypass CloudFlare with Chrome(Experimental)Bypass CF through Chrome
Bypass CloudFlare with Chrome Max Pages20Max. number of pages when bypassing CF through Chrome
Subdomains are internalWhether to consider subdomains as internal links
Follow linksInternal onlyWhich links to follow
Follow links limit0Follow links limit, applied to each unique domain
Skip comment blocksWhether to skip comment blocks
Search Cloudflare protected e-mailsWhether to scrape Cloudflare protected e-mails.
Skip non-HTML blocksDo not collect email addresses in tags (script, style, comment, etc.).
Skip meta tagsDo not collect email addresses in meta tags.