Check the selected websites for reliability, anti-scraping algorithms, software, and the expected to compute unit consumption before scraping them using Website Checker. It is available in countries like USA, UK, UAE, Canada, France, Spain, Germany, Australia, Mexico, Singapore, etc.
It is a simple website crawling tool that permits you to scrape website data to check its performance and blocking with methods like Playwright, Puppeteer, and Cheerio.
What Are the Features of This Website Checker?
The website data scraper has the following features; check them out:
Discovers the common Captchas
Gathers status code responses
Allows selection between Playwright/ Puppeteer ( web browser) and Cheerio ( HTTP) scraper
Stores HTML screenshots and snapshots if you choose Playwright or Puppeteer
Allows selecting various Playwright - Firefox, Chrome, and Webkit Safari browsers.
Allows primary browser and proxy server configuration
Allows enqueueing with pseudo URLs, link selector systems, or re-scraping start URLs
Handles network errors, timeouts, and other failed states
How to Use a Website Checker?
The primary task of the website checking tool is to check how often a source website blocks scrapers. Here, enter the start URL and the first product or category. You can add enqueueing with pseudoUrls + linkSelector or set replicateStartUrls, as these are good alternatives to test various proxy servers.
Pick run option combinations, and the website crawler will spawn the scraper for each proxy and scraping tool to combine the data results into a single dataset.
Ultimately, the website scraping tool will give you the blocking rate statistics. To ensure the scraper accurately discovers the page status, check out the page screenshots. It will give you the detailed result for each URL with the help of a dataset or key-value store.
Multiple Configurations and URLs
The Website Analysis Tool has no restrictions to check the selected number of configurations and websites. It helps you to check every configuration for all the websites. For that, you must set the maxConcurrentDomainsChecked reasonably to fit each parallel run into the total memory - 8 GB for Playwright/Puppeteer check and 4 GB for the Cheerio check, respectively.
The input of Website Checker
Check out the input page of the website performance analysis tool for more details. There are reasonable defaults in almost every input field.
Follow the https://api.RealdataAPI.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true for detailed output with HTML links, screenshots, and URLs.
Changelog
Visit the CHANGELOG to see the history of changes.
To run the code examples, you need to have an RealdataAPI account. Replace
< YOUR_API_TOKEN>
in the code with your API token.
import{ RealdataAPIClient }from'RealdataAPI-Client';// Initialize the RealdataAPIClient with API tokenconst client =newRealdataAPIClient({token:'<YOUR_API_TOKEN>',});// Prepare actor inputconst input ={"urlsToCheck":[{"url":"https://www.amazon.com/b?ie=UTF8&node=11392907011"}],"proxyConfiguration":{"useRealdataAPIProxy":true,"RealdataAPIProxyGroups":["SHADER","BUYPROXIES94952","RESIDENTIAL"]},"repeatChecksOnProvidedUrls":10,"maxNumberOfPagesCheckedPerDomain":1000};(async()=>{// Run the actor and wait for it to finishconst run =await client.actor("lukaskrivka/website-checker").call(input);// Fetch and print actor results from the run's dataset (if any)
console.log('Results from dataset');const{ items }=await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item)=>{
console.dir(item);});})();
from RealdataAPI_client import RealdataAPIClient
# Initialize the RealdataAPIClient with your API token
client = RealdataAPIClient("<YOUR_API_TOKEN>")# Prepare the actor input
run_input ={"urlsToCheck":[{"url":"https://www.amazon.com/b?ie=UTF8&node=11392907011"}],"proxyConfiguration":{"useRealdataAPIProxy":True,"RealdataAPIProxyGroups":["SHADER","BUYPROXIES94952","RESIDENTIAL",],},"repeatChecksOnProvidedUrls":10,"maxNumberOfPagesCheckedPerDomain":1000,}# Run the actor and wait for it to finish
run = client.actor("lukaskrivka/website-checker").call(run_input=run_input)# Fetch and print actor results from the run's dataset (if there are any)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
# Set API tokenAPI_TOKEN=<YOUR_API_TOKEN># Prepare actor inputcat> input.json <<'EOF'
{
"urlsToCheck": [
{
"url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
}
],
"proxyConfiguration": {
"useRealdataAPIProxy": true,
"RealdataAPIProxyGroups": [
"SHADER",
"BUYPROXIES94952",
"RESIDENTIAL"
]
},
"repeatChecksOnProvidedUrls": 10,
"maxNumberOfPagesCheckedPerDomain": 1000
}
EOF# Run the actorcurl"https://api.RealdataAPI.com/v2/acts/lukaskrivka~website-checker/runs?token=$API_TOKEN"/-X POST /-d @input.json /-H'Content-Type: application/json'
It is the static URL list to check captchas. Allow the Use request queue to add new links on the fly.
Proxy Server Configuration
proxyConfigurationOptional Object
Mention the proxy server to help your scraper hide its original IP and successfully check websites.
Puppeteer
checkers.puppeteerOptional Boolean
Crawl websites with Puppeteer.
Cheerio
checkers.cheerioOptional Boolean
Crawl websites with Cheerio.
Playwright
checkers.playwrightOptional Boolean
Crawl websites with the Playwright.
Enabled
saveSnapshotOptional Boolean
It will store HTML + screenshots for Playwright/Puppeteer and HTML for Cheerio.
Enqueue Any Domain URL
enqueueAllOnDomainOptional Boolean
It will enqueue any domain URL.
Link Selector
linkSelectorOptional String
The CSS-based link selector says the URLs to follow and add to the request queue. You can only apply the CSS selector by enabling the request queue. Further, use the pseudo-URLs setting to filter URLs from the queue.
The scraper will ignore page URLs if the link selector is blank.
Pseudo-URLs
pseudoUrlsOptional Array
It mentions the type of links the link selector finds you should add to the request queue. Pseudo URLs belong to regular expressions inside the [] bracket. You can only apply this setting by enabling the request queue option. If the scrape omits pseudo URLs, the scraper will find URLs from the link selector and enqueue them.
Check Provided URLs Multiple Times
repeatChecksOnProvidedUrlsOptional Integer
The scraper will check every provided link repeatedly. It is helpful to bypass the blocking of the first webpage or test the same link.
Maximum Pages Checked For Every Domain
maxNumberOfPagesCheckedPerDomainOptional Integer
Here, the scraper will load the maximum number of web pages. Once the limit reaches, the website checker will stop. It is the best practice to set these restrictions on the website checker to load limited pages so that it will prevent excessive usage of platform credits and storage. Remember that the scraper may load slightly higher page counts than the limit.
There is no restriction if you set the limit to zero.
It mentions the maximum page count the scraper processes parallelly for a single domain. The website scraping tool automatically varies the concurrency based on existing system sources. The maximum Concurrency option allows you to set an upper limit to reduce the loading on the selected domain.
Count of Maximum Concurrent Domains Checked
maxConcurrentDomainsCheckedOptional Integer
It mentions the maximum domain counts that it can check at once. It is relevant to pass multiple URLs for checking at the same time.
Retire Web Browser Instance After Counting Requests
It is to check the frequency of the browser on how often it rotates itself. Pick a lower number for more consumption and a higher one for minor consumption.
Navigation Timeout
navigationTimeoutSecsOptional Integer
It mentions the maximum duration in seconds that the request will allow the page to load. If the website checker fails to load the page within the given time, the browser will display an error, leading to failure.
Headful Browser
puppeteer.headfulOptional Boolean
It only works for the Puppeteer category.
Use Chrome
puppeteer.useChromeOptional Boolean
It only works for the Puppeteer category. But, Chrome may not work with Puppeteer.
Wait For
puppeteer.waitForOptional String
It only works for the Puppeteer category. It will wait for every webpage. Provide numbers in the selector or ms.
Memory
puppeteer.memoryOptional Integer
You must choose a memory from 128 to 32768 at the power of 2.
Chrome
playwright.chromeOptional Boolean
Use Chrome to check
Firefox
playwright.firefoxOptional Boolean
Use Firefox to check
Safari
playwright.webkitOptional Boolean
Use Safari to check
Use Chrome In Place Of Chromium
playwright.useChromeOptional Boolean
It only works for the Playwright category. Please remember that Chrome may not work with Playwright.
Headful Browser
playwright.headfulOptional Boolean
See if the web browser is headful or not.
Wait For
playwright.waitForOptional String
It only works for the Playwright category. It will wait for every page. Please provide numbers in a selector or ms.
Memory
playwright.memoryOptional Integer
You must choose a memory from 128 to 32768 at the power of 2.
Disclaimer : RealData API functions solely as an independent data infrastructure and technology solutions provider. We build customized automation workflows designed to collect publicly accessible web data based exclusively on client instructions. RealData API neither owns proprietary datasets nor engages in the sale or redistribution of extracted information. Our operations are limited strictly to lawful public web data processing and never involve unauthorized access to restricted systems or private networks. Any company names, trademarks, logos, or brand references displayed on this website are used purely for demonstrative and illustrative purposes to showcase our technical capabilities and do not imply endorsement, partnership, or affiliation. Use of our platform and services remains subject to our Terms of Service.