RealdataAPI Store - Browse tools published by our community and use them for your projects right away
logo

Website Checker - Scrape & Analyze Website Data

RealdataAPI / website-checker

Check the selected websites for reliability, anti-scraping algorithms, software, and the expected to compute unit consumption before scraping them using Website Checker. It is available in countries like USA, UK, UAE, Canada, France, Spain, Germany, Australia, Mexico, Singapore, etc.

What is a Website Checker?

It is a simple website crawling tool that permits you to scrape website data to check its performance and blocking with methods like Playwright, Puppeteer, and Cheerio.

What Are the Features of This Website Checker?

The website data scraper has the following features; check them out:

  • Discovers the common Captchas
  • Gathers status code responses
  • Allows selection between Playwright/ Puppeteer ( web browser) and Cheerio ( HTTP) scraper
  • Stores HTML screenshots and snapshots if you choose Playwright or Puppeteer
  • Allows selecting various Playwright - Firefox, Chrome, and Webkit Safari browsers.
  • Allows primary browser and proxy server configuration
  • Allows enqueueing with pseudo URLs, link selector systems, or re-scraping start URLs
  • Handles network errors, timeouts, and other failed states

How to Use a Website Checker?

The primary task of the website checking tool is to check how often a source website blocks scrapers. Here, enter the start URL and the first product or category. You can add enqueueing with pseudoUrls + linkSelector or set replicateStartUrls, as these are good alternatives to test various proxy servers.

Pick run option combinations, and the website crawler will spawn the scraper for each proxy and scraping tool to combine the data results into a single dataset.

Ultimately, the website scraping tool will give you the blocking rate statistics. To ensure the scraper accurately discovers the page status, check out the page screenshots. It will give you the detailed result for each URL with the help of a dataset or key-value store.

Multiple Configurations and URLs

The Website Analysis Tool has no restrictions to check the selected number of configurations and websites. It helps you to check every configuration for all the websites. For that, you must set the maxConcurrentDomainsChecked reasonably to fit each parallel run into the total memory - 8 GB for Playwright/Puppeteer check and 4 GB for the Cheerio check, respectively.

The input of Website Checker

Check out the input page of the website performance analysis tool for more details. There are reasonable defaults in almost every input field.

Output Example of Website Checker

Simple output
{ "timeouted": 0, "failedToLoadOther": 9, "accessDenied": 0, "recaptcha": 0, "distilCaptcha": 24, "hCaptcha": 0, "statusCodes": { "200": 3, "401": 2, "403": 5, "405": 24 }, "success": 3, "total": 43 }

Follow the https://api.RealdataAPI.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true for detailed output with HTML links, screenshots, and URLs.

Changelog

Visit the CHANGELOG to see the history of changes.

Industries

Check out how industries are using Website Checker around the world.

saas-btn.webp

E-commerce & Retail

To run the code examples, you need to have an RealdataAPI account. Replace < YOUR_API_TOKEN> in the code with your API token.

import { RealdataAPIClient } from 'RealdataAPI-Client';

// Initialize the RealdataAPIClient with API token
const client = new RealdataAPIClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare actor input
const input = {
    "urlsToCheck": [
        {
            "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
        }
    ],
    "proxyConfiguration": {
        "useRealdataAPIProxy": true,
        "RealdataAPIProxyGroups": [
            "SHADER",
            "BUYPROXIES94952",
            "RESIDENTIAL"
        ]
    },
    "repeatChecksOnProvidedUrls": 10,
    "maxNumberOfPagesCheckedPerDomain": 1000
};

(async () => {
    // Run the actor and wait for it to finish
    const run = await client.actor("lukaskrivka/website-checker").call(input);

    // Fetch and print actor results from the run's dataset (if any)
    console.log('Results from dataset');
    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    items.forEach((item) => {
        console.dir(item);
    });
})();
from RealdataAPI_client import RealdataAPIClient

# Initialize the RealdataAPIClient with your API token
client = RealdataAPIClient("<YOUR_API_TOKEN>")

# Prepare the actor input
run_input = {
    "urlsToCheck": [{ "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011" }],
    "proxyConfiguration": {
        "useRealdataAPIProxy": True,
        "RealdataAPIProxyGroups": [
            "SHADER",
            "BUYPROXIES94952",
            "RESIDENTIAL",
        ],
    },
    "repeatChecksOnProvidedUrls": 10,
    "maxNumberOfPagesCheckedPerDomain": 1000,
}

# Run the actor and wait for it to finish
run = client.actor("lukaskrivka/website-checker").call(run_input=run_input)

# Fetch and print actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)
# Set API token
API_TOKEN=<YOUR_API_TOKEN>

# Prepare actor input
cat > input.json <<'EOF'
{
  "urlsToCheck": [
    {
      "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
    }
  ],
  "proxyConfiguration": {
    "useRealdataAPIProxy": true,
    "RealdataAPIProxyGroups": [
      "SHADER",
      "BUYPROXIES94952",
      "RESIDENTIAL"
    ]
  },
  "repeatChecksOnProvidedUrls": 10,
  "maxNumberOfPagesCheckedPerDomain": 1000
}
EOF

# Run the actor
curl "https://api.RealdataAPI.com/v2/acts/lukaskrivka~website-checker/runs?token=$API_TOKEN" /
  -X POST /
  -d @input.json /
  -H 'Content-Type: application/json'

URLs To Check

urlsToCheck Required Array

It is the static URL list to check captchas. Allow the Use request queue to add new links on the fly.

Proxy Server Configuration

proxyConfiguration Optional Object

Mention the proxy server to help your scraper hide its original IP and successfully check websites.

Puppeteer

checkers.puppeteer Optional Boolean

Crawl websites with Puppeteer.

Cheerio

checkers.cheerio Optional Boolean

Crawl websites with Cheerio.

Playwright

checkers.playwright Optional Boolean

Crawl websites with the Playwright.

Enabled

saveSnapshot Optional Boolean

It will store HTML + screenshots for Playwright/Puppeteer and HTML for Cheerio.

Enqueue Any Domain URL

enqueueAllOnDomain Optional Boolean

It will enqueue any domain URL.

Link Selector

linkSelector Optional String

The CSS-based link selector says the URLs to follow and add to the request queue. You can only apply the CSS selector by enabling the request queue. Further, use the pseudo-URLs setting to filter URLs from the queue.

The scraper will ignore page URLs if the link selector is blank.

Pseudo-URLs

pseudoUrls Optional Array

It mentions the type of links the link selector finds you should add to the request queue. Pseudo URLs belong to regular expressions inside the [] bracket. You can only apply this setting by enabling the request queue option. If the scrape omits pseudo URLs, the scraper will find URLs from the link selector and enqueue them.

Check Provided URLs Multiple Times

repeatChecksOnProvidedUrls Optional Integer

The scraper will check every provided link repeatedly. It is helpful to bypass the blocking of the first webpage or test the same link.

Maximum Pages Checked For Every Domain

maxNumberOfPagesCheckedPerDomain Optional Integer

Here, the scraper will load the maximum number of web pages. Once the limit reaches, the website checker will stop. It is the best practice to set these restrictions on the website checker to load limited pages so that it will prevent excessive usage of platform credits and storage. Remember that the scraper may load slightly higher page counts than the limit.

There is no restriction if you set the limit to zero.

Check Maximum Concurrent Pages For Each Domain

maxConcurrentPagesCheckedPerDomain Optional Integer

It mentions the maximum page count the scraper processes parallelly for a single domain. The website scraping tool automatically varies the concurrency based on existing system sources. The maximum Concurrency option allows you to set an upper limit to reduce the loading on the selected domain.

Count of Maximum Concurrent Domains Checked

maxConcurrentDomainsChecked Optional Integer

It mentions the maximum domain counts that it can check at once. It is relevant to pass multiple URLs for checking at the same time.

Retire Web Browser Instance After Counting Requests

retireBrowserInstanceAfterRequestCount Optional Integer

It is to check the frequency of the browser on how often it rotates itself. Pick a lower number for more consumption and a higher one for minor consumption.

Navigation Timeout

navigationTimeoutSecs Optional Integer

It mentions the maximum duration in seconds that the request will allow the page to load. If the website checker fails to load the page within the given time, the browser will display an error, leading to failure.

Headful Browser

puppeteer.headful Optional Boolean

It only works for the Puppeteer category.

Use Chrome

puppeteer.useChrome Optional Boolean

It only works for the Puppeteer category. But, Chrome may not work with Puppeteer.

Wait For

puppeteer.waitFor Optional String

It only works for the Puppeteer category. It will wait for every webpage. Provide numbers in the selector or ms.

Memory

puppeteer.memory Optional Integer

You must choose a memory from 128 to 32768 at the power of 2.

Chrome

playwright.chrome Optional Boolean

Use Chrome to check

Firefox

playwright.firefox Optional Boolean

Use Firefox to check

Safari

playwright.webkit Optional Boolean

Use Safari to check

Use Chrome In Place Of Chromium

playwright.useChrome Optional Boolean

It only works for the Playwright category. Please remember that Chrome may not work with Playwright.

Headful Browser

playwright.headful Optional Boolean

See if the web browser is headful or not.

Wait For

playwright.waitFor Optional String

It only works for the Playwright category. It will wait for every page. Please provide numbers in a selector or ms.

Memory

playwright.memory Optional Integer

You must choose a memory from 128 to 32768 at the power of 2.

{
  "urlsToCheck": [
    {
      "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
    }
  ],
  "proxyConfiguration": {
    "useRealdataAPIProxy": true,
    "RealdataAPIProxyGroups": [
      "SHADER",
      "BUYPROXIES94952",
      "RESIDENTIAL"
    ]
  },
  "checkers.cheerio": true,
  "checkers.puppeteer": true,
  "checkers.playwright": true,
  "saveSnapshot": true,
  "enqueueAllOnDomain": true,
  "pseudoUrls": [],
  "repeatChecksOnProvidedUrls": 10,
  "maxNumberOfPagesCheckedPerDomain": 1000,
  "maxConcurrentPagesCheckedPerDomain": 500,
  "maxConcurrentDomainsChecked": 5,
  "retireBrowserInstanceAfterRequestCount": 10,
  "navigationTimeoutSecs": 60,
  "puppeteer.waitFor": "2000",
  "puppeteer.memory": 4096,
  "playwright.chrome": false,
  "playwright.firefox": true,
  "playwright.waitFor": "2000",
  "playwright.memory": 4096
}