Automation with TensorFlow.js and Puppeteer: Discover an Easy Path to Object Detection on Websites.

The ability to use Puppeteer to automate Google Chrome makes it an excellent tool for web crawling. In this article, we share how we combined a Puppeteer-powered web crawler with some machine learning methods to scan a pet shelter's website and extract dog photos.

December 24, 2023

Overview

This will be more of a journey into the 'what' and the high-level 'how' of creating my Puppyteer web crawler to search for photos of adorable dogs, rather than a guide.

Dive into code at GitHub.

Dependencies

The project has around 7 dependencies. Here are the key ones:

Headless Chrome Crawler

Headless Chrome Crawler is a Node.js/JavaScript dependency that you can configure to scan websites. It differs from some other web crawlers in that it uses Google Chrome as the conduit through which web pages (and JavaScript) are loaded and executed.

Crawlers based on simple requests to HTML files are usually fast. However, sometimes they may return empty documents, especially when websites are built on modern frontend frameworks like AngularJS, React, and Vue.js.

When using Headless Chrome, the crawler provides simple APIs for scanning dynamic websites with the following features…

Getting it up and running is easy. In the following code snippet, a scan of Kevin Bacon’s Wikipedia page is executed, printing page titles and information along the way.

    const HCCrawler = require('headless-chrome-crawler');

    (async () => {
        let crawler = await HCCrawler.launch({
            maxDepth: 2,
            evaluatePage: (() => ({
                title: $('title').text(),
            })),
            onSuccess: (result => console.log(result)),
        });
        await crawler.queue('https://en.wikipedia.org/wiki/Kevin_Bacon');
        await crawler.onIdle();
        await crawler.close();
    })();

Our use case for the web crawler was to find all the images loaded by Chrome while scanning a pet shelter’s website. We implemented a custom crawl. Custom crawl allows you to, among other things, interact with Puppeteer Page objects.

    customCrawl: async (page, crawl) => {
        await page.setRequestInterception(true);

        page.on('request', request => {
            let requestUrl = request.url();

            if (request.resourceType() == 'image' && !imageUrls.has(requestUrl)) {
                imageUrls.add(requestUrl);
                request.abort();
            } else {
                request.continue();
            }
        });
        let result = await crawl();
        result.content = await page.content();
        return result;
    }

Having access to the Page object, we can use request interception to record URLs leading to images. We save each URL leading to an image for classification using Tensorflow in the next step.

Tensorflow.js

TensorFlow.js is a JavaScript wrapper for the popular machine learning framework TensorFlow. TensorFlow is a tool for building, training, and using machine learning models to perform complex computations, such as converting text to speech or image recognition. Typically, all your TensorFlow logic is written in Python. TensorFlow.js enables you to perform machine learning tasks using JavaScript. This means you can easily load models in the browser or on the server side using Node.js.

TensorFlow.js also comes with several pre-built machine learning models, so you don’t need a doctorate to quickly start recognition.

Our implementation takes a URL of an image, which we recorded in the previous step, retrieves binary data from the web server, and then provides it to a pre-built object recognition model, coco-ssd.

More about coco-ssd:

An object detection model aimed at localizing and identifying multiple objects in a single image.

This model is a TensorFlow.js port of the COCO-SSD model. For more information about the TensorFlow Object Detection API, please read the README tensorflow/object_detection.

The model detects objects defined in the COCO dataset, which is a dataset for object detection, segmentation, and captioning. Additional information can be found here. The model is capable of detecting 90 classes of objects. (SSD stands for Single Shot MultiBox Detection).

This TensorFlow.js model does not require you to have knowledge of machine learning. It can take input in the form of any browser-based image elements (e.g., img, video, canvas) and return an array of bounding boxes with class names and confidence levels.

An impressive feature of coco-ssd is that it will detect as many objects in an image as possible and generate a bounding box indicating where the object is located in the image. The detect method will return an array of predictions, one for each detected object in the image.

    const tf = require('@tensorflow/tfjs');
    const tfnode = require('@tensorflow/tfjs-node');
    const cocoSsd = require('@tensorflow-models/coco-ssd');
    const request = require('request');

    function getImagePixelData(imageUrl) {
        return new Promise((resolve, reject) => {
            let options = { url: imageUrl, method: "get", encoding: null };

            request(options, (err, response, buffer) => {
                if (err) { reject(err); } 
                else { resolve(buffer);}
            });
        });
    }

    (async () => {
        let model = await cocoSsd.load({ base: 'mobilenet_v2' });
        let predictions = [];

        try {
            let url = 'https://www.guidedogs.org/wp-content/uploads/2019/11/website-donate-mobile.jpg';
            let imageBuffer = await getImagePixelData(url);

            if (imageBuffer) {
                let input = tfnode.node.decodeImage(imageBuffer);
                predictions = await model.detect(input);
                console.log(predictions);
            }
        } catch (err) {
            console.error(err);
        }
    })();

Below will be a photo of a dog.

Passing input data into the coco-ssd model results in the following outcome:

    [
        {
            bbox: [
                62.60044872760773,
                37.884591430425644,
                405.2848666906357,
                612.7625299990177
            ],
            class: 'dog',
            score: 0.984025239944458
        }
    ]

GetUp & Run

Step 1 | Clone the repository

    git clone [email protected]:evanhalley/puppyteer-crawler.git

Step 2 | Install dependencies

    cd puppyteer-crawler
    npm install

Step 3 | Search for a photo of a dog icon:

    node . --url=spcawake.org --depth=1 --query=dog

Output

    Searching https://spcawake.org for images containing a dog...
    The domain for the URL is spcawake.org...
    Starting crawl of https://spcawake.org...
    Crawled 1 urls and found 25 images...
    Classifying 25 images...
    ████████████████████████████████████████ 100% | ETA: 0s | 25/25
    Images that contain a dog
    https://spcawake.org/wp-content/uploads/2019/11/Clinic-Banner-2-820x461.jpg
    https://spcawake.org/wp-content/uploads/2019/03/Dog-for-website.jpg
    https://spcawake.org/wp-content/uploads/2019/03/volunteer-website-pic.jpg
    https://spcawake.org/wp-content/uploads/2019/12/Social-Dog-250x250.jpg
    https://spcawake.org/wp-content/uploads/2019/12/Alhanna-for-blog-v2-250x250.jpg

Summary

In this article, we explained how to use two libraries to quickly perform a task that can be quite laborious if done manually (depending on the size of the website). Using TensorFlow.js allows you to leverage models that are already created and trained to identify various types of objects. You can even train a model yourself, for example, to detect all images of 1992 Volkswagen GTIs on a classic car website.

Using a web crawler based on Puppeteer ensures the rendering of JavaScript and scanning of URLs resulting from processed JavaScript. This makes collecting data for model input easier and less cumbersome.

Read at the article:

Overview
Dependencies
- Headless Chrome Crawler
- Tensorflow.js

Automation with TensorFlow.js and Puppeteer: Discover an Easy Path to Object Detection on Websites.

Overview

Dependencies

Headless Chrome Crawler

Tensorflow.js

More about coco-ssd:

GetUp & Run

Step 1 | Clone the repository

Step 2 | Install dependencies

Step 3 | Search for a photo of a dog icon:

Output

Summary

See also

Data Parsing for Companies: From Free Tools to Building Your Own Database

Assessment of the Web Scraping Market and Forecasts for Its Development

How to Match Part Numbers with Images: Leveraging Google Search and Image Comparison in Node.js

OverviewOverview

DependenciesDependencies

Headless Chrome CrawlerHeadless Chrome Crawler

Tensorflow.jsTensorflow.js

More about coco-ssd:More about coco-ssd:

GetUp & RunGetUp & Run

Step 1 | Clone the repositoryStep 1 | Clone the repository

Step 2 | Install dependenciesStep 2 | Install dependencies

Step 3 | Search for a photo of a dog icon:Step 3 | Search for a photo of a dog icon:

OutputOutput

SummarySummary

See also

Data Parsing for Companies: From Free Tools to Building Your Own Database

Assessment of the Web Scraping Market and Forecasts for Its Development

How to Match Part Numbers with Images: Leveraging Google Search and Image Comparison in Node.js

Overview

Dependencies

Headless Chrome Crawler

Tensorflow.js

More about coco-ssd:

GetUp & Run

Step 1 | Clone the repository

Step 2 | Install dependencies

Step 3 | Search for a photo of a dog icon:

Output

Summary