Guide to create a web scrapper using bubble.io and puppeteer

Integrating code with no-code tool Bubble.io - sounds contradictory? Take a look at how we developed a Scrapper using Bubble, Node.js and Ngrok.​

web scraper

Used tools explained shortly:

Bubble.io

  • The whole user experience and interaction of the application are based on bubble.io. The Low Code platform provides visual elements to create the user interface, workflows to handle user inputs, and a database to store data like the scraped data from an eCommerce site.

Bubble.io Plugin

  • When we hit bubble.io’s limits, we can extend it. One way is by developing a plugin. Within the plugins, you can execute custom code or create your own visual elements for the user interface. We’ll be using an API connector – plugin provided by Bubble.

Node

  • Node.js is a single-threaded, open-source, cross-platform runtime environment for building fast and scalable server-side and networking applications. It runs on the V8 JavaScript runtime engine, and it uses event-driven, non-blocking I/O architecture, which makes it efficient and suitable for real-time applications.

Puppeteer

  • Puppeteer is a Node library that provides a high-level API to control headless Chrome or Chromium browsers over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.

Express

  • Express is a minimal and flexible Node.js web application framework that allows setting up middlewares to respond to HTTP Requests and defines a routing table which is used to perform different actions based on HTTP Method and URL

Ngrok

  • Ngrok is a cross-platform application that enables developers to expose a local development server to the Internet with minimal effort. The software makes your locally-hosted web server appear to be hosted on a subdomain of ngrok.com, meaning that no public IP or domain name on the local machine is needed

VS Code

  • Visual Studio Code is a streamlined code editor with support for development operations like debugging, task running, and version control. It aims to provide just the tools a developer needs for a quick code-build-debug cycle and leaves more complex workflows to fuller featured IDEs, such as Visual Studio IDE

Pre-requisites

Download and Install

Environment Setup for web scrapper

  1. Create a folder on the desktop named “scraper”
  2. Open VS Code
  3. Click on File > Open Folder
  4. Locate and select  folder “scraper”
  5. Click on “Select Folder”
  6. Create a file named “index.js”
  7. Goto terminal and write “npm init -y”, press enter
  8. Install puppeteer using command “npm install puppeteer”, press enter
  9. Install express using the command “npm install express”, press enter

Lets code

  • Open file “index.js”
  • Import the Puppeteer module within the “index.js” file

const puppeteer = require(‘puppeteer’);

  • Import the Express framework within the “index.js” file

const express = require(‘express’);

  • Instantiate the Express app

const app = express();

  • Set our port:

const port = 3000;

The port will be used a bit later when we tell the app to listen to requests.

  • Finalized selectors

Web Scraper uses CSS selectors to find HTML elements in web pages and extract data from them. When selecting an element the Web Scraper will try to make its best guess of what the CSS selector might be for the selected elements. But you can also write it yourself and test it by clicking “Element preview”.

Selectors = {

    name:’.prod-subtitle’,

    price:’span.push-right:nth-child(1) > strong:nth-child(1)’

}

  • Empty JSON object to send to bubble later, when data has been scrapped and stored into this JSON object

let productDetail = {

    name:”,

    price:”

}

We need to keep in mind that Puppeteer is a promise-based library: It performs asynchronous calls to the headless Chrome instance under the hood. Let’s keep the code clean by using async/await. For that, we need to define an async function and put all the Puppeteer code in there.

  • Define HTTP Get endpoint to accept requests from bubble server

When a user hits the endpoint with a GET request, the  JSON object, from express” will be returned to the bubble application. We’d like to set it to be on the product page, so the URL for the endpoint is /product:

app.get(‘/product’, async (req, res) => {

  • Launch the browser

const browser = await puppeteer.launch()

  • Open a new tab

const page = await browser.newPage()

Puppeteer has a newPage() method that creates a new page instance in the browser, and these page instances can do quite a few things. In our scraper() method, you created a page instance and then used the page.goto() method to navigate to the target site

  • Pass URL of target site

await page.goto(req.query.url)

  • Save scraped data from HTML element’s selector ( name of product ) into JSON object

productDetail.name = await page.$eval(Selectors.name, el=>el.textContent)

  • Save scraped data from HTML element’s selector ( price of product ) into JSON object

price_ = await page.$eval(Selectors.price, el=>el.textContent)

“price_” is a temporary variable declared anywhere, above this line of code.

Here, we need to clean data ( price ), as price contains ‘,’ and ‘.’ swapped. 

This is not a mandatory case, but necessary here. 

lett = 0;

price_  = price_.replace(/,/g, match => ++t === 2 ?          ‘.’ : match)
productDetail.price = price_;

  • Close browser

await browser.close()

  • Send JSON object to bubble as a response

res.json(productDetail);

  • let’s start with our clients

app.listen(port, () => {

  console.log(`Example app listening on port ${port}`)

})

  • To run the application, open terminal and write command “node index.js” and press enter

Complete Code

				
					const express = require('express')

const puppeteer = require('puppeteer')

const app = express()

const port = 3000

Selectors = {

    name:'.prod-subtitle',

    price:'span.push-right:nth-child(1) > strong:nth-child(1)'

}

let productDetail = {

    name:'',

    price:''

}

app.get('/product', async (req, res) => {

    const browser = await puppeteer.launch()

    const page = await browser.newPage()

    await page.goto(req.query.url)

    productDetail.name = await page.$eval(Selectors.name, el=>el.textContent)

    price_ = await page.$eval(Selectors.price, el=>el.textContent)

    let t = 0;

    price_  = price_.replace(/,/g, match => ++t === 2 ? '.' : match)

    productDetail.price = price_;

    await browser.close()

    res.json(productDetail);

   })

app.listen(port, () => {

  console.log(`Example app listening on port ${port}`)

})
				
			

Hosting script on server (Ngrok)

  • Signup to Ngrok
  • Go to “Your Authtoken”
  • Copy Token
  • Open Ngrok, type ngrok authtoken [Paste your Token]

ngrok authtoken 25pMZXFc3gQ3KAYhhclTu41LcS0_3u24V5PXcQUzLgQA29ApA

  • To expose a web server running on your local machine to the internet, type ngrok HTTP [port number] – in this case, port number 3000

ngrok http 3000

  • Now that our script is ready and has been hosted on the server. Let’s design UI on Bubble, and test requests via Bubble API Connector.

Designing UI on Bubble.io

  • Create a page, and put Input Field and Button on the page.
    The user will paste a link from the eCommerce site into the Input Field.
  • Get a plugin named “API Connector”
  • Set API Name “Puppeteer Scraper API”
  • Set Authentication “None or self-handled”
  • Create a call and set the name “GET”
  • Set Use as “Action”
  • After pasting the link, the user will click on the “Calculate” button. And the following Event will get initiated with respective actions:
  • Get  scrapped data from API

  • Store that data into database (optional)

  • Display data onto UI’s group

From our blog

Latest Posts

Looking to build something?

Chat with our team to see what we can do