Web Scraping

2 Easy Web Extract Techniques

There are many ways to extract data from the web, some require learning languages like Python, some require expensive tools, but there are a couple of ways that can get you up and running quickly using minimal effort.

The first easy web extract technique is using node-fetch to download all the html you need then parsing it through Cheerio which then allows you to find specific data you need using jQuery’s powerful $ function. You can find a tutorial on how to scrape using Simple Web Scraping with Cheerio and node-fetch.

However this method has some limitations such as “What if I want to scrape data from a website that requires me to login?”

That is when we can use a more powerful tool called Puppeteer which is actually a headless chrome browser that allows you to literally surf the web every time you want to extract data from the web without ever having to open a the chrome browser yourself. To get started with this method we also made a guide on how to do it Scraping the web with Puppeteer

There are many more methods that you can use but there are two easy web extract techniques that are robust and can get you the data you need for your next project.

Cheerio Javascript Web Scraping

Simple Web Scraping with Cheerio and node-fetch

Before we get started explaining how to you can use Cheerio.js to scrape the web it is useful to understand what Cheerio is. It describes itself as a “Fast, flexible, and lean implementation of core jQuery designed specifically for the server.” put in regular speak that means it makes it easy to take HTML and play around with it, in our context to make it easy to find content we want using selectors.

What makes Cheerio special is how easy and lightweight it is to use on your server, but it is not a headless chrome browser like Puppeteer so you won’t be doing anything that requires user interaction such as logging in.

To get started with Cheerio you need to know how to run a node server, which is a simple as opening a terminal window in VS Code and learning a few commands but once you figure that out you can get started. You also need node-fetch which you can find using npm.

The first thing you are going to want to do is get some HTML from the website you want to scrape like this

const fetch = require('node-fetch');

    .then(res => res.text())
    .then(body => console.log(body));

Then once you have the HTML you want to load it into Cheerio like this

const cheerio = require('cheerio');
const $ = cheerio.load('<ul id="fruits">...</ul>');

Now the the entire HTML is loaded into $ and if you wanted to find the text inside fruits class you can simple go


Web Scraping

Amazon Product Scraping

One of the most popular sites on the web for scraping is Amazon.

Some of my favorites are the chrome extension Honey:

and another chrome extension Keepa, an Amazon price tracker:

Both of these extensions are special because they simultaneously extract web data by scraping the current page, then at the same time display the aggregated results of many people contributing to the app.

In Honey’s case the value add is analytics and money saved scraping coupons with the added benefit of applying them for you. Keepa is more of a hardcore price analytics tool allowing the user to tell if now is the time to buy because the price is low or not to buy because the price is unusually high.

This is the power of web scraping combined with the ability to use that data to generate insights.

Amazon is very aware that their products and vendor data is constantly being web scraped from every country and anyone who wants to make apps for Amazon FBA entrepreneurs. When it is being scraped using browser extensions it would be very difficult to stop when no blatant violations are being abused like copyright infringement for example.

If you need Amazon product data I would suggest finding a company that is providing an API or if they are too expensive because you need a lot of data points, consider Puppeteer or dive into building your own extension.

Javascript Puppeteer Web Scraping

Scraping the web with Puppeteer

Puppeteer is a headless chrome browser that has a lot of flexibility. You can tell it to browse to a url, wait until the page is loaded, click the login button, type a username and password, browse to another page, grab a bunch of data and display it as JSON.

To learn Puppeteer there is one website that is your new best friend:

Remember to backup anything you write often as if you close the window you will lose your work without any way to get it back.

If you want to learn how to use Puppeteer by building a project to turn Craigslist into an API Jarrod Overson has a great tutorial over on youtube:

It is good stuff but don’t just watch, pause and follow along with code

Puppeteer is powerful and is everything you need to get data into JSON format which can then be saved in a NoSQL database like MongoDB or Firestore for use in your app.Here are some more learning resources:

Here are some more learning resources:

Scrape something and let me know what you plan on building below in the comments!