Categories
Cheerio Javascript Web Scraping

Simple Web Scraping with Cheerio and node-fetch

Before we get started explaining how to you can use Cheerio.js to scrape the web it is useful to understand what Cheerio is. It describes itself as a “Fast, flexible, and lean implementation of core jQuery designed specifically for the server.” put in regular speak that means it makes it easy to take HTML and play around with it, in our context to make it easy to find content we want using selectors.

What makes Cheerio special is how easy and lightweight it is to use on your server, but it is not a headless chrome browser like Puppeteer so you won’t be doing anything that requires user interaction such as logging in.

To get started with Cheerio you need to know how to run a node server, which is a simple as opening a terminal window in VS Code and learning a few commands but once you figure that out you can get started. You also need node-fetch which you can find using npm.

The first thing you are going to want to do is get some HTML from the website you want to scrape like this

const fetch = require('node-fetch');

fetch('https://blog.thewebscraping.com/')
    .then(res => res.text())
    .then(body => console.log(body));

Then once you have the HTML you want to load it into Cheerio like this

const cheerio = require('cheerio');
const $ = cheerio.load('<ul id="fruits">...</ul>');

Now the the entire HTML is loaded into $ and if you wanted to find the text inside fruits class you can simple go

$('.fruits').text();

Categories
Javascript Puppeteer Web Scraping

Scraping the web with Puppeteer

Puppeteer is a headless chrome browser that has a lot of flexibility. You can tell it to browse to a url, wait until the page is loaded, click the login button, type a username and password, browse to another page, grab a bunch of data and display it as JSON.

To learn Puppeteer there is one website that is your new best friend: https://try-puppeteer.appspot.com/

try-puppeteer.appspot.com

Remember to backup anything you write often as if you close the window you will lose your work without any way to get it back.

If you want to learn how to use Puppeteer by building a project to turn Craigslist into an API Jarrod Overson has a great tutorial over on youtube:

It is good stuff but don’t just watch, pause and follow along with code

Puppeteer is powerful and is everything you need to get data into JSON format which can then be saved in a NoSQL database like MongoDB or Firestore for use in your app.Here are some more learning resources:

Here are some more learning resources:

Scrape something and let me know what you plan on building below in the comments!