Web Scraping

How To Easily Scrape Web Data Using Simple Scraper

Transcript below

Data is important for any project that you’ll be working on and a common project for beginners that I see is to build a list of no code websites because they spend so much time visiting them and learning them it will be useful to put all of those websites into a single page one example of this is no code list it’s just a list of no code projects it’s got categories over here it’s just got just a straight list of all the different projects so I’m going to show you how to take this list and turn it into a spreadsheet so you can practice doing the same thing with your own project and to do this we’re going to use an extension called simple scraper you can it’s just a chrome extension install it I’ve already installed it so I don’t need to go through that but you just click install to chrome and I’ll show you how to do it click on the simple scraper extension.

Click scrape this website and now you want to add properties so the first property is going to be the name of the website so we find that by hovering over the name and you don’t want to click on it because if you click that it’ll take you off the page so you push shift and that will highlight the name of the website and then when you’re finished you go tick then the next one we’re going to do is the URL so we’ll make this a let’s call it URL hover over it hit shift because if you actually click it then it’ll go through and the last thing we’ll get is this image and you want to classy up the web page that you’re building so we’ll use an image and there we go we’ve got 291 because as you can see there’s a lot of them here and there we go view results.

All right so we’ve extracted 291 we’ve got the image so if we can have a look at the image yep it looks like an image then we’ve got the name the name link which is going to be useful because you can make a very quick way to send people to that site you’ve got the URL which would be useful for displaying and that’s once this is just another link because two of them were links now we go down to download csv and once you click on it you’ll see you have a table full of data which can be used to build your next project I don’t know what this first one is I’m just going to delete that which is something you’ll find with data you have to clean it up quite often but from here you could import it into.

Google sheets you can import it into air table and from there you can use a program like or application like bubble dot io or one of the as you can see many on this list to build your own project um yeah that’s all I have for today.

Web Scraping

2 Easy Web Extract Techniques

There are many ways to extract data from the web, some require learning languages like Python, some require expensive tools, but there are a couple of ways that can get you up and running quickly using minimal effort.

The first easy web extract technique is using node-fetch to download all the html you need then parsing it through Cheerio which then allows you to find specific data you need using jQuery’s powerful $ function. You can find a tutorial on how to scrape using Simple Web Scraping with Cheerio and node-fetch.

However this method has some limitations such as “What if I want to scrape data from a website that requires me to login?”

That is when we can use a more powerful tool called Puppeteer which is actually a headless chrome browser that allows you to literally surf the web every time you want to extract data from the web without ever having to open a the chrome browser yourself. To get started with this method we also made a guide on how to do it Scraping the web with Puppeteer

There are many more methods that you can use but there are two easy web extract techniques that are robust and can get you the data you need for your next project.

Cheerio Javascript Web Scraping

Simple Web Scraping with Cheerio and node-fetch

Before we get started explaining how to you can use Cheerio.js to scrape the web it is useful to understand what Cheerio is. It describes itself as a “Fast, flexible, and lean implementation of core jQuery designed specifically for the server.” put in regular speak that means it makes it easy to take HTML and play around with it, in our context to make it easy to find content we want using selectors.

What makes Cheerio special is how easy and lightweight it is to use on your server, but it is not a headless chrome browser like Puppeteer so you won’t be doing anything that requires user interaction such as logging in.

To get started with Cheerio you need to know how to run a node server, which is a simple as opening a terminal window in VS Code and learning a few commands but once you figure that out you can get started. You also need node-fetch which you can find using npm.

The first thing you are going to want to do is get some HTML from the website you want to scrape like this

const fetch = require('node-fetch');

    .then(res => res.text())
    .then(body => console.log(body));

Then once you have the HTML you want to load it into Cheerio like this

const cheerio = require('cheerio');
const $ = cheerio.load('<ul id="fruits">...</ul>');

Now the the entire HTML is loaded into $ and if you wanted to find the text inside fruits class you can simple go


Web Scraping

Amazon Product Scraping

One of the most popular sites on the web for scraping is Amazon.

Some of my favorites are the chrome extension Honey:

and another chrome extension Keepa, an Amazon price tracker:

Both of these extensions are special because they simultaneously extract web data by scraping the current page, then at the same time display the aggregated results of many people contributing to the app.

In Honey’s case the value add is analytics and money saved scraping coupons with the added benefit of applying them for you. Keepa is more of a hardcore price analytics tool allowing the user to tell if now is the time to buy because the price is low or not to buy because the price is unusually high.

This is the power of web scraping combined with the ability to use that data to generate insights.

Amazon is very aware that their products and vendor data is constantly being web scraped from every country and anyone who wants to make apps for Amazon FBA entrepreneurs. When it is being scraped using browser extensions it would be very difficult to stop when no blatant violations are being abused like copyright infringement for example.

If you need Amazon product data I would suggest finding a company that is providing an API or if they are too expensive because you need a lot of data points, consider Puppeteer or dive into building your own extension.

Javascript Puppeteer Web Scraping

Scraping the web with Puppeteer

Puppeteer is a headless chrome browser that has a lot of flexibility. You can tell it to browse to a url, wait until the page is loaded, click the login button, type a username and password, browse to another page, grab a bunch of data and display it as JSON.

To learn Puppeteer there is one website that is your new best friend:

Remember to backup anything you write often as if you close the window you will lose your work without any way to get it back.

If you want to learn how to use Puppeteer by building a project to turn Craigslist into an API Jarrod Overson has a great tutorial over on youtube:

It is good stuff but don’t just watch, pause and follow along with code

Puppeteer is powerful and is everything you need to get data into JSON format which can then be saved in a NoSQL database like MongoDB or Firestore for use in your app.Here are some more learning resources:

Here are some more learning resources:

Scrape something and let me know what you plan on building below in the comments!