By Potch

March 19, 2019

Scrape the Data You Need with Cheerio!

We live in the age of the API. Data is (for better or worse) a commodity, and it's big business to offer data as a service via APIs. Even still, some useful data just isn't available online via a developer-friendly API.

Well, sort of!

Let's say there's a website with some useful data on it. Maybe it's the hours of a restaurant, a list of products, or the latest breaking news. This site doesn't have a traditional API, but then again, what is HTTP? We ask for a page, and get an HTML response. That's practically an XML API right there! We may just have a way to get that data with a program after all.

Finding information in HTML is easy when you're in a browser — the contents of an element are one document.querySelector away, and if you're using jQuery, it's as easy as $, with extra tools to boot! We are off to the races.

One more snag awaits us. Because of the web's strict same-origin policy, we (reasonably, yet irritatingly) can't just fetch the contents of another web page from our page without permission. We know that node (or even curl) for that matter would have no problem, so we can code up a little server that fetches the page for us. But node doesn't know how to parse HTML, and even if it did, where's my snazzy jQuery for slicing and dicing the results!

📦 cheerio's npm listing 📖 cheerio documentation

Let's for the moment pretend this site dosn't have an RSS feed (it totally does). I'd love to get a list of the latest posts for display in a widget. Okay, no RSS, but what does the HTML look like?

A screenshot of the markup in devtools, showing a flat list of posts

Clean and easy to read! I see a flat container zine-items, full of zine-item elements. A zine-item looks something like this.

We've got the URL of the post, the title of the post, and some category metadata. Now we know what we're looking for, we can deploy cheerio!

Here's the full working demo. Feel free to click around- we'll break it down below.

The Cheerio Starter App #

We're demoing this library by building a tiny app that grabs posts from the Glitch Culture Zine. The app is based on the basic express starter, with the addition of request to assist with fetching data from a URL, and of course cheerio. We're serving a minimal HTML page, with a script that fetches data from our server at the URL /glitch-culture. I'm not going to go too deep into the client side; after fetching the list of posts, we're just making some list items and populating them with data.

The core of our example is in server.js, specifically the handler for app.get('/glitch-culture'). When the client requests this URL, the following happens:

Now What? #

The starter app uses cheerio to convert some HTML into JSON, but once you've got the data, the power is yours! Remix the app and build your own page scraper. Why not monitor prices on your favorite online store? Liberate data trapped in HTML tables! Watch for sneaky changes to your senator's Wikipedia page! Scraping is the "first API", and cheerio makes it easy.

One Last Thing #

Screen scraping is super useful, but not all website owners love the idea of people grabbing the information off their pages. It's always a good idea to check a website's terms and conditions before running any automatic scrapers against their content, to be sure you're not getting yourself on someone's naughty list. Also, remember to be polite! If you're requesting data off a website repeatedly, you could be overwhelming ther server or slowing the site down for other visitors.

Happy scraping!