Common issues I've encountered when scraping

Common issues I've found when scraping content, and how to mitigate them

Scraping web content is something I have had to do across lots of different projects and for lots of different reasons. In this article, I want to touch on some of the common issues I’ve encountered when scraping web content. Think of this as a bit of a “thar be dragons” warning if this is something you’re hoping to do soon.

Web scraping is using a script or service to visit a web page for you and grab some information. This might be because the website doesn’t provide an API, or the API doesn’t include some information you need. In Ruby I’ve written scrapers using Nokogiri to parse the data.

Ideally, web scraping is the last thing you try because it is so variable and requires both development and compute effort.

disclaimer: If you are thinking about web scraping, make sure you adhere to the website’s terms and conditions, in particular, you should pay attention to their robots.txt file and honour any rule they have for bots.

The website content changes

This is by far the most common issue I have experienced. If you are trying to access the title of a page, you might look for the content of <h1>, if for some reason the website you are scraping changes the titles to use <h2>, then your scraper will break.

The best way around this is to be as general as possible. For example if your scraper was looking for something really specific like div > div > header > h1 there is way more chance that this will change (to remove a div, for example!).

It is common to rely on class names to help find content, for example .title might be the class used to style the title. This is becoming less and less useful as more websites adopt atomic CSS and generated classes.

If you were using a cloud hosted web scraping platform, many of them have pre-made templates that they keep up to date when some of the more common web pages you may want to scrape change.

Business needs change

Sometimes it is trivial to grab one bit of information, but another bit, even if visible when browsing a website, can be much harder.

If your business needs change to suddenly need more information or the same information but presented differently, the complexity of the scraper can change and become unwieldy.

Oftentimes scrapers are made in fairly hacky, “quick-win”, kind of ways, they are a means to an end. But if you put time into coding them with the same care you’d code other parts of your application it will be easier to change to suit new business needs.

Communicate early and clearly that just because it was quick to grab one thing doesn’t mean it is as quick to grab everything on a page.

You get locked out

If you followed my advice at the start of this article, you won’t run afoul of the website’s terms and conditions, but sometimes a website will block you.

If this happens, my advice is to contact the website and see if anything can be done. Don’t try and code your way around it.

Recent posts View all

WritingGit

How to speed up Rubocop

A small bit of config that could speed up your Rubocop runs

Web Dev

Purging DNS entries

I had no idea you can ask some public DNS caches to purge your domain to help speed things along