Skip to main content

Course

Web Scraping in R

IntermediateSkill Level

4.7+

Updated 04/2024

Learn how to efficiently collect and download data from any website using R.

Start Course for Free

RData Preparation

4 hr

13 videos

45 Exercises

3,600 XP

15,062

Statement of Accomplishment

Loved by learners at thousands of companies

Training a Team?

Try for Business

Course Description

Have you ever come across a website that displays a lot of data such as statistics, product reviews, or prices in a format that’s not data analysis-ready? Often, authorities and other data providers publish their data in neatly formatted tables. However, not all of these sites include a download button, but don’t despair. In this course, you’ll learn how to efficiently collect and download data from any website using R. You'll learn how to automate the scraping and parsing of Wikipedia using the rvest and httr packages. Through hands-on exercises, you’ll also expand your understanding of HTML and CSS, the building blocks of web pages, as you make your data harvesting workflows less error-prone and more efficient.

Prerequisites

Intermediate R Introduction to the Tidyverse

1

Introduction to HTML and Web Scraping

In this chapter, you'll be introduced to Hyper Text Markup Language (HTML), a declarative language used to structure modern websites. Using the rvest library, you'll learn how to query simple HTML elements and scrape your first table.

Introduction to HTML

Read in HTML

Beware of syntax errors!

Navigating HTML

Select all children of a list

Parse hyperlinks into a data frame

Scrape your first table

The right order of table elements

Turn a table into a data frame with html_table()

2

Navigation and Selection with CSS

Cascading Style Sheets (CSS) describe how HTML elements are displayed on a web page, including colors, fonts, and general layout. In this chapter, you'll learn why CSS selectors and combinators are a crucial ingredient for web scraping.

Introduction to CSS

Select multiple HTML types

Order CSS selectors by the number of results

CSS classes and IDs

Identify the correct selector types

Leverage the uniqueness of IDs

Select the last child with a pseudo-class

CSS combinators

Select direct descendants with the child combinator

How many elements get returned?

Simply the best!

Not every sibling is the same

3

Advanced Selection with XPATH

The CSS selectors you got to know in the last chapter are powerful but have their limitations. For example, if you want to select nodes based on the properties of their descendants. XPath to the rescue! Using this query language, you can navigate and scrape even the most hideous HTML.

Introduction to XPATH

Find the correct CSS equivalent

Select by class and ID with XPATH

Use predicates to select nodes based on their children

XPATH functions and advanced predicates

Find a more elegant XPATH alternative

Get to know the position() function

Extract nodes based on the number of their children

The XPATH text() function

The shortcomings of html_table() with badly structured tables

Select directly from a parent element with XPATH's text()

Combine extracted data into a data frame

Scrape an element based on its text

4

Scraping Best Practices

Now that you know how to extract content from web pages, it's time to look behind the curtains. In this final chapter, you’ll learn why HTTP requests are the foundation of every scraping action and how they can be customized to comply with best practices in web scraping.

The nature of HTTP requests

Which of these statements about HTTP is false?

Do it the httr way

Houston, we got a 404!

Telling who you are with custom user agents

Check out your user agent

Add a custom user agent

How to be gentle and slow down your requests

Custom arguments for throttled functions

Apply throttling to a multi-page crawler

Recap: Web Scraping in R

Web Scraping in R

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.7

from 98 reviews

79%

17%

3%

0%

1%

Sort by

Geofrey

6 days ago

Very interesting course.

krish

last week

Camilo

2 weeks ago

Everything was so good!

Marten

3 weeks ago

Arjun

4 weeks ago

Faisal

4 weeks ago

Great course with clear explanations and hands-on exercises. I learned a lot about web scraping in R. Highly recommended!

"Very interesting course."

Geofrey

krish

"Everything was so good!"

Camilo

FAQs

Which R packages are used for web scraping in this course?

You will use the rvest package for parsing HTML and extracting data, along with the httr package for making HTTP requests and handling web page responses.

Do I need to know HTML and CSS before starting this course?

No. The course teaches you the HTML and CSS fundamentals needed for web scraping, including how elements are structured and how to use CSS selectors to target data.

What website is used for hands-on scraping practice?

You will practice automating the scraping and parsing of Wikipedia pages, learning to extract tables and other structured content from real web pages.

Will I learn to handle common scraping errors and edge cases?

Yes. The course teaches you techniques to make your data harvesting workflows less error-prone and more efficient when dealing with real-world web content.

What can I do with web scraping skills in a data science workflow?

You can collect data from websites that lack download options, such as price listings, statistics tables, and product reviews, and prepare it for analysis in R.

Join over 19 million learners and start Web Scraping in R today!

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.