+ - 0:00:00
Notes for current slide
Notes for next slide

POL 304: Using Data to Understand Politics and Society

Web-Scraping

Olga Chyzh [www.olgachyzh.com]

1 / 16

Outline

  • What is webscraping?

  • Webscraping using rvest

  • Examples

    • GDP form Wikipedia

    • 2020 US election returns

  • Cleaning the data with tidyverse

2 / 16

What is Webscraping?

  • Extract data from websites

    • Tables

    • Links to other websites

    • Text

3 / 16

Why Webscrape?

  • Because copy-paste is tedious

  • Because it's fast

  • Because you can automate it

  • Because it helps reduce/catch errors

4 / 16

Webscraping: Broad Strokes

  • All websites are written in HTML (mostly)

  • HTML code is messy and difficult to parse manually

  • We will use R to

    • read the HTML (or other) code
    • clean it up to extract the data we need
  • Need only a very rudimentary understanding of HTML

5 / 16

Webscraping with rvest: Step-by-Step Start Guide

Install all tidyverse packages:

# check if you already have it
library(tidyverse)
library(magrittr)
library(rvest)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyverse
6 / 16

Step 1: What Website Are You Scraping?

# character variable containing the url you want to scrape
myurl<-"https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
7 / 16

Step 2: Read HTML into R

  • HTML is HyperText Markup Language.

  • Go to any website, right click, click "View Page Source" to see the HTML

library(rvest)
library(tidyverse)
library(magrittr)
myhtml <- read_html(myurl)
myhtml
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin-vector-legacy mediawiki ltr sitedir-ltr mw-hide-empty-e ...
8 / 16

Step 3: Where in the HTML Code Are Your Data?

  • Need to find your data within the myhtml object.

  • In HTML, all objects, such as tables, paragraphs, hyperlinks, and headings, are inside "tags" that are surrounded by <> symbols

  • Examples of tags:

    • <p> This is a paragraph.</p>
    • <h1> This is a heading. </h1>
    • <a> This is a link. </a>
    • <li> item in a list </li>
    • <table>This is a table. </table>
  • Can use Selector Gadget to find the exact location. Enter vignette("selectorgadget") for an overview.

  • Can also skim through the raw html code looking for possible tags.

  • For more on HTML, check out the W3schools' tutorial

  • You don't need to be an expert in HTML to webscrape with rvest!

9 / 16

Step 4:

Give HTML tags into html_nodes() to extract your data of interest. Once you got the content of what you are looking for, use html_text to extract text, html_table to get a table

mytable<-html_nodes(myhtml, "table") %>% #Gets everything in the element
html_table(fill=TRUE) #Convert to an R table, fill=TRUE is necessary when the website has multiple tables
mytable<-mytable %>% extract2(3) #since the website has multiple tables, we need to extract the 3rd one.
#Or you can combine the operations into a pipe:
mytable<-read_html(myurl) %>% html_nodes("table") %>% html_table(fill=TRUE) %>% extract2(3)
10 / 16

Step 5: Save and Clean the Data

  • You may want to remove all columns except Country and GDP.

    • Use select from tidyverse to select these columns
  • You may want to delete any extra rows

    • Use slice to select the rows you need.
  • You may want to clean up country names by removing any unnecessary symbols (e.g. [])

    • Use mutate and str_extract
  • Finally, we need to convert GDP to a numeric variable

    • Use parse_number
11 / 16

Step 5: Save and Clean the Data

library(stringr)
library(magrittr)
mytable<-read_html(myurl) %>%
html_nodes("table") %>%
html_table(fill=TRUE) %>%
extract2(3) %>% #our table is actually nested within a list element [[]]
select(Country=1, Year=4, GDP=3) %>%
slice(3:214) %>%
mutate( Year=str_remove(Year, ".*\\]"), #remove everything before the ]
GDP=str_remove(GDP, ".*\\]"),GDP=parse_number(GDP), Year=parse_number(Year))
12 / 16

Your Turn (5 min)

13 / 16

Example 2

  • We will scrape the 2020 US Presidential Election returns for the state of Maryland

  • Then we will select county, and the votes for just the two major candidates, remove the total, and convert the votes to numeric values.

myurl<-"https://elections.maryland.gov/elections/2020/results/general/gen_detail_results_2020_4_BOT001-.html"
pres<-read_html(myurl) %>% html_nodes("table") %>% html_table(fill=TRUE) %>% extract2(2) %>%
select(County=Jurisdiction, Biden20=contains("Biden"), Trump20=contains("Trump")) %>%
filter(str_detect(County, "Total", negate=TRUE)) %>%
mutate(Biden20=parse_number(Biden20), Trump20=parse_number(Trump20))
14 / 16

Your Turn (5 min)

  • Follow the same steps to scrape the 2016 US Presidential returns by county for the state of Maryland.

  • Clean up the results

15 / 16

Challenge Yourself

  1. Follow the steps learned in class to scrape the names, ridings, and party of the current Ontario MPPs from https://www.ola.org/en/members/current.

  2. Extract the links to each individual MPP website and use it to get a list of their email addresses.

16 / 16

Outline

  • What is webscraping?

  • Webscraping using rvest

  • Examples

    • GDP form Wikipedia

    • 2020 US election returns

  • Cleaning the data with tidyverse

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow