Data Scraping and Management for Social Scientists with R

Prerequisites (knowledge of topic)
•    Some prior knowledge in R and/or programming beneficial, but not required

Hardware
•    Bring your own laptop

Software
•    RStudio & R, most recent version (download free versions)
•    You may want to bring your own credit card to create your own cloud accounts (for database server and certain APIs). Accounts are typically free but some require depositing a credit card number.

Course content
Online platforms such as Yelp, Twitter, Amazon, or Instagram are large-scale, rich and relevant sources of data. Researchers in the social sciences increasingly tap into these data for field evidence when studying various phenomena.

In this course, you will learn how to find, acquire, store, and manage data from such sources and prepare them for follow-up statistical analysis for your own research.

After a short introduction into the relevance of data science skills for the social sciences, we will review R as a programming language and its basic data formats. We will then use R to program simple scrapers that systematically extract data from websites. We will use the packages rvest, httr, and RSelenium, among others, for this purpose. You will further need to learn how to read HTML, CSS, JSON, or XML codes, to use regular expressions, and to handle string, text and image data. To store the data, we will look into relational databases, (My)SQL, and related R packages. Many websites such as Twitter and Yelp offer convenient application-programming interfaces (APIs) that facilitate the extraction of data and we will look into accessing them from R. Finally, we will highlight some options for feature extraction from images and text, which allows us to augment our collected data with meaningful variables we can use in our analysis.

At the end of this course, students should be able to identify valuable online data sources, to write basic scrapers, and to prepare the collected data such that they can use them for statistical analysis as part of their own research projects.

Throughout the course, students will work on a data-scraping project related to their theses. This project will be presented at the final day of the course.

All data scraping code and other sources will be made available on https://www.data-scraping.org.

Structure
Preliminary schedule:

Day 1
Intro to data scraping
Define students’ scraping projects
Review of R and introduction to programming with R
Afternoon: R programming exercises

Day 2
The anatomy of the internet and relevant data formats
Intro to web scraping with R (with httr, rvest, RSelenium)
Introduction to APIs
Afternoon: Scraping exercises

Day 3
Relational databases and SQL
Data management with R
Afternoon: Database design and implementation project (with MySQL in the cloud)

Day 4
Scraping examples from Yelp, Crowdspring, Twitter, and Instagram
Scaling up your scraper with parallel code and proxies
Feature extraction examples
Afternoon: Work on your scraping projects

Day 5
Wrap-up of course
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)

Literature

Mandatory
None, all readings will be provided during the course

Supplementary / voluntary
None, all readings will be provided during the course

Examination part

Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)

Supplementary aids

Individual quiz: Closed book
Presentation of students’ scraping projects: Closed book

Examination content
Lecture slides covering key concepts of R and programming, the anatomy of the internet, relational databases, and scraping (slides will be provided as PDFs the day before classes).
•    Students will need to understand R code when they see it but they will not be required to code during the exam.

Literature
None.