Big Data in R: SQL, Spark, NoSQL

Prerequisites (knowledge of topic)
Students should have some data manipulation and visualization skills in R. For example, they should be able to use dplyr to calculate the mean of a numeric variable grouped by a categorical variable. They should be able to use ggplot2 to draw common plots such as histograms and scatter plots.

Basic SQL skills are helpful but not required. For example, being able to write simple SELECT queries is useful.

Hardware
A laptop.

Software

-    A web browser
-    R
-    An R IDE, such as RStudio
-    PostgreSQL
-    Spark
-    MongoDB

A software setup guide will be provided before the course begins.

Course content

Data is “big” if it affects the way you store your data, or how you manipulate it. The course starts with techniques for finding performance problems in your analyses through benchmarking and profiling. Day two recaps cores skills in data manipulation and visualization with dplyr and ggplot2, and covers techniques for visualizing big datasets.

Days three to five cover the scenario of your data being too big to fit on a single machine, and teach technques for working with SQL databases (PostgreSQL), Spark, and NoSQL databases (MongoDB).

Structure
Day 1 - Working with local files, finding performance problems
•    Fast data I/O with readr, data.table (1 unit)
•    Benchmarking code with microbenchmark (1 unit)
•    Profiling code with profvis (1 unit)
•    Data storage with feather, fst (1 unit)
•    Storing data in the cloud with aws.s3 (1 unit)
Day 2 - Core skills, visualizing big data
•    Tricks for visualizing big datasets with ggplot2 (1 unit)
•    Visualizing big data with trelliscope (2 units)
•    Intermediate dplyr skills, such as joining data frames (2 units)
Day 3 - Using SQL Databases from R
• Running queries on PostgreSQL databases with dplyr (3 units)
•    Exploring databases with DBI (1 unit)
•    Profiling and optimizing queries (1 unit)
Day 4 - Using Spark from R
•    Running queries on Spark with sparklyr (2 units)
•    Machine learning on Spark with sparklyr (2 units)
•    Working with Parquet files (1 unit)
Day 5 - Using MongoDB from R
•    Working with JSON data with jsonlite (1 unit)
•    Principles of NoSQL (1 unit)
•    Running queries on MongoDB using mongolite (1 unit)
•    Calculating on MongoDB using mongolite (1 unit)

Literature
Mandatory

N/A

Supplementary / voluntary
N/A

Mandatory readings before course start
Before beginning the course, you are expected to be comfortable with the skills and techniques covered in the “Explore” section (ch2 to ch8) of R for Data Science.

These data manipulation and data visualization skills are also taught in the following DataCamp courses, which may be taken as an alternative to reading R for Data Science.

-    Introduction to the Tidyverse
-    Introduction to Data Visualization with ggplot2
-    Data Manipulation with dplyr in R

Examination part
-    Daily challenges: 5 x 10%
-    Open book exam, to be completed within 3 weeks of course completion: 50%

Supplementary aids
Most of the R packages used in this course have good documentation in their vignettes. you can find these using R's browseVignettes() function, or by browsing CRAN.

Monday's work on benchmarking and profiling is nicely covered in Efficient R Programming, particularly ch1, ch5 and ch7.

Tuesday's work on data visualization is covered in the ggplot2 website, ggplot2 book, and trelliscope website.

Wednesday's work on accessing SQL databases from R is covered in the RStudio database documentation.

Thursday's work on Spark is covered in the RStudio Spark documentation.

Friday's work on accessing MongoDB from R is covered in the mongolite User Manual.

Examination content
The exam will be tested against the learning objectives for the week. The following list is non-exhaustive and subject to change.

Learning objectives for Monday

You should be able to
-    articulate performance tradeoffs between common local data file formats
-    import and export data in these formats
-    benchmark the performance of R expressions
-    locate performance problems in R code using profiling

Learning objectives for Tuesday

You should be able to
-    articulate common problems that occur when visualizing large datasets
-    draw common plot types suitable for visualizing large datasets
-    draw trelliscope plots
-    specify cognostics for trelliscope plots

Learning objectives for Wednesday

You should be able to
-    connect to a SQL database from R
-    run SELECT queries against that database, and retrieve the results
-    create a View in the database
-    analyze the performance of the query.

Learning objectives for Thursday

You should be able to
-    import data frames to Spark from R
-    manipulate those data frames in Spark
-    run simple machine learning models in Spark
-    Manipulate Parquet files

Learning objectives for Friday

You should be able to
-    convert between JSON strings and R lists
-    connect to MongoDB from R
-    perform queries on data in MongoDB
-    perform calculations on data in MongoDB