Machine Learning with R - Advanced

Prerequisites (knowledge of topic)
This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.

Hardware
A laptop computer is required to complete the in-class exercises.

Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.

Course content
With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.

Structure
The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).

The tentative schedule is as follows:
Day 1: Handling messy data
Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?
Lecture: Learning to explore data
Lecture: Missing values – imputation and other strategies
Lecture: The R data pipeline – tidyverse
Lab: Getting to know your data

Day 2: Understanding ML performance
Discussion: What makes a successful ML model?
Lecture: Getting beyond accuracy – other performance measures
Lecture: The “no free lunch” theorem
Lecture: Estimating future performance – sampling methods, model selection
Lab: Comparing models on your dataset with ROC curves

Day 3: Improving ML performance
Discussion: What factors keep ML models from perfect prediction?
Lecture: Tuning stock models – automated parameter tuning
Lecture: Meta-learning – ensembles, stacked models
Lab: Machine Learning Competition (Round 1)
 
Day 4: “Big data” problems
Discussion: Is more data always better? Why or why not?
Lecture: The curse of dimensionality – dimensionality reduction, t-SNE
Lecture: Imbalanced datasets – under and over-sampling strategies
Lecture: Improving R’s performance on big data
Lab: Machine Learning Competition (Round 2)

Day 5: Next-generation “Black Box” methods
Discussion: What are the strengths and weaknesses of man versus machine?
Lecture: Deep Learning – Keras, Tensorflow
Lecture: Text embeddings – word2vec
Lecture: Cluster computing – use cases of Hadoop, Spark, etc.
Discussion: Results of ML Competition – winners’ tips and tricks
Lab: Work on your final project

Literature

Mandatory
PDFs with readings will be distributed prior to the start of each class day.

Supplementary / voluntary
None required.

Mandatory readings before course start
Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.

Examination part
80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. The project is based on a challenging real-world dataset given to all course participants. The project will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.

The remaining 20% of the course grade will be based on participation during in-class discussions and performance during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.

Supplementary aids
Students may reference any literature as needed when writing the final report.

Examination content
The primary goal of the final project is for students to gain an ability to solve difficult ML tasks. The project should reflect an understanding of the material covered throughout the week, as well as an ability to apply the material in new and innovative ways.

Literature
Not applicable.