Machine Learning with R – Advanced

Prerequisites (knowledge of topic)

This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.

 

Hardware

A laptop computer is required to complete the in-class exercises.

 

Software

R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.

 

Course content

With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.

 

Structure

The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).

 

The tentative schedule is as follows:

 

Day 1: Handling messy data

Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?

Lecture: Learning to explore data

Lecture: Missing values – imputation and other strategies

Lecture: The R data pipeline – tidyverse

Lab: Getting to know your data

 

Day 2: Understanding ML performance

Discussion: What makes a successful ML model?

Lecture: Getting beyond accuracy – other performance measures

Lecture: The “no free lunch” theorem

Lecture: Estimating future performance – sampling methods, model selection

Lab: Comparing models on your dataset with ROC curves

 

Day 3: Improving ML performance

Discussion: What factors keep ML models from perfect prediction?

Lecture: Tuning stock models – automated parameter tuning

Lecture: Meta-learning – ensembles, stacked models

Lab: Machine Learning Competition (Round 1)

 

Day 4: “Big data” problems

Discussion: Is more data always better? Why or why not?

Lecture: The curse of dimensionality – dimensionality reduction, t-SNE

Lecture: Imbalanced datasets – under and over-sampling strategies

Lecture: Improving R’s performance on big data

Lab: Machine Learning Competition (Round 2)

 

Day 5: Next-generation “Black Box” methods

Discussion: What are the strengths and weaknesses of man versus machine?

Lecture: Deep Learning – Keras, Tensorflow

Lecture: Text embeddings – word2vec

Lecture: Cluster computing – use cases of Hadoop, Spark, etc.

Discussion: Results of ML Competition – winners’ tips and tricks

Lab: Work on your final project

 

Literature

 

Mandatory

Machine Learning with R (2nd ed.) by Brett Lantz (2015). Packt Publishing

 

Supplementary / voluntary

None required.

 

Mandatory readings before course start

Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.

 

Examination part

80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data. This project is intended to evaluate an ability to apply in class learnings to a real-world topic of one’s own choosing. Students should feel free to use a project related to their career or field of study. For example, one may use this opportunity to advance his/her dissertation research or complete a task for his/her job. The exact scoring criteria for this assignment will be provided on the 1st day of class.

 

The remaining 20% of the course grade will be based on participation during in-class discussions and during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.

 

Supplementary aids

Students may reference any literature as needed when writing the final report.

 

Examination content

The final report should illustrate an ability to apply advanced machine learning methods to a topic of the student’s choosing. The student should explain the methods applied and evaluate the machine learning model’s performance for the chosen task.

 

Literature

Not applicable.