Machine Learning with R – Advanced
Prerequisites (knowledge of topic)
This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.
A laptop computer is required to complete the in-class exercises.
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.
With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.
The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).
The tentative schedule is as follows:
Day 1: Handling messy data
Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?
Lecture: Learning to explore data
Lecture: Missing values – imputation and other strategies
Lecture: The R data pipeline – tidyverse
Lab: Getting to know your data
Day 2: Understanding ML performance
Discussion: What makes a successful ML model?
Lecture: Getting beyond accuracy – other performance measures
Lecture: The “no free lunch” theorem
Lecture: Estimating future performance – sampling methods, model selection
Lab: Comparing models on your dataset with ROC curves
Day 3: Improving ML performance
Discussion: What factors keep ML models from perfect prediction?
Lecture: Tuning stock models – automated parameter tuning
Lecture: Meta-learning – ensembles, stacked models
Lab: Machine Learning Competition (Round 1)
Day 4: “Big data” problems
Discussion: Is more data always better? Why or why not?
Lecture: The curse of dimensionality – dimensionality reduction, t-SNE
Lecture: Imbalanced datasets – under and over-sampling strategies
Lecture: Improving R’s performance on big data
Lab: Machine Learning Competition (Round 2)
Day 5: Next-generation “Black Box” methods
Discussion: What are the strengths and weaknesses of man versus machine?
Lecture: Deep Learning – Keras, Tensorflow
Lecture: Text embeddings – word2vec
Lecture: Cluster computing – use cases of Hadoop, Spark, etc.
Discussion: Results of ML Competition – winners’ tips and tricks
Lab: Work on your final project
Machine Learning with R (2nd ed.) by Brett Lantz (2015). Packt Publishing
Supplementary / voluntary
Mandatory readings before course start
Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data. This project is intended to evaluate an ability to apply in class learnings to a real-world topic of one’s own choosing. Students should feel free to use a project related to their career or field of study. For example, one may use this opportunity to advance his/her dissertation research or complete a task for his/her job. The exact scoring criteria for this assignment will be provided on the 1st day of class.
The remaining 20% of the course grade will be based on participation during in-class discussions and during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.
Students may reference any literature as needed when writing the final report.
The final report should illustrate an ability to apply advanced machine learning methods to a topic of the student’s choosing. The student should explain the methods applied and evaluate the machine learning model’s performance for the chosen task.