# Foundations of Machine Learning and Regression Methods for Categorical Outcomes

**Prerequisites (knowledge of topic)**

Note that this course is designed for the applied analyst; its focus is on teaching you the tools you need for effective model evaluation, presentation, and interpretation. Support and code for model estimation and post estimation is provided for R as well as Stata.

At the end of this course, you should have a clear understanding as to which types of models and methods are available to answer different research questions, and also have experience applying a varied toolkit of these models.

**Course content**

Course website: http://www.shawnasmith.net/gserm

*Software & Computing: *

Models for this course are presented in broad strokes; however, a major component of this course is application through model estimation, post-estimation and interpretation. For pedagogical purposes, I will use Stata 15 in course lectures; however course support will be provided for both Stata and R. While Stata—and most popular statistical software packages—includes native estimation (and even post-estimation) commands for categorical models, we will also use a set of ado files written for Stata by Scott Long & Jeremy Freese that facilitate the (at times complicated) interpretation of categorical models within Stata. This suite of commands is called SPost. These post-estimation commands can also be emulated in R although this will require more investigation on the part of the student. A variety of packages now exist in R relevant to the course that we are happy to provide guidance on when possible. With respect to the machine learning models, we will be making use of several user-written packages in both Stata and R. *If you are taking this course for credit, ***you will need to complete assignments using either Stata and SPost 13 commands or R with appropriate post-estimation commands**.

• Getting Access to Stata/R:

o *Stata:* Access to Stata is available through the GSERM labs. Several versions are also available to purchase at different price points; I am happy to provide guidance as to which would be most appropriate for your needs.

o *R*: R is free. You can download it at https://www.r-project.org/

**R Studio** is a free program that greatly upgrades R’s user-interface and can be downloaded at https://www.rstudio.com/

• ** Getting Started using Stata:** New to Stata? No worries—this course will catch you up quickly. However, I strongly suggest working through the “Getting Started using Stata” document available on the course website (http://shawnasmith.net/gserm/) prior to Day 1 of class. Feel free to get in touch if you have questions.

o *New to R?* As R has a steeper learning curve, I would not recommend attempting to learn R solely for the purposes of this course. However, I am happy to recommend resources for those of you so inclined:

The two textbooks recommended above provide good introductions for ‘getting started’ with R, as well as lots of in situ code.

Mike Marin (UBC Public Health) has a great series of videos introducing R online at http://www.statslectures.com/index.php/r-stats-videos-tutorials/getting-started-with-r.

• **Downloading Stata packages**: If you will be using Stata on a personal computer, then you will need to install several user-written packages. Here’s the step-by-step:

o *Pre-reqs*: Internet access & administrative privileges

o In Stata, type search {package name} into the command line.

o In the viewer window that appears, click the link for the package

o Follow directions to install

• **Accessing course data and usecda**: Course data will be available for download through the course website. It is also available for us in Stata with the usecda command. usecda is a command written specifically for this course to expedite access to course datasets & examples. It is currently only available for download through Shawna’s Github account. To download on any computer:

o Tell Stata where to download the file from by using the following command in the Stata command line or in a do-file:

net from "https://shawnana79.github.io/data"

o Install the program by either: (a) clicking on the blue usecda link that appears in the output following the previous command; or (b) using the command: net install usecda

o Check out the help file by typing help usecda in the Stata command line

By nature or by measurement, dependent variables of interest to social and behavioral scientists are frequently categorical. Outcomes that include several ranked or unranked, non-continuous categories--like vote choice, social media platform preference, brand loyalty, and/or condom use—are often of interest, with scientists expressly interested in developing models to *explain *or *classify *variation therein. *Explanatory models* are process-focused, and aim to determine the individual impact of factors that contribute to a particular outcome, often based on a *priori *theory—e.g., “How does social class affect whether an individual voted for the Conservatives in 2019?”; *classification models*, alternatively, are outcome-focused, and aim to identify the set of factors that most accurately classify (or predict) a particular outcome—e.g., “How do the Tories best use information from polls, geography, weather, Twitter feeds, and/or social demographics to predict who voted Conservative in 2019?”

Chances are your research involves a categorical outcome—binary, ordinal, or multinomial—and thus options thus abound for the modeling approach(es) you might take to address your research question of interest. This course is designed to provide an overview of a number of parametric and non-parametric approaches to exploring your outcome of interest, via both explanatory and classification perspectives.

**Structure***N.B.:* The exact content of the course will vary depending on the background & interests of participants. In other words, this schedule is subject to change.

*Topic on Monday: *

• Overview of class; introduction to models; some vocabulary (Pt 1)

**Explanatory Models**

• The 30-Minute Review of linear regression; Identification; Maximum Likelihood Estimation

• Linear probability model; Identification of Pr(y=1); Two philosophies: transformational and latent variable approach for binary outcomes

• Estimation of BRM; Odds ratios

• Using Pr(y=1) to interpret the BRM (pt. 1): tables & plots; discrete change

**Suggested Readings: **•

**Long**Ch. 1

•

**HT&J**Ch. 1

•

**JWH&T**Ch. 1

•

**Long**Ch. 2;

**P&X**Ch. 2;

**L&F**Ch. 1-2

**F&W**Ch. 1-2 or

**Monogan**Ch. 1-2 (R)

•

**Long**Ch. 3;

**P&X**Ch.1

Due:

Due:

A1: Math Review

*Topic on Tuesday:*

• Using Pr(y=1) to interpret the BRM (pt. 2): plots; difference at means vs. mean of difference; partial change/margins

• Hypothesis testing; Wald and LR tests; Confidence intervals

• Scalar measures of fit: pseudo-R2, AIC, BIC

• Ordinal variables; a latent variable model

**Suggested Readings: **•

**Long**Ch. 4

•

**Long**Ch. 5;

**P&X**Ch. 7

*Topic on Wednesday:*

• Estimation of ORM; latent variable interpretations; Pr(y=k)

• Odds ratios; parallel regression assumption and proportional odds

• Multinomial logit as a set of BLMs

• Calculating predicted probabilities; Interpretation using Pr(y=k)

• Odds ratio plots; Discrete change plots

• Tests for the MNLM; IIA

**Suggested Readings: **•

**Long**Ch. 6;

**P&X**Ch. 8

*Topic on Thursday:*

**Classification Models**

• Overview of classification models; some vocabulary (Pt. 2)

• Using Pr(y=1) to interpret the BRM (pt. 3): AUC, ROC, penalization

• Strengths & limitations of BRM for classification

• Introduction to partition-based models for classification

• CART models: Testing, evaluating, improving

• Random forests: Testing, evaluating, improving

• Strengths & limitations of CART methods for classification

**Suggested Readings: **•

**JWH&T**Ch. 4.1-3.4, 5;

**HT&J**Ch. 4.4

•

**JWH&T**Ch. 8;

**HT&**J Ch. 9.1-9.3

**Due:**

A2: BRM + T&F

*Topic on Friday:*

• Introduction to semi-linear models for classification

• k-NN models: Testing, evaluating, improving

• Strengths & limitations of k-NN methods for classification

• Support Vector Machine (SVM) models: Testing, evaluating, improving

• Strengths & limitations of SVM models for classification

• Course wrap-up/other topics on request, as time allows

**Suggested Readings: **•

**JWH&T**Ch. 4.6;

**HT&J**Ch. 2.3; 13.1, 13.3

•

**JWH&T**Ch. 9;

**HT&J**Ch. 12.1-12.3

*Sunday:*

A3: ORM & MNLM

and

A4: Classification Methods*due via email to* shawnana@umich.edu;

include “**GSERMCat:**” in subject line

The first half of this course will focus on explanatory methods, with an emphasis on **regression methods for categorical outcomes**. Although regression models for categorical outcomes are often conceptualized as extensions of linear regression models (i.e., ‘generalized linear models’), categorical outcomes violate key assumptions of the simple linear regression framework, and thus require both alternative estimation strategies and additional identification assumptions. These assumptions have implications for model interpretation, notably interpreting coefficients, comparing coefficients, testing for significance, and assessing model fit. We will begin with deriving the logit and probit models for use with binary outcomes, and also introduce a variety of post-estimation tools for interpreting effects of predictor variables on binary outcomes. We will then extend these models and methods of interpretation from binary to ordinal outcomes using the ordinal logit and probit models, and multinomial outcomes with the multinomial logit model. Methods for examining model fit and evaluating significance tests will also be discussed.

In the second half of the course, we will turn our attention to methods for **classification and prediction**. We will begin by re-examining what we’ve already learned—namely, logit models for categorical outcomes—and discussing the strengths and weaknesses of these models for model prediction and classification. These familiar models will also be used to introduce concepts of training a model, evaluating model performance, and improving model performance. As before, our focus will be on the binary case, with extensions for ordinal and multinomial cases. We will then move onto partition-based models, specifically Classification and Regression Tree (CART) models and random forests, followed by semi-linear models, namely k-Nearest Neighbor (k-NN) and Support Vector Machines (SVM). The focus of these approaches will be on classification and prediction for binary outcomes, but extensions for outcomes with multiple categories will also be presented.

**Literature**

**Mandatory***Lecture Notes for Foundations of Machine Learning and Regression Methods for Categorical Outcomes.* This coursepack contain copies of the overheads for the lectures, data set codebooks, and materials used in the computing lab. It will be provided at the beginning of our first class session. Be sure to bring these notes to all lecture and lab sessions.

• For participants that prefer electronic versions, component parts are also available on the course website.

**Supplementary / voluntary***Explanatory models*

Long, J. Scott. 1997.* Regression Models for Categorical and Limited Dependent Variables.* Thousand Oaks, CA: Sage. *Hereafter*: **Long**

Powers, Daniel A. & Yu Xie. 2008. *Statistical Methods for Categorical Data Analysis*. 2nd Edition. Bingley, UK: Emerald Press. *Hereafter*: **P&X**

**For the Stata devotees:** Long, J. Scott & Jeremy Freese. 2014. *Regression Models for Categorical Dependent Variables Using Stata. *3rd Edition. College Station, TX: Stata Press. *Hereafter*: **L&F**

**Or if you like R:** I’m still searching for my favorite here, but a couple of good ones are:

• Monogan, James E. III. 2015. *Political Analysis Using R.* New York, NY: Springer. *Hereafter*: **Monogan**.

• Fox, John & Sanford Weisberg. 2010. *An R Companion to Applied Regression.* Thousand Oaks, CA: Sage. *Hereafter*: **F&W**.

*Classification models*

Hastie, T., Tibshirani, R. and Freedman, J. 2009. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction.* 2nd Edition. New York: Springer. *Hereafter*: **HT&J**

James, G., Witten, D., Hastie, T., & Tibshirani, R. 2013. *An Introduction to Statistical Learning. *New York: Springer. *Hereafter*: **JWH&T**

**Examination part**

Decentral – examination paper written at home (individual) 100%

Grading

Participant’s overall grades are based on completion of four assignments weighted as follows:

**• A1 Math Review: **1/6 **• A2 BRM + T&F:** 2/6**• A3 ORM & MNLM:** 1/6**• A4 Classification Methods:** 2/6

Literature

None.

**Additional Course Information***Getting Help*

I am available to provide feedback or answer questions during lunch breaks & after course hours. Due to the compressed nature of this course (& my desire for you all to digest as much material as possible!), I encourage you to bring up questions or concerns early & often. If you would like to discuss questions or concerns related to the methods presented here for a particular paper or thesis, I would encourage you to make an appointment to meet before or after lecture one day, or during the lunch break.

• I can also be reached by email both during & after this course at shawnana@umich.edu. Ensure a prompt response to your email by prefacing your subject with **“GSERMCat:”**.

**Academic Integrity**

It is not possible for us to have an intellectual community without honor. I expect that you demonstrate respect by recognizing the labor of those who create intellectual products. Academic dishonesty (including cheating and plagiarism) will not be tolerated and will be dealt with according to university policy.