Computational Statistics

Prerequisites (knowledge of topic)

Advanced knowledge in statistics and econometrics (gained, for example, following the specific courses in a master in quantitative methods/economics/finance).

 

Hardware

Individual laptop (with no particular requisite).

 

Software

Examples and codes are shown using the R-software (free downloadable from https://www.r-project.org/).

 

Course Content

Computational Statistics is the area of specialization within statistics that includes statistical visualization and other computationally-intensive methods of statistics for mining large, nonhomogeneous, multi-dimensional datasets so as to discover knowledge in the data. As in all areas of statistics, probability models are important, and results are qualified by statements of confidence or of probability. An important activity in computational statistics is model building and evaluation. 

First, the basic multiple linear regression is reviewed. Then, some nonparametric procedures for regression and classification are introduced and explained. In particular, Kernel estimators, smoothing splines, classification and regression trees, additive models, projection pursuit and eventually neural nets will be considered, where some of them have a straightforward interpretation, other are useful for obtaining good predictions.

The main problems arising in computational statistics like the curse of dimensionality will be discussed. Moreover, the goodness of a given (complex) model for estimation and prediction is analyzed using resampling, bootstrap and cross-validation techniques.

Structure

Outline

  1. Overview of supervised learning
    Introductory examples, two simple approaches to prediction, statistical decision theory, local methods in high dimensions, structured regression models, bias-variance tradeoff, multiple testing and use of p-values.
  2. Linear methods for regression
    Multiple regression, analysis of residuals, subset selection and coefficient shrinkage.
  3. Methods for classification
    Bayes classifier, linear regression of an indicator matrix, discriminant analysis, logistic regression.
  4. Nonparametric density estimation and regression
    Histogram, kernel density estimation, kernel regression estimator, local polynomial nonparametric regression estimator, smoothing splines and penalized regression.
  5. Model assessment and selection
    Bias, variance and model complexity, bias-variance decomposition, optimism of the training error rate, AIC and BIC, cross-validation, boostrap methods.
  6. Flexible regression and classification methods
    Additive models; multivariate adaptive regression splines (MARS); neural networks; projection pursuit regression; classification and regression trees (CART).
  7. Bagging and Boosting
    The bagging algorithm, bagging for trees, subagging, the AdaBoost procedure, steepest descent and gradient boosting.
  8. Introduction to the idea of a Superlearner

 

Structure (Chapters refer to the outline above)

Days 1 and 2:  Chapters 1,2, and 3

Day 3:  Chapter 5

Day 4:  Chapter 4

Days 5 and 6:  Chapters 6,7, and 8.

 

Literature

 

Mandatory

F. Audrino, Lecture Notes (can be downloaded from Studynet or asked directly to the lecturer).

Hastie T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction, Springer Series in Statistics, Springer, Canada.

 

Supplementary / voluntary

Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.

van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.

Moreover: References to related published papers will be given during the course.

 

Additional online resources:

 

A complete version of the main reference book can be downloaded online: http://statweb.stanford.edu/~hastie/ElemStatLearn/ Moreover, the R-package for the examples in the book is available: https://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf

 

The web-page of the book on Targeted Learning: http://www.targetedlearningbook.com/

 

https://stat.ethz.ch/education/semesters/ss2015/CompStat (mostly overlapping Computational Statistics class taught at the ETH Zürich)

 

R-software information and download: https://www.r-project.org/

 

Online course of Hastie and Tibshirani on Statistical Learning: Official course at Stanford Online: https://lagunita.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Quicker access to videos: http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ Link to the website of an introductory book related to the course: http://www-bcf.usc.edu/~gareth/ISL/index.html

 

Mandatory readings before course start

-

 

Examination part

Decentral: 100% group examination paper (term paper). Due to St. Gallen quality standards, possibility of an individual examination paper.

 

Supplementary aids

The examination paper consists in the analysis of a data set chosen by the students involving the methods learned in the lecture.

 

Examination content

The whole outline of the lecture described above.

 

Literature

Audrino, Lecture Notes.