Computational Statistics
Prerequisites (knowledge of topic)
Advanced knowledge in statistics and econometrics (gained, for example, following the specific courses in a master in quantitative methods/economics/finance).
Hardware
Individual laptop (with no particular requisite).
Software
Examples and codes are shown using the R-software (free downloadable from https://www.r-project.org/).
Course Content
Computational Statistics is the area of specialization within statistics that includes statistical visualization and other computationally-intensive methods of statistics for mining large, nonhomogeneous, multi-dimensional datasets so as to discover knowledge in the data. As in all areas of statistics, probability models are important, and results are qualified by statements of confidence or of probability. An important activity in computational statistics is model building and evaluation.
First, the basic multiple linear regression is reviewed. Then, some nonparametric procedures for regression and classification are introduced and explained. In particular, Kernel estimators, smoothing splines, classification and regression trees, additive models, projection pursuit and eventually neural nets will be considered, where some of them have a straightforward interpretation, other are useful for obtaining good predictions.
The main problems arising in computational statistics like the curse of dimensionality will be discussed. Moreover, the goodness of a given (complex) model for estimation and prediction is analyzed using resampling, bootstrap and cross-validation techniques.
Structure
Outline
- Overview of supervised learning
Introductory examples, two simple approaches to prediction, statistical decision theory, local methods in high dimensions, structured regression models, bias-variance tradeoff, multiple testing and use of p-values. - Linear methods for regression
Multiple regression, analysis of residuals, subset selection and coefficient shrinkage. - Methods for classification
Bayes classifier, linear regression of an indicator matrix, discriminant analysis, logistic regression. - Nonparametric density estimation and regression
Histogram, kernel density estimation, kernel regression estimator, local polynomial nonparametric regression estimator, smoothing splines and penalized regression. - Model assessment and selection
Bias, variance and model complexity, bias-variance decomposition, optimism of the training error rate, AIC and BIC, cross-validation, boostrap methods. - Flexible regression and classification methods
Additive models; multivariate adaptive regression splines (MARS); neural networks; projection pursuit regression; classification and regression trees (CART). - Bagging and Boosting
The bagging algorithm, bagging for trees, subagging, the AdaBoost procedure, steepest descent and gradient boosting. - Introduction to the idea of a Superlearner
Structure (Chapters refer to the outline above)
Days 1 and 2: Chapters 1,2, and 3
Day 3: Chapter 5
Day 4: Chapter 4
Days 5 and 6: Chapters 6,7, and 8.
Literature
Mandatory
F. Audrino, Lecture Notes (can be downloaded from Studynet or asked directly to the lecturer).
Hastie T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction, Springer Series in Statistics, Springer, Canada.
Supplementary / voluntary
Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.
Moreover: References to related published papers will be given during the course.
Additional online resources:
A complete version of the main reference book can be downloaded online: http://statweb.stanford.edu/~hastie/ElemStatLearn/ Moreover, the R-package for the examples in the book is available: https://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf
The web-page of the book on Targeted Learning: http://www.targetedlearningbook.com/
https://stat.ethz.ch/education/semesters/ss2015/CompStat (mostly overlapping Computational Statistics class taught at the ETH Zürich)
R-software information and download: https://www.r-project.org/
Online course of Hastie and Tibshirani on Statistical Learning: Official course at Stanford Online: https://lagunita.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Quicker access to videos: http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ Link to the website of an introductory book related to the course: http://www-bcf.usc.edu/~gareth/ISL/index.html
Mandatory readings before course start
-
Examination part
Decentral: 100% group examination paper (term paper). Due to St. Gallen quality standards, possibility of an individual examination paper.
Supplementary aids
The examination paper consists in the analysis of a data set chosen by the students involving the methods learned in the lecture.
Examination content
The whole outline of the lecture described above.
Literature
Audrino, Lecture Notes.