Preface

Welcome to Statistical Learning and Machine Learning with R! I started this project during the summer of 2018 when I was preparing for the Stat 432 course. At that time, our faculty member Dr. David Dalpiaz, had decided to move to The Ohio State University (although he moved back to UIUC later on). David introduced to me this awesome way of publishing website on GitHub, which is a very efficient approach for developing courses. Since I have also taught Stat 542 (Statistical Learning) for several years, I figured it could be beneficial to integrate what I have to this existing book by David and use it as the R material for both courses. For Stat 542, the main focus is to learn the numerical optimization behind these learning algorithms, and also be familiar with the theoretical background. As you can tell, I am not being very creative on the name, so SMLR it is. You can find the source file of this book on my GitHub.

Target Audience

This book can be suitable for students ranging from advanced undergraduate to first/second year Ph.D students who have prior knowledge in statistics. Although a student at the masters level will likely benefit most from the material. Previous experience with both basic mathematics (mainly linear algebra), statistical modeling (such as linear regressions) and R are assumed.

What’s Covered?

This book currently covers the following topics:

Basic Knowledge
- R, R Studio and R Markdown
- Linear regression and linear algebra
- Numerical optimization
Penalized linear models and model selection
Nonlinear and Nonparametric Models
- Spline
- K-nearest neighbor
- Kernel smoothing
Classification models
- Logistic regression
- Discriminant analysis
Machine Learning Models
- Support vector machine
- Kernel ridge regression
- Tree models
- Random forests
- Boosting
Unsupervised Learning
- K-means
- Hierarchical clustering
- PCA
- self-organizing map
- Spectral clustering
- UMAP

The goal of this book is to introduce not only how to run some of the popular statistical learning models in R, know the algorithms and programming techniques for solving these models and also understand some of the fundamental statistical theory behind them. For example, for graduate students, these topics will be discuss in more detail:

Optimization
- Lagrangian
- Primal vs. dual
EM and MM algorithm
Bias-variance trade-off in
- Linear regression
- KNN
- Kernel density estimation
Kernel Trick and RKHS
Representer Theorem
- SVM
- Spline

For each section, the difficulty will gradually increase from an undergraduate level to a graduate level.

It will be served as a supplement to An Introduction to Statistical Learning (James et al. 2013) for STAT 432 - Basics of Statistical Learning and to The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Hastie, Tibshirani, and Friedman 2001) for STAT 542 - Statistical Learning at the University of Illinois at Urbana-Champaign.

This book is under active development. Hence, you may encounter errors ranging from typos to broken code, to poorly explained topics. If you do, please let me know! Simply send an email and I will make the changes as soon as possible (rqzhu AT illinois DOT edu). Or, if you know R Markdown and are familiar with GitHub, make a pull request and fix an issue yourself! These contributions will be acknowledged.

Acknowledgements

The initial contents are derived from Dr. David Dalpiaz’s book. My STAT 542 course materials are also inspired by Dr. Feng Liang and Dr. John Marden who developed earlier versions of this course. And I also incorporated many online resources, which I cannot put into a comprehensive list. If you think I missed some references, please let me know.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Reference

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.