Statistical Machine Learning with R
2025-09-09
Preface
Welcome to Statistical Machine Learning with R! I started this project during the summer of 2018 when I was preparing for the Stat 432 course (Basics of Statistical Learning). At that time, our faculty member Dr. David Dalpiaz, had decided to move to The Ohio State University (although he moved back to UIUC later on). David introduced to me this awesome way of publishing website on GitHub, which is a very efficient approach for developing course materials. Since I have also taught Stat 542 (Statistical Learning) for several years, I figured it could be beneficial to integrate what I have to this existing book by David and use it as the R
material for both courses. For Stat 542, the main focus is to learn the numerical optimization behind these learning algorithms, and also be familiar with the theoretical background. As you can tell, I am not being very creative on the name, so SMLR
it is. You can find the source file of this book on my GitHub. Recently, I started a new course Stat 546 (Machine Learning in Data Science), which focuses more on RKHS, random forests and reinforcement learning, and I am also add more materials to this book.
Target Audience
This book can be suitable for students ranging from advanced undergraduate to first/second year Ph.D students who have prior knowledge in statistics. Although a student at the masters level will likely benefit most from the material. Previous experience with both basic mathematics (mainly linear algebra), statistical modeling (such as linear regressions) and R are assumed.
What’s Covered?
This book currently covers the following topics:
- Basic Knowledge
- R, R Studio and R Markdown
- Linear regression and linear algebra
- Numerical optimization basics
- Model Selection and Regularization in Linear Models
- Ridge regression
- Lasso
- Spline
- Classification models
- Logistic regression
- Discriminant analysis
- Nonparametric Models with Local Smoothing
- K-nearest neighbor
- Kernel smoothing
- Kernel Methods and RKHS
- Support vector machine
- RKHS
- Kernel ridge regression
- Tree and Ensemble Models
- Tree models
- Random forests
- Boosting
- Unsupervised Learning
- K-means
- Hierarchical clustering
- PCA
- self-organizing map
- Spectral clustering
- UMAP
The goal of this book is to introduce not only how to run some of the popular statistical learning models in R
, know the algorithms and programming techniques for solving these models and also understand some of the fundamental statistical theory behind them. For example, for graduate students, these topics will be discuss in more detail:
- Optimization
- Lagrangian
- Primal vs. dual
- EM and MM algorithm
- Bias-variance trade-off in
- Linear regression
- KNN
- Kernel density estimation
- Kernel Trick and RKHS
- Representer Theorem
- SVM
- Spline
This book is under active development. Hence, you may encounter errors ranging from typos to broken code, to poorly explained topics. If you do, please let me know! Simply send an email to rqzhu@illinois.edu and I will make the changes as soon as possible. Or, if you know R Markdown
and are familiar with GitHub, make a pull request and fix an issue yourself! These contributions will be acknowledged.
Acknowledgements
The initial contents are derived from Dr. David Dalpiaz’s book. My STAT 542 course materials are also inspired by Dr. Feng Liang and Dr. John Marden who developed earlier versions of this course. And I also incorporated many online resources, which I cannot put into a comprehensive list. If you think I missed some references, please let me know.
License
