Preface

Welcome to Statistical Machine Learning with R! I started this project during the summer of 2018 when I was preparing for the Stat 432 course (Basics of Statistical Learning). At that time, our faculty member Dr. David Dalpiaz, had decided to move to The Ohio State University (although he moved back to UIUC later on). David introduced to me this awesome way of publishing website on GitHub, which is a very efficient approach for developing course materials. Since I have also taught Stat 542 (Statistical Learning) for several years, I figured it could be beneficial to integrate what I have to this existing book by David and use it as the R material for both courses. For Stat 542, the main focus is to learn the numerical optimization behind these learning algorithms, and also be familiar with the theoretical background. As you can tell, I am not being very creative on the name, so SMLR it is. You can find the source file of this book on my GitHub. Recently, I started a new course Stat 546 (Machine Learning in Data Science), which focuses more on RKHS, random forests and reinforcement learning, and I am also add more materials to this book.

Target Audience

This book can be suitable for students ranging from advanced undergraduate to first/second year PhD students who have prior knowledge in statistics. Although a student at the masters level will likely benefit most from the material. Previous experience with both basic mathematics (mainly linear algebra), statistical modeling (such as linear regressions) and R are assumed.

What’s Covered?

This book currently covers the following topics:

Basic Knowledge
- R, R Studio and R Markdown
- Linear regression and linear algebra
- Numerical optimization basics
Model Selection and Regularization in Linear Models
- Model Selection
- Ridge regression
- Lasso
- Spline
Classification models
- Logistic regression
- Discriminant analysis
Nonparametric Models with Local Smoothing
- K-nearest neighbor
- Kernel smoothing
Kernel Methods and RKHS
- Support vector machine
- Kernel ridge regression
- Support vector regression
Tree and Ensemble Models
- Tree models
- Random forests
- AdaBoost
- Gradient boosting machine
Unsupervised Learning
- K-means
- Hierarchical clustering
- PCA
- self-organizing map
- Spectral clustering
- UMAP

The goal of this book is to introduce not only how to run some of the popular statistical learning models in R, know the algorithms and programming techniques for solving these models and also understand some of the fundamental statistical theory behind them. For example, for graduate students, these topics will be discuss in more detail:

Optimization
- Lagrangian
- Primal vs. dual
EM and MM algorithm (to be added)
Bias-variance trade-off in
- Linear regression
- KNN
- Kernel density estimation
RKHS
- Kernel Trick
- Representer Theorem

This book is under active development. Hence, you may encounter errors ranging from typos to broken code, to poorly explained topics. If you do, please let me know! Simply send an email to rqzhu@illinois.edu and I will make the changes as soon as possible. Or, if you know R Markdown and are familiar with GitHub, make a pull request and fix an issue yourself! These contributions will be acknowledged.

Acknowledgements

The initial contents and Bookdown code are derived from Dr. David Dalpiaz’s online book. My STAT 542 course materials are also inspired by teaching materilas from Dr. Feng Liang and Dr. John Marden who developed earlier versions of this course. And I also incorporated many online resources, which I cannot put into a comprehensive list. If you think I missed some references, please let me know.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.