1 Installing and Using R

R is a free-to-use software that is very popular in statistical computing. You can download R from . The latest version is 3.5.1. Another software that makes using R easier is Rstudio, which is available at . You can find many online guides that help you set-up these two software, for example, this YouTube video. This guild is created using R Markdown, which is a feature provided by RStudio.

2 Basic Mathematical Operations

We will start with some basic operations. Try type-in the following commands into your R console and start to explore yourself. Most of them are self-explanatory. Lines with a # in the front are comments, which will not be executed. For displaying, lines with ## in the front in the following chunk of code are outputs from the previous line. This is a particular feature created by R Markdown, which integrates text and code in a single document.

    # Basic mathematics
    1 + 3
## [1] 4
    3*5
## [1] 15
    3^5
## [1] 243
    exp(2)
## [1] 7.389056
    log(3)
## [1] 1.098612
    log2(3)
## [1] 1.584963
    factorial(5)
## [1] 120

Most of our operations will be performed on data objects, such as vectors and matrices. They can be created within R, loaded from an existing package or read-in from a file (to be discussed later on). For now, let’s create them in R. Note that for data objects, common mathematical operations can be performed such as matrix multiplication, using the %*% sign.

    # vector
    c(1,2,3,4)
## [1] 1 2 3 4
    # matrix 
    matrix(c(1,2,3,4), 2, 2)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
    # create matrix from vectors 
    x = c(1,1,1,0,0,0); y = c(1,0,1,0,1,0)
    cbind(x,y)
##      x y
## [1,] 1 1
## [2,] 1 0
## [3,] 1 1
## [4,] 0 0
## [5,] 0 1
## [6,] 0 0
    # matrix multiplication
    matrix(c(1,2,3,4), 2, 2) %*% t(cbind(x,y))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    4    1    4    0    3    0
## [2,]    6    2    6    0    4    0
    # some simple operations 
    x[3]
## [1] 1
    x[2:5]
## [1] 1 1 0 0
    cbind(x,y)[1:2, ]
##      x y
## [1,] 1 1
## [2,] 1 0
    x + 1
## [1] 2 2 2 1 1 1
    length(x)
## [1] 6
    dim(cbind(x,y))
## [1] 6 2
    # A warning will be issued when R detects something wrong. Results may still be produced.
    x + c(1,2,3,4)
## Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of shorter object length
## [1] 2 3 4 4 1 2

3 Random Number Generations from a Distribution

Random number generation is essential for statistical simulation. R provides functions to generate random observations from a given distribution — for example, the normal distribution and the binomial distribution. Usually, random number generation function starts with ‘r’ and then the distribution name. For example, to generate Normal random variables, we use rnorm(), and for random t-distribution, its rt(), etc.

    # generate 10 Bernoulli random variables as a vector
    rbinom(n=10, size = 1, prob = 0.5)
##  [1] 1 0 0 1 0 1 0 1 0 0
    # 2 random normally distributed variables
    rnorm(n=4, mean = 1, sd = 2)
## [1] 1.489433 3.227115 1.772210 3.448394

We can also calculate the probability density function (pdf) and cumulative distribution function (cdf) of random variables. For example, the normal pdf and cdf function can be evaluated on a sequence of points, which can be visualized on a plot.

    # generate a sequence of numbers from -2 to 2
    x = seq(-2, 2, by = 0.2)
    # calculate normal pdf on these points
    fx = dnorm(x, mean = 0, sd = 1)
    
    # plot them in a figure
    plot(x, fx, pch = 19)

If we need to replicate the results, we can set a random seed

    # after setting the seed, the random generation will follow a particular sequence
    set.seed(1)
    rnorm(n=4, mean = 1, sd = 2)
## [1] -0.2529076  1.3672866 -0.6712572  4.1905616
    
    # if we don't reset the seed, a new set of random numbers will be generated 
    rnorm(n=4, mean = 1, sd = 2)
## [1]  1.6590155 -0.6409368  1.9748581  2.4766494
    
    # if we rest the seed, we will replicate exactly the same results as the first vector
    set.seed(1)
    rnorm(n=4, mean = 1, sd = 2)
## [1] -0.2529076  1.3672866 -0.6712572  4.1905616

4 Descriptive Statistics of Data

Statistical functions that provide a summary of the data

    x = rnorm(n=100, mean = 1, sd = 2)
    y = rnorm(n=100, mean = 2, sd = 1)
    sum(x)
## [1] 118.4815
    mean(x)
## [1] 1.184815
    var(x)
## [1] 3.142351
    median(x)
## [1] 1.148906
    quantile(x, c(0.25, 0.5, 0.75))
##       25%       50%       75% 
## 0.0115149 1.1489063 2.2746083
    cor(x, y)
## [1] -0.04261199

For discrete data, we usually use the table function

    set.seed(1); n = 1000
    x = rbinom(n, size = 1, prob = 0.75)
    y = rbinom(n, size = 3, prob = c(0.4, 0.3, 0.2, 0.1))
    table(x)
## x
##   0   1 
## 248 752
    
    # this will be a cross table
    table(x, y)
##    y
## x     0   1   2   3
##   0 128  79  34   7
##   1 342 267 125  18

For a mixture of discrete and continuous data (multiple variables), we often use a data.frame

    # data.frame is a special data structure that can store both factor and numeric variables
    z = runif(n, min = 18, max = 65)
    data = data.frame("Gender" = as.factor(x), "Group" = as.factor(y), "Age" = z)
    levels(data$Gender) = c("male", "female")
    levels(data$Group) = c("patient", "physician", "engineer", "statistician")
    
    # a peek at the top 3 entries of the data
    head(data, 3)
##   Gender     Group      Age
## 1 female physician 58.97484
## 2 female physician 63.45826
## 3 female   patient 58.74506
    
    # a brief summary
    summary(data)
##     Gender             Group          Age       
##  male  :248   patient     :470   Min.   :18.03  
##  female:752   physician   :346   1st Qu.:29.07  
##               engineer    :159   Median :40.51  
##               statistician: 25   Mean   :41.02  
##                                  3rd Qu.:53.43  
##                                  Max.   :64.99

As a real example, we take the classical iris data, which can be loaded directly from R.

    # read-in the data
    data(iris)
    
    # a peek at the top observations 
    head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

    # number of observations for different species
    table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
    
    # suppose we are interested in the mean of first 4 variables by species
    aggregate(iris[, 1:4], list(iris$Species), mean)
##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

5 Producing Figures

Data visualization is an important initial step of any analysis. Some of the commonly used techniques include the histograms, bar plot, scatter plot. You can also customize those figures with different colors and shapes to facilitate the visualization. We use the iris data as an example.

    # first, we produce a histogram of the Sepal.Length variable 
    hist(iris$Sepal.Length, xlab = "Sepal Length")


    # bar plot can be used to compare different variables or the same variable in different groups
    # the following boxplot compares Sepal.Length across different species
    # you can easily color them
    boxplot(Sepal.Length ~ Species, data=iris, col=c("indianred", "deepskyblue", "darkorange"))

    
    # schatter plot is usually used for two variables
    # but we can color them using a third, categorical varaible
    # remeber to add legend for readability
    plot(iris$Sepal.Length, iris$Sepal.Width, pch = 19,
         col = c("indianred","deepskyblue", "darkorange")[iris$Species])
    legend("topright", c("setosa", "versicolor", "virginica"), 
           pch = 19, col = c("indianred","deepskyblue", "darkorange"))

    
    # to incorporate more than two variables, we can either use 3d plots
    # or do pairwise plots for each pair of variables
    
    pairs(iris[,1:4], pch = 19,  cex = 0.75, lower.panel=NULL,
          col = c("indianred","deepskyblue", "darkorange")[iris$Species])

6 Read-in and Save Data

R can read-in data from many different sources such as .txt, .csv, etc. For example, read.csv() can be used to import .csv files. The first argument should be specified as the path to the data file, or just the name of the file if the current working directory is the same as the data file. R objects, especially matrices, can be saved into these standard files. Use functions such as write.table() and write.csv to perform this. We can also save any object into .RData file, which can be loaded later on. To do this try functions save.image() and save().

7 R Packages

One of the most important features of R is its massive collection of packages. A package is like an add-on that can be downloaded and installed and perform additional function and analysis.

    # The MASS package can be used to generate multivariate normal distribution 
    # This package is already included with the base R
    library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
    P = 4; N = 200
    V <- 0.5^abs(outer(1:P, 1:P, "-"))
    X = as.matrix(mvrnorm(N, mu=rep(0,P), Sigma=V))
    head(X, 3)
##            [,1]       [,2]       [,3]       [,4]
## [1,]  0.5170973 -0.5084187 -0.4949838 -2.0529670
## [2,]  0.4995059  0.5314248  0.7749723  0.8980893
## [3,] -1.9868469 -0.9742923 -1.1977524  1.8032596

Most packages need to be downloaded and installed. To do this, we use the install.packages() function. After the package is loaded to your console using library() we can start to utilize the functions within that package.

    # we demonstrate using a package that can produce 3d images

    # to install the package
    install.packages("scatterplot3d")

Please note that you may not want to install a package every time the R markdown file is compiled.

    # load the package
    library(scatterplot3d) 

    # now produce a 3d plot
    scatterplot3d(iris[,1:3], pch = 19, 
                  color=c("indianred","deepskyblue", "darkorange")[iris$Species])
    legend("bottom", legend = levels(iris$Species), pch = 19,
           col = c("indianred","deepskyblue", "darkorange"),
      inset = -0.3, xpd = TRUE, horiz = TRUE)

8 Basic Functions for Statistics

For discrete variables, we can construct a frequency table to summarize the data.

    # Summarizes Gender vs. Group
    table(data[, 1:2])
##         Group
## Gender   patient physician engineer statistician
##   male       128        79       34            7
##   female     342       267      125           18

For continuous variables, we can calculate Pearson’s correlation.

    set.seed(1); n = 30
    x = rnorm(n)
    y = x + rnorm(n, sd = 2)
    z = x + rnorm(n, sd = 2)
    
    # Calculate Pearson's correlation and tests
    cor(y, z)
## [1] 0.5810874
    cor.test(y, z)
## 
##  Pearson's product-moment correlation
## 
## data:  y and z
## t = 3.7782, df = 28, p-value = 0.0007592
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2792861 0.7784002
## sample estimates:
##       cor 
## 0.5810874

A t-test is often used to test the difference of a continuous variable across two groups. For example, we test the Age differences between genders.

    t.test(data$Age ~ data$Gender)
## 
##  Welch Two Sample t-test
## 
## data:  data$Age by data$Gender
## t = -0.92872, df = 407.2, p-value = 0.3536
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  -2.963039  1.061637
## sample estimates:
##   mean in group male mean in group female 
##             40.30084             41.25154

The ``Lady Tasting Tea’’ problem is the first statistical hypothesis testing problem, proposed by R. A. Fisher. The description of the original problem can be found . We can code the problem with two discrete random variables, and the Fisher Exact Test can be used. For a demonstration, let’s consider a situation that Lady Bristol gets three of the cups right.

    # Construct the 2 by 2 table as the dataset
    TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2,
                         dimnames = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea")))
    TeaTasting
##       Truth
## Guess  Milk Tea
##   Milk    3   1
##   Tea     1   3
    
    # In this case, the p-value is not significant
    fisher.test(TeaTasting,  alternative = "greater")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3135693       Inf
## sample estimates:
## odds ratio 
##   6.408309

    # Later on, we will learn that some alternative tests can be used
    # For example, the chi-square test
    chisq.test(TeaTasting)
## Warning in chisq.test(TeaTasting): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TeaTasting
## X-squared = 0.5, df = 1, p-value = 0.4795

A simple linear regression assumes the underlying model \(Y = \beta X + \epsilon\). With the observed data, we can estimate the regression coefficients:

    # the lm() function is the most commonly used
    fit = lm(y~x)
    summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0404 -1.0099 -0.4594  1.1506  3.7069 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.2586     0.2964   0.873  0.39032   
## x             1.0838     0.3249   3.336  0.00241 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.617 on 28 degrees of freedom
## Multiple R-squared:  0.2844, Adjusted R-squared:  0.2588 
## F-statistic: 11.13 on 1 and 28 DF,  p-value: 0.00241

9 Writing Your Own Functions

Writing your own R function can be an exciting experience. For example, we can create a function that calculates the inner product of two vectors:

    # the lm() function is the most commonly used
    myfun <- function(a, b)
    {
      return(a %*% b)
    }

    x1 = rnorm(100)
    x2 = rnorm(100)
    myfun(x1, x2)
##           [,1]
## [1,] -5.537335

If we need the function to return multiple objects, we can create a list. The following example returns both the inner product and the correlation.

    # the lm() function is the most commonly used
    myfun <- function(a, b)
    {
      return(list("prod" = a %*% b, "cor" = cor(a, b)))
    }

    myfun(x1, x2)
## $prod
##           [,1]
## [1,] -5.537335
## 
## $cor
## [1] -0.05558356

Most functions in R are constructed this way because a lot of information needs to be stored when fitting a statistical model. You can always peek into a fitted object of a model and take the information you need.

    # the output is too long, hence omitted
    str(fit)
    names(fit)
    fit$coefficients
## (Intercept)           x 
##   0.2586414   1.0837726

10 Numerical Optimizations

Coefficients in linear regression can be solved using the formula

\[\widehat \beta = (X'X)^{-1} X'Y.\]

We can also treat this as an optimization problem by minimizing the \(\ell_2\) loss function:

\[\widehat \beta = \arg\min \frac{1}{2n}\lVert Y - X\beta \rVert_2^2.\]

This can be done through some numerical optimization methods ():

    f <- function(b, X, Y)
    {
      length(Y)^{-1}*0.5*sum((Y - X %*% as.matrix(b))^2)
    }
    
    # We use an optimization function: 
    # the first argument is the inital value for beta
    # the second argument is the objective function
    # the rest of the arguments are additional information used
    # in calculating the objective function

    solution = optim(rep(0, 2), f, X = cbind(1, x), Y = y)
    solution$par
## [1] 0.2586419 1.0836772

Many machine learning and statistical learning problems are eventually an optimization problem, although they are much more difficult than solving a linear regression.

11 Get Helps

To get the reference about a particular function, one can put a question mark in front of a function name to see details:

    # This will open up the web browser for the on-line document 
    ?mvrnorm
    ?save.image

And 99% of times, questions can be answered searching google. stackoverflow is a nice place to find similar questions.