Abstract
This document walks you through some of the basic commands and functions of R. They will be used to perform statistical analysis in this course. This is by no means a comprehensive guide, and there are many other resources available online.
R is a free-to-use software that is very popular in statistical computing. You can download R from . The latest version is 3.5.1. Another software that makes using R easier is Rstudio, which is available at . You can find many online guides that help you set-up these two software, for example, this YouTube video. This guild is created using R Markdown, which is a feature provided by RStudio.
We will start with some basic operations. Try type-in the following commands into your R console and start to explore yourself. Most of them are self-explanatory. Lines with a #
in the front are comments, which will not be executed. For displaying, lines with ##
in the front in the following chunk of code are outputs from the previous line. This is a particular feature created by R Markdown, which integrates text and code in a single document.
# Basic mathematics
1 + 3
## [1] 4
3*5
## [1] 15
3^5
## [1] 243
exp(2)
## [1] 7.389056
log(3)
## [1] 1.098612
log2(3)
## [1] 1.584963
factorial(5)
## [1] 120
Most of our operations will be performed on data objects, such as vectors and matrices. They can be created within R, loaded from an existing package or read-in from a file (to be discussed later on). For now, let’s create them in R. Note that for data objects, common mathematical operations can be performed such as matrix multiplication, using the %*% sign.
# vector
c(1,2,3,4)
## [1] 1 2 3 4
# matrix
matrix(c(1,2,3,4), 2, 2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# create matrix from vectors
x = c(1,1,1,0,0,0); y = c(1,0,1,0,1,0)
cbind(x,y)
## x y
## [1,] 1 1
## [2,] 1 0
## [3,] 1 1
## [4,] 0 0
## [5,] 0 1
## [6,] 0 0
# matrix multiplication
matrix(c(1,2,3,4), 2, 2) %*% t(cbind(x,y))
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 4 1 4 0 3 0
## [2,] 6 2 6 0 4 0
# some simple operations
x[3]
## [1] 1
x[2:5]
## [1] 1 1 0 0
cbind(x,y)[1:2, ]
## x y
## [1,] 1 1
## [2,] 1 0
x + 1
## [1] 2 2 2 1 1 1
length(x)
## [1] 6
dim(cbind(x,y))
## [1] 6 2
# A warning will be issued when R detects something wrong. Results may still be produced.
x + c(1,2,3,4)
## Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of shorter object length
## [1] 2 3 4 4 1 2
Random number generation is essential for statistical simulation. R provides functions to generate random observations from a given distribution — for example, the normal distribution and the binomial distribution. Usually, random number generation function starts with ‘r
’ and then the distribution name. For example, to generate Normal random variables, we use rnorm()
, and for random t-distribution, its rt()
, etc.
# generate 10 Bernoulli random variables as a vector
rbinom(n=10, size = 1, prob = 0.5)
## [1] 1 0 0 1 0 1 0 1 0 0
# 2 random normally distributed variables
rnorm(n=4, mean = 1, sd = 2)
## [1] 1.489433 3.227115 1.772210 3.448394
We can also calculate the probability density function (pdf) and cumulative distribution function (cdf) of random variables. For example, the normal pdf and cdf function can be evaluated on a sequence of points, which can be visualized on a plot.
# generate a sequence of numbers from -2 to 2
x = seq(-2, 2, by = 0.2)
# calculate normal pdf on these points
fx = dnorm(x, mean = 0, sd = 1)
# plot them in a figure
plot(x, fx, pch = 19)
If we need to replicate the results, we can set a random seed
# after setting the seed, the random generation will follow a particular sequence
set.seed(1)
rnorm(n=4, mean = 1, sd = 2)
## [1] -0.2529076 1.3672866 -0.6712572 4.1905616
# if we don't reset the seed, a new set of random numbers will be generated
rnorm(n=4, mean = 1, sd = 2)
## [1] 1.6590155 -0.6409368 1.9748581 2.4766494
# if we rest the seed, we will replicate exactly the same results as the first vector
set.seed(1)
rnorm(n=4, mean = 1, sd = 2)
## [1] -0.2529076 1.3672866 -0.6712572 4.1905616
Statistical functions that provide a summary of the data
x = rnorm(n=100, mean = 1, sd = 2)
y = rnorm(n=100, mean = 2, sd = 1)
sum(x)
## [1] 118.4815
mean(x)
## [1] 1.184815
var(x)
## [1] 3.142351
median(x)
## [1] 1.148906
quantile(x, c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 0.0115149 1.1489063 2.2746083
cor(x, y)
## [1] -0.04261199
For discrete data, we usually use the table function
set.seed(1); n = 1000
x = rbinom(n, size = 1, prob = 0.75)
y = rbinom(n, size = 3, prob = c(0.4, 0.3, 0.2, 0.1))
table(x)
## x
## 0 1
## 248 752
# this will be a cross table
table(x, y)
## y
## x 0 1 2 3
## 0 128 79 34 7
## 1 342 267 125 18
For a mixture of discrete and continuous data (multiple variables), we often use a data.frame
# data.frame is a special data structure that can store both factor and numeric variables
z = runif(n, min = 18, max = 65)
data = data.frame("Gender" = as.factor(x), "Group" = as.factor(y), "Age" = z)
levels(data$Gender) = c("male", "female")
levels(data$Group) = c("patient", "physician", "engineer", "statistician")
# a peek at the top 3 entries of the data
head(data, 3)
## Gender Group Age
## 1 female physician 58.97484
## 2 female physician 63.45826
## 3 female patient 58.74506
# a brief summary
summary(data)
## Gender Group Age
## male :248 patient :470 Min. :18.03
## female:752 physician :346 1st Qu.:29.07
## engineer :159 Median :40.51
## statistician: 25 Mean :41.02
## 3rd Qu.:53.43
## Max. :64.99
As a real example, we take the classical iris
data, which can be loaded directly from R.
# read-in the data
data(iris)
# a peek at the top observations
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# number of observations for different species
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
# suppose we are interested in the mean of first 4 variables by species
aggregate(iris[, 1:4], list(iris$Species), mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
Data visualization is an important initial step of any analysis. Some of the commonly used techniques include the histograms, bar plot, scatter plot. You can also customize those figures with different colors and shapes to facilitate the visualization. We use the iris data as an example.
# first, we produce a histogram of the Sepal.Length variable
hist(iris$Sepal.Length, xlab = "Sepal Length")
# bar plot can be used to compare different variables or the same variable in different groups
# the following boxplot compares Sepal.Length across different species
# you can easily color them
boxplot(Sepal.Length ~ Species, data=iris, col=c("indianred", "deepskyblue", "darkorange"))
# schatter plot is usually used for two variables
# but we can color them using a third, categorical varaible
# remeber to add legend for readability
plot(iris$Sepal.Length, iris$Sepal.Width, pch = 19,
col = c("indianred","deepskyblue", "darkorange")[iris$Species])
legend("topright", c("setosa", "versicolor", "virginica"),
pch = 19, col = c("indianred","deepskyblue", "darkorange"))
# to incorporate more than two variables, we can either use 3d plots
# or do pairwise plots for each pair of variables
pairs(iris[,1:4], pch = 19, cex = 0.75, lower.panel=NULL,
col = c("indianred","deepskyblue", "darkorange")[iris$Species])
R can read-in data from many different sources such as .txt
, .csv
, etc. For example, read.csv()
can be used to import .csv
files. The first argument should be specified as the path to the data file, or just the name of the file if the current working directory is the same as the data file. R objects, especially matrices, can be saved into these standard files. Use functions such as write.table()
and write.csv
to perform this. We can also save any object into .RData
file, which can be loaded later on. To do this try functions save.image()
and save()
.
One of the most important features of R is its massive collection of packages. A package is like an add-on that can be downloaded and installed and perform additional function and analysis.
# The MASS package can be used to generate multivariate normal distribution
# This package is already included with the base R
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
P = 4; N = 200
V <- 0.5^abs(outer(1:P, 1:P, "-"))
X = as.matrix(mvrnorm(N, mu=rep(0,P), Sigma=V))
head(X, 3)
## [,1] [,2] [,3] [,4]
## [1,] 0.5170973 -0.5084187 -0.4949838 -2.0529670
## [2,] 0.4995059 0.5314248 0.7749723 0.8980893
## [3,] -1.9868469 -0.9742923 -1.1977524 1.8032596
Most packages need to be downloaded and installed. To do this, we use the install.packages()
function. After the package is loaded to your console using library()
we can start to utilize the functions within that package.
# we demonstrate using a package that can produce 3d images
# to install the package
install.packages("scatterplot3d")
Please note that you may not want to install a package every time the R markdown file is compiled.
# load the package
library(scatterplot3d)
# now produce a 3d plot
scatterplot3d(iris[,1:3], pch = 19,
color=c("indianred","deepskyblue", "darkorange")[iris$Species])
legend("bottom", legend = levels(iris$Species), pch = 19,
col = c("indianred","deepskyblue", "darkorange"),
inset = -0.3, xpd = TRUE, horiz = TRUE)
For discrete variables, we can construct a frequency table to summarize the data.
# Summarizes Gender vs. Group
table(data[, 1:2])
## Group
## Gender patient physician engineer statistician
## male 128 79 34 7
## female 342 267 125 18
For continuous variables, we can calculate Pearson’s correlation.
set.seed(1); n = 30
x = rnorm(n)
y = x + rnorm(n, sd = 2)
z = x + rnorm(n, sd = 2)
# Calculate Pearson's correlation and tests
cor(y, z)
## [1] 0.5810874
cor.test(y, z)
##
## Pearson's product-moment correlation
##
## data: y and z
## t = 3.7782, df = 28, p-value = 0.0007592
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2792861 0.7784002
## sample estimates:
## cor
## 0.5810874
A t-test is often used to test the difference of a continuous variable across two groups. For example, we test the Age differences between genders.
t.test(data$Age ~ data$Gender)
##
## Welch Two Sample t-test
##
## data: data$Age by data$Gender
## t = -0.92872, df = 407.2, p-value = 0.3536
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
## -2.963039 1.061637
## sample estimates:
## mean in group male mean in group female
## 40.30084 41.25154
The ``Lady Tasting Tea’’ problem is the first statistical hypothesis testing problem, proposed by R. A. Fisher. The description of the original problem can be found . We can code the problem with two discrete random variables, and the Fisher Exact Test can be used. For a demonstration, let’s consider a situation that Lady Bristol gets three of the cups right.
# Construct the 2 by 2 table as the dataset
TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2,
dimnames = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea")))
TeaTasting
## Truth
## Guess Milk Tea
## Milk 3 1
## Tea 1 3
# In this case, the p-value is not significant
fisher.test(TeaTasting, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
# Later on, we will learn that some alternative tests can be used
# For example, the chi-square test
chisq.test(TeaTasting)
## Warning in chisq.test(TeaTasting): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TeaTasting
## X-squared = 0.5, df = 1, p-value = 0.4795
A simple linear regression assumes the underlying model \(Y = \beta X + \epsilon\). With the observed data, we can estimate the regression coefficients:
# the lm() function is the most commonly used
fit = lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0404 -1.0099 -0.4594 1.1506 3.7069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2586 0.2964 0.873 0.39032
## x 1.0838 0.3249 3.336 0.00241 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.617 on 28 degrees of freedom
## Multiple R-squared: 0.2844, Adjusted R-squared: 0.2588
## F-statistic: 11.13 on 1 and 28 DF, p-value: 0.00241
Writing your own R function can be an exciting experience. For example, we can create a function that calculates the inner product of two vectors:
# the lm() function is the most commonly used
myfun <- function(a, b)
{
return(a %*% b)
}
x1 = rnorm(100)
x2 = rnorm(100)
myfun(x1, x2)
## [,1]
## [1,] -5.537335
If we need the function to return multiple objects, we can create a list. The following example returns both the inner product and the correlation.
# the lm() function is the most commonly used
myfun <- function(a, b)
{
return(list("prod" = a %*% b, "cor" = cor(a, b)))
}
myfun(x1, x2)
## $prod
## [,1]
## [1,] -5.537335
##
## $cor
## [1] -0.05558356
Most functions in R are constructed this way because a lot of information needs to be stored when fitting a statistical model. You can always peek into a fitted object of a model and take the information you need.
# the output is too long, hence omitted
str(fit)
names(fit)
fit$coefficients
## (Intercept) x
## 0.2586414 1.0837726
Coefficients in linear regression can be solved using the formula
\[\widehat \beta = (X'X)^{-1} X'Y.\]
We can also treat this as an optimization problem by minimizing the \(\ell_2\) loss function:
\[\widehat \beta = \arg\min \frac{1}{2n}\lVert Y - X\beta \rVert_2^2.\]
This can be done through some numerical optimization methods ():
f <- function(b, X, Y)
{
length(Y)^{-1}*0.5*sum((Y - X %*% as.matrix(b))^2)
}
# We use an optimization function:
# the first argument is the inital value for beta
# the second argument is the objective function
# the rest of the arguments are additional information used
# in calculating the objective function
solution = optim(rep(0, 2), f, X = cbind(1, x), Y = y)
solution$par
## [1] 0.2586419 1.0836772
Many machine learning and statistical learning problems are eventually an optimization problem, although they are much more difficult than solving a linear regression.
To get the reference about a particular function, one can put a question mark in front of a function name to see details:
# This will open up the web browser for the on-line document
?mvrnorm
?save.image
And 99% of times, questions can be answered searching google. stackoverflow is a nice place to find similar questions.