Survival analysis is used to handle time to event data. For example, we may interested in the time until death after developing a cancer; the time until tumor recurrence after a surgery; or the time until a light bulb failure. In these examples, the outcome variable is referred to as a failure time, which we want to model using covariates. If we get to observe them, then certain regression models can be used. However, the difficult comes in when we cannot observe them completely. One of the most commonly encountered mechanisms is called right censoring, meaning that there could be another event (censoring time) happens before the failure time, which would remove the subject out of the study, hence preventing us from observing the failure time. For example,
A patient participanted in a study of survival time after a cancer related surgry. At day 10, the patient drops out for some random uncontrollable reasons, and the outcome of interest (death) is not observed. However, we know that the death time should be larger than 10 days.
In this example above, we did not observe the event of interest, but only the last follow-up time, 10 days. This is called the censoring time in survival analysis. In particular, this is referred to as the right-censoring type There are also other censoring mechanism, however, we will only focus on this type of censoring.
In a right-censoring setting, we denote the event of interest as \(T\), the failure time, and the lost-to-follow-up time as \(C\), the censoring time. We may have two situations for each subject:
The following plot shows how these observations look like in a real study:
set.seed(1)
n = 10
tstart = runif(n)
tend = tstart + rexp(n)
tcensor = rbinom(n, 1, 0.5)
plot(NA, NA, xlim = c(0, max(tend)), ylim = c(0, n+1), xlab = "Time", ylab = "Subject")
for (i in 1:n)
{
segments(x0 = tstart[i], i, x1 = tend[i], y1 = i)
points(tend[i], i, pch = ifelse(tcensor[i] == 1, 4, 1),
cex = 2, lwd = 2, col = ifelse(tcensor[i] == 1, "blue", "red"))
}
legend("topright", c("Failure", "Censor"), pch = c(4, 1), col = c("blue", "red"), cex = 1.3)
To perform the analysis, we need to first align all subjects to the same starting point:
plot(NA, NA, xlim = c(0, max(tend)), ylim = c(0, n+1), xlab = "Time", ylab = "Subject")
for (i in 1:n)
{
segments(x0 = 0, i, x1 = tend[i] - tstart[i], y1 = i)
points(tend[i] - tstart[i], i, pch = ifelse(tcensor[i] == 1, 4, 1),
cex = 2, lwd = 2, col = ifelse(tcensor[i] == 1, "blue", "red"))
}
legend("topright", c("Failure", "Censor"), pch = c(4, 1), col = c("blue", "red"), cex = 1.3)
Hence, in survival analysis, we have two outcomes for each subject:
Y = data.frame("Time" = tend - tstart, "Censor" = tcensor)
head(Y)
In R
, a survival outcome can also be represented using
the Surv()
function from the survival
package,
this will put a +
sign after the censored subjects,
indicating that the actual failure time is larger than the observed.
library(survival)
Surv(Y$Time, Y$Censor)
## [1] 2.0340910+ 1.7987484+ 0.3740457 1.2295621+ 0.5396828+ 0.9565675 0.1470460+ 1.3907351+ 0.7620299 1.2376036
Example: Mayo Clinic Primary Biliary Cholangitis (PBC) Data is one of
the most famous survival datasets (Fleming
and Harrington, 1991). This data is collected from the Mayo Clinic
trial in PBC conducted between 1974 and 1984. A total of 424 PBC
patients, referred to Mayo Clinic during that ten-year interval, met
eligibility criteria for the randomized placebo controlled trial of the
drug D-penicillamine. The first 312 cases in the data set participated
in the randomized trial and contain largely complete data (survival
package documentation).
library(survival)
data(pbc)
pbc$censor = (pbc$status == 2)
pbc = pbc[1:312, ]
head(pbc)
In this dataset, time
is the observed time (fail or
censor) and status
is the status at endpoint. In the
original data, there are three types of end points: 0 for censored, 1
for transplant and 2 for dead. We will treat both 0 and 1 as censored
observations since we are interested in the length of death.
The survival function gives the probability that a patient will survive beyond any specified time.
Mathematically, for any given time \(t\), the survival function is
\[S(t) = P(T > t)\] The most difficult problem in survival analysis is that we do not observe all the failure times \(T\). Be careful that using only the observed time (regardless of the censoring) for the analysis would lead to very biased results. To correctly estimate the survival function, the Kaplan–Meier estimator or the Nelson–Aalen estimator can be used. The following plot shows the estimated survival curve and the standard deviation. By default, this function uses the Nelson–Aalen estimator
plot(Surv(pbc$time, pbc$censor))
The main idea of a survival function estimation is to account for the fact that at each time point, we should estimate the risk of failure by only considering the subjects that are still remained in the study. This involves a new concept called the hazard function which describes the rate of failure at each given time point. Mathematically, the hazard function is defined as
\[h(t) = \frac{f(t)}{S(t)}\] where \(f(t)\) is the density function. The relationship of the three could be visualized with the following normal distribution function. The technique behind the survival function calculation is to estimate the hazard at each given time point, with numerator being a failure event, and the denominator being the count of subjects at risk at the time. The hazard function can then be recovered to calculate the survival function (the detail is beyond the scope of this course).
x = seq(-3, 3, 0.01)
fx = dnorm(x)
Sx = 1 - pnorm(x)
hx = fx / Sx
par(mfrow=c(1,3))
plot(x, fx, type = "l", lwd = 3, main = "Density", cex.main = 3)
plot(x, Sx, type = "l", lwd = 3, main = "Survival", cex.main = 3)
plot(x, hx, type = "l", lwd = 3, main = "Hazard", cex.main = 3)
Read the Salinas-Escudero et al. 2020 paper, Figure 1. For different age groups, what are their chance to survival at 20 days? For patients who entered ICU, what is the median survival time (the time where half of the population fail)?
Similar to the continuous outcome case, where we could use the \(t\) test to test their difference, we also
want to compare if the survivabilities of two groups are different given
a set of samples. For example, let’s test if the two treatment
trt
groups (1 for D-penicillmain, 2 for placebo) are
different. The two survival curves look like the following
pbc[, c("time", "censor", "trt")]
plot(survfit(Surv(time, censor) ~ trt, data = pbc), col = c("red","blue"))
The most popular test to compare the two curves is called the
log-rank test. Essentially, the test is comparing a
cumulative version of the hazard function over the
entire time line. This can be done using the survdiff()
function:
survdiff(Surv(time, censor) ~ trt, data = pbc)
## Call:
## survdiff(formula = Surv(time, censor) ~ trt, data = pbc)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=1 158 65 63.2 0.0502 0.102
## trt=2 154 60 61.8 0.0513 0.102
##
## Chisq= 0.1 on 1 degrees of freedom, p= 0.7
The function provides some summary statistics in addition to the test of difference:
(O-E)^2/E
and (O-E)^2/V
are both
summaries of the deviation of the hazard function from the pooled
samples. This would not be very useful if you are only interested in the
test significance.Chisq
term is the statistic calculated for this
test, followed by its p-value. A detailed step-by-step calculation of
this quantity can be found hereUse the pbc
data to perform the following tasks:
plot(Surv(pbc$time[pbc$sex == "f"], pbc$censor[pbc$sex == "f"]))
survdiff(Surv(time, censor) ~ sex, data = pbc)
The next step is to model the effect of covariates. This is much more difficult than a regression model. In a regression model, we estimate the mean of outcome \(Y\). However, this is not always possible for a survival model since large values of failure time will most likely get censored. In many cases, there can be a maximum follow-up time (truncation) which prevents the failure time form being observed beyond that point. Hence the Cox Propositional Hazard (Cox-PH) model instead focus on the hazard function. The central idea is that the covariate changes would modify the hazard function, which in turn changes the survival function. But this is only up to the time point where we could estimate the hazard, i.e., the time point where we can observe failures.
Mathematically, we would estimate a quantity called baseline hazard, \(\lambda_0(t)\), which represent a certain type of “reference” of all individuals, and then the hazard function of a specific subject with covariate vector \(X\) can be obtained using
\[\begin{align} \lambda(t | X) &= \lambda_0 \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\\ &= \lambda_0 \exp(\beta_1 X_1) \exp(\beta_2 X_2) \cdots \exp(\beta_p X_p) \end{align}\] We need to notice three facts:
To fit a Cox-PH model, we will use the coxph()
function.
Consider using both trt
and age
variables. The
syntax is almost the same as a linear regression.
pbcfit <- coxph(Surv(time, censor) ~ age + trt, data = pbc)
summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt, data = pbc)
##
## n= 312, number of events= 125
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.040632 1.041469 0.008964 4.533 5.83e-06 ***
## trt 0.073604 1.076380 0.182068 0.404 0.686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.041 0.9602 1.0233 1.060
## trt 1.076 0.9290 0.7533 1.538
##
## Concordance= 0.625 (se = 0.026 )
## Likelihood ratio test= 20.67 on 2 df, p=3e-05
## Wald test = 20.71 on 2 df, p=3e-05
## Score (logrank) test = 20.97 on 2 df, p=3e-05
We should pay attention to several quantities:
coef
and the associated p-values. This part
is similar to a linear regression modelexp(coef)
and its 95% confidence interval
lower .95
, upper .95
. The reason that we want
to take the exponential of the coefficient is due to the previously
explained fact that the effect of a covariate is multiplicative, and the
magnitude is \(\exp(\beta X)\). Hence,
if this quantity is 1, then there is no effect to the baseline hazard.
In other words, if the confidence interval excludes 1, then this
variable would present significant effect to the hazard, which also
affects the survival.Concordance
this is the agreement between an observed
response and the predicted link function. However, be aware that we do
not predict the outcome exactly, we only predict it relatively in terms
of their hazard. Hence, if a subject is predicted to have high risk and
is also failing very early, then this is a good model. Hence, the
concordance score is essentially checking a type of rank correlation
(see our correlation slides in Week 4) between the fitted and the
observed.Again, this would be related to the hazard. We can do the same
analysis as the parameter in a logistic regression by setting two groups
age = a
and age = a+1
(we will ignore the
effect of treatment for simplicity):
age = a
group: the hazard is \(\lambda_0 \exp(\beta a)\)age = a + 1
group: the hazard is \(\lambda_0 \exp(\beta (a+1))\)Hence, if we look at the ratio of the corresponding hazards:
\[\frac{\lambda_0 \exp(\beta (a+1))}{\lambda_0 \exp(\beta a)} = \exp(\beta)\] So, the parameter \(\beta\) represents the log of hazard ratio, corresponding to two groups with 1 unit difference of the variable. Large positive \(\beta\) means if the variable value increases, the hazard is increasing, and the subject is expected to fail early.
Use the pbc
data to perform the following tasks:
pbcfit <- coxph(Surv(time, censor) ~ age + trt + sex, data = pbc)
summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt + sex, data = pbc)
##
## n= 312, number of events= 125
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.038820 1.039583 0.008995 4.316 1.59e-05 ***
## trt 0.062647 1.064651 0.181853 0.344 0.730
## sexf -0.337659 0.713438 0.238873 -1.414 0.157
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0396 0.9619 1.0214 1.058
## trt 1.0647 0.9393 0.7454 1.521
## sexf 0.7134 1.4017 0.4467 1.139
##
## Concordance= 0.623 (se = 0.026 )
## Likelihood ratio test= 22.53 on 3 df, p=5e-05
## Wald test = 23.22 on 3 df, p=4e-05
## Score (logrank) test = 23.43 on 3 df, p=3e-05
pbcfit <- coxph(Surv(time, censor) ~ age + trt + sex + age*trt, data = pbc)
summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt + sex + age *
## trt, data = pbc)
##
## n= 312, number of events= 125
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.02090 1.02112 0.02712 0.771 0.441
## trt -0.60273 0.54731 0.96792 -0.623 0.533
## sexf -0.31861 0.72716 0.24046 -1.325 0.185
## age:trt 0.01257 1.01265 0.01794 0.700 0.484
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0211 0.9793 0.9683 1.077
## trt 0.5473 1.8271 0.0821 3.649
## sexf 0.7272 1.3752 0.4539 1.165
## age:trt 1.0126 0.9875 0.9777 1.049
##
## Concordance= 0.625 (se = 0.027 )
## Likelihood ratio test= 23.02 on 4 df, p=1e-04
## Wald test = 23.6 on 4 df, p=1e-04
## Score (logrank) test = 23.82 on 4 df, p=9e-05