Time to Event Data and Censoring

Survival analysis is used to handle time to event data. For example, we may interested in the time until death after developing a cancer; the time until tumor recurrence after a surgery; or the time until a light bulb failure. In these examples, the outcome variable is referred to as a failure time, which we want to model using covariates. If we get to observe them, then certain regression models can be used. However, the difficult comes in when we cannot observe them completely. One of the most commonly encountered mechanisms is called right censoring, meaning that there could be another event (censoring time) happens before the failure time, which would remove the subject out of the study, hence preventing us from observing the failure time. For example,

A patient participanted in a study of survival time after a cancer related surgry. At day 10, the patient drops out for some random uncontrollable reasons, and the outcome of interest (death) is not observed. However, we know that the death time should be larger than 10 days.

In this example above, we did not observe the event of interest, but only the last follow-up time, 10 days. This is called the censoring time in survival analysis. In particular, this is referred to as the right-censoring type There are also other censoring mechanism, however, we will only focus on this type of censoring.

In a right-censoring setting, we denote the event of interest as \(T\), the failure time, and the lost-to-follow-up time as \(C\), the censoring time. We may have two situations for each subject:

  • The failure time \(T\) happens before the censoring time \(C\), hence \(T\) is observed. Define a censoring indicator \(\delta = 1\) for this situation.
  • The censoring time \(C\) happens before the failure time \(T\), hence \(C\) is observed. Define a censoring indicator \(\delta = 0\) for this situation.

The following plot shows how these observations look like in a real study:

  set.seed(1)
  n = 10
  tstart = runif(n)
  tend = tstart + rexp(n)
  tcensor = rbinom(n, 1, 0.5)
  plot(NA, NA, xlim = c(0, max(tend)), ylim = c(0, n+1), xlab = "Time", ylab = "Subject")
  
  for (i in 1:n)
  {
    segments(x0 = tstart[i], i, x1 = tend[i], y1 = i)
    points(tend[i], i, pch = ifelse(tcensor[i] == 1, 4, 1), 
           cex = 2, lwd = 2, col = ifelse(tcensor[i] == 1, "blue", "red"))
  }
  legend("topright", c("Failure", "Censor"), pch = c(4, 1), col = c("blue", "red"), cex = 1.3)

To perform the analysis, we need to first align all subjects to the same starting point:

  plot(NA, NA, xlim = c(0, max(tend)), ylim = c(0, n+1), xlab = "Time", ylab = "Subject")
  
  for (i in 1:n)
  {
    segments(x0 = 0, i, x1 = tend[i] - tstart[i], y1 = i)
    points(tend[i] - tstart[i], i, pch = ifelse(tcensor[i] == 1, 4, 1), 
           cex = 2, lwd = 2, col = ifelse(tcensor[i] == 1, "blue", "red"))
  }
  legend("topright", c("Failure", "Censor"), pch = c(4, 1), col = c("blue", "red"), cex = 1.3)

Hence, in survival analysis, we have two outcomes for each subject:

  • The observed time (regardless of failure or censoring) \(Y = \min(T, C)\)
  • The censoring indicator (0 or 1) \(\delta = I(T \leq C)\)
  Y = data.frame("Time" = tend - tstart, "Censor" = tcensor)
  head(Y)

In R, a survival outcome can also be represented using the Surv() function from the survival package, this will put a + sign after the censored subjects, indicating that the actual failure time is larger than the observed.

  library(survival)
  Surv(Y$Time, Y$Censor)
##  [1] 2.0340910+ 1.7987484+ 0.3740457  1.2295621+ 0.5396828+ 0.9565675  0.1470460+ 1.3907351+ 0.7620299  1.2376036

Estimating the Survival Function

Example: Mayo Clinic Primary Biliary Cholangitis (PBC) Data is one of the most famous survival datasets (Fleming and Harrington, 1991). This data is collected from the Mayo Clinic trial in PBC conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data set participated in the randomized trial and contain largely complete data (survival package documentation).

  library(survival)  
  data(pbc)
  pbc$censor = (pbc$status == 2)
  pbc = pbc[1:312, ]
  head(pbc)

In this dataset, time is the observed time (fail or censor) and status is the status at endpoint. In the original data, there are three types of end points: 0 for censored, 1 for transplant and 2 for dead. We will treat both 0 and 1 as censored observations since we are interested in the length of death.

The survival function gives the probability that a patient will survive beyond any specified time.

Mathematically, for any given time \(t\), the survival function is

\[S(t) = P(T > t)\] The most difficult problem in survival analysis is that we do not observe all the failure times \(T\). Be careful that using only the observed time (regardless of the censoring) for the analysis would lead to very biased results. To correctly estimate the survival function, the Kaplan–Meier estimator or the Nelson–Aalen estimator can be used. The following plot shows the estimated survival curve and the standard deviation. By default, this function uses the Nelson–Aalen estimator

  plot(Surv(pbc$time, pbc$censor))

The main idea of a survival function estimation is to account for the fact that at each time point, we should estimate the risk of failure by only considering the subjects that are still remained in the study. This involves a new concept called the hazard function which describes the rate of failure at each given time point. Mathematically, the hazard function is defined as

\[h(t) = \frac{f(t)}{S(t)}\] where \(f(t)\) is the density function. The relationship of the three could be visualized with the following normal distribution function. The technique behind the survival function calculation is to estimate the hazard at each given time point, with numerator being a failure event, and the denominator being the count of subjects at risk at the time. The hazard function can then be recovered to calculate the survival function (the detail is beyond the scope of this course).

  x = seq(-3, 3, 0.01)
  fx = dnorm(x)
  Sx = 1 - pnorm(x)
  hx = fx / Sx
  par(mfrow=c(1,3))
  plot(x, fx, type = "l", lwd = 3, main = "Density", cex.main = 3)
  plot(x, Sx, type = "l", lwd = 3, main = "Survival", cex.main = 3)
  plot(x, hx, type = "l", lwd = 3, main = "Hazard", cex.main = 3)

Practice Questions

Read the Salinas-Escudero et al. 2020 paper, Figure 1. For different age groups, what are their chance to survival at 20 days? For patients who entered ICU, what is the median survival time (the time where half of the population fail)?

Comparing Two Survial Functions

Similar to the continuous outcome case, where we could use the \(t\) test to test their difference, we also want to compare if the survivabilities of two groups are different given a set of samples. For example, let’s test if the two treatment trt groups (1 for D-penicillmain, 2 for placebo) are different. The two survival curves look like the following

  pbc[, c("time", "censor", "trt")]
  plot(survfit(Surv(time, censor) ~ trt, data = pbc), col = c("red","blue"))

The most popular test to compare the two curves is called the log-rank test. Essentially, the test is comparing a cumulative version of the hazard function over the entire time line. This can be done using the survdiff() function:

  survdiff(Surv(time, censor) ~ trt, data = pbc)
## Call:
## survdiff(formula = Surv(time, censor) ~ trt, data = pbc)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=1 158       65     63.2    0.0502     0.102
## trt=2 154       60     61.8    0.0513     0.102
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.7

The function provides some summary statistics in addition to the test of difference:

  • The number of observations and the number of failures in both groups.
  • The terms (O-E)^2/E and (O-E)^2/V are both summaries of the deviation of the hazard function from the pooled samples. This would not be very useful if you are only interested in the test significance.
  • The Chisq term is the statistic calculated for this test, followed by its p-value. A detailed step-by-step calculation of this quantity can be found here

Practice Questions

Use the pbc data to perform the following tasks:

  • Plot the survival function of female group
  • Test the survival differences between male and female groups
  plot(Surv(pbc$time[pbc$sex == "f"], pbc$censor[pbc$sex == "f"]))
  survdiff(Surv(time, censor) ~ sex, data = pbc)

Cox Propotional Hazard Model

The next step is to model the effect of covariates. This is much more difficult than a regression model. In a regression model, we estimate the mean of outcome \(Y\). However, this is not always possible for a survival model since large values of failure time will most likely get censored. In many cases, there can be a maximum follow-up time (truncation) which prevents the failure time form being observed beyond that point. Hence the Cox Propositional Hazard (Cox-PH) model instead focus on the hazard function. The central idea is that the covariate changes would modify the hazard function, which in turn changes the survival function. But this is only up to the time point where we could estimate the hazard, i.e., the time point where we can observe failures.

Mathematically, we would estimate a quantity called baseline hazard, \(\lambda_0(t)\), which represent a certain type of “reference” of all individuals, and then the hazard function of a specific subject with covariate vector \(X\) can be obtained using

\[\begin{align} \lambda(t | X) &= \lambda_0 \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\\ &= \lambda_0 \exp(\beta_1 X_1) \exp(\beta_2 X_2) \cdots \exp(\beta_p X_p) \end{align}\] We need to notice three facts:

  • There is no intercept term, since the baseline hazard \(\lambda_0(t)\) serves the same purpose
  • Higher link function \(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\) would lead to higher hazard, meaning that the survival function would decrease more rapidly as time progresses.
  • The effect of each covariate is multiplicative to the hazard function

To fit a Cox-PH model, we will use the coxph() function. Consider using both trt and age variables. The syntax is almost the same as a linear regression.

  pbcfit <- coxph(Surv(time, censor) ~ age + trt, data = pbc)
  summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt, data = pbc)
## 
##   n= 312, number of events= 125 
## 
##         coef exp(coef) se(coef)     z Pr(>|z|)    
## age 0.040632  1.041469 0.008964 4.533 5.83e-06 ***
## trt 0.073604  1.076380 0.182068 0.404    0.686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age     1.041     0.9602    1.0233     1.060
## trt     1.076     0.9290    0.7533     1.538
## 
## Concordance= 0.625  (se = 0.026 )
## Likelihood ratio test= 20.67  on 2 df,   p=3e-05
## Wald test            = 20.71  on 2 df,   p=3e-05
## Score (logrank) test = 20.97  on 2 df,   p=3e-05

We should pay attention to several quantities:

  • Coefficient coef and the associated p-values. This part is similar to a linear regression model
  • exp(coef) and its 95% confidence interval lower .95, upper .95. The reason that we want to take the exponential of the coefficient is due to the previously explained fact that the effect of a covariate is multiplicative, and the magnitude is \(\exp(\beta X)\). Hence, if this quantity is 1, then there is no effect to the baseline hazard. In other words, if the confidence interval excludes 1, then this variable would present significant effect to the hazard, which also affects the survival.
  • Concordance this is the agreement between an observed response and the predicted link function. However, be aware that we do not predict the outcome exactly, we only predict it relatively in terms of their hazard. Hence, if a subject is predicted to have high risk and is also failing very early, then this is a good model. Hence, the concordance score is essentially checking a type of rank correlation (see our correlation slides in Week 4) between the fitted and the observed.

Interpretation of Coefficients in a Cox-PH

Again, this would be related to the hazard. We can do the same analysis as the parameter in a logistic regression by setting two groups age = a and age = a+1 (we will ignore the effect of treatment for simplicity):

  • age = a group: the hazard is \(\lambda_0 \exp(\beta a)\)
  • age = a + 1 group: the hazard is \(\lambda_0 \exp(\beta (a+1))\)

Hence, if we look at the ratio of the corresponding hazards:

\[\frac{\lambda_0 \exp(\beta (a+1))}{\lambda_0 \exp(\beta a)} = \exp(\beta)\] So, the parameter \(\beta\) represents the log of hazard ratio, corresponding to two groups with 1 unit difference of the variable. Large positive \(\beta\) means if the variable value increases, the hazard is increasing, and the subject is expected to fail early.

Practice Questions

Use the pbc data to perform the following tasks:

  • Include gender variable into the model. Is this variable significant?
  • Test the interaction term between age and treatment, how do you interpret them?
  pbcfit <- coxph(Surv(time, censor) ~ age + trt + sex, data = pbc)
  summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt + sex, data = pbc)
## 
##   n= 312, number of events= 125 
## 
##           coef exp(coef)  se(coef)      z Pr(>|z|)    
## age   0.038820  1.039583  0.008995  4.316 1.59e-05 ***
## trt   0.062647  1.064651  0.181853  0.344    0.730    
## sexf -0.337659  0.713438  0.238873 -1.414    0.157    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## age     1.0396     0.9619    1.0214     1.058
## trt     1.0647     0.9393    0.7454     1.521
## sexf    0.7134     1.4017    0.4467     1.139
## 
## Concordance= 0.623  (se = 0.026 )
## Likelihood ratio test= 22.53  on 3 df,   p=5e-05
## Wald test            = 23.22  on 3 df,   p=4e-05
## Score (logrank) test = 23.43  on 3 df,   p=3e-05
  
  pbcfit <- coxph(Surv(time, censor) ~ age + trt + sex + age*trt, data = pbc)
  summary(pbcfit)
## Call:
## coxph(formula = Surv(time, censor) ~ age + trt + sex + age * 
##     trt, data = pbc)
## 
##   n= 312, number of events= 125 
## 
##             coef exp(coef) se(coef)      z Pr(>|z|)
## age      0.02090   1.02112  0.02712  0.771    0.441
## trt     -0.60273   0.54731  0.96792 -0.623    0.533
## sexf    -0.31861   0.72716  0.24046 -1.325    0.185
## age:trt  0.01257   1.01265  0.01794  0.700    0.484
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## age        1.0211     0.9793    0.9683     1.077
## trt        0.5473     1.8271    0.0821     3.649
## sexf       0.7272     1.3752    0.4539     1.165
## age:trt    1.0126     0.9875    0.9777     1.049
## 
## Concordance= 0.625  (se = 0.027 )
## Likelihood ratio test= 23.02  on 4 df,   p=1e-04
## Wald test            = 23.6  on 4 df,   p=1e-04
## Score (logrank) test = 23.82  on 4 df,   p=9e-05