Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to re-install all of your packages.The first three questions of this homework is quite simple. We will
use three different models to estimate the personalized treatment rule,
and we will use the Sepsis
dataset in all three questions.
Keep in mind that the Sepsis
dataset contains a
BEST
variable which is the doctor’s best guess for the
treatment. And you should never use this variable in your model fitting.
Only use it when evaluating the performance of your model. The data is
provided on our course website.
Implement the direct learning approach with linear regression. Since
this is an observational study, you should incorporate propensity score
weighting when estimating the treatment rule. After obtaining the
estimated treatment rule, compare it with the best label
Sepsis$BEST
and report your results.
Answer:
The accuracy of selecting the best treatment is 76.60%.
df = read.csv("Sepsis.csv")
df = df[,-1]
df$THERAPY[df$THERAPY==0] = -1
# estimate propensity score
pscore = glm(as.factor(THERAPY)~.-Health-BEST,data=df,family=binomial)$fitted.values
#pscore
#View(df)
# fit the regression model
Y = df$Health*df$THERAPY/pscore
X = df[,-c(1,2,ncol(df))]
mod = lm(Y~., data=X)
summary(mod)
##
## Call:
## lm(formula = Y ~ ., data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7337 -3.0178 -0.1693 2.8358 12.2930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.633e+00 1.556e+00 1.692 0.0912 .
## PRAPACHE 7.433e-02 3.819e-02 1.947 0.0522 .
## AGE -6.502e-02 1.327e-02 -4.899 1.34e-06 ***
## BLGCS -2.954e-02 6.744e-02 -0.438 0.6616
## ORGANNUM -1.218e-01 2.607e-01 -0.467 0.6406
## BLIL6 1.005e-06 2.999e-06 0.335 0.7376
## BLLPLAT 6.392e-04 1.549e-03 0.413 0.6801
## BLLBILI -8.781e-02 4.811e-02 -1.825 0.0686 .
## BLLCREAT -7.335e-02 4.961e-02 -1.479 0.1399
## TIMFIRST 3.370e-04 2.228e-04 1.513 0.1311
## BLADL 8.030e-02 5.257e-02 1.528 0.1273
## blSOFA 3.033e-03 9.895e-02 0.031 0.9756
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.448 on 458 degrees of freedom
## Multiple R-squared: 0.07563, Adjusted R-squared: 0.05342
## F-statistic: 3.406 on 11 and 458 DF, p-value: 0.000147
pred <- sign(predict(mod, X))
df$BEST[df$BEST==0] = -1
mean(pred == df$BEST)
## [1] 0.7659574
Use the outcome weighted learning via WeightSVM
package
with support vector machine to analyze the Sepsis
dataset.
You may have to consider the following decisions try some of them
If the performance is not as good as you want, try the
DTRlearn2
package to perform this task again. Note that the
DTRlearn2
package has already incorporated some rough
internal tuning procedures and you only need to try different kernels.
Report your results.
Answer:
We can see that OWL with SVM could only reach 60% accuracy using support vector machine. This is likely because that the dataset is noisy or classes have significant overlap, which make SVM performs relatively poor in this case.
library(WeightSVM)
set.seed(1)
# get weights
W = (df$Health+7)/pscore
# tuning parameters
gamma_set = exp(seq(log(0.005), log(0.5), length.out =10))
# fit the model
for(i in 1:length(gamma_set)){
owl.fit = wsvm(THERAPY ~ .-Health-BEST,data = df,
kernel = "radial", gamma=gamma_set[i],
weight = W)
print(mean((owl.fit$fitted>0)==(df$BEST==1)))
}
## [1] 0.5553191
## [1] 0.5914894
## [1] 0.612766
## [1] 0.5851064
## [1] 0.5723404
## [1] 0.5765957
## [1] 0.5404255
## [1] 0.5042553
## [1] 0.4851064
## [1] 0.4787234
# try a different kernel
W = (df$Health + 7)/pscore
gamma_set = exp(seq(log(0.005), log(0.5), length.out =10))
for(i in 1:length(gamma_set)){
owl.fit = wsvm(THERAPY ~ . - Health - BEST,
data = df,
kernel = "polynomial",
gamma = gamma_set[i],
coef0 = 0.5,
weight = W)
print(mean((owl.fit$fitted>0)==(df$BEST==1)))
}
## [1] 0.5446809
## [1] 0.5787234
## [1] 0.5957447
## [1] 0.5808511
## [1] 0.5702128
## [1] 0.5765957
## [1] 0.5446809
## [1] 0.5319149
## [1] 0.5276596
## [1] 0.5212766
# use linear model
fit = glm(as.factor(THERAPY) ~ . - Health - BEST, data = df,
family = "binomial",weights=W)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
mean((fit$fitted.values>=0.5)==(df$BEST==1))
## [1] 0.5765957
# try the DTRlearn2 package
library(DTRlearn2)
X = scale(df[,-c(1,2,ncol(df))])
A = as.numeric(df$THERAPY)
R = df$Health
# fit the OWL model
OWL.fit = owl(H = data.matrix(X), A = A, R = R,
K = 1, # for our problem, it is only one stage of treatment so we use K = 1
kernel = "linear") # bandwidth for each covariate
OWL.pred = predict(OWL.fit, data.matrix(X), K = 1)
# the estimated value function
mean((OWL.pred$treatment[[1]] > 0) == (df$BEST > 0))
## [1] 0.7510638
grf
packageUse the grf
package to find the optimal treatment rule
for the Sepsis
dataset. Compare the estimated treatment
rule with the best label Sepsis$BEST
and report your
results. Furthermore, report the estimated average treatment effect
(ATT) of this entire population (based on our understandings of the
relationship between the conditional average treatment effect and the
average treatment effect). You may also use the grf
package
documentation.
Answer:
The accuracy of selecting the optimal treatment using generalized random forest is 79.57%.
#install.packages('grf')
library(grf)
tau.forest <- causal_forest(df[,-c(1,2,ncol(df))], df$Health, df$THERAPY,mtry=5,num.trees=2000)
mean((tau.forest$predictions > 0) == (df$BEST == 1))
## [1] 0.7914894
# average treatment effect
average_treatment_effect(tau.forest, target.sample = "all")
## estimate std.err
## -0.01767283 0.09432337
# same value by using the formula E(Y(1) - Y(0))
mean(tau.forest$predictions)
## [1] -0.01547598
Suppose that I want to estimate \(E(X^3)\) when \(X\) is a random variable from \(N(1,1)\). However, I do not know how to perform integration, and I could do is to generate random samples and perform numerical approximation. In addition, my machine can only generate random samples from \(N(2,2)\), but I do know the analytic density functions of both distributions. How can I estimate/approximate \(E(X^3)\) numerically using the Radon–Nikodym theorem? Use 5000 samples to make your estimation stable. Validate your answer using simulated data from \(N(1,1)\) or analytic integration.
Hint: if you ask GPT or write your prompt correctly with Copilot, it will give you the answer. But make sure to explain your procedure.
Answer:
# generate samples from N(2,2)
n = 5000
X = rnorm(n, mean = 2, sd = 2)
# compute the Radon-Nikodym derivative
ratio = dnorm(X, mean = 1, sd = 1)/dnorm(X, mean = 2, sd = 2)
# estimate E(X^3)
mean(X^3 * ratio)
## [1] 3.919327
# compare with the simulated data from N(1,1)
mean(rnorm(n, mean = 1, sd = 1)^3)
## [1] 3.871007