Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R code chunks visible for grading..Rmd file
as a template, be sure to remove this instruction
section.R version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.Given a fitted random forest, with each tree as a partition on the space of \(\cal X\), show that the kernel function \(K(x, z)\) induced from the forest is positive definite.
For this question, our goal is to write a function that creates a kernel based on the idea of Mondrian tree. This question consists of several parts:
1 for
internal node, -1 for leaf node.NA.NA.Here is an example of the tree structure for a tree with 3 internal nodes and 4 leaf nodes. The first row is the root node that splits into node 2 and 3, and each is further split into two leaf nodes.
while or for)
loop that iteratively find and split current nodes and add child nodes
to the tree until the budget \(\lambda\) is exhausted. You can use the
rexp function to generate exponential random variables for
each of the current nodes, and find the one that should be split. After
the splitting, update the the budget, the tree structure and the
bounding boxes accordingly.Call the function with \(p = 2\) and \(\lambda = 2\). And print the tree structure (restrict that up to the 5th row) and the bounding boxes.
Here is my output:
node_type split_dim cut_loc left_child right_child pred_value
[1,] 1 2 0.1303658 2 3 NA
[2,] 1 1 0.2542666 4 5 NA
[3,] 1 1 0.1783499 6 7 NA
[4,] -1 NA NA NA NA NA
[5,] -1 NA NA NA NA NA
## [,1] [,2]
## [1,] 0.0000000 0.0000000
## [2,] 0.0000000 0.0000000
## [3,] 0.0000000 0.1303658
## [4,] 0.0000000 0.0000000
## [5,] 0.2542666 0.0000000
## [,1] [,2]
## [1,] 1.0000000 1.0000000
## [2,] 1.0000000 0.1303658
## [3,] 1.0000000 1.0000000
## [4,] 0.2542666 0.1303658
## [5,] 1.0000000 0.1303658
Answer:
set.seed(546)
myMondrianTree <- function(lambda, p) {
# Initialize the tree structure matrix
tree <- matrix(NA, nrow = 1, ncol = 6)
colnames(tree) <- c("node_type", "split_dim", "cut_loc", "left_child", "right_child", "pred_value")
tree[1, ] <- c(-1, NA, NA, NA, NA, NA) # Root node, use -1 for current nodes that being consider for splitting
# Initialize bounding box matrices
lower_bound <- matrix(0, nrow = 1, ncol = p)
upper_bound <- matrix(1, nrow = 1, ncol = p)
# Initialize variables
total_time <- 0
while (total_time < lambda)
{
# find all current nodes
all_nodes = which(tree[, 1] == -1) # Current nodes that can be split
# calculate the rate for each current node
rates = sapply(all_nodes, function(node) sum(upper_bound[node, ] - lower_bound[node, ]))
# generate exponential random variables for each current node
exp_rvs = rexp(length(all_nodes), rate = rates)
# find the node with the smallest exponential random variable
split_node = all_nodes[which.min(exp_rvs)]
# update the total time
total_time = total_time + min(exp_rvs)
if (total_time >= lambda) {
break
}
# randomly select a dimension to split
split_dim = sample(1:p, 1, prob = upper_bound[split_node, ] - lower_bound[split_node, ])
# randomly select a cut location within the bounding box of the selected dimension
cut_loc = runif(1, lower_bound[split_node, split_dim], upper_bound[split_node, split_dim])
# update the tree structure and node types
tree[split_node, ] <- c(1, split_dim, cut_loc, nrow(tree) + 1, nrow(tree) + 2, NA) # Update the split node
tree <- rbind(tree, c(-1, NA, NA, NA, NA, NA), c(-1, NA, NA, NA, NA, NA)) # Add two new current nodes
# update the bounding boxes
left_upper_bound <- upper_bound[split_node, ]
left_upper_bound[split_dim] <- cut_loc
left_lower_bound <- lower_bound[split_node, ]
right_lower_bound <- lower_bound[split_node, ]
right_lower_bound[split_dim] <- cut_loc
right_upper_bound <- upper_bound[split_node, ]
lower_bound <- rbind(lower_bound, left_lower_bound, right_lower_bound)
upper_bound <- rbind(upper_bound, left_upper_bound, right_upper_bound)
# print the current tree structure for debugging
# cat("\n\n --------- Split ----")
# print(tree)
# print(lower_bound)
# print(upper_bound)
}
rownames(lower_bound) <- NULL
rownames(upper_bound) <- NULL
return(list("tree" = tree,
"lower_bound" = lower_bound,
"upper_bound" = upper_bound))
}
newfit = myMondrianTree(lambda = 2, p = 2)
printrows = min(5, nrow(newfit$tree))
newfit$tree[1:printrows, ]
## node_type split_dim cut_loc left_child right_child pred_value
## [1,] 1 2 0.1303658 2 3 NA
## [2,] 1 1 0.2542666 4 5 NA
## [3,] 1 1 0.1783499 6 7 NA
## [4,] -1 NA NA NA NA NA
## [5,] -1 NA NA NA NA NA
newfit$lower_bound[1:printrows, ]
## [,1] [,2]
## [1,] 0.0000000 0.0000000
## [2,] 0.0000000 0.0000000
## [3,] 0.0000000 0.1303658
## [4,] 0.0000000 0.0000000
## [5,] 0.2542666 0.0000000
newfit$upper_bound[1:printrows, ]
## [,1] [,2]
## [1,] 1.0000000 1.0000000
## [2,] 1.0000000 0.1303658
## [3,] 1.0000000 1.0000000
## [4,] 0.2542666 0.1303658
## [5,] 1.0000000 0.1303658
After constructing the Mondrian tree, you should be able to visualize
the partition on \([0, 1]^2\) for \(p = 2\). To do this, use the
rect() function to plot the terminal node bounding boxes of
all leaf nodes. You can consider assigning a color to each leaf node.
Here is what I have.
In addition, generate 300 observations uniformly on \([0, 1]^2\). And write a prediction function, that finds the terminal node for each observation. You can generate 300 observations uniformly on \([0, 1]^2\). Then, for each observation, find the leaf node it belongs to. Assign a color to each leaf node, and plot the observations with the color of the leaf node they belong to. Here is what I have. After this question, you can discard the bounding box matrices, since they are not needed anymore.
Answer:
set.seed(546)
newfit = myMondrianTree(lambda = 4, p = 2)
tree = newfit$tree
# plot all terminal nodes
# initiate empty plot
plot(NULL, xlim = c(0, 1), ylim = c(0, 1),
xlab = "X1", ylab = "X2",
main = "Mondrian Tree Partition")
nodecolor = matrix(NA, nrow(tree))
for (i in 1:nrow(tree))
{
if (tree[i, 1] == 1) next # Skip non-leaf nodes
# Get the bounding box for the leaf node
lower <- newfit$lower_bound[i, ]
upper <- newfit$upper_bound[i, ]
# Draw the rectangle for the bounding box
usecolor = rgb(runif(1), runif(1), runif(1))
nodecolor[i] = usecolor
rect(lower[1], lower[2], upper[1], upper[2],
col = usecolor, border = "black")
}
# Generate 300 random data in [0, 1]^2
x <- matrix(runif(300 * 2), ncol = 2)
colnames(x) <- c("X1", "X2")
# Function to find the leaf node for a given x
findLeafNode <- function(x, tree) {
node <- 1 # Start from the root node
while (tree[node, 1] != -1) { # While not a leaf node
split_dim <- tree[node, 2]
cut_loc <- tree[node, 3]
if (x[split_dim] <= cut_loc) {
node <- tree[node, 4] # Go to left child
} else {
node <- tree[node, 5] # Go to right child
}
}
return(node)
}
# Find the leaf node for each point
leaf_nodes <- sapply(1:nrow(x), function(i) findLeafNode(x[i, ], tree))
# initiate empty plot
plot(NULL, xlim = c(0, 1), ylim = c(0, 1),
xlab = "X1", ylab = "X2",
main = "Mondrian Tree Partition")
# plot frame
for (i in 1:nrow(tree))
{
if (tree[i, 1] == 1) next # Skip non-leaf nodes
# Get the bounding box for the leaf node
lower <- newfit$lower_bound[i, ]
upper <- newfit$upper_bound[i, ]
# Draw the rectangle for the bounding box
rect(lower[1], lower[2], upper[1], upper[2],
border = "grey")
}
# Plot the points
for (i in 1:nrow(x))
{
usecol = nodecolor[leaf_nodes[i]]
points(x[i, 1], x[i, 2], col = usecol, pch = 19)
}
Now you have your tree model, lets use it for a regression problem.
Use the following data generating process to generate your training and testing data.
\[ Y = \sin(2 \pi X_1) + 2 x_2^2 + \epsilon, \quad \epsilon \sim N(0, 1) \]
Where \(X\) is uniform on \([0, 1]^p\). The functional surface look like the following:
set.seed(546)
# Generate a grid of points for plotting the true function
grid_size <- 50
x1_seq <- seq(0, 1, length.out = grid_size)
x2_seq <- seq(0, 1, length.out = grid_size)
grid_points <- expand.grid(X1 = x1_seq, X2 = x2_seq)
# True function
true_function <- function(x) {
sin(2 * pi * x[1]) + 2 * x[2]^2
}
# Compute the true function values on the grid
true_values <- apply(grid_points, 1, true_function)
# Reshape for contour plot
true_matrix <- matrix(true_values, nrow = grid_size, ncol = grid_size)
# Plot the true function surface
contour(x1_seq, x2_seq, true_matrix,
main = "True Function Surface",
xlab = "X1", ylab = "X2", col = terrain.colors(10))
Generate 400 training data and 1000 testing data. Fit a Mondrian tree with \(\lambda = 4\) on the training data, and make predictions on the testing data. You may try to experiment with different values of \(\lambda\). Report the Mean Squared Error (MSE) on the testing data.
This is my fitted surface by the Mondrian tree with \(\lambda = 10\):
Answer:
set.seed(546)
# Generate training data
ntrain <- 400
ntest <- 1000
X <- matrix(runif((ntrain + ntest) * 2), ncol = 2)
colnames(X) <- c("X1", "X2")
Y = sin(2 * pi * X[, 1]) + 2 * X[, 2]^2 + rnorm(ntrain + ntest)
train_X <- X[1:ntrain, ]
train_Y <- Y[1:ntrain]
test_X <- X[(ntrain + 1):(ntrain + ntest), ]
test_Y <- Y[(ntrain + 1):(ntrain + ntest)]
# Fit Mondrian tree
lambda <- 10
mondrian_construct <- myMondrianTree(lambda, p = 2)
tree <- mondrian_construct$tree
# a function to fit the tree with training data
mondrian_fit <- function(tree, train_X, train_Y) {
# Find leaf nodes for training data
leaf_nodes <- sapply(1:nrow(train_X), function(i) findLeafNode(train_X[i, ], tree))
# Calculate mean response for each leaf node
leaf_means <- tapply(train_Y, leaf_nodes, mean)
# Update the tree structure with mean responses
for (node in names(leaf_means)) {
tree[as.numeric(node), 6] <- leaf_means[node]
}
return(tree)
}
# Function to make predictions
mondrian_pred <- function(tree, test_X) {
predictions <- sapply(1:nrow(test_X), function(i) {
leaf_node <- findLeafNode(test_X[i, ], tree)
return(tree[leaf_node, 6])
})
return(predictions)
}
# train the model
tree_fit <- mondrian_fit(tree, train_X, train_Y)
test_pred <- mondrian_pred(tree_fit, test_X)
# Calculate Mean Squared Error
mse <- mean((test_pred - test_Y)^2, na.rm = TRUE)
print(paste("Mean Squared Error on Test Data:", round(mse, 4)))
## [1] "Mean Squared Error on Test Data: 1.2669"
# plot the fitted surface of the testing data
grid_size = 30
x1_seq <- seq(0, 1, length.out = grid_size)
x2_seq <- seq(0, 1, length.out = grid_size)
grid_points <- expand.grid(X1 = x1_seq, X2 = x2_seq)
forest_surface <- mondrian_pred(tree_fit,as.matrix(grid_points))
forest_matrix <- matrix(forest_surface, nrow = grid_size, ncol = grid_size)
contour(x1_seq, x2_seq, forest_matrix,
main = "Fitted Function Surface by Mondrian Forest",
xlab = "X1", ylab = "X2", col = terrain.colors(10))
Now you can utilize the previous code to fit a forest of Mondrian
trees. Write a function that fits \(B =
100\) Mondrian trees, and make predictions by averaging the
predictions from all trees. Note that each tree should be trained on a
bootstrap sample of the training data. You may consider saving your
fitted Mondrian trees in a list(), and call elements as
forest[[b]]. Use the same training and testing data as in
Question 4, and report the MSE on the testing data. Experiment with
different values of \(\lambda\) and
report your findings (are they better than a single tree model?).
This is my fitted surface by the Mondrian forest with \(\lambda = 10\):
Answer:
set.seed(546)
# Function to fit a Mondrian forest
MondrianForestFit <- function(B, lambda, train_X, train_Y) {
forest <- vector("list", B)
for (b in 1:B) {
onetree <- myMondrianTree(lambda = lambda, p = ncol(train_X))
# Bootstrap sample
use_indices <- sample(1:nrow(train_X), nrow(train_X), replace = TRUE)
forest[[b]] <- mondrian_fit(onetree$tree, train_X[use_indices, ], train_Y[use_indices])
}
return(forest)
}
# Function to predict using Mondrian forest
MondrianForestPred <- function(forest, test_X) {
B <- length(forest)
# go over all trees and get predictions
all_predictions <- sapply(1:B, function(b) mondrian_pred(forest[[b]], test_X))
# average predictions
forest_predictions <- rowMeans(all_predictions, na.rm = TRUE)
return(forest_predictions)
}
# Fit Mondrian forest
B <- 100
mondrian_forest <- MondrianForestFit(B, lambda, train_X, train_Y)
# Make predictions on testing data
forest_predictions <- MondrianForestPred(mondrian_forest, test_X)
# Calculate Mean Squared Error
mse_forest <- mean((forest_predictions - test_Y)^2, na.rm = TRUE)
print(paste("Mean Squared Error of Mondrian Forest on Test Data:", round(mse_forest, 4)))
## [1] "Mean Squared Error of Mondrian Forest on Test Data: 1.0695"
# plot the fitted surface of the testing data
grid_size = 30
x1_seq <- seq(0, 1, length.out = grid_size)
x2_seq <- seq(0, 1, length.out = grid_size)
grid_points <- expand.grid(X1 = x1_seq, X2 = x2_seq)
forest_surface <- MondrianForestPred(mondrian_forest, as.matrix(grid_points))
forest_matrix <- matrix(forest_surface, nrow = grid_size, ncol = grid_size)
contour(x1_seq, x2_seq, forest_matrix,
main = "Fitted Function Surface by Mondrian Forest",
xlab = "X1", ylab = "X2", col = terrain.colors(10))
The prediction error is 1.0695, which almost the same as the variance of the random noise. It is significantly better than a single tree model.
In our previous question, the Mondrian tree is a non-adaptive model, since the partition does not depend on the data. In this question, we will implement the Adaboost algorithm with decision stumps as the base learner, and it is actually data adaptive. A stump model partitions the input space into two regions by a cutoff point \(c\), and assigns a prediction \(\{-1, +1\}\) to each side. We work with simulated one-dimensional data:
set.seed(5)
n <- 200
x <- runif(n)
py <- function(x) sin(4 * pi * x) / 3 + 0.5
y <- (rbinom(n, 1, py(x)) - 0.5) * 2
plot(x, y + 0.1 * runif(n, -1, 1), ylim = c(-1.1, 1.1), pch = 19,
col = ifelse(y == 1, "darkorange", "deepskyblue"), ylab = "y")
lines(sort(x), py(x)[order(x)] - 0.5)
testx <- seq(0, 1, length.out = 1000)
testy <- (rbinom(1000, 1, py(testx)) - 0.5) * 2
Recall that a stump model works this way:
You should write a function called myStump(x, y, w) that
outputs the cutoff point, and the left and right predictions. Note that
you need to exhaust all the cutoff point, however, given a set of
observed values, only these observed unique \(x\) values matter. Once your finish the
stump model algorithm, test your code in the following two
scenarios:
Answer:
# Weighted Gini impurity
gini <- function(y, w) {
p <- weighted.mean(y == 1, w)
p * (1 - p)
}
# Fit a weighted stump
myStump <- function(x, y, w) {
n <- length(x)
scores <- numeric(n - 1)
for (i in 1:(n - 1)) {
idx <- (x <= x[i])
scores[i] <- -(
sum(w[idx]) * gini(y[idx], w[idx]) +
sum(w[!idx]) * gini(y[!idx], w[!idx])
) / sum(w)
}
cutoff <- x[which.max(scores)]
left_p <- weighted.mean(y[x <= cutoff] == 1, w[x <= cutoff])
right_p <- weighted.mean(y[x > cutoff] == 1, w[x > cutoff])
list(
cutoff = cutoff,
left = ifelse(left_p > 0.5, 1, -1),
right = ifelse(right_p > 0.5, 1, -1)
)
}
# Test stump under two weighting schemes
w_equal <- rep(1 / n, n)
myStump(x, y, w_equal)
## $cutoff
## [1] 0.2486734
##
## $left
## [1] 1
##
## $right
## [1] -1
w_biased <- ifelse(x >= 0.5, 2, 0.1)
myStump(x, y, w_biased)
## $cutoff
## [1] 0.8036775
##
## $left
## [1] 1
##
## $right
## [1] -1
shrinkage factor \(\delta\), which is commonly used in
boosting algorithms. Note that the shrinkage factor essentially works on
the \(\alpha\) to achieve a smaller
step size.shrinkage factors of this plot, and
comment on your findings. Please note that you may need to adjust the
number of trees so that your algorithm works well.Answer:
# fit Adaboost
my_adaboost <- function(x, y, ntrees, epsilon = 0.5){
modellist = list()
w = rep(1/length(x), length(x))
for (k in 1:ntrees){
fk = myStump(x, y, w)
fit = ifelse(x <= fk$cutoff, fk$left, fk$right)
e = sum(w*(fit != y))
# cat(paste(" e is ", e, "\n") )
a = 0.5 * log((1-e)/e)
w = w * exp(-epsilon*a*y*fit)
w = w / sum(w)
# apply shrinkage parameter to alpha
fk['alpha'] = a * epsilon
modellist[[k]] = fk
}
return(modellist)
}
# predict by AdaBoost
my_adaboost_pred <- function(fit, K, testx, testy){
pred = rep(0, length(testx))
exp_loss = err = rep(NA, K)
for (k in 1: min(K, length(fit))){
pred_tree = ifelse(testx <= fit[[k]]$cutoff, fit[[k]]$left, fit[[k]]$right)
pred = pred + fit[[k]]$alpha * pred_tree
exp_loss[k] = mean( exp(- testy * pred) )
err[k] = mean(testy != sign(pred))
}
return(list("pred" = pred, "exp_loss" = exp_loss, "err" = err))
}
To fit the adaboost:
# fit model and prediction
K = c(50, 100, 200)
delta = c(1, 0.5, 0.1)
pred_test <- list()
pred_train <- list()
for(i in 1:3){
myfit = my_adaboost(x, y, K[i], delta[i])
pred_test[[i]] = my_adaboost_pred(myfit, K[i], testx, testy)
pred_train[[i]] = my_adaboost_pred(myfit, K[i], x, y)
}
par(mfrow = c(2, 3))
# plot the exponential loss and error
for(i in 1:3){
main_text = paste0("Exponential loss and errors, delta = ", delta[i])
plot(pred_train[[i]]$exp_loss, type = "l", ylim = c(0, 1),
main = main_text, ylab = "exponetial loss", xlab = "ntrees")
lines(pred_train[[i]]$err, col = 'green')
lines(pred_test[[i]]$err, col = 'red')
legend("right", lty = 1, legend = c("exp loss", "train error", "test error"), col = c("black", "green", "red"))
}
# plot fitted model (last iteration)
for(i in 1:3){
main_text = paste0("Final model, delta = ", delta[i])
plot(x, y + 0.1*runif(n, -1, 1), ylim = c(-1.1, 1.1),
col = ifelse(y == 1, "darkorange", "deepskyblue"), ylab = "y",
main = main_text)
lines(sort(x), py(x)[order(x)] - 0.5)
lines(testx, pred_test[[i]]$pred, col = "red")
lines(testx, ifelse(pred_test[[i]]$pred > 0, 1, -1), col = "blue", lty = 2)
}
Comments on Number of Trees (not required)
Comments on Shrinkage Parameter