Convex Optimization Problems

A convex function \(f: \mathbb{R}^p \rightarrow \mathbb{R}^p\) satisfies, for any \(\alpha \in [0, 1]\)

\[f\big(\alpha \mathbf{x}_1 + (1-\alpha) \mathbf{x}_2 \big) \leq \alpha f(\mathbf{x}_1) + (1-\alpha) f(\mathbf{x}_2 ).\] And oftentimes, we optimize the objective function in a convex set. There are many examples of a convex function. The linear regression problem, by treating the regression parameters \(\boldsymbol \beta\) as the argument in the objective function, the mean squared error loss \(\frac{1}{2n} \lVert \mathbf{y} - \mathbf{X}\boldsymbol \beta \rVert^2\) is a convex function. Maximizing or minimizing such functions is our goal, although many of the techniques we introduce here are also used to solve non-convex problems in practice.

Local and Global Minimizers

Not all problems are convex problems. For a convex problem, we can usually find a global minimizer, i.e., a \(x^\ast\) such that for any \(x\) in the feasible set, \(f(x^\ast) \leq f(x)\). This can be achieved by using the descent algorithms to be introduced. However, for nonconvex problems, we may still be interested in a local minimizer, which satisfies that for any \(x\) in a neighboring set of \(x^\ast\), \(f(x^\ast) \leq f(x)\). The comparison of these two cases can be demonstrated in the following plots. Again a descent algorithm can help us find a local minimizer, except for some very special cases, such as a saddle point. However, these are beyond the scope of this course.

Gradient Descent

We use linear regression as an example. The objective function for linear regression is:

\[ \ell(\boldsymbol \beta) = \frac{1}{2n}||\mathbf{y} - \mathbf{X} \boldsymbol \beta ||^2 \] with solution is

\[\widehat{\boldsymbol \beta} = \left(\mathbf{X}^\text{T}\mathbf{X}\right)^{-1} \mathbf{X}^\text{T} \mathbf{y} \]

    library(MASS)
    set.seed(3)
    n = 200
    
    # create some data with linear model
    X = mvrnorm(n, c(0, 0), matrix(c(1,0.7, 0.7, 1), 2,2))
    y = rnorm(n, mean = 2*X[,1] + X[,2])
    
    beta1 <- seq(-1, 4, 0.005)
    beta2 <- seq(-1, 4, 0.005)
    allbeta <- data.matrix(expand.grid(beta1, beta2))
    rss <- matrix(apply(allbeta, 1, function(b, X, y) sum((y - X %*% b)^2), X, y), 
                  length(beta1), length(beta2))
    
    # quantile levels for drawing contour
    quanlvl = c(0.01, 0.025, 0.05, 0.2, 0.5, 0.75)
    
    # plot the contour
    contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
    box()
    
    # the truth
    b = solve(t(X) %*% X) %*% t(X) %*% y
    points(b[1], b[2], pch = 19, col = "blue", cex = 2)

We use an optimization approach to solve this problem. By taking the derivative with respect to \(\boldsymbol \beta\), we have

\[ \begin{align} \frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta} = -\frac{1}{n} \sum_{i=1}^n (y_i - x_i^\text{T} \boldsymbol \beta) x_i. \end{align} \]

To perform the optimization, we will first set an initial beta value, say \(\boldsymbol \beta = \mathbf{0}\) for all entries, then proceed with the update

\[ \boldsymbol \beta^\text{new} = \boldsymbol \beta^\text{old} - \frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta} \times \delta,\]

where \(\delta\) is a step size. Let’s set \(\delta = 0.1\) for now. The following function performs gradient descent.

  # gradient descent function, which also record the path
  mylm_g <- function(x, y, b0 = rep(0, ncol(x)), delta = 0.1, epsilon = 1e-6, maxitr = 5000)
    {
      if (!is.matrix(x)) stop("x must be a matrix")
      # if (!is.vector(y)) stop("y must be a vector")
      if (nrow(x) != length(y)) stop("number of observations different")
      
      # initialize beta values
      b = list()
      b1 = b0
      
      # iterative update
      for (k in 1:maxitr)
      {
        b[[k]] = b1
        b0 = b1
        b1 = b0 + t(x) %*% (y - x %*% b0) * delta / length(y)      
        
        if (max(abs(b0 - b1)) < epsilon)
          break;
      }
  
      if (k == maxitr) cat("maximum iteration reached\n")
      return(list("allb" = b, "beta" = b1))
  }

  # fit the model 
  mybeta = mylm_g(X, y, b0 = c(0, 1))
  # mybeta

  allb = matrix(unlist(mybeta$allb), 2, length(mybeta$allb))
  
  contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
  points(allb[1,], allb[2,], type = "l", col = "red", pch = 2)
  points(b[1], b[2], pch = 19, col = "blue", cex = 2)
  box()

The descent path is very smooth because we choose a very small step size. However, if the step size is larger, we may observe unstable results or even unable to converge. We set \(\delta = 1\).

  # fit the model 
  mybeta = mylm_g(X, y, b0 = c(0, 1), delta = 1)
  # mybeta

  allb = matrix(unlist(mybeta$allb), 2, length(mybeta$allb))
  
  contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
  points(allb[1,], allb[2,], type = "l", col = "red", lwd = 2)
  points(b[1], b[2], pch = 19, col = "blue", cex = 2)
  box()

Coordinate Descent

For coordinate descent, we update one variable at a time. There are two styles of coordinate descent: the Gauss-Seidel style and the Jacobi style. The later is more friendly for parallel computing. Here we use the Gauss-Seidel style. Note that the update for a single parmeter is

\[ \underset{\boldsymbol \beta_j}{\text{argmin}} \frac{1}{2n} ||\mathbf{y} - X_j \beta_j - \mathbf{X}_{(-j)} \boldsymbol \beta_{(-j)} ||^2 \] We can first calculate the residual defined as \(\mathbf{r} = \mathbf{y} - \mathbf{X}_{(-j)} \boldsymbol \beta_{(-j)}\), and perform the one-dimensional regression by regressing \(\mathbf{r}\) on \(X_j\) and obtain the update. Please note that for solving the Lasso problem, a soft-thresholding function is applied to this at each iteration. \[ \beta_j = \frac{X_j^T \mathbf{r}}{X_j^T X_j} \] The coordiante descent usually does not involve choosing a step size. Note that the following function is NOT efficient because there are a lot of wasted calculations. It is only for demonstration purpose.

  # gradient descent function, which also record the path
  mylm_c <- function(x, y, b0 = rep(0, ncol(x)), epsilon = 1e-6, maxitr = 5000)
    {
      if (!is.matrix(x)) stop("x must be a matrix")
      # if (!is.vector(y)) stop("y must be a vector")
      if (nrow(x) != length(y)) stop("number of observations different")
      
      # initialize beta values
      b = list()
      b1 = b0
      
      # iterative update
      for (k in 1:maxitr)
      {
        b0 = b1
        
        for (j in 1:ncol(x))
        {
          b[[(k-1)*ncol(x)+j]] = b1
          r = y - x[, -j, drop = FALSE] %*% b1[-j]
          b1[j] = t(r) %*% x[,j] / (t(x[,j, drop = FALSE]) %*% x[,j])
        }
        
        if (max(abs(b0 - b1)) < epsilon)
          break;
      }
  
      if (k == maxitr) cat("maximum iteration reached\n")
      return(list("allb" = b, "beta" = b1))
  }

  # fit the model 
  mybeta = mylm_c(X, y, b0 = c(0, 3))
  # mybeta

  allb = matrix(unlist(mybeta$allb), 2, length(mybeta$allb))
  
  contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
  points(allb[1,], allb[2,], type = "l", col = "red", lwd = 2)
  points(b[1], b[2], pch = 19, col = "blue", cex = 2)
  box()

Stocastic Gradient Descent

For Stocastic Gradient Descent (SGD), we update the parameter based on a single subject. Hence, the gradient is defined as \[ \frac{\partial \ell_i(\boldsymbol \beta)}{\partial \boldsymbol \beta} = - (y_i - x_i^\text{T} \boldsymbol \beta) x_i. \] This can be much faster than gradient descent becuase we only calculate based on one observation at each iteration, instead on \(n\) observations. There is a decay rate involved in SGD because if the step size does not decreases to 0, the algorithm cannot converge. One has to tune this properly specially in a large optimization system.

  # gradient descent function, which also record the path
  mylm_sgd <- function(x, y, b0 = rep(0, ncol(x)), delta = 0.25, maxitr = 10)
    {
      if (!is.matrix(x)) stop("x must be a matrix")
      # if (!is.vector(y)) stop("y must be a vector")
      if (nrow(x) != length(y)) stop("number of observations different")
      
      # initialize beta values
      b = list()

      counter = 1 
      delta0 = delta
      
      # iterative update
      for (k in 1:maxitr)
      {
        for (i in sample(1:nrow(x)))
        {
          b[[counter]] = b0
          # update based on the gradient of a single subject
          b0 = b0 + (y[i] - sum(x[i, ] * b0)) * x[i, ] * delta
          counter = counter + 1
          
          # learning rate decay
          delta = delta * 0.95
        }
      }
      
      return(list("allb" = b, "beta" = b0))
  }

  # fit the model 
  mybeta = mylm_sgd(X, y, b0 = c(0, 1), maxitr = 3)
  
  # mybeta
  allb = matrix(unlist(mybeta$allb), 2, length(mybeta$allb))
  
  contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
  points(allb[1,], allb[2,], type = "l", col = "red", pch = 2)
  points(b[1], b[2], pch = 19, col = "blue", cex = 2)
  box()

Mini-batch Stocastic Gradient Descent

Instead of using just one obervation, we could also consider splitting the data into several small “batches” and use one batch of sample to calculate the gradient at each iteration.

  # gradient descent function, which also record the path
  mylm_sgd_mb <- function(x, y, b0 = rep(0, ncol(x)), delta = 0.5, maxitr = 20)
    {
      if (!is.matrix(x)) stop("x must be a matrix")
      # if (!is.vector(y)) stop("y must be a vector")
      if (nrow(x) != length(y)) stop("number of observations different")
      
      # initiate batches with 10 observations each
      batch = sample(rep(1:floor(nrow(x)/10), length.out = nrow(x)))
    
      # initialize beta values
      b = list()

      counter = 1 
      delta0 = delta
      
      # iterative update
      for (k in 1:maxitr)
      {
        for (i in 1:max(batch)) # loop through batches
        {
          b[[counter]] = b0
          # update based on the gradient of a single subject
          b0 = b0 + t(x[batch==i, ]) %*% (y[batch==i] - x[batch==i, ] %*% b0) * delta / sum(batch==i)
          counter = counter + 1
          
          # learning rate decay
          delta = delta * 0.95
        }
      }
      
      return(list("allb" = b, "beta" = b0))
  }

  # fit the model 
  mybeta = mylm_sgd_mb(X, y, b0 = c(0, 1), maxitr = 3)
  # mybeta

  allb = matrix(unlist(mybeta$allb), 2, length(mybeta$allb))
  
  contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
  points(allb[1,], allb[2,], type = "l", col = "red", pch = 2)
  points(b[1], b[2], pch = 19, col = "blue", cex = 2)
  box()

Lagrangian Multiplier

Constrained optimization problems appear very frequently. Both Lasso and Ridge regressions can be viewed as constrained problems, while SVM is another example, which will be introduced later on. Let’s investigate this using a toy example. Suppose we have an optimization problem

\[\text{minimize} \quad f(x, y) = x^2 + y^2\] \[\text{subj. to} \quad g(x, y) = xy - 4 = 0\]

    x <- seq(-5, 5, 0.05)
    y <- seq(-5, 5, 0.05)
    mygrid <- data.matrix(expand.grid(x, y))
    f <- matrix(mygrid[,1]^2 + mygrid[,2]^2, length(x), length(y))

    f2 <- matrix(mygrid[,1]*mygrid[,2], length(x), length(y))
    
    # plot the contour
    contour(x, y, f, levels = c(0.2, 1, 2, 4, 8, 16))
    contour(x, y, f2, levels = 4, add = TRUE, col = "blue", lwd = 2)
    box()

    lines(seq(1, 3, 0.01), 4- seq(1, 3, 0.01), type = "l", col = "orange", lwd = 2)    
    points(2, 2, col = "red", pch = 19, cex = 2)
    points(-2, -2, col = "red", pch = 19, cex = 2)

The problem itself is very simple. We know that the optimizer is the red dot. But an interesting point of view is to look at the level curves of the objective function. As it is growing (expanding), there is one point (the red dot) at which level curve barely touches the constrain curve (blue line). This should be the optimizer. But this also implies that the tangent line (orange line) of this leveling curve must coincide with the tangent line of the constraint. Noticing that the tangent line can be obtained by taking the derivative of the function, this observation implies that gradients of the two functions (the objective function and the constraint function) must be a multiple of the other. Hence,

\[ \begin{align} & \bigtriangledown f = \lambda \bigtriangledown g \\ \\ \Longrightarrow \qquad & \begin{cases} 2x = \lambda y & \text{by taking derivative w.r.t.} \,\, x\\ 2y = \lambda x & \text{by taking derivative w.r.t.} \,\, y\\ xy - 4 = 0 & \text{the constraint itself} \end{cases} \end{align} \] The three equations put together is very easy to solve. We have \(x = y = 0\) or \(\lambda = \pm 2\) based on the first two equations. The first one is not feasible based on the constraint. The second solution leads to two feasible solutions: \(x = y = 2\) or \(x = y = -2\). Hence, we now know that there are two solutions.

Now, looking back at the equation \(\bigtriangledown f = \lambda \bigtriangledown g\), this is simply the derivative of the Lagrangian function defined as

\[{\cal L}(x, y, \lambda) = f(x, y) - \lambda g(x, y),\] while solving for the solution of the constrained problem becomes finding the stationary point of the Lagrangian. Be aware that in some cases, the solution you found can be maximizers instead of minimizers. Hence, its necessary to compare all of them and see which one is smaller.