I am taking Convex Optimization this semester (2021 Spring). The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.

From this chapter, we start to introduce convex optimization algorithms, including descent methods, newton’s methods, conjugate direction methods, and majorization minimization. I used to consider gradient descent as the simple basis of neural networks, so I did not pay much attention to it. But it turns out that the underlying theory can be rather complex. I feel like learning the history and construction of nowadays neural networks. Theory is fascinating in this way.

Unconstrained Optimization#

Table of Contents:

Problems
Descent methods
Newton’s method
Conjugate direction methods
Quasi-Newton methods
Majorization minimization

Problems#

$\min_\mathbf{x}f(\mathbf{x})$

Where $f:\R^n\to\R$ is convex and twice continuously differentiable. Compute a sequence of points that $f(\mathbf{x}^{(k)})\to p^*$ and we can check the termination by $\|\nabla f(\mathbf{x}^{(k)})\|\le \epsilon$ .

Fixed-point algorithm
- Banach Fixed Point Theorem
  
  Let $(X,d)$ be a non-empty complete metric space with a contraction mapping $T:X\to X$
  
  $d(T(\mathbf{x}),T(\mathbf{y}))\le Ld(\mathbf{x},\mathbf{y}), 0\le L<1$
  
  Then start with an arbitrary $\mathbf{x}_0$ and take $\mathbf{x}_{n}=T(\mathbf{x}_{n-1})$ will reach to the fixed point $\mathbf{x}_{n}\to\mathbf{x}^*$ .
Given the theorem, we know that if $\varphi$ satisfies:

$\|\varphi(\mathbf{x})-\varphi(\mathbf{y}))\|\le L\|\mathbf{x}-\mathbf{y}\|, 0\le L<1$

Then the iteration method converges to the fixed point of $\varphi$ .
Fixed point and gradient descent

If we take $\varphi(\mathbf{x})=\mathbf{x}-\alpha\nabla f(\mathbf{x})$ , then the fixed point satisfies $\mathbf{x}=\mathbf{x}-\alpha\nabla f(\mathbf{x})$ , i.e. $\nabla f(\mathbf{x}^*)=0$ .
1. $\mathbf{x}_{k+1}=\mathbf{x}_k-\alpha\nabla f(\mathbf{x}_k)$ . This is updating method of gradient descent.
2. $\mathbf{x}_k=\mathbf{x}_{k+1}+\alpha\nabla f(\mathbf{x}_{k+1})$ . This means $\mathbf{x}_{k+1}=\arg\min_{\mathbf{x}}\alpha f(\mathbf{x})+\frac{1}{2}\|\mathbf{x}-\mathbf{x}_k\|^2$ , which is the proximal operator.
Strong convexity and implications

Strong convexity of the objective function implies:

$m\mathbf{I}\preceq \nabla^2f(\mathbf{x})\preceq M\mathbf{I} \Rightarrow \operatorname{cond}(\nabla^2f(\mathbf{x}))\le \frac{M}{m}$

$f(\mathbf{y}) \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{m}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}$

The second inequality can be used to bound $f(\mathbf{x})-p^{*}$ in terms of $\|\nabla f(\mathrm{x})\|_{2}$ . The righthand side of it is a convex quadratic function of $\mathbf{y}$ (for fixed $\mathbf{x}$ ). Setting the gradient with respect to $\mathbf{y}$ equal to zero, we find that $\tilde{\mathbf{y}}=\mathbf{x}-(1 / m) \nabla f(\mathbf{x})$ minimizes the righthand side. Therefore we have

$\begin{aligned} f(\mathbf{y}) & \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{m}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2} \\ & \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\tilde{\mathbf{y}}-\mathbf{x})+\frac{m}{2}\|\tilde{\mathbf{y}}-\mathbf{x}\|_{2}^{2} \\ &=f(\mathbf{x})-\frac{1}{2 m}\|\nabla f(\mathbf{x})\|_{2}^{2} \end{aligned}$

Since this holds for any $\mathbf{y} \in S$ , we have

$p^{*} \geq f(\mathbf{x})-\frac{1}{2 m}\|\nabla f(\mathbf{x})\|_{2}^{2}$

This inequality shows to what extent is $\|\nabla f(\mathbf{x})\|_2$ small enough to show that the point is nearly optimal:

$\|\nabla f(\mathbf{x})\|_2\le (2m\epsilon)^{1/2}\Rightarrow f(\mathbf{x})-p^*\le \epsilon$

We can also bound the distance between $\mathbf{x}$ and the optimal $\mathbf{x}^*$ :

$\|\mathbf{x}-\mathbf{x}^*\|_2\le \frac{2}{m}\|\nabla f(\mathbf(x)\|_2$

Similarly we have the Descent Lemma and an upper bound of $p^*$ :

$f(\mathbf{y}) \leq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{M}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}\\ p^{*} \leq f(\mathbf{x})-\frac{1}{2 M}\|\nabla f(\mathbf{x})\|_{2}^{2}$
- $M$ -Smoothness
  
  The upper bound of $\nabla ^2f(\mathbf{x})$ is equivalent of the definition of smoothness:
$\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2\le M\|\mathbf{x}-\mathbf{y}\|_2 \Leftrightarrow \nabla^2f(\mathbf{x})\preceq M\mathbf{I}$

Descent methods#

$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)}+\mathbf{t}^{(k)}\Delta\mathbf{x}^{(k)}$

Where $\mathbf{t}^{(k)}>0$ is the step size and $\Delta\mathbf{x}^{(k)}$ is the search direction.

We hope that $f(\mathbf{x}^{(k+1)})<f(\mathbf{x}^{(k)})$ unless $\mathbf{x}^{(k)}$ is optimal. We know by convexity that $\nabla f(\mathbf{x}^{(k)})^\top (\mathbf{y}-\mathbf{x}^{(k)})\ge 0$ implies $f(\mathbf{y})\ge f(\mathbf{x}^{(k)})$ , so the search direction must satisfy that:

$\langle\nabla f(\mathbf{x}^{(k)}), \Delta\mathbf{x}^{(k)}\rangle=\nabla f(\mathbf{x}^{(k)})^\top \Delta\mathbf{x}^{(k)} < 0$

Stopping criterion is $\|\nabla f(\mathbf{x})\|_2\le \eta$ .

Step size#

Exact line search

$t = {\arg\min}_{s\ge 0}f(\mathbf{x}+s\Delta\mathbf{x})$
Backtracking line search (inexact)

Algorithm:
1. Given a search direction $\Delta \mathbf{x}$ , $\alpha\in(0,1),\beta\in(0,1)$ . Set $t=1$ .
2. While $f(\mathbf{x}+t\Delta\mathbf{x}) > f(\mathbf{x})+\alpha t\nabla f(\mathbf{x})^\top \Delta \mathbf{x}$ , reduce the step size $t:=\beta t$ .

From the illustration we can see that the stopping criterion satisfies in $[0, t_0]$ . The algorithm will stop when $t=1$ , or $t\in(\beta t_0, t_0]$ , namely $t>\min\{ 1, \beta t_0\}$ .

Search direction#

Gradient Descent#

$\Delta\mathbf{x}=-\nabla f(\mathbf{x})$

Convergence analysis: Exact line search

Use a lighter notation $\mathbf{x}^{+}=\mathbf{x}+t \Delta \mathbf{x}$ for $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)}+t^{(k)}\Delta \mathbf{x}^{(k)}$ .

Assume that $f$ is strongly convex on $S$ , then there are positive constants $m$ and $M$ such that $m \mathbf{I} \preceq \nabla^{2} f(\mathbf{x}) \preceq M \mathbf{I}, \forall \mathbf{x} \in S$ .

Define the updating function $\tilde{f}: \mathbb{R} \rightarrow \mathbb{R}$ by $\tilde{f}(t)=f(\mathbf{x}-t \nabla f(\mathbf{x}))$ . From the descent lemma, take
$\mathbf{y}=\mathbf{x}-t \nabla f(\mathbf{x})$ , we obtain a quadratic upper bound on $\tilde{f}$ :

$\tilde{f}(t) \leq f(\mathbf{x})-t\|\nabla f(\mathbf{x})\|_{2}^{2}+\frac{M t^{2}}{2}\|\nabla f(\mathbf{x})\|_{2}^{2} = f(\mathbf{x})+(\frac{M}{2}t^2-t)\|\nabla f(\mathbf{x})\|_{2}^{2}$

Minimize over $t$ and we have:

$f(\mathbf{x}^+) \le f(\mathbf{x})-\frac{1}{2M}\|\nabla f(\mathbf{x})\|_{2}^{2} f(\mathbf{x}^+)-p^* \le f(\mathbf{x})-p^*-\frac{1}{2M}\|\nabla f(\mathbf{x})\|_{2}^{2}$

We know that $\|\nabla f(\mathbf{x})\|_{2}^{2}\ge 2m(f(\mathbf{x})-p^*)$ , so

$f\left(\mathbf{x}^{+}\right)-p^{*} \leq(1-m / M)\left(f(\mathbf{x})-p^{*}\right)$

Take recursively for $k$ steps and we will have:

$f\left(\mathbf{x}^{(k)}\right)-p^{*} \leq c^{k}\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right), \text { where } c=1-m / M<1 .$

Thus, the method has linear convergence.

In particular, we must have $f\left(\mathbf{x}^{(k)}\right)-p^{*} \leq \epsilon$ after at most

$\frac{\log \left(\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right) / \epsilon\right)}{\log (1 / c)}$

iterations of the gradient method with exact line search.
- Interpretation:
  
  The nominator is the log of the ratio between initial gap and final gap.
  
  The denominator is the log of the upper bound of condition number of Hessian matrix near $\mathbf{x}^*$ . When $M/m$ is large enough, we have:
  
  $\log (1/ c) = -\log (1-m/M)\approx m/M$
  
  Hence, the iterations needed increases nearly linearly with $M/m$ .
Convergence analysis: backtracking line search

We already know that

$\tilde{f}(t) \leq f(\mathbf{x})+(\frac{M}{2}t^2-t)\|\nabla f(\mathbf{x})\|_{2}^{2}$

So the equation $\tilde{f}(t)\le f(\mathbf{x})-\alpha t\|\nabla f(\mathbf{x})\|_2^2,0<\alpha<0.5$ always satisfies when $0\le t\le 1/M$ .

Hence, we know that the search terminates either with $t=1$ or $t\ge \beta/M$ .

${f}(\mathbf{x}^+) \leq f(\mathbf{x})-\min\{\alpha,\frac{\alpha\beta}{M} \} \|\nabla f(\mathbf{x})\|_{2}^{2}$

Given $\|\nabla f(\mathbf{x})\|_{2}^{2}\ge 2m(f(\mathbf{x})-p^*)$ , subtract $p^*$ from both sides:

${f}(\mathbf{x}^+)-p^* \leq \left(1-\min\{2m\alpha,\frac{2m\alpha\beta}{M} \} \right)\left\|\nabla f(\mathbf{x})\right\|_{2}^{2}\left(f(\mathbf{x})-p^*\right)$

Take $c=1-\min\{2m\alpha,\frac{2m\alpha\beta}{M} \} <1$ and we have

$f\left(\mathbf{x}^{(k)}\right)-p^{*} \leq c^{k}\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right)$

Steepest descent method#

The first-order Taylor approximation of $f(\mathbf{x}+\mathbf{v})$ is

$f(\mathbf{x}+\mathbf{v}) \approx f(\mathbf{x})+\nabla f(\mathbf{x})^\top \mathbf{v}$

The second term can be interpreted as a directional derivative in the direction $v$ . Our goal is to choose a step direction $v$ such that the derivative is as negative as possible.

We define a normalized steepest descent direction as

$\Delta\mathbf{x}_{nsd}={\arg\min}_\mathbf{v}\{\nabla f(\mathbf{x})^\top \mathbf{v}\mid \|\mathbf{v}\|=1 \}$

We can define an unnormalized steepest descent by dual norm

$\Delta\mathbf{x}_{sd}=\|\nabla f(\mathbf{x})\|_*\Delta\mathbf{x}_{nsd}$

This search direction satisfies

$\nabla f(\mathbf{x})^\top \Delta\mathbf{x}_{sd}=\|\nabla f(\mathbf{x})\|_*\nabla f(\mathbf{x})^\top \Delta\mathbf{x}_{nsd} = -\|\nabla f(\mathbf{x})\|_*^2$

Euclidean norm case

When $\|\cdot\|$ is Euclidean norm, $\Delta\mathbf{x}_{nsd} =- \frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}$ , and $\Delta\mathbf{x}_{sd} =- \nabla f(\mathbf{x})$ , which is exactly the gradient descent method.
Quadratic norm case

$\|\mathbf{z}\|_\mathbf{P}=(\mathbf{z^\top Pz})^{1/2}=\|\mathbf{P}^{1/2}\mathbf{z}\|_2,\mathbf{P}\in\mathbb{S}_{++}^n\\ \Delta\mathbf{x}_{nsd}=-\frac{\mathbf{P}^{-1}\nabla f(\mathbf{x})}{\left(\nabla f(\mathbf{x})^\top\mathbf{P}^{-1}\nabla f(\mathbf{x}) \right)^{1/2}}$

The dual norm is $\|\mathbf{z}\|_*=\|\mathbf{P}^{-1/2}\mathbf{z}\|_2$ , so the steepest descent step is

$\Delta\mathbf{x}_{sd}=-\mathbf{P}^{-1}\nabla f(\mathbf{x})$
Coordinate change

$\bar{\mathbf{u}} = \mathbf{P}^{1/2}\mathbf{u}$ Defines the coordinate change from Quadratic norm to Euclidean norm: $\|\mathbf{u}\|_\mathbf{P} = \|\bar{\mathbf{u}}\|_2$ .

$\bar{f}(\bar{\mathbf{u}}) = f(\mathbf{P}^{-1/2}\bar{\mathbf{u}}) = f(\mathbf{u})$

Apply the gradient method to $\bar{f}$

$\Delta\bar{\mathbf{x}} = -\nabla \bar{f}(\bar{\mathbf{x}} )=-\mathbf{P}^{-1/2}\nabla f(\mathbf{P}^{-1/2}\bar{\mathbf{x}}) = -\mathbf{P}^{-1/2}\nabla f({\mathbf{x}})$

By the coordinate change, the steepest descent search direction of Quadratic norm $\|\cdot\|_\mathbf{P}$ should be

$\Delta{\mathbf{x}} = \mathbf{P}^{-1/2}\Delta\bar{\mathbf{x}} = -\mathbf{P}^{-1}\nabla f({\mathbf{x}})$
$l_1$ -norm

Let $i$ be the index when $|(\nabla f(\mathbf{x}))_i|=\|\nabla f(\mathbf{x})\|_\infty$ .

$\Delta\mathbf{x}_{nsd}=-\operatorname{sign}\left(\frac{\partial f(\mathbf{x})}{\partial x_i}\right)e_i\\ \Delta\mathbf{x}_{nsd}=\Delta\mathbf{x}_{nsd}\|\nabla f(\mathbf{x})\|_\infty=-\left(\frac{\partial f(\mathbf{x})}{\partial x_i}\right)e_i\\$

Hence, the normalized steepest descent direction is always the coordinate axis direction whose decrease in $f$ is greatest.
Choice of norm

If we already know an approximation $\hat{\mathbf{H}}$ of the Hessian at the optimal point, then we can choose $\mathbf{P} = \hat{\mathbf{H}}$ , so that the Hessian of $\bar{f}$ would be

$\nabla^2\bar{f}(\bar{\mathbf{x}}) = \nabla ^2f(\mathbf{P}^{-1/2}\bar{\mathbf{x}})=\hat{\mathbf{H}}^{-1/2}\nabla ^2f(\bar{\mathbf{x}})\hat{\mathbf{H}}^{-1/2}\approx \mathbf{I}$

Thus, the condition number is likely to be low.

Geometrically saying, the ellipsoid $\{\mathbf{x}\mid\|\mathbf{x}\|_\mathbf{P}\le 1 \}$ should approximate the shape of the sub level set( $\{\mathbf{x}\mid f(\mathbf{x})\le L \}$ ).

In the following image, the left ellipsoid is similar to the sub level set, while the right one is not.

If the condition number of a set is small, it means that the set has approximately the same width in all directions, i.e., it is nearly spherical.

Clearly the left sub level set is more likely a sphere.

Newton’s method#

Motivation. Steepest descent only use the first derivatives, try to use higher derivatives.

Method#

Construct a quadratic approximation to the objective function that matches the 1st and 2nd derivative values at that point. Then minimize the approximate function.

$f(\mathbf{x}) \approx f\left(\mathbf{x}^{(k)}\right)+\mathbf{g}^{(k) \top}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{\top} \mathbf{F}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}-\mathbf{x}^{(k)}\right) \triangleq q(\mathbf{x})$

Here, $\mathbf{g}^{(k)}=\nabla f\left(\mathbf{x}^{(k)}\right)$ .

Applying the First-Order Necessary Condition (FONC) to $q$ yields

$\mathbf{0}=\nabla q(\mathbf{x})=\mathbf{g}^{(k)}+\mathbf{F}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}-\mathbf{x}^{(k)}\right)$

If $\mathbf{F}\left(\mathbf{x}^{(k)}\right) \succ 0$ , then $q$ achieves a minimum at

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}$

The above is the recursive updating function of Newton’s method.

Analysis#

Cons:

If $\mathbf{F}\left(\mathbf{x}^{(k)}\right)$ is not positive definite, the algorithm might not head in the direction of decreasing values of the objective function.
Even if it is positive definite, it might occur that $ f\left(\mathbf{x}^{(k)}\right)\le f\left(\mathbf{x}^{(k+1)}\right)$, e.g. if starting point is far away from $\mathbf{x}^*$ .

Pros:

When starting point is near $\mathbf{x}^*$ , Newton’s method converges to $\mathbf{x}^*$ with order of convergence at least 2.

Note that we only need to assume $f\in C^3, \nabla f(\mathbf{x}^*)=\mathbf{0},\mathbf{F}(\mathbf{x}^*)$ , is invertible. So Newton’s method does not necessarily converge to a local minimum.
Descent property ($ f\left(\mathbf{x}^{(k)}\right)\ge f\left(\mathbf{x}^{(k+1)}\right)$) holds true with a little modification to the algorithm(Damped Newton’s Method).

If $\mathbf{F}\left(\mathbf{x}^{(k)}\right) \succ 0, \mathbf{g}^{(k)}=\nabla f\left(\mathbf{x}^{(k)}\right)\ne 0$ , then the direction

$\mathbf{d}^{(k)}=-\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}=\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}$

is a descent direction in the sense that $\exist\hat{\alpha}>0,\forall\alpha\in(0,\hat{\alpha})$ ,

$f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})<f(\mathbf{x}^{(k)})$

Proof. Let

$\phi(\alpha ) =f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})$

Thus, we need to prove that $\exist\hat{\alpha}>0,\forall\alpha\in(0,\hat{\alpha})$ , $\phi(\alpha)<\phi(0)$ . Given $\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \succ 0$ ,

$\phi'(\alpha ) =\nabla f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})^\top \mathbf{d}^{(k)}\\ \phi'(0 ) =\nabla f(\mathbf{x}^{(k)})^\top \mathbf{d}^{(k)} =-\mathbf{g}^{(k)}\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}<0$

Hence, there must exist an $\hat{\alpha}$ satisfies $\phi(\alpha)<\phi(0)$ .

Damped Newton’s method#

Modify the recursive updating function with a parameter $\alpha_k$ to ensure descent property:

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}\\ \alpha_k = \arg\min_{\alpha\ge 0} f(\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)})$

Drawback: Computing $\mathbf{F}\left(\mathbf{x}^{(k)}\right)$ and solving $\mathbf{F}\left(\mathbf{x}^{(k)}\right)\mathbf{d}^{(k)}=-\mathbf{g}^{(k)}$ are computationally expensive.

Levenberg-Marquardt modification#

To ensure descent property when the Hessian matrix is not positive definite, modify the recursive updating function as:

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\left(\mathbf{F}\left(\mathbf{x}^{(k)}\right)+\mu_k\mathbf{I}\right) ^{-1}\mathbf{g}^{(k)},\quad\mu_k\ge 0$

The algorithm is actually locally minimizing

$f(\mathbf{x})+\frac{\mu_k}{2}\|\mathbf{x}-\mathbf{x}^{(k)}\|^2$

When $\mu_k\to 0$ , it becomes pure Newton’s method. When $\mu_k\to\infty$ , it becomes pure gradient descent method with small step size. In practice, start from a small $\mu_k$ and increase it until descent property is satisfied.

Case: Nonlinear least-squares#

$\min_\mathbf{x}\sum_{i=1}^m \left(r_i(\mathbf{x}) \right)^2$

Here, $r_i$ are given functions. Define $\mathbf{r} = [r_1,\cdots ,r_m]^\top$ then the objective function is $f(\mathbf{x}) = \mathbf{r}(\mathbf{x})^\top\mathbf{r}(\mathbf{x})$ . Denote the Jacobian matrix of $\mathbf{r}$ by

$\mathbf{J}(\mathbf{x})=\left[\begin{array}{ccc} \frac{\partial r_{1}}{\partial x_{1}}(\mathbf{x}) & \cdots & \frac{\partial r_{1}}{\partial x_{n}}(\mathbf{x}) \\ \vdots & & \vdots \\ \frac{\partial r_{m}}{\partial x_{1}}(\mathbf{x}) & \cdots & \frac{\partial r_{m}}{\partial x_{n}}(\mathbf{x}) \end{array}\right]$

Then we can compute the gradient and Hessian of $f$ :

$(\nabla f(\mathbf{x}))_{j} =\frac{\partial f}{\partial x_{j}}(\mathbf{x}) =2 \sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x}) \\ \nabla f(\mathbf{x}) = 2\mathbf{J}(\mathbf{x})^\top \mathbf{r}(\mathbf{x})$

$\begin{aligned} \frac{\partial^{2} f}{\partial x_{k} \partial x_{j}}(\mathbf{x}) &=\frac{\partial}{\partial x_{k}}\left(2 \sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x})\right) \\ &=2 \sum_{i=1}^{m}\left(\frac{\partial r_{i}}{\partial x_{k}}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x})+r_{i}(\mathbf{x}) \frac{\partial^{2} r_{i}}{\partial x_{k} \partial x_{j}}(\mathbf{x})\right) \end{aligned}$

Letting $\mathbf{S}(\mathbf{x})$ be the matrix whose $(k, j)$ th component is

$\sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial^{2} r_{i}}{\partial x_{k} \partial x_{j}}(\mathbf{x})$

We write the Hessian matrix as

$\mathbf{F}(\mathbf{x})=2\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})+\mathbf{S}(\mathbf{x})\right)$

Therefore, Newton’s method applied to the nonlinear least-squares problem is given by

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})+\mathbf{S}(\mathbf{x})\right)^{-1} \mathbf{J}(\mathbf{x})^{T} \mathbf{r}(\mathbf{x})$

When we ignore the second derivatives in $\mathbf{S}$ , we have the Gauss-Newton method

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})\right)^{-1} \mathbf{J}(\mathbf{x})^{T} \mathbf{r}(\mathbf{x})$

Conjugate direction methods#

Motivation. Intermediate between steepest descent and Newton’s method.

Solve quadratics of $n$ variables in $n$ steps.
Requires no Hessian.
No matrix inversion, no storage of $n\times n$ matrix required.

$Q$ -conjugate: $Q$ is a real symmetric $n\times n$ matrix. Directions $\mathbf{d}^{(0)},\mathbf{d}^{(1)},\cdots,\mathbf{d}^{(m)}$ are $Q$ -conjugate if $\forall i\ne j$ , $\mathbf{d}^{(i)\top} Q \mathbf{d}^{(j)}=0$ .

Lemma. If $Q\succ\mathbf{0}$ and directions $\mathbf{d}^{(0)},\mathbf{d}^{(1)},\cdots,\mathbf{d}^{(k)},k\le n-1$ are nonzero and $Q$ -conjugate, then they are linearly independent.

🌟Quadratic problems#

$f(\mathbf{x})=\frac{1}{2} \mathbf{x}^{T} \mathbf{Q} \mathbf{x}-\mathbf{x}^{T} \mathbf{b},\quad \mathbf{Q}=\mathbf{Q}^\top\succ\mathbf{0}$

Motivation. Given $n$ $\mathbf{Q}$ -conjugate directions, we have

$\mathbf{x}^*-\mathbf{x}^{(0)}=\alpha_0\mathbf{d}^{(0)}+\cdots +\alpha_{n-1}\mathbf{d}^{(n-1)}$

Hence, we need to find the $n$ directions and their corresponding ratios.

Ratio#

Given the direction $\mathbf{d}^{(k)}$ , we can compute its ratio $\alpha_k$ . Multiply the above equation by $\mathbf{d}^{(k)\top}\mathbf{Q}$ and we have

$\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(0)})=\alpha_k\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}\\ \alpha_k=\frac{\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(0)})}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}$

Since

$\mathbf{x}^{(k)}-\mathbf{x}^{(0)}=\alpha_0\mathbf{d}^{(0)}+\cdots +\alpha_{k-1}\mathbf{d}^{(k-1)}\\ \mathbf{g}^{(k)} = \mathbf{Q}\mathbf{x}^{(k)}-\mathbf{b}, \mathbf{Q}\mathbf{x}^*=\mathbf{b}$

🌟We have

$\alpha_k=\frac{\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(k)})}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}=-\frac{\mathbf{d}^{(k)\top}\mathbf{g}^{(k)}}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}$

Direction#

Lemma. By mathematical induction, we have

$\begin{aligned} \left\langle\mathbf{g}^{(k+1)}, \mathbf{d}^{(i)}\right\rangle=&\left\langle\mathbf{Q} \mathbf{x}^{(k+1)}-\mathbf{b}, \mathbf{d}^{(i)}\right\rangle\\=&\left\langle\mathbf{Q}\left(\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}\right)-\mathbf{b}, \mathbf{d}^{(i)}\right\rangle \\ =&\left\langle\mathbf{g}^{(k)}+\alpha_{k} \mathbf{Q} \mathbf{d}^{(k)}, \mathbf{d}^{(i)}\right\rangle=0\\ \end{aligned}\\ \Rightarrow \mathbf{g}^{(k+1)\top} \mathbf{d}^{(i)}=0,\quad0\le k\le n-1, 0\le i\le k$

🌟Directions.

$\left\langle\mathbf{g}^{(k+1)}, \mathbf{g}^{(i)}\right\rangle=\left\langle\mathbf{g}^{(k+1)}, \beta_{i-1} \mathbf{d}^{(i-1)}-\mathbf{d}^{(i)}\right\rangle=0, \quad \forall 0 \leq i \leq k$

Then by mathematical induction.

$\begin{aligned} \left\langle\mathbf{d}^{(k+1)}, \mathbf{Q} \mathbf{d}^{(i)}\right\rangle=&\left\langle-\mathbf{g}^{(k+1)}+\beta_{k} \mathbf{d}^{(k)}, \mathbf{Q} \mathbf{d}^{(i)}\right\rangle\\=&-\frac{1}{\alpha_{i}}\left\langle\mathbf{g}^{(k+1)}, \mathbf{Q} \Delta \mathbf{x}^{(i)}\right\rangle \\ =&-\frac{1}{\alpha_{i}}\left\langle\mathbf{g}^{(k+1)}, \Delta \mathbf{g}^{(i)}\right\rangle=0 \end{aligned}$

Algorithm#

Given starting point $\mathbf{x}^{(0)}$ ,

$\mathbf{g}^{(0)} =\nabla f\left(\mathbf{x}^{(0)}\right)=\mathbf{Q} \mathbf{x}^{(0)}-\mathbf{b}$ . If $\mathbf{g}^{(0)}=0$ , stop, else set $\mathbf{d}^{(0)}=-\mathbf{g}^{(0)}$ .
Take $\alpha_{k} =-\frac{\mathbf{g}^{(k) \top} \mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}$ . Update $\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}$ .
Compute gradient $\mathbf{g}^{(k+1)} =\nabla f\left(\mathbf{x}^{(k+1)}\right)=\mathbf{Q} \mathbf{x}^{(k+1)}-\mathbf{b}$ . If $\mathbf{g}^{(k+1)}=0$ , stop.
Take $\beta_{k} =\frac{\mathbf{g}^{(k+1) \top} \mathbf{Q}\mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}$ . Get a new $\mathbf{Q}$ -conjugate direction $\mathbf{d}^{(k+1)}=-\mathbf{g}^{(k+1)}+\beta_k\mathbf{d}^{(k)}$ .
$k:=k+1$ , return to step 2.

Non-quadratic problems#

Motivation. Consider the quadratic function $f(\mathbf{x})=\frac{1}{2} \mathbf{x}^{T} \mathbf{Q} \mathbf{x}-\mathbf{x}^{T} \mathbf{b}$ as a second-order Taylor series approximation of any objective function. Thus, $\mathbf{Q}$ is the Hessian that needs reevaluation at each iteration.

To be Hessian-free, modify the method of computing $\alpha_k, \beta_k$ :

$\alpha_k=\arg\min_{\alpha>0} f(\mathbf{x}^{(k)}+\alpha\mathbf{d}^{(k)})$ can be replaced by a linear search.

Replace $\mathbf{Q}\mathbf{d}^{(k)}$ in $\beta_k$ with $\frac{\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)} }{\alpha_k}$ :

$\begin{aligned} \beta_{k} =&\frac{\mathbf{g}^{(k+1) \top} \mathbf{Q}\mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}\\ =&\frac{\mathbf{g}^{(k+1) \top} [\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}{\mathbf{d}^{(k) \top}[\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}\quad\text{(Hestenes-Stiefel formula)}\\ =&\frac{\mathbf{g}^{(k+1) \top} [\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}{\mathbf{g}^{(k)\top}\mathbf{g}^{(k)}}\quad\text{(Polak-Ribiere formula)}\\ =&\frac{\mathbf{g}^{(k+1) \top} \mathbf{g}^{(k+1) }}{\mathbf{g}^{(k)\top}\mathbf{g}^{(k)}}\quad\text{(Fletcher-Reeves formula)} \end{aligned}$

The last equation is derived from the lemma above and definition of $\mathbf{d}^{(k)}$ .

Non-quadratic problems might not converge in $n$ steps, so reinitialize the direction vector to the negative gradient after every few iterations (e.g., $n$ or $n + 1$ ).
If line search of $\alpha$ is inaccurate, HS formula is better. Choice of formula depends on objective function.

Quasi-Newton methods#

Motivation. To avoid computing $\mathbf{F}(\mathbf{x}^{(k)})^{-1}$ in damped Newton’s method, try replace it with an approximation.

$\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}=\mathbf{x}^{(k)}-\alpha\mathbf{H}_k\mathbf{g}^{(k)}$

To ensure a decrease in $f$ :

$\begin{aligned} f\left(\mathbf{x}^{(k+1)}\right) &=f\left(\mathbf{x}^{(k)}\right)+\mathbf{g}^{(k) T}\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right)+o\left(\left\|\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right\|\right) \\ &=f\left(\mathbf{x}^{(k)}\right)-\alpha \mathbf{g}^{(k) T} \mathbf{H}_{k} \mathbf{g}^{(k)}+o\left(\left\|\mathbf{H}_{k} \mathbf{g}^{(k)}\right\|\right) \end{aligned}$

We have to have:

$\mathbf{g}^{(k) T} \mathbf{H}_{k} \mathbf{g}^{(k)}>0$

A simple way to ensure the above requirement is to set $\mathbf{H}_k\succ \mathbf{0}$ .

Approximate the inverse Hessian#

Suppose that $\mathbf{F}(\mathbf{x})=\mathbf{Q},\forall \mathbf{x}$ and $\mathbf{Q}=\mathbf{Q}^\top$ . We have

$\Delta \mathbf{g}^{(k)} = \mathbf{g}^{(k+1)}-\mathbf{g}^{(k)}=\mathbf{Q}\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right) = \mathbf{Q}\Delta \mathbf{x}^{(k)}$

Therefore, the approximation $\mathbf{H}_{k+1}$ satisfies that

$\mathbf{H}_{k+1}\Delta\mathbf{g}^{(i)}=\Delta\mathbf{x}^{(i)},0\le i\le k$

Hence, we have $\mathbf{H}_n=\mathbf{Q}^{-1}$ . We conclude that if $\mathbf{H}_n$ satisfies $\mathbf{H}_n\Delta\mathbf{g}^{(i)}=\Delta\mathbf{x}^{(i)},0\le i\le n-1$ , then the algorithm is guaranteed to solve problems with quadratic objective functions in $n+1$ steps.

Algorithm#

$\begin{aligned} \mathbf{d}^{(k)} &=-\mathbf{H}_{k} \mathbf{g}^{(k)} \\ \alpha_{k} &=\arg \min _{\alpha \geq 0} f\left(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)}\right) \\ \mathbf{x}^{(k+1)} &=\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)} \end{aligned}$

where the matrices $\mathbf{H}_{0}, \mathbf{H}_{1}, \cdots$ are symmetric. In the quadratic case, the above matrices are required to satisfy

$\mathbf{H}_{k+1} \Delta \mathbf{g}^{(i)}=\Delta \mathbf{x}^{(i)}, 0 \leq i \leq k$

How to derive $\mathbf{H}_{k+1}$ from $\mathbf{H}_k$ that satisfies the above equation? We consider 3 specific updating formulas

Rank One Correction Formula (Single-Rank-Symmetric SRS algorithm)

$\mathbf{H}_{k+1} = \mathbf{H}_{k}+\alpha_k \mathbf{z}^{(k)}\mathbf{z}^{(k)\top}$

The correction satisfies $\text{rank}(\mathbf{z}^{(k)}\mathbf{z}^{(k)\top})=1$ .

$\begin{array}{c} \mathbf{H}_{k+1} \Delta \mathbf{g}^{(k)}=\left(\mathbf{H}_{k}+a_{k} \mathbf{z}^{(k)} \mathbf{z}^{(k) \top}\right) \Delta \mathbf{g}^{(k)}=\Delta \mathbf{x}^{(k)} \\ \mathbf{z}^{(k)}=\frac{\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}}{a_{k} \mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}} \\ \mathbf{H}_{k+1}=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{a_{k}\left(\mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}\right)^{2}} \end{array}$

We know from the first equation that $\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}=\left(a_{k} \mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}\right) \mathbf{z}^{(k)}$ , multiply it by $\Delta\mathbf{g}^{(k)}$ to replace the denominator

$\mathbf{H}_{k+1}=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{\Delta \mathbf{g}^{(k)\top}(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)})}$

Algorithm. Start with $k=0$ and a real symmetric positive definite $\mathbf{H}_0$ .
1. If $\mathbf{g}^{(k)}=0$ , stop, else set $\mathbf{d}^{(k)}=-\mathbf{H}_k\mathbf{g}^{(k)}$ .
2. Take $\alpha_{k} =\arg\min_{\alpha\ge 0}f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})$ . Update $\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}$ .
3. Compute gradient and the next inverse Hessian approximation, and return to step 1 with $k:=k+1$ .
  
  $\begin{aligned} \Delta \mathbf{x}^{(k)} &=\alpha_{k} \mathbf{d}^{(k)} \\ \Delta \mathbf{g}^{(k)} &=\mathbf{g}^{(k+1)}-\mathbf{g}^{(k)} \\ \mathbf{H}_{k+1} &=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{\Delta \mathbf{g}^{(k) \top}\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)} \end{aligned}$
Davidon-Fletcher-Powell (DFP) Algorithm

Rank two update

Algorithm. Start with $k=0$ and a real symmetric positive definite $\mathbf{H}_0$ .
1. If $\mathbf{g}^{(k)}=0$ , stop, else set $\mathbf{d}^{(k)}=-\mathbf{H}_k\mathbf{g}^{(k)}$ .
2. Take $\alpha_{k} =\arg\min_{\alpha\ge 0}f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})$ . Update $\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}$ .
3. Compute gradient and the next inverse Hessian approximation, and return to step 1 with $k:=k+1$ .
  
  $\begin{aligned} \Delta \mathbf{x}^{(k)} &=\alpha_{k} \mathbf{d}^{(k)} \\ \Delta \mathbf{g}^{(k)} &=\mathbf{g}^{(k+1)}-\mathbf{g}^{(k)} \\ \mathbf{H}_{k+1} &=\mathbf{H}_{k}+\frac{\Delta \mathbf{x}^{(k)} \Delta \mathbf{x}^{(k) T}}{\Delta \mathbf{x}^{(k) T} \Delta \mathbf{g}^{(k)}}-\frac{\left[\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right]\left[\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right]^{T}}{\Delta \mathbf{g}^{(k) T} \mathbf{H}_{k} \Delta \mathbf{g}^{(k)}} \end{aligned}$
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Algorithm

Note that we now approximate $\mathbf{Q}^{-1}$ by $\mathbf{H}_k$ , we can also approximate $\mathbf{Q}$ itself and take the inverse at last.

$\mathbf{H}_{k+1} \Delta \mathbf{g}^{(i)}=\Delta \mathbf{x}^{(i)}, 0 \leq i \leq k\\ \mathbf{B}_{k+1} \Delta \mathbf{x}^{(i)}=\Delta \mathbf{g}^{(i)}, 0 \leq i \leq k$

The BFGS update goes like

$\mathbf{B}_{k+1} =\mathbf{B}_{k}+\frac{\Delta \mathbf{g}^{(k)} \Delta \mathbf{g}^{(k) T}}{\Delta \mathbf{g}^{(k) T} \Delta \mathbf{x}^{(k)}}-\frac{\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]^{T}}{\Delta \mathbf{x}^{(k) T} \mathbf{B}_{k} \Delta \mathbf{x}^{(k)}}$

To obtain approximation of the inverse Hessian,

$\mathbf{H}_{k+1} = (\mathbf{B}_{k+1})^{-1} =\left(\mathbf{B}_{k}+\frac{\Delta \mathbf{g}^{(k)} \Delta \mathbf{g}^{(k) T}}{\Delta \mathbf{g}^{(k) T} \Delta \mathbf{x}^{(k)}}-\frac{\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]^{T}}{\Delta \mathbf{x}^{(k) T} \mathbf{B}_{k} \Delta \mathbf{x}^{(k)}}\right)^{-1}$

This inverse can be computed using Sherman-Morrison formula:

$\left(\mathbf{A}+\mathbf{U C V}^{T}\right)^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1} \mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}^{T} \mathbf{A}^{-1} \mathbf{U}\right)^{-1} \mathbf{V}^{T} \mathbf{A}^{-1}$

Where $\mathbf{A},\mathbf{C}$ are invertible and the size of $\mathbf{U},\mathbf{C},\mathbf{V}$ are compatible.

Hence, the updating of $\mathbf{H}$ is

$\mathbf{H}_{k+1}^{B F G S}=\mathbf{H}_{k}+\left(1+\frac{\Delta \mathbf{g}^{(k) \top} \mathbf{H}_{k} \Delta \mathbf{g}^{(k)}}{\Delta \mathbf{g}^{(k) \top} \Delta \mathbf{x}^{(k)}}\right) \frac{\Delta \mathbf{x}^{(k)} \Delta \mathbf{x}^{(k) \top}}{\Delta \mathbf{x}^{(k) \top} \Delta \mathbf{g}^{(k)}}-\frac{\mathbf{H}_{k} \Delta \mathbf{g}^{(k)} \Delta \mathbf{x}^{(k) \top}+\left(\mathbf{H}_{k} \Delta \mathbf{g}^{(k)} \Delta \mathbf{x}^{(k) \top}\right)^{\top}}{\Delta \mathbf{g}^{(k) \top} \Delta \mathbf{x}^{(k)}}$

🌟Limited-Memory BFGS#

$\begin{array}{c} \mathbf{H}_{k+1}=\mathbf{V}_{k}^{T} \mathbf{H}_{k} \mathbf{V}_{k}+\rho_{k} \Delta \mathbf{x}_{k} \Delta \mathbf{x}_{k}^{T} \\ \rho_{k}=\frac{1}{\Delta \mathbf{g}_{k}^{T} \Delta \mathbf{x}_{k}}, \quad \mathbf{V}_{k}=\mathbf{I}-\rho_{k} \Delta \mathbf{g}_{k} \Delta \mathbf{x}_{k}^{T} \end{array}$

We only need to store $m$ pairs of vectors $(\Delta\mathbf{x}_i,\Delta\mathbf{g}_i)$ instead of the matrix. $\mathbf{H}_k$ Is only used to compute the vector $\mathbf{H}_k\mathbf{g}_k$ , so the pairs are enough for computation.

$\begin{aligned} \mathbf{H}_{k}=&\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m}^{T}\right) \mathbf{H}_{k}^{0}\left(\mathbf{V}_{k-m} \cdots \mathbf{V}_{k-1}\right) \\ &+\rho_{k-m}\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m+1}^{T}\right) \Delta \mathbf{x}_{k-m} \Delta \mathbf{x}_{k-m}^{T}\left(\mathbf{V}_{k-m+1} \cdots \mathbf{V}_{k-1}\right) \\ &+\rho_{k-m+1}\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m+2}^{T}\right) \Delta \mathbf{x}_{k-m+1} \Delta \mathbf{x}_{k-m+1}^{T}\left(\mathbf{V}_{k-m+2} \cdots \mathbf{V}_{k-1}\right) \\ &+\cdots \\ &+\rho_{k-1} \Delta \mathbf{x}_{k-1} \Delta \mathbf{x}_{k-1}^{T} . \end{aligned}$

From this expression we can design a two-loop algorithm to compute $\mathbf{H}_k\mathbf{g}_k$ :

$\mathbf{H}_{k}\mathbf{g}_{k}=\mathbf{V}_{k-1}^{T} \mathbf{H}_{k-1} (\mathbf{V}_{k-1}\mathbf{g}_{k})+ \Delta \mathbf{x}_{k-1}(\rho_{k-1} \Delta \mathbf{x}_{k-1}^{T}\mathbf{g}_{k})$

Loop 1: Compute $\{\alpha_i\}$ and $\{\mathbf{q}_i\}$ , $i=k-1,\cdots ,k-m$ .

$\alpha_{k-1}:=\rho_{k-1} \Delta \mathbf{x}_{k-1}^{T}\mathbf{g}_{k}, \quad\alpha_i=\rho_{i} \Delta \mathbf{x}_{i}^{T}\mathbf{q}_{i+1}\\ \mathbf{q}_k:=\mathbf{g}_{k}, \quad\mathbf{q}_{i-1}=\mathbf{V}_{i-1}\mathbf{q}_i \\ \Rightarrow\mathbf{q}_i=\mathbf{V}_i\mathbf{q}_{i+1}= (\mathbf{I}-\rho_{i} \Delta \mathbf{g}_{i} \Delta \mathbf{x}_{i}^{T})\mathbf{q}_{i+1}= \mathbf{q}_{i+1}-\alpha_i\Delta\mathbf{g}_i\\$

Loop 2: Compute $\mathbf{p}_i$ , $i=k-m,\cdots, k$ . Finally $\mathbf{H}_k\mathbf{g}=\mathbf{p}_k$ .

$\mathbf{p}_{k-m}:=\mathbf{H}_{k}^0\mathbf{q}_{k-m}, \quad \beta=\rho_i\Delta\mathbf{g}_i^T\mathbf{p}_i \\ \quad\mathbf{p}_{i+1}=\mathbf{V}_{i}^T\mathbf{p}_i+\alpha_i\Delta\mathbf{x}_i = \mathbf{p}_i+(\alpha_i-\beta)\Delta\mathbf{x}_i$

The choice of $\mathbf{H}_k^0$ could be:

$\mathbf{H}_k^0=\gamma_k\mathbf{I},\quad \gamma_{k}=\frac{\Delta \mathbf{x}_{k-1}^{T} \Delta \mathbf{g}_{k-1}}{\Delta \mathbf{g}_{k-1}^{T} \Delta \mathbf{g}_{k-1}}$

Use line search based on weak Wolfe conditions (directional derivative should not be too small):

$\begin{aligned} f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right) & \leq f\left(\mathbf{x}_{k}\right)+c_{1} \alpha_{k} \nabla f_{k}^{T} \mathbf{p}_{k} \\ \nabla f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right)^{T} \mathbf{p}_{k} & \geq c_{2} \nabla f_{k}^{T} \mathbf{p}_{k} \end{aligned}$

with $0<c_{1}<c_{2}<1$ , or strong Wolfe conditions (directional derivative should not increase too quickly) :

$\begin{aligned} f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right) & \leq f\left(\mathbf{x}_{k}\right)+c_{1} \alpha_{k} \nabla f_{k}^{T} \mathbf{p}_{k} \\ \left|\nabla f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right)^{T} \mathbf{p}_{k}\right| & \leq c_{2}\left|\nabla f_{k}^{T} \mathbf{p}_{k}\right| \end{aligned}$

Majorization minimization#

Basic framework#

Original problem:

$\min_\mathbf{x} f(\mathbf{x})\quad s.t. \mathbf{x}\in C$

New problem:

$\min_\mathbf{x} g(\mathbf{x})\quad s.t. \mathbf{x}\in C$

Where $g$ satisfies globally majorant condition:

$f(\mathbf{x})\le g_k(\mathbf{x}),\forall\mathbf{x}\in C$
$f(\mathbf{x}_k)= g_k(\mathbf{x}_k)$

Such that $\mathbf{x}_{k+1}=\underset{\mathbf{x}}{\operatorname{argmin}} g_{k}(\mathbf{x}) \Longrightarrow f\left(\mathbf{x}_{k+1}\right) \leq g_{k}\left(\mathbf{x}_{k+1}\right) \leq g_{k}\left(\mathbf{x}_{k}\right)=f\left(\mathbf{x}_{k}\right)$ .

Local majorant replace the first requirement with $f(\mathbf{x}_{k+1})\le g_k(\mathbf{x}_{k+1})$ .

Choice of majorant function#

Lipschitz Gradient Surrogate

$\begin{aligned} &\begin{array}{c} g_{k}(\mathbf{x})=f\left(\mathbf{x}_{k}\right)+\left\langle\nabla f\left(\mathbf{x}_{k}\right), \mathbf{x}-\mathbf{x}_{k}\right\rangle+\frac{1}{2 \alpha}\left\|\mathbf{x}-\mathbf{x}_{k}\right\|^{2} \\ \mathbf{x}_{k+1}=\mathcal{P}_{\mathcal{C}}\left(\mathbf{x}_{k}-\alpha \nabla f\left(\mathbf{x}_{k}\right)\right) \quad\leftarrow\text{projected gradient descent}\\ \mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha \nabla f\left(\mathbf{x}_{k}\right),\quad \mathcal{C}=\mathbb{R}^{n}\quad\leftarrow\text{gradient descent} \end{array}\\ \end{aligned}$

Choose $\alpha$ through backtracking line search. If $f$ is $L$ -smooth, i.e., $\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|\le L\|\mathbf{x}-\mathbf{y}\|$ , then choose $\alpha=L^{-1}$ .
Quadratic Surrogate

$\begin{array}{c} g_{k}(\mathbf{x})=f\left(\mathbf{x}_{k}\right)+\left\langle\nabla f\left(\mathbf{x}_{k}\right), \mathbf{x}-\mathbf{x}_{k}\right\rangle+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}_{k}\right)^{T} \mathbf{H}_{k}\left(\mathbf{x}-\mathbf{x}_{k}\right), \text { where } \mathbf{H}_{k} \succ \nabla^{2} f \\ \mathbf{x}_{k+1}=\mathbf{x}_{k}-\mathbf{H}_{k}^{-1} \nabla f . \quad\left(\mathcal{C}=\mathbb{R}^{n}\right) \quad \begin{array}{c} \text { (Quasi-Newton method) } \end{array} \end{array}$
Jensen Surrogate

$\min_\mathbf{x} f(\mathbf{x}) =\tilde{f}(\boldsymbol{\theta}^\top \mathbf{x}),\quad s.t. \mathbf{x}\in C, \tilde{f} \text{ convex}\\ g_{k}(\mathbf{x})=\sum_{i=1}^{n} w_{i} \tilde{f}\left(\frac{\theta_{i}}{w_{i}}\left(x_{i}-x_{k, i}\right)+\boldsymbol{\theta}^{T} \mathbf{x}_{k}\right)$

where $\mathbf{w} \in \mathbb{R}_{+}^{n},\|\mathbf{w}\|_{1}$ and $w_{i} \neq 0$ whenever $\theta_{i} \neq 0$ .
- Application of Jensen’s Inequality: EM (Expectation Maximization) Algorithm
  
  $-\log \left(\sum_{t=1}^{T} f^{t}(\boldsymbol{\theta})\right) \leq-\sum_{t=1}^{T} \mathbf{w}(t) \log \left(\frac{f^{t}(\boldsymbol{\theta})}{\mathbf{w}(t)}\right)$
  
  EM algorithms minimizes a negative log-likelihood. The right side of this equation can be interpreted as a majorizing surrogate of the left side.
  
  Consider the maximum likelihood estimation of parameters of a Gaussian Mixture Model (GMM):
  
  $\sum_{k=1}^{n} \theta_{k} \frac{1}{(2 \pi)^{d / 2}\left|\boldsymbol{\Sigma}_{k}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right)$
  
  Given data $\mathbf{x}_{i}, i=1, \cdots, m$ , we want to estimate the parameters $\left\{\theta_{i}, \boldsymbol{\mu}_{i}, \boldsymbol{\Sigma}_{i}\right\}$ from the data by maximizing the log-likelihood:
  
  $\max _{\boldsymbol{\theta}, \boldsymbol{\mu}, \mathbf{\Sigma}} \sum_{i=1}^{m} \log \sum_{k=1}^{n} \theta_{k} \frac{1}{(2 \pi)^{d / 2}\left|\boldsymbol{\Sigma}_{k}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{k}\right)\right)$
Variational Surrogate

Minimizing a function $f(\mathbf{x},\mathbf{y})$ is actually minimizing:

$\min _{\mathbf{x}} f(\mathbf{x}), \quad \text { s.t. } \mathbf{x} \in \mathcal{C}, \quad \text { where } f(\mathbf{x})=\min _{\mathbf{y} \in \mathcal{Y}} h(\mathbf{x}, \mathbf{y}) \\ g_{k}(\mathbf{x})=h\left(\mathbf{x}, \mathbf{y}_{k}^{*}\right), \quad \mathbf{y}_{k}^{*}=\underset{\mathbf{y} \in \mathcal{Y}}{\operatorname{argmin}} h\left(\mathbf{x}_{k}, \mathbf{y}\right) .$