I am taking Convex Optimization this semester (2021 Spring). The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.

From this chapter, we start to introduce convex optimization algorithms, including descent methods, newton’s methods, conjugate direction methods, and majorization minimization. I used to consider gradient descent as the simple basis of neural networks, so I did not pay much attention to it. But it turns out that the underlying theory can be rather complex. I feel like learning the history and construction of nowadays neural networks. Theory is fascinating in this way.

Unconstrained Optimization#

Table of Contents:

Problems#

minxf(x)\min_\mathbf{x}f(\mathbf{x})

Where f:RnRf:\R^n\to\R is convex and twice continuously differentiable. Compute a sequence of points that f(x(k))pf(\mathbf{x}^{(k)})\to p^* and we can check the termination by f(x(k))ϵ\|\nabla f(\mathbf{x}^{(k)})\|\le \epsilon.

  • Fixed-point algorithm

    • Banach Fixed Point Theorem

      Let (X,d)(X,d) be a non-empty complete metric space with a contraction mapping T:XXT:X\to X

      d(T(x),T(y))Ld(x,y),0L<1d(T(\mathbf{x}),T(\mathbf{y}))\le Ld(\mathbf{x},\mathbf{y}), 0\le L<1

      Then start with an arbitrary x0\mathbf{x}_0 and take xn=T(xn1)\mathbf{x}_{n}=T(\mathbf{x}_{n-1}) will reach to the fixed point xnx\mathbf{x}_{n}\to\mathbf{x}^*.

    Given the theorem, we know that if φ\varphi satisfies:

    φ(x)φ(y))Lxy,0L<1\|\varphi(\mathbf{x})-\varphi(\mathbf{y}))\|\le L\|\mathbf{x}-\mathbf{y}\|, 0\le L<1

    Then the iteration method converges to the fixed point of φ\varphi.

  • Fixed point and gradient descent

    If we take φ(x)=xαf(x)\varphi(\mathbf{x})=\mathbf{x}-\alpha\nabla f(\mathbf{x}), then the fixed point satisfies x=xαf(x)\mathbf{x}=\mathbf{x}-\alpha\nabla f(\mathbf{x}), i.e. f(x)=0\nabla f(\mathbf{x}^*)=0.

    1. xk+1=xkαf(xk)\mathbf{x}_{k+1}=\mathbf{x}_k-\alpha\nabla f(\mathbf{x}_k). This is updating method of gradient descent.
    2. xk=xk+1+αf(xk+1)\mathbf{x}_k=\mathbf{x}_{k+1}+\alpha\nabla f(\mathbf{x}_{k+1}). This means xk+1=argminxαf(x)+12xxk2\mathbf{x}_{k+1}=\arg\min_{\mathbf{x}}\alpha f(\mathbf{x})+\frac{1}{2}\|\mathbf{x}-\mathbf{x}_k\|^2, which is the proximal operator.
  • Strong convexity and implications

    Strong convexity of the objective function implies:

    mI2f(x)MIcond(2f(x))Mmm\mathbf{I}\preceq \nabla^2f(\mathbf{x})\preceq M\mathbf{I} \Rightarrow \operatorname{cond}(\nabla^2f(\mathbf{x}))\le \frac{M}{m}

    f(y)f(x)+f(x)T(yx)+m2yx22f(\mathbf{y}) \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{m}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}

    The second inequality can be used to bound f(x)pf(\mathbf{x})-p^{*} in terms of f(x)2\|\nabla f(\mathrm{x})\|_{2}. The righthand side of it is a convex quadratic function of y\mathbf{y} (for fixed x\mathbf{x}). Setting the gradient with respect to y\mathbf{y} equal to zero, we find that y~=x(1/m)f(x)\tilde{\mathbf{y}}=\mathbf{x}-(1 / m) \nabla f(\mathbf{x}) minimizes the righthand side. Therefore we have

    f(y)f(x)+f(x)T(yx)+m2yx22f(x)+f(x)T(y~x)+m2y~x22=f(x)12mf(x)22\begin{aligned} f(\mathbf{y}) & \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{m}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2} \\ & \geq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\tilde{\mathbf{y}}-\mathbf{x})+\frac{m}{2}\|\tilde{\mathbf{y}}-\mathbf{x}\|_{2}^{2} \\ &=f(\mathbf{x})-\frac{1}{2 m}\|\nabla f(\mathbf{x})\|_{2}^{2} \end{aligned}

    Since this holds for any yS\mathbf{y} \in S, we have

    pf(x)12mf(x)22p^{*} \geq f(\mathbf{x})-\frac{1}{2 m}\|\nabla f(\mathbf{x})\|_{2}^{2}

    This inequality shows to what extent is f(x)2\|\nabla f(\mathbf{x})\|_2 small enough to show that the point is nearly optimal:

    f(x)2(2mϵ)1/2f(x)pϵ\|\nabla f(\mathbf{x})\|_2\le (2m\epsilon)^{1/2}\Rightarrow f(\mathbf{x})-p^*\le \epsilon

    We can also bound the distance between x\mathbf{x} and the optimal x\mathbf{x}^*:

    xx22mf((x)2\|\mathbf{x}-\mathbf{x}^*\|_2\le \frac{2}{m}\|\nabla f(\mathbf(x)\|_2

    Similarly we have the Descent Lemma and an upper bound of pp^*:

    f(y)f(x)+f(x)T(yx)+M2yx22pf(x)12Mf(x)22f(\mathbf{y}) \leq f(\mathbf{x})+\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})+\frac{M}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}\\ p^{*} \leq f(\mathbf{x})-\frac{1}{2 M}\|\nabla f(\mathbf{x})\|_{2}^{2}

    • MM-Smoothness

      The upper bound of 2f(x)\nabla ^2f(\mathbf{x}) is equivalent of the definition of smoothness:

    f(x)f(y)2Mxy22f(x)MI\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2\le M\|\mathbf{x}-\mathbf{y}\|_2 \Leftrightarrow \nabla^2f(\mathbf{x})\preceq M\mathbf{I}

Descent methods#

x(k+1)=x(k)+t(k)Δx(k)\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)}+\mathbf{t}^{(k)}\Delta\mathbf{x}^{(k)}

Where t(k)>0\mathbf{t}^{(k)}>0 is the step size and Δx(k)\Delta\mathbf{x}^{(k)} is the search direction.

We hope that f(x(k+1))<f(x(k))f(\mathbf{x}^{(k+1)})<f(\mathbf{x}^{(k)}) unless x(k)\mathbf{x}^{(k)} is optimal. We know by convexity that f(x(k))(yx(k))0\nabla f(\mathbf{x}^{(k)})^\top (\mathbf{y}-\mathbf{x}^{(k)})\ge 0 implies f(y)f(x(k))f(\mathbf{y})\ge f(\mathbf{x}^{(k)}), so the search direction must satisfy that:

f(x(k)),Δx(k)=f(x(k))Δx(k)<0\langle\nabla f(\mathbf{x}^{(k)}), \Delta\mathbf{x}^{(k)}\rangle=\nabla f(\mathbf{x}^{(k)})^\top \Delta\mathbf{x}^{(k)} < 0

Stopping criterion is f(x)2η\|\nabla f(\mathbf{x})\|_2\le \eta.

Step size#

  • Exact line search

    t=argmins0f(x+sΔx)t = {\arg\min}_{s\ge 0}f(\mathbf{x}+s\Delta\mathbf{x})

  • Backtracking line search (inexact)

    Algorithm:

    1. Given a search direction Δx\Delta \mathbf{x}, α(0,1),β(0,1)\alpha\in(0,1),\beta\in(0,1). Set t=1t=1.

    2. While f(x+tΔx)>f(x)+αtf(x)Δxf(\mathbf{x}+t\Delta\mathbf{x}) > f(\mathbf{x})+\alpha t\nabla f(\mathbf{x})^\top \Delta \mathbf{x}, reduce the step size t:=βtt:=\beta t.

    image-20210428132622002

From the illustration we can see that the stopping criterion satisfies in [0,t0][0, t_0]. The algorithm will stop when t=1t=1, or t(βt0,t0]t\in(\beta t_0, t_0], namely t>min{1,βt0}t>\min\{ 1, \beta t_0\}.

Search direction#

Gradient Descent#

Δx=f(x)\Delta\mathbf{x}=-\nabla f(\mathbf{x})

  • Convergence analysis: Exact line search

    Use a lighter notation x+=x+tΔx\mathbf{x}^{+}=\mathbf{x}+t \Delta \mathbf{x} for x(k+1)=x(k)+t(k)Δx(k)\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)}+t^{(k)}\Delta \mathbf{x}^{(k)}.

    Assume that ff is strongly convex on SS, then there are positive constants mm and MM such that mI2f(x)MI,xSm \mathbf{I} \preceq \nabla^{2} f(\mathbf{x}) \preceq M \mathbf{I}, \forall \mathbf{x} \in S.

    Define the updating function f~:RR\tilde{f}: \mathbb{R} \rightarrow \mathbb{R} by f~(t)=f(xtf(x))\tilde{f}(t)=f(\mathbf{x}-t \nabla f(\mathbf{x})). From the descent lemma, take
    y=xtf(x)\mathbf{y}=\mathbf{x}-t \nabla f(\mathbf{x}), we obtain a quadratic upper bound on f~\tilde{f} :

    f~(t)f(x)tf(x)22+Mt22f(x)22=f(x)+(M2t2t)f(x)22\tilde{f}(t) \leq f(\mathbf{x})-t\|\nabla f(\mathbf{x})\|_{2}^{2}+\frac{M t^{2}}{2}\|\nabla f(\mathbf{x})\|_{2}^{2} = f(\mathbf{x})+(\frac{M}{2}t^2-t)\|\nabla f(\mathbf{x})\|_{2}^{2}

    Minimize over tt and we have:

    f(x+)f(x)12Mf(x)22f(x+)pf(x)p12Mf(x)22f(\mathbf{x}^+) \le f(\mathbf{x})-\frac{1}{2M}\|\nabla f(\mathbf{x})\|_{2}^{2} f(\mathbf{x}^+)-p^* \le f(\mathbf{x})-p^*-\frac{1}{2M}\|\nabla f(\mathbf{x})\|_{2}^{2}

    We know that f(x)222m(f(x)p)\|\nabla f(\mathbf{x})\|_{2}^{2}\ge 2m(f(\mathbf{x})-p^*), so

    f(x+)p(1m/M)(f(x)p)f\left(\mathbf{x}^{+}\right)-p^{*} \leq(1-m / M)\left(f(\mathbf{x})-p^{*}\right)

    Take recursively for kk steps and we will have:

    f(x(k))pck(f(x(0))p), where c=1m/M<1.f\left(\mathbf{x}^{(k)}\right)-p^{*} \leq c^{k}\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right), \text { where } c=1-m / M<1 .

    Thus, the method has linear convergence.

    In particular, we must have f(x(k))pϵf\left(\mathbf{x}^{(k)}\right)-p^{*} \leq \epsilon after at most

    log((f(x(0))p)/ϵ)log(1/c)\frac{\log \left(\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right) / \epsilon\right)}{\log (1 / c)}

    iterations of the gradient method with exact line search.

    • Interpretation:

      The nominator is the log of the ratio between initial gap and final gap.

      The denominator is the log of the upper bound of condition number of Hessian matrix near x\mathbf{x}^*. When M/mM/m is large enough, we have:

      log(1/c)=log(1m/M)m/M\log (1/ c) = -\log (1-m/M)\approx m/M

      Hence, the iterations needed increases nearly linearly with M/mM/m.

  • Convergence analysis: backtracking line search

    We already know that

    f~(t)f(x)+(M2t2t)f(x)22\tilde{f}(t) \leq f(\mathbf{x})+(\frac{M}{2}t^2-t)\|\nabla f(\mathbf{x})\|_{2}^{2}

    So the equation f~(t)f(x)αtf(x)22,0<α<0.5\tilde{f}(t)\le f(\mathbf{x})-\alpha t\|\nabla f(\mathbf{x})\|_2^2,0<\alpha<0.5 always satisfies when 0t1/M0\le t\le 1/M.

    Hence, we know that the search terminates either with t=1t=1 or tβ/Mt\ge \beta/M.

    f(x+)f(x)min{α,αβM}f(x)22{f}(\mathbf{x}^+) \leq f(\mathbf{x})-\min\{\alpha,\frac{\alpha\beta}{M} \} \|\nabla f(\mathbf{x})\|_{2}^{2}

    Given f(x)222m(f(x)p)\|\nabla f(\mathbf{x})\|_{2}^{2}\ge 2m(f(\mathbf{x})-p^*), subtract pp^* from both sides:

    f(x+)p(1min{2mα,2mαβM})f(x)22(f(x)p){f}(\mathbf{x}^+)-p^* \leq \left(1-\min\{2m\alpha,\frac{2m\alpha\beta}{M} \} \right)\left\|\nabla f(\mathbf{x})\right\|_{2}^{2}\left(f(\mathbf{x})-p^*\right)

    Take c=1min{2mα,2mαβM}<1c=1-\min\{2m\alpha,\frac{2m\alpha\beta}{M} \} <1 and we have

    f(x(k))pck(f(x(0))p)f\left(\mathbf{x}^{(k)}\right)-p^{*} \leq c^{k}\left(f\left(\mathbf{x}^{(0)}\right)-p^{*}\right)

Steepest descent method#

The first-order Taylor approximation of f(x+v)f(\mathbf{x}+\mathbf{v}) is

f(x+v)f(x)+f(x)vf(\mathbf{x}+\mathbf{v}) \approx f(\mathbf{x})+\nabla f(\mathbf{x})^\top \mathbf{v}

The second term can be interpreted as a directional derivative in the direction vv. Our goal is to choose a step direction vv such that the derivative is as negative as possible.

We define a normalized steepest descent direction as

Δxnsd=argminv{f(x)vv=1}\Delta\mathbf{x}_{nsd}={\arg\min}_\mathbf{v}\{\nabla f(\mathbf{x})^\top \mathbf{v}\mid \|\mathbf{v}\|=1 \}

We can define an unnormalized steepest descent by dual norm

Δxsd=f(x)Δxnsd\Delta\mathbf{x}_{sd}=\|\nabla f(\mathbf{x})\|_*\Delta\mathbf{x}_{nsd}

This search direction satisfies

f(x)Δxsd=f(x)f(x)Δxnsd=f(x)2\nabla f(\mathbf{x})^\top \Delta\mathbf{x}_{sd}=\|\nabla f(\mathbf{x})\|_*\nabla f(\mathbf{x})^\top \Delta\mathbf{x}_{nsd} = -\|\nabla f(\mathbf{x})\|_*^2

  • Euclidean norm case

    When \|\cdot\| is Euclidean norm, Δxnsd=f(x)f(x)\Delta\mathbf{x}_{nsd} =- \frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}, and Δxsd=f(x)\Delta\mathbf{x}_{sd} =- \nabla f(\mathbf{x}), which is exactly the gradient descent method.

  • Quadratic norm case

    zP=(zPz)1/2=P1/2z2,PS++nΔxnsd=P1f(x)(f(x)P1f(x))1/2\|\mathbf{z}\|_\mathbf{P}=(\mathbf{z^\top Pz})^{1/2}=\|\mathbf{P}^{1/2}\mathbf{z}\|_2,\mathbf{P}\in\mathbb{S}_{++}^n\\ \Delta\mathbf{x}_{nsd}=-\frac{\mathbf{P}^{-1}\nabla f(\mathbf{x})}{\left(\nabla f(\mathbf{x})^\top\mathbf{P}^{-1}\nabla f(\mathbf{x}) \right)^{1/2}}

    The dual norm is z=P1/2z2\|\mathbf{z}\|_*=\|\mathbf{P}^{-1/2}\mathbf{z}\|_2, so the steepest descent step is

    Δxsd=P1f(x)\Delta\mathbf{x}_{sd}=-\mathbf{P}^{-1}\nabla f(\mathbf{x})

  • Coordinate change

    uˉ=P1/2u\bar{\mathbf{u}} = \mathbf{P}^{1/2}\mathbf{u} Defines the coordinate change from Quadratic norm to Euclidean norm: uP=uˉ2\|\mathbf{u}\|_\mathbf{P} = \|\bar{\mathbf{u}}\|_2.

    fˉ(uˉ)=f(P1/2uˉ)=f(u)\bar{f}(\bar{\mathbf{u}}) = f(\mathbf{P}^{-1/2}\bar{\mathbf{u}}) = f(\mathbf{u})

    Apply the gradient method to fˉ\bar{f}

    Δxˉ=fˉ(xˉ)=P1/2f(P1/2xˉ)=P1/2f(x)\Delta\bar{\mathbf{x}} = -\nabla \bar{f}(\bar{\mathbf{x}} )=-\mathbf{P}^{-1/2}\nabla f(\mathbf{P}^{-1/2}\bar{\mathbf{x}}) = -\mathbf{P}^{-1/2}\nabla f({\mathbf{x}})

    By the coordinate change, the steepest descent search direction of Quadratic norm P\|\cdot\|_\mathbf{P} should be

    Δx=P1/2Δxˉ=P1f(x)\Delta{\mathbf{x}} = \mathbf{P}^{-1/2}\Delta\bar{\mathbf{x}} = -\mathbf{P}^{-1}\nabla f({\mathbf{x}})

  • l1l_1-norm

    Let ii be the index when (f(x))i=f(x)|(\nabla f(\mathbf{x}))_i|=\|\nabla f(\mathbf{x})\|_\infty.

    Δxnsd=sign(f(x)xi)eiΔxnsd=Δxnsdf(x)=(f(x)xi)ei\Delta\mathbf{x}_{nsd}=-\operatorname{sign}\left(\frac{\partial f(\mathbf{x})}{\partial x_i}\right)e_i\\ \Delta\mathbf{x}_{nsd}=\Delta\mathbf{x}_{nsd}\|\nabla f(\mathbf{x})\|_\infty=-\left(\frac{\partial f(\mathbf{x})}{\partial x_i}\right)e_i\\

    Hence, the normalized steepest descent direction is always the coordinate axis direction whose decrease in ff is greatest.

  • Choice of norm

    If we already know an approximation H^\hat{\mathbf{H}} of the Hessian at the optimal point, then we can choose P=H^\mathbf{P} = \hat{\mathbf{H}}, so that the Hessian of fˉ\bar{f} would be

    2fˉ(xˉ)=2f(P1/2xˉ)=H^1/22f(xˉ)H^1/2I\nabla^2\bar{f}(\bar{\mathbf{x}}) = \nabla ^2f(\mathbf{P}^{-1/2}\bar{\mathbf{x}})=\hat{\mathbf{H}}^{-1/2}\nabla ^2f(\bar{\mathbf{x}})\hat{\mathbf{H}}^{-1/2}\approx \mathbf{I}

    Thus, the condition number is likely to be low.

    Geometrically saying, the ellipsoid {xxP1}\{\mathbf{x}\mid\|\mathbf{x}\|_\mathbf{P}\le 1 \} should approximate the shape of the sub level set({xf(x)L}\{\mathbf{x}\mid f(\mathbf{x})\le L \}).

    In the following image, the left ellipsoid is similar to the sub level set, while the right one is not.

    image-20210501105111873

    If the condition number of a set is small, it means that the set has approximately the same width in all directions, i.e., it is nearly spherical.

    Clearly the left sub level set is more likely a sphere.

Newton’s method#

Motivation. Steepest descent only use the first derivatives, try to use higher derivatives.

Method#

Construct a quadratic approximation to the objective function that matches the 1st and 2nd derivative values at that point. Then minimize the approximate function.

f(x)f(x(k))+g(k)(xx(k))+12(xx(k))F(x(k))(xx(k))q(x)f(\mathbf{x}) \approx f\left(\mathbf{x}^{(k)}\right)+\mathbf{g}^{(k) \top}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{\top} \mathbf{F}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}-\mathbf{x}^{(k)}\right) \triangleq q(\mathbf{x})

Here, g(k)=f(x(k))\mathbf{g}^{(k)}=\nabla f\left(\mathbf{x}^{(k)}\right).

image-20210503102236583

Applying the First-Order Necessary Condition (FONC) to qq yields

0=q(x)=g(k)+F(x(k))(xx(k))\mathbf{0}=\nabla q(\mathbf{x})=\mathbf{g}^{(k)}+\mathbf{F}\left(\mathbf{x}^{(k)}\right)\left(\mathbf{x}-\mathbf{x}^{(k)}\right)

If F(x(k))0\mathbf{F}\left(\mathbf{x}^{(k)}\right) \succ 0, then qq achieves a minimum at

x(k+1)=x(k)F(x(k))1g(k)\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}

The above is the recursive updating function of Newton’s method.

Analysis#

Cons:

  1. If F(x(k))\mathbf{F}\left(\mathbf{x}^{(k)}\right) is not positive definite, the algorithm might not head in the direction of decreasing values of the objective function.
  2. Even if it is positive definite, it might occur that $ f\left(\mathbf{x}^{(k)}\right)\le f\left(\mathbf{x}^{(k+1)}\right)$, e.g. if starting point is far away from x\mathbf{x}^*.

Pros:

  1. When starting point is near x\mathbf{x}^*, Newton’s method converges to x\mathbf{x}^* with order of convergence at least 2.

    Note that we only need to assume fC3,f(x)=0,F(x)f\in C^3, \nabla f(\mathbf{x}^*)=\mathbf{0},\mathbf{F}(\mathbf{x}^*), is invertible. So Newton’s method does not necessarily converge to a local minimum.

  2. Descent property ($ f\left(\mathbf{x}^{(k)}\right)\ge f\left(\mathbf{x}^{(k+1)}\right)$) holds true with a little modification to the algorithm(Damped Newton’s Method).

    If F(x(k))0,g(k)=f(x(k))0\mathbf{F}\left(\mathbf{x}^{(k)}\right) \succ 0, \mathbf{g}^{(k)}=\nabla f\left(\mathbf{x}^{(k)}\right)\ne 0, then the direction

    d(k)=F(x(k))1g(k)=x(k+1)x(k)\mathbf{d}^{(k)}=-\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}=\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}

    is a descent direction in the sense that α^>0,α(0,α^)\exist\hat{\alpha}>0,\forall\alpha\in(0,\hat{\alpha}),

    f(x(k)+αd(k))<f(x(k))f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})<f(\mathbf{x}^{(k)})

    Proof. Let

    ϕ(α)=f(x(k)+αd(k))\phi(\alpha ) =f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})

    Thus, we need to prove that α^>0,α(0,α^)\exist\hat{\alpha}>0,\forall\alpha\in(0,\hat{\alpha}), ϕ(α)<ϕ(0)\phi(\alpha)<\phi(0). Given F(x(k))10\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \succ 0,

    ϕ(α)=f(x(k)+αd(k))d(k)ϕ(0)=f(x(k))d(k)=g(k)F(x(k))1g(k)<0\phi'(\alpha ) =\nabla f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})^\top \mathbf{d}^{(k)}\\ \phi'(0 ) =\nabla f(\mathbf{x}^{(k)})^\top \mathbf{d}^{(k)} =-\mathbf{g}^{(k)}\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}<0

    Hence, there must exist an α^\hat{\alpha} satisfies ϕ(α)<ϕ(0)\phi(\alpha)<\phi(0).

Damped Newton’s method#

Modify the recursive updating function with a parameter αk\alpha_k to ensure descent property:

x(k+1)=x(k)αkF(x(k))1g(k)αk=argminα0f(x(k)αkF(x(k))1g(k))\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}\\ \alpha_k = \arg\min_{\alpha\ge 0} f(\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)})

Drawback: Computing F(x(k))\mathbf{F}\left(\mathbf{x}^{(k)}\right) and solving F(x(k))d(k)=g(k)\mathbf{F}\left(\mathbf{x}^{(k)}\right)\mathbf{d}^{(k)}=-\mathbf{g}^{(k)} are computationally expensive.

Levenberg-Marquardt modification#

To ensure descent property when the Hessian matrix is not positive definite, modify the recursive updating function as:

x(k+1)=x(k)αk(F(x(k))+μkI)1g(k),μk0\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\left(\mathbf{F}\left(\mathbf{x}^{(k)}\right)+\mu_k\mathbf{I}\right) ^{-1}\mathbf{g}^{(k)},\quad\mu_k\ge 0

The algorithm is actually locally minimizing

f(x)+μk2xx(k)2f(\mathbf{x})+\frac{\mu_k}{2}\|\mathbf{x}-\mathbf{x}^{(k)}\|^2

When μk0\mu_k\to 0, it becomes pure Newton’s method. When μk\mu_k\to\infty, it becomes pure gradient descent method with small step size. In practice, start from a small μk\mu_k and increase it until descent property is satisfied.

Case: Nonlinear least-squares#

minxi=1m(ri(x))2\min_\mathbf{x}\sum_{i=1}^m \left(r_i(\mathbf{x}) \right)^2

Here, rir_i are given functions. Define r=[r1,,rm]\mathbf{r} = [r_1,\cdots ,r_m]^\top then the objective function is f(x)=r(x)r(x)f(\mathbf{x}) = \mathbf{r}(\mathbf{x})^\top\mathbf{r}(\mathbf{x}). Denote the Jacobian matrix of r\mathbf{r} by

J(x)=[r1x1(x)r1xn(x)rmx1(x)rmxn(x)]\mathbf{J}(\mathbf{x})=\left[\begin{array}{ccc} \frac{\partial r_{1}}{\partial x_{1}}(\mathbf{x}) & \cdots & \frac{\partial r_{1}}{\partial x_{n}}(\mathbf{x}) \\ \vdots & & \vdots \\ \frac{\partial r_{m}}{\partial x_{1}}(\mathbf{x}) & \cdots & \frac{\partial r_{m}}{\partial x_{n}}(\mathbf{x}) \end{array}\right]

Then we can compute the gradient and Hessian of ff:

(f(x))j=fxj(x)=2i=1mri(x)rixj(x)f(x)=2J(x)r(x)(\nabla f(\mathbf{x}))_{j} =\frac{\partial f}{\partial x_{j}}(\mathbf{x}) =2 \sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x}) \\ \nabla f(\mathbf{x}) = 2\mathbf{J}(\mathbf{x})^\top \mathbf{r}(\mathbf{x})

2fxkxj(x)=xk(2i=1mri(x)rixj(x))=2i=1m(rixk(x)rixj(x)+ri(x)2rixkxj(x))\begin{aligned} \frac{\partial^{2} f}{\partial x_{k} \partial x_{j}}(\mathbf{x}) &=\frac{\partial}{\partial x_{k}}\left(2 \sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x})\right) \\ &=2 \sum_{i=1}^{m}\left(\frac{\partial r_{i}}{\partial x_{k}}(\mathbf{x}) \frac{\partial r_{i}}{\partial x_{j}}(\mathbf{x})+r_{i}(\mathbf{x}) \frac{\partial^{2} r_{i}}{\partial x_{k} \partial x_{j}}(\mathbf{x})\right) \end{aligned}

Letting S(x)\mathbf{S}(\mathbf{x}) be the matrix whose (k,j)(k, j) th component is

i=1mri(x)2rixkxj(x)\sum_{i=1}^{m} r_{i}(\mathbf{x}) \frac{\partial^{2} r_{i}}{\partial x_{k} \partial x_{j}}(\mathbf{x})

We write the Hessian matrix as

F(x)=2(J(x)TJ(x)+S(x))\mathbf{F}(\mathbf{x})=2\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})+\mathbf{S}(\mathbf{x})\right)

Therefore, Newton’s method applied to the nonlinear least-squares problem is given by

x(k+1)=x(k)(J(x)TJ(x)+S(x))1J(x)Tr(x)\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})+\mathbf{S}(\mathbf{x})\right)^{-1} \mathbf{J}(\mathbf{x})^{T} \mathbf{r}(\mathbf{x})

When we ignore the second derivatives in S\mathbf{S}, we have the Gauss-Newton method

x(k+1)=x(k)(J(x)TJ(x))1J(x)Tr(x)\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\left(\mathbf{J}(\mathbf{x})^{T} \mathbf{J}(\mathbf{x})\right)^{-1} \mathbf{J}(\mathbf{x})^{T} \mathbf{r}(\mathbf{x})

Conjugate direction methods#

Motivation. Intermediate between steepest descent and Newton’s method.

  1. Solve quadratics of nn variables in nn steps.
  2. Requires no Hessian.
  3. No matrix inversion, no storage of n×nn\times n matrix required.

QQ-conjugate: QQ is a real symmetric n×nn\times n matrix. Directions d(0),d(1),,d(m)\mathbf{d}^{(0)},\mathbf{d}^{(1)},\cdots,\mathbf{d}^{(m)} are QQ-conjugate if ij\forall i\ne j, d(i)Qd(j)=0\mathbf{d}^{(i)\top} Q \mathbf{d}^{(j)}=0.

Lemma. If Q0Q\succ\mathbf{0} and directions d(0),d(1),,d(k),kn1\mathbf{d}^{(0)},\mathbf{d}^{(1)},\cdots,\mathbf{d}^{(k)},k\le n-1 are nonzero and QQ-conjugate, then they are linearly independent.

🌟Quadratic problems#

f(x)=12xTQxxTb,Q=Q0f(\mathbf{x})=\frac{1}{2} \mathbf{x}^{T} \mathbf{Q} \mathbf{x}-\mathbf{x}^{T} \mathbf{b},\quad \mathbf{Q}=\mathbf{Q}^\top\succ\mathbf{0}

Motivation. Given nn Q\mathbf{Q}-conjugate directions, we have

xx(0)=α0d(0)++αn1d(n1)\mathbf{x}^*-\mathbf{x}^{(0)}=\alpha_0\mathbf{d}^{(0)}+\cdots +\alpha_{n-1}\mathbf{d}^{(n-1)}

Hence, we need to find the nn directions and their corresponding ratios.

Ratio#

Given the direction d(k)\mathbf{d}^{(k)}, we can compute its ratio αk\alpha_k. Multiply the above equation by d(k)Q\mathbf{d}^{(k)\top}\mathbf{Q} and we have

d(k)Q(xx(0))=αkd(k)Qd(k)αk=d(k)Q(xx(0))d(k)Qd(k)\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(0)})=\alpha_k\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}\\ \alpha_k=\frac{\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(0)})}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}

Since

x(k)x(0)=α0d(0)++αk1d(k1)g(k)=Qx(k)b,Qx=b\mathbf{x}^{(k)}-\mathbf{x}^{(0)}=\alpha_0\mathbf{d}^{(0)}+\cdots +\alpha_{k-1}\mathbf{d}^{(k-1)}\\ \mathbf{g}^{(k)} = \mathbf{Q}\mathbf{x}^{(k)}-\mathbf{b}, \mathbf{Q}\mathbf{x}^*=\mathbf{b}

🌟We have

αk=d(k)Q(xx(k))d(k)Qd(k)=d(k)g(k)d(k)Qd(k)\alpha_k=\frac{\mathbf{d}^{(k)\top}\mathbf{Q}(\mathbf{x}^*-\mathbf{x}^{(k)})}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}=-\frac{\mathbf{d}^{(k)\top}\mathbf{g}^{(k)}}{\mathbf{d}^{(k)\top}\mathbf{Q}\mathbf{d}^{(k)}}

Direction#

Lemma. By mathematical induction, we have

g(k+1),d(i)=Qx(k+1)b,d(i)=Q(x(k)+αkd(k))b,d(i)=g(k)+αkQd(k),d(i)=0g(k+1)d(i)=0,0kn1,0ik\begin{aligned} \left\langle\mathbf{g}^{(k+1)}, \mathbf{d}^{(i)}\right\rangle=&\left\langle\mathbf{Q} \mathbf{x}^{(k+1)}-\mathbf{b}, \mathbf{d}^{(i)}\right\rangle\\=&\left\langle\mathbf{Q}\left(\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}\right)-\mathbf{b}, \mathbf{d}^{(i)}\right\rangle \\ =&\left\langle\mathbf{g}^{(k)}+\alpha_{k} \mathbf{Q} \mathbf{d}^{(k)}, \mathbf{d}^{(i)}\right\rangle=0\\ \end{aligned}\\ \Rightarrow \mathbf{g}^{(k+1)\top} \mathbf{d}^{(i)}=0,\quad0\le k\le n-1, 0\le i\le k

🌟Directions.

g(k+1),g(i)=g(k+1),βi1d(i1)d(i)=0,0ik\left\langle\mathbf{g}^{(k+1)}, \mathbf{g}^{(i)}\right\rangle=\left\langle\mathbf{g}^{(k+1)}, \beta_{i-1} \mathbf{d}^{(i-1)}-\mathbf{d}^{(i)}\right\rangle=0, \quad \forall 0 \leq i \leq k

Then by mathematical induction.

d(k+1),Qd(i)=g(k+1)+βkd(k),Qd(i)=1αig(k+1),QΔx(i)=1αig(k+1),Δg(i)=0\begin{aligned} \left\langle\mathbf{d}^{(k+1)}, \mathbf{Q} \mathbf{d}^{(i)}\right\rangle=&\left\langle-\mathbf{g}^{(k+1)}+\beta_{k} \mathbf{d}^{(k)}, \mathbf{Q} \mathbf{d}^{(i)}\right\rangle\\=&-\frac{1}{\alpha_{i}}\left\langle\mathbf{g}^{(k+1)}, \mathbf{Q} \Delta \mathbf{x}^{(i)}\right\rangle \\ =&-\frac{1}{\alpha_{i}}\left\langle\mathbf{g}^{(k+1)}, \Delta \mathbf{g}^{(i)}\right\rangle=0 \end{aligned}

Algorithm#

Given starting point x(0)\mathbf{x}^{(0)},

  1. g(0)=f(x(0))=Qx(0)b\mathbf{g}^{(0)} =\nabla f\left(\mathbf{x}^{(0)}\right)=\mathbf{Q} \mathbf{x}^{(0)}-\mathbf{b}. If g(0)=0\mathbf{g}^{(0)}=0, stop, else set d(0)=g(0)\mathbf{d}^{(0)}=-\mathbf{g}^{(0)}.
  2. Take αk=g(k)d(k)d(k)Qd(k)\alpha_{k} =-\frac{\mathbf{g}^{(k) \top} \mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}. Update x(k+1)=x(k)+αkd(k)\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}.
  3. Compute gradient g(k+1)=f(x(k+1))=Qx(k+1)b\mathbf{g}^{(k+1)} =\nabla f\left(\mathbf{x}^{(k+1)}\right)=\mathbf{Q} \mathbf{x}^{(k+1)}-\mathbf{b}. If g(k+1)=0\mathbf{g}^{(k+1)}=0, stop.
  4. Take βk=g(k+1)Qd(k)d(k)Qd(k)\beta_{k} =\frac{\mathbf{g}^{(k+1) \top} \mathbf{Q}\mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}. Get a new Q\mathbf{Q}-conjugate direction d(k+1)=g(k+1)+βkd(k)\mathbf{d}^{(k+1)}=-\mathbf{g}^{(k+1)}+\beta_k\mathbf{d}^{(k)}.
  5. k:=k+1k:=k+1, return to step 2.

Non-quadratic problems#

Motivation. Consider the quadratic function f(x)=12xTQxxTbf(\mathbf{x})=\frac{1}{2} \mathbf{x}^{T} \mathbf{Q} \mathbf{x}-\mathbf{x}^{T} \mathbf{b} as a second-order Taylor series approximation of any objective function. Thus, Q\mathbf{Q} is the Hessian that needs reevaluation at each iteration.

To be Hessian-free, modify the method of computing αk,βk\alpha_k, \beta_k:

αk=argminα>0f(x(k)+αd(k))\alpha_k=\arg\min_{\alpha>0} f(\mathbf{x}^{(k)}+\alpha\mathbf{d}^{(k)}) can be replaced by a linear search.

Replace Qd(k)\mathbf{Q}\mathbf{d}^{(k)} in βk\beta_k with g(k+1)g(k)αk\frac{\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)} }{\alpha_k}:

βk=g(k+1)Qd(k)d(k)Qd(k)=g(k+1)[g(k+1)g(k)]d(k)[g(k+1)g(k)](Hestenes-Stiefel formula)=g(k+1)[g(k+1)g(k)]g(k)g(k)(Polak-Ribiere formula)=g(k+1)g(k+1)g(k)g(k)(Fletcher-Reeves formula)\begin{aligned} \beta_{k} =&\frac{\mathbf{g}^{(k+1) \top} \mathbf{Q}\mathbf{d}^{(k)}}{\mathbf{d}^{(k) \top} \mathbf{Q} \mathbf{d}^{(k)}}\\ =&\frac{\mathbf{g}^{(k+1) \top} [\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}{\mathbf{d}^{(k) \top}[\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}\quad\text{(Hestenes-Stiefel formula)}\\ =&\frac{\mathbf{g}^{(k+1) \top} [\mathbf{g}^{(k+1) }-\mathbf{g}^{(k)}]}{\mathbf{g}^{(k)\top}\mathbf{g}^{(k)}}\quad\text{(Polak-Ribiere formula)}\\ =&\frac{\mathbf{g}^{(k+1) \top} \mathbf{g}^{(k+1) }}{\mathbf{g}^{(k)\top}\mathbf{g}^{(k)}}\quad\text{(Fletcher-Reeves formula)} \end{aligned}

The last equation is derived from the lemma above and definition of d(k)\mathbf{d}^{(k)}.

  1. Non-quadratic problems might not converge in nn steps, so reinitialize the direction vector to the negative gradient after every few iterations (e.g., nn or n+1n + 1).
  2. If line search of α\alpha is inaccurate, HS formula is better. Choice of formula depends on objective function.

Quasi-Newton methods#

Motivation. To avoid computing F(x(k))1\mathbf{F}(\mathbf{x}^{(k)})^{-1} in damped Newton’s method, try replace it with an approximation.

x(k+1)=x(k)αkF(x(k))1g(k)=x(k)αHkg(k)\mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}-\alpha_k\mathbf{F}\left(\mathbf{x}^{(k)}\right)^{-1} \mathbf{g}^{(k)}=\mathbf{x}^{(k)}-\alpha\mathbf{H}_k\mathbf{g}^{(k)}

To ensure a decrease in ff:

f(x(k+1))=f(x(k))+g(k)T(x(k+1)x(k))+o(x(k+1)x(k))=f(x(k))αg(k)THkg(k)+o(Hkg(k))\begin{aligned} f\left(\mathbf{x}^{(k+1)}\right) &=f\left(\mathbf{x}^{(k)}\right)+\mathbf{g}^{(k) T}\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right)+o\left(\left\|\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right\|\right) \\ &=f\left(\mathbf{x}^{(k)}\right)-\alpha \mathbf{g}^{(k) T} \mathbf{H}_{k} \mathbf{g}^{(k)}+o\left(\left\|\mathbf{H}_{k} \mathbf{g}^{(k)}\right\|\right) \end{aligned}

We have to have:

g(k)THkg(k)>0\mathbf{g}^{(k) T} \mathbf{H}_{k} \mathbf{g}^{(k)}>0

A simple way to ensure the above requirement is to set Hk0\mathbf{H}_k\succ \mathbf{0}.

Approximate the inverse Hessian#

Suppose that F(x)=Q,x\mathbf{F}(\mathbf{x})=\mathbf{Q},\forall \mathbf{x} and Q=Q\mathbf{Q}=\mathbf{Q}^\top. We have

Δg(k)=g(k+1)g(k)=Q(x(k+1)x(k))=QΔx(k)\Delta \mathbf{g}^{(k)} = \mathbf{g}^{(k+1)}-\mathbf{g}^{(k)}=\mathbf{Q}\left(\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\right) = \mathbf{Q}\Delta \mathbf{x}^{(k)}

Therefore, the approximation Hk+1\mathbf{H}_{k+1} satisfies that

Hk+1Δg(i)=Δx(i),0ik\mathbf{H}_{k+1}\Delta\mathbf{g}^{(i)}=\Delta\mathbf{x}^{(i)},0\le i\le k

Hence, we have Hn=Q1\mathbf{H}_n=\mathbf{Q}^{-1}. We conclude that if Hn\mathbf{H}_n satisfies HnΔg(i)=Δx(i),0in1\mathbf{H}_n\Delta\mathbf{g}^{(i)}=\Delta\mathbf{x}^{(i)},0\le i\le n-1, then the algorithm is guaranteed to solve problems with quadratic objective functions in n+1n+1 steps.

Algorithm#

d(k)=Hkg(k)αk=argminα0f(x(k)+αd(k))x(k+1)=x(k)+αkd(k)\begin{aligned} \mathbf{d}^{(k)} &=-\mathbf{H}_{k} \mathbf{g}^{(k)} \\ \alpha_{k} &=\arg \min _{\alpha \geq 0} f\left(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)}\right) \\ \mathbf{x}^{(k+1)} &=\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)} \end{aligned}

where the matrices H0,H1,\mathbf{H}_{0}, \mathbf{H}_{1}, \cdots are symmetric. In the quadratic case, the above matrices are required to satisfy

Hk+1Δg(i)=Δx(i),0ik\mathbf{H}_{k+1} \Delta \mathbf{g}^{(i)}=\Delta \mathbf{x}^{(i)}, 0 \leq i \leq k

How to derive Hk+1\mathbf{H}_{k+1} from Hk\mathbf{H}_k that satisfies the above equation? We consider 3 specific updating formulas

  1. Rank One Correction Formula (Single-Rank-Symmetric SRS algorithm)

    Hk+1=Hk+αkz(k)z(k)\mathbf{H}_{k+1} = \mathbf{H}_{k}+\alpha_k \mathbf{z}^{(k)}\mathbf{z}^{(k)\top}

    The correction satisfies rank(z(k)z(k))=1\text{rank}(\mathbf{z}^{(k)}\mathbf{z}^{(k)\top})=1.

    Hk+1Δg(k)=(Hk+akz(k)z(k))Δg(k)=Δx(k)z(k)=Δx(k)HkΔg(k)akz(k)Δg(k)Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))ak(z(k)Δg(k))2\begin{array}{c} \mathbf{H}_{k+1} \Delta \mathbf{g}^{(k)}=\left(\mathbf{H}_{k}+a_{k} \mathbf{z}^{(k)} \mathbf{z}^{(k) \top}\right) \Delta \mathbf{g}^{(k)}=\Delta \mathbf{x}^{(k)} \\ \mathbf{z}^{(k)}=\frac{\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}}{a_{k} \mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}} \\ \mathbf{H}_{k+1}=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{a_{k}\left(\mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}\right)^{2}} \end{array}

    We know from the first equation that Δx(k)HkΔg(k)=(akz(k)Δg(k))z(k)\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}=\left(a_{k} \mathbf{z}^{(k) \top} \Delta \mathbf{g}^{(k)}\right) \mathbf{z}^{(k)}, multiply it by Δg(k)\Delta\mathbf{g}^{(k)} to replace the denominator

    Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))Δg(k)(Δx(k)HkΔg(k))\mathbf{H}_{k+1}=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{\Delta \mathbf{g}^{(k)\top}(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)})}

    Algorithm. Start with k=0k=0 and a real symmetric positive definite H0\mathbf{H}_0.

    1. If g(k)=0\mathbf{g}^{(k)}=0, stop, else set d(k)=Hkg(k)\mathbf{d}^{(k)}=-\mathbf{H}_k\mathbf{g}^{(k)}.

    2. Take αk=argminα0f(x(k)+αd(k))\alpha_{k} =\arg\min_{\alpha\ge 0}f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)}). Update x(k+1)=x(k)+αkd(k)\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}.

    3. Compute gradient and the next inverse Hessian approximation, and return to step 1 with k:=k+1k:=k+1.

      Δx(k)=αkd(k)Δg(k)=g(k+1)g(k)Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))Δg(k)(Δx(k)HkΔg(k))\begin{aligned} \Delta \mathbf{x}^{(k)} &=\alpha_{k} \mathbf{d}^{(k)} \\ \Delta \mathbf{g}^{(k)} &=\mathbf{g}^{(k+1)}-\mathbf{g}^{(k)} \\ \mathbf{H}_{k+1} &=\mathbf{H}_{k}+\frac{\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)^{\top}}{\Delta \mathbf{g}^{(k) \top}\left(\Delta \mathbf{x}^{(k)}-\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right)} \end{aligned}

  2. Davidon-Fletcher-Powell (DFP) Algorithm

    Rank two update

    Algorithm. Start with k=0k=0 and a real symmetric positive definite H0\mathbf{H}_0.

    1. If g(k)=0\mathbf{g}^{(k)}=0, stop, else set d(k)=Hkg(k)\mathbf{d}^{(k)}=-\mathbf{H}_k\mathbf{g}^{(k)}.

    2. Take αk=argminα0f(x(k)+αd(k))\alpha_{k} =\arg\min_{\alpha\ge 0}f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)}). Update x(k+1)=x(k)+αkd(k)\mathbf{x}^{(k+1)} =\mathbf{x}^{(k)}+\alpha_{k} \mathbf{d}^{(k)}.

    3. Compute gradient and the next inverse Hessian approximation, and return to step 1 with k:=k+1k:=k+1.

      Δx(k)=αkd(k)Δg(k)=g(k+1)g(k)Hk+1=Hk+Δx(k)Δx(k)TΔx(k)TΔg(k)[HkΔg(k)][HkΔg(k)]TΔg(k)THkΔg(k)\begin{aligned} \Delta \mathbf{x}^{(k)} &=\alpha_{k} \mathbf{d}^{(k)} \\ \Delta \mathbf{g}^{(k)} &=\mathbf{g}^{(k+1)}-\mathbf{g}^{(k)} \\ \mathbf{H}_{k+1} &=\mathbf{H}_{k}+\frac{\Delta \mathbf{x}^{(k)} \Delta \mathbf{x}^{(k) T}}{\Delta \mathbf{x}^{(k) T} \Delta \mathbf{g}^{(k)}}-\frac{\left[\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right]\left[\mathbf{H}_{k} \Delta \mathbf{g}^{(k)}\right]^{T}}{\Delta \mathbf{g}^{(k) T} \mathbf{H}_{k} \Delta \mathbf{g}^{(k)}} \end{aligned}

  3. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Algorithm

    Note that we now approximate Q1\mathbf{Q}^{-1} by Hk\mathbf{H}_k, we can also approximate Q\mathbf{Q} itself and take the inverse at last.

    Hk+1Δg(i)=Δx(i),0ikBk+1Δx(i)=Δg(i),0ik\mathbf{H}_{k+1} \Delta \mathbf{g}^{(i)}=\Delta \mathbf{x}^{(i)}, 0 \leq i \leq k\\ \mathbf{B}_{k+1} \Delta \mathbf{x}^{(i)}=\Delta \mathbf{g}^{(i)}, 0 \leq i \leq k

    The BFGS update goes like

    Bk+1=Bk+Δg(k)Δg(k)TΔg(k)TΔx(k)[BkΔx(k)][BkΔx(k)]TΔx(k)TBkΔx(k)\mathbf{B}_{k+1} =\mathbf{B}_{k}+\frac{\Delta \mathbf{g}^{(k)} \Delta \mathbf{g}^{(k) T}}{\Delta \mathbf{g}^{(k) T} \Delta \mathbf{x}^{(k)}}-\frac{\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]^{T}}{\Delta \mathbf{x}^{(k) T} \mathbf{B}_{k} \Delta \mathbf{x}^{(k)}}

    To obtain approximation of the inverse Hessian,

    Hk+1=(Bk+1)1=(Bk+Δg(k)Δg(k)TΔg(k)TΔx(k)[BkΔx(k)][BkΔx(k)]TΔx(k)TBkΔx(k))1\mathbf{H}_{k+1} = (\mathbf{B}_{k+1})^{-1} =\left(\mathbf{B}_{k}+\frac{\Delta \mathbf{g}^{(k)} \Delta \mathbf{g}^{(k) T}}{\Delta \mathbf{g}^{(k) T} \Delta \mathbf{x}^{(k)}}-\frac{\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]\left[\mathbf{B}_{k} \Delta \mathbf{x}^{(k)}\right]^{T}}{\Delta \mathbf{x}^{(k) T} \mathbf{B}_{k} \Delta \mathbf{x}^{(k)}}\right)^{-1}

    This inverse can be computed using Sherman-Morrison formula:

    (A+UCVT)1=A1A1U(C1+VTA1U)1VTA1\left(\mathbf{A}+\mathbf{U C V}^{T}\right)^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1} \mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}^{T} \mathbf{A}^{-1} \mathbf{U}\right)^{-1} \mathbf{V}^{T} \mathbf{A}^{-1}

    Where A,C\mathbf{A},\mathbf{C} are invertible and the size of U,C,V\mathbf{U},\mathbf{C},\mathbf{V} are compatible.

    Hence, the updating of H\mathbf{H} is

    Hk+1BFGS=Hk+(1+Δg(k)HkΔg(k)Δg(k)Δx(k))Δx(k)Δx(k)Δx(k)Δg(k)HkΔg(k)Δx(k)+(HkΔg(k)Δx(k))Δg(k)Δx(k)\mathbf{H}_{k+1}^{B F G S}=\mathbf{H}_{k}+\left(1+\frac{\Delta \mathbf{g}^{(k) \top} \mathbf{H}_{k} \Delta \mathbf{g}^{(k)}}{\Delta \mathbf{g}^{(k) \top} \Delta \mathbf{x}^{(k)}}\right) \frac{\Delta \mathbf{x}^{(k)} \Delta \mathbf{x}^{(k) \top}}{\Delta \mathbf{x}^{(k) \top} \Delta \mathbf{g}^{(k)}}-\frac{\mathbf{H}_{k} \Delta \mathbf{g}^{(k)} \Delta \mathbf{x}^{(k) \top}+\left(\mathbf{H}_{k} \Delta \mathbf{g}^{(k)} \Delta \mathbf{x}^{(k) \top}\right)^{\top}}{\Delta \mathbf{g}^{(k) \top} \Delta \mathbf{x}^{(k)}}

🌟Limited-Memory BFGS#

Hk+1=VkTHkVk+ρkΔxkΔxkTρk=1ΔgkTΔxk,Vk=IρkΔgkΔxkT\begin{array}{c} \mathbf{H}_{k+1}=\mathbf{V}_{k}^{T} \mathbf{H}_{k} \mathbf{V}_{k}+\rho_{k} \Delta \mathbf{x}_{k} \Delta \mathbf{x}_{k}^{T} \\ \rho_{k}=\frac{1}{\Delta \mathbf{g}_{k}^{T} \Delta \mathbf{x}_{k}}, \quad \mathbf{V}_{k}=\mathbf{I}-\rho_{k} \Delta \mathbf{g}_{k} \Delta \mathbf{x}_{k}^{T} \end{array}

We only need to store mm pairs of vectors (Δxi,Δgi)(\Delta\mathbf{x}_i,\Delta\mathbf{g}_i) instead of the matrix. Hk\mathbf{H}_k Is only used to compute the vector Hkgk\mathbf{H}_k\mathbf{g}_k, so the pairs are enough for computation.

Hk=(Vk1TVkmT)Hk0(VkmVk1)+ρkm(Vk1TVkm+1T)ΔxkmΔxkmT(Vkm+1Vk1)+ρkm+1(Vk1TVkm+2T)Δxkm+1Δxkm+1T(Vkm+2Vk1)++ρk1Δxk1Δxk1T.\begin{aligned} \mathbf{H}_{k}=&\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m}^{T}\right) \mathbf{H}_{k}^{0}\left(\mathbf{V}_{k-m} \cdots \mathbf{V}_{k-1}\right) \\ &+\rho_{k-m}\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m+1}^{T}\right) \Delta \mathbf{x}_{k-m} \Delta \mathbf{x}_{k-m}^{T}\left(\mathbf{V}_{k-m+1} \cdots \mathbf{V}_{k-1}\right) \\ &+\rho_{k-m+1}\left(\mathbf{V}_{k-1}^{T} \cdots \mathbf{V}_{k-m+2}^{T}\right) \Delta \mathbf{x}_{k-m+1} \Delta \mathbf{x}_{k-m+1}^{T}\left(\mathbf{V}_{k-m+2} \cdots \mathbf{V}_{k-1}\right) \\ &+\cdots \\ &+\rho_{k-1} \Delta \mathbf{x}_{k-1} \Delta \mathbf{x}_{k-1}^{T} . \end{aligned}

From this expression we can design a two-loop algorithm to compute Hkgk\mathbf{H}_k\mathbf{g}_k:

Hkgk=Vk1THk1(Vk1gk)+Δxk1(ρk1Δxk1Tgk)\mathbf{H}_{k}\mathbf{g}_{k}=\mathbf{V}_{k-1}^{T} \mathbf{H}_{k-1} (\mathbf{V}_{k-1}\mathbf{g}_{k})+ \Delta \mathbf{x}_{k-1}(\rho_{k-1} \Delta \mathbf{x}_{k-1}^{T}\mathbf{g}_{k})

Loop 1: Compute {αi}\{\alpha_i\} and {qi}\{\mathbf{q}_i\}, i=k1,,kmi=k-1,\cdots ,k-m.

αk1:=ρk1Δxk1Tgk,αi=ρiΔxiTqi+1qk:=gk,qi1=Vi1qiqi=Viqi+1=(IρiΔgiΔxiT)qi+1=qi+1αiΔgi\alpha_{k-1}:=\rho_{k-1} \Delta \mathbf{x}_{k-1}^{T}\mathbf{g}_{k}, \quad\alpha_i=\rho_{i} \Delta \mathbf{x}_{i}^{T}\mathbf{q}_{i+1}\\ \mathbf{q}_k:=\mathbf{g}_{k}, \quad\mathbf{q}_{i-1}=\mathbf{V}_{i-1}\mathbf{q}_i \\ \Rightarrow\mathbf{q}_i=\mathbf{V}_i\mathbf{q}_{i+1}= (\mathbf{I}-\rho_{i} \Delta \mathbf{g}_{i} \Delta \mathbf{x}_{i}^{T})\mathbf{q}_{i+1}= \mathbf{q}_{i+1}-\alpha_i\Delta\mathbf{g}_i\\

Loop 2: Compute pi\mathbf{p}_i, i=km,,ki=k-m,\cdots, k. Finally Hkg=pk\mathbf{H}_k\mathbf{g}=\mathbf{p}_k.

pkm:=Hk0qkm,β=ρiΔgiTpipi+1=ViTpi+αiΔxi=pi+(αiβ)Δxi\mathbf{p}_{k-m}:=\mathbf{H}_{k}^0\mathbf{q}_{k-m}, \quad \beta=\rho_i\Delta\mathbf{g}_i^T\mathbf{p}_i \\ \quad\mathbf{p}_{i+1}=\mathbf{V}_{i}^T\mathbf{p}_i+\alpha_i\Delta\mathbf{x}_i = \mathbf{p}_i+(\alpha_i-\beta)\Delta\mathbf{x}_i

image-20210505150615681

The choice of Hk0\mathbf{H}_k^0 could be:

Hk0=γkI,γk=Δxk1TΔgk1Δgk1TΔgk1\mathbf{H}_k^0=\gamma_k\mathbf{I},\quad \gamma_{k}=\frac{\Delta \mathbf{x}_{k-1}^{T} \Delta \mathbf{g}_{k-1}}{\Delta \mathbf{g}_{k-1}^{T} \Delta \mathbf{g}_{k-1}}

Use line search based on weak Wolfe conditions (directional derivative should not be too small):

f(xk+αkpk)f(xk)+c1αkfkTpkf(xk+αkpk)Tpkc2fkTpk\begin{aligned} f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right) & \leq f\left(\mathbf{x}_{k}\right)+c_{1} \alpha_{k} \nabla f_{k}^{T} \mathbf{p}_{k} \\ \nabla f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right)^{T} \mathbf{p}_{k} & \geq c_{2} \nabla f_{k}^{T} \mathbf{p}_{k} \end{aligned}

with 0<c1<c2<10<c_{1}<c_{2}<1, or strong Wolfe conditions (directional derivative should not increase too quickly) :

f(xk+αkpk)f(xk)+c1αkfkTpkf(xk+αkpk)Tpkc2fkTpk\begin{aligned} f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right) & \leq f\left(\mathbf{x}_{k}\right)+c_{1} \alpha_{k} \nabla f_{k}^{T} \mathbf{p}_{k} \\ \left|\nabla f\left(\mathbf{x}_{k}+\alpha_{k} \mathbf{p}_{k}\right)^{T} \mathbf{p}_{k}\right| & \leq c_{2}\left|\nabla f_{k}^{T} \mathbf{p}_{k}\right| \end{aligned}

image-20210512131042221

Majorization minimization#

Basic framework#

Original problem:

minxf(x)s.t.xC\min_\mathbf{x} f(\mathbf{x})\quad s.t. \mathbf{x}\in C

New problem:

minxg(x)s.t.xC\min_\mathbf{x} g(\mathbf{x})\quad s.t. \mathbf{x}\in C

Where gg satisfies globally majorant condition:

  1. f(x)gk(x),xCf(\mathbf{x})\le g_k(\mathbf{x}),\forall\mathbf{x}\in C
  2. f(xk)=gk(xk)f(\mathbf{x}_k)= g_k(\mathbf{x}_k)

Such that xk+1=argminxgk(x)f(xk+1)gk(xk+1)gk(xk)=f(xk)\mathbf{x}_{k+1}=\underset{\mathbf{x}}{\operatorname{argmin}} g_{k}(\mathbf{x}) \Longrightarrow f\left(\mathbf{x}_{k+1}\right) \leq g_{k}\left(\mathbf{x}_{k+1}\right) \leq g_{k}\left(\mathbf{x}_{k}\right)=f\left(\mathbf{x}_{k}\right).

Local majorant replace the first requirement with f(xk+1)gk(xk+1)f(\mathbf{x}_{k+1})\le g_k(\mathbf{x}_{k+1}).

image-20210505142957991

Choice of majorant function#

  • Lipschitz Gradient Surrogate

    gk(x)=f(xk)+f(xk),xxk+12αxxk2xk+1=PC(xkαf(xk))projected gradient descentxk+1=xkαf(xk),C=Rngradient descent\begin{aligned} &\begin{array}{c} g_{k}(\mathbf{x})=f\left(\mathbf{x}_{k}\right)+\left\langle\nabla f\left(\mathbf{x}_{k}\right), \mathbf{x}-\mathbf{x}_{k}\right\rangle+\frac{1}{2 \alpha}\left\|\mathbf{x}-\mathbf{x}_{k}\right\|^{2} \\ \mathbf{x}_{k+1}=\mathcal{P}_{\mathcal{C}}\left(\mathbf{x}_{k}-\alpha \nabla f\left(\mathbf{x}_{k}\right)\right) \quad\leftarrow\text{projected gradient descent}\\ \mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha \nabla f\left(\mathbf{x}_{k}\right),\quad \mathcal{C}=\mathbb{R}^{n}\quad\leftarrow\text{gradient descent} \end{array}\\ \end{aligned}

    Choose α\alpha through backtracking line search. If ff is LL-smooth, i.e., f(x)f(y)Lxy\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|\le L\|\mathbf{x}-\mathbf{y}\|, then choose α=L1\alpha=L^{-1}.

  • Quadratic Surrogate

    gk(x)=f(xk)+f(xk),xxk+12(xxk)THk(xxk), where Hk2fxk+1=xkHk1f.(C=Rn) (Quasi-Newton method) \begin{array}{c} g_{k}(\mathbf{x})=f\left(\mathbf{x}_{k}\right)+\left\langle\nabla f\left(\mathbf{x}_{k}\right), \mathbf{x}-\mathbf{x}_{k}\right\rangle+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}_{k}\right)^{T} \mathbf{H}_{k}\left(\mathbf{x}-\mathbf{x}_{k}\right), \text { where } \mathbf{H}_{k} \succ \nabla^{2} f \\ \mathbf{x}_{k+1}=\mathbf{x}_{k}-\mathbf{H}_{k}^{-1} \nabla f . \quad\left(\mathcal{C}=\mathbb{R}^{n}\right) \quad \begin{array}{c} \text { (Quasi-Newton method) } \end{array} \end{array}

  • Jensen Surrogate

    minxf(x)=f~(θx),s.t.xC,f~ convexgk(x)=i=1nwif~(θiwi(xixk,i)+θTxk)\min_\mathbf{x} f(\mathbf{x}) =\tilde{f}(\boldsymbol{\theta}^\top \mathbf{x}),\quad s.t. \mathbf{x}\in C, \tilde{f} \text{ convex}\\ g_{k}(\mathbf{x})=\sum_{i=1}^{n} w_{i} \tilde{f}\left(\frac{\theta_{i}}{w_{i}}\left(x_{i}-x_{k, i}\right)+\boldsymbol{\theta}^{T} \mathbf{x}_{k}\right)

    where wR+n,w1\mathbf{w} \in \mathbb{R}_{+}^{n},\|\mathbf{w}\|_{1} and wi0w_{i} \neq 0 whenever θi0\theta_{i} \neq 0.

    • Application of Jensen’s Inequality: EM (Expectation Maximization) Algorithm

      log(t=1Tft(θ))t=1Tw(t)log(ft(θ)w(t))-\log \left(\sum_{t=1}^{T} f^{t}(\boldsymbol{\theta})\right) \leq-\sum_{t=1}^{T} \mathbf{w}(t) \log \left(\frac{f^{t}(\boldsymbol{\theta})}{\mathbf{w}(t)}\right)

      EM algorithms minimizes a negative log-likelihood. The right side of this equation can be interpreted as a majorizing surrogate of the left side.

      Consider the maximum likelihood estimation of parameters of a Gaussian Mixture Model (GMM):

      k=1nθk1(2π)d/2Σk1/2exp(12(xμk)TΣk1(xμk))\sum_{k=1}^{n} \theta_{k} \frac{1}{(2 \pi)^{d / 2}\left|\boldsymbol{\Sigma}_{k}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right)

      Given data xi,i=1,,m\mathbf{x}_{i}, i=1, \cdots, m, we want to estimate the parameters {θi,μi,Σi}\left\{\theta_{i}, \boldsymbol{\mu}_{i}, \boldsymbol{\Sigma}_{i}\right\} from the data by maximizing the log-likelihood:

      maxθ,μ,Σi=1mlogk=1nθk1(2π)d/2Σk1/2exp(12(xiμk)TΣk1(xiμk))\max _{\boldsymbol{\theta}, \boldsymbol{\mu}, \mathbf{\Sigma}} \sum_{i=1}^{m} \log \sum_{k=1}^{n} \theta_{k} \frac{1}{(2 \pi)^{d / 2}\left|\boldsymbol{\Sigma}_{k}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{k}\right)\right)

  • Variational Surrogate

    Minimizing a function f(x,y)f(\mathbf{x},\mathbf{y}) is actually minimizing:

    minxf(x), s.t. xC, where f(x)=minyYh(x,y)gk(x)=h(x,yk),yk=argminyYh(xk,y).\min _{\mathbf{x}} f(\mathbf{x}), \quad \text { s.t. } \mathbf{x} \in \mathcal{C}, \quad \text { where } f(\mathbf{x})=\min _{\mathbf{y} \in \mathcal{Y}} h(\mathbf{x}, \mathbf{y}) \\ g_{k}(\mathbf{x})=h\left(\mathbf{x}, \mathbf{y}_{k}^{*}\right), \quad \mathbf{y}_{k}^{*}=\underset{\mathbf{y} \in \mathcal{Y}}{\operatorname{argmin}} h\left(\mathbf{x}_{k}, \mathbf{y}\right) .