I am taking Convex Optimization this semester (2021 Spring). The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.
From this chapter, we start to introduce convex optimization algorithms, including descent methods, newton’s methods, conjugate direction methods, and majorization minimization. I used to consider gradient descent as the simple basis of neural networks, so I did not pay much attention to it. But it turns out that the underlying theory can be rather complex. I feel like learning the history and construction of nowadays neural networks. Theory is fascinating in this way.
Unconstrained Optimization
Table of Contents:
Problems
xminf(x)
Where f:Rn→R is convex and twice continuously differentiable. Compute a sequence of points that f(x(k))→p∗ and we can check the termination by ∥∇f(x(k))∥≤ϵ.
-
Fixed-point algorithm
-
Banach Fixed Point Theorem
Let (X,d) be a non-empty complete metric space with a contraction mapping T:X→X
d(T(x),T(y))≤Ld(x,y),0≤L<1
Then start with an arbitrary x0 and take xn=T(xn−1) will reach to the fixed point xn→x∗.
Given the theorem, we know that if φ satisfies:
∥φ(x)−φ(y))∥≤L∥x−y∥,0≤L<1
Then the iteration method converges to the fixed point of φ.
-
Fixed point and gradient descent
If we take φ(x)=x−α∇f(x), then the fixed point satisfies x=x−α∇f(x), i.e. ∇f(x∗)=0.
- xk+1=xk−α∇f(xk). This is updating method of gradient descent.
- xk=xk+1+α∇f(xk+1). This means xk+1=argminxαf(x)+21∥x−xk∥2, which is the proximal operator.
-
Strong convexity and implications
Strong convexity of the objective function implies:
mI⪯∇2f(x)⪯MI⇒cond(∇2f(x))≤mM
f(y)≥f(x)+∇f(x)T(y−x)+2m∥y−x∥22
The second inequality can be used to bound f(x)−p∗ in terms of ∥∇f(x)∥2. The righthand side of it is a convex quadratic function of y (for fixed x). Setting the gradient with respect to y equal to zero, we find that y~=x−(1/m)∇f(x) minimizes the righthand side. Therefore we have
f(y)≥f(x)+∇f(x)T(y−x)+2m∥y−x∥22≥f(x)+∇f(x)T(y~−x)+2m∥y~−x∥22=f(x)−2m1∥∇f(x)∥22
Since this holds for any y∈S, we have
p∗≥f(x)−2m1∥∇f(x)∥22
This inequality shows to what extent is ∥∇f(x)∥2 small enough to show that the point is nearly optimal:
∥∇f(x)∥2≤(2mϵ)1/2⇒f(x)−p∗≤ϵ
We can also bound the distance between x and the optimal x∗:
∥x−x∗∥2≤m2∥∇f((x)∥2
Similarly we have the Descent Lemma and an upper bound of p∗:
f(y)≤f(x)+∇f(x)T(y−x)+2M∥y−x∥22p∗≤f(x)−2M1∥∇f(x)∥22
∥∇f(x)−∇f(y)∥2≤M∥x−y∥2⇔∇2f(x)⪯MI
Descent methods
x(k+1)=x(k)+t(k)Δx(k)
Where t(k)>0 is the step size and Δx(k) is the search direction.
We hope that f(x(k+1))<f(x(k)) unless x(k) is optimal. We know by convexity that ∇f(x(k))⊤(y−x(k))≥0 implies f(y)≥f(x(k)), so the search direction must satisfy that:
⟨∇f(x(k)),Δx(k)⟩=∇f(x(k))⊤Δx(k)<0
Stopping criterion is ∥∇f(x)∥2≤η.
Step size
From the illustration we can see that the stopping criterion satisfies in [0,t0]. The algorithm will stop when t=1, or t∈(βt0,t0], namely t>min{1,βt0}.
Search direction
Gradient Descent
Δx=−∇f(x)
-
Convergence analysis: Exact line search
Use a lighter notation x+=x+tΔx for x(k+1)=x(k)+t(k)Δx(k).
Assume that f is strongly convex on S, then there are positive constants m and M such that mI⪯∇2f(x)⪯MI,∀x∈S.
Define the updating function f~:R→R by f~(t)=f(x−t∇f(x)). From the descent lemma, take
y=x−t∇f(x), we obtain a quadratic upper bound on f~ :
f~(t)≤f(x)−t∥∇f(x)∥22+2Mt2∥∇f(x)∥22=f(x)+(2Mt2−t)∥∇f(x)∥22
Minimize over t and we have:
f(x+)≤f(x)−2M1∥∇f(x)∥22f(x+)−p∗≤f(x)−p∗−2M1∥∇f(x)∥22
We know that ∥∇f(x)∥22≥2m(f(x)−p∗), so
f(x+)−p∗≤(1−m/M)(f(x)−p∗)
Take recursively for k steps and we will have:
f(x(k))−p∗≤ck(f(x(0))−p∗), where c=1−m/M<1.
Thus, the method has linear convergence.
In particular, we must have f(x(k))−p∗≤ϵ after at most
log(1/c)log((f(x(0))−p∗)/ϵ)
iterations of the gradient method with exact line search.
-
Interpretation:
The nominator is the log of the ratio between initial gap and final gap.
The denominator is the log of the upper bound of condition number of Hessian matrix near x∗. When M/m is large enough, we have:
log(1/c)=−log(1−m/M)≈m/M
Hence, the iterations needed increases nearly linearly with M/m.
-
Convergence analysis: backtracking line search
We already know that
f~(t)≤f(x)+(2Mt2−t)∥∇f(x)∥22
So the equation f~(t)≤f(x)−αt∥∇f(x)∥22,0<α<0.5 always satisfies when 0≤t≤1/M.
Hence, we know that the search terminates either with t=1 or t≥β/M.
f(x+)≤f(x)−min{α,Mαβ}∥∇f(x)∥22
Given ∥∇f(x)∥22≥2m(f(x)−p∗), subtract p∗ from both sides:
f(x+)−p∗≤(1−min{2mα,M2mαβ})∥∇f(x)∥22(f(x)−p∗)
Take c=1−min{2mα,M2mαβ}<1 and we have
f(x(k))−p∗≤ck(f(x(0))−p∗)
Steepest descent method
The first-order Taylor approximation of f(x+v) is
f(x+v)≈f(x)+∇f(x)⊤v
The second term can be interpreted as a directional derivative in the direction v. Our goal is to choose a step direction v such that the derivative is as negative as possible.
We define a normalized steepest descent direction as
Δxnsd=argminv{∇f(x)⊤v∣∥v∥=1}
We can define an unnormalized steepest descent by dual norm
Δxsd=∥∇f(x)∥∗Δxnsd
This search direction satisfies
∇f(x)⊤Δxsd=∥∇f(x)∥∗∇f(x)⊤Δxnsd=−∥∇f(x)∥∗2
-
Euclidean norm case
When ∥⋅∥ is Euclidean norm, Δxnsd=−∥∇f(x)∥∇f(x), and Δxsd=−∇f(x), which is exactly the gradient descent method.
-
Quadratic norm case
∥z∥P=(z⊤Pz)1/2=∥P1/2z∥2,P∈S++nΔxnsd=−(∇f(x)⊤P−1∇f(x))1/2P−1∇f(x)
The dual norm is ∥z∥∗=∥P−1/2z∥2, so the steepest descent step is
Δxsd=−P−1∇f(x)
-
Coordinate change
uˉ=P1/2u Defines the coordinate change from Quadratic norm to Euclidean norm: ∥u∥P=∥uˉ∥2.
fˉ(uˉ)=f(P−1/2uˉ)=f(u)
Apply the gradient method to fˉ
Δxˉ=−∇fˉ(xˉ)=−P−1/2∇f(P−1/2xˉ)=−P−1/2∇f(x)
By the coordinate change, the steepest descent search direction of Quadratic norm ∥⋅∥P should be
Δx=P−1/2Δxˉ=−P−1∇f(x)
-
l1-norm
Let i be the index when ∣(∇f(x))i∣=∥∇f(x)∥∞.
Δxnsd=−sign(∂xi∂f(x))eiΔxnsd=Δxnsd∥∇f(x)∥∞=−(∂xi∂f(x))ei
Hence, the normalized steepest descent direction is always the coordinate axis direction whose decrease in f is greatest.
-
Choice of norm
If we already know an approximation H^ of the Hessian at the optimal point, then we can choose P=H^, so that the Hessian of fˉ would be
∇2fˉ(xˉ)=∇2f(P−1/2xˉ)=H^−1/2∇2f(xˉ)H^−1/2≈I
Thus, the condition number is likely to be low.
Geometrically saying, the ellipsoid {x∣∥x∥P≤1} should approximate the shape of the sub level set({x∣f(x)≤L}).
In the following image, the left ellipsoid is similar to the sub level set, while the right one is not.
If the condition number of a set is small, it means that the set has approximately the same width in all directions, i.e., it is nearly spherical.
Clearly the left sub level set is more likely a sphere.
Newton’s method
Motivation. Steepest descent only use the first derivatives, try to use higher derivatives.
Method
Construct a quadratic approximation to the objective function that matches the 1st and 2nd derivative values at that point. Then minimize the approximate function.
f(x)≈f(x(k))+g(k)⊤(x−x(k))+21(x−x(k))⊤F(x(k))(x−x(k))≜q(x)
Here, g(k)=∇f(x(k)).
Applying the First-Order Necessary Condition (FONC) to q yields
0=∇q(x)=g(k)+F(x(k))(x−x(k))
If F(x(k))≻0, then q achieves a minimum at
x(k+1)=x(k)−F(x(k))−1g(k)
The above is the recursive updating function of Newton’s method.
Analysis
Cons:
- If F(x(k)) is not positive definite, the algorithm might not head in the direction of decreasing values of the objective function.
- Even if it is positive definite, it might occur that $ f\left(\mathbf{x}^{(k)}\right)\le f\left(\mathbf{x}^{(k+1)}\right)$, e.g. if starting point is far away from x∗.
Pros:
-
When starting point is near x∗, Newton’s method converges to x∗ with order of convergence at least 2.
Note that we only need to assume f∈C3,∇f(x∗)=0,F(x∗), is invertible. So Newton’s method does not necessarily converge to a local minimum.
-
Descent property ($ f\left(\mathbf{x}^{(k)}\right)\ge f\left(\mathbf{x}^{(k+1)}\right)$) holds true with a little modification to the algorithm(Damped Newton’s Method).
If F(x(k))≻0,g(k)=∇f(x(k))=0, then the direction
d(k)=−F(x(k))−1g(k)=x(k+1)−x(k)
is a descent direction in the sense that ∃α^>0,∀α∈(0,α^),
f(x(k)+αd(k))<f(x(k))
Proof. Let
ϕ(α)=f(x(k)+αd(k))
Thus, we need to prove that ∃α^>0,∀α∈(0,α^), ϕ(α)<ϕ(0). Given F(x(k))−1≻0,
ϕ′(α)=∇f(x(k)+αd(k))⊤d(k)ϕ′(0)=∇f(x(k))⊤d(k)=−g(k)F(x(k))−1g(k)<0
Hence, there must exist an α^ satisfies ϕ(α)<ϕ(0).
Damped Newton’s method
Modify the recursive updating function with a parameter αk to ensure descent property:
x(k+1)=x(k)−αkF(x(k))−1g(k)αk=argα≥0minf(x(k)−αkF(x(k))−1g(k))
Drawback: Computing F(x(k)) and solving F(x(k))d(k)=−g(k) are computationally expensive.
Levenberg-Marquardt modification
To ensure descent property when the Hessian matrix is not positive definite, modify the recursive updating function as:
x(k+1)=x(k)−αk(F(x(k))+μkI)−1g(k),μk≥0
The algorithm is actually locally minimizing
f(x)+2μk∥x−x(k)∥2
When μk→0, it becomes pure Newton’s method. When μk→∞, it becomes pure gradient descent method with small step size. In practice, start from a small μk and increase it until descent property is satisfied.
Case: Nonlinear least-squares
xmini=1∑m(ri(x))2
Here, ri are given functions. Define r=[r1,⋯,rm]⊤ then the objective function is f(x)=r(x)⊤r(x). Denote the Jacobian matrix of r by
J(x)=⎣⎢⎢⎡∂x1∂r1(x)⋮∂x1∂rm(x)⋯⋯∂xn∂r1(x)⋮∂xn∂rm(x)⎦⎥⎥⎤
Then we can compute the gradient and Hessian of f:
(∇f(x))j=∂xj∂f(x)=2i=1∑mri(x)∂xj∂ri(x)∇f(x)=2J(x)⊤r(x)
∂xk∂xj∂2f(x)=∂xk∂(2i=1∑mri(x)∂xj∂ri(x))=2i=1∑m(∂xk∂ri(x)∂xj∂ri(x)+ri(x)∂xk∂xj∂2ri(x))
Letting S(x) be the matrix whose (k,j) th component is
i=1∑mri(x)∂xk∂xj∂2ri(x)
We write the Hessian matrix as
F(x)=2(J(x)TJ(x)+S(x))
Therefore, Newton’s method applied to the nonlinear least-squares problem is given by
x(k+1)=x(k)−(J(x)TJ(x)+S(x))−1J(x)Tr(x)
When we ignore the second derivatives in S, we have the Gauss-Newton method
x(k+1)=x(k)−(J(x)TJ(x))−1J(x)Tr(x)
Conjugate direction methods
Motivation. Intermediate between steepest descent and Newton’s method.
- Solve quadratics of n variables in n steps.
- Requires no Hessian.
- No matrix inversion, no storage of n×n matrix required.
Q-conjugate: Q is a real symmetric n×n matrix. Directions d(0),d(1),⋯,d(m) are Q-conjugate if ∀i=j, d(i)⊤Qd(j)=0.
Lemma. If Q≻0 and directions d(0),d(1),⋯,d(k),k≤n−1 are nonzero and Q-conjugate, then they are linearly independent.
🌟Quadratic problems
f(x)=21xTQx−xTb,Q=Q⊤≻0
Motivation. Given n Q-conjugate directions, we have
x∗−x(0)=α0d(0)+⋯+αn−1d(n−1)
Hence, we need to find the n directions and their corresponding ratios.
Ratio
Given the direction d(k), we can compute its ratio αk. Multiply the above equation by d(k)⊤Q and we have
d(k)⊤Q(x∗−x(0))=αkd(k)⊤Qd(k)αk=d(k)⊤Qd(k)d(k)⊤Q(x∗−x(0))
Since
x(k)−x(0)=α0d(0)+⋯+αk−1d(k−1)g(k)=Qx(k)−b,Qx∗=b
🌟We have
αk=d(k)⊤Qd(k)d(k)⊤Q(x∗−x(k))=−d(k)⊤Qd(k)d(k)⊤g(k)
Direction
Lemma. By mathematical induction, we have
⟨g(k+1),d(i)⟩===⟨Qx(k+1)−b,d(i)⟩⟨Q(x(k)+αkd(k))−b,d(i)⟩⟨g(k)+αkQd(k),d(i)⟩=0⇒g(k+1)⊤d(i)=0,0≤k≤n−1,0≤i≤k
🌟Directions.
⟨g(k+1),g(i)⟩=⟨g(k+1),βi−1d(i−1)−d(i)⟩=0,∀0≤i≤k
Then by mathematical induction.
⟨d(k+1),Qd(i)⟩===⟨−g(k+1)+βkd(k),Qd(i)⟩−αi1⟨g(k+1),QΔx(i)⟩−αi1⟨g(k+1),Δg(i)⟩=0
Algorithm
Given starting point x(0),
- g(0)=∇f(x(0))=Qx(0)−b. If g(0)=0, stop, else set d(0)=−g(0).
- Take αk=−d(k)⊤Qd(k)g(k)⊤d(k). Update x(k+1)=x(k)+αkd(k).
- Compute gradient g(k+1)=∇f(x(k+1))=Qx(k+1)−b. If g(k+1)=0, stop.
- Take βk=d(k)⊤Qd(k)g(k+1)⊤Qd(k). Get a new Q-conjugate direction d(k+1)=−g(k+1)+βkd(k).
- k:=k+1, return to step 2.
Non-quadratic problems
Motivation. Consider the quadratic function f(x)=21xTQx−xTb as a second-order Taylor series approximation of any objective function. Thus, Q is the Hessian that needs reevaluation at each iteration.
To be Hessian-free, modify the method of computing αk,βk:
αk=argminα>0f(x(k)+αd(k)) can be replaced by a linear search.
Replace Qd(k) in βk with αkg(k+1)−g(k):
βk====d(k)⊤Qd(k)g(k+1)⊤Qd(k)d(k)⊤[g(k+1)−g(k)]g(k+1)⊤[g(k+1)−g(k)](Hestenes-Stiefel formula)g(k)⊤g(k)g(k+1)⊤[g(k+1)−g(k)](Polak-Ribiere formula)g(k)⊤g(k)g(k+1)⊤g(k+1)(Fletcher-Reeves formula)
The last equation is derived from the lemma above and definition of d(k).
- Non-quadratic problems might not converge in n steps, so reinitialize the direction vector to the negative gradient after every few iterations (e.g., n or n+1).
- If line search of α is inaccurate, HS formula is better. Choice of formula depends on objective function.
Quasi-Newton methods
Motivation. To avoid computing F(x(k))−1 in damped Newton’s method, try replace it with an approximation.
x(k+1)=x(k)−αkF(x(k))−1g(k)=x(k)−αHkg(k)
To ensure a decrease in f:
f(x(k+1))=f(x(k))+g(k)T(x(k+1)−x(k))+o(∥∥∥∥x(k+1)−x(k)∥∥∥∥)=f(x(k))−αg(k)THkg(k)+o(∥∥∥∥Hkg(k)∥∥∥∥)
We have to have:
g(k)THkg(k)>0
A simple way to ensure the above requirement is to set Hk≻0.
Approximate the inverse Hessian
Suppose that F(x)=Q,∀x and Q=Q⊤. We have
Δg(k)=g(k+1)−g(k)=Q(x(k+1)−x(k))=QΔx(k)
Therefore, the approximation Hk+1 satisfies that
Hk+1Δg(i)=Δx(i),0≤i≤k
Hence, we have Hn=Q−1. We conclude that if Hn satisfies HnΔg(i)=Δx(i),0≤i≤n−1, then the algorithm is guaranteed to solve problems with quadratic objective functions in n+1 steps.
Algorithm
d(k)αkx(k+1)=−Hkg(k)=argα≥0minf(x(k)+αd(k))=x(k)+αkd(k)
where the matrices H0,H1,⋯ are symmetric. In the quadratic case, the above matrices are required to satisfy
Hk+1Δg(i)=Δx(i),0≤i≤k
How to derive Hk+1 from Hk that satisfies the above equation? We consider 3 specific updating formulas
-
Rank One Correction Formula (Single-Rank-Symmetric SRS algorithm)
Hk+1=Hk+αkz(k)z(k)⊤
The correction satisfies rank(z(k)z(k)⊤)=1.
Hk+1Δg(k)=(Hk+akz(k)z(k)⊤)Δg(k)=Δx(k)z(k)=akz(k)⊤Δg(k)Δx(k)−HkΔg(k)Hk+1=Hk+ak(z(k)⊤Δg(k))2(Δx(k)−HkΔg(k))(Δx(k)−HkΔg(k))⊤
We know from the first equation that Δx(k)−HkΔg(k)=(akz(k)⊤Δg(k))z(k), multiply it by Δg(k) to replace the denominator
Hk+1=Hk+Δg(k)⊤(Δx(k)−HkΔg(k))(Δx(k)−HkΔg(k))(Δx(k)−HkΔg(k))⊤
Algorithm. Start with k=0 and a real symmetric positive definite H0.
-
If g(k)=0, stop, else set d(k)=−Hkg(k).
-
Take αk=argminα≥0f(x(k)+αd(k)). Update x(k+1)=x(k)+αkd(k).
-
Compute gradient and the next inverse Hessian approximation, and return to step 1 with k:=k+1.
Δx(k)Δg(k)Hk+1=αkd(k)=g(k+1)−g(k)=Hk+Δg(k)⊤(Δx(k)−HkΔg(k))(Δx(k)−HkΔg(k))(Δx(k)−HkΔg(k))⊤
-
Davidon-Fletcher-Powell (DFP) Algorithm
Rank two update
Algorithm. Start with k=0 and a real symmetric positive definite H0.
-
If g(k)=0, stop, else set d(k)=−Hkg(k).
-
Take αk=argminα≥0f(x(k)+αd(k)). Update x(k+1)=x(k)+αkd(k).
-
Compute gradient and the next inverse Hessian approximation, and return to step 1 with k:=k+1.
Δx(k)Δg(k)Hk+1=αkd(k)=g(k+1)−g(k)=Hk+Δx(k)TΔg(k)Δx(k)Δx(k)T−Δg(k)THkΔg(k)[HkΔg(k)][HkΔg(k)]T
-
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Algorithm
Note that we now approximate Q−1 by Hk, we can also approximate Q itself and take the inverse at last.
Hk+1Δg(i)=Δx(i),0≤i≤kBk+1Δx(i)=Δg(i),0≤i≤k
The BFGS update goes like
Bk+1=Bk+Δg(k)TΔx(k)Δg(k)Δg(k)T−Δx(k)TBkΔx(k)[BkΔx(k)][BkΔx(k)]T
To obtain approximation of the inverse Hessian,
Hk+1=(Bk+1)−1=(Bk+Δg(k)TΔx(k)Δg(k)Δg(k)T−Δx(k)TBkΔx(k)[BkΔx(k)][BkΔx(k)]T)−1
This inverse can be computed using Sherman-Morrison formula:
(A+UCVT)−1=A−1−A−1U(C−1+VTA−1U)−1VTA−1
Where A,C are invertible and the size of U,C,V are compatible.
Hence, the updating of H is
Hk+1BFGS=Hk+(1+Δg(k)⊤Δx(k)Δg(k)⊤HkΔg(k))Δx(k)⊤Δg(k)Δx(k)Δx(k)⊤−Δg(k)⊤Δx(k)HkΔg(k)Δx(k)⊤+(HkΔg(k)Δx(k)⊤)⊤
🌟Limited-Memory BFGS
Hk+1=VkTHkVk+ρkΔxkΔxkTρk=ΔgkTΔxk1,Vk=I−ρkΔgkΔxkT
We only need to store m pairs of vectors (Δxi,Δgi) instead of the matrix. Hk Is only used to compute the vector Hkgk, so the pairs are enough for computation.
Hk=(Vk−1T⋯Vk−mT)Hk0(Vk−m⋯Vk−1)+ρk−m(Vk−1T⋯Vk−m+1T)Δxk−mΔxk−mT(Vk−m+1⋯Vk−1)+ρk−m+1(Vk−1T⋯Vk−m+2T)Δxk−m+1Δxk−m+1T(Vk−m+2⋯Vk−1)+⋯+ρk−1Δxk−1Δxk−1T.
From this expression we can design a two-loop algorithm to compute Hkgk:
Hkgk=Vk−1THk−1(Vk−1gk)+Δxk−1(ρk−1Δxk−1Tgk)
Loop 1: Compute {αi} and {qi}, i=k−1,⋯,k−m.
αk−1:=ρk−1Δxk−1Tgk,αi=ρiΔxiTqi+1qk:=gk,qi−1=Vi−1qi⇒qi=Viqi+1=(I−ρiΔgiΔxiT)qi+1=qi+1−αiΔgi
Loop 2: Compute pi, i=k−m,⋯,k. Finally Hkg=pk.
pk−m:=Hk0qk−m,β=ρiΔgiTpipi+1=ViTpi+αiΔxi=pi+(αi−β)Δxi
The choice of Hk0 could be:
Hk0=γkI,γk=Δgk−1TΔgk−1Δxk−1TΔgk−1
Use line search based on weak Wolfe conditions (directional derivative should not be too small):
f(xk+αkpk)∇f(xk+αkpk)Tpk≤f(xk)+c1αk∇fkTpk≥c2∇fkTpk
with 0<c1<c2<1, or strong Wolfe conditions (directional derivative should not increase too quickly) :
f(xk+αkpk)∣∣∣∣∇f(xk+αkpk)Tpk∣∣∣∣≤f(xk)+c1αk∇fkTpk≤c2∣∣∣∇fkTpk∣∣∣
Majorization minimization
Basic framework
Original problem:
xminf(x)s.t.x∈C
New problem:
xming(x)s.t.x∈C
Where g satisfies globally majorant condition:
- f(x)≤gk(x),∀x∈C
- f(xk)=gk(xk)
Such that xk+1=xargmingk(x)⟹f(xk+1)≤gk(xk+1)≤gk(xk)=f(xk).
Local majorant replace the first requirement with f(xk+1)≤gk(xk+1).
Choice of majorant function
-
Lipschitz Gradient Surrogate
gk(x)=f(xk)+⟨∇f(xk),x−xk⟩+2α1∥x−xk∥2xk+1=PC(xk−α∇f(xk))←projected gradient descentxk+1=xk−α∇f(xk),C=Rn←gradient descent
Choose α through backtracking line search. If f is L-smooth, i.e., ∥∇f(x)−∇f(y)∥≤L∥x−y∥, then choose α=L−1.
-
Quadratic Surrogate
gk(x)=f(xk)+⟨∇f(xk),x−xk⟩+21(x−xk)THk(x−xk), where Hk≻∇2fxk+1=xk−Hk−1∇f.(C=Rn) (Quasi-Newton method)
-
Jensen Surrogate
xminf(x)=f~(θ⊤x),s.t.x∈C,f~ convexgk(x)=i=1∑nwif~(wiθi(xi−xk,i)+θTxk)
where w∈R+n,∥w∥1 and wi=0 whenever θi=0.
-
Application of Jensen’s Inequality: EM (Expectation Maximization) Algorithm
−log(t=1∑Tft(θ))≤−t=1∑Tw(t)log(w(t)ft(θ))
EM algorithms minimizes a negative log-likelihood. The right side of this equation can be interpreted as a majorizing surrogate of the left side.
Consider the maximum likelihood estimation of parameters of a Gaussian Mixture Model (GMM):
k=1∑nθk(2π)d/2∣Σk∣1/21exp(−21(x−μk)TΣk−1(x−μk))
Given data xi,i=1,⋯,m, we want to estimate the parameters {θi,μi,Σi} from the data by maximizing the log-likelihood:
θ,μ,Σmaxi=1∑mlogk=1∑nθk(2π)d/2∣Σk∣1/21exp(−21(xi−μk)TΣk−1(xi−μk))
-
Variational Surrogate
Minimizing a function f(x,y) is actually minimizing:
xminf(x), s.t. x∈C, where f(x)=y∈Yminh(x,y)gk(x)=h(x,yk∗),yk∗=y∈Yargminh(xk,y).