I am taking Convex Optimization this semester (2021 Spring). The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.
This chapter is the last of the basics, in the next chapter we will start to introduce specific algorithms. We first define convex function, how to judge convexity, and properties related with convexity. Next, we discuss operations that preserve convexity just like the previous chapter. Finally, we will introduce conjugate function, envelop function, and proximal mapping , three important concepts for future use.
Convex Function
Table of Contents:
Basic definitions and examples
Convex function
A function f : R n → R f:\R^n\to\R f : R n → R is convex if dom f \operatorname{dom} f d o m f is convex and for all x , y ∈ dom f \mathbf{x, y}\in \operatorname{dom} f x , y ∈ d o m f , and 0 ≤ θ ≤ 1 0\le \theta\le 1 0 ≤ θ ≤ 1 ,
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) f(\theta\mathbf{x}+(1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x})+ (1-\theta) f(\mathbf{y})
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y )
A function is strictly convex if strict inequality holds whenever x ≠ y \mathbf{x}\ne \mathbf{y} x = y and 0 < θ < 1 0<\theta<1 0 < θ < 1 .
A function is μ \mu μ -strongly convex if
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) − θ ( 1 − θ ) μ 2 ∥ y − x ∥ 2 , ∀ θ ∈ [ 0 , 1 ] f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x})+(1-\theta) f(\mathbf{y})-\frac{\theta(1-\theta) \mu}{2}\|\mathbf{y}-\mathbf{x}\|^{2}, \quad \forall \theta \in[0,1]
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) − 2 θ ( 1 − θ ) μ ∥ y − x ∥ 2 , ∀ θ ∈ [ 0 , 1 ]
Extended-value extensions
Corollary. Convex function is differentiable almost everywhere on the relative interior of its domain.
Define the extended-value extension of a convex function f f f as: f ~ : R n → R ∪ { ∞ } \tilde{f}:\R^n\to\R\cup\{\infty\} f ~ : R n → R ∪ { ∞ }
f ~ ( x ) = { f ( x ) , x ∈ dom f ∞ , x ∉ dom f \tilde{f}(\mathbf{x})=\left\{\begin{array}{ll}
f(\mathbf{x}), & \mathbf{x} \in \operatorname{dom} f \\
\infty, & \mathbf{x} \notin \operatorname{dom} f
\end{array}\right.
f ~ ( x ) = { f ( x ) , ∞ , x ∈ d o m f x ∈ / d o m f
Indicator function of a convex set C C C I ~ C ( x ) = { 0 , x ∈ C ∞ , x ∉ C \tilde{I}_C(\mathbf{x})=\left\{\begin{array}{ll}
0, & \mathbf{x} \in C \\
\infty, & \mathbf{x} \notin C
\end{array}\right.
I ~ C ( x ) = { 0 , ∞ , x ∈ C x ∈ / C
Now we have:min x f ( x ) s . t . x ∈ C ⇔ min x f ( x ) + I ~ C ( x ) \min_\mathbf{x}f(\mathbf{x})\quad s.t.\mathbf{x}\in C\Leftrightarrow \min_\mathbf{x}f(\mathbf{x})+\tilde{I}_C(\mathbf{x})
x min f ( x ) s . t . x ∈ C ⇔ x min f ( x ) + I ~ C ( x )
Proper function
f f f is proper if f ( x ) < ∞ f(\mathbf{x})<\infty f ( x ) < ∞ for at least one x \mathbf{x} x and f ( x ) > − ∞ f(\mathbf{x})>-\infty f ( x ) > − ∞ for all x \mathbf{x} x .
A function is proper iff its epigraph is nonempty and does not contain a vertical line.
First-order conditions (to prove convexity)
Suppose f f f is differentiable, then f f f is convex iff dom f \operatorname{dom} f d o m f is convex and for all x , y ∈ dom f \mathbf{x, y}\in \operatorname{dom} f x , y ∈ d o m f ,
f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle
f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩
This indicates that from local information of a convex function, we can derive some global information (a global under-estimator). For instance, if ∇ f ( x ) = 0 \nabla f(\mathbf{x})=0 ∇ f ( x ) = 0 , then x \mathbf{x} x is a global minimizer.
Proof.
⇒ \Rightarrow ⇒ Given f f f is convex.
f ( ( 1 − α ) x + α y ) ≤ ( 1 − α ) f ( x ) + α f ( y ) f ( y ) ≥ f ( x ) + f ( x + α ( y − x ) ) − f ( x ) α f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ , α → 0 + \begin{aligned}
f((1-\alpha) \mathbf{x}+\alpha \mathbf{y}) \leq&(1-\alpha) f(\mathbf{x})+\alpha f(\mathbf{y})\\
f(\mathbf{y})\ge &f(\mathbf{x})+\frac{f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))-f(\mathbf{x}) }{\alpha}\\
f(\mathbf{y}) \geq& f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle,\quad\alpha\to0^+
\end{aligned}
f ( ( 1 − α ) x + α y ) ≤ f ( y ) ≥ f ( y ) ≥ ( 1 − α ) f ( x ) + α f ( y ) f ( x ) + α f ( x + α ( y − x ) ) − f ( x ) f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ , α → 0 +
⇐ \Leftarrow ⇐ Given the inequality f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ .
Let y , x \mathbf{y},\mathbf{x} y , x be α x + ( 1 − α ) y \alpha \mathbf{x}+(1-\alpha) \mathbf{y} α x + ( 1 − α ) y separately:
f ( α x + ( 1 − α ) y ) ≤ f ( x ) − ( 1 − α ) ⟨ ∇ f ( α x + ( 1 − α ) y ) , x − y ⟩ f ( α x + ( 1 − α ) y ) ≤ f ( y ) + α ⟨ ∇ f ( α x + ( 1 − α ) y ) , x − y ⟩ \begin{array}{l}
f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}) \leq f(\mathbf{x})-(1-\alpha)\langle\nabla f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}), \mathbf{x}-\mathbf{y}\rangle \\
f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}) \leq f(\mathbf{y})+\alpha\langle\nabla f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}), \mathbf{x}-\mathbf{y}\rangle
\end{array}
f ( α x + ( 1 − α ) y ) ≤ f ( x ) − ( 1 − α ) ⟨ ∇ f ( α x + ( 1 − α ) y ) , x − y ⟩ f ( α x + ( 1 − α ) y ) ≤ f ( y ) + α ⟨ ∇ f ( α x + ( 1 − α ) y ) , x − y ⟩
Multiply the first inequality by α \alpha α and the second by 1 − α 1-\alpha 1 − α and add up together, we will get f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x})+ (1-\alpha) f(\mathbf{y}) f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) .
Strictly convex iff
f ( y ) > f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ f(\mathbf{y}) > f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle
f ( y ) > f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩
Note that in the proof of ⇒ \Rightarrow ⇒ , when we take the limit α → 0 + \alpha\to0^+ α → 0 + , we can only get f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ but not > > > . So the proof should be altered a little:
f ( ( 1 − α ) x + α y ) < ( 1 − α ) f ( x ) + α f ( y ) f ( y ) > f ( x ) + f ( x + α ( y − x ) ) − f ( x ) α Since f ( x + α ( y − x ) ) ≥ f ( x ) + ⟨ ∇ f ( x ) , α ( y − x ) ⟩ by convexity, f ( y ) > f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ \begin{aligned}
f((1-\alpha) \mathbf{x}+\alpha \mathbf{y}) <&(1-\alpha) f(\mathbf{x})+\alpha f(\mathbf{y})\\
f(\mathbf{y})> &f(\mathbf{x})+\frac{f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))-f(\mathbf{x}) }{\alpha}\\
\text{Since }f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))\ge &f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \alpha(\mathbf{y}-\mathbf{x})\rangle\text{ by convexity,}\\
f(\mathbf{y}) >& f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle
\end{aligned}
f ( ( 1 − α ) x + α y ) < f ( y ) > Since f ( x + α ( y − x ) ) ≥ f ( y ) > ( 1 − α ) f ( x ) + α f ( y ) f ( x ) + α f ( x + α ( y − x ) ) − f ( x ) f ( x ) + ⟨ ∇ f ( x ) , α ( y − x ) ⟩ by convexity, f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩
Strongly convex iff f ( x ) − μ 2 ∥ x ∥ 2 f(\mathbf{x})-\frac{\mu}{2}\|\mathbf{x}\|^2 f ( x ) − 2 μ ∥ x ∥ 2 is convex, i.e.
f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ + μ 2 ∥ y − x ∥ 2 f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle+\frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2
f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ + 2 μ ∥ y − x ∥ 2
Second-order conditions
Suppose that f f f is twice differentiable , then f f f is convex iff dom f \operatorname{dom} f d o m f is convex, its Hessian is positive semidefinite, i.e. for all x ∈ dom f \mathbf{x}\in \operatorname{dom} f x ∈ d o m f ,
∇ 2 f ( x ) ⪰ 0 \nabla^2f(\mathbf{x})\succeq0
∇ 2 f ( x ) ⪰ 0
Note that the reverse is not true, for example the saddle point also satisfies ∇ f = 0 \nabla f=0 ∇ f = 0 .
Strictly convex : ∇ 2 f ( x ) ≻ 0 \nabla^2f(\mathbf{x})\succ0 ∇ 2 f ( x ) ≻ 0
Strongly convex : ∇ 2 f ( x ) ⪰ μ I \nabla^2f(\mathbf{x})\succeq\mu\mathbf{I} ∇ 2 f ( x ) ⪰ μ I
🌟 Why do we need the condition ‘dom f \operatorname{dom}f d o m f is convex’?
e.g. f ( x ) = 1 / x 2 f(x)=1/x^2 f ( x ) = 1 / x 2 with dom f = { x ∈ R ∣ x ≠ 0 } \operatorname{dom} f=\{x\in\R\mid x\ne 0\} d o m f = { x ∈ R ∣ x = 0 } satisfies f ′ ′ ( x ) > 0 f''(x)>0 f ′ ′ ( x ) > 0 for all x ∈ dom f x\in\operatorname{dom} f x ∈ d o m f , but it is not a convex function.
Examples
Functions on R \R R
Exponential: e a x e^{ax} e a x
Powers: x a x^a x a on R + + \R_{++} R + + is convex when a ∈ ( − ∞ , 0 ] ∪ [ 1 , + ∞ ) a\in(-\infty,0]\cup[1,+\infty) a ∈ ( − ∞ , 0 ] ∪ [ 1 , + ∞ ) , concave when 0 ≤ a ≤ 1 0\le a\le 1 0 ≤ a ≤ 1
Powers of absolute value: ∣ x ∣ p , p ≥ 1 |x|^p,p\ge 1 ∣ x ∣ p , p ≥ 1
Logarithm: log x \log x log x is concave on R + + \R_{++} R + +
Negative entropy: x log x x\log x x log x is convex on R + \R_{+} R + (define f ( x ) = 0 ) f(x)=0) f ( x ) = 0 )
Functions on R n \R^n R n
Norms
Can be proved by triangle inequality:
f ( θ x + ( 1 − θ ) y ) ≤ f ( θ x ) + f ( ( 1 − θ ) y ) = θ f ( x ) + ( 1 − θ ) f ( y ) f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq f(\theta \mathbf{x})+f((1-\theta) \mathbf{y})=\theta f(\mathbf{x})+(1-\theta) f(\mathbf{y})
f ( θ x + ( 1 − θ ) y ) ≤ f ( θ x ) + f ( ( 1 − θ ) y ) = θ f ( x ) + ( 1 − θ ) f ( y )
Max function: f ( x ) = max { x 1 , ⋯ , x n } f(\mathbf{x}) = \max\{x_1,\cdots,x_n\} f ( x ) = max { x 1 , ⋯ , x n } . Prove by definition.
Quadratic over linear: f ( x , y ) = x 2 / y , dom f = R × R + + f(x,y)=x^2/y,\operatorname{dom}f=\R\times\R_{++} f ( x , y ) = x 2 / y , d o m f = R × R + +
∇ 2 f ( x , y ) = 2 y 3 [ y 2 − x y − x y x 2 ] = 2 y 3 [ y − x ] [ y − x ] T ⪰ 0 \nabla^{2} f(x, y)=\frac{2}{y^{3}}\left[\begin{array}{cc}
y^{2} & -x y \\
-x y & x^{2}
\end{array}\right]=\frac{2}{y^{3}}\left[\begin{array}{c}
y \\
-x
\end{array}\right]\left[\begin{array}{c}
y \\
-x
\end{array}\right]^{T} \succeq \mathbf{0}
∇ 2 f ( x , y ) = y 3 2 [ y 2 − x y − x y x 2 ] = y 3 2 [ y − x ] [ y − x ] T ⪰ 0
Log-sum-exp: f ( x ) = log ( e x 1 + ⋯ e x n ) f(\mathbf{x}) = \log(e^{x_1}+\cdots e^{x_n}) f ( x ) = log ( e x 1 + ⋯ e x n )
Differentiable approximation of max function :
max { x 1 , ⋯ , x n } ≤ f ( x ) ≤ max { x 1 , ⋯ , x n } + log n \max\{x_1,\cdots,x_n\}\le f(\mathbf{x}) \le \max\{x_1,\cdots,x_n\}+\log n
max { x 1 , ⋯ , x n } ≤ f ( x ) ≤ max { x 1 , ⋯ , x n } + log n
Prove by ∇ 2 f ( x ) ⪰ 0 \nabla^2f(\mathbf{x})\succeq0 ∇ 2 f ( x ) ⪰ 0 :
∇ 2 f ( x ) = 1 ⟨ 1 , z ⟩ 2 ( ⟨ 1 , z ⟩ diag ( z ) − z z T ) , z = ( e x 1 , … , e x n ) ∀ v , v T ∇ 2 f ( x ) v = 1 ⟨ 1 , z ⟩ 2 ( ( ∑ i = 1 n z i ) ( ∑ i = 1 n v i 2 z i ) − ( ∑ i = 1 n v i z i ) 2 ) ≥ 0 \nabla^{2} f(\mathbf{x})=\frac{1}{\langle\mathbf{1}, \mathbf{z}\rangle^{2}}\left(\langle\mathbf{1}, \mathbf{z}\rangle \operatorname{diag}(\mathbf{z})-\mathbf{z z}^{T}\right)
,\mathbf{z}=\left(e^{x_{1}}, \ldots, e^{x_{n}}\right)\\ \forall\mathbf{v},\quad
\mathbf{v}^{T} \nabla^{2} f(\mathbf{x}) \mathbf{v}=\frac{1}{\langle\mathbf{1}, \mathbf{z}\rangle^{2}}\left(\left(\sum_{i=1}^{n} z_{i}\right)\left(\sum_{i=1}^{n} v_{i}^{2} z_{i}\right)-\left(\sum_{i=1}^{n} v_{i} z_{i}\right)^{2}\right) \geq 0
∇ 2 f ( x ) = ⟨ 1 , z ⟩ 2 1 ( ⟨ 1 , z ⟩ d i a g ( z ) − z z T ) , z = ( e x 1 , … , e x n ) ∀ v , v T ∇ 2 f ( x ) v = ⟨ 1 , z ⟩ 2 1 ⎝ ⎛ ( i = 1 ∑ n z i ) ( i = 1 ∑ n v i 2 z i ) − ( i = 1 ∑ n v i z i ) 2 ⎠ ⎞ ≥ 0
The last inequality is derived from Cauchy-Schwarz inequality: ⟨ a , a ⟩ ⋅ ⟨ b , b ⟩ ≥ ⟨ a , b ⟩ 2 , a i = z i , b i = v i z i \langle\mathbf{a},\mathbf{a}\rangle\cdot \langle\mathbf{b},\mathbf{b}\rangle\ge \langle\mathbf{a},\mathbf{b}\rangle^2, a_i=\sqrt{z_i}, b_i=v_i\sqrt{z_i} ⟨ a , a ⟩ ⋅ ⟨ b , b ⟩ ≥ ⟨ a , b ⟩ 2 , a i = z i , b i = v i z i .
Geometric mean: f ( x ) = ( ∏ i = 1 n x i ) 1 / n f(\mathbf{x})=\left(\prod_{i=1}^n x_i\right)^{1/n} f ( x ) = ( ∏ i = 1 n x i ) 1 / n is concave on dom f = R + + n \operatorname{dom}f=\R_{++}^n d o m f = R + + n
Log-determinant: f ( X ) = log det X f(\mathbf{X})=\log\operatorname{det}\mathbf{X} f ( X ) = log d e t X is concave on dom f = S + + n \operatorname{dom}f=\mathbb{S}^n_{++} d o m f = S + + n
Consider an arbitrary line g ( t ) = f ( Z + t V ) , Z , V ∈ S n g(t)=f(\mathbf{Z}+t\mathbf{V}),\mathbf{Z,V}\in\mathbb{S}^n g ( t ) = f ( Z + t V ) , Z , V ∈ S n . t t t Is restricted to value that X = Z + t V ⪰ 0 \mathbf{X}=\mathbf{Z}+t\mathbf{V}\succeq 0 X = Z + t V ⪰ 0 .
g ( t ) = log det ( Z + t V ) = log det ( Z 1 / 2 ( I + t Z − 1 / 2 V Z − 1 / 2 ) Z 1 / 2 ) = ∑ i = 1 n log ( 1 + t λ i ) + log det Z g ′ ( t ) = ∑ i = 1 n λ i 1 + t λ i g ′ ′ ( t ) = − ∑ i = 1 n λ i 2 ( 1 + t λ i ) 2 ≤ 0 \begin{aligned}
g(t) &=\log \operatorname{det}(\mathbf{Z}+t \mathbf{V}) \\
&=\log \operatorname{det}\left(\mathbf{Z}^{1 / 2}\left(\mathbf{I}+t \mathbf{Z}^{-1 / 2} \mathbf{V} \mathbf{Z}^{-1 / 2}\right) \mathbf{Z}^{1 / 2}\right) \\
&=\sum_{i=1}^{n} \log \left(1+t \lambda_{i}\right)+\log \operatorname{det} \mathbf{Z}
\end{aligned}\\
g^{\prime}(t)=\sum_{i=1}^{n} \frac{\lambda_{i}}{1+t \lambda_{i}}
\\ g^{\prime \prime}(t)=-\sum_{i=1}^{n} \frac{\lambda_{i}^{2}}{\left(1+t \lambda_{i}\right)^{2}}\le 0
g ( t ) = log d e t ( Z + t V ) = log d e t ( Z 1 / 2 ( I + t Z − 1 / 2 V Z − 1 / 2 ) Z 1 / 2 ) = i = 1 ∑ n log ( 1 + t λ i ) + log d e t Z g ′ ( t ) = i = 1 ∑ n 1 + t λ i λ i g ′ ′ ( t ) = − i = 1 ∑ n ( 1 + t λ i ) 2 λ i 2 ≤ 0
Basic properties
Sublevels
α \alpha α -sublevel set of a function f : R n → R f:\R^n\to\R f : R n → R is defined as
C α = { x ∈ dom f ∣ f ( x ) ≤ α } C_\alpha=\{\mathbf{x}\in\operatorname{dom}f\mid f(\mathbf{x})\le\alpha\}
C α = { x ∈ d o m f ∣ f ( x ) ≤ α }
Sublevel sets of a convex function are convex for any α \alpha α . The converse is not true, like concave function f ( x ) = − e x f(x)=-e^x f ( x ) = − e x .
To establish convexity of a set, we can express it as a sublevel set of a convex function, or as the superlevel set { x ∈ dom f ∣ f ( x ) ≥ α } \{\mathbf{x}\in\operatorname{dom}f\mid f(\mathbf{x})\ge\alpha\} { x ∈ d o m f ∣ f ( x ) ≥ α } of a concave function.
Example:
G ( x ) = ( ∏ i = 1 n x i ) 1 / n , A ( x ) = 1 n ∑ i = 1 n x i { x ∈ R + n ∣ G ( x ) ≥ α A ( x ) } G(\mathbf{x})=\left(\prod_{i=1}^{n} x_{i}\right)^{1 / n}, \quad A(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} x_{i} \\
\{\mathbf{x}\in\R_+^n\mid G(\mathbf{x})\ge\alpha A(\mathbf{x})\}
G ( x ) = ( i = 1 ∏ n x i ) 1 / n , A ( x ) = n 1 i = 1 ∑ n x i { x ∈ R + n ∣ G ( x ) ≥ α A ( x ) }
It is the 0-sublevel set of α A ( x ) − G ( x ) \alpha A(\mathbf{x})-G(\mathbf{x}) α A ( x ) − G ( x ) , which is convex. So the set is convex.
Epigraph
epi f = { ( x , t ) ∣ x ∈ dom f , f ( x ) ≤ t } \operatorname{epi}f=\{(\mathbf{x},t)\mid\mathbf{x}\in\operatorname{dom}f,f(\mathbf{x})\le t\}
e p i f = { ( x , t ) ∣ x ∈ d o m f , f ( x ) ≤ t }
A function is convex iff its epigraph is a convex set.
Interpret first-order conditions geometrically using epigraph
If ( y , t ) ∈ epi f (\mathbf{y},t)\in\operatorname{epi} f ( y , t ) ∈ e p i f , then
t ≥ f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ ( y , t ) ∈ epi f ⟹ [ ∇ f ( x ) − 1 ] T ( [ y t ] − [ x f ( x ) ] ) ≤ 0 t \geq f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle\\
(\mathbf{y}, t) \in \operatorname{epi} f \Longrightarrow\left[\begin{array}{c}
\nabla f(\mathbf{x}) \\
-1
\end{array}\right]^{T}\left(\left[\begin{array}{l}
\mathbf{y} \\
t
\end{array}\right]-\left[\begin{array}{c}
\mathbf{x} \\
f(\mathbf{x})
\end{array}\right]\right) \leq 0
t ≥ f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ ( y , t ) ∈ e p i f ⟹ [ ∇ f ( x ) − 1 ] T ( [ y t ] − [ x f ( x ) ] ) ≤ 0
Thus, the hyperplane defined by ( ∇ f ( x ) , − 1 ) (\nabla f(\mathbf{x}), -1) ( ∇ f ( x ) , − 1 ) is the supporting plane of epigraph at boundary point ( x , f ( x ) ) (\mathbf{x},f(\mathbf{x})) ( x , f ( x ) ) .
Jensen’s Inequality
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) f ( θ 1 x 1 + ⋯ + θ k x k ) ≤ θ 1 f ( x 1 ) + ⋯ + θ k f ( x k ) f ( ∫ S p ( x ) x d x ) ≤ ∫ S f ( x ) p ( x ) d x f ( E x ) ≤ E f ( x ) f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x})+(1-\theta) f(\mathbf{y}) \\
f\left(\theta_{1} \mathbf{x}_{1}+\cdots+\theta_{k} \mathbf{x}_{k}\right) \leq \theta_{1} f\left(\mathbf{x}_{1}\right)+\cdots+\theta_{k} f\left(\mathbf{x}_{k}\right) \\
f\left(\int_{S} p(\mathbf{x}) \mathbf{x} d \mathbf{x}\right) \leq \int_{S} f(\mathbf{x}) p(\mathbf{x}) d \mathbf{x} \\
f(\mathbb{E} \mathbf{x}) \leq \mathbb{E} f(\mathbf{x})
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) f ( θ 1 x 1 + ⋯ + θ k x k ) ≤ θ 1 f ( x 1 ) + ⋯ + θ k f ( x k ) f ( ∫ S p ( x ) x d x ) ≤ ∫ S f ( x ) p ( x ) d x f ( E x ) ≤ E f ( x )
Thus, adding a zero-mean random vector $\mathbf{z} $ will not decrease the value of a convex function on average:
E f ( x + z ) ≥ f ( x ) \mathbb{E}f(\mathbf{x}+\mathbf{z})\ge f(\mathbf{x})
E f ( x + z ) ≥ f ( x )
Inequality
Many inequalities can be derived by applying Jensen’s Inequality to some convex function.
Arithmetic-geometric mean inequality (f ( x ) = − log x , θ = 1 / 2 f(x)=-\log x, \theta=1/2 f ( x ) = − log x , θ = 1 / 2 )
a b ≤ ( a + b ) / 2 \sqrt{ab}\le(a+b)/2
a b ≤ ( a + b ) / 2
Holder’s Inequality
∑ i = 1 n x i y i ≤ ( ∑ i = 1 n ∣ x i ∣ p ) 1 / p ( ∑ i = 1 n ∣ y i ∣ q ) 1 / q \sum_{i=1}^{n} x_{i} y_{i} \leq\left(\sum_{i=1}^{n}\left|x_{i}\right|^{p}\right)^{1 / p}\left(\sum_{i=1}^{n}\left|y_{i}\right|^{q}\right)^{1 / q}
i = 1 ∑ n x i y i ≤ ( i = 1 ∑ n ∣ x i ∣ p ) 1 / p ( i = 1 ∑ n ∣ y i ∣ q ) 1 / q
Proof.
Still by f ( x ) = − log x f(x)=-\log x f ( x ) = − log x ,
a θ b 1 − θ ≤ θ a + ( 1 − θ ) b a = ∣ x i ∣ p ∑ j = 1 n ∣ x j ∣ p , b = ∣ y i ∣ q ∑ j = 1 n ∣ y j ∣ q , θ = 1 / p , ( ∣ x i ∣ p ∑ j = 1 n ∣ x j ∣ p ) 1 / p ( ∣ y i ∣ q ∑ j = 1 n ∣ y j ∣ q ) 1 / q ≤ ∣ x i ∣ p p ∑ j = 1 n ∣ x j ∣ p + ∣ y i ∣ q q ∑ j = 1 n ∣ y j ∣ q a^{\theta} b^{1-\theta} \leq \theta a+(1-\theta) b\\
\begin{array}{c}
a=\frac{\left|x_{i}\right|^{p}}{\sum_{j=1}^{n}\left|x_{j}\right|^{p}}, \quad b=\frac{\left|y_{i}\right|^{q}}{\sum_{j=1}^{n}\left|y_{j}\right|^{q}}, \quad \theta=1 / p, \\
\left(\frac{\left|x_{i}\right|^{p}}{\sum_{j=1}^{n}\left|x_{j}\right|^{p}}\right)^{1 / p}\left(\frac{\left|y_{i}\right|^{q}}{\sum_{j=1}^{n}\left|y_{j}\right|^{q}}\right)^{1 / q} \leq \frac{\left|x_{i}\right|^{p}}{p \sum_{j=1}^{n}\left|x_{j}\right|^{p}}+\frac{\left|y_{i}\right|^{q}}{q \sum_{j=1}^{n}\left|y_{j}\right|^{q}}
\end{array}
a θ b 1 − θ ≤ θ a + ( 1 − θ ) b a = ∑ j = 1 n ∣ x j ∣ p ∣ x i ∣ p , b = ∑ j = 1 n ∣ y j ∣ q ∣ y i ∣ q , θ = 1 / p , ( ∑ j = 1 n ∣ x j ∣ p ∣ x i ∣ p ) 1 / p ( ∑ j = 1 n ∣ y j ∣ q ∣ y i ∣ q ) 1 / q ≤ p ∑ j = 1 n ∣ x j ∣ p ∣ x i ∣ p + q ∑ j = 1 n ∣ y j ∣ q ∣ y i ∣ q
Bregman distance
f f f is a differentiable strictly convex function f : C → R , C ⊂ R n f:C\to\R, C\sub\R^n f : C → R , C ⊂ R n is a convex set.
B f ( y , x ) = f ( y ) − f ( x ) − ⟨ ∇ f ( x ) , y − x ⟩ B_f(\mathbf{y},\mathbf{x})=f(\mathbf{y})-f(\mathbf{x})-\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x} \rangle
B f ( y , x ) = f ( y ) − f ( x ) − ⟨ ∇ f ( x ) , y − x ⟩
By first-order conditions, we know that B f ( y , x ) ≥ 0 B_f(\mathbf{y},\mathbf{x})\ge 0 B f ( y , x ) ≥ 0 , but it may not be symmetric: B f ( y , x ) ≠ B f ( x , y ) B_f(\mathbf{y},\mathbf{x})\ne B_f(\mathbf{x},\mathbf{y}) B f ( y , x ) = B f ( x , y )
🌟 Subgradient
∂ f ( x ) = { g ∣ f ( y ) ≥ f ( x ) + ⟨ g , y − x ⟩ , ∀ y ∈ dom f } \partial f(\mathbf{x})=\{\mathbf{g} \mid f(\mathbf{y}) \geq f(\mathbf{x})+\langle\mathbf{g}, \mathbf{y}-\mathbf{x}\rangle, \forall \mathbf{y} \in \operatorname{dom} f\}
∂ f ( x ) = { g ∣ f ( y ) ≥ f ( x ) + ⟨ g , y − x ⟩ , ∀ y ∈ d o m f }
Generalization of gradient: non-vertical supporting hyperplane.
$$
f'(\mathbf{x};\mathbf{u})=\max_{\mathbf{g}\in\partial f(\mathbf{x})}\langle\mathbf{y},\mathbf{g}\rangle
$$
Examples
∣ x ∣ |x| ∣ x ∣
∂ ∣ x ∣ = { 1 , x > 0 − 1 , x < 0 [ − 1 , 1 ] , x = 0 \partial|x| = \left\{\begin{array}{ll}
1,&x>0\\
-1, &x<0\\ [-1,1],&x=0
\end{array} \right.
∂ ∣ x ∣ = ⎩ ⎪ ⎨ ⎪ ⎧ 1 , − 1 , [ − 1 , 1 ] , x > 0 x < 0 x = 0
max { 0 , 1 2 ( x 2 − 1 ) } \max\{0, \frac{1}{2}(x^2-1)\} max { 0 , 2 1 ( x 2 − 1 ) }
∂ ∣ x ∣ = { x , ∣ x ∣ > 1 0 , ∣ x ∣ < 1 [ − 1 , 0 ] , x = − 1 [ 0 , 1 ] , x = 1 \partial|x| = \left\{\begin{array}{ll}
x,&|x|>1\\
0, &|x|<1\\ [-1,0],&x=-1\\ [0,1],&x=1
\end{array} \right.
∂ ∣ x ∣ = ⎩ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎧ x , 0 , [ − 1 , 0 ] , [ 0 , 1 ] , ∣ x ∣ > 1 ∣ x ∣ < 1 x = − 1 x = 1
I C ( x ) = { + ∞ , x ∈ C 0 , x ∉ C I_C(\mathbf{x})=\left\{\begin{array}{ll}
+\infty,&\mathbf{x}\in C\\
0, &\mathbf{x}\notin C
\end{array} \right. I C ( x ) = { + ∞ , 0 , x ∈ C x ∈ / C
f ( y ) = f ( x ) + ⟨ g , y − x ⟩ f(\mathbf{y})=f(\mathbf{x})+\langle g,\mathbf{y}-\mathbf{x}\rangle f ( y ) = f ( x ) + ⟨ g , y − x ⟩
When x ∈ C ∘ \mathbf{x}\in C^\circ x ∈ C ∘ , 0 ≥ ⟨ g , y − x ⟩ ⇒ g = 0 0\ge \langle g, \mathbf{y}-\mathbf{x}\rangle\Rightarrow g=0 0 ≥ ⟨ g , y − x ⟩ ⇒ g = 0 .
When x ∈ ∂ C \mathbf{x}\in\partial C x ∈ ∂ C , y − x \mathbf{y}-\mathbf{x} y − x forms the tangent cone, so g g g would be its normal cone.
Properties
Continuity
x k → x ∗ g k ∈ ∂ f ( x k ) , g k → g ∗ } ⇒ g ∗ ∈ ∂ f ( x ∗ ) \left.
\begin{array}{l}
x_k\to x^*\\
g_k\in\partial f(x_k), g_k\to g^*
\end{array}
\right\}\Rightarrow g^*\in\partial f(x^*)
x k → x ∗ g k ∈ ∂ f ( x k ) , g k → g ∗ } ⇒ g ∗ ∈ ∂ f ( x ∗ )
Chain Rule
F ( x ) = f ( A x ) ⇒ ∂ F ( x ) = A ⊤ ∂ f ( A x ) F ′ ( x ; y ) = h ′ ( f ( x ) ) f ′ ( x ; y ) F(\mathbf{x})=f(\mathbf{Ax}) \Rightarrow\partial F(\mathbf{x}) = \mathbf{A}^\top\partial f(\mathbf{Ax})\\
F'(\mathbf{x};\mathbf{y}) = h'(f(\mathbf{x}))f'(\mathbf{x};\mathbf{y})
F ( x ) = f ( A x ) ⇒ ∂ F ( x ) = A ⊤ ∂ f ( A x ) F ′ ( x ; y ) = h ′ ( f ( x ) ) f ′ ( x ; y )
Subgradient of norms
For a real Hilbert space endowed with inner product and norm, then
∂ ∥ x ∥ = { y ∣ ⟨ y , x ⟩ = ∥ x ∥ , ∥ y ∥ ∗ ≤ 1 } \partial \|\mathbf{x}\| = \{\mathbf{y}\mid \langle\mathbf{y},\mathbf{x}\rangle=\|\mathbf{x}\|,\|\mathbf{y}\|^*\le 1 \}
∂ ∥ x ∥ = { y ∣ ⟨ y , x ⟩ = ∥ x ∥ , ∥ y ∥ ∗ ≤ 1 }
Proof. Let S = { y ∣ ⟨ y , x ⟩ = ∥ x ∥ , ∥ y ∥ ∗ ≤ 1 } S =\{\mathbf{y}\mid \langle\mathbf{y},\mathbf{x}\rangle=\|\mathbf{x}\|,\|\mathbf{y}\|^*\le 1 \} S = { y ∣ ⟨ y , x ⟩ = ∥ x ∥ , ∥ y ∥ ∗ ≤ 1 } .
For every y ∈ ∂ ∥ x ∥ \mathbf{y}\in\partial\|\mathbf{x}\| y ∈ ∂ ∥ x ∥ , we have
∥ w ∥ − ∥ x ∥ ≥ ⟨ y , w − x ⟩ , ∀ w \|\mathbf{w}\|-\|\mathbf{x}\|\ge \langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle,\forall \mathbf{w}
∥ w ∥ − ∥ x ∥ ≥ ⟨ y , w − x ⟩ , ∀ w
Choose w = 0 , w = 2 x \mathbf{w}=0,\mathbf{w}=2\mathbf{x} w = 0 , w = 2 x we have ∥ x ∥ = ⟨ y , x ⟩ \|\mathbf{x}\|=\langle\mathbf{y},\mathbf{x}\rangle ∥ x ∥ = ⟨ y , x ⟩ .
∥ w − x ∥ ≥ ∥ w ∥ − ∥ x ∥ ≥ ⟨ y , w − x ⟩ , ∀ w ⟨ y , w − x ∥ w − x ∥ ⟩ ≤ 1 , ∀ w ≠ x \|\mathbf{w}-\mathbf{x}\|\ge\|\mathbf{w}\|-\|\mathbf{x}\|\ge \langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle,\forall \mathbf{w}\\
\left\langle\mathbf{y},\frac{\mathbf{w}-\mathbf{x}}{\|\mathbf{w}-\mathbf{x}\|}\right\rangle\le 1,\forall\mathbf{w}\ne\mathbf{x}
∥ w − x ∥ ≥ ∥ w ∥ − ∥ x ∥ ≥ ⟨ y , w − x ⟩ , ∀ w ⟨ y , ∥ w − x ∥ w − x ⟩ ≤ 1 , ∀ w = x
Therefore ∥ y ∥ ∗ ≤ 1 , ∂ ∥ x ∥ ⊂ S \|\mathbf{y}\|^*\le 1, \partial\|\mathbf{x}\|\subset S ∥ y ∥ ∗ ≤ 1 , ∂ ∥ x ∥ ⊂ S .
For every y ∈ S \mathbf{y}\in S y ∈ S ,
⟨ y , w − x ⟩ = ⟨ y , w ⟩ − ⟨ y , x ⟩ = ⟨ y , w ⟩ − ∥ x ∥ ≤ ∥ y ∥ ∗ ∥ w ∥ − ∥ x ∥ ≤ ∥ w ∥ − ∥ x ∥ , ∀ w \langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle=\langle\mathbf{y}, \mathbf{w}\rangle-\langle\mathbf{y}, \mathbf{x}\rangle=\langle\mathbf{y}, \mathbf{w}\rangle-\|\mathbf{x}\| \leq\|\mathbf{y}\|^{*}\|\mathbf{w}\|-\|\mathbf{x}\| \leq\|\mathbf{w}\|-\|\mathbf{x}\|,\forall \mathbf{w}
⟨ y , w − x ⟩ = ⟨ y , w ⟩ − ⟨ y , x ⟩ = ⟨ y , w ⟩ − ∥ x ∥ ≤ ∥ y ∥ ∗ ∥ w ∥ − ∥ x ∥ ≤ ∥ w ∥ − ∥ x ∥ , ∀ w
Therefore, y ∈ ∂ ∥ x ∥ , S ⊂ ∂ ∥ x ∥ \mathbf{y}\in\partial\|\mathbf{x}\|, S\subset\partial \|\mathbf{x}\| y ∈ ∂ ∥ x ∥ , S ⊂ ∂ ∥ x ∥ .
Danskin’s Theorem
If Z \mathcal{Z} Z is a compact subset of R m \R^m R m , ϕ : R n × Z → R \phi:\R^n\times\mathcal{Z}\to \R ϕ : R n × Z → R is continuous and ϕ ( ⋅ , z ) : R n → R \phi(\cdot,\mathbf{z}):\R^n\to\R ϕ ( ⋅ , z ) : R n → R is convex for each z ∈ Z \mathbf{z}\in\mathcal{Z} z ∈ Z . Define that
f ( x ) = max z ∈ Z ϕ ( x , z ) Z ( x ) = arg max z ∈ Z ϕ ( x , z ) f(\mathbf{x}) = \max_{\mathbf{z}\in\mathcal{Z}}\phi(\mathbf{x},\mathbf{z})\\
\mathcal{Z}(\mathbf{x}) = \arg\max_{\mathbf{z}\in\mathcal{Z}}\phi(\mathbf{x},\mathbf{z})
f ( x ) = z ∈ Z max ϕ ( x , z ) Z ( x ) = arg z ∈ Z max ϕ ( x , z )
Theorem: If ϕ ( ⋅ , z ) \phi(\cdot,\mathbf{z}) ϕ ( ⋅ , z ) is differentiable for all z \mathbf{z} z and ∇ x ϕ ( x , ⋅ ) \nabla_x\phi(\mathbf{x},\cdot) ∇ x ϕ ( x , ⋅ ) is continuous on Z \mathcal{Z} Z , then
∂ f ( x ) = c o n v { ∇ x ϕ ( x , z ) ∣ z ∈ Z ( x ) } \partial f(\mathbf{x}) = conv\{\nabla_x\phi(\mathbf{x},\mathbf{z})\mid \mathbf{z}\in\mathcal{Z}(\mathbf{x}) \}
∂ f ( x ) = c o n v { ∇ x ϕ ( x , z ) ∣ z ∈ Z ( x ) }
Operations that preserve convexity
Nonnegative weighted sums
f = w 1 f 1 + ⋯ + w m f m , w i ≥ 0 f = w_1f_1+\cdots+w_mf_m, w_i\ge 0
f = w 1 f 1 + ⋯ + w m f m , w i ≥ 0
Proof. Since the image of a convex set under a linear mapping is convex, we have
epi ( w f ) = [ I 0 0 ⊤ w ] epi f \operatorname{epi}(wf) = \begin{bmatrix}
\mathbf{I} &\mathbf{0}\\
\mathbf{0}^\top &w
\end{bmatrix}\operatorname{epi}f
e p i ( w f ) = [ I 0 ⊤ 0 w ] e p i f
Affine mapping
f f f Is convex, then g ( x ) = f ( A x + b ) g(\mathbf{x}) = f(\mathbf{Ax}+\mathbf{b}) g ( x ) = f ( A x + b ) is also convex.
Pointwise maximum and supremum
f ( x ) = max { f 1 ( x ) , f 2 ( x ) } , dom f = dom f 1 ∩ dom f 2 f(\mathbf{x}) = \max\{f_1(\mathbf{x}), f_2(\mathbf{x}) \}, \operatorname{dom} f = \operatorname{dom}f_1\cap\operatorname{dom}f_2
f ( x ) = max { f 1 ( x ) , f 2 ( x ) } , d o m f = d o m f 1 ∩ d o m f 2
The max function can be extended to f 1 , ⋯ , f m f_1,\cdots,f_m f 1 , ⋯ , f m .
Sum of r r r largest components:
x [ 1 ] ≥ ⋯ ≥ x [ n ] f ( x ) = ∑ i = 1 r x [ i ] x_{[1]}\ge\cdots\ge x_{[n]}\\
f(\mathbf{x}) = \sum_{i=1}^r x_{[i]}
x [ 1 ] ≥ ⋯ ≥ x [ n ] f ( x ) = i = 1 ∑ r x [ i ]
As for the proof, the function can be considered the maximum of all “sum of r r r elements” functions.
Support function of a set
S C ( x ) = sup { ⟨ x , y ⟩ ∣ y ∈ C } S_C(\mathbf{x}) = \sup\{\langle\mathbf{x},\mathbf{y}\rangle\mid\mathbf{y}\in C\} S C ( x ) = sup { ⟨ x , y ⟩ ∣ y ∈ C } is the point wise supremum of a family of linear functions, hence convex. It describes the distance of the supporting hyperplane at x \mathbf{x} x from the origin.
Distance to farthest point of a set
f ( x ) = sup y ∈ C ∥ x − y ∥ f(\mathbf{x}) = \sup_{\mathbf{y}\in C}\|\mathbf{x}-\mathbf{y}\| f ( x ) = sup y ∈ C ∥ x − y ∥ (Pointwise supremum of convex function ∥ x − y ∥ \|\mathbf{x}-\mathbf{y}\| ∥ x − y ∥ .)
Least-squares cost
g ( w ) = inf x ∑ i = 1 n w i ( ⟨ a i , x ⟩ − b i ) 2 g(\mathbf{w}) = \inf_\mathbf{x}\sum_{i=1}^nw_i(\langle\mathbf{a}_i,\mathbf{x}\rangle-b_i)^2 g ( w ) = inf x ∑ i = 1 n w i ( ⟨ a i , x ⟩ − b i ) 2 Is the infimum of a family of linear functions, hence concave.
Norm of a matrix
f ( X ) = sup { u ⊤ X v ∣ ∥ u ∥ 2 = 1 , ∥ v ∥ 2 = 1 } f(\mathbf{X}) = \sup\{\mathbf{u^\top Xv}\mid \|\mathbf{u}\|_2=1,\|\mathbf{v}\|_2=1\} f ( X ) = sup { u ⊤ X v ∣ ∥ u ∥ 2 = 1 , ∥ v ∥ 2 = 1 }
Composition
For function f ( x ) = h ( g ( x ) ) f(\mathbf{x}) = h(g(\mathbf{x})) f ( x ) = h ( g ( x ) ) :
f f f Is convex if h h h is convex, h ~ \tilde{h} h ~ is nondecreasing, and g g g is convex.
f f f Is convex if h h h is convex, h ~ \tilde{h} h ~ is nonincreasing, and g g g is concave
Note that we need the extended-value extension (∞ \infty ∞ for points not in dom) to be nonincreasing or nondecreasing.
Minimization
f f f Is convex and C C C is convex nonempty set, then g ( x ) = inf y ∈ C f ( x , y ) g(\mathbf{x})=\inf_{\mathbf{y}\in C} f(\mathbf{x}, \mathbf{y}) g ( x ) = inf y ∈ C f ( x , y ) is convex.
Proof. epi g = { ( x , t ) ∣ ( x , y , t ) ∈ epi f , y ∈ C } \operatorname{epi}g=\{(\mathbf{x},t)\mid (\mathbf{x},\mathbf{y},t)\in\operatorname{epi}f,\mathbf{y}\in C\} e p i g = { ( x , t ) ∣ ( x , y , t ) ∈ e p i f , y ∈ C } Is convex because it is a projection from the convex epi f \operatorname{epi} f e p i f .
Example: dist ( x , S ) = inf y ∈ S ∥ x − y ∥ \operatorname{dist}(\mathbf{x}, S)=\inf_{\mathbf{y}\in S}\|\mathbf{x}-\mathbf{y}\| d i s t ( x , S ) = inf y ∈ S ∥ x − y ∥ Is convex if S S S is convex, because ∥ x − y ∥ \|\mathbf{x}-\mathbf{y}\| ∥ x − y ∥ is convex in ( x , y ) (\mathbf{x},\mathbf{y}) ( x , y ) .
Perspective of a function
f : R n → R f:\R^n\to\R f : R n → R , the perspective of f f f is g ( x , t ) = t f ( x / t ) g(\mathbf{x}, t)=tf(\mathbf{x}/t) g ( x , t ) = t f ( x / t ) , dom g = { ( x , t ) ∣ x / t ∈ dom f , t > 0 } \operatorname{dom}g=\{(\mathbf{x},t)\mid \mathbf{x}/t\in\operatorname{dom}f, t>0\} d o m g = { ( x , t ) ∣ x / t ∈ d o m f , t > 0 } .
Perspective preserves convexity, because ( x , t , s ) ∈ epi g (\mathbf{x},t,s)\in \operatorname{epi}g ( x , t , s ) ∈ e p i g indicates that t f ( x / t ) ≤ s , f ( x / t ) ≤ s / t tf(\mathbf{x}/t)\le s,f(\mathbf{x}/t)\le s/t t f ( x / t ) ≤ s , f ( x / t ) ≤ s / t , which is ( x / t , s / t ) ∈ epi f (\mathbf{x}/t, s/t)\in \operatorname{epi} f ( x / t , s / t ) ∈ e p i f .
Example: The perspective function of convex function f ( x ) = − log x f(x)=-\log x f ( x ) = − log x on R + + \R_{++} R + + is g ( x , t ) = − t log ( x / t ) = t log t − t log x g(x,t)=-t\log(x/t)=t\log t-t\log x g ( x , t ) = − t log ( x / t ) = t log t − t log x is convex on R + + 2 \R_{++}^2 R + + 2 . g g g Is called relative entropy.
🌟 Conjugate function
f : R n → R f:\R^n\to\R f : R n → R
f ∗ ( y ) = sup x ∈ dom f ( ⟨ y , x ⟩ − f ( x ) ) = sup x ∈ dom f ( y − 1 ) T ( x f ( x ) ) = sup ( x , t ) ∈ epi f ( y − 1 ) T ( x t ) f^{*}(\mathbf{y})=\sup _{\mathbf{x} \in \operatorname{dom} f}(\langle\mathbf{y}, \mathbf{x}\rangle-f(\mathbf{x}))
=\sup _{\mathbf{x} \in \operatorname{dom} f}\left(\begin{array}{c}
\mathbf{y} \\
-1
\end{array}\right)^{T}\left(\begin{array}{c}
\mathbf{x} \\
f(\mathbf{x})
\end{array}\right)
=\sup _{(\mathbf{x}, t) \in \operatorname{epi} f}\left(\begin{array}{c}
\mathbf{y} \\
-1
\end{array}\right)^{T}\left(\begin{array}{c}
\mathbf{x} \\
t
\end{array}\right)
f ∗ ( y ) = x ∈ d o m f sup ( ⟨ y , x ⟩ − f ( x ) ) = x ∈ d o m f sup ( y − 1 ) T ( x f ( x ) ) = ( x , t ) ∈ e p i f sup ( y − 1 ) T ( x t )
It is the support function of the epigraph (or say, supremum of affine functions), so it is always convex. The domain is where the supremum is finite .
The conjugate function f ∗ ( y ) f^*(\mathbf{y}) f ∗ ( y ) is the maximum gap between ⟨ y , x ⟩ \langle\mathbf{y},\mathbf{x}\rangle ⟨ y , x ⟩ and f ( x ) f(\mathbf{x}) f ( x ) . If f f f is differentiable, the sup x ∗ x^* x ∗ occurs when f ′ ( x ∗ ) = y f'(x^*)=y f ′ ( x ∗ ) = y .
Example. 1. Confirm the domain of y y y . 2. Take derivative to compute x x x when the function reaches maximum.
f ( x ) = e x f(x)=e^x f ( x ) = e x .
x y − e x xy-e^x x y − e x Is bounded over only when y ≥ 0 y\ge 0 y ≥ 0 .
Take the derivative and we have y − e x = 0 , x = log y y-e^x=0, x=\log y y − e x = 0 , x = log y . Substituting and we have f ∗ ( y ) = y log y − y f^*(y) = y\log y-y f ∗ ( y ) = y log y − y .
Note that in y = 0 y=0 y = 0 case, sup x − e x = 0 \sup_x -e^x=0 sup x − e x = 0 , so we can interpret 0 log 0 = 0 0\log 0=0 0 log 0 = 0 .
Some special conjugate functions
Indicator function
Indicator function of any set S S S is I S ( x ) = 0 I_S(\mathbf{x})=0 I S ( x ) = 0 on dom I S = S \operatorname{dom}I_S=S d o m I S = S . Its conjugate is I S ∗ ( y ) = sup x ∈ S ⟨ y , x ⟩ I_S^*(\mathbf{y}) = \sup_{\mathbf{x}\in S}\langle\mathbf{y},\mathbf{x}\rangle I S ∗ ( y ) = sup x ∈ S ⟨ y , x ⟩ , which is the support function of S S S .
If S S S is a closed convex set, then the conjugate of support function is the indicator function of S S S .
Log Determinant
f ( X ) = log det X − 1 f(\mathbf{X})=\log\det \mathbf{X}^{-1} f ( X ) = log det X − 1 on S + + n \mathbf{S}_{++}^n S + + n , f ∗ ( Y ) = sup X ≻ 0 ( tr ( Y ⊤ X + log det X ) ) f^*(\mathbf{Y})=\sup_{\mathbf{X}\succ 0}(\operatorname{tr}(\mathbf{Y^\top X}+\log\det\mathbf{X})) f ∗ ( Y ) = sup X ≻ 0 ( t r ( Y ⊤ X + log det X ) ) .
We first prove the dominant is Y ≺ 0 \mathbf{Y}\prec0 Y ≺ 0 . If Y ⊀ 0 \mathbf{Y}\not\prec 0 Y ≺ 0 , then it has an eigenvector v , ∥ v ∥ 2 = 1 \mathbf{v}, \|\mathbf{v}\|_2=1 v , ∥ v ∥ 2 = 1 and eigenvalue λ ≥ 0 \lambda\ge 0 λ ≥ 0 . Take X = I + t v v ⊤ \mathbf{X}=\mathbf{I}+t\mathbf{vv^\top} X = I + t v v ⊤ :
tr ( Y ⊤ X ) + log det X = tr ( Y ) + t λ + log ( 1 + t ) → + ∞ \operatorname{tr}(\mathbf{Y^\top X})+\log\det\mathbf{X} = \operatorname{tr}(\mathbf{Y})+t\lambda + \log(1+t)\to +\infty
t r ( Y ⊤ X ) + log det X = t r ( Y ) + t λ + log ( 1 + t ) → + ∞
Now take the derivative:
∇ X ( tr ( Y ⊤ X ) + log det X ) = Y + X − 1 = 0 ⇒ X = − Y − 1 \nabla_{\mathbf{X}}(\operatorname{tr}(\mathbf{Y^\top X})+\log\det\mathbf{X}) = \mathbf{Y}+\mathbf{X}^{-1}=0\Rightarrow \mathbf{X}=-\mathbf{Y}^{-1}
∇ X ( t r ( Y ⊤ X ) + log det X ) = Y + X − 1 = 0 ⇒ X = − Y − 1
Thus, f ∗ ( Y ) = log det ( − Y − 1 ) − n , dom f ∗ = − S + + n f^*(\mathbf{Y}) = \log\det(-\mathbf{Y}^{-1})-n, \operatorname{dom}f^*=-\mathbb{S}_{++}^n f ∗ ( Y ) = log det ( − Y − 1 ) − n , d o m f ∗ = − S + + n .
Norm
Conjugate function of f ( x ) = ∥ x ∥ f(\mathbf{x})=\|\mathbf{x}\| f ( x ) = ∥ x ∥ is f ∗ ( y ) = I B ( y ) , B = { y ∣ ∥ y ∥ ∗ ≤ 1 } f^*(\mathbf{y})=I_B(\mathbf{y}),B=\{\mathbf{y}\mid\|\mathbf{y}\|^*\le 1 \} f ∗ ( y ) = I B ( y ) , B = { y ∣ ∥ y ∥ ∗ ≤ 1 } .
Fenchel’s Inequality
f ( x ) + f ∗ ( y ) ≥ ⟨ x , y ⟩ , ∀ x , y f(\mathbf{x})+f^*(\mathbf{y})\ge \langle\mathbf{x},\mathbf{y}\rangle, \forall \mathbf{x},\mathbf{y}
f ( x ) + f ∗ ( y ) ≥ ⟨ x , y ⟩ , ∀ x , y
Example. 1 2 ∥ x ∥ 2 + 1 2 ∥ y ∥ ∗ 2 ≥ ⟨ x , y ⟩ \frac{1}{2}\|\mathbf{x}\|^{2}+\frac{1}{2}\|\mathbf{y}\|_{*}^{2} \geq\langle\mathbf{x}, \mathbf{y}\rangle 2 1 ∥ x ∥ 2 + 2 1 ∥ y ∥ ∗ 2 ≥ ⟨ x , y ⟩ .
Conjugate of conjugate
If f f f is proper, convex, and closed , then f ∗ ∗ = f f^{**}=f f ∗ ∗ = f .
For any f f f , f ∗ ∗ f^{**} f ∗ ∗ is the largest convex function not exceeding f f f , i.e. the convex envelop of f f f .
If f f f is convex and differentiable , then maximizer x ∗ x^* x ∗ of ⟨ y , x ⟩ − f ( x ) ⇔ y = ∇ f ( x ∗ ) \langle y,x\rangle-f(x)\Leftrightarrow y=\nabla f(x^*) ⟨ y , x ⟩ − f ( x ) ⇔ y = ∇ f ( x ∗ ) .
Hence, if we can solve y = ∇ f ( z ) y=\nabla f(z) y = ∇ f ( z ) for z z z , then we can compute f ∗ ( y ) f^*(y) f ∗ ( y ) by f ∗ ( y ) = z ⊤ ∇ f ( x ∗ ) − f ( x ∗ ) f^*(y) = z^\top \nabla f(x^*)-f(x^*) f ∗ ( y ) = z ⊤ ∇ f ( x ∗ ) − f ( x ∗ ) .
Affine : a > 0 , b ∈ R a>0, b\in \R a > 0 , b ∈ R , the conjugate of g ( x ) = a f ( x ) + b g(x)=af(x)+b g ( x ) = a f ( x ) + b is g ∗ ( y ) = a f ∗ ( y / a ) + b g^*(y)=af^*(y/a)+b g ∗ ( y ) = a f ∗ ( y / a ) + b .
Example. g ( x ) = f ( A x + b ) g(\mathbf{x}) = f(\mathbf{Ax}+\mathbf{b}) g ( x ) = f ( A x + b ) , g ∗ ( y ) = f ∗ ( A − ⊤ y ) − b ⊤ A − ⊤ y g^*(\mathbf{y}) = f^*(\mathbf{A}^{-\top }\mathbf{y})-\mathbf{b^\top A^{-\top}y} g ∗ ( y ) = f ∗ ( A − ⊤ y ) − b ⊤ A − ⊤ y .
Separable : f ( u , v ) = f 1 ( u ) + f 2 ( v ) f(u,v)=f_1(u)+f_2(v) f ( u , v ) = f 1 ( u ) + f 2 ( v ) where f 1 f_1 f 1 and f 2 f_2 f 2 are convex, then f ∗ ( w , z ) = f 1 ∗ ( w ) + f 2 ∗ ( z ) f^*(w,z)=f_1^*(w)+f_2^*(z) f ∗ ( w , z ) = f 1 ∗ ( w ) + f 2 ∗ ( z ) .
Envelope function and Proximal mapping
f : R n → ( − ∞ , + ∞ ] f:\R^n\to(-\infty,+\infty] f : R n → ( − ∞ , + ∞ ] is a closed proper function. For a scalar c > 0 c>0 c > 0 , define:
Env c f ( x ) = inf w { f ( w ) + 1 2 c ∥ w − x ∥ 2 } Prox c f ( x ) = Argmin w { f ( w ) + 1 2 c ∥ w − x ∥ 2 } . \begin{aligned}
\operatorname{Env}_{c} f(\mathbf{x}) &=\inf _{\mathbf{w}}\left\{f(\mathbf{w})+\frac{1}{2 c}\|\mathbf{w}-\mathbf{x}\|^{2}\right\} \\
\operatorname{Prox}_{c} f(\mathbf{x}) &=\underset{\mathbf{w}}{\operatorname{Argmin}}\left\{f(\mathbf{w})+\frac{1}{2 c}\|\mathbf{w}-\mathbf{x}\|^{2}\right\} .
\end{aligned}
E n v c f ( x ) P r o x c f ( x ) = w inf { f ( w ) + 2 c 1 ∥ w − x ∥ 2 } = w A r g m i n { f ( w ) + 2 c 1 ∥ w − x ∥ 2 } .
Motivation. We are attempting to reduce the value of f f f without straying too far from x \mathbf{x} x . You can imagine that this would be a useful sub-step in an optimization algorithm. The parameter c c c can be viewed as a “step size ” that controls how far from x \mathbf{x} x we are allowed to move. When c c c is very small , the penalty for moving away is severe .
Caption. Prox c f ( x k ) = x k + 1 \operatorname{Prox}_{c} f(\mathbf{x}_{k})=\mathbf{x}_{k+1} P r o x c f ( x k ) = x k + 1 . At the point x = x k + 1 x=x_{k+1} x = x k + 1 , we have f ( x ) + 1 2 c k ∥ x − x k ∥ 2 = γ k = Env c f ( x k ) f(x)+\frac{1}{2c_k}\|x-x_k\|^2=\gamma_k = \operatorname{Env}_cf(x_k) f ( x ) + 2 c k 1 ∥ x − x k ∥ 2 = γ k = E n v c f ( x k ) . (Given − 1 2 c k ∥ x − x k ∥ 2 -\frac{1}{2c_k}\|x-x_k\|^2 − 2 c k 1 ∥ x − x k ∥ 2 , the least distance γ k \gamma_k γ k to move it upward and intersect with f ( x ) f(x) f ( x ) would be the envelop.) And the slope is exactly the sub gradient.
Involution:
( f ⊠ g ) ( x ) = inf y + z = x f ( y ) + g ( z ) ( f ⊗ g ) ( x ) = ∬ y + z = x f ( y ) g ( z ) d y d z (f \boxtimes g)(\mathbf{x})=\inf _{\mathbf{y}+\mathbf{z}=\mathbf{x}} f(\mathbf{y})+g(\mathbf{z})\\
(f \otimes g)(\mathbf{x})=\iint _{\mathbf{y}+\mathbf{z}=\mathbf{x}} f(\mathbf{y})g(\mathbf{z})d\mathbf{y}d\mathbf{z}
( f ⊠ g ) ( x ) = y + z = x inf f ( y ) + g ( z ) ( f ⊗ g ) ( x ) = ∬ y + z = x f ( y ) g ( z ) d y d z
When we take w = x \mathbf{w}=\mathbf{x} w = x , we know that Env c ≤ f ( x ) \operatorname{Env}_{c} \le f(\mathbf{x}) E n v c ≤ f ( x ) .
u = Prox c f ( x ) ⇔ x − u ∈ c ∂ f ( u ) \mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x}) \Leftrightarrow \mathbf{x}-\mathbf{u} \in c \partial f(\mathbf{u}) u = P r o x c f ( x ) ⇔ x − u ∈ c ∂ f ( u ) By definition of subgradient.
Usage of proximal mapping
Let T = I − α ∇ f T=I-\alpha\nabla f T = I − α ∇ f and T ~ = ( I + α ∇ f ) − 1 \tilde{T} = (I+\alpha\nabla f)^{-1} T ~ = ( I + α ∇ f ) − 1 . We have updates in two ways:
Forward: x k + 1 = T ( x k ) = x k − α ∇ f ( x k ) Backward: x k + 1 = T ~ ( x k ) = arg min x f ( x ) + 1 2 α ∥ x − x k ∥ 2 x k − x k + 1 ∈ c ∂ f ( x k + 1 ) \text{Forward: }\mathbf{x}_{k+1}=T(\mathbf{x}_k)=\mathbf{x}_{k}-\alpha\nabla f(\mathbf{x}_{k})\\
\text{Backward: }\mathbf{x}_{k+1}=\tilde{T}(\mathbf{x}_k)=\arg\min_\mathbf{x}f(\mathbf{x})+\frac{1}{2\alpha}\|\mathbf{x}-\mathbf{x}_{k}\|^2\\
\mathbf{x}_k-\mathbf{x}_{k+1} \in c \partial f(\mathbf{x}_{k+1})
Forward: x k + 1 = T ( x k ) = x k − α ∇ f ( x k ) Backward: x k + 1 = T ~ ( x k ) = arg x min f ( x ) + 2 α 1 ∥ x − x k ∥ 2 x k − x k + 1 ∈ c ∂ f ( x k + 1 )
Sufficient descent and encourages convergence
f ( x k + 1 ) − f ( x k ) ≤ − 1 2 α ∥ x k + 1 − x k ∥ 2 ⇒ ∑ k = 0 + ∞ ∥ x k + 1 − x k ∥ 2 < + ∞ f(\mathbf{x}_{k+1})-f(\mathbf{x}_k)\le -\frac{1}{2\alpha}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|^2\\
\Rightarrow \sum_{k=0}^{+\infty}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|^2<+\infty
f ( x k + 1 ) − f ( x k ) ≤ − 2 α 1 ∥ x k + 1 − x k ∥ 2 ⇒ k = 0 ∑ + ∞ ∥ x k + 1 − x k ∥ 2 < + ∞
By KL condition, we have ∑ k = 0 + ∞ ∥ x k + 1 − x k ∥ < + ∞ \sum_{k=0}^{+\infty}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|<+\infty ∑ k = 0 + ∞ ∥ x k + 1 − x k ∥ < + ∞ .
Example
f ( x ) = I C ( x ) f(\mathbf{x}) = I_C(\mathbf{x}) f ( x ) = I C ( x )
Prox c f ( x ) = arg min y f ( y ) + 1 2 α ∥ y − x ∥ 2 = arg min y ∈ C ∥ y − x ∥ 2 = P C ( x ) \operatorname{Prox}_{c} f(\mathbf{x}) = \arg\min_{\mathbf{y}} f(\mathbf{y})+\frac{1}{2\alpha}\|\mathbf{y}-\mathbf{x}\|^2 = \arg\min_{\mathbf{y}\in C} \|\mathbf{y}-\mathbf{x}\|^2=P_C(\mathbf{x})
P r o x c f ( x ) = arg y min f ( y ) + 2 α 1 ∥ y − x ∥ 2 = arg y ∈ C min ∥ y − x ∥ 2 = P C ( x )
Halfspace C = { x ∣ a ⊤ x ≤ b } C=\{\mathbf{x}\mid\mathbf{a^\top x}\le b \} C = { x ∣ a ⊤ x ≤ b }
min y 1 2 ∥ y − x ∥ 2 s . t . a ⊤ y = b L ( y , λ ) = 1 2 ∥ y − x ∥ 2 + λ ( a ⊤ y − b ) { ( y − x ) + λ a = 0 a ⊤ y = b ⇒ y = x − λ a , λ = a T x − b ∥ a ∥ 2 a \min_\mathbf{y}\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^2\quad s.t.\mathbf{a^\top y}= b\\
L(\mathbf{y},\lambda)=\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^2+\lambda(\mathbf{a^\top y}-b)\\
\left\{\begin{array}{l}
(\mathbf{y}-\mathbf{x})+\lambda\mathbf{a}=0\\
\mathbf{a^\top y}=b
\end{array}
\right.
\Rightarrow \mathbf{y}=\mathbf{x}-\lambda\mathbf{a},\lambda=\frac{\mathbf{a}^{T} \mathbf{x}-b}{\|\mathbf{a}\|^{2}} \mathbf{a}\\
y min 2 1 ∥ y − x ∥ 2 s . t . a ⊤ y = b L ( y , λ ) = 2 1 ∥ y − x ∥ 2 + λ ( a ⊤ y − b ) { ( y − x ) + λ a = 0 a ⊤ y = b ⇒ y = x − λ a , λ = ∥ a ∥ 2 a T x − b a
If x ∈ C \mathbf{x}\in C x ∈ C , take y = x \mathbf{y}=\mathbf{x} y = x . If not, then take the projection of y \mathbf{y} y onto the halfspace:
P C ( x ) = { x + b − a T x ∥ a ∥ 2 a , if a T x > b x , if a T x ≤ b P_{\mathcal{C}}(\mathbf{x})=\left\{\begin{array}{ll}
\mathbf{x}+\frac{b-\mathbf{a}^{T} \mathbf{x}}{\|\mathbf{a}\|^{2}} \mathbf{a}, & \text { if } \mathbf{a}^{T} \mathbf{x}>b \\
\mathbf{x}, & \text { if } \mathbf{a}^{T} \mathbf{x} \leq b
\end{array}\right.
P C ( x ) = { x + ∥ a ∥ 2 b − a T x a , x , if a T x > b if a T x ≤ b
Hyperbox C = { x ∣ l ≤ x ≤ u } C=\{\mathbf{x}\mid\mathbf{l}\le \mathbf{x}\le\mathbf{u} \} C = { x ∣ l ≤ x ≤ u }
( P C ( x ) ) i = { l i , if x i ≤ l i x i , if l i ≤ x i ≤ u i u i , if x i ≥ u i \left(P_{\mathcal{C}}(\mathbf{x})\right)_{i}=\left\{\begin{array}{ll}
l_{i}, & \text { if } x_{i} \leq l_{i} \\
x_{i}, & \text { if } l_{i} \leq \mathbf{x}_{i} \leq u_{i} \\
u_{i}, & \text { if } x_{i} \geq u_{i}
\end{array}\right.
( P C ( x ) ) i = ⎩ ⎪ ⎨ ⎪ ⎧ l i , x i , u i , if x i ≤ l i if l i ≤ x i ≤ u i if x i ≥ u i
Nonnegative orthant: C = R + n C=\R_+^n C = R + n
P C ( x ) = max ( x , 0 ) P_C(\mathbf{x}) = \max(\mathbf{x},\mathbf{0})
P C ( x ) = max ( x , 0 )
Properties
Separable
f ( u , v ) = f 1 ( u ) + f 2 ( v ) f(\mathbf{u},\mathbf{v}) = f_1(\mathbf{u})+f_2(\mathbf{v}) f ( u , v ) = f 1 ( u ) + f 2 ( v ) . Prox c f ( u , v ) = ( Prox c f 1 ( u ) , Prox c f 2 ( v ) ) \operatorname{Prox}_{c} f(\mathbf{u}, \mathbf{v})=\left(\operatorname{Prox}_{c} f_{1}(\mathbf{u}), \operatorname{Prox}_{c} f_{2}(\mathbf{v})\right) P r o x c f ( u , v ) = ( P r o x c f 1 ( u ) , P r o x c f 2 ( v ) ) .
Scaling
a > 0 , b ∈ R , g ( x ) = f ( a x + b ) a>0,b\in\R,g(x)=f(ax+b) a > 0 , b ∈ R , g ( x ) = f ( a x + b ) .
Prox c g ( x ) = a − 1 ( Prox c a 2 f ( a x + b ) − b ) \operatorname{Prox}_{c} g(x)=a^{-1}\left(\operatorname{Prox}_{c a^{2}} f(a x+b)-b\right)
P r o x c g ( x ) = a − 1 ( P r o x c a 2 f ( a x + b ) − b )
Suppose g ( x ) = f ( A x + b ) g(\mathbf{x})=f(\mathbf{A} \mathbf{x}+\mathbf{b}) g ( x ) = f ( A x + b ) , where A ∈ R n × m \mathbf{A} \in \mathbb{R}^{n \times m} A ∈ R n × m satisfies A A T = λ − 1 I ( λ > 0 ) \mathbf{A} \mathbf{A}^{T}=\lambda^{-1} \mathbf{I}(\lambda>0) A A T = λ − 1 I ( λ > 0 )
and b ∈ R n \mathbf{b} \in \mathbb{R}^{n} b ∈ R n . Then
Prox c g ( x ) = ( I − λ A T A ) x + λ A T ( Prox c λ − 1 f ( A x + b ) − b ) \operatorname{Prox}_{c} g(\mathbf{x})=\left(\mathbf{I}-\lambda \mathbf{A}^{T} \mathbf{A}\right) \mathbf{x}+\lambda \mathbf{A}^{T}\left(\operatorname{Prox}_{c \lambda^{-1}} f(\mathbf{A} \mathbf{x}+\mathbf{b})-\mathbf{b}\right)
P r o x c g ( x ) = ( I − λ A T A ) x + λ A T ( P r o x c λ − 1 f ( A x + b ) − b )
Gradient
f f f Closed, proper, convex,c > 0 c>0 c > 0 .
∇ Env c f ( x ) = 1 c ( x − Prox c f ( x ) ) \nabla \operatorname{Env}_{c} f(\mathbf{x})=\frac{1}{c}\left(\mathbf{x}-\operatorname{Prox}_{c} f(\mathbf{x})\right)
∇ E n v c f ( x ) = c 1 ( x − P r o x c f ( x ) )
🌟 Moreau Decomposition
x = Prox c f ( x ) + c Prox c − 1 f ∗ ( c − 1 x ) \mathbf{x}=\operatorname{Prox}_{c} f(\mathbf{x})+c \operatorname{Prox}_{c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right)
x = P r o x c f ( x ) + c P r o x c − 1 f ∗ ( c − 1 x )
Proof . Let u = Prox c f ( x ) \mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x}) u = P r o x c f ( x ) . Then
u = Prox c f ( x ) ⟺ − c − 1 ( u − x ) ∈ ∂ f ( u ) ⟺ u ∈ ∂ f ∗ ( − c − 1 ( u − x ) ) \begin{aligned}
\mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x}) & \Longleftrightarrow-c^{-1}(\mathbf{u}-\mathbf{x}) \in \partial f(\mathbf{u}) \\
& \Longleftrightarrow \mathbf{u} \in \partial f^{*}\left(-c^{-1}(\mathbf{u}-\mathbf{x})\right)
\end{aligned}
u = P r o x c f ( x ) ⟺ − c − 1 ( u − x ) ∈ ∂ f ( u ) ⟺ u ∈ ∂ f ∗ ( − c − 1 ( u − x ) )
Let z = − c − 1 ( u − x ) \mathbf{z}=-c^{-1}(\mathbf{u}-\mathbf{x}) z = − c − 1 ( u − x ) . Then u = x − c z \mathbf{u}=\mathbf{x}-c \mathbf{z} u = x − c z and thus
⟺ x − c z ∈ ∂ f ∗ ( z ) ⟺ 0 ∈ ∂ f ∗ ( z ) + c ( z − c − 1 x ) ⟺ z = Prox c − 1 f ∗ ( c − 1 x ) \begin{aligned}
& \Longleftrightarrow \mathbf{x}-c \mathbf{z} \in \partial f^{*}(\mathbf{z}) \\
& \Longleftrightarrow \mathbf{0} \in \partial f^{*}(\mathbf{z})+c\left(\mathbf{z}-c^{-1} \mathbf{x}\right) \\
& \Longleftrightarrow \mathbf{z}=\operatorname{Prox}_{c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right) \\
\end{aligned}
⟺ x − c z ∈ ∂ f ∗ ( z ) ⟺ 0 ∈ ∂ f ∗ ( z ) + c ( z − c − 1 x ) ⟺ z = P r o x c − 1 f ∗ ( c − 1 x )
So $ \mathbf{x}=\mathbf{u}+c \mathbf{z}=\operatorname{Prox}{c} f(\mathbf{x})+c \operatorname{Prox} {c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right)$.
Example: proximal mapping of norm
We know that if f ( x ) = ∥ x ∥ f(\mathbf{x})=\|\mathbf{x}\| f ( x ) = ∥ x ∥ , then f ∗ ( y ) = I B ( y ) f^{*}(\mathbf{y})=I_{\mathcal{B}}(\mathbf{y}) f ∗ ( y ) = I B ( y ) , where B \mathcal{B} B is the unit ball of the dual norm ∥ ⋅ ∥ ∗ . \|\cdot\|_{*} . ∥ ⋅ ∥ ∗ . Then by Moreay decomposition:
Prox c f ( x ) = x − c Prox c − 1 f ∗ ( x / c ) = x − c P B ( x / c ) = x − P c B ( x ) \begin{aligned}
\operatorname{Prox}_{c} f(\mathbf{x}) &=\mathbf{x}-c \operatorname{Prox}_{c^{-1}} f^{*}(\mathbf{x} / c) \\
&=\mathbf{x}-c P_{\mathcal{B}}(\mathbf{x} / c) \\
&=\mathbf{x}-P_{c \mathcal{B}}(\mathbf{x})
\end{aligned}
P r o x c f ( x ) = x − c P r o x c − 1 f ∗ ( x / c ) = x − c P B ( x / c ) = x − P c B ( x )