I am taking Convex Optimization this semester (2021 Spring). The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.

This chapter is the last of the basics, in the next chapter we will start to introduce specific algorithms. We first define convex function, how to judge convexity, and properties related with convexity. Next, we discuss operations that preserve convexity just like the previous chapter. Finally, we will introduce conjugate function, envelop function, and proximal mapping, three important concepts for future use.

Convex Function#

Basic definitions and examples#

Convex function#

A function $f:\R^n\to\R$ is convex if $\operatorname{dom} f$ is convex and for all $\mathbf{x, y}\in \operatorname{dom} f$, and $0\le \theta\le 1$,

$f(\theta\mathbf{x}+(1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x})+ (1-\theta) f(\mathbf{y})$

A function is strictly convex if strict inequality holds whenever $\mathbf{x}\ne \mathbf{y}$ and $0<\theta<1$.

A function is $\mu$-strongly convex if

$f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x})+(1-\theta) f(\mathbf{y})-\frac{\theta(1-\theta) \mu}{2}\|\mathbf{y}-\mathbf{x}\|^{2}, \quad \forall \theta \in[0,1]$

Extended-value extensions#

Corollary. Convex function is differentiable almost everywhere on the relative interior of its domain.

Define the extended-value extension of a convex function $f$ as: $\tilde{f}:\R^n\to\R\cup\{\infty\}$

$\tilde{f}(\mathbf{x})=\left\{\begin{array}{ll} f(\mathbf{x}), & \mathbf{x} \in \operatorname{dom} f \\ \infty, & \mathbf{x} \notin \operatorname{dom} f \end{array}\right.$

• Indicator function of a convex set $C$

$\tilde{I}_C(\mathbf{x})=\left\{\begin{array}{ll} 0, & \mathbf{x} \in C \\ \infty, & \mathbf{x} \notin C \end{array}\right.$

Now we have:

$\min_\mathbf{x}f(\mathbf{x})\quad s.t.\mathbf{x}\in C\Leftrightarrow \min_\mathbf{x}f(\mathbf{x})+\tilde{I}_C(\mathbf{x})$

Proper function#

$f$ is proper if $f(\mathbf{x})<\infty$ for at least one $\mathbf{x}$ and $f(\mathbf{x})>-\infty$ for all $\mathbf{x}$.

A function is proper iff its epigraph is nonempty and does not contain a vertical line.

First-order conditions (to prove convexity)#

Suppose $f$ is differentiable, then $f$ is convex iff $\operatorname{dom} f$ is convex and for all $\mathbf{x, y}\in \operatorname{dom} f$,

$f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle$

This indicates that from local information of a convex function, we can derive some global information (a global under-estimator). For instance, if $\nabla f(\mathbf{x})=0$, then $\mathbf{x}$ is a global minimizer.

Proof.

1. $\Rightarrow$ Given $f$ is convex.

\begin{aligned} f((1-\alpha) \mathbf{x}+\alpha \mathbf{y}) \leq&(1-\alpha) f(\mathbf{x})+\alpha f(\mathbf{y})\\ f(\mathbf{y})\ge &f(\mathbf{x})+\frac{f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))-f(\mathbf{x}) }{\alpha}\\ f(\mathbf{y}) \geq& f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle,\quad\alpha\to0^+ \end{aligned}

2. $\Leftarrow$ Given the inequality $f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle$.

Let $\mathbf{y},\mathbf{x}$ be $\alpha \mathbf{x}+(1-\alpha) \mathbf{y}$ separately:

$\begin{array}{l} f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}) \leq f(\mathbf{x})-(1-\alpha)\langle\nabla f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}), \mathbf{x}-\mathbf{y}\rangle \\ f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}) \leq f(\mathbf{y})+\alpha\langle\nabla f(\alpha \mathbf{x}+(1-\alpha) \mathbf{y}), \mathbf{x}-\mathbf{y}\rangle \end{array}$

Multiply the first inequality by $\alpha$ and the second by $1-\alpha$ and add up together, we will get $f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x})+ (1-\alpha) f(\mathbf{y})$.

Strictly convex iff

$f(\mathbf{y}) > f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle$

Note that in the proof of $\Rightarrow$, when we take the limit $\alpha\to0^+$, we can only get $f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle$ but not $>$. So the proof should be altered a little:

\begin{aligned} f((1-\alpha) \mathbf{x}+\alpha \mathbf{y}) <&(1-\alpha) f(\mathbf{x})+\alpha f(\mathbf{y})\\ f(\mathbf{y})> &f(\mathbf{x})+\frac{f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))-f(\mathbf{x}) }{\alpha}\\ \text{Since }f(\mathbf{x}+\alpha( \mathbf{y}-\mathbf{x}))\ge &f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \alpha(\mathbf{y}-\mathbf{x})\rangle\text{ by convexity,}\\ f(\mathbf{y}) >& f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle \end{aligned}

Strongly convex iff $f(\mathbf{x})-\frac{\mu}{2}\|\mathbf{x}\|^2$ is convex, i.e.

$f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle+\frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2$

Second-order conditions#

Suppose that $f$ is twice differentiable, then $f$ is convex iff $\operatorname{dom} f$ is convex, its Hessian is positive semidefinite, i.e. for all $\mathbf{x}\in \operatorname{dom} f$,

$\nabla^2f(\mathbf{x})\succeq0$

Note that the reverse is not true, for example the saddle point also satisfies $\nabla f=0$.

Strictly convex: $\nabla^2f(\mathbf{x})\succ0$

Strongly convex: $\nabla^2f(\mathbf{x})\succeq\mu\mathbf{I}$

🌟 Why do we need the condition ‘$\operatorname{dom}f$ is convex’?

e.g. $f(x)=1/x^2$ with $\operatorname{dom} f=\{x\in\R\mid x\ne 0\}$ satisfies $f''(x)>0$ for all $x\in\operatorname{dom} f$, but it is not a convex function.

Examples#

Functions on $\R$#

• Exponential: $e^{ax}$
• Powers: $x^a$ on $\R_{++}$ is convex when $a\in(-\infty,0]\cup[1,+\infty)$, concave when $0\le a\le 1$
• Powers of absolute value: $|x|^p,p\ge 1$
• Logarithm: $\log x$ is concave on $\R_{++}$
• Negative entropy: $x\log x$ is convex on $\R_{+}$ (define $f(x)=0)$

Functions on $\R^n$#

• Norms

Can be proved by triangle inequality:

$f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq f(\theta \mathbf{x})+f((1-\theta) \mathbf{y})=\theta f(\mathbf{x})+(1-\theta) f(\mathbf{y})$

• Max function: $f(\mathbf{x}) = \max\{x_1,\cdots,x_n\}$. Prove by definition.

• Quadratic over linear: $f(x,y)=x^2/y,\operatorname{dom}f=\R\times\R_{++}$

$\nabla^{2} f(x, y)=\frac{2}{y^{3}}\left[\begin{array}{cc} y^{2} & -x y \\ -x y & x^{2} \end{array}\right]=\frac{2}{y^{3}}\left[\begin{array}{c} y \\ -x \end{array}\right]\left[\begin{array}{c} y \\ -x \end{array}\right]^{T} \succeq \mathbf{0}$

• Log-sum-exp: $f(\mathbf{x}) = \log(e^{x_1}+\cdots e^{x_n})$

Differentiable approximation of max function:

$\max\{x_1,\cdots,x_n\}\le f(\mathbf{x}) \le \max\{x_1,\cdots,x_n\}+\log n$

Prove by $\nabla^2f(\mathbf{x})\succeq0$:

$\nabla^{2} f(\mathbf{x})=\frac{1}{\langle\mathbf{1}, \mathbf{z}\rangle^{2}}\left(\langle\mathbf{1}, \mathbf{z}\rangle \operatorname{diag}(\mathbf{z})-\mathbf{z z}^{T}\right) ,\mathbf{z}=\left(e^{x_{1}}, \ldots, e^{x_{n}}\right)\\ \forall\mathbf{v},\quad \mathbf{v}^{T} \nabla^{2} f(\mathbf{x}) \mathbf{v}=\frac{1}{\langle\mathbf{1}, \mathbf{z}\rangle^{2}}\left(\left(\sum_{i=1}^{n} z_{i}\right)\left(\sum_{i=1}^{n} v_{i}^{2} z_{i}\right)-\left(\sum_{i=1}^{n} v_{i} z_{i}\right)^{2}\right) \geq 0$

The last inequality is derived from Cauchy-Schwarz inequality: $\langle\mathbf{a},\mathbf{a}\rangle\cdot \langle\mathbf{b},\mathbf{b}\rangle\ge \langle\mathbf{a},\mathbf{b}\rangle^2, a_i=\sqrt{z_i}, b_i=v_i\sqrt{z_i}$.

• Geometric mean: $f(\mathbf{x})=\left(\prod_{i=1}^n x_i\right)^{1/n}$ is concave on $\operatorname{dom}f=\R_{++}^n$

• Log-determinant: $f(\mathbf{X})=\log\operatorname{det}\mathbf{X}$ is concave on $\operatorname{dom}f=\mathbb{S}^n_{++}$

Consider an arbitrary line $g(t)=f(\mathbf{Z}+t\mathbf{V}),\mathbf{Z,V}\in\mathbb{S}^n$. $t$ Is restricted to value that $\mathbf{X}=\mathbf{Z}+t\mathbf{V}\succeq 0$.

\begin{aligned} g(t) &=\log \operatorname{det}(\mathbf{Z}+t \mathbf{V}) \\ &=\log \operatorname{det}\left(\mathbf{Z}^{1 / 2}\left(\mathbf{I}+t \mathbf{Z}^{-1 / 2} \mathbf{V} \mathbf{Z}^{-1 / 2}\right) \mathbf{Z}^{1 / 2}\right) \\ &=\sum_{i=1}^{n} \log \left(1+t \lambda_{i}\right)+\log \operatorname{det} \mathbf{Z} \end{aligned}\\ g^{\prime}(t)=\sum_{i=1}^{n} \frac{\lambda_{i}}{1+t \lambda_{i}} \\ g^{\prime \prime}(t)=-\sum_{i=1}^{n} \frac{\lambda_{i}^{2}}{\left(1+t \lambda_{i}\right)^{2}}\le 0

Basic properties#

Sublevels#

$\alpha$-sublevel set of a function $f:\R^n\to\R$ is defined as

$C_\alpha=\{\mathbf{x}\in\operatorname{dom}f\mid f(\mathbf{x})\le\alpha\}$

Sublevel sets of a convex function are convex for any $\alpha$. The converse is not true, like concave function $f(x)=-e^x$.

To establish convexity of a set, we can express it as a sublevel set of a convex function, or as the superlevel set $\{\mathbf{x}\in\operatorname{dom}f\mid f(\mathbf{x})\ge\alpha\}$ of a concave function.

Example:

$G(\mathbf{x})=\left(\prod_{i=1}^{n} x_{i}\right)^{1 / n}, \quad A(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} x_{i} \\ \{\mathbf{x}\in\R_+^n\mid G(\mathbf{x})\ge\alpha A(\mathbf{x})\}$

It is the 0-sublevel set of $\alpha A(\mathbf{x})-G(\mathbf{x})$, which is convex. So the set is convex.

Epigraph#

$\operatorname{epi}f=\{(\mathbf{x},t)\mid\mathbf{x}\in\operatorname{dom}f,f(\mathbf{x})\le t\}$

A function is convex iff its epigraph is a convex set.

Interpret first-order conditions geometrically using epigraph

If $(\mathbf{y},t)\in\operatorname{epi} f$, then

$t \geq f(\mathbf{y}) \geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle\\ (\mathbf{y}, t) \in \operatorname{epi} f \Longrightarrow\left[\begin{array}{c} \nabla f(\mathbf{x}) \\ -1 \end{array}\right]^{T}\left(\left[\begin{array}{l} \mathbf{y} \\ t \end{array}\right]-\left[\begin{array}{c} \mathbf{x} \\ f(\mathbf{x}) \end{array}\right]\right) \leq 0$

Thus, the hyperplane defined by $(\nabla f(\mathbf{x}), -1)$ is the supporting plane of epigraph at boundary point $(\mathbf{x},f(\mathbf{x}))$.

Jensen’s Inequality#

$f(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x})+(1-\theta) f(\mathbf{y}) \\ f\left(\theta_{1} \mathbf{x}_{1}+\cdots+\theta_{k} \mathbf{x}_{k}\right) \leq \theta_{1} f\left(\mathbf{x}_{1}\right)+\cdots+\theta_{k} f\left(\mathbf{x}_{k}\right) \\ f\left(\int_{S} p(\mathbf{x}) \mathbf{x} d \mathbf{x}\right) \leq \int_{S} f(\mathbf{x}) p(\mathbf{x}) d \mathbf{x} \\ f(\mathbb{E} \mathbf{x}) \leq \mathbb{E} f(\mathbf{x})$

Thus, adding a zero-mean random vector $\mathbf{z}$ will not decrease the value of a convex function on average:

$\mathbb{E}f(\mathbf{x}+\mathbf{z})\ge f(\mathbf{x})$

Inequality#

Many inequalities can be derived by applying Jensen’s Inequality to some convex function.

• Arithmetic-geometric mean inequality ($f(x)=-\log x, \theta=1/2$)

$\sqrt{ab}\le(a+b)/2$

• Holder’s Inequality

$\sum_{i=1}^{n} x_{i} y_{i} \leq\left(\sum_{i=1}^{n}\left|x_{i}\right|^{p}\right)^{1 / p}\left(\sum_{i=1}^{n}\left|y_{i}\right|^{q}\right)^{1 / q}$

Proof.

Still by $f(x)=-\log x$,

$a^{\theta} b^{1-\theta} \leq \theta a+(1-\theta) b\\ \begin{array}{c} a=\frac{\left|x_{i}\right|^{p}}{\sum_{j=1}^{n}\left|x_{j}\right|^{p}}, \quad b=\frac{\left|y_{i}\right|^{q}}{\sum_{j=1}^{n}\left|y_{j}\right|^{q}}, \quad \theta=1 / p, \\ \left(\frac{\left|x_{i}\right|^{p}}{\sum_{j=1}^{n}\left|x_{j}\right|^{p}}\right)^{1 / p}\left(\frac{\left|y_{i}\right|^{q}}{\sum_{j=1}^{n}\left|y_{j}\right|^{q}}\right)^{1 / q} \leq \frac{\left|x_{i}\right|^{p}}{p \sum_{j=1}^{n}\left|x_{j}\right|^{p}}+\frac{\left|y_{i}\right|^{q}}{q \sum_{j=1}^{n}\left|y_{j}\right|^{q}} \end{array}$

Bregman distance#

$f$ is a differentiable strictly convex function $f:C\to\R, C\sub\R^n$ is a convex set.

$B_f(\mathbf{y},\mathbf{x})=f(\mathbf{y})-f(\mathbf{x})-\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x} \rangle$

By first-order conditions, we know that $B_f(\mathbf{y},\mathbf{x})\ge 0$, but it may not be symmetric: $B_f(\mathbf{y},\mathbf{x})\ne B_f(\mathbf{x},\mathbf{y})$

$\partial f(\mathbf{x})=\{\mathbf{g} \mid f(\mathbf{y}) \geq f(\mathbf{x})+\langle\mathbf{g}, \mathbf{y}-\mathbf{x}\rangle, \forall \mathbf{y} \in \operatorname{dom} f\}$

Generalization of gradient: non-vertical supporting hyperplane.

$$f'(\mathbf{x};\mathbf{u})=\max_{\mathbf{g}\in\partial f(\mathbf{x})}\langle\mathbf{y},\mathbf{g}\rangle$$

Examples#

• $|x|$

$\partial|x| = \left\{\begin{array}{ll} 1,&x>0\\ -1, &x<0\\ [-1,1],&x=0 \end{array} \right.$

• $\max\{0, \frac{1}{2}(x^2-1)\}$

$\partial|x| = \left\{\begin{array}{ll} x,&|x|>1\\ 0, &|x|<1\\ [-1,0],&x=-1\\ [0,1],&x=1 \end{array} \right.$

• $I_C(\mathbf{x})=\left\{\begin{array}{ll} +\infty,&\mathbf{x}\in C\\ 0, &\mathbf{x}\notin C \end{array} \right.$

$f(\mathbf{y})=f(\mathbf{x})+\langle g,\mathbf{y}-\mathbf{x}\rangle$

When $\mathbf{x}\in C^\circ$, $0\ge \langle g, \mathbf{y}-\mathbf{x}\rangle\Rightarrow g=0$.

When $\mathbf{x}\in\partial C$, $\mathbf{y}-\mathbf{x}$ forms the tangent cone, so $g$ would be its normal cone.

Properties#

• Continuity

$\left. \begin{array}{l} x_k\to x^*\\ g_k\in\partial f(x_k), g_k\to g^* \end{array} \right\}\Rightarrow g^*\in\partial f(x^*)$

• Chain Rule

$F(\mathbf{x})=f(\mathbf{Ax}) \Rightarrow\partial F(\mathbf{x}) = \mathbf{A}^\top\partial f(\mathbf{Ax})\\ F'(\mathbf{x};\mathbf{y}) = h'(f(\mathbf{x}))f'(\mathbf{x};\mathbf{y})$

For a real Hilbert space endowed with inner product and norm, then

$\partial \|\mathbf{x}\| = \{\mathbf{y}\mid \langle\mathbf{y},\mathbf{x}\rangle=\|\mathbf{x}\|,\|\mathbf{y}\|^*\le 1 \}$

Proof. Let $S =\{\mathbf{y}\mid \langle\mathbf{y},\mathbf{x}\rangle=\|\mathbf{x}\|,\|\mathbf{y}\|^*\le 1 \}$.

1. For every $\mathbf{y}\in\partial\|\mathbf{x}\|$, we have

$\|\mathbf{w}\|-\|\mathbf{x}\|\ge \langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle,\forall \mathbf{w}$

Choose $\mathbf{w}=0,\mathbf{w}=2\mathbf{x}$ we have $\|\mathbf{x}\|=\langle\mathbf{y},\mathbf{x}\rangle$.

$\|\mathbf{w}-\mathbf{x}\|\ge\|\mathbf{w}\|-\|\mathbf{x}\|\ge \langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle,\forall \mathbf{w}\\ \left\langle\mathbf{y},\frac{\mathbf{w}-\mathbf{x}}{\|\mathbf{w}-\mathbf{x}\|}\right\rangle\le 1,\forall\mathbf{w}\ne\mathbf{x}$

Therefore $\|\mathbf{y}\|^*\le 1, \partial\|\mathbf{x}\|\subset S$.

2. For every $\mathbf{y}\in S$,

$\langle\mathbf{y}, \mathbf{w}-\mathbf{x}\rangle=\langle\mathbf{y}, \mathbf{w}\rangle-\langle\mathbf{y}, \mathbf{x}\rangle=\langle\mathbf{y}, \mathbf{w}\rangle-\|\mathbf{x}\| \leq\|\mathbf{y}\|^{*}\|\mathbf{w}\|-\|\mathbf{x}\| \leq\|\mathbf{w}\|-\|\mathbf{x}\|,\forall \mathbf{w}$

Therefore, $\mathbf{y}\in\partial\|\mathbf{x}\|, S\subset\partial \|\mathbf{x}\|$.

• Danskin’s Theorem

If $\mathcal{Z}$ is a compact subset of $\R^m$, $\phi:\R^n\times\mathcal{Z}\to \R$ is continuous and $\phi(\cdot,\mathbf{z}):\R^n\to\R$ is convex for each $\mathbf{z}\in\mathcal{Z}$. Define that

$f(\mathbf{x}) = \max_{\mathbf{z}\in\mathcal{Z}}\phi(\mathbf{x},\mathbf{z})\\ \mathcal{Z}(\mathbf{x}) = \arg\max_{\mathbf{z}\in\mathcal{Z}}\phi(\mathbf{x},\mathbf{z})$

Theorem: If $\phi(\cdot,\mathbf{z})$ is differentiable for all $\mathbf{z}$ and $\nabla_x\phi(\mathbf{x},\cdot)$ is continuous on $\mathcal{Z}$, then

$\partial f(\mathbf{x}) = conv\{\nabla_x\phi(\mathbf{x},\mathbf{z})\mid \mathbf{z}\in\mathcal{Z}(\mathbf{x}) \}$

Operations that preserve convexity#

Nonnegative weighted sums#

$f = w_1f_1+\cdots+w_mf_m, w_i\ge 0$

Proof. Since the image of a convex set under a linear mapping is convex, we have

$\operatorname{epi}(wf) = \begin{bmatrix} \mathbf{I} &\mathbf{0}\\ \mathbf{0}^\top &w \end{bmatrix}\operatorname{epi}f$

Affine mapping#

$f$ Is convex, then $g(\mathbf{x}) = f(\mathbf{Ax}+\mathbf{b})$ is also convex.

Pointwise maximum and supremum#

$f(\mathbf{x}) = \max\{f_1(\mathbf{x}), f_2(\mathbf{x}) \}, \operatorname{dom} f = \operatorname{dom}f_1\cap\operatorname{dom}f_2$

The max function can be extended to $f_1,\cdots,f_m$.

• Sum of $r$ largest components:

$x_{[1]}\ge\cdots\ge x_{[n]}\\ f(\mathbf{x}) = \sum_{i=1}^r x_{[i]}$

As for the proof, the function can be considered the maximum of all “sum of $r$ elements” functions.

• Support function of a set

$S_C(\mathbf{x}) = \sup\{\langle\mathbf{x},\mathbf{y}\rangle\mid\mathbf{y}\in C\}$ is the point wise supremum of a family of linear functions, hence convex. It describes the distance of the supporting hyperplane at $\mathbf{x}$ from the origin.

• Distance to farthest point of a set

$f(\mathbf{x}) = \sup_{\mathbf{y}\in C}\|\mathbf{x}-\mathbf{y}\|$ (Pointwise supremum of convex function $\|\mathbf{x}-\mathbf{y}\|$.)

• Least-squares cost

$g(\mathbf{w}) = \inf_\mathbf{x}\sum_{i=1}^nw_i(\langle\mathbf{a}_i,\mathbf{x}\rangle-b_i)^2$ Is the infimum of a family of linear functions, hence concave.

• Norm of a matrix

$f(\mathbf{X}) = \sup\{\mathbf{u^\top Xv}\mid \|\mathbf{u}\|_2=1,\|\mathbf{v}\|_2=1\}$

Composition#

For function $f(\mathbf{x}) = h(g(\mathbf{x}))$:

• $f$ Is convex if $h$ is convex, $\tilde{h}$ is nondecreasing, and $g$ is convex.
• $f$ Is convex if $h$ is convex, $\tilde{h}$ is nonincreasing, and $g$ is concave

Note that we need the extended-value extension($\infty$ for points not in dom) to be nonincreasing or nondecreasing.

Minimization#

$f$ Is convex and $C$ is convex nonempty set, then $g(\mathbf{x})=\inf_{\mathbf{y}\in C} f(\mathbf{x}, \mathbf{y})$ is convex.

Proof. $\operatorname{epi}g=\{(\mathbf{x},t)\mid (\mathbf{x},\mathbf{y},t)\in\operatorname{epi}f,\mathbf{y}\in C\}$ Is convex because it is a projection from the convex $\operatorname{epi} f$.

Example: $\operatorname{dist}(\mathbf{x}, S)=\inf_{\mathbf{y}\in S}\|\mathbf{x}-\mathbf{y}\|$ Is convex if $S$ is convex, because $\|\mathbf{x}-\mathbf{y}\|$ is convex in $(\mathbf{x},\mathbf{y})$.

Perspective of a function#

$f:\R^n\to\R$, the perspective of $f$ is $g(\mathbf{x}, t)=tf(\mathbf{x}/t)$, $\operatorname{dom}g=\{(\mathbf{x},t)\mid \mathbf{x}/t\in\operatorname{dom}f, t>0\}$.

Perspective preserves convexity, because $(\mathbf{x},t,s)\in \operatorname{epi}g$ indicates that $tf(\mathbf{x}/t)\le s,f(\mathbf{x}/t)\le s/t$, which is $(\mathbf{x}/t, s/t)\in \operatorname{epi} f$.

Example: The perspective function of convex function $f(x)=-\log x$ on $\R_{++}$ is $g(x,t)=-t\log(x/t)=t\log t-t\log x$ is convex on $\R_{++}^2$. $g$ Is called relative entropy.

🌟 Conjugate function#

$f:\R^n\to\R$

$f^{*}(\mathbf{y})=\sup _{\mathbf{x} \in \operatorname{dom} f}(\langle\mathbf{y}, \mathbf{x}\rangle-f(\mathbf{x})) =\sup _{\mathbf{x} \in \operatorname{dom} f}\left(\begin{array}{c} \mathbf{y} \\ -1 \end{array}\right)^{T}\left(\begin{array}{c} \mathbf{x} \\ f(\mathbf{x}) \end{array}\right) =\sup _{(\mathbf{x}, t) \in \operatorname{epi} f}\left(\begin{array}{c} \mathbf{y} \\ -1 \end{array}\right)^{T}\left(\begin{array}{c} \mathbf{x} \\ t \end{array}\right)$

It is the support function of the epigraph (or say, supremum of affine functions), so it is always convex. The domain is where the supremum is finite.

The conjugate function $f^*(\mathbf{y})$ is the maximum gap between $\langle\mathbf{y},\mathbf{x}\rangle$ and $f(\mathbf{x})$. If $f$ is differentiable, the sup $x^*$ occurs when $f'(x^*)=y$.

Example. 1. Confirm the domain of $y$. 2. Take derivative to compute $x$ when the function reaches maximum.

• $f(x)=e^x$.

$xy-e^x$ Is bounded over only when $y\ge 0$.

Take the derivative and we have $y-e^x=0, x=\log y$. Substituting and we have $f^*(y) = y\log y-y$.

Note that in $y=0$ case, $\sup_x -e^x=0$, so we can interpret $0\log 0=0$.

Some special conjugate functions#

Indicator function#

Indicator function of any set $S$ is $I_S(\mathbf{x})=0$ on $\operatorname{dom}I_S=S$. Its conjugate is $I_S^*(\mathbf{y}) = \sup_{\mathbf{x}\in S}\langle\mathbf{y},\mathbf{x}\rangle$, which is the support function of $S$.

If $S$ is a closed convex set, then the conjugate of support function is the indicator function of $S$.

Log Determinant#

$f(\mathbf{X})=\log\det \mathbf{X}^{-1}$ on $\mathbf{S}_{++}^n$, $f^*(\mathbf{Y})=\sup_{\mathbf{X}\succ 0}(\operatorname{tr}(\mathbf{Y^\top X}+\log\det\mathbf{X}))$.

We first prove the dominant is $\mathbf{Y}\prec0$. If $\mathbf{Y}\not\prec 0$, then it has an eigenvector $\mathbf{v}, \|\mathbf{v}\|_2=1$ and eigenvalue $\lambda\ge 0$. Take $\mathbf{X}=\mathbf{I}+t\mathbf{vv^\top}$:

$\operatorname{tr}(\mathbf{Y^\top X})+\log\det\mathbf{X} = \operatorname{tr}(\mathbf{Y})+t\lambda + \log(1+t)\to +\infty$

Now take the derivative:

$\nabla_{\mathbf{X}}(\operatorname{tr}(\mathbf{Y^\top X})+\log\det\mathbf{X}) = \mathbf{Y}+\mathbf{X}^{-1}=0\Rightarrow \mathbf{X}=-\mathbf{Y}^{-1}$

Thus, $f^*(\mathbf{Y}) = \log\det(-\mathbf{Y}^{-1})-n, \operatorname{dom}f^*=-\mathbb{S}_{++}^n$.

Norm#

Conjugate function of $f(\mathbf{x})=\|\mathbf{x}\|$ is $f^*(\mathbf{y})=I_B(\mathbf{y}),B=\{\mathbf{y}\mid\|\mathbf{y}\|^*\le 1 \}$.

Fenchel’s Inequality#

$f(\mathbf{x})+f^*(\mathbf{y})\ge \langle\mathbf{x},\mathbf{y}\rangle, \forall \mathbf{x},\mathbf{y}$

Example. $\frac{1}{2}\|\mathbf{x}\|^{2}+\frac{1}{2}\|\mathbf{y}\|_{*}^{2} \geq\langle\mathbf{x}, \mathbf{y}\rangle$.

Conjugate of conjugate#

• If $f$ is proper, convex, and closed, then $f^{**}=f$.
• For any $f$, $f^{**}$ is the largest convex function not exceeding $f$, i.e. the convex envelop of $f$.

• If $f$ is convex and differentiable, then maximizer $x^*$ of $\langle y,x\rangle-f(x)\Leftrightarrow y=\nabla f(x^*)$.

Hence, if we can solve $y=\nabla f(z)$ for $z$, then we can compute $f^*(y)$ by $f^*(y) = z^\top \nabla f(x^*)-f(x^*)$.

• Affine: $a>0, b\in \R$, the conjugate of $g(x)=af(x)+b$ is $g^*(y)=af^*(y/a)+b$.

Example. $g(\mathbf{x}) = f(\mathbf{Ax}+\mathbf{b})$, $g^*(\mathbf{y}) = f^*(\mathbf{A}^{-\top }\mathbf{y})-\mathbf{b^\top A^{-\top}y}$.

• Separable: $f(u,v)=f_1(u)+f_2(v)$ where $f_1$ and $f_2$ are convex, then $f^*(w,z)=f_1^*(w)+f_2^*(z)$.

Envelope function and Proximal mapping#

$f:\R^n\to(-\infty,+\infty]$ is a closed proper function. For a scalar $c>0$, define:

\begin{aligned} \operatorname{Env}_{c} f(\mathbf{x}) &=\inf _{\mathbf{w}}\left\{f(\mathbf{w})+\frac{1}{2 c}\|\mathbf{w}-\mathbf{x}\|^{2}\right\} \\ \operatorname{Prox}_{c} f(\mathbf{x}) &=\underset{\mathbf{w}}{\operatorname{Argmin}}\left\{f(\mathbf{w})+\frac{1}{2 c}\|\mathbf{w}-\mathbf{x}\|^{2}\right\} . \end{aligned}

Motivation. We are attempting to reduce the value of $f$ without straying too far from $\mathbf{x}$. You can imagine that this would be a useful sub-step in an optimization algorithm. The parameter $c$ can be viewed as a “step size” that controls how far from $\mathbf{x}$ we are allowed to move. When $c$ is very small, the penalty for moving away is severe.

Caption. $\operatorname{Prox}_{c} f(\mathbf{x}_{k})=\mathbf{x}_{k+1}$. At the point $x=x_{k+1}$, we have $f(x)+\frac{1}{2c_k}\|x-x_k\|^2=\gamma_k = \operatorname{Env}_cf(x_k)$. (Given $-\frac{1}{2c_k}\|x-x_k\|^2$, the least distance $\gamma_k$ to move it upward and intersect with $f(x)$ would be the envelop.) And the slope is exactly the sub gradient.

Involution:

$(f \boxtimes g)(\mathbf{x})=\inf _{\mathbf{y}+\mathbf{z}=\mathbf{x}} f(\mathbf{y})+g(\mathbf{z})\\ (f \otimes g)(\mathbf{x})=\iint _{\mathbf{y}+\mathbf{z}=\mathbf{x}} f(\mathbf{y})g(\mathbf{z})d\mathbf{y}d\mathbf{z}$

• When we take $\mathbf{w}=\mathbf{x}$, we know that $\operatorname{Env}_{c} \le f(\mathbf{x})$.
• $\mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x}) \Leftrightarrow \mathbf{x}-\mathbf{u} \in c \partial f(\mathbf{u})$ By definition of subgradient.

Usage of proximal mapping#

1. Let $T=I-\alpha\nabla f$ and $\tilde{T} = (I+\alpha\nabla f)^{-1}$. We have updates in two ways:

$\text{Forward: }\mathbf{x}_{k+1}=T(\mathbf{x}_k)=\mathbf{x}_{k}-\alpha\nabla f(\mathbf{x}_{k})\\ \text{Backward: }\mathbf{x}_{k+1}=\tilde{T}(\mathbf{x}_k)=\arg\min_\mathbf{x}f(\mathbf{x})+\frac{1}{2\alpha}\|\mathbf{x}-\mathbf{x}_{k}\|^2\\ \mathbf{x}_k-\mathbf{x}_{k+1} \in c \partial f(\mathbf{x}_{k+1})$

2. Sufficient descent and encourages convergence

$f(\mathbf{x}_{k+1})-f(\mathbf{x}_k)\le -\frac{1}{2\alpha}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|^2\\ \Rightarrow \sum_{k=0}^{+\infty}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|^2<+\infty$

By KL condition, we have $\sum_{k=0}^{+\infty}\|\mathbf{x}_{k+1}-\mathbf{x}_k\|<+\infty$.

Example#

• $f(\mathbf{x}) = I_C(\mathbf{x})$

$\operatorname{Prox}_{c} f(\mathbf{x}) = \arg\min_{\mathbf{y}} f(\mathbf{y})+\frac{1}{2\alpha}\|\mathbf{y}-\mathbf{x}\|^2 = \arg\min_{\mathbf{y}\in C} \|\mathbf{y}-\mathbf{x}\|^2=P_C(\mathbf{x})$

• Halfspace $C=\{\mathbf{x}\mid\mathbf{a^\top x}\le b \}$

$\min_\mathbf{y}\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^2\quad s.t.\mathbf{a^\top y}= b\\ L(\mathbf{y},\lambda)=\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^2+\lambda(\mathbf{a^\top y}-b)\\ \left\{\begin{array}{l} (\mathbf{y}-\mathbf{x})+\lambda\mathbf{a}=0\\ \mathbf{a^\top y}=b \end{array} \right. \Rightarrow \mathbf{y}=\mathbf{x}-\lambda\mathbf{a},\lambda=\frac{\mathbf{a}^{T} \mathbf{x}-b}{\|\mathbf{a}\|^{2}} \mathbf{a}\\$

If $\mathbf{x}\in C$, take $\mathbf{y}=\mathbf{x}$. If not, then take the projection of $\mathbf{y}$ onto the halfspace:

$P_{\mathcal{C}}(\mathbf{x})=\left\{\begin{array}{ll} \mathbf{x}+\frac{b-\mathbf{a}^{T} \mathbf{x}}{\|\mathbf{a}\|^{2}} \mathbf{a}, & \text { if } \mathbf{a}^{T} \mathbf{x}>b \\ \mathbf{x}, & \text { if } \mathbf{a}^{T} \mathbf{x} \leq b \end{array}\right.$

• Hyperbox $C=\{\mathbf{x}\mid\mathbf{l}\le \mathbf{x}\le\mathbf{u} \}$

$\left(P_{\mathcal{C}}(\mathbf{x})\right)_{i}=\left\{\begin{array}{ll} l_{i}, & \text { if } x_{i} \leq l_{i} \\ x_{i}, & \text { if } l_{i} \leq \mathbf{x}_{i} \leq u_{i} \\ u_{i}, & \text { if } x_{i} \geq u_{i} \end{array}\right.$

• Nonnegative orthant: $C=\R_+^n$

$P_C(\mathbf{x}) = \max(\mathbf{x},\mathbf{0})$

Properties#

• Separable

$f(\mathbf{u},\mathbf{v}) = f_1(\mathbf{u})+f_2(\mathbf{v})$. $\operatorname{Prox}_{c} f(\mathbf{u}, \mathbf{v})=\left(\operatorname{Prox}_{c} f_{1}(\mathbf{u}), \operatorname{Prox}_{c} f_{2}(\mathbf{v})\right)$.

• Scaling

$a>0,b\in\R,g(x)=f(ax+b)$.

$\operatorname{Prox}_{c} g(x)=a^{-1}\left(\operatorname{Prox}_{c a^{2}} f(a x+b)-b\right)$

Suppose $g(\mathbf{x})=f(\mathbf{A} \mathbf{x}+\mathbf{b})$, where $\mathbf{A} \in \mathbb{R}^{n \times m}$ satisfies $\mathbf{A} \mathbf{A}^{T}=\lambda^{-1} \mathbf{I}(\lambda>0)$
and $\mathbf{b} \in \mathbb{R}^{n}$. Then

$\operatorname{Prox}_{c} g(\mathbf{x})=\left(\mathbf{I}-\lambda \mathbf{A}^{T} \mathbf{A}\right) \mathbf{x}+\lambda \mathbf{A}^{T}\left(\operatorname{Prox}_{c \lambda^{-1}} f(\mathbf{A} \mathbf{x}+\mathbf{b})-\mathbf{b}\right)$

$f$ Closed, proper, convex,$c>0$.

$\nabla \operatorname{Env}_{c} f(\mathbf{x})=\frac{1}{c}\left(\mathbf{x}-\operatorname{Prox}_{c} f(\mathbf{x})\right)$

• 🌟 Moreau Decomposition

$\mathbf{x}=\operatorname{Prox}_{c} f(\mathbf{x})+c \operatorname{Prox}_{c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right)$

Proof. Let $\mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x})$. Then

\begin{aligned} \mathbf{u}=\operatorname{Prox}_{c} f(\mathbf{x}) & \Longleftrightarrow-c^{-1}(\mathbf{u}-\mathbf{x}) \in \partial f(\mathbf{u}) \\ & \Longleftrightarrow \mathbf{u} \in \partial f^{*}\left(-c^{-1}(\mathbf{u}-\mathbf{x})\right) \end{aligned}

Let $\mathbf{z}=-c^{-1}(\mathbf{u}-\mathbf{x})$. Then $\mathbf{u}=\mathbf{x}-c \mathbf{z}$ and thus

\begin{aligned} & \Longleftrightarrow \mathbf{x}-c \mathbf{z} \in \partial f^{*}(\mathbf{z}) \\ & \Longleftrightarrow \mathbf{0} \in \partial f^{*}(\mathbf{z})+c\left(\mathbf{z}-c^{-1} \mathbf{x}\right) \\ & \Longleftrightarrow \mathbf{z}=\operatorname{Prox}_{c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right) \\ \end{aligned}

So $\mathbf{x}=\mathbf{u}+c \mathbf{z}=\operatorname{Prox}{c} f(\mathbf{x})+c \operatorname{Prox}{c^{-1}} f^{*}\left(c^{-1} \mathbf{x}\right)$.

• Example: proximal mapping of norm

We know that if $f(\mathbf{x})=\|\mathbf{x}\|$, then $f^{*}(\mathbf{y})=I_{\mathcal{B}}(\mathbf{y})$, where $\mathcal{B}$ is the unit ball of the dual norm $\|\cdot\|_{*} .$ Then by Moreay decomposition:

\begin{aligned} \operatorname{Prox}_{c} f(\mathbf{x}) &=\mathbf{x}-c \operatorname{Prox}_{c^{-1}} f^{*}(\mathbf{x} / c) \\ &=\mathbf{x}-c P_{\mathcal{B}}(\mathbf{x} / c) \\ &=\mathbf{x}-P_{c \mathcal{B}}(\mathbf{x}) \end{aligned}