I am taking Convex Optimization this semester (2021 Spring). Here is the first chapter of my course notes. The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.

Mathematical Backgrounds#

Table of Contents:

Topology in Rn\mathbb{R}^n#

  • open, closed, bounded, compact

    • Open: xC,ϵ>0,Bϵ(x)C\forall x\in \mathcal{C}, \exist\epsilon>0, B_{\epsilon}(x)\subseteq C
    • Closed: Cc\mathcal{C}^c is open
    • Bounded: R>0,x<R,xC\exist R>0,\|x\|<R,\forall x\in \mathcal{C}
    • Compact: bounded+closed
  • Interior, closure, boundary

    • Interior: C={xϵ>0,Bϵ(x)C}\mathcal{C}^\circ = \{x\mid \exist \epsilon>0, B_{\epsilon}(x)\subseteq \mathcal{C}\}

      Interior = points that satisfy openness, so interior of open set C\mathcal{C} = C\mathcal{C} itself.

    • Closure: C=((Cc))c\overline{\mathcal{C}} = \left((\mathcal{C}^c)^\circ\right)^c

    • Boundary: C=C\C\partial \mathcal{C} = \overline{\mathcal{C}} \backslash \mathcal{C}^\circ

Analysis in Rn\mathbb{R}^n#

  • Accumulation point

    • Def 1: ϵ>0,x{xk},xBϵ(x)\forall \epsilon>0, \exist x\in \{x_k\}, x\in B_\epsilon(x^*)
    • Def 2: xx^* is a limit of a subsequence
  • Bolzano-Weierstrass theorem

    Any bounded sequence in Rn\mathbb{R}^n contains a convergent subsequence.

    Proof. Bounded so we can divide the range in half. There always exists one of the halves that has infinite number of points in it.

  • Convergence rate

    When number of operations is infinite, try convergence rate to estimate an optimization algorithm.

    • Q-linear

      Assume xkxx_k\to x^{*}, then define sequence of errors: ek=xkxe_k = x_k-x^{*}.

      Sequence {xk}\{x_k\} Q-linearly converges to xx^{*} with rate rr and rate constant CC if

      limkek+1ekr=C,C<\lim_{k\to \infty}\frac{\|e_{k+1}\|}{\|e_k\|^r} = C,C<\infty

      • Linear: r=1,0<C<1r=1, 0<C<1

        Sublinear: r=1,C=1r=1, C=1

        Superlinear: r=1,C=0r=1, C=0

      • Quadratic: r=2r=2

    • R-linear

      When limkek+1ekr\lim_{k\to \infty}\frac{\|e_{k+1}\|}{\|e_k\|^r} does not exist, try R-linear definition.

      Sequence {xk}\{x_k\} R-linearly converges to xx^{*} when xkxek,{ek}\|x_k-x^{*}\|\le e_k, \{e_k\} converges to 00 Q-linearly.

  • Continuity

    xdomf,ϵ>0,δydomf,yx2δf(y)f(x)2εx \in \text{dom} f, \forall \epsilon>0, \exist \delta\Rightarrow y\in \text{dom} f, \|\mathbf{y}-\mathbf{x}\|_{2} \leq \delta \Rightarrow\|f(\mathbf{y})-f(\mathbf{x})\|_{2} \leq \varepsilon

  • Minimum vs Minimal point xx^*

    • Minimum: min for all point x
    • Minimal: min for sufficiently small ϵ>0\epsilon>0, xBϵ(x)domf\forall x \in B_\epsilon (x^*)\cup \text{dom} f
  • Closedness

    • For each αR,{xdomff(x)α}\alpha \in \mathbb{R}, \{\mathbf{x} \in \operatorname{dom} f \mid f(\mathbf{x}) \leq \alpha\} is closed

    • epif={(x,t)Rn+1xdomf,f(x)t}\operatorname{epi} f=\left\{(\mathbf{x}, t) \in \mathbb{R}^{n+1} \mid \mathbf{x} \in \operatorname{dom} f, f(\mathbf{x}) \leq t\right\} Is closed

    • Continuous + closed domain = closed function

      Continuous + open domain = closed iff ff\to \infty along every sequence converging to a boundary point of domain

Derivative#

If f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m, Df(x)=Jm×nDf(x)= \mathbf{J}_{m\times n} when for all choices of sequence {z}domf\{z\}\subset \text{dom} f,

limzdomf,zx,zxf(z)f(x)J(zx)2zx2=0\lim _{\mathbf{z} \in \operatorname{dom} f, \mathbf{z} \neq \mathbf{x}, \mathbf{z} \rightarrow \mathbf{x}} \frac{\|f(\mathbf{z})-f(\mathbf{x})-\mathbf{J}(\mathbf{z}-\mathbf{x})\|_{2}}{\|\mathbf{z}-\mathbf{x}\|_{2}}=0

Let z=x+teiz = x+te_i and let t0t\to 0, we have ii-th column f(x)xi\frac{\partial f(\mathbf{x})}{\partial x_{i}}, so

J=f(x)xT=(fi(x)xj),1im,1jn.\mathbf{J}=\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}^{T}}=\left(\frac{\partial f_{i}(\mathbf{x})}{\partial x_{j}}\right), 1\le i\le m, 1\le j\le n.

  • Gradient

    1. If f:RnRf:\mathbb{R}^n\to\mathbb{R}, then Df(x)Df(x) is a 1×n1\times n row vector. We call its transpose the gradient of the function:

      f(x)=Df(x)\nabla f(\mathbf{x})=D f(\mathbf{x})^{\top}

      First-order approximation can thus be: f(x)+f(x),zxf(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{z}-\mathbf{x}\rangle.

      Note that only when ff is real-value will gradient be different from derivative.

      • Why define gradient?

        Because we normally view a vector as a column vector.

    2. If f(X)f(\mathbf{X}) here X\mathbf{X} is a matrix, then define fX(f(X)Xij)\frac{\partial f}{\partial \mathbf{X}} \triangleq\left(\frac{\partial f(\mathbf{X})}{\partial X_{i j}}\right)

      More on matrix calculus.

  • First-order Approximation

    An alternative method of finding the derivative is by deriving a first-order approximation:

    f(z)=f(x)+Df(x)(zx)f(\mathbf{z}) = f(\mathbf{x})+Df(\mathbf{x})(\mathbf{z}-\mathbf{x})

    Example: f(X)=logdetX,domf=S++nf(\mathbf{X})=\log \operatorname{det} \mathbf{X}, \text{dom} f = \mathbb{S}^n_{++}

    Let Z\mathbf{Z} be close to X,ΔX=ZX,λi\mathbf{X}, \Delta\mathbf{X} = \mathbf{Z}-\mathbf{X}, \lambda_i be the ii-th eigenvalue of X1/2ΔXX1/2\mathbf{X}^{-1 / 2} \Delta \mathbf{X X}^{-1 / 2}. Since ΔX\Delta\mathbf{X} is close to 0, λi\lambda_i are also small.

    Lemma:

    detAB=detBA,trAB=trBAdetA=i=1nλi(A),trA=i=1nλi(A),A,B=trAB\text{det}\mathbf{A}\mathbf{B} = \text{det}\mathbf{B}\mathbf{A},\text{tr}\mathbf{A}\mathbf{B} = \text{tr}\mathbf{B}\mathbf{A}\\ \text{det}\mathbf{A} = \prod_{i=1}^n \lambda_i(\mathbf{A}), \text{tr}\mathbf{A} = \sum_{i=1}^n \lambda_i(\mathbf{A}),\\ \langle\mathbf{A},\mathbf{B}\rangle = \text{tr}\mathbf{A}^\top\mathbf{B}

    (Why detA=i=1nλi(A)\text{det}\mathbf{A} = \prod_{i=1}^n \lambda_i(\mathbf{A})? Because λi\lambda_i = multiple of the basis of eigenvector, det\text{det} = multiple of volume)

    logdetZ=logdet(X+ΔX)=logdet(X1/2(I+X1/2ΔXX1/2)X1/2)=logdetX+logdet(I+X1/2ΔXX1/2)=logdetX+i=1nlog(1+λi)logdetX+i=1nλi=logdetX+tr(X1/2ΔXX1/2)=logdetX+tr(X1ΔX)=logdetX+tr(X1(ZX))\begin{aligned} \log \operatorname{det} \mathbf{Z} &=\log \operatorname{det}(\mathbf{X}+\Delta \mathbf{X}) \\ &=\log \operatorname{det}\left(\mathbf{X}^{1 / 2}\left(\mathbf{I}+\mathbf{X}^{-1 / 2} \Delta \mathbf{X} \mathbf{X}^{-1 / 2}\right) \mathbf{X}^{1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\log \operatorname{det}\left(\mathbf{I}+\mathbf{X}^{-1 / 2} \Delta \mathbf{X} \mathbf{X}^{-1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\sum_{i=1}^{n} \log \left(1+\lambda_{i}\right) \approx \log \operatorname{det} \mathbf{X}+\sum_{i=1}^{n} \lambda_{i} \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1 / 2} \Delta \mathbf{X X}^{-1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1} \Delta \mathbf{X}\right) \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1}(\mathbf{Z}-\mathbf{X})\right) \end{aligned}

    Since A,B=tr(ATB)\langle \mathbf{A},\mathbf{B}\rangle=\text{tr}(\mathbf{A}^T\mathbf{B}), f(X)=X1\nabla f(\mathbf{X}) = \mathbf{X}^{-1}. \blacksquare

  • Chain rule

    Dh(x)=Dg(f(x))Df(x)D h(\mathbf{x})=D g(f(\mathbf{x})) D f(\mathbf{x})

    Remember to take transpose when ff is real-valued.

    • Affine function: f:RnRm,g:RpRm,g(x)=f(Ax+b),ARn×p,bRnf: \mathbb{R}^{n} \rightarrow \mathbb{R}^m, g: \mathbb{R}^p \rightarrow \mathbb{R}^m, g(\mathbf{x}) = f(\mathbf{Ax+b}), \mathbf{A}\in \mathbb{R^{n\times p}}, \mathbf{b}\in \mathbb{R}^n

      Dg(x)=Df(Ax+b)Ag(x)=Af(Ax+b)D g(\mathbf{x})=D f(\mathbf{Ax+b}) \mathbf{A}\\ \nabla g(\mathbf{x})=\mathbf{A}^\top\nabla f(\mathbf{Ax+b})

  • 🌟Understand derivative as a mapping

    Derivative is basically a mapping of ΔxΔf(x)\Delta x \to \Delta f(x). Therefore, chain rule is actually successional mappings.

    Consider f(x)=logdet(F0+x1F1+xnFn)f(\mathbf{x}) = \log \text{det} (\mathbf{F}_0+x_1\mathbf{F}_1+\cdots x_n\mathbf{F}_n), if we fix xI\mathbf{x}_{-I}, then we have 2 mappings:

    1. xiF0+x1F1+xnFnx_i \to \mathbf{F}_0+x_1\mathbf{F}_1+\cdots x_n\mathbf{F}_n

      Mapping of derivative RRn×n\mathbb{R}\to \mathbb{R}^{n\times n}: xkxkFkx_k\to x_k\mathbf{F}_k

      Here, Fk\mathbf{F}_k is both the matrix of the mapping and the derivative.

    2. MlogdetM\mathbf{M} \to \log \text{det} \mathbf{M}

      Mapping of derivative Rn×nR\mathbb{R}^{n\times n}\to \mathbb{R}: Ntr(NlogdetM)\mathbf{N} \to \text{tr}(\mathbf{N} \cdot \nabla \log \text{det} \mathbf{M})

      Here, the derivative size is supposed to be 1×(n×n)1\times(n\times n), so the mapping matrix cannot be written down. However, we can write out the above mapping.

    By chain rule, combining these two mappings and we will have:

    f(x)xi=tr(FilogdetN)=tr(F1Fi),F=F0+x1F1++xnFnf(x)=[tr(F1F1)tr(F1Fn)]\frac{\partial f(\mathbf{x})}{x_i} = \text{tr}(\mathbf{F_i}\cdot \nabla \log \text{det}\mathbf{N}) = \text{tr}(\mathbf{F}^{-1}\mathbf{F_i}), \\\mathbf{F}=\mathbf{F}_{0}+x_{1} \mathbf{F}_{1}+\cdots+x_{n} \mathbf{F}_{n}\\ \Rightarrow \nabla f(\mathbf{x})=\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{F}^{-1} \mathbf{F}_{1}\right) \\ \vdots \\ \operatorname{tr}\left(\mathbf{F}^{-1} \mathbf{F}_{n}\right)\end{array}\right]

    • Why we know the second mapping to be Ntr(NlogdetM)\mathbf{N} \to \text{tr}(\mathbf{N} \cdot \nabla \log \text{det} \mathbf{M})?

      Back to how we compute the gradient of f(X)=logdetX,domf=S++nf(\mathbf{X})=\log \operatorname{det} \mathbf{X}, \text{dom} f = \mathbb{S}^n_{++}. We know by first-order approximation that

      f(z)=f(x)+Df(x)(zx)=f(x)+f(x),zx=f(x)+tr((zx)f(x))f(\mathbf{z}) = f(\mathbf{x})+Df(\mathbf{x})(\mathbf{z}-\mathbf{x}) = f(\mathbf{x})+ \langle \nabla f(\mathbf{x}), \mathbf{z}-\mathbf{x}\rangle =f(\mathbf{x})+\text{tr}((\mathbf{z}-\mathbf{x})\nabla f(\mathbf{x}))

      Hence, the derivative is actually a mapping from zx\mathbf{z}-\mathbf{x} to f(z)f(x)=tr((zx)f(x))f(\mathbf{z}) - f(\mathbf{x}) = \text{tr}((\mathbf{z}-\mathbf{x})\nabla f(\mathbf{x})).

Second derivative#

If f:RnRf:\mathbb{R}^n\to \mathbb{R}, then Df(x):RnRnDf(\mathbf{x}): \mathbb{R}^n\to \mathbb{R}^n, and D2f(x):RnRn×nD^2f(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^{n\times n}.

D2f(x)=2f(x)=(Df(x))TxT=(2f(x)xixj)=HD^{2} f(\mathbf{x})=\nabla^{2} f(\mathbf{x})=\frac{\partial(D f(\mathbf{x}))^{T}}{\partial \mathbf{x}^{T}}=\left(\frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}}\right)=\mathbf{H}

  • Second-order approximation

    It needs to satisfy:

    limzdomf,zx,zxf(z)f^(z)zx22=0\lim _{\mathbf{z} \in \operatorname{dom} f, \mathbf{z} \neq \mathbf{x}, \mathbf{z} \rightarrow \mathbf{x}} \frac{|f(\mathbf{z})-\hat{f}(\mathbf{z})|}{\|\mathbf{z}-\mathbf{x}\|_{2}^{2}}=0

    So we have:

    f^(z)=f(x)+f(x)(zx)+(1/2)(zx)2f(x)(zx)=f(x)+Df(x)(zx)+12D2f(x)(zx),zx\hat{f}(\mathbf{z})=f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{z}-\mathbf{x})+(1 / 2)(\mathbf{z}-\mathbf{x})^{\top} \nabla^{2} f(\mathbf{x})(\mathbf{z}-\mathbf{x}) \\ =f(\mathbf{x})+D f(\mathbf{x})(\mathbf{z}-\mathbf{x})+\frac{1}{2}\left\langle D^{2} f(\mathbf{x})(\mathbf{z}-\mathbf{x}), \mathbf{z}-\mathbf{x}\right\rangle

  • Chain rule for second derivative

    • Scalar function: f:RnR,g:RR,h(x)=g(f(x))f: \mathbb{R}^{n} \rightarrow \mathbb{R}, g: \mathbb{R} \rightarrow \mathbb{R}, h(\mathbf{x}) = g(f(\mathbf{x}))

      2h(x)=g(f(x))2f(x)+g(f(x))f(x)f(x)T\nabla^{2} h(\mathbf{x})=g^{\prime}(f(\mathbf{x})) \nabla^{2} f(\mathbf{x})+g^{\prime \prime}(f(\mathbf{x})) \nabla f(\mathbf{x}) \nabla f(\mathbf{x})^{T}

    • Affine function: f:RnR,g:RmR,g(x)=f(Ax+b),ARn×m,bRnf: \mathbb{R}^{n} \rightarrow \mathbb{R}, g: \mathbb{R}^m \rightarrow \mathbb{R}, g(\mathbf{x}) = f(\mathbf{Ax+b}), \mathbf{A}\in \mathbb{R^{n\times m}}, \mathbf{b}\in \mathbb{R}^n

      2g(x)=A2f(Ax+b)A\nabla^{2} g(\mathbf{x})=\mathbf{A}^{\top} \nabla^{2} f(\mathbf{A} \mathbf{x}+\mathbf{b}) \mathbf{A}

Neural Network#

  • Back Propagation

    The error function and layer computation can be written like:

    e(w,v,y)=V(y,xo)xi=σ(jpa(i)wijxj)=σ(ai)Vwij=Vaiaiwij=δixj\begin{array}{l}e(\mathbf{w}, \mathbf{v}, \mathbf{y})=V\left(\mathbf{y}, \mathbf{x}_{o}\right) \\ x_{i}=\sigma\left(\sum_{j \in \mathrm{pa}(i)} w_{i j} x_{j}\right) = \sigma(a_i)\\ \frac{\partial V}{\partial w_{i j}}=\frac{\partial V}{\partial a_{i}} \frac{\partial a_{i}}{\partial w_{i j}}=\delta_{i} x_{j}\end{array}

  • Automatic differentiation

    image-20210322101810039

  • Reparameterization trick

    Case: Parameter of sampling distribution

    Motivation: How to compute the gradient of L(θ)=Ezpθ(z)[f(z)]L(\theta)=\mathbb{E}_{z \sim p_{\theta}(z)}[f(z)]? After sampling zz, we no longer have θ\theta in the expression, so back propagation cannot work.

    (Note that no θ\theta in density function f(z)f(z))

    1. Choose a non-parameterized distribution q(ϵ)q(\epsilon) (e.g. standard normal distribution, uniform…) and sample an instance ϵ\epsilon
    2. Generate a sample of zz by z=gθ(ϵ)z=g_\theta(\epsilon). The choice of gθg_\theta should make zpθ(z)z\sim p_\theta(z)

    Now we have L(θ)=Eεq(ε)[f~θ(ε)],f~θ=f(gθ)L(\theta)=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\tilde{f}_{\theta}(\varepsilon)\right],\tilde{f}_{\theta} = f(g_\theta), the sampling will not wipe out θ\theta, then:

    L(θ)θ=θEεq(ε)[f~θ(ε)]=Eεq(ε)[θf~θ(ε)]=Eεq(ε)[fggθ(ε)θ]\frac{\partial L(\theta)}{\partial \theta}=\frac{\partial}{\partial \theta} \mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\tilde{f}_{\theta}(\varepsilon)\right]=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\frac{\partial}{\partial \theta} \tilde{f}_{\theta}(\varepsilon)\right]=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\frac{\partial f}{\partial g} \frac{\partial g_{\theta}(\varepsilon)}{\partial \theta}\right]

    image-20210323164239610

    Example: Normal distribution

    Since we know that N(z;μ,σ2)=σN(ε;0,1)+μ\mathcal{N}\left(z ; \mu, \sigma^{2}\right)=\sigma \mathcal{N}(\varepsilon ; 0,1)+\mu, we sample z=σε+μz = \sigma\varepsilon+\mu and

    EzN(z;μ,σ2)[f(z)]=EεN(ε;0,1)[f(σε+μ)].\mathbb{E}_{z \sim \mathcal{N}\left(z ; \mu, \sigma^{2}\right)}[f(z)]=\mathbb{E}_{\varepsilon \sim \mathcal{N}(\varepsilon ; 0,1)}[f(\sigma \varepsilon+\mu)].

  • Gumbel-Softmax Trick

    Case: Taking max

    Motivation: How to compute the gradient of L(θ)=argmaxi{xi(θ)i=1,,K}L(\theta)=\underset{i}{\operatorname{argmax}}\left\{x_{i}(\theta) \mid i=1, \cdots, K\right\}?

    Choose xix_i deterministically \to Choose (Sample) xix_i according to πi=expxik=1Kexpxk\pi_{i}=\frac{\exp x_{i}}{\sum_{k=1}^{K} \exp x_{k}}.

    Since π\mathbf{\pi} is a probability distribution, we cannot simply take argmaxi(πi)\arg\max_i(\pi _i) as the result. Instead, we need to add some randomness:

    z=onehot(argmaxi[gi+logπi])=onehot(argmaxixi),giGumbel(0,1)Glog(log(U)),UUnif[0,1]z =\text{onehot}\left( \arg\max_i[g_i+\log \pi_i] \right) =\text{onehot}\left( \arg\max_ix_i'\right), g_i\sim \text{Gumbel}(0,1)\\ G\sim -\log(-\log(U)),U\sim \text{Unif}[0,1]

    Since argmax\arg\max is not differentiable, we use softmax to approximate it:

    πτ,i=exp(xi/τ)k=1Kexp(xk/τ)\pi_{\tau, i}^{\prime}=\frac{\exp \left(x_{i}^{\prime} / \tau\right)}{\sum_{k=1}^{K} \exp \left(x_{k}^{\prime} / \tau\right)}

    Here, τ\tau is a hyper-parameter called temperature. The smaller τ\tau is, the closer the softmax approximates one-hot vector.

Linear Algebra#

  • Symmetric eigenvalue decomposition (EVD)

    ASn\mathbf{A}\in \mathbb{S}^n Is a real symmetric n×nn\times n matrix, then A=QΛQ\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^{\top}. Q\mathbf{Q} Is orthonormal set of eigenvectors satisfies QQ=I\mathbf{Q}^\top \mathbf{Q} =\mathbf{I}, Λ=diag(λ1,,λn)\mathbf{\Lambda}=\text{diag}(\lambda_1,\cdots,\lambda_n).

    detA=i=1nλi,trA=i=1nλi,A2=maxλ,AF=(i=1nλi2)1/2\operatorname{det} \mathbf{A}=\prod_{i=1}^{n} \lambda_{i}, \operatorname{tr} \mathbf{A}=\sum_{i=1}^{n} \lambda_{i},\|\mathbf{A}\|_{2}={\max|\lambda|},\|\mathbf{A}\|_{F}=\left(\sum_{i=1}^{n} \lambda_{i}^{2}\right)^{1 / 2}

  • Definiteness

    By definition, the largest and smallest eigenvalues satisfy

    λmax(A)=supx0xAxxx,λmin(A)=infx0xAxxxλmin(A)xxxAxλmaxxx\lambda_{\max }(\mathbf{A})=\sup _{x \neq 0} \frac{\mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\mathbf{x}^{\top} \mathbf{x}}, \quad \lambda_{\min }(\mathbf{A})=\inf _{\mathbf{x} \neq 0} \frac{\mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\mathbf{x}^{\top} \mathbf{x}}\\ \lambda_{\min }(\mathbf{A})\mathbf{x}^{\top} \mathbf{x}\le\mathbf{x}^{\top} \mathbf{A} \mathbf{x} \le \lambda_{\max }\mathbf{x}^{\top} \mathbf{x}

    For ASn\mathbf{A}\in \mathbb{S}^n, positive definite A0\mathbf{A} \succ \mathbf{0}: x0,xAx>0\forall \mathbf{x}\ne \mathbf{0}, \mathbf{x}^\top\mathbf{A}\mathbf{x}>0 \Leftrightarrow \lambda_\min(\mathbf{A})>0.

  • Symmetric squareroot

    A1/2=Qdiag(λ11/2,,λn1/2)Q\mathbf{A}^{1 / 2}=\mathbf{Q} \operatorname{diag}\left(\lambda_{1}^{1 / 2}, \ldots, \lambda_{n}^{1 / 2}\right) \mathbf{Q}^{\top}

    A1/2\mathbf{A}^{1 / 2} Is the unique symmetric positive semidefinite solution of X2=A\mathbf{X}^2 = \mathbf{A}.

  • SVD

    ASm×n\mathbf{A}\in \mathbb{S}^{m\times n} Can be factored as A=UΣV,Σ=diag{σ1,,σr}\mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top, \mathbf{\Sigma}=\text{diag}\{\sigma_1,\cdots,\sigma_r\}.

    σi\sigma_i of a symmetric matrix = |λi0\lambda_i\ne 0|, sorted into descending order

  • Condition number

    Condition number is a measure for the sensitivity of the out-data with respect to perturbations in in-data.

    For Ax=b\mathbf{Ax=b}, adding perturbations to become A(x+δx)=b+δb\mathbf{A(x+\delta x)=b+\delta b}. Since \|\mathbf{b}\|\le \lambda_\max\|\mathbf{x}\|, \|\delta \mathbf{b}\|\ge \lambda_\min\| \delta\mathbf{x}\|, we have:

    δxxκ(A)δbb{\|\delta \mathbf{x}\| \over \|\mathbf{x}\|} \leq \kappa(\mathbf{A}) {\|\delta \mathbf{b}\| \over \|\mathbf{b}\|}

    Nonsingular (invertible) matrix ARn×n\mathbf{A}\in \mathbb{R}^{n\times n} has condition number:

    cond(A)=A2A12=σmax(A)/σmin(A)\operatorname{cond}(\mathbf{A})=\|\mathbf{A}\|_{2}\left\|\mathbf{A}^{-1}\right\|_{2}=\sigma_{\max }(\mathbf{A}) / \sigma_{\min }(\mathbf{A})

  • Pseudo-inverse

    A=VΣ1URn×m\mathbf{A}^{\dagger}=\mathbf{V} \boldsymbol{\Sigma}^{-1} \mathbf{U}^{\top} \in \mathbb{R}^{n \times m}

  • Adjoint operator

    Given linear operator: A:RmRn\mathcal{A}:\mathbb{R}^m\to \mathbb{R}^n, its adjoint operator A\mathcal{A}^* satisfies:

    A(x),y=x,A(y),xRn,yRm\left\langle\mathcal{A}^{*}(\mathbf{x}), \mathbf{y}\right\rangle=\langle\mathbf{x}, \mathcal{A}(\mathbf{y})\rangle, \quad \forall \mathbf{x} \in \mathbb{R}^{n}, \mathbf{y} \in \mathbb{R}^{m}

    Example: If A(x)=Ax\mathcal{A}^{}(\mathbf{x}) =\mathbf{Ax}, then A(x)=Ax\mathcal{A}^{*}(\mathbf{x}) =\mathbf{A}^\top \mathbf{x}.

  • Von Neumann trace theorem

    Suppose A,BRm×n\mathbf{A,B}\in \mathbb{R}^{m\times n}, then:

    A,Bi=1min(m,n)σi(A)σi(B)|\langle\mathbf{A}, \mathbf{B}\rangle| \leq \sum_{i=1}^{\min (m, n)} \sigma_{i}(\mathbf{A}) \sigma_{i}(\mathbf{B})

    When they are both p.s.d. matrices, we have:

    i=1nσi(A)σni+1(B)tr(AB)i=1nσi(A)σi(B)\sum_{i=1}^{n} \sigma_{i}(\mathbf{A}) \sigma_{n-i+1}(\mathbf{B}) \leq \operatorname{tr}(\mathbf{A B}) \leq \sum_{i=1}^{n} \sigma_{i}(\mathbf{A}) \sigma_{i}(\mathbf{B})

Norms#

f:RnRf:\mathbb{R}^n\to \mathbb{R} Is called norm if ff is nonnegative, definite(f(x)=0x=0f(\mathbf{x} )=0 \Leftrightarrow \mathbf{x=0}), homogeneous(f(tx)=tf(x)f(t\mathbf{x} )=|t|f(\mathbf{x} )), and satisfies triangle inequality(f(x+y)f(x)+f(y)f(\mathbf{x+y} )\le f(\mathbf{x} )+f(\mathbf{y} )).

  • Inner product

    x,y=xyX,Y=tr(XY)\langle\mathbf{x}, \mathbf{y}\rangle = \mathbf{x}^\top \mathbf{y} \\ \langle\mathbf{X}, \mathbf{Y}\rangle = \text{tr}\left(\mathbf{X}^\top \mathbf{Y}\right)\\

  • Unit balls

    If we have a norm \|\cdot\|, then the ball is B={xRnx1}\mathcal{B} = \{\mathbf{x}\in \mathbb{R}^n\mid \|\mathbf{x}\|\le 1\}. It is symmetric about the origin, convex, closed, bounded and has nonempty interior.

    If we have a ball C\mathcal{C} that satisfies the above properties, it is a unit ball of the norm x=(sup{t0txC})1\|\mathbf{x}\| = \left(\sup \{t\ge 0\mid t\mathbf{x}\in \mathcal{C}\}\right)^{-1}.

  • Quadratic norm

    For PS++n\mathbf{P}\in \mathbb{S}_{++}^n, xP=(xPx)1/2=P1/2x2\|\mathbf{x}\|_P = (\mathbf{x}^\top \mathbf{P}\mathbf{x})^{1/2} = \|\mathbf{P}^{1/2}\mathbf{x}\|_2. Its unit ball is an ellipsoid.

  • Equivalence of norms

    αxaxbβxa,xRn\alpha \|\mathbf{x}\|_a\le \|\mathbf{x}\|_b\le \beta \|\mathbf{x}\|_a, \forall \mathbf{x}\in \mathbb{R}^n

    This means a\|\cdot\|_a and b\|\cdot\|_b are equivalent, i.e. they define the same set of open subsets and convergent sequences.

  • Operator/induced norms

    We can induce an operator norm of XRm×n\mathbf{X}\in \mathbb{R}^{m\times n} with a\|\cdot\|_a on Rm\mathbb{R}^m and b\|\cdot\|_b on Rn\mathbb{R}^n:

    Xa,b=sup{Xuaub1}\|\mathbf{X}\|_{a,b} = \sup \{\|\mathbf{Xu}\|_a\mid \|\mathbf{u}\|_b\le 1\}

  • Matrix norms

    XF=(tr(XX))1/2X1=maxjixijX2=σmax(X)X=maxijxij\|\mathbf{X}\|_F = \left(\text{tr}\left(\mathbf{X}^\top \mathbf{X}\right)\right)^{1/2}\\ {\displaystyle \|\mathbf{X}\|_{1}=\max _{j}\sum _{i}|x_{ij}|}\\ {\displaystyle \|\mathbf{X}\|_{2}=\sigma _{\max }(\mathbf{X})}\\ \|\mathbf{X}\|_{\infty} = \max_i\sum_{j}|x_{ij}|

  • Nuclear norm

    A=iσi(A)X=maxUU=IVV=Itr(UXV)\|\mathbf{A}\|_* = \sum_i\sigma_i(\mathbf{A})\\ \|\mathbf{X}\|_* = \max_{\mathbf{U}^\top \mathbf{U}=\mathbf{I}\\\mathbf{V}^\top \mathbf{V}=\mathbf{I}} \text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V})

    First, we can prove Xtr(UXV)\|\mathbf{X}\|_* \ge \text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V}) by Von Neumann trace theorem:

    tr(UXV)=X,UViσi(X)σi(UIV)=iσi(X)=X\text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V}) = \langle \mathbf{X}, \mathbf{U}\mathbf{V}^\top\rangle \le \sum_i\sigma_i(\mathbf{X})\sigma_i(\mathbf{U}\mathbf{I}\mathbf{V}^\top) = \sum_i\sigma_i(\mathbf{X}) = \|\mathbf{X}\|_*

    Next, when X=UxΣxVx\mathbf{X}=\mathbf{U}_x\mathbf{\Sigma}_x\mathbf{V}_x^\top, then take U=Ux,V=Vx\mathbf{U}=\mathbf{U}_x, \mathbf{V}=\mathbf{V}_x, we have the equality:

    X,UV=tr(UxXVx)=tr(Σx)=X\langle \mathbf{X}, \mathbf{U}\mathbf{V}^\top\rangle =\text{tr}(\mathbf{U}_x^\top\mathbf{X}\mathbf{V}_x) =\text{tr}(\mathbf{\Sigma}_x)= \|\mathbf{X}\|_*

  • (p,q)(p,q)-norm

    Ap,q=(i=1nAipq)1q,p,q1\|\mathbf{A}\|_{p, q}=\left(\sum_{i=1}^{n}\left\|\mathbf{A}_{i}\right\|_{p}^{q}\right)^{\frac{1}{q}}, \quad p,q\ge 1

  • Dual norm

    If we have a linear function fz(x)=zxf_\mathbf{z}(\mathbf{x}) = \mathbf{z}^\top \mathbf{x}, how big is its value relative to the norm of x\mathbf{x}? It would be obviously zxx\frac{ \mathbf{z}^\top \mathbf{x}}{\|\mathbf{x}\|}. Hence, z\| \mathbf{z}\|^* can be interpreted as a stretching factor, a linear version of operator norm. (See more intuition here.)

    For norm \|\cdot\|, its dual norm is:

    z=sup{zxx1}\|\mathbf{z}\|^* = \sup\{\mathbf{z}^\top \mathbf{x}\mid \|\mathbf{x}\|\le 1\}

    Example:

    • z2z2\|z\|_2\Leftrightarrow \|z\|_2: z2=supx{z,xx21}\|z\|_2^* = \sup_x\{\langle z,x\rangle\mid \|x\|_2\le 1\}. By Cauchy-Schwartz inequality, z,xx2z2z2\langle z,x\rangle\le \|x\|_2\|z\|_2\le \|z\|_2. When x=zz2x = \frac{z}{\|z\|_2}, z,x=z2\langle z,x\rangle= \|z\|_2. Hence, z2=z2\|z\|_2^* = \|z\|_2.

    • zpzq\|z\|_p\Leftrightarrow \|z\|_q: zp=supx{z,xxp1},\|z\|_p^* = \sup_x\{\langle z,x\rangle\mid \|x\|_p\le 1\}, by Holder inequality, $ \langle z,x\rangle\le |z|_q|x|_p,\frac{1}{p}+\frac{1}{q}=1$.

    • z1z\|z\|_1\Leftrightarrow \|z\|_\infty: izixiizixi(ixi)maxizi=x1z|\sum_i z_ix_i|\le \sum_i|z_i||x_i|\le \left(\sum_i|x_i|\right) \max_i |z_i| = \|x\|_1\|z\|_\infty, let xi=sgn(zi),i=argmaxizix_i = \text{sgn}(z_i), i=\arg\max_i|z_i|

    • Z2=sup{tr(ZX)X21}=σ1(Z)++σr(Z)=tr(ZZ){\displaystyle \|Z\|_{2}^*=\sup\{\mathrm {\bf {tr}} (Z^{\intercal }X)|\|X\|_{2}\leq 1\}}={\sigma _{1}(Z)+\cdots +\sigma _{r}(Z)=\mathrm {\bf {tr}} ({\sqrt {Z^{\intercal }Z}})}