Special Notes on May 15, 2019

Date: May 15, 2019
Last Updated: May 15, 2019
Categories:
Notes Theory
Tags:
research algorithm NIPS optimization

Contents


In this article, we would discuss the trick about training and testing phases for dictionary learning (sparse coding). The original work could be referred here. As extra reading materials, we suggest reading Convex Optimization for understanding how to apply Lagrangian method and Matrix Cookbook to refer some conclusions about how to calculate gradients for matrices.

The pdf version of this article could be found here:

PDF version

Solve the Lasso problem

Consider the testing phase of sparse coding which could be formulated as

\begin{align} \min\limits_{{\{\boldsymbol{\alpha}_i\}}_{i=1}^{N}}~&{\sum_{i=1}^{N} \left({\lVert \mathbf{x}_i - \mathbf{D}\boldsymbol{\alpha}_i \rVert}_2^2 + \lambda {\lVert \boldsymbol{\alpha}_i \rVert}_1 \right)}. \end{align}

Then we could find the stationary point according to the first-order gradient,

\begin{equation} \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\alpha}_k} &\left(\sum_{i=1}^{N} \left({\lVert \mathbf{x}_i - \mathbf{D}\boldsymbol{\alpha}_i \rVert}_2^2 + \lambda {\lVert \boldsymbol{\alpha}_i \rVert}_1 \right) \right) \\ &= \frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\alpha}_k} \left( {\lVert \mathbf{x}_k - \mathbf{D}\boldsymbol{\alpha}_k \rVert}_2^2 + \lambda {\lVert \boldsymbol{\alpha}_k \rVert}_1 \right) \\ &= \frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\alpha}_k} \left( (\mathbf{x}_k - \mathbf{D}\boldsymbol{\alpha}_k)^T(\mathbf{x}_k - \mathbf{D}\boldsymbol{\alpha}_k) \right) + \lambda\mathrm{sign}(\boldsymbol{\alpha}_k)\\ &= \frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\alpha}_k} \left( -2\boldsymbol{\alpha}_k^T\mathbf{D}^T\mathbf{x}_k + \boldsymbol{\alpha}_k^T\mathbf{D}^T\mathbf{D}\boldsymbol{\alpha}_k \right) + \lambda\mathrm{sign}(\boldsymbol{\alpha}_k)\\ &= -2 \mathbf{D}^T\mathbf{x}_k + 2 \mathbf{D}^T\mathbf{D} \boldsymbol{\alpha}_k + \lambda \mathrm{sign}(\boldsymbol{\alpha}_k) = 0. \end{aligned} \end{equation}

$(2)$ indicates the analytical solution for Lasso problem implicitly. To find the best $\boldsymbol{\alpha}_k$, we still need to apply some tricks. Denote that $\boldsymbol{\theta}_k = \mathrm{sign}(\boldsymbol{\alpha}_k)$, we could rewrite $(2)$ as

\begin{equation} \begin{aligned} \boldsymbol{\alpha}_k = (\mathbf{D}^T\mathbf{D})^{-1}(\mathbf{D}^T\mathbf{x}_k-\frac{\lambda}{2}\boldsymbol{\theta}_k). \end{aligned} \end{equation}

Since $\boldsymbol{\theta}_k = \mathrm{sign}(\boldsymbol{\alpha}_k)$, we could apply $\mathrm{sign}(\cdot)$ to both sides of $(3)$, then we get

\begin{equation}\label{fml:c5:LassoSoluRelax1} \begin{aligned} \mathrm{sign} \left(\frac{1}{\lambda}\frac{\mathrm{d} \mathrm{Lasso}}{\mathrm{d}\boldsymbol{\alpha}_k} \right) &= \mathrm{sign}\left((\mathbf{D}^T\mathbf{D})^{-1}(\mathbf{D}^T\mathbf{x}_k-\frac{\lambda}{2}\boldsymbol{\theta}_k)\right) - \boldsymbol{\theta}_k \\ &= \mathrm{sign}\left(\mathbf{D}^T\mathbf{x}_k-\frac{\lambda}{2}\boldsymbol{\theta}_k\right) - \boldsymbol{\theta}_k \\ &= \mathrm{sign}\left(\mathbf{y}_k-\lambda\boldsymbol{\theta}_k\right) - \boldsymbol{\theta}_k = 0. \end{aligned} \end{equation}

$(4)$ indicates a fast solution for $(2)$. It proves that considering that the $i^{\mathrm{th}}$ element of $\mathbf{y}_k$, i.e. $y_{ki}$, we would find that when $y_{ki} > \lambda$, $\alpha_{ki} > 0$, and when $y_{ki} < -\lambda$, $\alpha_{ki} < 0$. Furthermore, when $y_{ki} \in [-\lambda,~\lambda]$, $\alpha_{ki}=0$. After confirming $\boldsymbol{\theta}_k$, it will be easy to get $\boldsymbol{\alpha}_k$ from $(3)$ directly.

Learn the dictionary

If we rewrite the coding as a matrix as below,

\begin{equation} \mathbf{A} = \left[ \begin{array}{c c c c} \boldsymbol{\alpha}_1 & \boldsymbol{\alpha}_2 & \cdots & \boldsymbol{\alpha}_N \end{array} \right], \end{equation}

then we could rewrite the dictionary learning problem by Frobenius norm,

\begin{equation} \begin{aligned} \min\limits_{\mathbf{D}}~&{\lVert \mathbf{X} - \mathbf{D}\mathbf{A}\rVert}^2_F, \\ \textrm{s.t.}~&{\lVert D(:,k) \rVert}_2 \leqslant 1,~\forall k \in \{1,2,\ldots,K\}. \end{aligned} \end{equation}

Training dictionary requires us to train $\mathbf{D}$ and $\mathbf{A}$ alternatively. The method for training $\mathbf{A}$ has been discussed before, hence we would discuss how to train $\mathbf{D}$ in the following part. Subsequently, we only note that the trainable variable is $\mathbf{D}$ in $(6)$, which means $(1)$ and $(6)$ require to be solved alternatively.

Solving the dictionary training need us to use Lagrangian multiplier method. Denote the multiplier as $\mu_j$, we could incorporate the constraints into the problem,

\begin{equation} \begin{aligned} \mathcal{L}(\mathbf{D},~\boldsymbol{\mu}) = {\lVert \mathbf{X} - \mathbf{D}\mathbf{A}\rVert}^2_F + \sum_{j=1}^{K}\mu_j\sum_{i=1}^{D}(D_{ij}^2-1). \end{aligned} \end{equation}

The first term of $(7)$ could be expanded as

\begin{equation} \begin{aligned} {\lVert \mathbf{X} - \mathbf{D}\mathbf{A}\rVert}^2_F &= \mathrm{Tr}((\mathbf{X} - \mathbf{D}\mathbf{A})(\mathbf{X} - \mathbf{D}\mathbf{A})^T) \\ &= \mathrm{Tr}(\mathbf{X}\mathbf{X}^T) + \mathrm{Tr}(\mathbf{D}\mathbf{A}\mathbf{A}^T\mathbf{D}^T) - 2\mathrm{Tr}(\mathbf{D}\mathbf{A}\mathbf{X}^T). \end{aligned} \end{equation}

Denote a diagonal matrix $\boldsymbol{\Lambda}$ where each element is $\mu_j$, Then the second term could be rewritten as

\begin{equation} \begin{aligned} \sum_{j=1}^{K}\mu_j\sum_{i=1}^{D}(D_{ij}^2-1) &= \sum_{j=1}^{K}\mu_j\sum_{i=1}^{D}(D_{ij}^2) - \sum_{j=1}^{K}\mu_j\\ &= \mathrm{Tr}(\mathbf{D}\boldsymbol{\Lambda}\mathbf{D}^T - \boldsymbol{\Lambda}). \end{aligned} \end{equation}

Hence we could rewrite $(7)$ as

\begin{equation} \begin{aligned} \mathcal{L}(\mathbf{D},~\boldsymbol{\Lambda}) = \mathrm{Tr}(\mathbf{X}\mathbf{X}^T + \mathbf{D}\mathbf{A}\mathbf{A}^T\mathbf{D}^T - 2\mathbf{D}\mathbf{A}\mathbf{X}^T + \mathbf{D}\boldsymbol{\Lambda}\mathbf{D}^T - \boldsymbol{\Lambda}). \end{aligned} \end{equation}

Apply the first-order partial gradient to $\mathbf{D}$, we have

\begin{equation} \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}\mathbf{D}} \left( \mathcal{L}(\mathbf{D},~\boldsymbol{\Lambda}) \right) &= \frac{\mathrm{d}}{\mathrm{d}\mathbf{D}} \left( \mathrm{Tr}( \mathbf{D}\mathbf{A}\mathbf{A}^T\mathbf{D}^T - 2\mathbf{D}\mathbf{A}\mathbf{X}^T + \mathbf{D}\boldsymbol{\Lambda}\mathbf{D}^T \right)\\ &= 2\mathbf{D}\mathbf{A}\mathbf{A}^T - 2\mathbf{X}\mathbf{A}^T + 2\mathbf{D}\boldsymbol{\Lambda} = 0.\\ \mathbf{D} &= \mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}. \end{aligned} \end{equation}

Substitute $(11)$ into $(10)$, we would have

\begin{equation} \begin{aligned} \mathcal{L}(\boldsymbol{\Lambda}) &= \min\limits_{\mathbf{D}}~\mathcal{L}(\mathbf{D},~\boldsymbol{\Lambda}) \\ &= \mathrm{Tr}(\mathbf{X}\mathbf{X}^T + \mathbf{D}\mathbf{A}\mathbf{A}^T\mathbf{D}^T - 2\mathbf{D}\mathbf{A}\mathbf{X}^T + \mathbf{D}\boldsymbol{\Lambda}\mathbf{D}^T - \boldsymbol{\Lambda})\\ &= \mathrm{Tr}(\mathbf{X}\mathbf{X}^T + \mathbf{D}(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})\mathbf{D}^T - 2\mathbf{D}\mathbf{A}\mathbf{X}^T - \boldsymbol{\Lambda})\\ &= \mathrm{Tr}(\mathbf{X}\mathbf{X}^T + \mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}\mathbf{A}\mathbf{X}^T \\ &~~~~~~- 2\mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T + \boldsymbol{\Lambda})^{-1}\mathbf{A}\mathbf{X}^T - \boldsymbol{\Lambda})\\ &= \mathrm{Tr}(\mathbf{X}\mathbf{X}^T - \mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}\mathbf{A}\mathbf{X}^T - \boldsymbol{\Lambda}). \end{aligned} \end{equation}

Note that $\mathbf{D}$ has been represented by $\boldsymbol{\Lambda}$, we would know that minimizing $\mathcal{L}(\cdot)$ could be applied on $\boldsymbol{\Lambda}$ solely. Hence we have

\begin{equation} \begin{aligned} \frac{\partial \min\limits_{\mathbf{D}}~\mathcal{L}}{\partial \mu_i} &= \mathrm{Tr}\left( \frac{\partial \mathbf{X}\mathbf{X}^T}{\partial \mu_i} - \frac{ \partial \mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}\mathbf{A}\mathbf{X}^T}{ \partial \mu_i } - \frac{\partial\boldsymbol{\Lambda}}{\partial \mu_i} \right)\\ &= -\mathrm{Tr}\left( \frac{ \partial \mathbf{X}\mathbf{A}^T(\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1}\mathbf{A}\mathbf{X}^T }{ \partial \mu_i } \right) - 1. \end{aligned} \end{equation}

In Matrix Cookbook, there has been a conclusion that

\begin{align} \mathrm{Tr}\left( \frac{\partial \mathbf{P}^T (\mathbf{X}+\mathbf{A})^{-1} \mathbf{P} }{\partial x_i} \right) = -{\lVert \mathbf{P}^T (\mathbf{X}+\mathbf{A})^{-1} \mathbf{e}_i \rVert}^2_2. \end{align}

Apply $(14)$ to $(13)$, we have

\begin{equation} \begin{aligned} \frac{\partial \min\limits_{\mathbf{D}}~\mathcal{L}}{\partial \mu_i} = {\lVert \mathbf{X}\mathbf{A}^T (\mathbf{A}\mathbf{A}^T+\boldsymbol{\Lambda})^{-1} \mathbf{e}_i \rVert}^2_2 - 1 = 0. \end{aligned} \end{equation}

$(15)$ is in the quadratic form, hence it is convex and we could find the analytical solution for $\boldsymbol{\Lambda}$. Substitute $\boldsymbol{\Lambda}$ into $(11)$, we would solve $\mathbf{D}$.

An interesting thing is that if anyone substitute $(11)$ into $(15)$, then it will be

\begin{equation} \begin{aligned} {\lVert \mathbf{D} \mathbf{e}_i \rVert}^2_2 = {\lVert D(:,i) \rVert}_2 = 1, \end{aligned} \end{equation}

which shows that the solution revealed in $(11)$ and $(15)$ fulfills the constraints in $(6)$ strictly.