Special Notes on Aug. 24, 2018

Date:	Aug 24, 2018
Last Updated:	Jan 29, 2019
Categories:	Notes Theory
Tags:	algorithm deep-learning inverse-problem optimization

Introduction

Check here to see Levenberg–Marquardt algorithm in Wikipedia.

Check this link and we could review the theory of Levenberg–Marquardt algorithm (LMA), which is used as an improvement compared to the plain first-order gradient descent method. In this short discussion, we would like to talk about how and why we could derive such a method. Then we would know that when we could use this method and when we could not.

To understand the following part, we need to introduce such concepts:

Derivation for vector-valued function

Suppose that we have a function $\mathbf{y} = \mathbf{f}(\mathbf{x})$, where $\mathbf{x}$ is an $M$ dimensional vector (or $1 \times M$ dimensional matrix) and $\mathbf{y}$ is an $N$ dimensional vector. Thus we know $\mathbf{f}$ is an $\mathbb{R}^{M} \rightarrow \mathbb{R}^{N}$ mapping. In particular, if $N=1$, the function accepts an $M$ dimensional vector but only gives one output value. We use $y = f(\mathbf{x})$ to represent this case. In another case, the function only accepts one value but outputs a vector. We use $\mathbf{y} = \mathbf{f}(x)$ to represent this case.

First, let us talk about the special case $f(\mathbf{x})$. Since the input of $f$ is a vector, we could view it as $f(x_1,~x_2,~\ldots~x_M)$, where $x_i$ is the $i^{\mathrm{th}}$ element of $\mathbf{x}$. We could calculate $\frac{\partial f}{\partial x_i}$ for any $i$. Thus we denote the derivative of $f$ over $\mathbf{x}$ is:

\begin{align} \frac{\partial f}{\partial \mathbf{x}} = \left[ \frac{\partial f}{\partial x_1} ,~ \frac{\partial f}{\partial x_2} ,~ \ldots ,~ \frac{\partial f}{\partial x_M} \right]. \end{align}

In the other case $\mathbf{f}(x)$. Denote $f_i(x)$ as the $i^{\mathrm{th}}$ element of $\mathbf{y}$. Since we could calculate $\frac{\partial f_i}{\partial x}$ for any $i$. Thus we denote the derivative of $\mathbf{f}$ over $x$ is:

\begin{align} \frac{\partial \mathbf{f}}{\partial x} = \left[ \frac{\partial f_1}{\partial x} ,~ \frac{\partial f_2}{\partial x} ,~ \ldots ,~ \frac{\partial f_N}{\partial x} \right]. \end{align}

Jacobian matrix

Check here to see Jacobian matrix in Wikipedia.

After learning the above definitions, now we may want to define the derivative of $\mathbf{f}(\mathbf{x})$. Since it has multiple inputs and multiple outputs, the meaning of its derivative may not be very apparent. Generally we call such a derivative, or gradient, as “Jacobian matrix”. It has $N$ rows and $M$ columns and could be formulated as

\begin{align} \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_{M}} \\ \frac{\partial f_2}{\partial x_1} & \ddots & \cdots & \frac{\partial f_2}{\partial x_{M}} \\ \vdots & & \ddots & \vdots \\ \frac{\partial f_{N}}{\partial x_1} & \cdots & \frac{\partial f_{N}}{\partial x_{M - 1}} & \frac{\partial f_{N}}{\partial x_{M}} \end{bmatrix} \end{align}

To understand why Jacobian matrix could be defined as the derivative, we need to suppose that we have a function composition $z = g(\mathbf{y}) = g(\mathbf{f}(\mathbf{x}))$. If we have already gotten the derivative over $\mathbf{y}$, i.e. we have known $\frac{\partial g}{\partial \mathbf{y}}$. Then we could derive $\frac{\partial g}{\partial \mathbf{x}}$ by

\begin{align} \frac{\partial g}{\partial \mathbf{x}} = \frac{\partial g}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial g}{\partial \mathbf{y}} \mathbf{J} . \end{align}

In this equitation, the Jacobian matrix serves as a weight matrix which re-weights the gradient over $\mathbf{y}$ and map the gradient from the space $\mathbb{R}^N$ to $\mathbb{R}^M$. If we use numerical method to calculate $\frac{\partial g}{\partial \mathbf{x}}$ and $\frac{\partial g}{\partial \mathbf{y}}$ respectively, we could find that the above equation is satisfied.

Hessian matrix

Check here to see Newton's method in optimization in Wikipedia.

In some cases, the second-order derivative could help us to get a better or faster convergence. For example, the Newton’s method shows the most plain idea for utilizing the second-order term in optimization. Compare to the gradient descent method (which is first-order method), it could get rid of the influence from the local gradient and take a shorter route by making use of the curvature.

Check here to see Gauss-Newton algorithm in Wikipedia.

Furthermore, the Gauss-Newton algorithm is a practical version of the Newton’s method. By approximating the Hessian matrix (second-order derivative) as $\mathbf{J}^T\mathbf{J}$, it supposes to a good approximation for the second order term (in least square problem).

Check here to see Hessian matrix in Wikipedia.

First, suppose that there is a single valued function which accepts a vector $f(\mathbf{x})$. Then its corresponding Hessian matrix is

\begin{align} \mathbf{H}(f) = \begin{bmatrix} \frac{\partial f}{\partial x_1^2} & \frac{\partial f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial f}{\partial x_1 \partial x_{M}} \\ \frac{\partial f}{\partial x_2 \partial x_1} & \ddots & \cdots & \frac{\partial f}{\partial x_2 \partial x_{M}} \\ \vdots & & \ddots & \vdots \\ \frac{\partial f}{\partial x_M \partial x_1} & \cdots & \frac{\partial f}{\partial x_M \partial x_{M - 1}} & \frac{\partial f}{\partial x_M \partial x_{M}} \end{bmatrix}. \end{align}

Frankly, in the case of function vector, which means we have $\mathbf{y} = \mathbf{f}(\mathbf{x})$. In this case, the Hessian matrix would become a 3D one, which means

\begin{align} \mathbf{H}(\mathbf{f}) = \begin{bmatrix} \mathbf{H}(f_1) & \mathbf{H}(f_2) & \mathbf{H}(f_3) & \cdots & \mathbf{H}(f_N) \end{bmatrix}^T, \end{align}

where we denote that $\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(\mathbf{x}) & f_2(\mathbf{x}) & \cdots & f_N(\mathbf{x}) \end{bmatrix}^T$. The element of the above 3D matrix is a 2D matrix.

Derive LMA from a second-order expansion view

In this part, we would derive the LMA from a second-order expansion view. In fact, LMA could be regarded as a semi-second-order gradient approach, because when calculating the derivative, we use the second-order expansion for the least square problem while the inner function, i.e. the forward function is only expanded in the first-order.

General form

Considering a function composition $g(\mathbf{f}(\mathbf{x}))$, where we denote the inner function vector as $\mathbf{z} = \mathbf{f}(\mathbf{x})$. Then we could calculate the incremental of the single valued function by

\begin{align} g(\mathbf{z} + \Delta \mathbf{z}) - g(\mathbf{z}) \approx \nabla g^T(\mathbf{z}) \Delta \mathbf{z} + \frac{1}{2} \Delta \mathbf{z}^T \mathbf{H}(g) \Delta \mathbf{z}, \end{align}

which indicates that we expand the function $g$ in second order. Note that we use Hessian matrix $\mathbf{H}(g)$ to represent the second order derivative. Then we could represent the incremental $\Delta \mathbf{z}$ by

\begin{align} \Delta \mathbf{z} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \Delta \mathbf{x} = \mathbf{J}(\mathbf{f}) \Delta \mathbf{x}. \end{align}

Note that we have used the Jacobian matrix $\mathbf{J}(\mathbf{f})$ to represent the derivative of function vector $\mathbf{f}$ over $\mathbf{x}$. Actually, we could expand $\mathbf{f}(\mathbf{x}+\Delta \mathbf{x})$ to the second order. But here we only use the first order term to represent the derivative. So the expansion for the inner function vector $\mathbf{f}$ is still remained in the first order.

LMA for least square problem

Considering the least square problem:

\begin{align} \label{fml:intro:misfit} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x}} \lVert \mathbf{y} - \mathbf{f}(\mathbf{x}) \rVert^2, \end{align}

where $\mathbf{y}$ is constant (usually viewed as “observation”), and $\mathbf{x}$ is what we need to optimize.

So the least square problem could be described as finding a stationary point of a function composition, i.e.

\begin{equation} \begin{aligned} g(\mathbf{z}) &= \lVert \mathbf{y} - \mathbf{z} \rVert^2, \\ \mathbf{z} &= \mathbf{f}(\mathbf{x}), \\ \nabla g^T(\mathbf{z}) &= - 2 (\mathbf{y} - \mathbf{z})^T, \\ \mathbf{H}(g) &= 2 \mathbf{I}, \\ \mathbf{J}(\mathbf{f}) &:= \mathbf{J}, \end{aligned} \end{equation}

where we denote the Jacobian matrix of the inner function vector $\mathbf{f}$ as $\mathbf{J}$ for the convenience.

Then we could rewrite the incremental as

\begin{equation} \begin{aligned} g(\mathbf{z} + \Delta \mathbf{z}) - g(\mathbf{z}) &\approx - 2 (\mathbf{y} - \mathbf{z})^T \Delta \mathbf{z} + \Delta \mathbf{z}^T \Delta \mathbf{z}, \\ g(\mathbf{x} + \Delta \mathbf{x}) - g(\mathbf{x}) &\approx - 2 (\mathbf{y} - \mathbf{f}(\mathbf{x}))^T \mathbf{J} \Delta \mathbf{x} + \Delta \mathbf{x}^T \mathbf{J}^T \mathbf{J} \Delta \mathbf{x}, \\ \frac{g(\mathbf{x} + \Delta \mathbf{x}) - g(\mathbf{x})}{ \Delta \mathbf{x} } &\approx \Delta \mathbf{x}^T \mathbf{J}^T \mathbf{J} - 2 (\mathbf{y} - \mathbf{f}(\mathbf{x}))^T \mathbf{J} = 0. \end{aligned} \end{equation}

Set the derivative to 0, therefore we could get

\begin{equation} \begin{aligned} \mathbf{J}^T \mathbf{J} \Delta \mathbf{x} &= 2 \mathbf{J}^T (\mathbf{y} - \mathbf{f}(\mathbf{x})), \\ \Delta \mathbf{x} &= 2 \left( \mathbf{J}^T \mathbf{J} \right)^{-1} \mathbf{J}^T (\mathbf{y} - \mathbf{f}(\mathbf{x})). \end{aligned} \end{equation}

Jacobian matrix is not a square matrix in most time. However, the symmetric matrix $\mathbf{J}^T \mathbf{J}$ is a square one. So we could calculate its inverse. This result also explains that why we use $\mathbf{J}^T\mathbf{J}$ as the approximation of Hessian matrix. To be frank, this technique could only be used in the problems like least square.

To prevent the cases where the inverse problem is underdetermined, we could add a “damped vector” which is derived from the pure first order gradient,

\begin{align} \Delta \mathbf{x} &= 2 \left( \mathbf{J}^T \mathbf{J} + \lambda \mathbf{I} \right)^{-1} \mathbf{J}^T (\mathbf{y} - \mathbf{f}(\mathbf{x})). \end{align}

Furthermore, there is an adaption that ensure the damped vector is scalable:

\begin{align} \Delta \mathbf{x} &= 2 \left( \mathbf{J}^T \mathbf{J} + \lambda \mathrm{diag} \left( \mathbf{J}^T \mathbf{J} \right) \right)^{-1} \mathbf{J}^T (\mathbf{y} - \mathbf{f}(\mathbf{x})). \end{align}

This solution is exactly what we call “LMA”.

Contents