Notes from Jan. 29, 2019 to Feb. 15, 2019

Date: Jan 29, 2019
Last Updated: Feb 15, 2019
Categories:
Notes Papers
Tags:
research deep-learning stochastic inverse-problem NIPS

Contents


Introduction

In this article we would summarize some popular solutions for the inverse problem. To be specific, here we only discuss the methods of inverse problems. Although when introducing some algorithms, we may need to explain what problem they work on, the various topics about which problem we solve is not what we concentrate on in this article.

Generally the solutions could be divided into 2 parts: optimizing methods and stochastic approaches. In the first part, i.e. the optimizing methods, we would introduce different iterative algorithms which are used to find the optimal solution of the problem. In most cases, these algorithms are based on deterministic methods, i.e. the derivatives. In the second part, we would introduce some stochastic methods especially some heuristic methods where we do not need to calculate the gradient but only need to evaluate the loss function. In those 2 parts, we would also talks about machine learning, regularization and some other related topics which have been applied to enhance the performance of the plain algorithms.

In the following parts, we will show not only the methods and ideas but also the state-of-art progress on inverse problem. Our literature resource is mainly from NIPS and IEEE Xplore.

In the last section, we attach 3 groups of slides which are related to this article. The slides include the weekly report from 02/01/2019 to 02/15/2019.

Optimizing methods

Optimizing methods for solving non-linear inverse problem

Conjugate gradient descent

  • Title: A new family of conjugate gradient methods for unconstrained optimization
  • Author: Mohd Rivaie, Muhammad Fauzi, and Mustafa Mamat
  • Year: 2011
  • Theory level: Theoretical
  • Theory type: Gradient based algotirhm
  • Used data: Standard testing benchmarks
  • Source: International Conference on Modeling, Simulation and Applied Optimization

Reference

  • External Reference: A modified PRP conjugate gradient method with Armijo line search for large-scale unconstrained optimization.

Reference

This paper proposes a new conjugate gradient coefficient which is used in conjugate gradient descent method. As an alternative of Newton's method, conjugate gradient method aims at approximating the Hessian matrix with first order gradient to avoid the computationally expensive second-order term. Regardless the computational cost, this method is slower than Newton's method but faster than first-order gradient descent.

Denote that we need to solve such an unconstrained problem which could be non-linear,

\begin{align} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x}} f(\mathbf{x}). \end{align}

The algorithm could be described as:

  1. Initialize the input parameter $\mathbf{x}_0$, $k=0$.
  2. Calculate first-order gradient $\mathbf{g}_k = \nabla f(\mathbf{x}_k)$.
  3. Compute $\boldsymbol{\beta}_k$ which is the conjugate gradient coefficient.
  4. Update descent direction: when $k=0$, let $\mathbf{d}_k=-\mathbf{g}_k$; when $k > 0$, $\mathbf{d}_k=-\mathbf{g}_k+\boldsymbol{\beta}_k\mathbf{d}_{k-1}$.
  5. Use line search to find the best update parameter: $\alpha_k=\arg \min_{\alpha}f(\mathbf{x}_k+\alpha \mathbf{d}_k)$.
  6. Let $\mathbf{x}_{k+1} = \mathbf{x}_k+\alpha_k \mathbf{d}_k$. If $f(\mathbf{x}_{k+1}) < f(\mathbf{x}_k)$ and $\lVert \mathbf{g}_k \rVert < \varepsilon$, stop; otherwise get back to step 2.

The author also proposes a new conjugate gradient coefficient and proves the convergence of the new coefficient.

Frank–Wolfe Algorithm

  • Title: Decentralized Frank–Wolfe Algorithm for Convex and Nonconvex Problems
  • Author: Hoi-To Wai, Jean Lafond, Anna Scaglione, and Eric Moulines
  • Year: 2017
  • Theory level: Theoretical
  • Theory type: Gradient based algotirhm
  • Used data: Standard testing benchmarks
  • Source: IEEE Transactions on Automatic Control

Reference

This paper introduce a decentralized version of Frank–Wolfe Algorithm which is used to solve a strictly constrained optimizing problem. In this paper, the author assume that there a series of functions $f_i (\mathbf{x})$ which might not be convex, then the problem is

\begin{align} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x} \in \mathcal{D}} F(\mathbf{x}) = \arg \min\limits_{\mathbf{x} \in \mathcal{D}} \sum_i f_i(\mathbf{x}). \end{align}

Denote that the adjacent matrix $\mathbf{W}$ has such condition: $\sum_{j} W_{ij} = 1$. Then the algorithm could be described as:

  1. For each agent, calculate the local average iterate among its neighbor: $\bar{\mathbf{x}}_i = \sum_{j} W_{ij} \mathbf{x}_j$, where $W_{ij}$ is an element of the adjacent matrix.
  2. For each agent, calculate the local average gradient among its neighbor: $\overline{\nabla F}_i= \sum_{j} W_{ij} \nabla f_j (\mathbf{x}_j)$.
  3. Let $\boldsymbol{\alpha}_i = \arg \min\limits_{\boldsymbol{\alpha}_i \in \mathcal{D}} \boldsymbol{\alpha}_i^T \overline{\nabla F}_i$.
  4. Update iterate: $\mathbf{x}_{i+1} = (1-\gamma) \bar{\mathbf{x}}_i + \gamma \boldsymbol{\alpha}_i$.

Note that the domain $\mathcal{D}$ is a strict constraint. For example, we could set that $\mathcal{D} = \{ \mathbf{x} | \lVert \mathbf{x} \rVert_1 < \lambda \}$. Then the problem would become a lasso problem.

The author also discuss the convergence when $f_i$ is convex functions, and the bound when $f_i$ is non-convex. Compared to the conventional FW algorithm, the proposed one is decentralized, which means it could make use of multiple agents and the parallel computing. The author has proved that although in the algorithm we only calculate the "local average" for each agent, when considering the whole system product, the "local average" would converge to the real one.

Optimizing methods for solving linear inverse problem

LISTA-CPSS Algorithm

  • Title: Theoretical Linear Convergence of Unfolded ISTA and its PracticalWeights and Thresholds
  • Author: Xiaohan Chen, Jialin Liu, Zhangyang Wang and Wotao Yin
  • Year: 2018
  • Theory level: Theoretical
  • Theory type: Machine learning
  • Used data: 11 test images for sparse coding
  • Source: Advances in Neural Information Processing Systems 31 (NIPS 2018) pre-proceedings

Reference

This paper concentrates on the inspection on LISTA (Learning-ISTA). To check the basic idea of LISTA and LAMP, please check this article:

Reference

LISTA which is adapted from ISTA, has 2 weight matrices in each layer. The network structure could be described in the following figure.

The network still preserves the workflow of ISTA. The observed data is the input for each layer, while we still need an initial guess for the prediction. The initial guess is passed to the first layer and adjusted in each layer until we get the accurate prediction.

In this paper, the author propose two improvements on the original LISTA:

  1. Partial weight coupling (CP): The author proves that the primal LISTA could not converge unless $\mathbf{W}_{k2} = \mathbf{I} - \mathbf{W}_{k1} \mathbf{A}$. Hence we would only has one weight matrix in each layer.
  2. Support selection technique: The author proposes that after we calculate the $\mathbf{v}_k = \mathbf{W}_{k1} \mathbf{b} + \mathbf{W}_{k2} \mathbf{x}_k$, we could use a new threshold technique. When $\mathbf{v}_k$ is small, use soft thresholding; when $\mathbf{v}_k$ is large, use hard thresholding.

With the two above methods equipped, the convergence speed would be accelerated.

Inf-ADMM-ADNN

  • Title: An inner-loop free solution to inverse problems using deep neural networks
  • Author: Kai Fan, Qi Wei, Lawrence Carin and Katherine A. Heller
  • Year: 2017
  • Theory level: Application
  • Theory type: Machine learning
  • Used data: Deblurring, super resolution and colorization
  • Source: Advances in Neural Information Processing Systems 30 (NIPS 2017)

Reference

  • External Reference: Distributed optimization and statistical learning via the alternating direction method of multipliers.

Reference

This paper aims at solve a general form of liner inverse problem:

\begin{equation} \begin{aligned} \arg \min\limits_{\mathbf{x}} &\lVert \mathbf{y} - \mathbf{A}\mathbf{z} \rVert + \lambda \mathcal{R}(\mathbf{x},~\mathbf{y}),\\ \mathrm{s.t.}&~\mathbf{z}=\mathbf{x}. \end{aligned} \end{equation}

By using Lagrange multiplier method, we could decompose this optimization into 3 steps in each iteration. We call this method alternating direction method of multipliers (ADMM) framework:

\begin{equation} \left\{ \begin{aligned} \mathbf{x}^{k+1} &= \arg \min\limits_{\mathbf{x}} \beta \left\lVert \mathbf{x} - \mathbf{z}^k + \frac{\mathbf{u}^k}{2\beta} \right\rVert^2 + \lambda \mathcal{R}(\mathbf{x},~\mathbf{y}), \\ \mathbf{z}^{k+1} &= \arg \min\limits_{\mathbf{z}} \left\lVert \mathbf{y} - \mathbf{A}\mathbf{z} \right\rVert^2 + \beta \left\lVert \mathbf{x} - \mathbf{z}^k + \frac{\mathbf{u}^k}{2\beta} \right\rVert^2, \\ \mathbf{u}^{k+1} &= \mathbf{u}^k + 2 \beta \left( \mathbf{x}^{k+1} - \mathbf{z}^{k+1} \right) \end{aligned} \right. \end{equation}

In practice, the matrix $\mathbf{A}$ may be a very large one. To solve this problem, we have overcome two main challenges. The first one is the solution for $\mathbf{z}^{k+1}$. Although it has a closed-form solution, we need to calculate the inverse of $\mathbf{B} = \left( \beta\mathbf{I} + \mathbf{A}\mathbf{A}^T \right)^{-1}$. Thus the author proposes a network which is used to learn $\mathbf{C}_{\phi} \rightarrow \mathbf{B}^{-1}$

In some cases, for example, if $\mathcal{R} = \lVert \cdot \rVert_1$, $\mathbf{x}^{k+1}$ which is deduced from proximal operator could get the closed-form solution. However, if $\mathcal{R}$ is in a generalized form, for the proximal operator $\arg\min\limits_{\mathbf{x}} \frac{1}{2} \lVert \mathbf{x} - \mathbf{v} \rVert_2^2 + \mathcal{R}(\mathbf{x},~\mathbf{y})$, we could find that

\begin{align} \mathbf{v} - \mathbf{x} \propto \partial \mathcal{R}. \end{align}

Thus we know that $\mathbf{v} = \mathcal{F} (\mathbf{x})$. And we use the following network to learn the inverse $\mathbf{x} = \mathcal{F}^{-1}(\mathbf{v})$.

VAMP equipped with non-separable denoiser

  • Title: Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis
  • Author: Alyson K. Fletcher, Parthe Pandit, Sundeep Rangan, Subrata Sarkar and Philip Schniter
  • Year: 2018
  • Theory level: Theoretical
  • Theory type: Gradient based algorithm
  • Used data: Compressive Image Recovery and Lifting
  • Source: Advances in Neural Information Processing Systems 31 (NIPS 2018) pre-proceedings

Reference

  • External Reference: To learn more about AMP, please check this note:

Reference

This paper extends the Vector-AMP (VAMP) to a wider case. First, let us introduce VAMP. VAMP extend the conventional AMP into a two-part iteration structure. In the first part, we use a denoising operator $\mathbf{g}_1(\cdot)$ as we have done in AMP. In the other part, we use a linear minimum mean-squared error (LMMSE) operator to ensure a state evolution (SE) analysis. This technique extend the algorithm to a case that $\mathbf{A}$ is not required to be in the i.i.d. sub-Gaussian distribution but only need to be an arbitrary right rotationally invariant matrix.

This is the algorithm of VAMP:

The sub-structure in each part is the same as AMP. However, in this algorithm, we have 2 function vectors $\mathbf{g}_1(\cdot)$ and $\mathbf{g}_2(\cdot)$, where $\mathbf{g}_2(\cdot)$ needs to be a solution to the L2-penalized linear inverse problem:

\begin{align} \mathbf{g}_2(\mathbf{r}_{2k},~ \gamma_{2k}) := \left( \gamma_{\omega} \mathbf{A}^T\mathbf{A} + \gamma_{2k}\mathbf{I} \right)^{-1} \left( \gamma_{2k} \mathbf{A}^T\mathbf{y} + \gamma_{2k}\mathbf{r}_{2k} \right). \end{align}

However, $\mathbf{g}_1(\cdot)$ could be many kinds of denoisers. For example, soft thresholding is the solution derived from L1 norm penalty, which is a separable denoiser. In this article, the author propose that with the ground truth $\mathbf{x}^{\ast}$ companied by Gaussian noise, both of the errors $\mathbf{p}_k = \mathbf{r}_{1k} - \mathbf{x}^{\ast}$ and $\mathbf{q}_k = \mathbf{V}^T \left(\mathbf{r}_{1k} - \mathbf{x}^{\ast}\right)$

converge to a vector with each element in the Gaussian distribution with zero mean and $\tau_{1k}$, $\tau_{2k}$ variance respectively.

Regularization / Penalty

Adversarial Regularizers

  • Title: Adversarial Regularizers in Inverse Problems
  • Author: Sebastian Lunz, Carola Schoenlieb and Ozan Öktem
  • Year: 2018
  • Theory level: Theoretical
  • Theory type: Machine learning
  • Used data: image inverse, the data is from BSDS500 dataset
  • Source: Advances in Neural Information Processing Systems 31 (NIPS 2018) pre-proceedings

Reference

Consider such a problem:

\begin{align} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x}} \lVert \mathbf{y} - \mathbf{A}\mathbf{x}\rVert_2^2 + \lambda f(\mathbf{x}). \end{align}

This paper proposes that we could use a network $\Psi_{\boldsymbol{\Theta}}(\cdot)$ to replace the regularization term $f$. Since the network could learn from the distribution of the data, it could serve better than a specially designed regularization.

Considering that $\mathbf{y} = \mathbf{A} \mathbf{x}$, we may have a direct inverse method that $\mathbf{x} = \mathbf{A}^{\ast}\mathbf{x}$. Note that because $\mathbf{A}$ may not be a square matrix, it would not have inverse in most cases. However, $\mathbf{A}^{\ast}$ could be viewed as a pseudo-inverse of $\mathbf{A}$. There exists some techniques where we could calculate $\mathbf{A}^{\ast}$ efficiently. And a research also discloses that such conclusion also holds when we consider a pseudo-inverse of a convolution.

Considering that we have ground truth set $\mathbb{P}_r$, and the observation set $\mathbf{P}_y$. By $\mathbf{A}^{\ast}$ we could project the observation into a pseudo-inverse set $\mathbf{P}_n$. Then we could use such loss function to train network:

\begin{align} \mathcal{L} = \Psi_{\boldsymbol{\Theta}}(\mathbf{x}_r) - \Psi_{\boldsymbol{\Theta}}(\mathbf{x}_n) + \mu \max\left( \lVert \nabla_{\mathbf{x}_i} \Psi_{\boldsymbol{\Theta}}(\mathbf{x}_i) \rVert^2_2,~0\right)^2, \end{align}

where $\mathbf{x}_r \sim \mathbb{P}_r$, $\mathbf{x}_n \sim \mathbb{P}_n$ and $\mathbf{x}_i = \varepsilon \mathbf{x}_r + (1 - \varepsilon \mathbf{x}_n)$. $\varepsilon$ is a random variable in the uniform distribution $U(0,~1)$. The last term in the loss function is used to enforce $\Psi_{\boldsymbol{\Theta}}$ to be Lipschitz continuous. During the training, we sample $\mathbf{x}_r$ and $\mathbf{x}_n$ randomly. Since they are not coupled, the loss function is essentially the Wasserstein distance which makes the network learn the minimal distance between the two distribution $\mathbb{P}_r$ and $\mathbb{P}_n$.

After the network getting trained, the network parameter $\boldsymbol{\Theta}$ would be fixed. Then we could use gradient descent to solve the inverse problem:

\begin{align} \mathbf{x}^{k+1} = \mathbf{x}^k - \alpha \nabla_{\mathbf{x}} \left( \lVert \mathbf{y} - \mathbf{A}\mathbf{x}^k\rVert_2^2 + \lambda \Psi_{\boldsymbol{\Theta}}(\mathbf{x}^k) \right). \end{align}

Proximal gradient method

  • Title: Parallel proximal methods for total variation minimization
  • Author: Ulugbek S. Kamilov
  • Year: 2017
  • Theory level: Theoretical
  • Theory type: Regularization
  • Used data: Shepp-Logan images
  • Source: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Reference

In this paper, the author gives the general formulation of the proximal algorithm where we need to solve that

\begin{align} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x}} \mathcal{D}(\mathbf{x}) + \frac{1}{K} \sum_{k=1}^K \mathcal{R}_k(\mathbf{x}), \end{align}

where $\mathcal{D},~\mathcal{R}_k$ are convex. Such problem often appears when solving dictionary learning. To get the solution, we need to apply ISTA-based optimization:

\begin{align} \mathbf{z}_t = \mathbf{x}_{t-1} - \gamma_t \nabla \mathcal{D} (\mathbf{x}_{t-1}), \\ \mathbf{x}_t = \frac{1}{K} \sum_{k=1}^K \mathrm{prox}_{\gamma_t \mathcal{R}_k}(\mathbf{z}_t). \end{align}

To learn the details about what is proximal operator ($\mathbf{prox}(\cdot)$) and how to solve it, please check:

Reference

Landweber iteration

  • Title: Learning Model-Based Sparsity via Projected Gradient Descent
  • Author: Sohail Bahmani, Petros T. Boufounos, and Bhiksha Raj
  • Year: 2017
  • Theory level: Theoretical
  • Theory type: Regularization
  • Used data: None (pure theoretical)
  • Source: IEEE Transactions on Information Theory

Reference

This paper gives the proof of how the Landweber iteration (Projected Gradient Descent) converges, and it also verify some other features like Stable Model-Restricted Hessian (SMRH) condition.

The Landweber iteration is used when there is a strict constraint accompanied with the optimization.

\begin{align} \hat{\mathbf{x}} = \arg \min\limits_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x}). \end{align}

Different from the plain gradient descent, it needs to project the updated parameter to the strictly limited space:

\begin{align} \mathbf{x}_{t+1} = \mathcal{P}_{\mathcal{C}} (\mathbf{x}_t - \alpha \nabla f(\mathbf{x})). \end{align}

Note that to solve this problem, we need to use the project function $\mathcal{P}$ which means we find a solution $\mathbf{x}_{t+1} \in \mathcal{C}$ that has the smallest distance to $\mathbf{x}_t - \alpha \nabla f(\mathbf{x})$.

Stochastic methods

Sampling methods

Metropolis-Hastings

  • Title: Bayesian calibration of a large‐scale geothermal reservoir model by a new adaptive delayed acceptance Metropolis Hastings algorithm
  • Author: T. Cui, C. Fox, and M. J. O'Sullivan
  • Year: 2011
  • Theory level: Algorithm
  • Theory type: Stochastic sampling algorithm
  • Used data: Geothermal reservoir modeling
  • Source: Water Resources Research

Reference

Metropolis-Hastings method is widely used in Monte-Carlo-Markov-Chain (MCMC) methods. This paper proposes an improvement for solving inverse problems with a highly complicated forward model. Suppose we have a model function $\mathbf{f}$, the likelihood function derived from Bayesian method could be formulated as

\begin{align} p(\mathbf{x}|\mathbf{y}) \propto e^{-\frac{1}{2} (\mathbf{y} - \mathbf{f}(\mathbf{x}))^T \Sigma_e^{-1} (\mathbf{y} - \mathbf{f}(\mathbf{x})) } p(\mathbf{x}), \end{align}

where $\Sigma_e$ is the covariance matrix of the misfit vector $\mathbf{e} = \mathbf{y} - \mathbf{f}(\mathbf{x})$. Note that if we have a coarse model $\mathbf{f}^{\ast}(\mathbf{x})$ which is derived by reducing the order of $\mathbf{f}$, then we would have $\mathbf{e} = \mathbf{y} - \mathbf{f}_d(\mathbf{x}) + \left(\mathbf{f}_d(\mathbf{x}) - \mathbf{f}(\mathbf{x}) \right)$. Denote that $\mathbf{d} = \mathbf{f}_d(\mathbf{x}) - \mathbf{f}(\mathbf{x})$, and suppose that $\mathbf{d} \sim \mathcal{N}(\mu_d, \Sigma_d)$. Then we would have a new likelihood function

\begin{align} p_d(\mathbf{x}|\mathbf{y}) \propto e^{-\frac{1}{2} (\mathbf{y}_d - \mathbf{f}_d(\mathbf{x}) - \mu_d )^T (\Sigma_e + \Sigma_d)^{-1} (\mathbf{y}_d - \mathbf{f}_d(\mathbf{x}) - \mu_d) } p(\mathbf{x}). \end{align}

Suppose that we have a probability function $\mathbf{g}(\cdot)$, and random vector $\boldsymbol{\lambda}$ follows the distribution of $\mathbf{g}$. Then we could update a model parameter $\mathbf{x}_k$ by updating function $\mathbf{x}^{\dagger} = \boldsymbol{\eta}(\mathbf{x}_k,~\boldsymbol{\lambda})$. Consider another probablity function $\mathbf{g}_r(\cdots)$ which generates another random vector $\boldsymbol{\lambda}_r$. We may have another updating function $\mathbf{x}_k = \boldsymbol{\eta}_r(\mathbf{x}^{\dagger},~\boldsymbol{\lambda}_r)$. Hence we know $\boldsymbol{\eta},~\boldsymbol{\eta}_r$ are reverse transitions compared to each other. After that, we could define the proposal distribution as

\begin{equation} \left\{ \begin{aligned} &q(\mathbf{x}_k \rightarrow \mathbf{x}^{\dagger}) = g(\boldsymbol{\lambda}),\\ &q(\mathbf{x}^{\dagger} \rightarrow \mathbf{x}_k) = g_r(\boldsymbol{\lambda}_r) \mathbf{J}, \\ &\mathbf{J} = \left| \frac{ \partial \boldsymbol{\eta} \left( \begin{bmatrix} \mathbf{x} \\ \boldsymbol{\lambda} \end{bmatrix} \right) }{ \partial \begin{bmatrix} \mathbf{x} \\ \boldsymbol{\lambda} \end{bmatrix} } \right|. \end{aligned} \right. \end{equation}

Then, the Metropolis-Hastings algorithm could be described as

  1. Generate a random update $\mathbf{x}^{\dagger}$ and calculate its bi-directional transitions $q(\mathbf{x}_k \rightarrow \mathbf{x}^{\dagger}) ,~ q(\mathbf{x}^{\dagger} \rightarrow \mathbf{x}_k)$.
  2. Use Metropolis-Hastings algorithm to calculate acception rate over the coarse likelihood:
    \begin{align} \alpha(\mathbf{x}_k ,~ \mathbf{x}^{\dagger}) = \min \left( 1,~ \frac{p_d(\mathbf{x}^{\dagger}|\mathbf{y}) q(\mathbf{x}^{\dagger} \rightarrow \mathbf{x}_k)}{p_d(\mathbf{x}_k|\mathbf{y}) q(\mathbf{x}_k \rightarrow \mathbf{x}^{\dagger})} \right). \end{align}
  3. If $\mathbf{x}^{\dagger}$ could be accept according to the acception rate $\alpha$, then calculate another accpetion rate over the detail likelihood:
    \begin{align} \beta(\mathbf{x}_k ,~ \mathbf{x}^{\dagger}) = \min \left( 1,~ \frac{p(\mathbf{x}^{\dagger}|\mathbf{y}) \alpha(\mathbf{x}^{\dagger} ,~ \mathbf{x}_k) q(\mathbf{x}^{\dagger} \rightarrow \mathbf{x}_k)}{p(\mathbf{x}_k|\mathbf{y}) \alpha(\mathbf{x}_k ,~ \mathbf{x}^{\dagger}) q(\mathbf{x}_k \rightarrow \mathbf{x}^{\dagger})} \right). \end{align}
  4. If $\mathbf{x}^{\dagger}$ could be accept according to the acception rate $\beta$, then perform update $\mathbf{x}_{k+1} = \mathbf{x}^{\dagger}$.

The idea of this algorithm is, if the sample is rejected by the coarse likelihood, then we do not need to calculate the complicated forward model. Although calculating the coarse model would reduce the efficiency, if the sample is easy to be rejected, then we would gain better performance by this method.

Gibbs sampling

  • Title: A Bayesian inference approach to the inverse heat conduction problem
  • Author: Jingbo Wang, and Nicholas Zabaras
  • Year: 2004
  • Theory level: Algorithm
  • Theory type: Stochastic sampling algorithm
  • Used data: Two-dimensional inverse heat conduction problem (IHCP)
  • Source: International Journal of Heat and Mass Transfer

Reference

  • External Reference: This is another article about solving multi-frequency problem (i.e. multiple forward model) with Gibbs sampling:

Reference

In this paper, we have $y_j \sim \mathbf{T}(\mathbf{x},t_j)$, and $k \frac{\partial T(\mathbf{x}, t)}{\partial \mathbf{n}} = \sum \theta_i \mathbf{w}_i (\mathbf{x},~t)$. Hence we define that vector $\mathbf{y}$ should fulfil that $\mathbf{y} \sim \mathbf{f} (\boldsymbol{\theta})$, where $\mathbf{f}(\cdot)$ is known by some prior work.

According to Bayesian method, $p(\boldsymbol{\theta}|\mathbf{y}) \propto p(\mathbf{y}|\boldsymbol{\theta}) p(\boldsymbol{\theta})$, hence we have

\begin{align} p(\boldsymbol{\theta}|\mathbf{y}) \propto e^{-\frac{1}{2\sigma^2}\left( \mathbf{f}(\boldsymbol{\theta}) - \mathbf{y} \right)^T\left( \mathbf{f}(\boldsymbol{\theta}) - \mathbf{y} \right)} p(\boldsymbol{\theta}). \end{align}

To specify the prior $p(\boldsymbol{\theta})$, we use the random field, which means

\begin{align} p(\boldsymbol{\theta}) \propto e^{-\sum_{i,~j} W_{ij} (\theta_i - \theta_j)^2} = \lambda^{\frac{m}{2}} e^{- \frac{1}{2} \boldsymbol{\theta}^T \mathbf{W} \boldsymbol{\theta}} \end{align}

Then in each iteration, we could use Gibbs sampling method, which means in the kth iteration, if we have $\boldsymbol{\theta}^{(k)}$, then we perform that

  1. $\theta^{(k+1)}_1 \sim p\left(\theta_1 \left| \theta_2=\theta^{(k)}_2,~ \theta_3=\theta^{(k)}_3,~ \theta_n=\theta^{(k)}_n \right.\right)$
  2. $\theta^{(k+1)}_2 \sim p\left(\theta_2 \left| \theta_1=\theta^{(k+1)}_1,~ \theta_3=\theta^{(k)}_3,~ \theta_n=\theta^{(k)}_n \right.\right)$
  3. $\theta^{(k+1)}_3 \sim p\left(\theta_3 \left| \theta_1=\theta^{(k+1)}_2,~ \theta_2=\theta^{(k+1)}_2,~ \theta_n=\theta^{(k)}_n \right.\right)$
  4. ...
  5. $\theta^{(k+1)}_n \sim p\left(\theta_n \left| \theta_1=\theta^{(k+1)}_1,~ \theta_2=\theta^{(k+1)}_2,~ \theta_n=\theta^{(k+1)}_{n-1} \right.\right)$

Heuristic methods

Ant colony algorithm

  • Title: Application of homogenous continuous Ant Colony Optimization algorithm to inverse problem of one-dimensional coupled radiation and conduction heat transfer
  • Author: Biao Zhang, Hong Qi, Ya-Tao Ren, Shuang-Cheng Sun, and Li-Ming Ruan
  • Year: 2013
  • Theory level: Algorithm
  • Theory type: Stochastic heruistic algorithm
  • Used data: Steady-state coupled radiation and conduction heat transfer in an absorbing, emitting and scattering plane-parallel slab
  • Source: International Journal of Heat and Mass Transfer

Reference

  • External Reference: To learn a visual demo for ant colony optimization, please check this link:

Reference

This paper introduce a method where we could enhance the optimization results by ant colony optimization (ACO). First, this paper introduce the basic-ACO (BACO), which shrinks the search scope in each iteration to reduce the calculation complexity. However, because of the grid-scheme and the scope-shrinking strategy, BACO may converge to the local minimum in some cases. To solve this problem, in this paper the author proposes the homogeneous ACO (HACO).

Suppose that we need to solve $\arg \min\limits_{\mathbf{x}} \lVert \mathbf{y} - \mathbf{f}(\mathbf{x})\rVert_2^2$, where $\mathbf{f}$ is a highly nonlinear model. Then we could initialize $M$ solutions $\mathbf{x}^{(1)},\mathbf{x}^{(2)},\cdots,\mathbf{x}^{(M)}$, which is also called "ants" in ACO. Suppose that the metric function $\mathcal{L}(\mathbf{x})=\frac{1}{N}\left\lVert \mathbf{y} - \mathbf{f}\left(\mathbf{x}^{(k)}\right) \right\rVert_2$ is used to evaluate the solution $\mathbf{x}^{(k)}$. For each parameter of any $\mathbf{x}^{(k)}=\begin{bmatrix}x^{(k)}_1 & x^{(k)}_2 & \cdots & x^{(1)}_n \end{bmatrix}$, we set the initial searching scope $x^{(k)}_i \in [L_i, S_i]$ and divide the scope into $N_i$ intervals. Then we could get a series of discrete positions. Initialize that $\tau_{ij} = Q$ and $\eta_{ij} = \frac{1}{N}$. After that we could perform such iterations:

  1. For each solution $k$, $p^{(k)}_{ij} = \frac{\zeta}{N_i} + \left( 1 - \zeta \right)\frac{ \tau^{\alpha}_{ij} \eta^{\beta}_{ij} }{ \sum_{j=1}^{N_i} \tau^{\alpha}_{ij} \eta^{\beta}_{ij} }$, where $\alpha,~\beta > 0$ and $\zeta \in [0,~1]$. $p^{(k)}_{ij}$ is the probability that how possible $x^{(k)}_i$ would transfer into $x^{(k)}_j$. Since when $\zeta \rightarrow 1$, the move tends to be random, we call such update "homogeneous updating".
  2. After we update the new solutions according to $p^{(k)}_{ij}$, we could search a local minimum for each solution in a local area. This search could be totally random, i.e. we just adjust the solutions by adding random noise. And we could also use some deterministic methods. If we calculate $\mathcal{L}(\mathbf{x}^{(k)})$ for each solution, the local minimum should satisfy that $\mathcal{L}(\mathbf{x}^{(k)} + \boldsymbol{\varepsilon}^{(k)}) < \mathcal{L}(\mathbf{x}^{(k)})$. This step is designed for reducing the discrete sampling effect. Record and update the historical best solution $\mathbf{x}^{\ast}$ and the best 3 solutions in this step: $\mathbf{x}^{(s_1)},~\mathbf{x}^{(s_2)},~\mathbf{x}^{(s_3)}$.
  3. Calculate the increments for each solution $k$, which means in step 1, if $x^{(k)}_i$ changes to $x^{(k)}_j$, then we have $\Delta \tau^{(k)}_{ij} = Q$ and $\Delta \eta^{(k)}_{ij} = \frac{1}{N \mathcal{L}(\mathbf{x}_j)}$.
  4. Update the information variables: \begin{align*} \tau_{ij} &= (1-\rho) \tau_{ij} + \sum_{k=1}^{M} \Delta \tau^{(k)}_{ij},\\ \eta_{ij} &= \max \left( \eta_{ij},~ \sum_{k=1}^{M} \Delta \eta^{(k)}_{ij} \right). \end{align*}
  5. Defining $\gamma \in (0,1)$ which indicates the shrinking ratio. Then we could shrink the searching scopes according to such equations: \begin{align*} L_i &= \max \left( \min \left( x^{\ast}_i,~ x^{(s_1)}_i,~ x^{(s_2)}_i,~ x^{(s_3)}_i \right) - \gamma \frac{S_i - L_i}{2},~ L_i \right),\\ S_i &= \min \left( \max \left( x^{\ast}_i,~ x^{(s_1)}_i,~ x^{(s_2)}_i,~ x^{(s_3)}_i \right) + \gamma \frac{S_i - L_i}{2},~ S_i \right). \end{align*}
  6. Resampling the new $N_i$ intervals according to new scope $x^{(k)}_i \in [L_i, S_i]$. Note that we need to interpolate $\tau_{ij},~\eta_{ij}$ for new intervals due to the resampling. Then start a new loop beginning from step 1.

Bat algorithm

  • Title: A New Metaheuristic Bat-Inspired Algorithm
  • Author: Xin-She Yang
  • Year: 2010
  • Theory level: Algorithm
  • Theory type: Stochastic heruistic algorithm
  • Used data: Standard benchmark non-linear functions
  • Source: Nature Inspired Cooperative Strategies for Optimization

Reference

Bat algorithm could be viewed as an adaptation of particle swarm optimizaion (PSO). The author of this paper has proposed several algorithms in this kind, for example, the firefly algorithm. This algorithm imitates bats' behavior. When they have not hunt the best solution, they tend to move close to the current optimal solution; once they get the best solution, they would try to stay in the local position.

Consider we have the loss function $\mathcal{L}(\mathbf{x})$, and we initialize a series of solutions randomly $\mathbf{x}_1,~\mathbf{x}_2,~\mathbf{x}_n$, where each solution could be viewed as a bat. Attach each bat with a velocity $\mathbf{v}_i$, a frequency $f_i$, a pulse rate $r_i=0$, and a loudness $A_i=1$. Set $\alpha < 1$. Then perform such iterations:

  1. In iteration $t$, mark the current optimal solution as $\mathbf{x}^{\ast}$. For each bat (solution) $\mathbf{x}_i$, generate a new solution $\mathbf{x}'_i$:
    1. Update its frequency in uniform distribution $f_i \sim U(0,~\alpha)$.
    2. Update the velocity $\mathbf{v}'_{i} = \mathbf{v}_{i} (1-f_i) + \left( \mathbf{x}^{\ast} - \mathbf{x}_i\right) f_i$.
    3. Update its position $\mathbf{x}'_i = \mathbf{x}_i + \mathbf{v}'_{i}$.
    4. Generate $r \sim U(0,1)$. If $r > r_i$, select a solution from the M best solutions. And use a random vector $\boldsymbol{\epsilon}$ to update it, i.e. $\mathbf{x}'_i = \mathbf{x}_{\mathrm{selected}} + \mathrm{diag}(\boldsymbol{\epsilon})\mathbf{A}$.
    5. Adjust $\mathbf{x}_i$ slightly and randomly.
    6. Generate $a \sim U(0,1)$. If $a < A_i$, and $\mathcal{L}(\mathbf{x}'_i) < \mathcal{L}(\mathbf{x}_i)$, then let \begin{align*} \mathbf{x}_i &= \mathbf{x}'_i,\\ r_i &= 1 - e^{\lambda t},\\ A_i &= \beta A_i. \end{align*}
  2. After iteration $t$, sort all solutions $\{\mathbf{x}_i\}$, then we could update the optimal solution $\mathbf{x}^{\ast}$.

In the beginning, $r$ is small while $A$ is large, the algorithm tends to abandon the solutions which are not good and accept new solutions. When the algorithm is nearly converged, the new solutions would not be accepted.

Related slides

Slides on week 1, Feb. 1, 2019

Ooops! Your browser does not support viewing pdfs.

Download PDF

Slides on week 2, Feb. 8, 2019

Ooops! Your browser does not support viewing pdfs.

Download PDF

Slides on week 3-a, Feb. 13, 2019

Ooops! Your browser does not support viewing pdfs.

Download PDF

Slides on week 3-b, Feb. 15, 2019

Ooops! Your browser does not support viewing pdfs.

Download PDF