Supplementary2 - Research Notes

本文最后更新于：2023年11月30日晚上

Supplementary of 'Online monitoring dynamic characteristics in thin-walled structures milling: A physics-constrained Bayesian updating approach'

A. The derivation of the training target of the proposed diffusion model

Training: To obtain the optimal parameters of the neural network, we need to minimize the cross entropy between $q\left( {\bf{x}_0} \right)$ and ${p_\theta }\left( {\bf{x}_0}\right)$. The loss $L$ can be expressed a:

$$ \begin{equation} \begin{array}{l} L = {\mathbb{E}_{q\left( {\mathbf{x}_0} \right)}}\left[ { - \log {p_\theta }\left( {\bf{x}}_0 \right)} \right]\\ \le {\mathbb{E}_{q\left( {\mathbf{x}_0} \right)}}\left[ { - \log {p_\theta }\left( {{{\bf{x}}_0}} \right) + {D_{{\rm{KL}}}}\left( {q\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)\| {p_\theta }\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)} \right)} \right]\\ = {\mathbb{E}_{q\left( {{\mathbf{x}_0}} \right)}} \! \left[ {\! -\! \log {p_\theta }\left( {{{\mathbf{x}}_0}} \right) \!+\! {\mathbb{E}_{q\left( {{{\mathbf{x}}_{1:T}}\vert {{\mathbf{x}}_0}} \right)}} \! \left[ {\log \frac{{q\left( {{{\mathbf{x}}_{1:T}}\vert {{\mathbf{x}}_0}} \right)}}{{{p_\theta }\! \left( {{{\mathbf{x}}_{0:T}}} \right)\!/\!{p_\theta }\left( {{{\mathbf{x}}_0}} \right)}}} \right]} \! \right]\\ = {\mathbb{E}_{q\left( {{x_{0:T}}} \right)}}\left[ {\log \frac{{q\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)}}{{{p_\theta }\left( {{{\bf{x}}_{0:T},\mathbf{y}}} \right)}}} \right] \end{array} \end{equation} $$

where ${D_{\rm{KL}}}\left( \cdot \| \cdot \right)$ denotes the KL divergence which is always non-negative.

Further, the loss function can be derived as follows:

$$ \begin{equation} \begin{array}{l} L &= \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \vert \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_T\right)\right)}_{L_T}\\ &+\sum_{t=1}^T \underbrace{\mathbb{E}_{q\left(\mathbf{x}_t \vert \mathbf{x}_0\right)}\left[D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t,\mathbf{y}\right)\right)\right]}_{L_{t-1}}\\ \end{array} \end{equation} $$

The final optimization objective contains $T+1$ terms. Because the prior distribution $p_\theta \left(\mathbf{x}_T\right)=\mathcal{N}(\mathbf{0}, \mathbf{I})$ and the $q\left(\mathbf{x}_T \vert \mathbf{x}_0\right)$ can be also approximated as isotropic Gaussian noise, the $L_T$ turns to be a constant which can be ignored in optimization.

According to Bayes' rule, the posterior distribution $\mathrm{q}\left(\mathbf{x}_{\mathrm{t}-1} \vert \mathbf{x}_{\mathrm{t}}, \mathbf{x}_0\right)$ can be written as:

$$ \begin{equation}\label{eq7} \begin{aligned} q\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right), \tilde{\beta}_t \mathbf{I}\right) \end{aligned} \end{equation} $$

We can derive the mean and variance of $\mathrm{q}\left(\mathbf{x}_{\mathrm{t}-1} \vert \mathbf{x}_{\mathrm{t}}, \mathbf{x}_0\right)$ by the definition of Gaussian distribution as $\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t$ and $\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ respectively, where $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{i=1}^t \alpha_i$.

For the transition probability $p_\theta\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t,\mathbf{y}\right)$, since $\mathbf{x}_{t}$ is known in the reverse process, we may choose the parameterization of $\boldsymbol{\mu}_\theta$ as

$$ \begin{equation} \begin{aligned} \boldsymbol{\mu}_\theta\left(\mathbf{x}_t,\mathbf{y}, t\right) =\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, \mathbf{y}, t\right)\right) \end{aligned} \end{equation} $$

and $\sigma_\theta\left(\mathbf{x}_t, t\right)$ as a fixed constant $\tilde{\beta}_t^{\frac{1}{2}}$.

Consequently, by considering the KL divergence between two Gaussian distribution $D_{KL}(p, q)=\log \frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2}$, the optimization objective $L_{t-1}$ can be further simplified as

$$ \begin{equation} \begin{array}{l} L_{t-1} &= \mathbb{E}_{q\left(\mathbf{x}_t \vert \mathbf{x}_0\right)} \left[ \frac{1}{2 \sigma_t^2}\left\|\tilde{\mathbf{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\mathbf{\mu}_\theta\left(\mathbf{x}_t,\mathbf{y}, t\right)\right\|_2^2 \right] + C\\ &= \kappa_t \mathbb{E}_{\mathbf{x}_0, \epsilon} \left[ \left\|\epsilon-\epsilon_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon,\mathbf{y}, t\right)\right\|_2^2\right] + C \end{array} \end{equation} $$

where $C$ and $\kappa_t=\frac{\beta_t}{2 \alpha_t\left(1-\bar{\alpha}_{t-1}\right)}$ is a constant.

The point is to transform the intractable cross entropy between $q\left( {\mathbf{x}_0} \right)$ and ${p_\theta }\left( {\mathbf{x}_0},\mathbf{y}\right)$ into the closed form KL divergences.

According to Ho et al. , a simplified loss function which discards the weighting of $L_{t-1}$ has been proved more effective as follows:

$$ \begin{equation} \begin{aligned} L_{simple}= \mathbb{E}_{\mathbf{x}_0, \epsilon} \left[ \left\|\epsilon-\epsilon_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon,\mathbf{y}, t\right)\right\|_2^2\right] + C \end{aligned} \end{equation} $$

It can be noticed that diffusion step $t$ is specifically added in $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\bar{\alpha}_t \boldsymbol{x}_0+\bar{\beta}_t \boldsymbol{\varepsilon},\mathbf{y}, t\right)$ using the sinusoidal position embedding for each sample to share the same model parameters. That is because at every diffusion step $t$, a different model needs to be learned to represent the reverse process. The diffusion step embedding structures within a neural network will be further discussed in next section.

paper

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

Pytorch 踩坑下一篇