Supplementary2

本文最后更新于:2023年11月30日 晚上

Supplementary of 'Online monitoring dynamic characteristics in thin-walled structures milling: A physics-constrained Bayesian updating approach'

A. The derivation of the training target of the proposed diffusion model

Training: To obtain the optimal parameters of the neural network, we need to minimize the cross entropy between \(q\left( {\bf{x}_0} \right)\) and \({p_\theta }\left( {\bf{x}_0}\right)\). The loss \(L\) can be expressed a:

$$ \begin{equation} \begin{array}{l} L = {\mathbb{E}_{q\left( {\mathbf{x}_0} \right)}}\left[ { - \log {p_\theta }\left( {\bf{x}}_0 \right)} \right]\\ \le {\mathbb{E}_{q\left( {\mathbf{x}_0} \right)}}\left[ { - \log {p_\theta }\left( {{{\bf{x}}_0}} \right) + {D_{{\rm{KL}}}}\left( {q\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)\| {p_\theta }\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)} \right)} \right]\\ = {\mathbb{E}_{q\left( {{\mathbf{x}_0}} \right)}} \! \left[ {\! -\! \log {p_\theta }\left( {{{\mathbf{x}}_0}} \right) \!+\! {\mathbb{E}_{q\left( {{{\mathbf{x}}_{1:T}}\vert {{\mathbf{x}}_0}} \right)}} \! \left[ {\log \frac{{q\left( {{{\mathbf{x}}_{1:T}}\vert {{\mathbf{x}}_0}} \right)}}{{{p_\theta }\! \left( {{{\mathbf{x}}_{0:T}}} \right)\!/\!{p_\theta }\left( {{{\mathbf{x}}_0}} \right)}}} \right]} \! \right]\\ = {\mathbb{E}_{q\left( {{x_{0:T}}} \right)}}\left[ {\log \frac{{q\left( {{{\bf{x}}_{1:T}}\vert {{\bf{x}}_0}} \right)}}{{{p_\theta }\left( {{{\bf{x}}_{0:T},\mathbf{y}}} \right)}}} \right] \end{array} \end{equation} $$

where \({D_{\rm{KL}}}\left( \cdot \| \cdot \right)\) denotes the KL divergence which is always non-negative.

Further, the loss function can be derived as follows:

$$ \begin{equation} \begin{array}{l} L &= \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \vert \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_T\right)\right)}_{L_T}\\ &+\sum_{t=1}^T \underbrace{\mathbb{E}_{q\left(\mathbf{x}_t \vert \mathbf{x}_0\right)}\left[D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t,\mathbf{y}\right)\right)\right]}_{L_{t-1}}\\ \end{array} \end{equation} $$

The final optimization objective contains \(T+1\) terms. Because the prior distribution \(p_\theta \left(\mathbf{x}_T\right)=\mathcal{N}(\mathbf{0}, \mathbf{I})\) and the \(q\left(\mathbf{x}_T \vert \mathbf{x}_0\right)\) can be also approximated as isotropic Gaussian noise, the \(L_T\) turns to be a constant which can be ignored in optimization.

According to Bayes' rule, the posterior distribution \(\mathrm{q}\left(\mathbf{x}_{\mathrm{t}-1} \vert \mathbf{x}_{\mathrm{t}}, \mathbf{x}_0\right)\) can be written as:

$$ \begin{equation}\label{eq7} \begin{aligned} q\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right), \tilde{\beta}_t \mathbf{I}\right) \end{aligned} \end{equation} $$

We can derive the mean and variance of \(\mathrm{q}\left(\mathbf{x}_{\mathrm{t}-1} \vert \mathbf{x}_{\mathrm{t}}, \mathbf{x}_0\right)\) by the definition of Gaussian distribution as \(\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t\) and \(\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t\) respectively, where \(\alpha_t=1-\beta_t\) and \(\bar{\alpha}_t=\prod_{i=1}^t \alpha_i\).

For the transition probability \(p_\theta\left(\mathbf{x}_{t-1} \vert \mathbf{x}_t,\mathbf{y}\right)\), since \(\mathbf{x}_{t}\) is known in the reverse process, we may choose the parameterization of \(\boldsymbol{\mu}_\theta\) as

$$ \begin{equation} \begin{aligned} \boldsymbol{\mu}_\theta\left(\mathbf{x}_t,\mathbf{y}, t\right) =\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, \mathbf{y}, t\right)\right) \end{aligned} \end{equation} $$

and \(\sigma_\theta\left(\mathbf{x}_t, t\right)\) as a fixed constant \(\tilde{\beta}_t^{\frac{1}{2}}\).

Consequently, by considering the KL divergence between two Gaussian distribution \(D_{KL}(p, q)=\log \frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2}\), the optimization objective \(L_{t-1}\) can be further simplified as

$$ \begin{equation} \begin{array}{l} L_{t-1} &= \mathbb{E}_{q\left(\mathbf{x}_t \vert \mathbf{x}_0\right)} \left[ \frac{1}{2 \sigma_t^2}\left\|\tilde{\mathbf{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\mathbf{\mu}_\theta\left(\mathbf{x}_t,\mathbf{y}, t\right)\right\|_2^2 \right] + C\\ &= \kappa_t \mathbb{E}_{\mathbf{x}_0, \epsilon} \left[ \left\|\epsilon-\epsilon_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon,\mathbf{y}, t\right)\right\|_2^2\right] + C \end{array} \end{equation} $$

where \(C\) and \(\kappa_t=\frac{\beta_t}{2 \alpha_t\left(1-\bar{\alpha}_{t-1}\right)}\) is a constant.

The point is to transform the intractable cross entropy between \(q\left( {\mathbf{x}_0} \right)\) and \({p_\theta }\left( {\mathbf{x}_0},\mathbf{y}\right)\) into the closed form KL divergences.

According to Ho et al. , a simplified loss function which discards the weighting of \(L_{t-1}\) has been proved more effective as follows:

$$ \begin{equation} \begin{aligned} L_{simple}= \mathbb{E}_{\mathbf{x}_0, \epsilon} \left[ \left\|\epsilon-\epsilon_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon,\mathbf{y}, t\right)\right\|_2^2\right] + C \end{aligned} \end{equation} $$

It can be noticed that diffusion step \(t\) is specifically added in \(\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\bar{\alpha}_t \boldsymbol{x}_0+\bar{\beta}_t \boldsymbol{\varepsilon},\mathbf{y}, t\right)\) using the sinusoidal position embedding for each sample to share the same model parameters. That is because at every diffusion step \(t\), a different model needs to be learned to represent the reverse process. The diffusion step embedding structures within a neural network will be further discussed in next section.


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!