オートエンコーダ

オートエンコーダ（自己符号化器、英: autoencoder）とは、機械学習において、ニューラルネットワークを使用した次元圧縮のためのアルゴリズム。2006年にジェフリー・ヒントンらが提案した^[1]。

概要

オートエンコーダは3層ニューラルネットにおいて、入力層と出力層に同じデータを用いて教師なし学習させたものである。教師データが実数値で値域がない場合、出力層の活性化関数は恒等写像、（すなわち出力層は線形変換になる）が選ばれることが多い。中間層の活性化関数も恒等写像を選ぶと結果は主成分分析とほぼ一致する。実用上では、入力と出力の差分をとることで、異常検知に利用されている。

特性と限界

オートエンコーダは次元圧縮に必要な特性を有するように設計されている。

オートエンコーダは中間層の次元数 $d_{m}$ が入出力層の次元数 $d_{i,o}$ より小さいように制約されている。なぜなら $d_{i,o}\leqq d_{m}$ の場合、オートエンコーダは恒等変換のみで再構成誤差ゼロを達成できてしまう^[2]。

オートエンコーダは次元圧縮を実現するが、これは良い表現学習を必ずしも意味しない^[3]。 $d_{m}$ を小さくすることで入力中の情報量が多い（より少量で画像を再構成できる）特徴のみが保存されると期待されるが（c.f. 非可逆圧縮）、これが特徴量として優れているとは一概に言えない。

理論

AEが再構成および次元圧縮を学習できる理由が理論的に解析されている。

オートエンコーダネットワーク $AE_{\phi ,\theta }(x)$ はエンコーダネットワーク $NN_{\phi }(x)$ とデコーダネットワーク $NN_{\theta }(x)$ からなる。決定論的な解釈においてAEは「再構成された入力」を直接出力する。すなわち ${\hat {x}}=AE_{\phi ,\theta }(x)=NN_{\theta }(NN_{\phi }(x))$ である。

確率論的解釈

AEは確率モデルの観点から深層潜在変数モデルの一種とみなせ、次のように定式化できる：

{\begin{aligned}z_{|x}\sim p_{\phi }(Z|X)&=p(Z|\lambda =NN_{\phi }(X))=\delta (Z-NN_{\phi }(X))\\{\hat {x}}_{|z}\sim p_{\theta }({\hat {X}}|Z)&=p({\hat {X}}|\mu =NN_{\theta }(Z))\end{aligned}}

すなわち $NN_{\phi }(x),NN_{\theta }(x)$ は分布パラメータ $\lambda ,\mu$ を出力し分布を介して $z,{\hat {x}}$ が得られると解釈できる^[4]^[5]。AEではエンコーダが決定論的に振舞うため、写像の条件付き確率分布（デルタ関数 $\delta$ ）で表現される。 $\delta$ の決定論的性質より $NN_{\phi }(x),NN_{\theta }(x)$ を集約して表現するとAEは次の確率論的表現で表される：

{\hat {x}}_{|x}\sim p({\hat {X}}|\mu =AE_{\phi ,\theta }(X))

AEの学習には平均二乗誤差（MSE, L₂）をはじめ様々な損失関数が（決定論的な視点から）経験的に使われている。これは経験的なものであって学習収束保証があるとは限らない。理論的な研究により、いくつかの損失関数では $p_{\theta }({\hat {X}}|Z)$ に特定の分布を設定したinfomax学習として定式化できることがわかっている。

固定分散正規分布モデル

「分散が固定された正規分布 $N(X|\mu _{\theta },\sigma )$ 」を考えると負の対数尤度 $L_{n}(\theta )$ は以下になる：

L_{n}(\theta )={\frac {\|x-\mu _{\theta }\|^{2}}{2\sigma ^{2}}}-\log({\sqrt {2\pi \sigma ^{2}}})\propto \|x-\mu _{\theta }\|^{2}

これは $x$ と $\mu _{\theta }$ の二乗誤差と解釈できる。すなわち、 $N(X|\mu _{\theta }=AE_{\phi ,\theta }(x),\sigma )$ のNLL最小化と ${\hat {x}}=AE_{\phi ,\theta }(x)$ の二乗誤差最小化は同等とみなせる^[6]。換言すれば、二乗誤差で学習されたオートエンコーダモデルは「最尤推定された固定分散正規分布 $N(X|\mu _{\theta }=AE_{\phi ,\theta }(x),\sigma )$ からの最頻値サンプリングモデル」であるとみなせる。

派生

オートエンコーダには様々な変種・派生モデルが存在する。以下はその一例である：

変分オートエンコーダー（VAE）
Contractive AutoEncoder
Saturating AutoEncoder
Nonparametrically Guided AutoEncoder
Unfolding Recursive AutoEncoder

スパース・オートエンコーダ

スパース・オートエンコーダ（英: sparse autoencoder）とは、フィードフォワードニューラルネットワークの学習において汎化能力を高めるため、正則化項を追加したオートエンコーダのこと。ただし、ネットワークの重みではなく、中間層の値自体を0に近づける。

Stacked autoencoder

バックプロパゲーションでは通常、中間層が2層以上ある場合、極小解に収束してしまう。そこで、中間層1層だけでオートエンコーダを作って学習させる。次に、中間層を入力層と見なしてもう1層積み上げる。これを繰り返して多層化したオートエンコーダをつくる方法をstacked autoencoderと言う。

Denoising AutoEncoder

入力層のデータにノイズを加えて学習させたもの。制約付きボルツマンマシンと結果がほぼ一致する。ノイズは確率分布が既知であればそれに従ったほうが良いが、未知である場合は一様分布で良い。

類似技術

脚注

[脚注の使い方]

出典

^ Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). “Reducing the Dimensionality of Data with Neural Networks”. Science 313 (5786): 504-507.
^ "autoencoder where Y is of the same dimensionality as X (or larger) can achieve perfect reconstruction simply by learning an identity mapping." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
^ "The criterion that representation Y should retain information about input X is not by itself sufficient to yield a useful representation." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
^ "a deterministic mapping from X to Y, that is, ... equivalently $q(Y|X;\theta )=\delta (Y-f_{\theta }(X))$ ... The deterministic mapping $f_{\theta }$ that transforms an input vector ${\boldsymbol {x}}$ into hidden representation ${\boldsymbol {y}}$ is called the encoder." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
^ " ${\boldsymbol {z}}=g_{\theta ^{'}}({\boldsymbol {y}})$ . This mapping $g_{\theta ^{'}}$ is called the decoder. ... In general ${\boldsymbol {z}}$ is not to be interpreted as an exact reconstruction of ${\boldsymbol {x}}$ , but rather in probabilistic terms as the parameters (typically the mean) of a distribution $p(X|Z={\boldsymbol {z}})$ " Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.
^ " $g_{\theta ^{'}}$ is called the decoder ... $Z=g_{\theta ^{'}}({\boldsymbol {y}})$ ... associated loss function $L({\boldsymbol {x}},{\boldsymbol {z}})$ ... $X|{\boldsymbol {z}}\sim N({\boldsymbol {z}},{\boldsymbol {\sigma }}^{2}{\boldsymbol {I}})$ ... This yields $L({\boldsymbol {x}},{\boldsymbol {z}})=L_{2}({\boldsymbol {x}},{\boldsymbol {z}})=C(\sigma ^{2})\|{\boldsymbol {x}}-{\boldsymbol {z}}\|^{2}$ ... This is the squared error objective found in most traditional autoencoders." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[hinton2006-1] Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). “Reducing the Dimensionality of Data with Neural Networks”. Science 313 (5786): 504-507.

[2] "autoencoder where Y is of the same dimensionality as X (or larger) can achieve perfect reconstruction simply by learning an identity mapping." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[3] "The criterion that representation Y should retain information about input X is not by itself sufficient to yield a useful representation." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[4] "a deterministic mapping from X to Y, that is, ... equivalently $q(Y|X;\theta )=\delta (Y-f_{\theta }(X))$ ... The deterministic mapping $f_{\theta }$ that transforms an input vector ${\boldsymbol {x}}$ into hidden representation ${\boldsymbol {y}}$ is called the encoder." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[5] " ${\boldsymbol {z}}=g_{\theta ^{'}}({\boldsymbol {y}})$ . This mapping $g_{\theta ^{'}}$ is called the decoder. ... In general ${\boldsymbol {z}}$ is not to be interpreted as an exact reconstruction of ${\boldsymbol {x}}$ , but rather in probabilistic terms as the parameters (typically the mean) of a distribution $p(X|Z={\boldsymbol {z}})$ " Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[6] " $g_{\theta ^{'}}$ is called the decoder ... $Z=g_{\theta ^{'}}({\boldsymbol {y}})$ ... associated loss function $L({\boldsymbol {x}},{\boldsymbol {z}})$ ... $X|{\boldsymbol {z}}\sim N({\boldsymbol {z}},{\boldsymbol {\sigma }}^{2}{\boldsymbol {I}})$ ... This yields $L({\boldsymbol {x}},{\boldsymbol {z}})=L_{2}({\boldsymbol {x}},{\boldsymbol {z}})=C(\sigma ^{2})\|{\boldsymbol {x}}-{\boldsymbol {z}}\|^{2}$ ... This is the squared error objective found in most traditional autoencoders." Vincent. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

[1]

[2]

[3]

[4]

[5]

[6]