エンドツーエンドの強化学習

エンドツーエンドの強化学習（英: end-to-end reinforcement learning）では、エンドツーエンドのプロセス、つまり、ロボットまたはエージェントのセンサーからモーターまでのプロセス全体が、モジュール化されていない単一の層状またはリカレントニューラルネットワークを含み、強化学習（RL）によってトレーニングされる。このアプローチは昔から長い間研究され続けているが、 Atari 2600のビデオゲーム（2013–15）^[1]^[2]^[3]^[4] およびGoogle DeepMindによるAlphaGo （2016）^[5] の学習で成功した結果によって再び隆盛した。

強化学習は従来、状態空間とアクション空間の明示的な設計を必要とする一方、状態空間からアクション空間へのマッピングは学習により行われるものであった^[6]。したがって、強化学習はアクションの学習に限定されるものであり、人間の設計者は、センサー信号から状態空間を構築する方法を設計し、学習前に各アクションのモーションコマンドを生成する方法を提供する必要があった。強化学習では、次元の呪いを回避するための非線形関数の近似を提供する目的で、ニューラルネットワークがよく用いられてきた。また主に知覚的エイリアシングまたは部分観測マルコフ決定過程（POMDP）を回避するために、リカレントニューラルネットワークも採用されてきた^[7] ^[8] ^[9] ^[10] ^[11] 。

エンドツーエンドの強化学習は、強化学習を、アクションのみの学習から、他の機能から独立して開発することが困難な高レベルの機能を含む、センサーからモーターまでのプロセス全体の学習にまで拡張する。高レベルの機能は、センサーやモーターのいずれにも直接接続されないため、入力と出力を与えることさえ困難である。

歴史

このアプローチはTD-Gammon （1992）^[12] で始まった。バックギャモンでは、セルフプレイ中のゲーム状況の評価は、階層型ニューラルネットワークを用いたTD（ $\lambda$ ）を通じて学習された。ボード上の特定の場所に置かれた特定の色のピースを示すために4つの入力が使用され、入力信号は合計198となった。組み込まれた知識はゼロであったため、ネットワークはゲームのプレイを中級レベルで学んだ。

柴田は1997年にこのフレームワークの使用を開始した^[13] 。彼らは、連続運動タスクにQ学習とActor-Criticを採用し^[14] 、メモリを要するタスクにリカレントニューラルネットワークを用いた^[15]。彼らはこのフレームワークを実際のロボットタスクに適用した ^[16]。彼らはさまざまな機能の学習を示した。

2013年頃から、Google DeepMindはビデオゲーム ^[1]^[2]と囲碁（AlphaGo）^[5] で印象的な学習結果を示した。彼らは、深層畳み込みニューラルネットワークを使用し、それは画像認識の面で優れた結果を示した。彼らは入力として、ほとんど生のRGBピクセル（84x84）の4フレームを使用した。ネットワークは強化学習に基づいてトレーニングされ、ゲームスコアの変化の兆候を表す報酬を用いた。全部で49のゲームが、最小限の事前知識を持つ同一のネットワークアーキテクチャとQ学習を使用して学習されたが、それはほとんどのゲームにおいて競合する方法よりも優れた結果を示し、プロの人間のゲームテスターに匹敵するか、あるいは勝るレベルで実行された^[2] 。これはDeep-Qネットワーク（DQN）と呼ばれることもある。 AlphaGoでは、深層ニューラルネットワークは強化学習だけでなく、教師あり学習とモンテカルロ木検索によっても訓練される^[5]。

機能の発展

柴田のグループは、このフレームワークから次のようなさまざまな機能が着想されることを示した^[17]。

画像認識
色の恒常性（錯視）
センサーの動き（アクティブ認識）
手と目の協調と手を伸ばす動作
脳活動の説明
知識の伝達
記憶
選択的注意
予測
探検

このフレームワークでの通信が確立された。モードは次のとおりである^[18]。

動的通信（交渉）
信号の二値化
実際のロボットとカメラを使用したグラウンデッド通信

参考文献

^ ^a ^b Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Driessche, George van den; Schrittwieser, Julian; Antonoglou, Ioannis et al. (28 January 2016). “Mastering the game of Go with deep neural networks and tree search”. Nature 529 (7587): 484–489. Bibcode: 2016Natur.529..484S. doi:10.1038/nature16961. ISSN 0028-0836. PMID 26819042.
^ ^a ^b ^c Mnih, Volodymyr (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013.
^ Mnih, Volodymyr (2015). “Human-level control through deep reinforcement learning”. Nature 518 (7540): 529–533. Bibcode: 2015Natur.518..529M. doi:10.1038/nature14236. PMID 25719670.
^ V. Mnih (26 February 2015). Performance of DQN in the Game Space Invaders.
^ ^a ^b ^c V. Mnih (26 February 2015). Demonstration of Learning Progress in the Game Breakout.
^ Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986
^ Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280.
^ Onat, Ahmet; Kita, Hajime (1998). Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem. The 5th International Conference on Neural Information Processing (ICONIP). pp. 837–840.
^ Onat, Ahmet; Kita, Hajime (1998). Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation. International Joint Conference on Neural Networks (IJCNN). pp. 2010–2015. doi:10.1109/IJCNN.1998.687168。
^ Bakker, Bram; Linaker, Fredrik (2002). Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction (PDF). 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 938–943.
^ Bakker, Bram; Zhumatiy, Viktor (2003). A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation (PDF). 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 430–435.
^ Tesauro, Gerald (March 1995). “Temporal Difference Learning and TD-Gammon”. Communications of the ACM 38 (3): 58–68. doi:10.1145/203330.203343 2017年3月10日閲覧。.
^ Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.
^ Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.
^ Utsunomiya, Hiroki; Shibata, Katsunari (2008). Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task (PDF). International Conference on Neural Information Processing (ICONIP) '08.^{[リンク切れ]}
^ Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network (PDF). International Conference on Neural Information Processing (ICONIP) '08.
^ Shibata, Katsunari (7 March 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239 [cs.AI]。
^ Shibata, Katsunari (10 March 2017). "Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network". arXiv:1703.03543 [cs.AI]。

[AlphaGo2-1] Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Driessche, George van den; Schrittwieser, Julian; Antonoglou, Ioannis et al. (28 January 2016). “Mastering the game of Go with deep neural networks and tree search”. Nature 529 (7587): 484–489. Bibcode: 2016Natur.529..484S. doi:10.1038/nature16961. ISSN 0028-0836. PMID 26819042.

[DQN12-2] Mnih, Volodymyr (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013.

[DQN22-3] Mnih, Volodymyr (2015). “Human-level control through deep reinforcement learning”. Nature 518 (7540): 529–533. Bibcode: 2015Natur.518..529M. doi:10.1038/nature14236. PMID 25719670.

[Invaders2-4] V. Mnih (26 February 2015). Performance of DQN in the Game Space Invaders.

[Breakout2-5] V. Mnih (26 February 2015). Demonstration of Learning Progress in the Game Breakout.

[RL-6] Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986

[Lin-7] Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280.

[Onat1-8] Onat, Ahmet; Kita, Hajime (1998). Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem. The 5th International Conference on Neural Information Processing (ICONIP). pp. 837–840.

[Onat2-9] Onat, Ahmet; Kita, Hajime (1998). Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation. International Joint Conference on Neural Networks (IJCNN). pp. 2010–2015. doi:10.1109/IJCNN.1998.687168。

[Bakker1-10] Bakker, Bram; Linaker, Fredrik (2002). Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction (PDF). 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 938–943.

[Bakker2-11] Bakker, Bram; Zhumatiy, Viktor (2003). A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation (PDF). 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 430–435.

[TD-Gammon-12] Tesauro, Gerald (March 1995). “Temporal Difference Learning and TD-Gammon”. Communications of the ACM 38 (3): 58–68. doi:10.1145/203330.203343 2017年3月10日閲覧。.

[Shibata3-13] Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.

[Shibata4-14] Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.

[Shibata5-15] Utsunomiya, Hiroki; Shibata, Katsunari (2008). Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task (PDF). International Conference on Neural Information Processing (ICONIP) '08.^{[リンク切れ]}

[Shibata6-16] Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network (PDF). International Conference on Neural Information Processing (ICONIP) '08.

[Shibata2-17] Shibata, Katsunari (7 March 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239 [cs.AI]。

[Shibata7-18] Shibata, Katsunari (10 March 2017). "Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network". arXiv:1703.03543 [cs.AI]。

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]