éå»æ°å¹Žéã®ããŒã¯ã¬ãŒãGoogle BrainãDeepMindãããã³OpenAIã®èšäºãäž»ã«åŒçšãããŠããŸãããããã®èšäºã¯ç§ã®èŠ³ç¹ããæãç®ç«ã€ããã§ãã ç§ã¯ã»ãŒç¢ºå®ã«å€ãæç®ãä»ã®çµç¹ããäœããèŠéããã®ã§ãç§ã¯è¬çœªããŸã-çµå±ã®ãšãããç§ã¯ãã äžäººã§ãã
ã¯ããã«
Facebookã§æ¬¡ã®ããšãè¿°ã¹ãŸããã
誰ãã匷ååŠç¿ïŒRLïŒã§åé¡ã解決ã§ãããã©ããå°ãããšãããã«ã¯çããããŸããã ããã¯å°ãªããšã70ïŒ ã®ã±ãŒã¹ã§åœãŠã¯ãŸããšæããŸãã匷åããããã£ãŒãã©ãŒãã³ã°ã«ã¯ã倧éã®èªå€§å®£äŒã䌎ããŸãã ãããŠãæ£åœãªçç±ããããŸãïŒ åŒ·ååŠç¿ïŒRLïŒã¯ä¿¡ããããªãã»ã©äžè¬çãªãã©ãã€ã ã§ãã ååãšããŠãä¿¡é Œæ§ãé«ãé«æ§èœãªRLã·ã¹ãã ã¯ãã¹ãŠã«ãããŠå®ç§ã§ãªããã°ãªããŸããã ãã®ãã©ãã€ã ãšãã£ãŒãã©ãŒãã³ã°ã®çµéšçãªåã®èåã¯èªæã§ãã Deep RLã¯æã匷åãªAIã®ããã«èŠããŸãããããã¯äžçš®ã®å€¢ã§ãããæ°ååãã«ã®è³é調éãä¿é²ããŸãã
æ®å¿µãªãããå®éã«ã¯ãã®ããšã¯ãŸã æ©èœããŸããã
ããããç§ã¯åœŒå¥³ãæã€ãšä¿¡ããŠããŸãã ä¿¡ããããªãã£ããããã®ãããã¯ã§ã¯æçãããŸããã§ããã ããããå ã«ã¯å€ãã®åé¡ãããããã®å€ãã¯æ ¹æ¬çã«è€éã§ãã èšç·ŽããããšãŒãžã§ã³ãã®çŸãããã¢ã¯ãäœæäžã«ããŒããè¡ãæ±ãæ¶ããã¹ãŠé ããŸãã
æ°åãæè¿ã®çµæã«äººã ãèªæãããã®ãèŠãŸããã 圌ãã¯æåã«æ·±ãRLãè©Šããåžžã«å°é£ãéå°è©äŸ¡ããŠããŸããã ééããªãããã®ãã¢ãã«ã¿ã¹ã¯ãã¯èŠããã»ã©åçŽã§ã¯ãããŸããã ãããŠãçãããªãããã®åéã¯åœŒããç 究ã«çŸå®çãªæåŸ ãèšå®ããããšãåŠã¶åã«ã圌ããæ°åç ŽããŸããã
å人çãªééãã¯ãããŸããã ããã¯ã·ã¹ãã ã®åé¡ã§ãã ããžãã£ããªçµæã«ã€ããŠã¹ããŒãªãŒãæãã®ã¯ç°¡åã§ãã ããããã¬ãã£ãã«ããŠã¿ãŠãã ããã åé¡ã¯ãç 究è ãã»ãšãã©ã®å Žåæ£ç¢ºã«åŠå®çãªçµæãåŸããšããããšã§ãã ããæå³ã§ã¯ããã®ãããªçµæã¯è¯å®çãªçµæãããããã«éèŠã§ãã
ãã®èšäºã§ã¯ããã£ãŒãRLãæ©èœããªãçç±ã説æããŸãã ç§ã®æèŠã§ã¯ãããããŸã æ©èœããå Žåã®äŸãšãå°æ¥ããä¿¡é Œæ§ã®é«ãäœæ¥ãéæããæ¹æ³ã瀺ããŸãã ããã¯ãæ·±ãRLã§äœæ¥ããŠãã人ãæ¢ããããã§ã¯ãããŸãããã誰ããåé¡ãç解ããŠããã°é²æ©ããããããã§ãã ããªããæ¬åœã«åé¡ã«ã€ããŠè©±ããªãã°ãåæã«éããã®ã¯ç°¡åã§ãããããŠããäºãã«å¥ã ã«åãã¬ãŒãã«äœåºŠãäœåºŠãã€ãŸãããªãã
ãã£ãŒãRLã®ãããã¯ã«é¢ããç 究ããã£ãšæ¥œãã¿ããã§ãã æ°ãã人ãããã«æ¥ãããã«ã ãããŠã圌ãã¯èªåãäœã«èå³ãæã£ãŠããã®ããç¥ãããã«ã
å ã«é²ãåã«ãããã€ãã®çºèšããããŠãã ããã
- ããã€ãã®ç§åŠè«æãããã«åŒçšãããŠããŸãã ç§ã¯éåžžã説åŸåã®ãããã¬ãã£ããªäŸãæããããžãã£ããªããšã«ã€ããŠã¯é»ã£ãŠããŸãã ããã¯ãç§ãç§åŠç 究ã奜ãã§ã¯ãªããšããæå³ã§ã¯ãããŸãã ã 圌ãã¯ãã¹ãŠè¯ãã§ã-æéãããã°èªã䟡å€ããããŸãã
- ç§ã®æ¥åžžæ¥åã§ã¯ãRLã¯åžžã«æ·±ãRLãæå³ãããããã匷ååŠç¿ããšã深局匷ååŠç¿ããšããçšèªãå矩èªãšããŠäœ¿çšããŸãã äžè¬çã«åŒ·åã䌎ãåŠç¿ããã匷åã䌎ã深局åŠç¿ã®çµéšçãªè¡åãæ¹å€ããŸãã åŒçšãããèšäºã¯éåžžããã£ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ãæã€ãšãŒãžã§ã³ãã®ä»äºã«ã€ããŠèª¬æããŠããŸãã çµéšçæ¹å€ã¯ç·åœ¢RLïŒç·åœ¢RLïŒãŸãã¯è¡šåœ¢åŒRLïŒè¡šåœ¢åŒRLïŒã«ãåœãŠã¯ãŸããããããŸãããããã®æ¹å€ãããå°ããªã¿ã¹ã¯ã«æ¡åŒµã§ãããã©ããã¯ããããŸããã æ·±ãRLãåãå·»ãèªå€§åºåã¯ãRLããè¯å¥œãªè¿äŒŒé¢æ°ãå¿ èŠãªå€§èŠæš¡ã§è€éãªå€æ¬¡å ç°å¢ã®ãœãªã¥ãŒã·ã§ã³ãšããŠæ瀺ãããŠãããšããäºå®ã«ãããã®ã§ãã ç¹ã«ããã®èªå€§åºåã§ãããæŽçããå¿ èŠããããŸãã
- ãã®èšäºã¯ãæ²èŠ³è«ãã楜芳è«ã«ç§»è¡ããããã«æ§æãããŠããŸãã ç§ã¯ãããå°ãé·ãããšãç¥ã£ãŠããŸããããããããªããçããåã«ããªãããããå šéšèªãã®ã«æéãããããªãã°ãç§ã¯éåžžã«æè¬ããŸãã
ãã以äžèŠåŽããããšãªãããã£ãŒãRLãã¯ã©ãã·ã¥ããããã€ãã®ã±ãŒã¹ã以äžã«ç€ºããŸãã
匷åããããã£ãŒãã©ãŒãã³ã°ã¯éåžžã«å¹æãäœãå ŽåããããŸã
匷åã䌎ããã£ãŒãã©ãŒãã³ã°ã®æãæåãªãã³ãããŒã¯ã¯ãAtariã²ãŒã ã§ãã ããç¥ãããŠããèšäºDeep Q-NetworksïŒDQNïŒã«ç€ºãããŠããããã«ãQã©ãŒãã³ã°ãé©åãªãµã€ãºã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ãšããã€ãã®æé©åã®ããªãã¯ãšçµã¿åããããšãããã€ãã®Atariã²ãŒã ã§äººéã®ããã©ãŒãã³ã¹ãéæããããããããäžåãããšãã§ããŸãã
ã²ãŒã Atariã²ãŒã ã¯æ¯ç§60ãã¬ãŒã ã§å®è¡ãããŸãã çµæã人ãšããŠè¡šç€ºããããã«æé©ãªDQNãåŠçããããã«å¿ èŠãªãã¬ãŒã æ°ãããã«ææ¡ã§ããŸããïŒ
çãã¯ã²ãŒã ã«ãã£ãŠç°ãªããããæè¿ã®Deepmindã®èšäº-Rainbow DQNïŒHessel et alã2017ïŒãã芧ãã ããã å ã®DQNã¢ãŒããã¯ãã£ãžã®é£ç¶ãã匷åã®ããã€ããçµæãã©ã®ããã«æ¹åãããã瀺ãããã¹ãŠã®æ¹åãçµã¿åãããããšãæãå¹æçã§ãã ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¯ã57ã®ã¢ã¿ãªã²ãŒã ã®ãã¡40以äžã§äººéã®çµæãè¶ ããŠããŸãã çµæã¯ãã®äŸ¿å©ãªãã£ãŒãã«ç€ºãããŠããŸãã
瞊軞ã¯ãã人éã«å¯ŸããŠæ£èŠåãããå¹³åçµæã®äžå€®å€ãã瀺ããŠããŸãã ã¢ã¿ãªã²ãŒã ããšã«57åã®DQNãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ãã人éã®çµæã100ïŒ ãšãããšãã«åãšãŒãžã§ã³ãã®çµæãæ£èŠåãã57ã²ãŒã ã®å¹³åäžå€®å€ãèšç®ããããšã«ãã£ãŠèšç®ãããŸãã RainbowDQNã¯ã1800 äžãã¬ãŒã ãåŠçããåŸã100ïŒ ã®ãã€ã«ã¹ããŒã³ãè¶ ããŠããŸãã ããã¯ãçŽ83æéã®ãã¬ã€ã«å ããŠãã©ãã ãæéãããã£ãŠããã¬ãŒãã³ã°æéã«çžåœããŸãã ããã¯ãã»ãšãã©ã®äººãæ°åã§ã€ããåçŽãªAtariã²ãŒã ã®å€ãã®æéã§ãã
åã®ã¬ã³ãŒãã¯ååžDQNã·ã¹ãã ã«å±ããŠããããïŒBellemare et alã2017ïŒ ã100ïŒ ã®çµæãéæããã«ã¯7000äžãã¬ãŒã ãã€ãŸãçŽ4åã®æéãå¿ èŠã ã£ããããå®éã«ã¯1800äžãã¬ãŒã ãéåžžã«è¯ãçµæã§ããããšã«çæããŠãã ããã Nature DQNïŒMnih et alã2015ïŒã«é¢ããŠã¯ã2åãã¬ãŒã åŸã§ãã100ïŒ ã®äžå€®å€ã«éããããšã¯ãããŸããã
ãèšç»ãšã©ãŒãã®èªç¥ãã€ã¢ã¹ã¯ãã¿ã¹ã¯ã®å®äºã«éåžžäºæ³ãããé·ãæéããããããšã瀺ããŠããŸãã 匷ååŠç¿ã«ã¯ç¬èªã®èšç»ãã¹ããããŸããéåžžããã¬ãŒãã³ã°ã«ã¯æã£ããããå€ãã®ãµã³ãã«ãå¿ èŠã§ãã
åé¡ã¯Atariã²ãŒã ã«éå®ãããŸããã 2çªç®ã«äººæ°ã®ãããã¹ãã¯ãMuJoCoç©çãšã³ãžã³ã®ã¿ã¹ã¯ã»ããã§ããMuJoCoãã³ãããŒã¯ã§ãã ãããã®ã¿ã¹ã¯ã§ã¯ãéåžžãããããã®ã·ãã¥ã¬ãŒã·ã§ã³ã«ãããåãã³ãžã®å ¥åãšé床ãå ¥åã§äžããããŸãã èŠèŠã®åé¡ã解決ããå¿ èŠã¯ãããŸããããRLã·ã¹ãã ã¯ã åã« ã¿ã¹ã¯ã«å¿ããæé ã ããã¯ããã®ãããªåçŽãªç°å¢ã§ã®å¶åŸ¡ã«ã¯ä¿¡ããããªãã»ã©ã§ãã
以äžã«ç€ºãParkour DeepMindã®èšäºïŒHeess et alã2017ïŒã¯ ã100人以äžã®åŸæ¥å¡64人ã䜿çšããŠãã¬ãŒãã³ã°ãããŠããŸãã ãã®èšäºã§ã¯ã¯ãŒã«ãŒãäœã§ããããæå®ããŠããŸããããããã¯åäžã®ããã»ããµãŒãæå³ãããšæ³å®ããŠããŸãã
ããã¯æé«ã®çµæã§ãã 圌ãæåã«åºãŠãããšããç§ã¯æ·±ãRLãå®è¡äžã«ãã®ãããªæ©è¡ãäžè¬çã«åŠã¶ããšãã§ããããšã«é©ããŸããã
ããããããã»ããµãŒæéã¯6400æéããããå°ãæ®å¿µã§ãã æåŸ ããŠããæéãçãããã§ã¯ãããŸãã...åçŽãªã¹ãã«ã§ã¯ãæ·±ãRLãå®éã«åœ¹ç«ã€ãã¬ãŒãã³ã°ã®ã¬ãã«ããã1æ¡å£ã£ãŠããããšã¯æ®å¿µã§ãã
æãããªåè«ããããŸãããã¬ãŒãã³ã°ã®å¹æãåã«ç¡èŠããå Žåã¯ã©ãã§ããããïŒ ãšã¯ã¹ããªãšã³ã¹ãç°¡åã«çæã§ããç¹å®ã®ç°å¢ããããŸãã ããšãã°ãã²ãŒã ã ãããããããäžå¯èœãªç°å¢ã§ã¯ãRLã¯å€§ããªèª²é¡ã«çŽé¢ããŸãã æ®å¿µãªãããã»ãšãã©ã®ç°å¢ã¯ãã®ã«ããŽãªã«åé¡ãããŸãã
æçµçãªããã©ãŒãã³ã¹ã®ã¿ã«é¢å¿ãããå Žåã¯ãä»ã®æ¹æ³ã§å€ãã®åé¡ã解決ããæ¹ãããã§ãããã
åé¡ã®è§£æ±ºçãæ¢ããšããéåžžãããŸããŸãªç®æšãéæããããã«åŠ¥åç¹ãèŠã€ããå¿ èŠããããŸãã ãã®ç¹å®ã®åé¡ã«å¯Ÿããæ¬åœã«è¯ã解決çã«éäžããããšããç ç©¶å šäœãžã®æ倧éã®è²¢ç®ã«éäžããããšãã§ããŸãã æè¯ã®åé¡ã¯ãåªãã解決çãåŸãããã«ç 究ãžã®è²¢ç®ãå¿ èŠãªå Žåã§ãã ãããå®éã«ã¯ããããã®åºæºãæºããåé¡ãèŠã€ããããšã¯å°é£ã§ãã
çŽç²ã«æ倧éã®å¹çãå®èšŒãããšããç¹ã§ãæ·±ãRLã¯ä»ã®æ¹æ³ãããåžžã«åªããŠãããããããã»ã©å°è±¡çãªçµæã瀺ããŸããã ã€ã³ã¿ã©ã¯ãã£ããªãã¹æé©åã«ãã£ãŠå¶åŸ¡ãããMuJoCoããããã®ãããªã次ã«ç€ºããŸãã æ£ããã¢ã¯ã·ã§ã³ã¯ããªãã©ã€ã³åŠç¿ãªãã§ãã»ãŒãªã¢ã«ã¿ã€ã ã§ã€ã³ã¿ã©ã¯ãã£ãã«èšç®ãããŸãã ã¯ãããã¹ãŠã2012幎ã®æ©åšã§åäœããŸãïŒ Tassa et alãIROS 2012 ïŒã
ãã®äœæ¥ã¯ããã«ã¯ãŒã«ã«é¢ããDeepMindã®èšäºãšæ¯èŒã§ãããšæããŸãã ãããã¯ã©ãéãã®ã§ããïŒ
éãã¯ãããã§ã¯èè ãäºæž¬ã¢ãã«ã§å¶åŸ¡ãé©çšããå°çã®å®éã®ã¢ãã«ïŒç©çãšã³ãžã³ïŒã§äœæ¥ããããšã§ãã RLã«ã¯ãã®ãããªã¢ãã«ã¯ãªããããäœæ¥ãå€§å¹ ã«è€éã«ãªããŸãã äžæ¹ãã¢ãã«ããŒã¹ã®è¡åèšç»ãçµæãå€§å¹ ã«æ¹åããã®ã§ããã°ããªãRLãã¬ãŒãã³ã°ãã ãŸããŠããå¿ èŠãããã®ã§ããããïŒ
åæ§ã«ãã¢ã³ãã«ã«ãïŒMCTSïŒã¿ãŒã³ããŒããªãŒæ€çŽ¢ãœãªã¥ãŒã·ã§ã³ã䜿çšãããšãAtariã®DQNãã¥ãŒã©ã«ãããã¯ãŒã¯ãç°¡åã«äžåãããšãã§ããŸãã 以äžã¯ã Guo et alãNIPS 2014ã®äž»èŠãªææšã§ãã èè ã¯ãèšç·ŽãããDQNã®çµæãUCTãšãŒãžã§ã³ãã®çµæãšæ¯èŒããŸãïŒããã¯ææ°ã®MCTSã®æšæºããŒãžã§ã³ã§ãïŒã
ç¹°ãè¿ããŸãããããã¯äžå ¬å¹³ãªæ¯èŒã§ããDQNã¯æ€çŽ¢ãè¡ãããMCTSã¯å°çç©çåŠã®å®éã®ã¢ãã«ïŒAtariãšãã¥ã¬ãŒã¿ãŒïŒã䜿çšããŠæ£ç¢ºã«æ€çŽ¢ãè¡ãããã§ãã ããããç¶æ³ã«ãã£ãŠã¯ãæ°ã«ããªãå ŽåããããŸããããã¯ãæ£çŽãªãŸãã¯äžæ£ãªæ¯èŒã§ãã åäœããããã«å¿ èŠãªå ŽåããããŸãïŒå®å šãªUCTè©äŸ¡ãå¿ èŠãªå Žåã¯ãå ã®ç§åŠèšäºArcade Learning EnvironmentïŒBellemare et alãJAIR 2013 ïŒã®ä»é²ãåç §ããŠãã ããã
匷ååŠç¿ã¯ãäžçã®æªç¥ã®ã¢ãã«ãå«ãç°å¢ãå«ããã¹ãŠã«çè«çã«é©ããŠããŸãã ããã«ããããããããã®ãããªæ±çšæ§ã¯é«äŸ¡ã§ããåŠç¿ã«åœ¹ç«ã€å¯èœæ§ã®ããç¹å®ã®æ å ±ã䜿çšããããšã¯å°é£ã§ãã ãã®ãããæåããããŒãã³ãŒãã£ã³ã°ã§ããããšãåŠç¿ããã«ã¯ãå€ãã®ãµã³ãã«ã䜿çšããå¿ èŠããããŸãã
çµéšããããŸããªã±ãŒã¹ãé€ããç¹å®ã®ã¿ã¹ã¯ã«åãããŠèª¿æŽãããã¢ã«ãŽãªãºã ã¯åŒ·ååŠç¿ãããé«éã§åªããåäœãããããšã瀺ãããŠããŸãã æ·±åšæ§RLã®ããã«æ·±åšæ§RLãéçºããŠããå Žåã¯åé¡ã§ã¯ãããŸããããå人çã«ã¯ãRL cã®æå¹æ§ã...ä»ã®äœããšæ¯èŒããã®ã¯æ°åã害ããŸãã AlphaGoããšãŠã奜ãã ã£ãçç±ã®1ã€ã¯ãããããã£ãŒãRLã«ãšã£ãŠæ確ãªåå©ã ã£ãããã§ãã
ããããã¹ãŠã®çç±ãããç§ã®ã¿ã¹ã¯ããšãŠãã¯ãŒã«ã§è€éã§é¢çœãçç±ã人ã ã«èª¬æããããšã¯ããå°é£ã§ãããªããªãã圌ãã¯ãã°ãã°ãªãé£ããã®ããè©äŸ¡ããããã®èæ¯ãçµéšããªãããã§ãã ãã£ãŒãRLã®æ©èœã«ã€ããŠäººã ãèããããšãšãå®éã®æ©èœãšã®éã«ã¯æ確ãªéãããããŸãã ä»ãç§ã¯ããããå·¥åŠã®åéã§åããŠããŸãã ããããå·¥åŠã«ã€ããŠèšåããå Žåãã»ãšãã©ã®äººã®é ã«æµ®ãã¶äŒç€ŸãèããŠã¿ãŸãããïŒBoston Dynamicsã
ãã®ããšã¯åŒ·åãã¬ãŒãã³ã°ã䜿çšããŸããã ããã§RLã䜿çšãããŠãããšæã£ã人ã«äœåºŠãäŒããŸããããããŸããã éçºè ã°ã«ãŒãããçºè¡ãããç§åŠè«æãæ¢ããšã ç·åœ¢2次ã¬ã®ã¥ã¬ãŒã¿ãæå€2次èšç»æ³ãœã«ããŒãåžæé©åã«é¢ããèšäºãèŠã€ãããŸãã èšãæããã°ã圌ãã¯äž»ã«å€å žçãªããããå·¥åŠã®æ¹æ³ãé©çšããŸãã ãããã®å€å žçãªææ³ã¯ãé©åã«é©çšããã°ããŸãæ©èœããããšãããããŸããã
匷åãã¬ãŒãã³ã°ã«ã¯éåžžãå ±é ¬æ©èœãå¿ èŠã§ã
匷åãããåŠç¿ã¯ãå ±é ¬é¢æ°ã®ååšãæå³ããŸãã éåžžãæåã«ååšãããããªãã©ã€ã³ã¢ãŒãã§æåã§æ§æããããã¬ãŒãã³ã°äžã¯å€æŽãããŸããã ã·ãã¥ã¬ãŒã·ã§ã³ãã¬ãŒãã³ã°ããªããŒã¹RLïŒå ±é ¬é¢æ°ãäºåŸçã«åŸ©å ãããå ŽåïŒãªã©ã®äŸå€ãããããããéåžžããšèšããŸãããã»ãšãã©ã®å ŽåãRLã¯å ±é ¬ãšããŠãªã©ã¯ã«ãšããŠäœ¿çšããŸãã
RLãé©åã«æ©èœããããã«ã¯ãå ±é ¬æ©èœãå¿ èŠãªãã®ãæ£ç¢ºã«ã«ããŒããå¿ èŠãããããšã«æ³šæããããšãéèŠã§ãã ãããŠç§ã¯æ£ç¢ºã«æå³ããŸã ã RLã¯ç ©ããããªãŒããŒãã£ãããããããããäºæããªãçµæã«ã€ãªãããŸãã ããããAtariãéåžžã«åªãããã³ãããŒã¯ã§ããçç±ã§ãã å€ãã®ãµã³ãã«ãå ¥æããã®ã¯ç°¡åãªã ãã§ãªãããã¹ãŠã®ã²ãŒã ã«ã¯æ確ãªç®æšïŒãã€ã³ãæ°ïŒããããããå ±é ¬é¢æ°ãèŠã€ããããšãå¿é ããå¿ èŠã¯ãããŸããã ãããŠãããªãã¯ä»ã®èª°ããåãæ©èœãæã£ãŠããããšãç¥ã£ãŠããŸãã
MuJoCoã¿ã¹ã¯ã®äººæ°ã¯ãåãçç±ã«ãããã®ã§ãã ã·ãã¥ã¬ãŒã·ã§ã³ã§æ©èœããããããªããžã§ã¯ãã®ç¶æ ã«é¢ããå®å šãªæ å ±ãåŸããããããå ±é ¬é¢æ°ã®äœæãå€§å¹ ã«ç°¡çŽ åãããŸãã
Reacherã¿ã¹ã¯ã§ã¯ãäžå¿ç¹ã«æ¥ç¶ããã2ã»ã°ã¡ã³ãã®ã¢ãŒã ãå¶åŸ¡ããŸããç®æšã¯ãã¢ãŒã ã®ç«¯ãç¹å®ã®ã¿ãŒã²ããã«ç§»åããããšã§ãã åŠç¿ã®æåäŸã«ã€ããŠã¯ã以äžãåç §ããŠãã ããã
ãã¹ãŠã®åº§æšãæ¢ç¥ã§ãããããå ±é ¬ã¯æã®ç«¯ããã¿ãŒã²ãããŸã§ã®è·é¢ã«å ããŠã移åããããã®çãæéãšããŠå®çŸ©ã§ããŸãã ååãšããŠãçŸå®ã®äžçã§ã¯ã座æšãæ£ç¢ºã«æž¬å®ããã®ã«ååãªã»ã³ãµãŒãããã°ãåãå®éšãè¡ãããšãã§ããŸãã ããããã·ã¹ãã ãäœãããå¿ èŠããããã«å¿ããŠãåççãªå ±é ¬ã決å®ããããšã¯å°é£ã§ãã
å ±é ¬é¢æ°èªäœããªããã°ã倧ããªåé¡ã«ã¯ãªããŸãã...
å ±é ¬æ©èœã®éçºã®è€éã
å ±é ¬é¢æ°ãäœæããããšã¯ããã»ã©é£ãããããŸããã é©åãªæ¯ãèãã奚å±ããæ©èœãäœæããããšãããšå°é£ãçºçããåæã«ã·ã¹ãã ã¯åŠç¿ãç¶æããŸãã
HalfCheetahã«ã¯ãåçŽé¢ã«å²ãŸãã2æ¬è¶³ã®ããããããããŸããã€ãŸããåæ¹ãŸãã¯åŸæ¹ã«ãã移åã§ããŸããã
ç®æšã¯ãžã§ã®ã³ã°ãåŠã¶ããšã§ãã å ±é ¬ã¯HalfCheetahã®é床ã§ãïŒ ãã㪠ïŒã
ããã¯æ»ãããªããŸãã¯åœ¢ãæŽããïŒåœ¢ãããïŒå ±é ¬ã§ããã€ãŸããæçµç®æšã«è¿ã¥ãã«ã€ããŠå¢å ããŸãã ç®æšã®æçµç¶æ ã«å°éãããšãã«ã®ã¿ä»äžãããã ãŸã°ããªå ±é ¬ãšã¯å¯Ÿç §çã«ãä»ã®ç¶æ ã§ã¯ååšããŸããã ãã¬ãŒãã³ã°ãåé¡ã®å®å šãªè§£æ±ºçãæäŸããªãã£ãå Žåã§ããè¯å®çãªãã£ãŒãããã¯ãæäŸãããããå ±é ¬ã®ã¹ã ãŒãºãªæé·ã¯ãå€ãã®å Žåãç¿åŸãã¯ããã«ç°¡åã§ãã
æ®å¿µãªãããã¹ã ãŒãºãªæé·ã«ããå ±é ¬ã«ã¯åãããããŸãïŒãã€ã¢ã¹ïŒã ãã§ã«è¿°ã¹ãããã«ããã®ãããäºæããªãæãŸãããªãåäœãçŸããŸãã è¯ãäŸã¯ã OpenAIããã°èšäºã®ããŒãã¬ãŒã¹ã§ãã ç®æšã¯ãŽãŒã«ã«å°éããããšã§ãã äžããããæéã«ã¬ãŒã¹ãçµäºãããš+1ã®å ±é ¬ãããã以å€ã®å Žåã¯0ã®å ±é ¬ãæ³åã§ããŸãã
å ±é ¬æ©èœã¯ããã§ãã¯ãã€ã³ããééããããã®ãã€ã³ããšããã£ããã·ã¥ã©ã€ã³ã«ãã°ããå°éã§ããããŒãã¹ãåéããããã®ãã€ã³ããæäŸããŸãã å€æããããã«ãããŒãã¹ãéããããšã¯ã¬ãŒã¹ã®å®äºãããå€ãã®ãã€ã³ããäžããŸãã
æ£çŽãªãšããããã®åºçç©ã¯æåã¯å°ãé¢åã§ããã 圌女ãééã£ãŠããããã§ã¯ãããŸããïŒ ãããã圌女ã¯æãããªããšãå®èšŒããŠããããã«æããããã§ãã ãã¡ãããå ±é ¬ã誀ã£ãŠå®çŸ©ãããŠããå Žåã匷ååŠç¿ã¯å¥åŠãªçµæããããããŸãïŒ ãã®åºçç©ã¯ãã®ç¹å®ã®ã±ãŒã¹ãäžåœã«éèŠèŠããŠããããã«æããŸããã
ãããããã®åŸãç§ã¯ãã®èšäºã®å·çãéå§ãã誀ã£ãŠå®çŸ©ãããå ±é ¬ã®æã説åŸåã®ããäŸãã ãŸãã«ãã®ããŒãã¬ãŒã¹ã®ãããªã§ããããšã«æ°ä»ããŸããã ãã以æ¥ããã®ãããã¯ã«é¢ããããã€ãã®ãã¬ãŒã³ããŒã·ã§ã³ã§äœ¿çšãããåé¡ã«æ³šç®ãéããŸããã ããã§ããã®ã§ãããç§ã¯ãã¶ãã¶è¯ãããã°æçš¿ã ã£ãããšãèªããŸãã
RLã¢ã«ãŽãªãºã ã¯ãåšå²ã®äžçã«ã€ããŠå€å°ãªããšãæšæž¬ããå¿ èŠãããå Žåããã©ãã¯ããŒã«ã«é¥ããŸãã ã¢ãã«ã¬ã¹RLã®æãæ±çšæ§ã®é«ãã«ããŽãªã¯ããã©ãã¯ããã¯ã¹æé©åã®ãããªãã®ã§ãã ãã®ãããªã·ã¹ãã ã¯ãããããMDPïŒMarkovææ決å®ããã»ã¹ïŒã«ãããšä»®å®ããããšã ããèš±å¯ãããŸã-ãã以äžã¯ãããŸããã ãšãŒãžã§ã³ãã¯ãããã¯ããªãã+1ãåŸããã®ã§ãããšåçŽã«èšãããŸãããããªãã¯ãã®ããã«ãããåŸãã®ã§ã¯ãªããããªãèªèº«ã§ä»ã®ãã¹ãŠãèŠã€ããã¹ãã§ãã ãã©ãã¯ããã¯ã¹ã®æé©åãšåæ§ã«ãåé¡ã¯ãå ±é ¬ãééã£ãæ¹æ³ã§åãåãããå Žåã§ãã+ 1ãäžããåäœã¯è¯ããšèŠãªãããããšã§ãã
å€å žçãªäŸã¯RLåéã®ãã®ã§ã¯ãããŸãã-誰ããè¶ å°ååè·¯ã®èšèšã«éºäŒçã¢ã«ãŽãªãºã ãé©çšããæçµèšèšã«1ã€ã®æ¥ç¶ãããŠããªãè«çã²ãŒããå¿ èŠãªåè·¯ãåãåã£ããšãã§ãã
ç°è²ã®èŠçŽ ã¯ãå·Šäžé ã®èŠçŽ ãå«ãåè·¯ã®æ£ããåäœã«å¿ èŠã§ãããäœã«ãæ¥ç¶ãããŠããŸããã èšäºãç©çåŠã«çµ¡ã¿åã£ãã·ãªã³ã³ã«åºæã®é²åããåè·¯ããã
ãŸãã¯ãããæè¿ã®äŸãšããŠã 2017 Salesforceããã°æçš¿ããããŸãã 圌ãã®ç®æšã¯ãããã¹ãã®èŠçŽãäœæããããšã§ããã åºæ¬ã¢ãã«ã¯æåž«ã«æããããåŸãROUGEãšåŒã°ããèªååãããã¡ããªãã¯ã«ãã£ãŠè©äŸ¡ãããŸããã ROUGEã¯å·®å¥åãããŠããªãå ±é ¬ã§ãããRLã¯ãã®ãããªãã®ã«å¯Ÿå¿ã§ããŸãã ããã§ã圌ãã¯RLãé©çšããŠROUGEãçŽæ¥æé©åããããšããŸããã ããã§é«ãROUGEïŒæ声ïŒïŒãåŸãããŸãããæè©ã¯ããŸãè¯ããããŸããã 以äžã«äŸã瀺ããŸãã
ããã³ã¯ãERSã圌ãã¹ã¿ãŒããããªãã£ãåŸããã¯ã©ãŒã¬ã³ã®100åç®ã®ã¬ãŒã¹ã圌ãã奪ããŸããã è±åœäººã«ãšã£ãŠã¯æªãé±æ«ãçµããã è³æ Œã«å ããããã¿ã³ã ããŒã¬ãŒã³ã®ãã³ã»ããºãã«ã°ã«å è¡ããŠãã£ããã·ã¥ã ã«ã€ã¹ã»ããã«ãã³ã 11ã¬ãŒã¹ã§..ã¬ãŒã¹ã 2,000ã©ããããªãŒãããããã«...ã§... I.- Paulus et alã2017
ãããŠãRLã¢ãã«ã¯æ倧ã®ROUGEçµæã瀺ããŸããã...
...圌ãã¯æçµçã«ãå±¥æŽæžãæžãããã«å¥ã®ã¢ãã«ã䜿çšããããšã«ããŸããã
å¥ã®æ¥œããäŸã ããã¯ããæãç³ã¿åŒã¬ãŽã³ã³ã¹ãã©ã¯ã¿ãŒã«é¢ããèšäºããšããŠãç¥ãããŠããPopov et alã2017ã®èšäºã«ãããã®ã§ãã èè ã¯ãDDPGã®åæ£ããŒãžã§ã³ã䜿çšããŠãã£ããã£ã«ãŒã«ãæããŠããŸãã ç®æšã¯ãèµ€ãç«æ¹äœãã€ãã¿ãéã®äžã«çœ®ãããšã§ãã
圌ãã¯åœŒå¥³ã®äœåãäœã£ããã倱æã®èå³æ·±ãã±ãŒã¹ã«çŽé¢ããã æåã®ãªããã£ã³ã°åäœã¯ãèµ€ããããã¯ã®ãªããã£ã³ã°é«ãã«åºã¥ããŠå ±ãããŸãã ããã¯ãç«æ¹äœã®åºé¢ã®Z座æšã«ãã£ãŠæ±ºãŸããŸãã 倱æãªãã·ã§ã³ã®1ã€ã§ã¯ãã¢ãã«ã¯åºé¢ãäžã«ããŠèµ€ãç«æ¹äœããªã³ã«ããäžããªãããã«ããããšãåŠã³ãŸããã
æããã«ããã®åäœã¯æå³ããããã®ã§ã¯ãããŸããã ããããRLã¯æ°ã«ããŸããã 匷åèšç·Žã®èŠ³ç¹ããã圌女ã¯ç«æ¹äœãåãããšã«å¯Ÿããå ±é ¬ãåãåã£ãã®ã§ã圌女ã¯ç«æ¹äœãåãç¶ããŸãã
ãã®åé¡ã解決ãã1ã€ã®æ¹æ³ã¯ããã¥ãŒããæ¥ç¶ããåŸã«ã®ã¿å ±é ¬ãäžããããšã§ãå ±é ¬ããŸã°ãã«ããããšã§ãã ãŸããªå ±é ¬ãåŠç¿ã«åœ¹ç«ã€ããããããæ©èœããå ŽåããããŸãã ããããå€ãã®å Žåãããã¯ããã§ã¯ãããŸãããç©æ¥µçãªåŒ·åãäžè¶³ããŠãããããäºæ ã¯è€éã«ãªããããŠããŸãã
åé¡ã®å¥ã®è§£æ±ºçã¯ãå ±é ¬ã®æ éãªåœ¢æãæ°ããå ±é ¬æ¡ä»¶ã®è¿œå ãããã³ãã¬ãŒãã³ã°äžã«RLã¢ã«ãŽãªãºã ãæãŸããåäœã瀺ããŸã§æ¢åã®æ¡ä»¶ã®ä¿æ°ã®èª¿æŽã§ãã ã¯ãããã®é¢ã§RLãå æããããšã¯å¯èœã§ãã ããã®ãããªéäºã¯æºè¶³ããããããŸããã æã«ã¯ãããå¿ èŠã§ããããã®éçšã§äœããåŠãã ãšã¯æããŸããã§ããã
åèãŸã§ã«ãã¬ãŽã³ã³ã¹ãã©ã¯ã¿ãŒã®æãããã¿ã«é¢ããèšäºã®å ±é ¬é¢æ°ã®1ã€ã次ã«ç€ºããŸãã
ãã®é¢æ°ã®éçºã«ã©ãã ãã®æéãè²»ããããã¯ããããŸããããã¡ã³ããŒã®æ°ãšç°ãªãä¿æ°ã§ãããããããšèšããŸãã
ä»ã®RLç 究è ãšã®äŒè©±ã®äžã§ã誀ã£ãŠèšå®ãããå ±é ¬ãæã€ã¢ãã«ã®å ã®è¡åã«é¢ããããã€ãã®è©±ãèããŸããã
- ååããšãŒãžã§ã³ãã«éšå±ã移åããããã«æããŸãã ãšãŒãžã§ã³ããåœå¢ãè¶ãããšãšããœãŒãã¯çµäºããŸããããã®å Žå眰éã¯ç§ãããŸããã ãã¬ãŒãã³ã°ã®çµããã«ããšãŒãžã§ã³ãã¯èªæ®ºè¡åãæ¡çšããŸãããããã¯ãè² ã®å ±é ¬ãåŸãã®ãéåžžã«ç°¡åã§ãæ£ã®å ±é ¬ãé£ãããããããçµæã0ã®è¿ éãªæ»äº¡ããè² ã®çµæã®ãªã¹ã¯ãé«ãé·å¯¿åœãããæãŸããããã§ãã
- å人ãããããã¢ãŒã ã·ãã¥ã¬ãŒã¿ãŒãèšç·ŽããŠãããŒãã«ã®äžã®ç¹å®ã®ãã€ã³ãã«åãã£ãŠç§»åããŸããã ãã€ã³ãã¯ããŒãã«ã«å¯ŸããŠçžå¯Ÿçã«å®çŸ©ãããããŒãã«ã¯äœã«ãæ¥ç¶ãããŠããªãã£ãããšãããããŸãã ã¢ãã«ã¯ããŒãã«ãéåžžã«æ¿ããããã¯ããããšãåŠã³ãããŒãã«ãåããŠç®æšç¹ãåãããŸãã-ãããŠããã¯æã®é£ã«ãããŸããã
- ç 究è ã¯ããã³ããŒã䜿ã£ãŠéãæã€ããããã¢ãŒã ã®ã·ãã¥ã¬ãŒã¿ãŒãèšç·Žããããã®RLã®äœ¿çšã«ã€ããŠè©±ããŸããã å ±é ¬ã¯ããšããšãéãç©Žã«ã©ãã ãå ¥ã£ããã«ãã£ãŠæ±ºãŸããŸããã ããããã¯ãã³ããŒãæŸã代ããã«ãæ足ã§éãæã¡ãŸããã ãããã圌ãã¯ããããã«ãã³ããŒãæŸãããã«å¥šå±ããå ±é ¬ãè¿œå ããŸããã ãã®çµæãããããã®åŠç¿æŠç¥ã¯ãã³ããŒãåããããŒã«ãéã«æã蟌ã¿ãéåžžã®æ¹æ³ã§ã¯äœ¿çšããªãããšã§ããã
確ãã«ãããã¯ãã¹ãŠééã£ãåããã®ç©èªã§ããå人çã«ãç§ã¯ãã®è¡åããããããªãèŠãããšã¯ãããŸããã ãããããããã®ç©èªã¯ã©ããç§ã«ã¯äžå¯èœã§ã¯ãªãããã§ãã ç§ã¯RLã§äœåºŠãç«å·ãè² ããä¿¡ããããŸããã§ããã
ããŒããŒãªããã£ãã€ã¶ãŒã«é¢ãã話ãããã人ãç¥ã£ãŠããŸãã ããŠãç§ã¯æ£çŽã«ç解ããŠããŸãã ããããå®éã«ã¯ãç§ã¯ãããã®è©±ãèãã®ã«ããããããŠããŸãããªããªãã圌ãã¯åžžã«æ¬åœã®è©±ãšããŠã®è¶ 人çãªæ··ä¹±ãã匷ãAIã«ã€ããŠè©±ããŠããããã§ãã åšãã«æ¯æ¥ããããã®å®éã®ç©èªãããã®ã«ããªããããçºæããã®ãã
è¯ãå ±é ¬ãäžãããããšããŠããå±æçãªæé©ãé¿ããããšã¯å°é£ã§ãã
以åã®RLã®äŸã¯ãå€ãã®å Žåãå ±é ¬ããã¯ããšåŒã°ããŸããç§ã«ãšã£ãŠã¯ãããã¯ã¹ããŒãã§éæšæºã®ãœãªã¥ãŒã·ã§ã³ã§ãããã¿ã¹ã¯ãã¶ã€ããŒããæåŸ ããããœãªã¥ãŒã·ã§ã³ãããå€ãã®å ±é ¬ããããããŸãã
ãããã³ã°å ±é ¬ã¯äŸå€ã§ããããäžè¬çãªã®ã¯ãæ¢æ»ãšéçºã®éã®èª€ã£ã劥åããçãã誀ã£ãå±æçæé©ã®å Žåã§ãã
ããã¯ç§ã®ãæ°ã«å ¥ãã®ãããªã®1ã€ã§ããHalfCheetahã§åŠç¿ããæ£èŠåãããç¹å žæ©èœãå®è£ ããŸããéšå€è ã®èŠ³ç¹ãããããã¯éåžžã«ãéåžžã«
ãã«ãããããç§ãã¡ã¯ãç§ãã¡ã暪ããèŠãŠãããªãã®è¶³ã§åãããšã¯ããªãã®èäžã«æšªããããããåªããŠãããšããå€ãã®ç¥èãæã£ãŠãããšããçç±ã ãã§æãã ãšèšããŸããRLã¯ãããç¥ããŸããïŒåœŒã¯ç¶æ ãã¯ãã«ãèŠãŠãã¢ã¯ã·ã§ã³ãã¯ãã«ãéä¿¡ããè¯å®çãªå ±é ¬ãåãåã£ãŠããããšã確èªããŸãã以äžã§ãã
ãã¬ãŒãã³ã°äžã«äœãèµ·ãã£ããã«ã€ããŠç§ãæãã€ãããšãã§ããæã劥åœãªèª¬æã以äžã«ç€ºããŸãã
- ã©ã³ãã ãªç 究ã§ããã®ã¢ãã«ã¯ãåããæ¢ããªããããåã«åãæ¹ãåçæ§ãé«ãããšãçºèŠããŸããã
- ã¢ãã«ã¯ããã®ãããªåäœãããã©ãã·ã¥ãããŠç¶ç¶çã«äœäžãå§ããã®ã«ååãªé »åºŠã§ãããå®è¡ããŸããã
- åæ¹ã«åããåŸãã¢ãã«ã¯ãååãªåãå ãããšãããã¯ããªãããè¡ãããšãã§ãããããããå°ãã®å ±é ¬ãäžããããšãåŠã³ãŸããã
- â , , «» .
- , â « » , ? .
ããã¯éåžžã«ãããããã§ãããæããã«ããããã«æããã®ã§ã¯ãããŸããã
倱æããå¥ã®äŸã次ã«ç€ºããŸããä»åã¯ReacherïŒvideoïŒã«å²ãŸããŠããŸããä»åã®å®è¡ã§ã¯ãéåžžãã©ã³ãã ãªåæéã¿ãã¢ã¯ã·ã§ã³ã«å¯ŸããŠåŒ·ãæ£ãŸãã¯éåžžã«è² ã®å€ãäžããŸããããã®ãããã»ãšãã©ã®ã¢ã¯ã·ã§ã³ã¯å¯èœãªéãæ倧ãŸãã¯æå°ã®å éã§å®è¡ãããŸãããå®éãã¢ãã«ãéåžžã«ç°¡åã«ã¹ãã³ã¢ããã§ããŸããåãã³ãžã«å€§ããªåãå ããã ãã§ãããããããå転ãããšãããã®ç¶æ ããäœããã®ç解å¯èœãªæ¹æ³ã§æãåºãããšã¯ãã§ã«å°é£ã§ãïŒæšªè¡å転ãåæ¢ããããã«ãããã€ãã®åµå¯æé ãåãããã¹ãã§ãããã¡ãããããã¯å¯èœã§ããããããããã¯ä»åã®å®è¡ã§ã¯çºçããŸããã§ããã
ã©ã¡ãã®å Žåããå€å žçãªåµå¯/æŸåã®åé¡ããããŸããããã¯ã倪å€ããã匷åãããåŠç¿ãè¿œæ±ããŠããŸãããããŒã¿ã¯çŸåšã®ã«ãŒã«ãã掟çããŠããŸããçŸåšã®ã«ãŒã«ãåºç¯ãªã€ã³ããªãžã§ã³ã¹ãæäŸããŠããå ŽåãäžèŠãªããŒã¿ãåãåããäœãåŠç¿ããŸããããšã¯ã¹ããã€ããå€ããã-æé©ã§ãªãåäœããçž«ããã
ãã®ããŒãã«ã¯çŽæçã«æ¥œããã¢ã€ãã¢ãããã€ããããŸããå éšã®åæ©ãšå¥œå¥å¿ãã«ãŠã³ãã«åºã¥ãç¥æ§ãªã©ã§ãããããã®ã¢ãããŒãã®å€ãã¯ã80幎代以åã«æåã«ææ¡ãããäžéšã¯ãã£ãŒãã©ãŒãã³ã°ã¢ãã«çšã«æ¹èšãããŸãããããããç§ã®ç¥ãéãããã¹ãŠã®ç°å¢ã§å®å®ããŠæ©èœããã¢ãããŒãã¯ãããŸãããæã«ã¯åœ¹ç«ã€ããšãããã°ã圹ã«ç«ããªãããšããããŸããããçš®ã®ã€ã³ããªãžã§ã³ã¹ããªãã¯ãã©ãã§ãæ©èœããã®ã¯è¯ãããšã§ãããè¿ãå°æ¥ã圌ãããã®å£åŸã®ç¹å¹è¬ãèŠã€ãããšã¯æããŸããã誰ãè©Šã¿ãŠããªãããã§ã¯ãªããæ¢æ»-éçºãéåžžã«ãéåžžã«ãéåžžã«ãéåžžã«è€éãªåé¡ã ããã§ããå€è ãã³ãã£ããã«é¢ãããŠã£ãããã£ã¢ã®èšäºããã®åŒçšïŒ
æŽå²äžåããŠããã®åé¡ã¯ç¬¬äºæ¬¡äžç倧æŠã®é£ååœã®ç§åŠè ã«ãã£ãŠç 究ãããŸãããããŒã¿ãŒã»ãã€ããã«ã«ãããšããã€ãã®ç§åŠè ãããã«æéãè²»ããããã«ããã€ã人ã«æãã€ããããã¹ãã§ããããšã瀺åãããã»ã©ãããã¯éåžžã«æ±ãã«ããããšãå€æããŸãããïŒæ å ±æºïŒQ-Learning for Bandit
Problem ãDuff 1995ïŒç§ã¯ããªãã®å ±é ¬ãæ æã«èª€è§£ããããŒã«ã«æé©ãéæããããã®æãæ ziãªæ¹æ³ãç©æ¥µçã«æ¢ããŠããæªéãšããŠæ·±ãRLãæ瀺ããŸããå°ãã°ãããŠããããããã¯æ¬åœã«çç£çãªæèã§ããããšãå€æããã
ãã£ãŒãRLãæ©èœããå Žåã§ããå¥åŠãªåäœã«åãã¬ãŒãã³ã°ã§ããŸãã
ãã£ãŒãã©ãŒãã³ã°ã¯ããã¹ãã¹ã€ãŒãã§åŠç¿ããããšã瀟äŒçã«åãå ¥ããããå¯äžã®æ©æ¢°åŠç¿é åã§ããããã人æ°ããããŸããïŒåºå žïŒ
匷åãã¬ãŒãã³ã°ã®ãã©ã¹é¢ã¯ãç¹å®ã®ç°å¢ã§è¯ãçµæãéæãããå Žåãç人ãšããŠåãã¬ãŒãã³ã°ã§ããããšã§ããæ¬ ç¹ã¯ãã¢ãã«ãä»ã®ç°å¢ã«æ¡åŒµããå¿ èŠãããå Žåãæãããåèšç·Žã®ããã«ããããããŸãæ©èœããªãããšã§ãã
DQNãããã¯ãŒã¯ã¯ãå€ãã®Atariã²ãŒã ã«å¯Ÿå¿ããŠããŸããããã¯ãåã¢ãã«ã®ãã¹ãŠã®ãã¬ãŒãã³ã°ãã1ã€ã®ã²ãŒã ã§æ倧ã®çµæãéæãããšããåäžã®ç®æšã«çŠç¹ãåãããŠããããã§ããæçµã¢ãã«ã¯ä»ã®ã²ãŒã ã«æ¡åŒµããããšã¯ã§ããŸããããªããªããããã¯ãã®ããã«æããããŠããªãã£ãããã§ããæ°ããAtariã²ãŒã ã®ãã¬ãŒãã³ã°æžã¿DQNãæ§æã§ããŸãïŒããã°ã¬ãã·ããã¥ãŒã©ã«ãããã¯ãŒã¯ïŒRusu et alã2016ïŒãåç §ïŒïŒããããããã®ãããªè»¢éãè¡ããããšããä¿èšŒã¯ãªããé垞誰ããããæåŸ ããŠããŸãããããã¯ãImageNetã®äºåã«ãã¬ãŒãã³ã°ãããæ©èœã§äººã ãç®ã«ãã倧æåã§ã¯ãããŸããã
ããã€ãã®æçœãªã³ã¡ã³ããé²ãããã«ïŒã¯ããååãšããŠãå¹ åºãç°å¢ã§ã®ãã¬ãŒãã³ã°ã¯ããã€ãã®åé¡ã解決ã§ããŸããå Žåã«ãã£ãŠã¯ãã¢ãã«ã®ã¢ã¯ã·ã§ã³ã®ãã®ãããªæ¡åŒµã¯åç¬ã§çºçããŸããäŸã¯ããã²ãŒã·ã§ã³ã§ããããã§ã¯ãã¿ãŒã²ããã®ã©ã³ãã ãªäœçœ®ãè©ŠããŠãæ±çšé¢æ°ã䜿çšããŠäžè¬åã§ããŸãã ïŒUniversal Value Function ApproximatorsãSchaul et alãICML 2015ãåç §ïŒãã®äœæ¥ã¯éåžžã«ææã§ãããšæããŸãããåŸã§ãã®äœæ¥ããããã«äŸã瀺ããŸãããããããã£ãŒãRLãäžè¬åããå¯èœæ§ã¯ãããŸããŸãªã¿ã¹ã¯ã»ããã«å¯ŸåŠããã»ã©å€§ãããªããšæããŸããèªèã¯ãã£ãšè¯ããªããŸããããã管çã®ããã®ImageNetããç»å Žããåã«ãæ·±ãRLã¯ãŸã å ãè¡ã£ãŠããŸãã OpenAIãŠãããŒã¹ã¯ãã®ãç«ã«ç«ãã€ããããšããŸããããèãããšããã«ãããšããã®ã¿ã¹ã¯ã¯é£ããããŠã»ãšãã©äœãããŸããã§ããã
ã¢ãã«ãäžè¬åãããã®ãããªç¬éã¯ãããŸããããç§ãã¡ã¯é©ãã»ã©çãã¢ãã«ã®ç¯å²ã«çãŸã£ãŠããŸããäŸãšããŠïŒãããŠèªåã®ä»äºãç¬ãèšãèš³ãšããŠïŒãCan Deep RL Solve Erdos-Selfridge-Spencer Gamesã®èšäºãèŠãŠãã ããã ïŒRaghu et alã2017ïŒãæé©ãªã²ãŒã ã®ããã®åæ圢åŒã§ã®è§£æ±ºçããã2人ã®ãã¬ãŒã€ãŒã®ããã®çµã¿åããã²ãŒã ãç 究ããŸãããæåã®å®éšã®1ã€ã§ã¯ããã¬ãŒã€ãŒ1ã®åäœãèšé²ããRLã䜿çšããŠãã¬ãŒã€ãŒ2ãèšç·ŽããŸããããã®å Žåããã¬ãŒã€ãŒ1ã®ã¢ã¯ã·ã§ã³ãç°å¢ã®äžéšãšèŠãªãããšãã§ããŸããæé©ãªãã¬ãŒã€ãŒ1ã«å¯ŸããŠãã¬ãŒã€ãŒ2ãæãããšãRLãé«ãçµæã瀺ãããšãã§ããããšã瀺ãããŸãããããããæé©ã§ãªããã¬ãŒã€ãŒ1ã«åãã«ãŒã«ãé©çšãããšãæé©ã§ãªããã¬ãŒã€ãŒã«ã¯é©çšãããªãã£ãããããã¬ãŒã€ãŒ2ã®æå¹æ§ãäœäžããŸããã
èšäºã®èè Lanctot et alãNIPS 2017åæ§ã®çµæãåŸãããŸãããããã§ã¯ã2人ã®ãšãŒãžã§ã³ããã¬ãŒã¶ãŒã¿ã°ãåçããŸãããšãŒãžã§ã³ãã¯ããã«ããšãŒãžã§ã³ã匷åãã¬ãŒãã³ã°ã䜿çšããŠãã¬ãŒãã³ã°ãããŸããäžè¬åããã¹ãããããã«ã5ã€ã®ã©ã³ãã ãªéå§ç¹ïŒsidïŒãããã¬ãŒãã³ã°ãéå§ãããŸãããããã¯ãäºãã«å¯ŸæŠããããã«èšç·ŽããããšãŒãžã§ã³ãã®ãããªã§ãã
ããªããèŠãããšãã§ããããã«ã圌ãã¯æ¥è¿ããŠãäºãã«æã€ããšãåŠã³ãŸããããã®åŸãèè ã¯ããå®éšãããã¬ãŒã€ãŒ1ãåãåºããå¥ã®å®éšãããã¬ãŒã€ãŒ2ãšäžç·ã«é£ããŠããŸãããåŠç¿ããã«ãŒã«ãäžè¬åãããšãåæ§ã®åäœãèŠãããã¯ãã§ãã
ãã¿ãã¬ïŒåœŒã«ã¯äŒããªãã
ããã¯äžè¬çãªãã«ããšãŒãžã§ã³ãRLåé¡ã®ããã§ãããšãŒãžã§ã³ããäºãã«èšç·Žããããšãäžçš®ã®å ±åé²åãèµ·ãããŸãããšãŒãžã§ã³ãã¯æ¬åœã«ãäºãã«æ¬åœã«ããæŠãããã«èšç·ŽãããŠããŸããã圌ãã以åã«äŒã£ãããšã®ãªããã¬ãŒã€ãŒã«å¯ŸããŠåœŒããéããããšãã圌ãã®æå¹æ§ã¯æžå°ããŸãããããã®ãããªã®å¯äžã®éãã¯ã©ã³ãã ã·ãŒãã§ããããšã«æ³šæããŠãã ãããåãåŠç¿ã¢ã«ãŽãªãºã ãåããã€ããŒãã©ã¡ãŒã¿ãŒãåäœã®éãã¯ãçŽç²ã«åææ¡ä»¶ã®ã©ã³ãã ãªæ§è³ªã«ãããã®ã§ãã
ããã«ãããããããäºãã«ç¬ç«ããéã³ãããç°å¢ã§åŸãããããã€ãã®å°è±¡çãªçµæããããŸã-ãããã¯äžè¬çãªè«æãšççŸããããã§ãã OpenAIããã°ã«ã¯ããã®åéã§ã®åœŒãã®ä»äºã«ã€ããŠã®è¯ãæçš¿ããããŸãããDIYãã¬ã€ãAlphaGoãšAlphaZeroã®éèŠãªéšåã§ããç§ã®çŽæçãªèãã¯ããšãŒãžã§ã³ããåãããŒã¹ã§åŠç¿ããã°ãåžžã«ç«¶ãåã£ãŠäºãã®åŠç¿ãã¹ããŒãã¢ããã§ããŸãããäžæ¹ãä»æ¹ãããã¯ããã«éãåŠç¿ããå Žåã圌ã匱è ã®è匱æ§ãæªçšããŠåèšç·ŽããŸãã察称çãªã¹ã¿ã³ãã¢ãã³ã²ãŒã ããäžè¬çãªãã«ããšãŒãžã§ã³ãèšå®ã«ç§»è¡ãããšããã¬ãŒãã³ã°ãåãé床ã§ããããšã確èªããã®ãéåžžã«é£ãããªããŸãã
äžè¬åãèæ ®ããªããŠããæçµçµæãäžå®å®ã§åçŸãé£ããããšãå€æããå ŽåããããŸãã
ã»ãšãã©ãã¹ãŠã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã«ã¯ãåŠç¿ã·ã¹ãã ã®åäœã«åœ±é¿ãããã€ããŒãã©ã¡ãŒã¿ãŒããããŸããå€ãã®å ŽåãæåãŸãã¯ã©ã³ãã æ€çŽ¢ã«ãã£ãŠéžæãããŸãã
æåž«ã®ãã¬ãŒãã³ã°ã¯å®å®ããŠããŸããããŒã¿ã»ãããä¿®æ£ããçã®ããŒã¿ã§ãã§ãã¯ããŸãããã€ããŒãã©ã¡ãŒã¿ãŒããããã«å€æŽããŠããæ©èœã¯ããŸãå€åããŸããããã¹ãŠã®ãã€ããŒãã©ã¡ãŒã¿ãŒãé©åã«æ©èœããããã§ã¯ãããŸããããé·å¹Žã«ããã£ãŠå€ãã®çµéšçãªããªãã¯ãèŠã€ãã£ãŠãããããå€ãã®ãã€ããŒãã©ã¡ãŒã¿ãŒã¯ãã¬ãŒãã³ã°äžã«çåœã®å åã瀺ããŸãã人çã®ãããã®å åã¯éåžžã«éèŠã§ãã圌ãã¯ããªããæ£ããéãé²ãã§ãããåççãªããšãããŠãããšèšããŸã-ãããŠããªãã¯ããå€ãã®æéãè²»ããå¿ èŠããããŸãã
çŸåšããã£ãŒãRLã¯ãŸã£ããå®å®ããŠããããç 究ããã»ã¹ã§éåžžã«è¿·æã§ãã
Google Brainã§åãå§ãããšããç§ã¯ããã«äžèšã®èšäºNormalized Advantage FunctionïŒNAFïŒããã¢ã«ãŽãªãºã ãå®è£ ãå§ããŸããã 2ã3é±éã§æžããšæããŸãããç§ã¯ããã€ãã®åãæããããŸããïŒTeanoïŒTensorFlowã«ãã移æ€ãããŠããŸãïŒã«ç²ŸéããŠãã人ãæ·±ãRLã§ã®çµéšãNAFèšäºã®çé èè ãBrainã«ã€ã³ã¿ãŒã³ããŠããã®ã§ã質åã§åœŒãæ©ãŸããããšãã§ããŸããã
æçµçã«ããœãããŠã§ã¢ã®ããã€ãã®ãã°ã®ããã«ãçµæãåçŸããã®ã«6é±éããããŸãããåé¡ã¯ããªããããã®ãã°ãé·ãéé ããŠããã®ããšããããšã§ãã
ãã®è³ªåã«çããããã«ãOpenAIãžã ã§æãåçŽãªé£ç¶ç®¡çã¿ã¹ã¯ã§ããæ¯ãåã¿ã¹ã¯ãæ€èšããŠãã ããããã®åé¡ã§ã¯ãæ¯ãåãç¹å®ã®ãã€ã³ãã«åºå®ãããéåãããã«äœçšããŸããå ¥åã¯3次å ã®ç¶æ ã§ããäœçšç©ºéã¯äžæ¬¡å ã§ããããã¯æ¯ãåã«å ããããåã®ç¬éã§ããç®æšã¯ãæ£ç¢ºã«åçŽäœçœ®ã§æ¯ãåã®ãã©ã³ã¹ããšãããšã§ãã
ããã¯å°ããªåé¡ã§ãããæ確ã«å®çŸ©ãããå ±é ¬ã®ãããã§ããã«ç°¡åã«ãªããŸããå ±é ¬ã¯æ¯ãåã®è§åºŠã«äŸåããŸããæ¯ãåãåçŽäœçœ®ã«æ»ãã¢ã¯ã·ã§ã³ã¯ãå ±é ¬ãäžããã ãã§ãªãããããå¢ãããŸãã
ããã«ãããããªã¢ãã«ã§ãã»ãšãã©ã¯ãåäœããŸããæ¯ãåãæ£ç¢ºãªåçŽäœçœ®ã«ç§»åãããããã§ã¯ãããŸããããéåãè£æ£ããããã®æ£ç¢ºãªåã®ã¢ãŒã¡ã³ããæäŸããŸãããããŠããã¹ãŠã®ãšã©ãŒãä¿®æ£ããåŸã®ããã©ãŒãã³ã¹ã°ã©ãã§ããåè¡ã¯ã10åã®ç¬ç«ããå®è¡ã®1ã€ããã®å ±é ¬æ²ç·ã§ããåããã€ããŒãã©ã¡ãŒã¿ãŒã§ãéãã¯ã©ã³ãã ãªéå§ç¹ã®ã¿ã§ãã10åã®å®è¡ã®ãã¡7åãããŸããããŸããã 3ã€ã¯ééããŸããã§ããã30ïŒ ã®æ éçã¯éçšå¯èœãšèŠãªãããŸããããã¯ãå€åæ å ±ã®æ倧å調æ»ïŒHouthooft et alãNIPS 2016ïŒã®å¥ã®ã°ã©ãã§ããæ°Žææ¥-ããŒãããŒã¿ãŒã詳现ã¯ããã»ã©éèŠã§ã¯ãããŸããããè³ã¯ãŸã°ãã«ãããŸããã y軞ã¯äžæçãªå ±é ¬ãx軞ã¯ã¿ã€ã ã¹ã©ã€ã¹ã®æ°ã䜿çšãããã¢ã«ãŽãªãºã ã¯TRPOã§ãã
æãç·ã¯10åã®ã©ã³ãã sidã®ããã©ãŒãã³ã¹ã®äžå€®å€ã§ããã圱ä»ãã®é åã¯25ããŒã»ã³ã¿ã€ã«ãã75ããŒã»ã³ã¿ã€ã«ãŸã§ã®ã«ãã¬ããžã§ãã誀解ããªãã§ãã ããããã®ã°ã©ãã¯VIMEã®è¯ãè°è«ã®ããã§ããããããäžæ¹ã§ã25ããŒã»ã³ã¿ã€ã«ã®ç·ã¯æ¬åœã«ãŒãã«è¿ãã§ããããã¯ãéå§ç¹ãã©ã³ãã ã§ããããã«çŽ25ïŒ ãæ©èœããªãããšãæå³ããŸãã
å çãšäžç·ã«æããããšã«ãéãããããŸãããããã»ã©æªãã¯ãããŸãããã©ã³ãã sidã䜿çšããå®è¡ã®30ïŒ ã§ãã¬ãŒãã³ã°ã³ãŒããæåž«ã«å¯Ÿå¿ããŠããªãã£ãå ŽåãããŒã¿ã®ããŒãäžãŸãã¯ãã¬ãŒãã³ã°äžã«äœããã®ãšã©ãŒãçºçããããšã¯ééããããŸãããè£åŒ·ä»ãã®ãã¬ãŒãã³ã°ã³ãŒããã©ã³ãã æ§ãããããŸã察åŠã§ããªãå Žåãããããã°ãªã®ãããããšãæªããã€ããŒãã©ã¡ãŒã¿ãŒãªã®ããç§ã¯éãæªãã®ãããããŸããã
ããã¯ãããªãæ©æ¢°åŠç¿ããé£ãããã®ãããšããèšäºã®èª¬æã§ããäž»èŠãªç¹ã¯ãæ©æ¢°åŠç¿ãã¯ã©ãã·ã¥ã¹ããŒã¹ã«è¿œå ã®æ¬¡å ãè¿œå ããã¯ã©ãã·ã¥ãªãã·ã§ã³ã®æ°ãææ°é¢æ°çã«å¢ããããšã§ãããã£ãŒãRLã¯ãã©ã³ãã æ§ãšããå¥ã®æ¬¡å ãè¿œå ããŸãããããŠãã©ã³ãã æ§ã®åé¡ã解決ããå¯äžã®æ¹æ³ã¯ããã€ãºãé€å»ããããšã§ãã
åŠç¿ã¢ã«ãŽãªãºã ã®ãµã³ããªã³ã°ãéå¹ççã§ãåæã«å®å®ããŠããªãå Žåãç 究ã®çç£æ§ãå€§å¹ ã«äœäžããŸãããã¶ã圌ã¯100äžæ©ããå¿ èŠãªãã§ããããããããããã«5ã€ã®ã©ã³ãã ãªå ¥åå€ãæããŠããããã€ããŒãã©ã¡ãŒã¿ãŒã®åºãããæãããšã仮説ãå¹æçã«ãã¹ãããããã«å¿ èŠãªèšç®ãææ°é¢æ°çã«å¢å ããŸãã
ãããç°¡åã«ãªã£ãããç§ã¯ããããã°ãããã£ãŠããŸã-ãããŠãå é±çŽ6ãè²»ãããŠãã¢ãã«ã°ã©ããŒã·ã§ã³ããŒãããååŸããŸãããããã¯ãRLã¿ã¹ã¯ã®æã§ã±ãŒã¹ã®50ïŒ ã§åäœããŸãã ãããŠãç§ã¯GPUã¯ã©ã¹ã¿ãŒãšãæ¯æ¥æŒé£ããšãæ°äººã®å人ãããŸãããããŠã圌ãã¯ãã®æ°å¹Žéããã®åéã§åããŠããŸãã
ããã«ãç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ£ããèšèšã«é¢ããæåž«ãšã®ãã¬ãŒãã³ã°åéããåŸãããæ å ±ã¯ãäž»ãªå¶éãã¯ã¬ãžããã®åé /ãããã¬ãŒãå¶åŸ¡ã§ãããå¹æçãªãã¬ãŒã³ããŒã·ã§ã³ã®æ¬ åŠã§ã¯ãªãããã匷åããããã¬ãŒãã³ã°ã®åéã«æ¡åŒµãããªãããã§ãã ResNetãbatchnormãããã³éåžžã«æ·±ããããã¯ãŒã¯ã¯ããã§ã¯æ©èœããŸããã
[æå¡é€æ]åãããã äœããå°ç¡ãã«ãããšããŠããéåžžã¯äœããã®ã©ã³ãã ã§ãªãçµæãåŸãããŸãã RLãæ©èœãããå¿ èŠããããŸãã äœããå°ç¡ãã«ããããååãªèšå®ãè¡ããªãã£ãããããšãã»ãŒç¢ºå®ã«ã©ã³ãã ãªã«ãŒã«ãããæªãã«ãŒã«ãååŸããããšã«ãªããŸãã ãããŠããã¹ãŠãå®å šã«èª¿æŽãããŠããŠããæªãçµæã¯ã±ãŒã¹ã®30ïŒ ã«ãããŸãã ãªãã§ïŒ ã¯ãããã®ããã«ã
èŠããã«ãããªãã®åé¡ã¯ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã®èšèšãã®è€éãããããæ·±ãRLã®è€éãã«ããå¯èœæ§ãé«ãã§ãã -OpenAIã§åããŠããAndrej Karpathyã«ããHacker Newsã«å¯Ÿããã³ã¡ã³ã
ã©ã³ãã ã·ãŒãã®äžå®å®æ§ã¯ãçé±ã®ã«ããªã¢ã®ãããªãã®ã§ãã åçŽãªå¶ç¶ã®äžèŽãå®è¡éã®ãã®ãããªåŒ·ãéãã«ã€ãªããå Žåã¯ãå®éã®ã³ãŒãã®å€æŽã«ããéããæ³åããŠãã ããã
幞ããªããšã«ããã®æèå®éšã¯ãã§ã«å®æœãããŠãããèšäºãDeep Reinforcement Learning That MattersãïŒHenderson et alãAAAI 2018ïŒã§èª¬æãããŠããããããã®æèå®éšãè¡ãå¿ èŠã¯ãããŸããã çµè«ã¯æ¬¡ã®ãšããã§ãã
- å®æ°ã«å ±é ¬ãä¹ç®ãããšãããã©ãŒãã³ã¹ã«å€§ããªéããçããå¯èœæ§ããããŸãã
- 5ã€ã®ã©ã³ãã ã·ãŒãïŒã¬ããŒãã®äžè¬çãªã¡ããªãã¯ïŒã¯ãéèŠãªçµæã瀺ãã®ã«ååã§ã¯ãªãå ŽåããããŸããæ éã«éžæãããšãä¿¡é Œåºéãéè€ããªãããã§ãã
- åãã¢ã«ãŽãªãºã ã®ç°ãªãå®è£ ã¯ãåããã€ããŒãã©ã¡ãŒã¿ãŒã§ãã£ãŠããåãã¿ã¹ã¯ã®ããã©ãŒãã³ã¹ãç°ãªããŸãã
ç§ã®çè«ã§ã¯ãããŒã¿ã¯åžžã«ã€ã³ã¿ãŒãããã§åéãããå¶åŸ¡ãããå¯äžã®ãã©ã¡ãŒã¿ãŒã¯å ±é ¬ã®ãµã€ãºã§ãããããRLã¯åæåãšæè²ããã»ã¹ã®ãã€ããã¯ã¹ã®äž¡æ¹ã«éåžžã«ææã§ãã è¯ãåŠç¿äŸã«ã©ã³ãã ã«åºäŒãââã¢ãã«ã¯ãã¯ããã«ããŸãæ©èœããŸãã ã¢ãã«ã«è¯ãäŸãèŠåœãããªãå ŽåãäœãåŠç¿ããŠããªãå¯èœæ§ããããŸããããã¯ãåå·®ã決å®çã§ãªããšãŸããŸã確信ããŠããããã§ãã
ãããããã£ãŒãRLã®ãã¹ãŠã®ãã°ãããææã«ã€ããŠã¯ã©ãã§ããããã
ãã¡ããã培åºçãªåŒ·ååŠç¿ã¯ããã€ãã®åªããçµæãéæããŠããŸãã DQNã¯ãã¯ãç®æ°ãããã®ã§ã¯ãããŸãããããã€ãŠã¯ãŸã£ããããããªçºèŠã§ããã åãã¢ãã«ã¯ãåã²ãŒã ãåå¥ã«èª¿æŽããããšãªãããã¯ã»ã«ã«ãã£ãŠçŽæ¥ç 究ãããŸãã AlphaGoãšAlphaZeroãéåžžã«å°è±¡çãªææãæ®ããŠããŸãã
ãããããããã®æåã«å ããŠã深局RLãçŸå®ã®äžçã«ãšã£ãŠå®çšçãªäŸ¡å€ãããå ŽåãèŠã€ããããšã¯å°é£ã§ãã
çŸå®ã®äžçã§ãã£ãŒãRLãå®éã®ã¿ã¹ã¯ã«äœ¿çšããæ¹æ³ãèããŠã¿ãŸãã-ããã¯é©ãã»ã©é£ããããšã§ãã ç§ã¯æšå¥šã·ã¹ãã ã§ããããã®çšéãèŠã€ãããšæã£ãããç§ã®æèŠã§ã¯ã å調ãã£ã«ã¿ãªã³ã°ãšã³ã³ããã¹ããã³ãã£ãããäŸç¶ãšããŠãããæ¯é ããŠããã
ç§ãæçµçã«èŠã€ããæé«ã®ãã®ã¯ã2ã€ã®Googleãããžã§ã¯ãã§ãããããŒã¿ã»ã³ã¿ãŒã®ãšãã«ã®ãŒæ¶è²»ãåæžããããšãšãæè¿çºè¡šãããAutoML Visionãããžã§ã¯ãã§ãã OpenAIã®ãžã£ãã¯ã»ã¯ã©ãŒã¯ã¯èªè ã«åæ§ã®è³ªåããã€ãŒãããåãçµè«ã«éããŸãã ã ïŒæšå¹ŽãAutoMLã®çºè¡šåã«ãã€ãŒãïŒã
NIPSã§ç¡äººã®ã¬ãŒã·ã³ã°ã«ãŒã®å°ããªã¢ãã«ãèŠãã圌ã®ããã«æ·±ãRLã·ã¹ãã ãéçºãããšèšã£ãã®ã§ãã¢ãŠãã£ã¯æ·±ãRLã§äœãé¢çœãããšãããŠããããšãç¥ã£ãŠããŸãã 倧ããªãã³ãœã«ã°ã©ãã®ããã€ã¹ã®é 眮ãæé©åããããã®å·§ã¿ãªäœæ¥ãè¡ãããŠããããšãç¥ã£ãŠããŸãïŒMirhoseini et alãICML 2017ïŒ ã Salesforceã«ã¯ãRLãããªãæ éã«äœ¿çšããå Žåã«æ©èœããããã¹ãèŠçŽã¢ãã«ããããŸãã ãã®èšäºãèªãã§ããéãéèäŒç€Ÿã¯ããããRLãå®éšããŠããŸãããããã«ã€ããŠã¯ãŸã 蚌æ ããããŸããã ïŒãã¡ãããéèäŒç€Ÿã«ã¯åžå Žã§éã¶æ¹æ³ãé ãçç±ãããããã確åºãã蚌æ ãåŸãããšãã§ããŸããïŒã Facebookã¯ããã£ããããããšã¹ããŒãã®ãã£ãŒãRLã«æé©ã§ãã æŽå²äžã®ãã¹ãŠã®ã€ã³ã¿ãŒãããäŒæ¥ã¯ãããããRLãåºåã¢ãã«ã«å°å ¥ããããšãèããããšããããŸããã誰ããå®éã«RLãå®è£ ããå Žåãããã«ã€ããŠã¯æ²é»ããŠããŸãã
ã ãããç§ã®æèŠã§ã¯ã培åºçãªRLã¯ãŸã åŠè¡ç 究ã®ãããã¯ã§ãããåºç¯å²ã«äœ¿çšããããã®ä¿¡é Œã§ããæè¡ã§ã¯ãªããå®éã«å¹æçã«æ©èœãããããšãã§ããŸã-ãããŠæåãã人ã¯æ å ±ãé瀺ããŸããã æåã®éžæè¢ã®æ¹ãå¯èœæ§ãé«ããšæããŸãã
ç»åã®åé¡ã®åé¡ã§ç§ã«æ¥ãå Žåãäºåã«èšç·ŽãããImageNetã¢ãã«ããå§ãããŸã-圌ãã¯ããããå®ç§ã«ä»äºãããã§ãããã ç§ãã¡ã¯ãã·ãªã³ã³ãã¬ãŒã·ãªãŒãºã®æ ç»è£œäœè ãåè«ããããŠãããããã°ãèªèããããã®æ¬åœã®AIã¢ããªã±ãŒã·ã§ã³ãäœã£ãŠããäžçã«äœãã§ããŸãã ãã£ãŒãRLã®åãæåã«ã€ããŠã¯èšããŸããã
ãããã®å¶éãããå Žåããã£ãŒãRLã䜿çšããã¿ã€ãã³ã°ã¯ïŒ
ããã¯å éšçã«é£ãã質åã§ãã åé¡ã¯ãåãRLã¢ãããŒããç°ãªãç°å¢ã«é©çšããããšããŠããããšã§ãã åžžã«æ©èœãããšã¯éããªãã®ã¯åœç¶ã§ãã
äžèšã«åºã¥ããŠã匷ååŠç¿ã®æ¢åã®ææããããã€ãã®çµè«ãåŒãåºãããšãã§ããŸãã ãããã¯ããã£ãŒãRLãäœããã®è³ªçã«å°è±¡çãªåäœãåŠç¿ãããããã®åéã®ä»¥åã®ã·ã¹ãã ãããããåŠç¿ãããããžã§ã¯ãã§ãïŒãã ãããããã¯éåžžã«äž»èŠ³çãªåºæºã§ãïŒã
çŸæç¹ã§ã®ç§ã®ãªã¹ãã§ãã
- åã®ã»ã¯ã·ã§ã³ã§èšåãããããžã§ã¯ãïŒDQNãAlphaGoãAlphaZeroããã«ã¯ãŒã«ããããããŒã¿ã»ã³ã¿ãŒã®ãšãã«ã®ãŒæ¶è²»éã®åæžãããã³Neural Architecture Searchã«ããAutoMLã
- OpenAI Dota 2 1v1 Shadow Fiendããã ã ç°¡çŽ åãããæŠéèšå®ã§æé«ã®ãããåããŸã ã
- ã¹ãŒããŒã¹ããã·ã¥ãã©ã¶ãŒãºã¡ã¬ãŒããã ã 1v1ãã¡ã«ã³ã³ã®ãããã¬ã€ã€ãŒãåãããã«åãããšãã§ããŸãã ïŒFiroiu et alã2017ïŒã
ïŒæè¿ã®äœè«ïŒæ©æ¢°åŠç¿ã¯æè¿ãããã®ç¡å¶éã®ãããµã¹ããŒã«ãã ãã¬ãŒã€ãŒãæã¡è² ãããŸããããã®ããã°ã©ã ã¯ã LibratusïŒBrown et alãIJCAI 2017ïŒãšDeepStackïŒMoravÄÃket alã2017ïŒã®äž¡æ¹ã䜿çšããŸãããã£ãŒãRLãäž¡æ¹ã®ã·ã¹ãã ã¯éåžžã«åªããŠããŸããã匷åããããã£ãŒãã©ãŒãã³ã°ã䜿çšãããåçã®åäºå®çæå°åã®ã¢ã«ãŽãªãºã ãšãµãã²ãŒã ã®æèœãªå埩ãœãªã¥ãŒã·ã§ã³ã䜿çšããŸãã
ãã®ãªã¹ããããåŠç¿ãä¿é²ããå ±éã®ããããã£ãåé¢ã§ããŸãã 以äžã«ãªã¹ããããŠããããããã£ã¯ãããããã¬ãŒãã³ã°ã«å¿ èŠã§ã¯ãããŸããããååšããã»ã©ãçµæã¯ç¢ºå®ã«è¯ããªããŸãã
- ã»ãŒç¡å¶éã®ãšã¯ã¹ããªãšã³ã¹ã®ç°¡åãªçæ ã ããã§å©ç¹ã¯æããã§ãã ããŒã¿ãå€ãã»ã©ããã¬ãŒãã³ã°ã容æã«ãªããŸãã ããã¯ãã¢ã¿ãªããŽãŒããã§ã¹ãå°giãããã³ã·ãã¥ã¬ãŒãããããã«ã¯ãŒã«ãããç°å¢ã«é©çšãããŸãã ããã¯ãããããããŒã¿ã»ã³ã¿ãŒã®é»æºãããžã§ã¯ãã«ãåœãŠã¯ãŸããŸããããã¯ã以åã®ç 究ïŒGaoã2014ïŒã§ããã¥ãŒã©ã«ãããã¯ãŒã¯ããšãã«ã®ãŒå¹çãé«ç²ŸåºŠã§äºæž¬ã§ããããšã瀺ãããããã§ãã ãã®ãããªã·ãã¥ã¬ãŒã·ã§ã³ã¢ãã«ã䜿çšããŠãRLã·ã¹ãã ããã¬ãŒãã³ã°ããŸãã
ãããããã®ååã¯Dota 2ãšSSBMã§ã®åäœã«é©çšãããŸãããã²ãŒã ã®æ倧é床ãšããã»ããµã®æ°ã«äŸåããŸãã - ã¿ã¹ã¯ã¯ããåçŽãªåœ¢åŒã«ç°¡çŽ åãããŸãã ãã£ãŒãRLã§æãããããééãã®1ã€ã¯ãéå¿çãªèšç»ãšå€¢ã§ãã 匷ååŠç¿ã¯äœã§ãå¯èœã§ãïŒ ããããããã¯äžåºŠã«ãã¹ãŠãåŒãåããå¿
èŠããããšããæå³ã§ã¯ãããŸããã
OpenAI Dota 2ãããã¯ãã²ãŒã ã®éå§æã«ã®ã¿åäœããShadow Fiend察Shadow Fiendã1x1ã®ã¿ãç¹å®ã®èšå®ãåºå®äœçœ®ããã³å»ºç©ã¿ã€ãã§åäœããããããDota 2 APIã䜿çšããŠã°ã©ãã£ãã¯åŠçã¿ã¹ã¯ã解決ããŸããã SSBMãããã¯è¶ 人çãªããã©ãŒãã³ã¹ãçºæ®ããŸããã1x1ã²ãŒã ã§ã®ã¿ããã£ããã³ãã¡ã«ã³ã³ãšã®ã¿ãç¡éã®è©Šåæéãæã€Battlefieldã§ã®ã¿ã§ãã
ããã¯ãããã®botç¬ã§ã¯ãããŸããã å®éãåçŽãªåé¡ã解決ãããŠãããã©ããããããããªãã®ã«ããªãè€éãªåé¡ã解決ããã®ã§ããïŒ ãã®åéã§ã®äžè¬çãªã¢ãããŒãã¯ãæåã«ã³ã³ã»ããã®æå°éã®èšŒæãåŸãŠãåŸã§ãããäžè¬åããããšã§ãã OpenAIã¯Dota 2ã®äœæ¥ãæ¡å€§ããŸããäœæ¥ã¯SSBMããããä»ã®ãã£ã©ã¯ã¿ãŒã«æ¡å€§ããäœæ¥ãç¶ããŠããŸãã - ç¬ç«ããã²ãŒã ã§åŠã¶æ¹æ³ããããŸãã ããããAlphaGoãAlphaZeroãDota 2 Shadow FiendãSSBM Falconãããã®ä»çµã¿ã§ãã ç¬ç«ããã²ãŒã ãšã¯ç«¶äºã²ãŒã ãæå³ããŸãããäž¡æ¹ã®ãã¬ã€ã€ãŒã¯1人ã®ãšãŒãžã§ã³ãã§å¶åŸ¡ã§ããŸãã ã©ãããããã®èšå®ã¯æãå®å®ããçµæããããããŸãã
- åŠç¿ã«å¯Ÿããé©åãªå ±é
¬ã決å®ããæ確ãªæ¹æ³ããããŸãã 2人çšã®ã²ãŒã ã§ã¯ãããã¯åã€ãš+1ãè² ãããš-1ã§ãã Zophãã«ããå
ã®Neural Architecture Searchã®èšäºãICLR 2017ã§ã¯ãèšç·Žãããã¢ãã«ã®æ€èšŒã¯æ£ç¢ºã§ããã ã¹ã ãŒãºãªå ±é
¬ãæ±ãããã³ã«ãééã£ãç®æšã«åãããŠã¢ãã«ãæé©åããæé©ã§ãªãããªã·ãŒãåŠç¿ããæ©äŒãå°å
¥ããŸãã
é©åãªå ±é ¬ãåŸãæ¹æ³ã«ã€ããŠè©³ããç¥ãããå Žåã¯ãã æ£ããã¹ã³ã¢ãªã³ã°ã«ãŒã« ããšãããã¬ãŒãºãæ€çŽ¢ããŠã¿ãŠãã ããã ãã®Terrence Taoã®ããã°æçš¿ã¯ãã¢ã¯ã»ã·ãã«ãªäŸãæäŸããŠããŸãã
åŠç¿ã«é¢ããŠã¯ãäœãå¹æçã§äœãå¹æçã§ãªãããè©ŠããŠã¿ã以å€ã«ã¢ããã€ã¹ã¯ãããŸããã - ç¶ç¶çãªå ±é ¬ã決å®ãããå Žåãå°ãªããšãããã¯è±ãã§ãªããã°ãªããŸãã ã Dota 2ã§ã¯ãææ°ã®ãããïŒãã¬ã€ã€ãŒã«ãã£ãŠæ®ºãããã¢ã³ã¹ã¿ãŒããšïŒãšãã«ã¹ïŒæ£ç¢ºãªæ»æãŸãã¯ã¹ãã«ã®é©çšåŸã«ããªã¬ãŒãããïŒã«å¯ŸããŠå ±é ¬ãäžããããšãã§ããŸãã ãããã®ä¿¡å·ã¯è¿ éãã€é »ç¹ã«å°çããŸãã SSBMãããã¯ãæ»æãæåãããã³ã«ä¿¡å·ã§ãè¡ãããåãããã¡ãŒãžã«å¯ŸããŠå ±é ¬ãåãåãããšãã§ããŸãã ã¢ã¯ã·ã§ã³ãšçµæã®éã®é 延ãçãã»ã©ããã£ãŒãããã¯ã«ãŒãã®ã¯ããŒãºãéããªãã匷åã·ã¹ãã ãæ倧ã®å ±é ¬ãžã®éãèŠã€ããã®ã容æã«ãªããŸãã
äŸïŒãã¥ãŒã©ã«ã¢ãŒããã¯ãã£æ€çŽ¢
ããã€ãã®ååãçµã¿åãããŠãNeural Architecture Searchã®æåãåæã§ããŸãã ICLR 2017ã®å ã®ããŒãžã§ã³ã«ãããšã12,800ã®ãµã³ãã«ã®åŸããã£ãŒãRLã¯ãã®çš®ã®æé«ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãŒããã¯ãã£ãèšèšã§ããŸãã 確ãã«ãåäŸã§ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ãåæããããã«ãã¬ãŒãã³ã°ããå¿ èŠããããŸãããããµã³ãã«ã®æ°ã§ã¯äŸç¶ãšããŠéåžžã«å¹æçã§ãã
äžèšã®ããã«ãå ±é ¬ã¯æ€èšŒã®æ£ç¢ºãã§ãã ããã¯éåžžã«è±å¯ãªå ±é ¬ä¿¡å·ã§ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ§é ã®å€æŽã«ãã£ãŠç²ŸåºŠã70ïŒ ãã71ïŒ ã«åäžããå Žåã§ããRLã¯ãã®æ©æµãåããŸãã ããã«ããã£ãŒãã©ãŒãã³ã°ã®ãã€ããŒãã©ã¡ãŒã¿ãŒãç·åœ¢ç¬ç«ã«è¿ããšãã蚌æ ããããŸãã ïŒããã¯ã ãã€ããŒãã©ã¡ãŒã¿ãŒæé©åïŒã¹ãã¯ãã«ã¢ãããŒãïŒHazan et alã2017ïŒã§çµéšçã«ç€ºãããŠããŸã-èå³ãããå Žåãç§ã®å±¥æŽæžã¯ãã¡ãã§ãïŒã NASã¯ãã€ããŒãã©ã¡ãŒã¿ãŒãç¹ã«æ§æããŸãããããã¥ãŒã©ã«ãããã¯ãŒã¯ã®èšèšæ±ºå®ããã®æ¹æ³ã§è¡ãããããšã¯éåžžã«åççã ãšæããŸãã ãœãªã¥ãŒã·ã§ã³ãšããã©ãŒãã³ã¹ã®éã«åŒ·ãçžé¢é¢ä¿ããããããããã¯åŠç¿ã«ãšã£ãŠæå ±ã§ãã æåŸã«ãããã§ã¯è±å¯ãªå ±é ¬ã ãã§ãªããã¢ãã«ãæããéã«ç§ãã¡ã«ãšã£ãŠéèŠãªããšã¯ãŸãã«ããã«ãããŸãã
ãããã®çç±ãããä»ã®ç°å¢ã§å¿ èŠãªæ°çŸäžã®äŸãšæ¯èŒããŠãNASãæé©ãªãã¥ãŒã©ã«ãããã¯ãŒã¯ã決å®ããããã«çŽ12,800ã®ãã¬ãŒãã³ã°æžã¿ãããã¯ãŒã¯ã®ã¿ãå¿ èŠãšããçç±ãæããã«ãªããŸãã ããŸããŸãªåŽé¢ããããã¹ãŠãRLã«æå©ã«åããŸãã
äžè¬ã«ãåæ§ã®æåäºäŸã¯äŸç¶ãšããŠäŸå€ã§ãããèŠåã§ã¯ãããŸããã 匷åãããåŠç¿ã説åŸåãæã£ãŠæ©èœããããã«ã¯ãããºã«ã®å€ãã®éšåãæ£ãã圢æããå¿ èŠããããããã§ãå°é£ã§ãã
äžè¬ã«ããã£ãŒãRLã¯ããã«äœ¿çšã§ãããã¯ãããžãŒã§ã¯ãããŸããã
æªæ¥ãèŠã
å€ãããšããããããŸã-æéã®çµéãšãšãã«ããã¹ãŠã®ç 究è ã¯èªåã®ç 究åéãæãæ¹æ³ãåŠã³ãŸãã åè«ã¯ã圌ããåé¡ãããŸãã«ã奜ããªã®ã§ãç 究è ããŸã ããããç¶ããŠãããšããããšã§ãã
ããã¯ã匷åããããã£ãŒãã©ãŒãã³ã°ã«ã€ããŠç§ãæããããšã§ãã äžèšã®ãã¹ãŠã«ãããããããç§ã¯ãRLãå¹æçã§ãªãå¯èœæ§ããããã®ãå«ãããŸããŸãªåé¡ã«RLã䜿çšããããšãè©Šã¿ãå¿ èŠãããããšã絶察ã«ç¢ºä¿¡ããŠããŸãã ããããRLãä»ã«ã©ã®ããã«æ¹åã§ããŸããïŒ
æè¡ã«æ¹åã®æéãäžããããŠããå Žåããã£ãŒãRLãå°æ¥æ©èœããªãçç±ã¯ãããŸããã æ·±ãRLãåºç¯å²ã«äœ¿çšã§ããã»ã©ä¿¡é Œã§ããããã«ãªããšãããã€ãã®éåžžã«èå³æ·±ãããšãèµ·ããå§ããŸãã åé¡ã¯ãããéæããæ¹æ³ã§ãã
以äžã«ããã£ãšããããå°æ¥ã®éçºãªãã·ã§ã³ããªã¹ãããŸããã ãã®æ¹åã«çºå±ããããã«ãããªãç 究ãå¿ èŠãªå Žåããããã®åéã®é¢é£ããç§åŠè«æãžã®ãªã³ã¯ã瀺ãããŸãã
å±æçãªæé©æ¡ä»¶ã§ååã§ãã 人ã èªèº«ããã¹ãŠã«ãããŠã°ããŒãã«ã«æé©ã§ãããšäž»åŒµããã®ã¯aræ ¢ã§ãã ç§ãã¡ã¯ææãäœæããããã«æé©åãããä»ã®çš®ãããããã«åªããŠãããšæããŸãã åã粟ç¥ã§ãããŒã«ã«ãœãªã¥ãŒã·ã§ã³ãå人ã®åºæ¬ã¬ãã«ãè¶ ããŠããå ŽåãRLãœãªã¥ãŒã·ã§ã³ã¯ã°ããŒãã«ãªããã£ããè¿œæ±ããå¿ èŠã¯ãããŸããã
éããã¹ãŠã決å®ããŸã ã AIãäœæããããã®æãéèŠãªããšã¯ãåã«éã®é床ãäžããããšã ãšä¿¡ããŠãã人ã ãç¥ã£ãŠããŸãã å人çã«ãç§ã¯éããã¹ãŠã®åé¡ã解決ããã®ã§ã¯ãªãããšçã£ãŠããŸããã確ãã«éèŠãªè²¢ç®ãããã§ãããã ãã¹ãŠãé«éã«åäœããã»ã©ããµã³ãã«ã®éå¹çæ§ã«å¯Ÿããæžå¿µãå°ãªããªããã€ã³ããªãžã§ã³ã¹ã®åé¡ãéããŠãã«ãŒããã©ãŒã¹ãçªç ŽãããããªããŸãã
ããã«åŠç¿ãã¥ãŒãè¿œå ããŸã ã äœãå¹æãæ£ç¢ºã«äžãããã«ã€ããŠã®æ å ±ãã»ãšãã©ãªãããããŸã°ããªå ±é ¬ãååããããšã¯å°é£ã§ãã å¹»èŠïŒ Hindsight Experience ReplayãAndrychowicz et alãNIPS 2017 ïŒã®åœ¢ã§è¯å®çãªå ±é ¬ãçæããããæ¯æŽã¿ã¹ã¯ïŒ UNREALãJaderberg et alãNIPS 2016 ïŒãå®çŸ©ããããèªå·±å¶åŸ¡ãã¬ãŒãã³ã°ããå§ãŸãäžçã®è¯ãã¢ãã«ãæ§ç¯ããããšãå¯èœã§ãã ããã°ããã§ãªãŒãã±ãŒãã«è¿œå ããŸãã
ã¢ãã«ããŒã¹ã®ãã¬ãŒãã³ã°ã«ããããµã³ãã«ã®å¹çãåäžããŸãã ã¢ãã«ã«åºã¥ããŠRLã説æããæ¹æ³ã¯æ¬¡ã®ãšããã§ããã誰ãããããããšæã£ãŠããŸããããã®æ¹æ³ã¯ã»ãšãã©ããã£ãŠããŸãããã ååãšããŠãåªããã¢ãã«ã¯å€ãã®åé¡ãä¿®æ£ããŸãã AlphaGoã®äŸã«èŠãããããã«ãã¢ãã«ã®ååšã¯ååãšããŠãåªãããœãªã¥ãŒã·ã§ã³ã®æ€çŽ¢ãéåžžã«å®¹æã«ããŸãã äžçã®è¯ãã¢ãã«ã¯æ°ããã¿ã¹ã¯ã«ããŸã移ãããäžçã®ã¢ãã«ã®å°å ¥ã«ãããæ°ããçµéšãæ³åããããšãã§ããŸãã ç§ã®çµéšã§ã¯ãã¢ãã«ããŒã¹ã®ãœãªã¥ãŒã·ã§ã³ã§ã¯å¿ èŠãªãµã³ãã«ãå°ãªããªããŸãã
ããããè¯ãã¢ãã«ãèšç·Žããããšã¯é£ããããšã§ãã äœæ¬¡å ã®ç¶æ ã¢ãã«ãæã æ©èœãããšããå°è±¡ãåããŸããããéåžžãç»åã¢ãã«ã¯é£ããããŸãã ããããããããããç°¡åã«ãªããšãããã€ãã®èå³æ·±ãããšãèµ·ãããŸãã
DynaïŒSuttonã1991ïŒããã³Dyna-2ïŒSilver et alããICML 2008ïŒã¯ããã®åéã®å€å žäœåã§ãã ã¢ãã«ããŒã¹ã®åŠç¿ã深局ãããã¯ãŒã¯ãšçµã¿åãããäœæ¥ã®äŸãšããŠãããŒã¯ã¬ãŒãããã£ã¯ã¹ç 究æã®æè¿ã®èšäºãããã€ããå§ãããŸãã
- ã¢ãã«ããªãŒã®åŸ®èª¿æŽã䜿çšããã¢ãã«ããŒã¹ã®ãã£ãŒãRLã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ãã€ããã¯ã¹ïŒNagabandi et alã2017 ;
- æéçã¹ãããæ¥ç¶ã䜿çšããèªå·±ç®¡çèŠèŠèšç»ïŒEbert et alãCoRL 2017ïŒ ;
- è»éäžå¿åŒ·ååŠç¿ã®ããã®ã¢ãã«ããŒã¹ã®æŽæ°ãšã¢ãã«ããªãŒã®æŽæ°ã®çµã¿åããïŒChebotar et alãICML 2017ïŒ ;
- èŠèŠéååŠç¿ã®ããã®æ·±ç©ºéãªãŒããšã³ã³ãŒããŒïŒãã£ã³ããICRA 2016ïŒ ;
- ã¬ã€ãä»ãããªã·ãŒæ€çŽ¢ïŒLevine et alãJMLR 2016ïŒ ã
匷ååŠç¿ã®äœ¿çšã¯ã埮調æŽãšããŠç°¡åã§ãã æåã®AlphaGoã®èšäºã¯ãæåž«ã®ãã¬ãŒãã³ã°ãšRLã®åŸ®èª¿æŽããå§ãŸããŸããã ããã¯ãããé«éã ã匷åã§ã¯ãªãæ¹æ³ã䜿çšããŠåæãã¬ãŒãã³ã°ãé«éåãããããé©åãªãªãã·ã§ã³ã§ãã ãã®æ¹æ³ã¯ãç°ãªãã³ã³ããã¹ãã§ãæ©èœããŸãã-Sequence TutorïŒJaques et alãICML 2017ïŒãåç §ããŠãã ããã å¥ã®ã·ã¹ãã ããã®ãäºåãã®äœæã«é¢äžããŠããå Žåã確çïŒç¡äœçºïŒã§ã¯ãªããåççãªäºåååžã®RLããã»ã¹ã®éå§ãšèŠãªãããšãã§ããŸãã
å ±é ¬é¢æ°ã¯åŠç¿å¯èœã«ãªããŸãã æ©æ¢°åŠç¿ã¯ãããŒã¿ã«åºã¥ããŠã人ãèšèšãããã®ãããåªãããã®ãæ§ç¯ããããšãåŠã¶ããšãã§ããããšãçŽæããŸãã å ±é ¬é¢æ°ãéžæããããšãéåžžã«é£ããå Žåããã®ã¿ã¹ã¯ã«æ©æ¢°åŠç¿ã䜿çšããŠã¿ãŸãããïŒ ã·ãã¥ã¬ãŒã·ã§ã³åŠç¿ãšRLã®å察-ãããã®è±å¯ãªé åã¯ãå ±é ¬é¢æ°ã人ããã®ç¢ºèªãŸãã¯è©äŸ¡ã«ãã£ãŠæé»çã«æ±ºå®ã§ããããšã瀺ããŠããŸãã
éRLããã³ã·ãã¥ã¬ãŒã·ã§ã³ãã¬ãŒãã³ã°ã«é¢ããæãæåãªç§åŠè«æã¯ãInverse Reinforcement LearningïŒNg and RussellãICML 2000ïŒã®ã¢ã«ãŽãªãºã ãInverse Reinforcement Learningãä»ããå®ç¿ïŒAbbeel and NgãICML 2004ïŒããã³DAggerïŒRossãGordonãand BagnellãAISTATSïŒã§ãã 2011ïŒ ã
ãããã®ã¢ã€ãã¢ã深局åŠç¿ã®åéã«æ¡å€§ããæè¿ã®äœåã«ã¯ã ã¬ã€ãä»ãã³ã¹ãåŠç¿ïŒFinnãªã©ãICML 2016ïŒ ã æéå¶çŽãããã¯ãŒã¯ïŒSermanetä»ã2017ïŒ ãããã³Learning From Human PreferencesïŒChristianoä»ãNIPS 2017ïŒãå«ãŸããŸãã ç¹ã«ããªã¹ããããŠããæåŸã®èšäºã¯ã人ã ã«ãã£ãŠä»ããããè©äŸ¡ã«ç±æ¥ããå ±é ¬ããå®éã«ãã¬ãŒãã³ã°ãããå ã®ããŒãã³ãŒãã£ã³ã°ãããå ±é ¬ãââããåªããŠããããšã瀺ããŠããŸã-ããã¯è¯ãå®çšçãªçµæã§ãã
詳现ãªãã¬ãŒãã³ã°ã䜿çšããªãé·æçãªä»äºã®äžã§ãç§ã¯Inverse Reward DesignïŒHadfield-Menell et alãNIPS 2017ïŒãšç©ççãªäººéã®çžäºäœçšããã®åŠç¿ããããç®æšïŒBajcsy et alãCoRL 2017ïŒã®èšäºãæ°ã«å ¥ããŸãã ã
転éãåŠç¿ããŸãã åŠç¿ã®è»¢éã¯ã以åã®ã¿ã¹ã¯ã®ç¥èã䜿çšããŠæ°ããã¿ã¹ã¯ã®åŠç¿ãã¹ããŒãã¢ããã§ããããšãçŽæããŸãããã¬ãŒãã³ã°ãç°çš®ã¿ã¹ã¯ã解決ããã®ã«ååã«ä¿¡é Œã§ããããã«ãªããšããç§ã¯ãããæªæ¥ã§ãããšçµ¶å¯Ÿã«ç¢ºä¿¡ããŠããŸãããŸã£ããå匷ã§ããªãå Žåã¯ãã¬ãŒãã³ã°ã移管ããããšã¯é£ãããã¿ã¹ã¯AãšBãããå Žåãã¿ã¹ã¯Aããã¿ã¹ã¯Bãžã®ãã¬ãŒãã³ã°ã®ç§»ç®¡ãèµ·ãããã©ãããäºæž¬ããããšã¯å°é£ã§ããç§ã®çµéšã§ã¯ãããã«éåžžã«æçœãªçãããããããŸã£ããç解ã§ããŸããããããŠãæãæçœãªå Žåã§ããéèªæãªã¢ãããŒããå¿ èŠã§ãã
ãã®åéã§ã®æè¿ã®ç 究ã¯ãUniversal Value Function ApproximatorsïŒSchaul et alãICML 2015ïŒãDistralïŒWhye Teh et alãNIPS 2017ïŒããã³Over å æCatastrophic ForgettingïŒKirkpatrick et alãPNAS 2017ïŒã§ããå€ãäœåã«ã€ããŠã¯ãHordeïŒSutton et alãAAMAS 2011ïŒãåç §ããŠãã ããã
ããšãã°ãããããå·¥åŠã¯ãã·ãã¥ã¬ãŒã¿ãŒããå®äžçãžã®ãã¬ãŒãã³ã°ã®è»¢éïŒã¿ã¹ã¯ã®ã·ãã¥ã¬ãŒã·ã§ã³ããå®éã®ã¿ã¹ã¯ãžïŒã§é 調ã«é²æ©ããŠããŸããåç §ããŠãã ãããã¶ã»ãã¡ã€ã³ã®ã©ã³ãã åãïŒããŒãã³ãã2017 IROSïŒ ãã·ã ã»ããŒã»å®ãããããããããã°ã¬ãã·ããšåŠç¿ïŒRusuã®ããCorl 2017ïŒãšGraspGANïŒBousmalisãã2017ïŒ ãïŒå 責äºé ïŒç§ã¯GraspGANã«åãçµã¿ãŸããïŒã
è¯ãäºåç¥èã¯ããã¬ãŒãã³ã°æéãå€§å¹ ã«ççž®ã§ããŸãããããã¯ã以åã®ãã€ã³ãã®ããã€ããšå¯æ¥ã«é¢é£ããŠããŸããäžæ¹ã§ãåŠç¿ã®ç§»è»¢ã¯ãéå»ã®çµéšã䜿çšããŠãä»ã®ã¿ã¹ã¯ã®äºå確çååžãäœæããããšã§ãã RLã¢ã«ãŽãªãºã ã¯ããã«ã³ãã®ææ決å®ããã»ã¹ã§åäœããããã«èšèšãããŠããŸããããã§ãäžè¬åã«åé¡ããããŸããç§ãã¡ã®ãœãªã¥ãŒã·ã§ã³ãç°å¢ã®çãã»ã¯ã¿ãŒã§ã®ã¿ããŸãæ©èœãããšä¿¡ãããªãããããã®ç°å¢ãå¹æçã«è§£æ±ºããããã«å ±éã®æ§é ã䜿çšã§ããã¯ãã§ãã
Pieter Ebbillã¯ãã¹ããŒãã®äžã§ãçŸå®ã®äžçã§è§£æ±ºãããããªã¿ã¹ã¯ã®ã¿ã«æ·±ãRLãèŠæ±ããå¿ èŠãããããšã«æ³šç®ããŠããŸããããã¯éåžžã«çã«ããªã£ãŠããããšã«åæããŸããéçŸå®çãªã¿ã¹ã¯ã®åŠç¿ãé ãããããšã§ãæ°ããå®éã®ã¿ã¹ã¯ããã°ããåŠç¿ã§ããããã«ãå®äžçã®äºåç¥èãå¿ èŠã§ããããã¯å®å šã«åãå ¥ãããã劥åæ¡ã§ãã
å°é£ãªã®ã¯ããã®ãããªå®äžçã®äºåã®èšèšãéåžžã«é£ããããšã§ãããã ããããããŸã å¯èœã§ããå¯èœæ§ã¯ååã«ãããšæããŸããå人çã«ã¯ãããŒã¿ããåççãªäºåååžãçæããæ¹æ³ãæäŸãããããã¡ã¿ãã¬ãŒãã³ã°ã«é¢ããæè¿ã®ç 究ã«æºè¶³ããŠããŸããããšãã°ãRLã䜿çšããŠå庫ãããã²ãŒãããå ŽåããŸãã¡ã¿ãã¬ãŒãã³ã°ã䜿çšããŠäžè¬çãªããã²ãŒã·ã§ã³ãæãã次ã«ããããã移åããç¹å®ã®å庫ã«å¯ŸããŠäºåã«ããã埮調æŽããããšã¯èå³æ·±ãã§ããããããã¯æªæ¥ã«éåžžã«äŒŒãŠãããåé¡ã¯ã¡ã¿ãã¬ãŒãã³ã°ãããã«å°éãããã©ããã§ãã
ææ°ã®åŠç¿åŠç¿äœæ¥ã®æŠèŠã«ã€ããŠã¯ãBAIRïŒBerkeley AI ResearchïŒã®ãã®åºçç©ãåç §ããŠãã ããã
ããè€éãªç°å¢ã¯é説çã«ç°¡åã«ãªããŸãããã«ã¯ãŒã«ãããã«é¢ããDeepMindã®èšäºã®äž»ãªçµè«ã®1ã€ã¯ãã¿ã¹ã¯ã®ããªãšãŒã·ã§ã³ãããã€ãè¿œå ããããšã§ã¿ã¹ã¯ãéåžžã«è€éã«ãããšãå®éã«ã¯ãã¬ãŒãã³ã°ãç°¡çŽ åã§ãããšããããšã§ããã«ãŒã«ã¯ä»ã®ãã¹ãŠã®ãã©ã¡ãŒã¿ãŒã®ããã©ãŒãã³ã¹ãæãªãããšãªã1ã€ã®èšå®ã§åãã¬ãŒãã³ã°ã§ããªãããã§ãããã¡ã€ã³ã©ã³ãã åã«é¢ããèšäºãImageNetã§ãåæ§ã®ããšãããããŸãããImageNetã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ãCIFAR-100ã§ãã¬ãŒãã³ã°ãããã¢ãã«ãããã¯ããã«åªããä»ã®ç°å¢ã«æ¡åŒµã§ããŸããå ã»ã©èšã£ãããã«ããã£ãšæ®éçãªRLã«ç§»è¡ããããã«ãã管ççšã®ImageNetããäœæããã ãã§ååã§ãããã
å€ãã®ãªãã·ã§ã³ããããŸããOpenAIãžã ã¯æã人æ°ã®ããç°å¢ã§ãããã¢ãŒã±ãŒãåŠç¿ç°å¢ãRoboschoolãDeepMind LabãDeepMind Control Suiteãããã³ELFã
æåŸã«ãããã¯åŠè¡çãªèŠ³ç¹ããã¯s蟱çã§ãããæ·±ãRLã®çµéšçãªåé¡ã¯å®éçãªèŠ³ç¹ããã¯åé¡ã§ã¯ãªããããããŸãããæ¶ç©ºã®äŸãšããŠãéèäŒç€Ÿããã£ãŒãRLã䜿çšããŠãããšããŸãã圌ãã¯ã3ã€ã®ã©ã³ãã ãªsidã䜿çšããŠãéå»ã®ç±³åœæ ªåŒåžå ŽããŒã¿ã«ã€ããŠè²©å£²ä»£çåºãèšç·ŽããŸããå®éã®A / Bãã¹ãã§ã¯ãæåã®ã·ãŒãã¯2ïŒ å°ãªãåå ¥ãããããã2çªç®ã®ã·ãŒãã¯å¹³ååçæ§ã§æ©èœãã3çªç®ã®ã·ãŒãã¯2ïŒ å¢å ããŸãããã®ä»®æ³ããŒãžã§ã³ã§ã¯ãåçŸæ§ã¯éèŠã§ã¯ãããŸãã-æ©çãŸãã2ïŒ é«ãã¢ãã«ãå±éããŠåã¶ã ãã§ããåæ§ã«ã販売代çåºãç±³åœã§ã®ã¿ããŸãæ©èœããããšã¯åé¡ã§ã¯ãããŸãããäžçåžå Žã§ããŸãæ©èœããªãå Žåã¯ãããã§ã¯äœ¿çšããªãã§ãã ãããç°åžžãªã·ã¹ãã ãšåçŸå¯èœãªç°åžžãªã·ã¹ãã ã«ã¯å€§ããªéãããããŸãããããããæåã«éäžããå¿ èŠããããŸãã
ä»ã©ãã«ããã®
å€ãã®ç¹ã§ããã£ãŒãRLã®çŸåšã®ç¶æ ã«æ©ãŸãããŠããŸããããã«ãããããããããã¯ç 究è ã®éã§ãã®ãããªåŒ·ãé¢å¿ãéããŠããŸãããããã¯ç§ãä»ã®åéã§èŠãããšã¯ãããŸãããç§ã®æ°æã¡ã¯ãã¢ã³ããªã¥ãŒã»ãŠã³ããã£ãŒãã©ãŒãã³ã°ãé©çšããããããšãã«ããšã®åœŒã®è¬æŒã§èšåãããã¬ãŒãºã«ãã£ãŠæãããè¡šãããŸãïŒçæçãªåŒ·ãæ²èŠ³äž»çŸ©ãããã«åŒ·ãé·æçãªæ¥œèŠ³äž»çŸ©ãšã®ãã©ã³ã¹ããã£ãŒãRLã¯å°ãæ··chaãšããŠããŸãããç§ã¯ãŸã å°æ¥ãä¿¡ããŠããŸãã
ãã ãã匷ååŠç¿ïŒRLïŒã§åé¡ã解決ã§ãããã©ããå床å°ããããå Žåãç§ã¯ããã«ããããããšçããŸããããããæ°å¹ŽåŸã«ãã®è³ªåãç¹°ãè¿ããããé¡ãããŸãããããŸã§ã«ããããããã¹ãŠãããŸãããã§ãããã