Allstateã®ä¿éºéè«æ±ã®é倧床ã®åæ¥ãããžã§ã¯ãäºæž¬
ããŒã1.ãããžã§ã¯ãã®èª¬æ
ãããžã§ã¯ãã®æŠèŠ
ããŒã¿é¢é£ã®åéã§åãå€ãã®äººã ã¯ãããŒã¿ãµã€ãšã³ã¹ã³ã³ãã¹ãããã¹ããããã©ãããã©ãŒã ã§ããKaggleãç¥ã£ãŠããŸãã çŸåšãKaggle ã«ã¯60äžäººãè¶ ããããŒã¿ãµã€ãšã³ãã£ã¹ããšå€ãã®æåäŒæ¥ãåå ããŠããŸãã äŒç€Ÿã¯åé¡ã説æããå質ææšãèšå®ããåé¡è§£æ±ºã«åœ¹ç«ã€ããŒã¿ã»ãããå ¬éããåå è ã¯äŒç€Ÿãæèµ·ããåé¡ã解決ããç¬èªã®æ¹æ³ãèŠã€ããŸãã
ãã®Kaggleã³ã³ãã¯ãå人ã®çåœä¿éºããã³è²¡ç£ä¿éºã®åéã§æ倧ã®ç±³åœå ¬éäŒæ¥ã§ããAllstateã«ãã£ãŠæäŸãããŠããŸãã Allstateã¯çŸåšãä¿éºéè«æ±ã®ã³ã¹ãïŒé倧床ïŒãäºæž¬ããèªåæ¹æ³ãéçºããŠãããKaggleã³ãã¥ããã£ã«ãã®åé¡ã解決ããããã®æ°ããã¢ã€ãã¢ãšæ°ããã¢ãããŒãã瀺ãããåŒã³ãããŠããŸãã
å瀟ã¯ãä¿éºéè«æ±åŠçãµãŒãã¹ã®å質ã®åäžãç®æããŠãããä¿éºéè«æ±ã®è²»çšãæ°å€ã§æšå®ããäžåž¯ïŒåäžåž¯ã¯å¿åã®èšå·ã®ãã¯ãã«ã§è¡šãããŸãïŒã§çºçããäºæ ã«é¢ããäžé£ã®ããŒã¿ãå ¬éããŠããŸãã ç§ãã¡ã®ä»äºã¯ãæ°ããäžåž¯ã®ä¿éºéè«æ±ã®å¯èœæ§ãäºæž¬ããããšã§ããã
Kaggleã«ã¯ããã®ã¿ã¹ã¯ã«é¢é£ããä»ã®ããŒã¿ã»ãããããã€ããããŸãã
⢠Allstate Insurance Prediction Competition-被ä¿éºè»äž¡ã®ç¹æ§ã«åºã¥ããŠä¿éºæãäºæž¬ããããã«èšèšããã以åã®Allstate競äºã ãã®ç«¶äºã®ããŒã¿ã»ããã¯ãä¿éºåéã«çªå ¥ããæ©äŒãæäŸããŸãã
⢠ç«çœæ倱è©äŸ¡ç«¶äº -ä¿éºå¥çŽãçå®ããããã«ãäºæ³ãããç«çœæ倱ãäºæž¬ããããã«Liberty Mutual Groupãéå¬ãã競äºã ããã¯ãä¿éºæ¥çã®äºæž¬åé¡ã解決ããã¢ãããŒãã®ç解ãæ·±ããã®ã«åœ¹ç«ã£ãä¿éºæ¥çã®ããŒã¿ã»ããã®ãã1ã€ã®äŸã§ãã
ãããšã¯å¥ã«ãåæããŒã¿ã»ããã¯é«åºŠã«å¿åã§ããããšã«æ³šæããŠãã ããïŒæ©èœåãšå€ã®äž¡æ¹ã®ç¹ã§ïŒã ãã®åŽé¢ã¯ãç¹æ§ã®æå³ã®ç解ãè€éã«ããå€éšãœãŒã¹ããã®ããŒã¿ã»ãããå å®ãããããšãå°é£ã«ããŸãã 競äºã®åå è ã¯ããœãŒã¹ããŒã¿ãå å®ãã解éããããã«ããŸããŸãªè©Šã¿ãããŸãããããã®è©Šã¿ã®æåã«ã¯è°è«ã®äœå°ããããŸãã äžæ¹ããã®ããŒã¿ã»ããã«ã¯ããã¬ãŒãã³ã°ããŒã¿ã«è¿œå æ å ±ãæ®ã£ãŠããå Žåã«çºçããããŒã¿ãªãŒã¯ããªãããã§ãã ãã®ãããªæ å ±ã¯ãã¿ãŒã²ããå€æ°ãšåŒ·ãçžé¢ããäžåçã«æ£ç¢ºãªäºæž¬ã«ã€ãªããå¯èœæ§ããããŸãã æè¿ãããªãã®æ°ã®Kaggleã³ã³ãã¹ãããã®ãããªãªãŒã¯ã«èŠããã§ããŸãã
åé¡ã®å£°æ
Allstateã®é¡§å®¢ã®ä¿éºéè«æ±ã«é¢ããèšé²ãå«ãããŒã¿ã»ãããèªç±ã«äœ¿çšã§ããŸãã åãšã³ããªã«ã¯ãã«ããŽãªå±æ§ãšé£ç¶å±æ§ã®äž¡æ¹ãå«ãŸããŠããŸãã ã¿ãŒã²ããå€æ°ã¯ããã®ä¿éºéè«æ±ã«ãã£ãŠåŒãèµ·ããããæ倱ã®æ°å€æšå®ã§ãã ãã¹ãŠã®æšèã¯å¯èœãªéãå¿åã§äœæãããŸããæšèã®å®éã®ååãŸãã¯ãã®æ¬åœã®æå³ã¯ããããŸããã
ç§ãã¡ã®ç®æšã¯ãäžããããå±æ§å€ã«åºã¥ããŠå°æ¥ã®æ倱ãæ£ããäºæž¬ã§ããã¢ãã«ãæ§ç¯ããããšã§ãã æããã«ãããã¯ååž°ã¿ã¹ã¯ã§ããã¿ãŒã²ããå€æ°ã¯æ°å€ã§ãã ãŸããæåž«ãšã®ãã¬ãŒãã³ã°ã®ã¿ã¹ã¯ã§ããããŸããã¿ãŒã²ããå€æ°ã¯ãã¬ãŒãã³ã°ããŒã¿ã»ããã§æ確ã«å®çŸ©ãããŠããããã¹ãã»ããã®åã¬ã³ãŒãã®å€ãååŸããå¿ èŠããããŸãã
Allstateã¯ããŒã¿ã®ã¯ãªãŒãã³ã°ãšååŠçã«åªããä»äºãããŸãããæäŸãããããŒã¿ã»ããã¯éåžžã«é«åºŠã«ã¯ãªãŒãã³ã°ãããïŒå°ãè¿œå åŠçãè¡ã£ãåŸïŒæåž«ãšäžç·ã«å€æ°ã®ãã¬ãŒãã³ã°ã¢ã«ãŽãªãºã ã«è»¢éã§ããŸãã ããŒã¿èª¿æ»å°çšã®ã¬ããŒãã®äžéšã§èŠãããã«ãAllstateã®ã¿ã¹ã¯ã§ã¯ãç¹ã«æ°ããæ©èœãçæããããæ¢åã®æ©èœãååŠçãããããããšã¯ã§ããŸããã äžæ¹ããã®ããŒã¿ã»ããã¯ãããŸããŸãªæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãšã¢ã³ãµã³ãã«ã®äœ¿çšãšãã¹ããä¿é²ããŸã-ã¡ããã©åæ¥ãããžã§ã¯ãã«å¿ èŠãªãã®ã§ãã
Allstateãããžã§ã¯ãã«æ¬¡ã®ã¢ãããŒããé©çšããŸããã
1.ããŒã¿ã»ããã調ã¹ãããŒã¿ãæ©èœãããã³ã¿ãŒã²ããå€æ°ã®æå³ãç解ããããŒã¿å ã®åçŽãªé¢ä¿ãèŠã€ããŸãã ãã®æé ã¯ã Data Discovery Notebookãã¡ã€ã«ã§å®è¡ãããŸãã
2.å¿ èŠãªããŒã¿ååŠçãå®è¡ããããã€ãã®ç°ãªãæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ïŒXGBoostããã³å€å±€ããŒã»ãããã³ïŒããã¬ãŒãã³ã°ããŸãã åºæ¬çãªçµæãååŸããŸãã ãããã®ã¿ã¹ã¯ã¯ã XGBoostããã³MLPãã¡ã€ã«ã§è§£æ±ºãããŸãã
3.ã¢ãã«ãæ§æããåã¢ãã«ã®çµæã«é¡èãªæ¹åãéæããŸãã ãã®ã¹ãããã¯ãXGBoostããã³MLPãã¡ã€ã«ã«ãå®è£ ãããŠããŸãã
4.åºæ¬äºæž¬åãšããŠä»¥åã®ã¢ãã«ã䜿çšããŠãã¢ãã«ãéãåãããïŒç©ã¿éããïŒææ³ã䜿çšããŠã¢ã³ãµã³ãã«ãåŠç¿ãããŸãã æçµçµæãååŸããŸããããã¯ä»¥åã®çµæãããã¯ããã«åªããŠããŸãã ãã®ã¹ãããã¯ã Stacking Notebookãã¡ã€ã«ã«å®è£ ãããŠããŸãã
5.çµæãç°¡åã«è©±ãåããããŒãã¡ã³ãã®é äœã§æçµçãªäœçœ®ãè©äŸ¡ãããããæ¹åããããã®è¿œå ã®æ¹æ³ãèŠã€ããŸãã çµæã«ã€ããŠã¯ããã®ã¬ããŒããšStacking Notebookãã¡ã€ã«ã®æåŸã®éšåã§èª¬æããŸãã
ææš
Kaggleãã©ãããã©ãŒã ã§ã¯ã競åäŒç€Ÿã競åä»ç€Ÿã競åã§ããææšãæ確ã«å®çŸ©ããå¿ èŠããããŸãã Allstateã¯ããã®ãããªã¡ããªãã¯ãšããŠMAEãéžæããŸããã MAEïŒå¹³å絶察誀差ïŒã¯ãäºæž¬å€ãšçã®å€ãçŽæ¥æ¯èŒããéåžžã«ã·ã³ãã«ã§æçœãªã¡ããªãã¯ã§ãã
ãã®ã¡ããªãã¯ã¯æå®ãããŠãããããå€æŽã§ããŸãã 競äºæ¡ä»¶ã®äžéšã§ãã ããã«ãããããããç§ã¯ããããã®ã¿ã¹ã¯ã«é©ããŠãããšèããŠããŸãã ãŸããMAEã¯ïŒMSEãæšæºèª€å·®ãšã¯ç°ãªãïŒãšããã·ã§ã³ã®äžæ£ç¢ºãªæšå®ã«å¯ŸããŠå€§ããªããã«ãã£ãäžããŸããïŒããŒã¿ã»ããã«ã¯ãç°åžžã«é«ãæ倱å€ãæã€è€æ°ã®ãšããã·ã§ã³ããããŸãïŒã 第äºã«ãMAEã¯ç°¡åã«ç解ã§ããŸãããšã©ãŒå€ã¯ãã¿ãŒã²ããå€æ°èªäœãšåã次å ã§è¡šçŸãããŸãã äžè¬çã«ãMAEã¯ããŒã¿ãµã€ãšã³ã¹ã®åå¿è ã«ãšã£ãŠåªããææšã§ãã ç°¡åã«èšç®ã§ããç解ããããã誀解ãã«ããã§ãã
ããŒã2.åæ
ããŒã¿æ¢çŽ¢
ãã®ã¹ãããã®å®å šãªæŠèŠã«ã€ããŠã¯ã ããŒã¿æ€åºãã¡ã€ã«ãåç §ã§ããŸãã
ãã¬ãŒãã³ã°ã»ããå šäœã¯ãuidå€æ°ã䜿çšããŠã€ã³ããã¯ã¹ä»ãããã188318åã®èŠçŽ ã§æ§æãããŸãã ããŒã¿ãæäœããããã®ã€ã³ããã¯ã¹ã«ã¯ãè¿œå æ å ±ã¯å«ãŸããŸããã ããã¯ãããã€ãã®æ¬ æå€ãå«ãã1ãã§å§ãŸãåçŽãªçªå·ä»ãã§ãã äºæž¬ã®ããã®ã€ã³ããã¯ã¹ãªãã§ãã¹ãã»ããã䜿çšããããšã¯ãããŸããïŒçµæãKaggleã«éä¿¡ããå¿ èŠããããŸãïŒãããã¹ãããŒã¿ã»ããããã¬ãŒãã³ã°ãšåãããã«ç·šæãããŠããããšã«æ³šæããŠãã ããã æããã«ããã¬ãŒãã³ã°ãšãã¹ãã®ãµã³ãã«ã¯ãsklearnããã±ãŒãžã®train_test_splitãªã©ãåå²æé ã«ãã£ãŠåãããŒã¿ã»ããããååŸãããŸããã
ãããžã§ã¯ãã®ãã®éšåã®äž»ãªçµæã¯æ¬¡ã®ãšããã§ãã
â¢ããŒã¿ã»ããã«ã¯130ã®ç°ãªãå±æ§ãå«ãŸããŸãïŒidã€ã³ããã¯ã¹ãšã¿ãŒã²ããæ倱å€æ°ãé€ãïŒã ããŒã¿ã»ããã®ãµã€ãºãèãããšãããã¯éåžžã«åççãªæ°ã®æ©èœã§ãã ããã§ã次å ã®åªããã«åºãããããšã¯ã»ãšãã©ãªãã£ãã§ãããã
â¢116ã®ã«ããŽãªèšå·ã14ã®æ°å€ã ã»ãšãã©ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã¯ã«ããŽãªå€æ°ãæ£ããåŠçã§ããªãããããããããããã®116ã®æ©èœãã³ãŒãã£ã³ã°ããå¿ èŠããããŸãã ãã®ãããªã³ãŒãã£ã³ã°ã®æ¹æ³ãšããããã®éãã«ã€ããŠã¯åŸã§èª¬æããŸãã
â¢ããŒã¿ã»ããå šäœã«åäžã®æ¬ æå€ã¯ãããŸããã ãã®äºå®ã¯ãAllstateãé«åºŠãªååŠçã§ããŒã¿ãæäŸããã¢ã¯ã»ã¹ãããã䜿ããããããããšã確èªããã ãã§ãã
â¢ã»ãšãã©ã®ã«ããŽãªå±æ§ïŒ72ãŸãã¯62ïŒ ïŒã¯ãã€ããªïŒã¯ã/ããããç·æ§/女æ§ïŒã§ããããããã®æå³ã¯åã«ãAãããã³ãBããšæžãããŠãããããæå³ããŸã£ããæšæž¬ã§ããŸããã 3ã€ã®æšèã«ã¯3ã€ã®ç°ãªãæå³ãããã12ã®æšèã«ã¯4ã€ã®ç°ãªãæå³ããããŸãã
â¢æ°å€èšå·ã¯æ¢ã«0ãã1ã®ç¯å²ã§ã¹ã±ãŒãªã³ã°ãããŠãããããããã¹ãŠã®æšæºåå·®ã¯0.2ã«è¿ããå¹³åå€ã¯0.5ã®ãªãŒããŒã§ãããããã£ãŠããããã®èšå·ã«ã€ããŠã¯ããã®å€ã«ã€ããŠæšæž¬ããããšã¯ã§ããŸããã
â¢ã©ããããLabelEncoderãŸãã¯åæ§ã®æé ã䜿çšããŠæ°å€ã«å€æãããåã«ãäžéšã®æ°å€å±æ§ãã«ããŽãªåãããŠããããã§ãã
â¢ããŸããŸãªå åã®ãã¹ãã°ã©ã ãäœæããããšã«ããããããã®ããããæ£èŠååžæ³åã«åŸããªãããšã確èªã§ããŸãã ãã®ããŒã¿ã®ååžã®é察称æ§ãæžããããšã¯ã§ããŸããïŒscipy.stats.mstats.skew> 0.25ã®å ŽåïŒããã®ãããªå€æã®åŸã§ããæ£åžžã«è¿ãååžã¯å€±æããŸãã
â¢ã¿ãŒã²ããå€æ°ãæ£èŠååžã§ã¯ãããŸããããæ£èŠååžã«è¿ãåçŽãªå¯Ÿæ°å€æã«ãªããŸãã
â¢ã¿ãŒã²ããå€æ°ã«ã¯ãç°åžžã«é«ãå€ïŒéåžžã«é倧ãªã€ã³ã·ãã³ãïŒãæã€ããã€ãã®å€ãå€ãå«ãŸããŠããŸãã çæ³çãªã±ãŒã¹ã§ã¯ãã¢ãã«ããã®ãããªå€ãå€ãèå¥ããŠæ£ããäºæž¬ã§ããããã«ããããšæããŸãã åæã«ãååãªæ³šæãæããªããã°ãç°¡åã«åãã¬ãŒãã³ã°ã§ããŸãã æããã«ãããã§äœããã®åŠ¥åãå¿ èŠã§ãã
â¢ãã¬ãŒãã³ã°ãµã³ãã«ãšãã¹ããµã³ãã«ã®ããŒã¿ååžã¯é¡äŒŒããŠããŸãã ããã¯ããã¬ãŒãã³ã°ãµã³ãã«ãšãã¹ããµã³ãã«ãžã®åå²ã®çæ³çãªç¹æ§ã§ããããã«ããã亀差æ€èšŒãå€§å¹ ã«ç°¡çŽ åããããã¬ãŒãã³ã°ããŒã¿ã»ããã®äº€å·®æ€èšŒã䜿çšããŠã¢ãã«ã®å質ã«é¢ããæ å ±ã«åºã¥ãã決å®ãè¡ãããšãã§ããŸãã ããã«ãããKaggleã³ã³ãã¹ããžã®åå ãå€§å¹ ã«ç°¡çŽ åãããŸããããåæ¥ãããžã§ã¯ãã®å®æœã«ã¯åœ¹ç«ã¡ãŸããã
â¢ããã€ãã®é£ç¶çãªç¹åŸŽã¯åŒ·ãçžé¢ããŠããŸãïŒçžé¢è¡åã¯äžã®å³1ã«ç€ºãããŠããŸãïŒã ããã«ããããã®ããŒã¿ã»ããã«ããŒã¿ããŒã¹ã®å€éå ±ç·æ§ãçããç·åœ¢ååž°ã¢ãã«ã®äºæž¬åãå€§å¹ ã«äœäžããå¯èœæ§ããããŸãã ãã®åé¡ã®äžéšã¯ãL1ãŸãã¯L2æ£ååã䜿çšããŠè§£æ±ºã§ããŸãã
å³1ïŒé£ç¶ãã£ãŒãã£ã®çžé¢è¡å
æŠèŠã®èŠèŠå
ãã®ããŒã¿ã»ããã®éèŠãªæ©èœ-é«åºŠãªå¿åæ§ãããŒã¿ååŠç-ã瀺ãããã«ã1ã€ã®èŠèŠåã瀺ããŸãã
以äžã¯ãcontïŒãšããŠããŒã¯ããã14ã®é£ç¶ç¹æ§ã®ãã¹ãã°ã©ã ã§ãã å³ãããããããã« 1ãããŒã¿ååžã«ã¯è€æ°ã®ããŒã¯ããããååžå¯åºŠé¢æ°ã¯ã¬ãŠã¹ååžã«è¿ããããŸããã ããŒã¿ã®é察称ä¿æ°ãæžããããšã¯ã§ããŸãããæ£èŠåã¢ã«ãŽãªãºã ïŒ Box-Coxå€æãªã©ïŒã䜿çšããŠãããã®ããŒã¿ã»ããã«ã¯ã»ãšãã©å¹æããããŸããã
å³2ïŒé£ç¶ãã£ãŒãã£ã®æ£ã°ã©ã
cont2ã®å åã¯ç¹ã«èå³æ·±ããã®ã§ãã ãã®çç¶ã¯ã«ããŽãªãŒã«ç±æ¥ããå¯èœæ§ãé«ãã幎霢ãŸãã¯å¹Žéœ¢ã«ããŽãªãŒãåæ ããŠããå¯èœæ§ããããŸãã æ®å¿µãªãããç§ã¯ãã®çç¶ã®èª¿æ»ã«ã¯åå ããŸããã§ãããç§ã®ãããžã§ã¯ãã«ã¯åœ±é¿ããããŸããã§ããã
ããŒã¿åæã®ã¢ã«ãŽãªãºã ãšæ¹æ³
ãã®ã»ã¯ã·ã§ã³ã¯ã XGBoostããŒãããã¯ãšMLPããŒã ããã¯ã® 2ã€ã®ããã¥ã¡ã³ãã§ããã«è©³ãã説æãããŠããŸãã
XGBoostã ç§ããããžã§ã¯ãã«èå³ãæã£ãŠããçç±ã®1ã€ã¯ãæšãç¹ã«XGBoostãããŒã¹ãããæ¹æ³ãè©Šãæ©äŒã§ãã äºå®ããã®ã¢ã«ãŽãªãºã ã¯ããã®ã¹ã±ãŒã©ããªãã£ãæè»æ§ãå°è±¡çãªäºæž¬åã«ãããå€ãã®Kaggle競æäŒã®äžçš®ã®æšæºçãªã¹ã€ã¹ãã€ãã«ãªããŸããã
XGBoostã¯ãçŸåšã®æåž«ïŒããå®çŸ©ããããã¬ãŒãã³ã°ããŒã¿ã»ãããšã¿ãŒã²ããå€æ°ïŒãšåæ§ã®æåž«ã«ããã¿ã¹ã¯ã®æå°ã«é©ããŠããŸãã 以äžã«ãXGBoostã¢ã«ãŽãªãºã ã®åçã説æããŸãã
XGBoostã¯ãæ¬è³ªçã«ããŒã¹ãã£ã³ã°ã®ããªãšãŒã·ã§ã³ã§ããæåž«ãã¬ãŒãã³ã°ã®ãã€ã¢ã¹ãšåæ£ãæžããããã«äœ¿çšãããæ©æ¢°åŠç¿ã®ã¢ã³ãµã³ãã«ã¡ã¿ã¢ã«ãŽãªãºã ãšã匱ãã¢ãã«ããã匷åãªã¢ãã«ã«å€ããäžé£ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã§ãã åºå žïŒ ãŠã£ãããã£ã¢ åœåãããŒã¹ãã£ã³ã°ã®ã¢ã€ãã¢ã¯ãPACïŒããããã»ãŒæ£ããïŒã¢ãã«ã§ã®ã©ã³ãã ãªæšæž¬ããããããã«è¯ãçµæãäžããã匱ããåŠç¿ã¢ã«ãŽãªãºã ããä»»æã®ã匷ããåŠç¿ã¢ã«ãŽãªãºã ã«ã匷åãã§ãããã©ããã«ã€ããŠãKearnsãšValiantã«ãã£ãŠæèµ·ããã質åã«æ ¹ãããŠããŸã粟床ã åºå žïŒ ããŒã¹ãã£ã³ã°ã®ç°¡åãªçŽ¹ä» ïŒYoav FreundãšRobert E. SchapireïŒã ãã®è³ªåã«å¯Ÿããè¯å®çãªåçã¯ãå€ãã®ããŒã¹ãã£ã³ã°ã¢ã«ãŽãªãºã ã®éçºã«ã€ãªãã£ãRESchapireã®èšäºThe Power of Weak Learningã§äžããããŸããã
ã芧ã®ãšãããããŒã¹ãã£ã³ã°ã®åºæ¬ååã¯ã匱åŠç¿ã¢ã«ãŽãªãºã ã®äžè²«ãã䜿çšã§ãã åŸç¶ã®å匱ãã¢ã«ãŽãªãºã ã¯ãã¢ãã«å šäœã®ãã€ã¢ã¹ãæžããã匱ãã¢ã«ãŽãªãºã ã匷åãªã¢ã³ãµã³ãã«ã¢ãã«ã«çµåããããšããŸãã AdaBoostïŒåŒ±åŠç¿ã¢ã«ãŽãªãºã ã«é©å¿ããé©å¿åããŒã¹ãã£ã³ã°ïŒãLPBoostãåŸé ããŒã¹ãã£ã³ã°ãªã©ãããŒã¹ãã£ã³ã°ã¢ã«ãŽãªãºã ãšæ¹æ³ã«ã¯ããŸããŸãªäŸããããŸãã
ç¹ã«ãXGBoostã¯ãåŸé ããŒã¹ãã£ã³ã°ã¹ããŒã ãå®è£ ããã©ã€ãã©ãªã§ãã åŸé ããŒã¹ãã£ã³ã°ã¢ãã«ã¯ãä»ã®ããŒã¹ãã£ã³ã°æ¹æ³ã䜿çšããå Žåãšåæ§ã«ã段éçã«æ§ç¯ãããŸãã ãã®ããŒã¹ãã£ã³ã°æ¹æ³ã¯ã匱åŠç¿ã¢ã«ãŽãªãºã ãäžè¬åããŠãä»»æã®åŸ®åå¯èœãªæ倱é¢æ°ïŒèšç®å¯èœãªåŸé ãæã€æ倱é¢æ°ïŒã®æé©åãå¯èœã«ããŸãã
XGBoostã¯ãäžçš®ã®ããŒã¹ãã£ã³ã°ãšããŠãã¹ããŒã¹ããŒã¿ã®æäœã«é©ããããªãªãžãã«ã®æ±ºå®æšããŒã¹ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãå«ãŸããŠããŸãã çè«ã«åºã¥ããæé ã«ãããæšã®ãã¬ãŒãã³ã°ã§ããŸããŸãªèŠçŽ ã®éã¿ãæäœã§ããŸãã åºå žïŒ XGBoostïŒã¹ã±ãŒã©ãã«ãªããªãŒããŒã¹ãã£ã³ã°ã·ã¹ãã ïŒTianqi ChenãCarlos GuestrinïŒ
XGBoostã¢ã«ãŽãªãºã ã«ã¯å€ãã®å©ç¹ããããŸãã
â¢æ£ååã å€å±€ããŒã»ãããã³ã¢ãã«ã«ç¹åããã»ã¯ã·ã§ã³ã§ç€ºãããããã«ãä»ã®ã¢ã«ãŽãªãºã ã䜿çšãããšãåãã¬ãŒãã³ã°ãããã¢ãã«ãç°¡åã«ååŸã§ããŸãã XGBoostã¯ããã®ããã»ã¹ãæ§æããããã®äžé£ã®ãªãã·ã§ã³ãšãšãã«ãä¿¡é Œæ§ãé«ããããã«äœ¿çšã§ããæ£èŠåããŒã«ãæäŸããŸãã ãããã®ãã©ã¡ãŒã¿ãŒã®ãªã¹ãã«ã¯ãã¬ã³ãïŒãããªãããªãŒåå²ã«å¿ èŠãªæ倱é¢æ°ã®æå°åæžïŒãã¢ã«ãã¡ïŒL1æ£ååã®éã¿ïŒãã©ã ãïŒL2æ£ååã®éã¿ïŒãmax_depthïŒæ倧ããªãŒæ·±åºŠïŒãmin_child_weightïŒãã¹ãŠã®éã¿ã®æå°åèšïŒãå«ãŸããŸãåäŸã«å¿ èŠãªèŠ³å¯ïŒã
â¢äžŠåããã³åæ£ã³ã³ãã¥ãŒãã£ã³ã°ã®å®è£ ã ä»ã®å€ãã®ããŒã¹ãã£ã³ã°ã¢ã«ãŽãªãºã ãšã¯ç°ãªããããã§ã®ãã¬ãŒãã³ã°ã¯äžŠè¡ããŠå®è¡ã§ããããããã¬ãŒãã³ã°æéãççž®ãããŸãã XGBoostã¯éåžžã«é«éã§ãã åè¿°ã®èšäºã®èè ã«ãããšã ãã·ã¹ãã ã¯1å°ã®ã³ã³ãã¥ãŒã¿ãŒã§æ¢åã®äžè¬çãªãœãªã¥ãŒã·ã§ã³ããã10å以äžé«éã§å®è¡ãããåæ£ç°å¢ãŸãã¯ã¡ã¢ãªå¶éç°å¢ã§æ°çŸäžã®ã³ããŒã«æ¡åŒµã§ããŸããã
â¢çµã¿èŸŒã¿ã®çžäºæ€èšŒã ã¯ãã¹æ€èšŒã¯ãçµæãšããŠåŸãããã¢ãã«ã®å質ãè©äŸ¡ããããã®åææ¡ä»¶ã§ãããXGBoostã®å Žåããããšé£æºããããã»ã¹ã¯éåžžã«ã·ã³ãã«ã§ç解ãããããã®ã§ãã
MLP ç§ãã¡ãæ§ç¯ããŠãã2çªç®ã®ã¢ãã«ã¯ãå®å šã«æ¥ç¶ãããçŽæ¥ååžãã¥ãŒã©ã«ãããã¯ãŒã¯ãŸãã¯å€å±€ããŒã»ãããã³ã§ãã æçµç®æšã¯ãåºæ¬çãªãªã°ã¬ããµã®ã¢ã³ãµã³ãã«ïŒã¹ã¿ããã³ã°ïŒãæ§ç¯ããããšãªã®ã§ïŒãããŠãæåã®ãªã°ã¬ããµã®ã¿ã€ã-XGBoostã決å®ããŸããïŒãããã§ãªããã°ããŒã¿ã»ããã調ã¹ãäžè¬åã¢ã«ãŽãªãºã ã®å¥ã®ãã¿ã€ãããèŠã€ããå¿ èŠããããŸãã
ããã¯ãäžè¬åããããŒãã¬ãã«ã¢ã«ãŽãªãºã ããã¹ããŒã¹ãã«ããŒãããã¹ãã ãšèšããšãã§ãã Wolpertãäžè¬åãªãŒããŒã¬ã€ã
ã¬ã€ã€ãŒãšåã¬ã€ã€ãŒå ã®èŠçŽ ãè¿œå ããããšã«ããããã¥ãŒã©ã«ãããã¯ãŒã¯ã¯ããŒã¿å ã®éåžžã«è€éãªéç·åœ¢é¢ä¿ããã£ããã£ã§ããŸãã æ®éè¿äŒŒå®çã¯ãçŽæ¥äŒæãã¥ãŒã©ã«ãããã¯ãŒã¯ããŠãŒã¯ãªãã空éã®ä»»æã®é£ç¶é¢æ°ãè¿äŒŒã§ãããšè¿°ã¹ãŠããŸãã ãããã£ãŠãå€å±€ããŒã»ãããã³ã¯éåžžã«åŒ·åãªã¢ããªã³ã°ã¢ã«ãŽãªãºã ã§ãã å€å±€ããŒã»ãããã³ã¯ç°¡åã«åãã¬ãŒãã³ã°ã®å¯Ÿè±¡ãšãªããŸããããã®èŠå ã®åœ±é¿ãæžããããã«å¿ èŠãªãã¹ãŠã®ããŒã«ãèªç±ã«äœ¿çšã§ããŸãïŒãã¥ãŒãã³æŽ»æ§åã®ã©ã³ãã ã·ã£ããããŠã³ïŒããããã¢ãŠãïŒãL1-L2æ£ååããã±ããæ£èŠåïŒãããæ£èŠåïŒãªã© ãŸããããã€ãã®åæ§ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ããŠããããã®äºæž¬ãå¹³ååããããšãã§ããŸãã
ãã£ãŒãã©ãŒãã³ã°ã³ãã¥ããã£ã¯ã人工ãã¥ãŒã©ã«ãããã¯ãŒã¯ã«åºã¥ããŠã¢ãã«ããã¬ãŒãã³ã°ããã³è©äŸ¡ããããã®é«å質ãªãœãããŠã§ã¢ãéçºããŸããã ç§ã®ã¢ãã«ã¯ãGoogleãéçºãããã³ãœã«ã³ã³ãã¥ãŒãã£ã³ã°ã©ã€ãã©ãªã§ããTensorFlowã«åºã¥ããŠããŸãã ã¢ãã«ã®æ§ç¯ãç°¡çŽ åãããããTensorFlowãšTheanoã®é«ã¬ãã«å€éšã€ã³ã¿ãŒãã§ã€ã¹ã§ããKerasã䜿çšããããšã«ããŸãããããã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ§ç¯ãšãã¬ãŒãã³ã°ã«å¿ èŠãªã»ãšãã©ã®æšæºæäœãåŒãåããŸãã
GridSearchããã³Hyperoptã ãããã®ã¡ãœããã¯ãã¢ãã«ã®éžæãšãã€ããŒãã©ã¡ãŒã¿ãŒã®æ§æã«äœ¿çšãããŸãã GridSearchã¯ãã©ã¡ãŒã¿ãŒã®ãã¹ãŠã®å¯èœãªçµã¿åããã培åºçã«æ€çŽ¢ããŸãããHyperoptã¯ãã©ã¡ãŒã¿ãŒç©ºéããç¹å®ã®ååžã§ç¹å®ã®æ°ã®åè£ãéžæãããããã€ãžã¢ã³æé©åã®åœ¢åŒã䜿çšããŸãã ãããã®äž¡æ¹ã®éžææ¹æ³ãšãšãã«ãçžäºæ€èšŒææ³ã䜿çšããŠã¢ãã«ã®ããã©ãŒãã³ã¹ãè©äŸ¡ããŸãã ã¢ãã«ã®èšç®ã®è€éãã«å¿ããŠã3ã€ãŸãã¯5ã€ã®éšåã§kåå²äº€å·®æ€èšŒã䜿çšããŸãã
ãªãŒããŒã¬ã€ã¢ãã«ïŒã¹ã¿ããã³ã°ïŒã 2ã€ã®ã¢ãã«ïŒXGBoostããã³å€å±€ããŒã»ãããã³ïŒã®äºæž¬ãçµã¿åãããŠãã¡ã¿ãªã°ã¬ããµãŒã䜿çšããŠæçµäºæž¬ãäœæããŸãã ãã®æ¹æ³ã¯ã¹ã¿ããã³ã°ãšåŒã°ããKaggleã§é »ç¹ã«äœ¿çšãããŸãïŒå€ãã®å Žåãé床ã«ïŒã ã¹ã¿ããã³ã°ã®ã¢ã€ãã¢ã¯ããã¬ãŒãã³ã°ã»ãããkåã®éšåã«åå²ããåããŒã¹ãªã°ã¬ããµãk-1åã®éšåã§ãã¬ãŒãã³ã°ããæ®ãã®éšåã§äºæž¬ãè¡ãããšã§ãã ãã®çµæãã¿ãŒã²ããå€æ°ã®å®éã®å€ãä¿æããªããããªã°ã¬ããµãŒã®äºæž¬ïŒãã©ãŒã«ãå€ïŒãå«ããã¬ãŒãã³ã°ãµã³ãã«ãååŸããŸãã 次ã«ãåãªã°ã¬ããµãŒã®äºæž¬ãã¡ã¿ã¢ãã«ã®ãµã€ã³ãšããŠäœ¿çšããçã®å€ãã¿ãŒã²ãããµã€ã³ãšããŠäœ¿çšããŠããã®ããŒã¿ã§ã¡ã¿ã¢ãã«ããã¬ãŒãã³ã°ããŸãã
ãã¬ãŒãã³ã°ãããã¡ã¿ã¢ãã«ã®å Žåããã¹ããµã³ãã«ã®ãªã°ã¬ããµã®äºæž¬å€æ°ãå ¥åãããã¬ãŒãã³ã°ãããã¢ãã«ãåãªã°ã¬ããµã®ç¹æ§èª€å·®ãèæ ®ã«å ¥ããæçµäºæž¬ãååŸããŸãã ãã®ã¹ãããã®å®è£ ã«ã€ããŠã¯ã Stacking Notebookãã¡ã€ã«ã§è©³ãã説æãããŠããŸãã
çµæãè©äŸ¡ããããã®åºæº
æåã®çµæã¯Allstateã«ãã£ãŠèšå®ãããŸãããã©ã³ãã ãã©ã¬ã¹ãã®ã¢ã³ãµã³ãã«ã¢ãã«ããã¬ãŒãã³ã°ããMAE = 1217.52141ã®çµæãåŸãŸããã ãã®çµæã¯ãåçŽãªXGBoostã¢ãã«ã§ãç°¡åã«è¶ ããããšãã§ããã»ãšãã©ã®åå è ãæåããŸããã
ãŸããã¢ãã«ãæãããšãã«èªåèªèº«ã«ããã€ãã®åºæºãèšå®ããŸããã äžçªäžã®è¡ã¯ããã®ã¯ã©ã¹ã®åçŽãªã¢ãã«ã®ããã©ãŒãã³ã¹ã§ããã XGBoostã®å Žåããã®çµæã¯MAE = 1219.57ã«èšå®ãããŠãããæé©åãŸãã¯ãã€ããŒãã©ã¡ãŒã¿ãŒèšå®ãªãã§50æ¬ã®æšã®åçŽãªã¢ãã«ã«ãã£ãŠéæãããŠããŸãã æšæºã®ãã€ããŒãã©ã¡ãŒã¿ãŒå€ïŒ Analytics Vidhyaã®èšäºã§æšå¥š ïŒãååŸããå°æ°ã®ããªãŒãæ®ããŠããã®åæçµæãåŸãŸããã
å€å±€ããŒã»ãããã³ã®å ŽåãReLUã¢ã¯ãã£ããŒã·ã§ã³æ©èœãæšæºãŠã§ã€ãã®åæåãããã³Adam GDãªããã£ãã€ã¶ãŒïŒMAE = 1190.73ã䜿çšããŠãé ãå±€ïŒ128ïŒã«å°æ°ã®èŠçŽ ãå«ã2å±€ã¢ãã«ã®ããã©ãŒãã³ã¹ãåºæ¬çµæãšããŠéžæãããŸããã
ãã®åæ¥ãããžã§ã¯ãã§ã¯ããã¹ãŠã®çµæãåçŸå¯èœã§ããã¹ãã ãšç解ããŠãããããè€éãªããŒã¹ã©ã€ã³ã¢ãã«ãé¿ããŸããã ç§ããã®Kaggleã³ã³ããã£ã·ã§ã³ã«åå ããŠããŸãããã³ã³ããã£ã·ã§ã³ã§äœ¿çšãããŠãããã¹ãŠã®ã¢ãã«ïŒã»ãšãã©ãçµã¿åããã§ãããå€æ°ã®ã¢ã«ãŽãªãºã ã䜿çšïŒããã¬ãŒãã³ã°ããã«ã¯ãééããªãèªè ããã®æéãããããããŸãã Kaggleã³ã³ãã¹ãã§ã¯ãMAE = 1100ã®çµæãäžåãããšãæãã§ããŸãã
ããŒã3.æ¹æ³è«
ããŒã¿ã®ååŠç
ãã®ããŒã¿ã»ããã¯ãã§ã«ååã«æºåãããååŠçãããŠããããšãæ¢ã«è¿°ã¹ãŸãããããšãã°ãé£ç¶èšå·ã¯éé[0,1]ã«ã¹ã±ãŒãªã³ã°ãããã«ããŽãªèšå·ã¯ååãå€æŽãããå€ã¯æ°å€ã«å€æãããŸããã å®éãååŠçã®èŠ³ç¹ã§ã¯ãå®è¡ã§ããäœå°ã¯ããŸããããŸããã ãã ããæ£ããã¢ãã«ããã¬ãŒãã³ã°ã§ããããã«ããŸã ããã€ãã®äœæ¥ãè¡ãå¿ èŠããããŸãã
ã¿ãŒã²ããå€æ°ã®ååŠç
ã¿ãŒã²ããç¹æ§ã«ã¯ãååž°ã¢ãã«ã®å質ãäœäžãããå¯èœæ§ã®ããææ°ååžããããŸãã ãåç¥ã®ããã«ãååž°ã¢ãã«ã¯ãã¿ãŒã²ããå€æ°ãæ£èŠååžããŠããå Žåã«æé©ã«æ©èœããŸãã
ãã®åé¡ã解決ããã«ã¯ãã¿ãŒã²ããå€æ°æ倱ã«å¯Ÿæ°å€ænp.logïŒtrain ['loss']ïŒãé©çšããã ãã§ãã
å³3ïŒå¯Ÿæ°å€æååŸã®ã¿ãŒã²ããå€æ°ã®ååžã®ãã¹ãã°ã©ã
çµæãæ¹åããããšãã§ããŸããããã€ãã®éžè±ãã芳枬å€ã¯ãã¡ã€ã³ã®ååžãã«ã®å·ŠåŽã«ãããŸãã ãããã®å€ãå€ãåãé€ãããã«ãæ倱å€æ°ã®ãã¹ãŠã®å€ãå³ã«200ãã€ã³ãïŒæ倱+ 200ïŒã·ããããããããã察æ°ãåãããšãã§ããŸãã
ã«ããŽãªãŒå€æ°ã®ã³ãŒãã£ã³ã°
ã»ãšãã©ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã¯ãã«ããŽãªãŒå€æ°ãçŽæ¥äœ¿çšã§ããŸããã XGBoostãäŸå€ã§ã¯ãªããããã«ããŽãªå€æ°ãæ°å€ã«å€æããå¿ èŠããããŸãã ããã§ã¯ãã©ãã«ãšã³ã³ãŒãã£ã³ã°ãŸãã¯ã¯ã³ããããšã³ã³ãŒãã£ã³ã°ã®2ã€ã®æšæºæŠç¥ã®ãããããéžæã§ããŸãã 䜿çšããæŠç¥ã¯éèŠãªãã€ã³ãã§ãããããã§ã¯ããã€ãã®èŠå ãèæ ®ããå¿ èŠããããŸãã
çŽæ¥ã³ãŒãã£ã³ã°ïŒã¯ã³ããããšã³ã³ãŒãã£ã³ã°ïŒã¯ãã«ããŽãªå±æ§ãæäœããåºæ¬çãªæ¹æ³ã§ãã ã¹ããŒã¹ãããªãã¯ã¹ãçæããŸããæ°ããåã¯ããããã1ã€ã®å±æ§ã®1ã€ã®å¯èœãªå€ãè¡šããŸãã 116åã®ã«ããŽãªå€æ°ããããcat116ã¯326åã®å€ããšããããèšå€§ãªæ°ã®ãŒããæã€ã¹ããŒã¹è¡åãååŸã§ããŸãã ããã«ããããã¬ãŒãã³ã°æéãé·ããªããã¡ã¢ãªã³ã¹ããå¢å ããçµæãäœäžããããšãããããŸãã çŽæ¥ã³ãŒãã£ã³ã°ã®å¥ã®æ¬ ç¹ã¯ãã«ããŽãªã®é åºãéèŠãªå Žåã®æ å ±ã®æ倱ã§ãã
äžæ¹ã ã©ãã«ãšã³ã³ãŒãã£ã³ã°ã¯ ã0ããã¯ã©ã¹1ã®æ°ãŸã§ã®å€ã®ã¿ãå«ãŸããããã«ãå ¥ååãåçŽã«æ£èŠåããŸã ã å€ãã®ååž°ã¢ã«ãŽãªãºã ã§ã¯ãããã¯è¯ãæŠç¥ã§ã¯ãããŸããããXGBoost ã¯ãã®ãããªå€æãéåžžã«ããŸãåŠçã§ããŸãã
XGBoostã®å ŽåãLabelEncoderã䜿çšããŠå ¥åãæ£èŠåããŸãã å€å±€ããŒã»ãããã³ã®å ŽåããããŒå€æ°ãäœæããå¿ èŠããããããããã§ã®éžæã¯ã¯ã³ããããšã³ã³ãŒãã£ã³ã°ã§ãã
ã¢ãã«ã®å®è£ ãšæ¹å
åè¿°ã®ããã«ãæ©æ¢°åŠç¿ãå®è£ ããããã®æ¹æ³è«ã¯2ã€ã®ã»ã¯ã·ã§ã³ã«åããããŸãã
â¢åºæ¬ã¢ãã«ïŒãŒãã¬ãã«ã¢ãã«ïŒã®ãã¬ãŒãã³ã°ã調æŽãããã³äº€å·®æ€èšŒïŒXGBoostãšããããã®2ã€ã®ã¢ãã«ã§ã¯å°ãç°ãªãäºåããŒã¿æºåãæ¢ã«å®è¡ãããšããä»®å®ã®äžã§ã®å€å±€ããŒã»ãããã³ã éãã¯ãå±æ§ã®ãšã³ã³ãŒãïŒçŽæ¥ãšã³ã³ãŒããŸãã¯ã©ãã«ãšã³ã³ãŒãïŒãšïŒæ®å¿µãªããïŒå€å±€ããŒã»ãããã³ã¢ãã«ã®ã¿ãŒã²ããå€æ°ã®å¯Ÿæ°å€æããªãå Žåã§ãã ãã®ããŒãã®çµæã¯2ã€ã®èª¿æŽãããã¢ãã«ã«ãªãããã®çµæã¯ç¢ºç«ãããåºæºãæºãããŸãã
â¢ã¬ãã«1ã¢ãã«ãã€ãŸããªãŒããŒã¬ã€ã¢ãã«ã®ãã¬ãŒãã³ã°ãšæ€èšŒããã®ã»ã¯ã·ã§ã³ã®çµæã¯ã以åã«ãã¬ãŒãã³ã°ããåºæ¬çãªãŒãã¬ãã«ã¢ãã«ã®ãããããããåªããçµæãæäŸããæ°ããã¡ã¿ã¢ãã«ã«ãªããŸãã
次ã«ãåã»ã¯ã·ã§ã³ã®è©³çŽ°ãªæŠèŠã説æããŸãã
ã»ã¯ã·ã§ã³1.ãŒãã¬ãã«ã¢ãã«ïŒãã¬ãŒãã³ã°ããã¥ãŒãã³ã°ãã¯ãã¹æ€èšŒ
XGBoostã¢ãã«ãã¬ãŒãã³ã°æ¹æ³ïŒAnalytics Vidhyaã®XGBoostãã¥ãŒãã³ã°ã¬ã€ãã®é©åããŒãžã§ã³ïŒïŒ
1.ãã©ã¡ãŒã¿ãŒnum_boost_round = 50ãmax_depth = 5ã§æµ ãã·ã³ãã«ãªã¢ãã«ããã¬ãŒãã³ã°ããŸãããã MAE = 1219.57ããã®ãããªçµæãäžéãšããŠèšå®ããã¢ãã«ã調æŽããããšã§ãããæ¹åããŸãã
2ããã€ããŒãã©ã¡ãŒã¿ãŒã®æé©åãä¿é²ããããã«ãXGBoostäžã«æ§ç¯ãããç¬èªã®ã¯ã©ã¹XGBoostRegressorãå®è£ ããŸãããã®ã¯ã©ã¹ã¯ãæŠããŠãã¢ãã«ãæ©èœããããã«å¿ èŠã§ã¯ãããŸããããscikit-learnã§å®è£ ãããGridSearchCVã°ãªããæ€çŽ¢ã䜿çšããå Žåãå€ãã®å©ç¹ãæäŸããŸãïŒç¬èªã®æ倱é¢æ°ã䜿çšããæ倧åãã代ããã«ãã®é¢æ°ãæå°åã§ããŸãïŒã
3.åŠç¿é床ãšãã°ãªããäžã§ã®ä»¥éã®åæ€çŽ¢ã«å«ãŸããããªãŒã®æ°ã決å®ããŠä¿®æ£ããŸããç§ãã¡ã®ã¿ã¹ã¯ã¯æçæéã§è¯ãçµæãåŸãããšã§ãããããå°æ°ã®ããªãŒãšé«ãåŠç¿é床ãé 眮ããŸãïŒeta = 0.1ãnum_boost_round = 50ã
4ããã©ã¡ãŒã¿ãŒmax_depthããã³min_child_weightãèšå®ããŸãããããã®ãã€ããŒãã©ã¡ãŒã¿ãŒãäžç·ã«æ§æããããšããå§ãããŸãã max_depthã®å¢å ã¯ãã¢ãã«ã®è€éããå¢å ãããŸãïŒãããŠåèšç·Žã®å¯èœæ§ãå¢å ãããŸãïŒãåæã«ãmin_child_weightã¯æ£ååãã©ã¡ãŒã¿ãŒãšããŠæ©èœããŸãã次ã®æé©ãªãã©ã¡ãŒã¿ãŒãååŸããŸãïŒmax_depth = 8ãmin_child_weight = 6ãããã«ãããMAE = 1219.57ããMAE = 1186.5ã«çµæãæ¹åãããŸãã
5.ã¬ã³ããæ£ååãã©ã¡ãŒã¿ãŒãæ§æããŸãã
6.åããªãŒïŒcolsample_bytreeãsubsampleïŒã§äœ¿çšããããã¬ãŒãã³ã°ã®å åãšèŠçŽ ã®æ°ã®æ¯çãæ§æããŸãã次ã®æé©ãªæ§æãååŸããŸãïŒsubsample = 0.9ãcolsample_bytree = 0.6ãçµæãMAE = 1183.7ã«æ¹åããŸãã
7ã , ( num_boost_round) eta. . 200 , eta=0.07 MAE=1145.9.
Grid Search 4 ( ):
4: max_depthâmin_child_weight colsample-subsampling
:
1.ç°¡åãªãã®ããå§ããŠãåäžã®é ãå±€ïŒ2å±€ïŒãReLUã¢ã¯ãã£ããŒã·ã§ã³é¢æ°ãããã³åŸé éäžæ³ãå®è£ ããAdamãªããã£ãã€ã¶ãŒã§åºæ¬ã¢ãã«ãæ§ç¯ããŸãããããã®ãããªæµ ãã¢ãã«ã¯åãã¬ãŒãã³ã°ãå°é£ã§ãããã«åŠç¿ããŠé©åãªåæçµæãåŸãããŸããå€äœãšåæ£ã®éã®åŠ¥åç¹ã«é¢ããŠããããã¯å€äœã¢ãã«ã§ããããããã¯å®å®ããŠããããã®ãããªåçŽãªã¢ãã«ã«å¯ŸããŠãŸãšããªçµæMAE = 1190.73ãäžããŸãã
2. k-fold cross-validationã䜿çšããŠãããæ·±ãã¢ãã«ã®ããã©ãŒãã³ã¹ã枬å®ããåãã¬ãŒãã³ã°ãèŠèŠåããŸãã 3å±€ã¢ãã«ããã¬ãŒãã³ã°ããç°¡åã«åãã¬ãŒãã³ã°ããåŸåãããããšã瀺ããŸãã
3ã3å±€ã¢ãã«ã«æ£ååãè¿œå ããŸãïŒãã¥ãŒãã³ã®ã·ã£ããããŠã³ïŒããããã¢ãŠãïŒãšæ©æåæ¢ïŒæ©æåæ¢ïŒãåŸç¶ã®æåãã¹ãçšã«ããã€ãã®å¯èœãªæ§æãå®çŸ©ããŸãããããã®æ§æã¯ãé衚瀺èŠçŽ ã®æ°ãšãã¥ãŒãã³ããªãã«ãã確çãç°ãªããŸãããããã®ã¢ãã«ããã¬ãŒãã³ã°ããçžäºæ€èšŒã«ãã£ãŠåŸãããçµæãæ€èšããã³æ¯èŒããæé©ãªãã®ãéžæããŸããå®éããã®ã¢ãããŒãã§ã¯ãæ¹åã¯åŸãããŸããã2å±€ã¢ãã«ãšæ¯èŒããçµæã¯æªåããã ãã§ãããã®ãããªçµæã¯ãå€å±€ããŒã»ãããã³ã®æ£ååã«å¯Ÿããæåã®ïŒãããŠããã«ããäžæ£ç¢ºãªïŒã¢ãããŒãã«ãã£ãŠåŒãèµ·ããããå¯èœæ§ããããŸããã¬ã€ã€ãŒã®åŠ¥åœãªèžç匷床ã決å®ããã ãã§ãããéžæãã匷床ãæé©ã§ãããšããä¿èšŒã¯ãããŸããã
4ãHyperoptãå°å ¥ããŠãèªååãããããã€ã³ããªãžã§ã³ããªæ¹æ³ã§ãã€ããŒãã©ã¡ãŒã¿ãŒã®ã¹ããŒã¹ãæ€çŽ¢ããŸãïŒè©äŸ¡Parzenã¢ã«ãŽãªãºã ã®ããªãŒã§ããtpe.suggestã䜿çšããŸãïŒãããŸããŸãªãã¥ãŒãã³ã®æ倱ãå±€ã®æ§æãé ããèŠçŽ ã®æ°ãå«ããã€ããŒãã©ã¡ãŒã¿ãŒã®å€æ°ã®æ§æã§ãHyperoptã®ããã€ãã®å埩ãå®è¡ããŸããæåŸã«ãadadeltaãªããã£ãã€ã¶ãŒããã±ããæ£èŠåïŒãããæ£èŠåïŒãããã³ãã¥ãŒãã³ã®æ倱ã䌎ã4å±€ã¢ãŒããã¯ãã£ãŒïŒ3ã€ã®é ãå±€ïŒã䜿çšããã®ãæé©ã§ããããšãããããŸãã
å€å±€ããŒã»ãããã³ã®
æçµã¢ãŒããã¯ã㣠ïŒå³5ïŒå€å±€ããŒã»ãããã³
ã®æçµã¢ãŒããã¯ãã£äº€å·®æ€èšŒã«é¢ãããã®ã¢ãã«ã®çµæã¯æ¬¡ã®ãšããã§ããMAE= 1150.009ã
ã»ã¯ã·ã§ã³2.第1ã¬ãã«ã¢ãã«ã®ãã¬ãŒãã³ã°
ãããŸã§ã«ããŒãã¬ãã«ã¢ãã«ïŒXGBoostãšå€å±€ããŒã»ãããã³ïŒã®ãã¬ãŒãã³ã°ãšèª¿æŽãè¡ã£ãŠããŸããããã®ã»ã¯ã·ã§ã³ã§ã¯ãïŒçã®å€ãããã£ãŠããïŒãŒãã¬ãã«ã¢ãã«ã®ã¯ãã¹æ€èšŒã§äœæãããäºæž¬ããã®ããŒã¿ã»ãããšãã¡ã¿ã¢ãã«ã®å質ã®æçµè©äŸ¡ã«äœ¿çšããããŒãã¬ãã«ã¢ãã«ã®äºæž¬ã®ãã¹ããµã³ãã«ãã³ã³ãã€ã«ããŸãã
ã¢ãã«ã®ã¢ã³ãµã³ãã«ãæ§ç¯ããããã»ã¹ãå®å šã«ç解ããã«ã¯ããã®ãã¡ã€ã«ãåç §ããŠãã ããã
ç§ã䜿çšããã¢ã³ãµã³ãã«æ§ç¯æ¹æ³è«ã以äžã«èª¬æããŸãã
⢠ã¹ããã1.é 延ããŒã¿ã»ããã®æ°ãããã¬ãŒãã³ã°ãšçæãçµæãKaggleã«éä¿¡ãããããªãŒããŒããŒããç Žããããšãããªãããããã¬ãŒãã³ã°ãµã³ãã«ããã¬ãŒãã³ã°ãšãã¹ãã®2ã€ã®éšåã«åããå¿ èŠããããŸãããã¬ãŒãã³ã°ãµããµã³ãã«ã¯ãkåå²ããŒãã£ã·ã§ã³ã䜿çšãã亀差æ€èšŒã®ãŒãã¬ãã«ã¢ãã«ã®äºæž¬ãçæããããã«äœ¿çšãããŸãããé 延ããŒã¿ã»ããã¯ã2ã€ã®ãŒãã¬ãã«ã¢ãã«ãšã¡ã¿ã¢ãã«ã®ããã©ãŒãã³ã¹ã®æçµè©äŸ¡ã«ã®ã¿äœ¿çšãããŸãã
⢠ã¹ããã2ïŒããŒãã£ã·ã§ã³åå²ããã¬ãŒãã³ã°ã»ãããkåã®éšåã«åå²ããŸããããã¯ããŒãã¬ãã«ã¢ãã«ã®ãã¬ãŒãã³ã°ã«äœ¿çšãããŸãã
⢠ã¹ããã3ïŒçžäºæ€èšŒã®äºæž¬ãK-1ããŒãã§åãŒãã¬ãã«ã¢ãã«ããã¬ãŒãã³ã°ããæ®ãã®ããŒãã®äºæž¬ãäœæããŸãããã¹ãŠã®KããŒãã«å¯ŸããŠãã®ããã»ã¹ãç¹°ãè¿ããŸããæåŸã«ããã¹ããµã³ãã«å šäœïŒã¿ã°ããããŸãïŒã®äºæž¬ãååŸããŸãã
⢠ã¹ããã4ïŒãµã³ãã«å šäœã®ãã¬ãŒãã³ã°ããã¬ãŒãã³ã°ããŒã¿ã»ããå šäœã§åãŒãã¬ãã«ã¢ãã«ããã¬ãŒãã³ã°ãããã¹ãã»ããã®äºæž¬ãååŸããŸããåŸãããäºæž¬ããæ°ããããŒã¿ã»ãããäœæããŸããå笊å·ã¯ããŒãã¬ãã«ã¢ãã«ã®1ã€ã®äºæž¬ã§ãã
⢠ã¹ããã5ïŒç¬¬1ã¬ãã«ã¢ãã«ã®ãã¬ãŒãã³ã°ãã¬ãã«1ã¢ãã«ã®ããŒã¯ãšããŠãã¬ãŒãã³ã°ã»ããã®å¯Ÿå¿ããããŒã¯ã䜿çšããŠãã¯ãã¹æ€èšŒäžã«ååŸããäºæž¬ã®ç¬¬1ã¬ãã«ã¢ãã«ããã¬ãŒãã³ã°ããŸãããã®åŸããŒãã¬ãã«ã¢ãã«ã®äºæž¬ã®ããŒã¿ã®çµã¿åããã»ããã䜿çšããŠã第1ã¬ãã«ã¢ãã«ã®æçµäºæž¬ãååŸããŸãã
æåã®ã¬ãã«ã®ã¢ãã«ãšããŠãç·åœ¢ååž°ãéžæããŸããã¡ã¿ã¢ãã«ã¯ç°¡åã«åãã¬ãŒãã³ã°ãããŸãïŒãããŠãççŽã«èšã£ãŠã競äºã§ã¯ãã¡ã¿ã¢ãã«ãšããŠã®åçŽãªç·åœ¢ååž°ãããããŸãæ©èœããŸããã§ããïŒããã®ãªãŒããŒã¬ã€ã¯éåžžã«ããŸãæ©èœããçµæãå€§å¹ ã«æ¹åãããŸãããé衚瀺ã®ããŒã¿ã»ããã§ãŒãã¬ãã«ã¢ãã«ãšæçµçãªã¢ã³ãµã³ãã«ã¢ãã«ãçžäºæ€èšŒããåŸã次ã®çµæãåŸãããŸããã
MAE XGBoost: 1149.19888471 MAE : 1145.49726607 MAE : 1136.21813333
ã¢ãã«ã®éè€MAE = 1136.21ã®çµæã¯ãã¢ã³ãµã³ãã«ã®æé«ã®ã¢ãã«ã®çµæãããé¡èã«åªããŠããŸãã ãã¡ããããã®çµæã¯ããã«æ¹åã§ããŸããããã®ãããžã§ã¯ãã§ã¯ãã¢ãã«ã®äºæž¬èœåãé«ããããšãšãã¬ãŒãã³ã°æéãççž®ããããšã®éã§åŠ¥åããŸãã
説æïŒãã®çµæã»ããã¯ãçžäºæ€èšŒã§ã¯ãªããé 延éžæã§èšç®ãããŸããã ãããã£ãŠãã¯ãã¹ããªããŒã·ã§ã³ã§åŸãããçµæãé 延ããŒã¿ã»ããã®çµæãšçŽæ¥æ¯èŒããæš©å©ã¯ãããŸããã ãã ããä¿çäžã®ããŒã¿ã»ããã«ã¯ãäºæ³ã©ãããããŒã¿ã»ããå šäœã®ååžã«è¿ãååžããããŸãã ãããããªãŒããŒã¬ã€ãããã©ãŒãã³ã¹ãæ¬åœã«æ¹åãããšäž»åŒµã§ããçç±ã§ãã
è£è¶³ãšããŠããŒãã¬ãã«ã¢ãã«ããªãŒããŒã¬ã€ã«ã©ã®ãããªéã¿ãä»ããã®ããç¥ãããšã¯èå³æ·±ãã§ãããã ç·åœ¢ååž°ã§ã¯ãæçµäºæž¬ã¯åçŽã«éã¿ãšåæäºæž¬ã®ç·åœ¢çµåã§ãã
PREDICTION = 0.59 * XGB_PREDICTION + 0.41 * MLP_PREDICTION
ããŒã4.çµæ
ã¢ãã«ã®è©äŸ¡ãšæ€èšŒ
çµæãè©äŸ¡ããããã«ãããŒã¿ã»ããã®ããŸããŸãªãµãã»ããã§æçµã¢ãã«ïŒåã ããã³ã¢ã³ãµã³ãã«ïŒããã¬ãŒãã³ã°ããã³æ€èšŒããŸãã ãããã£ãŠãã¢ãã«ã®å®å®æ§ãããã³åæãã¬ãŒãã³ã°ãµã³ãã«ã«é¢ä¿ãªãå®å®ããçµæãåŸãããšãã§ãããã©ããã確èªã§ããŸãã ãããã®ç®æšãéæããããã«ã Stacking Notebookããã¥ã¡ã³ãããmodules / stacker.pyã¯ã©ã¹ãžã®ã¢ãã«ã®ãªãŒããŒã©ãããäžè¬åããŸããããã«ãããç°ãªãsidã§ã¢ãã«ã®è©äŸ¡æé ããã°ããåŒã³åºãããšãã§ããŸãïŒã¢ãã«ãäºãã«ãããã«ç°ãªãããã«ïŒã
5ã€ã®ç°ãªãsidã䜿çšããŠããŒãããã³æåã®ã¬ãã«ã®ã¢ãã«ããã¬ãŒãã³ã°ããçµæãè¡šã«æžã蟌ã¿ãŸãã 次ã«ãpd.describeã¡ãœããã䜿çšããŠãåã¢ãã«ã®éèšããã©ãŒãã³ã¹çµ±èšãååŸããŸãã ããã§æãç¹åŸŽçãªã¡ããªãã¯ã¯ãå¹³åïŒå¹³åïŒãšæšæºåå·®ïŒstdïŒã§ãã
ã芧ã®ãšãããã¢ãã«ã¯éåžžã«å®å®ããŠããïŒæšæºåå·®ãäœãïŒãã¢ã³ãµã³ãã«ã¯åžžã«ä»ã®ã©ã®ã¢ãã«ãããåªããŠããŸãã ãã®æäœã®çµæã¯ãæè¯ã®åã ã®ã¢ãã«ã®æé«ã®çµæãããåªããŠããŸãïŒMAE = 1132.165察MAE = 1136.59ïŒã
ãã1ã€ã®èª¬æïŒã¢ãã«ãæ éã«ãã¬ãŒãã³ã°ããŠæ€èšŒããããšããŸãããããŸã æ°ä»ããªããŸãŸæ å ±æŒããã®äœå°ããããããããŸããã ãã¹ãŠã®ã¢ãã«ã¯ã1ã€ã®ãã®ãããªãªãŒã¯ã«ãã£ãŠåŒãèµ·ããããå¯èœæ§ã®ããçµæã®æ¹åã瀺ããŠããŸãïŒãã ãã5ã€ã®ã¢ãã«ã®ã¿ããã¬ãŒãã³ã°ããŸããããã©ã¡ãŒã¿ãŒseed = 0ã¯ãåã«æªãçµæãäžããå¯èœæ§ããããŸãïŒã ããã§ããæçµçãªçµè«ã¯åŒãç¶ãæå¹ã§ãã ããŸããŸãªSIDã§ãã¬ãŒãã³ã°ãããè€æ°ã®ãªãŒããŒã¬ã€ãå¹³åãããšãæçµçµæãåäžããŸãã
æ£åœå
åºæ¬çãªçµæã¯æ¬¡ã®ãšããã§ããMAE= 1217.52ïŒAllstateã®ã©ã³ãã ãã©ã¬ã¹ãã¢ãã«ïŒããã³MAE = 1190.73ïŒMAEåçŽå€å±€ããŒã»ãããã³ïŒã æçµã¢ãã«ã§ã¯ãæåã®çµæã7.2ïŒ ã 5.1ïŒ æ¹åãããŸããïŒ2çªç®ïŒã
ãããã®çµæã®éèŠæ§ã枬å®ããããã«ãååºæ¬çµæãåã®ã»ã¯ã·ã§ã³ã§ååŸããçµæã®è¡šã«è¿œå ããåºæ¬çµæãç°åžžãšåŒã°ãããã©ããã調ã¹ãŸãã ãã®ãããããŒã¹ã©ã€ã³ã®çµæãæåºéãšã¿ãªãããšãã§ããå ŽåãæçµçµæãšããŒã¹ã©ã€ã³ã®çµæã®å·®ã¯å€§ãããªããŸãã
ãã®ãã¹ããå®è¡ããããã«ãç°åžžãšç°åžžå€ã®æ€åºã«äœ¿çšãããIQRïŒååäœç¯å²ïŒãèšç®ã§ããŸãã 次ã«ã3çªç®ã®ããŒã¿å€äœå€ïŒQ3ïŒãèšç®ããåŒQ3 + 1.5 * IQRã䜿çšããŠãçµæã®äžéãèšå®ããŸãã ãã®å¶éãè¶ ããå€ã¯å€ãå€ãšèŠãªãããŸãã ãã®ãã¹ããå®æœãããšãäž¡æ¹ã®åºæ¬çãªçµæãå€ãå€ã§ããããšãããããŸãã ãããã£ãŠããªãŒããŒã¬ã€ã¢ãã«ã¯ããŒã¹ã³ã³ãããŒã«ãã€ã³ããã¯ããã«è¶ ããŠãããšèšããŸãã
for baseline in [1217.52, 1190.73]: stacker_scores = list(scores.stacker) stacker_scores.append(baseline) max_margin = np.percentile(stacker_scores, 75) + 1.5*iqr(stacker_scores) if baseline - max_margin > 0: print 'MAE =', baseline, ' .' else: print 'MAE = ', baseline, ' .'
åºåã§ã¯æ¬¡ã®ããã«ãªããŸãã
MAE = 1217.52 . MAE = 1190.73 .
ããŒã5.çµè«
èªç±åœ¢åŒã®èŠèŠå
詳现ããæœè±¡åãããããžã§ã¯ãå šäœãèŠãŠã¿ãŸãããã XGBoostãšå€å±€ããŒã»ãããã³ã®2ã€ã®äž»èŠã¢ãã«ããã¬ãŒãã³ã°ããã³æé©åããŸããã æãåçŽãªã¢ãã«ããã調æŽããããããå®å®ããè€éãªã¢ãã«ãŸã§ãå€ãã®ã¹ããããèžãã§ããŸãã ãã®åŸãã¬ãã«1ã¢ãã«ã®ç·åœ¢ååž°äºæž¬ãç·åœ¢ååž°ãäœæããã¹ã¿ããã³ã°ææ³ã䜿çšããŠãŒãã¬ãã«ã¢ãã«ã®äºæž¬ãçµåããŸããã æåŸã«ãã¹ã¿ããã³ã°ã®ããã©ãŒãã³ã¹ãæ€èšŒãã5ã€ã®ããªãšãŒã·ã§ã³ããã¬ãŒãã³ã°ããŸããã ããã5ã€ã®ã¢ãã«ã®çµæãå¹³åãããšãæçµçµæã¯MAE = 1129.91ã«ãªããŸããã
å³6ïŒçµæãæ¹åããããã®éèŠãªã¹ããã
ãã¡ããããŒãã¬ãã«ã«ãã£ãšå€ãã®ã¢ãã«ãå«ããããšãããããã®ã¢ã³ãµã³ãã«ãããè€éã«ããããšãã§ããŸãã å¯èœãªæ¹æ³ã®1ã€ã¯ãããã€ãã®å®å šã«ç°ãªãã¢ã³ãµã³ãã«ããã¬ãŒãã³ã°ãããããã®äºæž¬ãïŒããšãã°ãç·åœ¢çµåãšããŠïŒæ°ããã¬ãã«ã®ã¬ãã«2ã§çµåããããšã§ãã
å®äºããäœæ¥ã®åæ
åé¡ã®å®å šãªè§£æ±ºç
ãã®ãããžã§ã¯ãã®ã¢ã€ãã¢ã¯ãæ°ããç¹æ§ã®äœæãå¿ èŠãšããªãããŒã¿ã»ããã䜿çšããããšã§ããã ãã®ãããªããŒã¿ã»ãããæã€ããšã¯éåžžã«çŽ æŽãããã§ãããªããªã ããã«ãããããŒã¿ã®ååŠçãã¢ã«ãŽãªãºã ã®å€æŽã§ã¯ãªããã¢ã«ãŽãªãºã ãšãã®æé©åã«éäžã§ããŸãã ãã¡ãããXGBoostãšãã¥ãŒã©ã«ãããã¯ãŒã¯ã®åäœããã¹ããããã£ãã®ã§ãéžæããããŒã¿ã»ããã¯ãä»ã®Kaggleåå è ã®çµæãšæ¯èŒããŠãã¢ãã«ã®å質ã®åªããåºæ¬çãªè©äŸ¡ãäžããããšãã§ããŸããã
ãããžã§ã¯ãã§æ瀺ãããè匱ãªã¢ãã«ã䜿çšããŠããåºæ¬çµæãããªãéå°ã«ãªãã 7.2ïŒ å¢å ããŸãã ã åºæ¬çµæãæçµçµæãšæ¯èŒããæçµçµæãå®éã«å€§å¹ ãªæ¹åã§ããããšã確èªããŸããã
é£ãã
ãã®ãããžã§ã¯ãã¯ç°¡åã§ã¯ãããŸããã§ããã æ°ããå±æ§ã®äœæã次å ã®åæžãããŒã¿ã®åŒ·åãªã©ãããŒã¿ã»ããã«å¯Ÿããç¹å¥ãªãŸãã¯åµé çãªã¢ãããŒãã¯ç€ºããŸããã§ããããKaggleã«å¿ èŠãªã¢ãã«ã®èšç®ã®è€éãã¯äºæ³ãããé«ããªããŸããã
Allstate Claims Severityã®äž»ãªåé¡ã¯ãã³ã³ãã¥ãŒãã£ã³ã°ã®åçŸæ§ã§ããã ãããä¿èšŒããããã«ãå€ãã®äºåçãªèŠä»¶ãæºããå¿ èŠããããŸããïŒçµæãåŸãããã»ã¹ã¯æ確ã§ãããèšç®-決å®è«çïŒãŸãã¯å°ãªããšãéãããå€åã§ïŒã§ãããçŸä»£ã®æ©åšã§åççãªæéã«åçŸå¯èœã§ãªããã°ãªããŸããã§ããã ãã®çµæãæ¢åã®ã¢ãã«ã®è€éããå€§å¹ ã«åæžããããã€ãã®ææ³ãå®å šã«æé€ããŸããïŒããšãã°ãXGBoostããã³å€å±€ããŒã»ãããã³ã®ãã®ã³ã°ææ³ãé€å€ããŸãããããããã®ã¢ãã«ããã®ã³ã°ããåŸãèããè¯ãçµæãåŸãããŸããïŒã
ãã®ãããžã§ã¯ãã¯ãäž»ã«Amazon Web Servicesã€ã³ãã©ã¹ãã©ã¯ãã£ã§è¡ãããŸããããã«ãã¬ã€ã€ãŒããŒã»ãããã³ïŒNVIDIA Tesla K80 GPUã12 GB GPUã¡ã¢ãªïŒã®GPUã³ã³ãã¥ãŒãã£ã³ã°ã䜿çšããp2.xlargeã€ã³ã¹ã¿ã³ã¹ãšãXGBoostïŒ36 vCPUã 60 GBã®ã¡ã¢ãªïŒã ç§ã®ãããžã§ã¯ãã«ã¯ãéãèšç®ãå¿ èŠãšããããã€ãã®ã»ã¯ã·ã§ã³ããããŸãã
â¢ã°ãªããã§æ€çŽ¢ã XGBoostã»ã¯ã·ã§ã³å ã®ãã¹ãŠã®ã°ãªãããå埩åŠçããã«ã¯ãå€ãã®æéãããããŸãã éåžžã®ã³ã³ãã¥ãŒã¿ãŒã§ã¯ãXGBoostã«å¿ èŠãªãã¹ãŠã®èšç®ã«1ã2æéãããå ŽåããããŸãã
â¢å€å±€ããŒã»ãããã³ã¢ãã«ã®çžäºæ€èšŒã æ®å¿µãªãããå€å±€ããŒã»ãããã³ã®ãã¬ãŒãã³ã°ã§äº€å·®æ€èšŒãå éããéæ³ã®æ¹æ³ã¯ãããŸããã 亀差æ€èšŒã¯ãã¢ãã«ã®å質ãè©äŸ¡ããããã®ä¿¡é Œã§ããæ¹æ³ã§ãããããé¿ããã¹ãã§ã¯ãããŸããããååãªæéã確ä¿ããå¿ èŠããããŸãã
â¢ãã€ããŒãã©ã¡ãŒã¿ãŒã®æé©åã ããã¯ééããªããã¹ãŠã®èšç®ã®äžã§æãé£ããéšåã§ãã Hyperoptã®ããŸããŸãªçµã¿åãããæ€çŽ¢ããHyperoptã䜿çšããŠé©åãªçµã¿åãããèŠã€ããã«ã¯äœæéãããããŸãã
â¢å€å±€ããŒã»ãããã³ãšXGBoostã®çžäºæ€èšŒäºæž¬ã®çæã
çµæãåçŸããæè¯ã®æ¹æ³ã¯ãäºåã«ãã¬ãŒãã³ã°ãããã¢ãã«ãæåã«å®è¡ãã次ã«æãç°¡åãªã¢ãã«ã®ããã€ããåéèšããããšã§ãã
æ©èœåŒ·å
ãããžã§ã¯ãã«ã¯ãå°æ¥ã®æ¹åã®ããã®ååãªäœå°ãæ®ãããŠããŸãã 以äžã§ã¯ãä»ã«ã©ã®ãããªæ¹åæ¹æ³ã䜿çšã§ããããèããŸãã
ããŒã¿ã®ååŠç
1.æåã«å€å±€ããŒã»ãããã³ã®ã¢ãã«ã§äœæ¥ãéå§ãããã®åŸåããŠã¿ãŒã²ããå€æ°ã®å¯Ÿæ°ã䜿çšããææ³ãçºèŠããŸããã ãã®çµæãç§ã®å€å±€ããŒã»ãããã³ã¯å¯Ÿæ°å€æãªãã§èšç·ŽãããŸããã ãã¡ãããã¿ãŒã²ããå€æ°ã察æ°åãããã¹ãŠã®å€å±€ããŒã»ãããã³ã¢ãã«ãåãã¬ãŒãã³ã°ã§ããŸãã
2.ã¿ãŒã²ããå€æ°ãå€æããææãªæ¹æ³ã¯ã200ãã€ã³ãå³ã«ã·ããããããšã§ãïŒãã¹ãŠã®å€ã«200ãè¿œå ããŸãïŒã ãã®ãããªã·ãããè¡ã£ãŠãã察æ°ãåããšãã¿ãŒã²ããå€æ°ã®ååžå¯åºŠé¢æ°ã®å·ŠåŽã«ããå€ãå€ãåãé€ãããŸãã ãããã£ãŠãååžãæ£èŠã«è¿ã¥ããŸãã
XGBoost
1.ããªãŒãè¿œå ããåæã«etaãã©ã¡ãŒã¿ãŒãæžãããŠãããæŽç·ŽãããXGBoostã¢ãã«ããã¬ãŒãã³ã°ããŸãã ç§ã®äœæ¥ã¢ãã«ã§ã¯ã28000æ¬ã®æšãšeta = 0.003ã䜿çšããŸããããã¯ãã°ãªããæ€çŽ¢æé ã䜿çšããŠæ±ºå®ãããŸããã
2. num_boost_roundã®ä»£ããã«early_stopping_roundsã䜿çšããŠãåŠç¿ãåæ¢ããã¢ãã«ã®éå°é©åãåé¿ããŸãã ãã®å Žåãetaãå°ããªæ°ã«èšå®ããnum_boost_roundãéåžžã«å€§ããèšå®ããŸãïŒæ倧10äžïŒã ãã®å Žåãæ€èšŒãµã³ãã«ãæºåããå¿ èŠãããããšãç解ããå¿ èŠããããŸãã ãã®çµæãã¢ãã«ã®ãã¬ãŒãã³ã°çšããŒã¿ãå°ãªããªããããã©ãŒãã³ã¹ãäœäžããå¯èœæ§ããããŸãã
3.ãã€ããŒãã©ã¡ãŒã¿ãŒã®ä»ã®å€ã§ã°ãªããæ€çŽ¢ãéå§ããŸãã 0ãš0.5ã®éã®colsample_bytreeã®å€ããã¹ãã§ãããšããŸããããå€ãã®å Žåãè¯ãçµæãåŸãããŸãã ããã§ã®èãæ¹ã¯ããã€ããŒãã©ã¡ãŒã¿ãŒã®ç©ºéã«ããã€ãã®å±æçãªæé©å€ãããããããã®ããã€ããèŠã€ããå¿ èŠããããšããããšã§ãã
4.ç°ãªããã€ããŒãã©ã¡ãŒã¿ãŒã§ãã¬ãŒãã³ã°ãããè€æ°ã®XGBoostã¢ãã«ãçµã¿åãããŸãã ãããè¡ãã«ã¯ãã¢ãã«ã®çµæãå¹³ååãããã¬ã³ãããã³ã¹ã¿ãã¯ããŸãã
å€å±€ããŒã»ãããã³
1.ãã®ã³ã°ææ³ã䜿çšããŠãåãã¢ãã«ã®è€æ°ã®å€å±€ããŒã»ãããã³ãå¹³ååããŸãã ã¢ãã«ã¯ç¢ºçè«çã§ããããïŒããšãã°ããã¥ãŒãã³ã®æ倱=ããããã¢ãŠãã䜿çšããããïŒããã®ã³ã°ã«ããäœæ¥ãã¹ã ãŒãºã«ãªããäºæž¬èœåãåäžããŸãã
2.ããæ·±ããããã¯ãŒã¯ã¢ãŒããã¯ãã£ãä»ã®èŠçŽ æ§æãããã³ãã€ããŒãã©ã¡ãŒã¿ãŒå€ãè©ŠããŠãã ããã åŸé éäžæ³ã«åŸã£ãŠåäœããä»ã®ãªããã£ãã€ã¶ãŒããã¹ãããã¬ã€ã€ãŒã®æ°ãåã¬ã€ã€ãŒã®èŠçŽ ã®æ°ãå€æŽã§ããŸã-Hyperoptã®ãã¹ãŠã®åã¯ç§ãã¡ã®æã«ãããŸãã
3.æ¢ç¥ã®ææ³ã䜿çšããŠãããã€ãã®ç°ãªãå€å±€ããŒã»ãããã³ãçµã¿åãããŸã ïŒããšãã°ã2å±€ã3å±€ãããã³4å±€ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ããŸãïŒã ãããã®ã¢ãã«ã¯ãããŸããŸãªæ¹æ³ã§ããŒã¿ç©ºéãã«ããŒããããŒã¹ã©ã€ã³ã®çµæãããæ¹åãããŸãã
çžäºæ€èšŒ
3ã€ã®éšåã«åå²ããåçŽãªçžäºæ€èšŒã䜿çšãã代ããã«ïŒ5ã€ã®éšåã«åå²ã䜿çšããããšããããŸãïŒã10ã®éšåã«çžäºæ€èšŒããããšãã§ããŸãã ãã®ãããªçžäºæ€èšŒã¯ãKaggle競æã«é©ããŠããŸãããã»ãŒç¢ºå®ã«çµæãåäžããŸãïŒãã¬ãŒãã³ã°ããŒã¿ãå¢ããŸãïŒã
ã¹ã¿ããã³ã°
1.ãŒãã¬ãã«ã¢ãã«ãããã«ã¢ã³ãµã³ãã«ã«è¿œå ã§ããŸãã ãŸããããå€ãã®å€å±€ããŒã»ãããã³ãšXGBoostã¢ãã«ãåçŽã«ãã¬ãŒãã³ã°ã§ããŸããããããã¯äºãã«ç°ãªãå¿ èŠããããŸãã ããšãã°ããããã®ã¢ãã«ãããŸããŸãªããŒã¿ã®ãµãã»ããã§ãã¬ãŒãã³ã°ã§ããŸããäžéšã®ã¢ãã«ã§ã¯ãããŸããŸãªæ¹æ³ã§ããŒã¿ãååŠçïŒæ倱ïŒã§ããŸãã 次ã«ãå®å šã«ç°ãªãã¢ãã«ãå°å ¥ã§ããŸããLightGBMãkåã®æè¿åã¢ã«ãŽãªãºã ãå æ°å解ãã·ã³ïŒFMïŒãªã©ã§ãã
2.å¥ã®ã¢ã€ãã¢ã¯ãæ°ãããªãŒããŒã¬ã€ã¬ã€ã€ãŒãšããŠ2çªç®ã®ã¬ã€ã€ãŒãè¿œå ããããšã§ãã ãã®çµæã2ã¬ãã«ã®ã¹ã¿ããã³ã°ãåŸãããŸãããŒãã¬ãã«ã®ãªã°ã¬ããµãŒïŒL0ïŒãããããã©ãŒã«ãå€äºæž¬ã§ã¯ããã€ãã®ç°ãªã第1ã¬ãã«ã®ã¡ã¿ã¢ãã«ïŒL1ïŒããã¬ãŒãã³ã°ããŸãã 次ã«ãã¡ã¿ã¢ãã«äºæž¬ïŒL2ïŒã®ç·åœ¢çµåãååŸããæçµã°ã¬ãŒããååŸããŸãã
3. PCA-ããŒã¿ãã€ãã³ã°ã«ã€ããŠã©ãããããã§èæ ®ãããäž»èŠã³ã³ããŒãã³ãã®æ¹æ³ã䜿çšããŠã¿ãŠãã ããã ããã«ããã€ãã®ã¢ã€ãã¢ããããŸãã ãŸããååŸããã³ã³ããŒãã³ãããã©ãŒã«ãã¢ãŠãäºæž¬ãšæ··åããããšãã§ããŸãããã®äºæž¬ã§ã¯ãã¡ã¿ã¢ãã«ã«è¿œå æ å ±ãè¿œå ããããã«ç·åœ¢ååž°ïŒL1ïŒããã¬ãŒãã³ã°ãããŸãã 第äºã«ããã¹ãŠã®æšèããééã®å€§ãããã®ã®ã¿ãéžæã§ããæ®ãã¯ãã€ãºãšããŠç Žæ£ãããŸãã ããã¯ãçµæã®ã¢ãã«ã®å質ãåäžãããã®ã«ã圹ç«ã¡ãŸãã
ãããžã§ã¯ãå šäœãå®è¡ãããæ§æã®èª¬æã ãã