ãã®èšäºã§ã¯ãèªç¶èšèªåŠçïŒNLPïŒã«é¢é£ããåé¡ã解決ããããã®ããŸããŸãªæ©æ¢°åŠç¿æ¹æ³ã®äœ¿çšã«ã€ããŠèª¬æããŸãã ãããã®åé¡ã®1ã€ã¯ãããã¹ãããŒã¿ã®ææ çãªè²ä»ãïŒããžãã£ãããã¬ãã£ãããã¥ãŒãã©ã«ïŒã®èªå決å®ãã€ãŸãææ åæã§ãã ãã®ã¿ã¹ã¯ã®ç®çã¯ãç¹å®ã®ãªããžã§ã¯ãã®è©å€ã«å¯Ÿããå¹æã«ãããŠãç¹å®ã®ããã¹ãïŒæ ç»ã®ã¬ãã¥ãŒã解説ãªã©ïŒãè¯å®çãåŠå®çããŸãã¯äžç«çãã©ãããå€æããããšã§ãã 調æ§ãåæããããšã®é£ããã¯ãææ çã«è±ããªèšèªã®ååšã«ãããŸã-ã¹ã©ã³ã°ãææ§ããäžç¢ºå®æ§ãç®èãããããã¹ãŠã®èŠå ã¯ã人ã ãã§ãªãã³ã³ãã¥ãŒã¿ã誀解ãããŸãã
調æ§1ã2ã3ã®å®çŸ©ã«é¢ããèšäºãè€æ°åããã«æ²èŒãããŸãã ãšã«ããããã®ãããã¯ã¯æè¿äžçäžã§æãè°è«ãããŠãããããã¯ã®1ã€ã§ã[1ã2ã3ã4]ã
ãã®èšäºã§ã¯ã€ãããŒã·ã§ã³ã¯èŠåœãããªãããšãããã«èª¬æããŸãããã®è³æã¯ãæ©æ¢°åŠç¿ãšNLPã®åéã®åå¿è åãã®ãã¥ãŒããªã¢ã«ãšããŠåœ¹ç«ã€ã§ãããã ãã®ãªã³ã¯ã§äœ¿çšããäž»èŠãªè³æãèŠã€ããããšãã§ããŸãã ãã®ãªã³ã¯ã§ãã¹ãŠã®ãœãŒã¹ã³ãŒããèŠã€ããããšãã§ããŸãã
ããã§ãåé¡ã¯äœã§ãããã©ã®ããã«ããã解決ããã®ã§ããïŒ
ããã¹ãã¡ãã»ãŒãžïŒæ ç»ã®èª¬æãã¬ãã¥ãŒãã³ã¡ã³ãïŒããããšããŸãã
ããã®æ ç»ã¯ç§ãåæºãããŸããã ããã¯ãã ããªãã®èªç±ãªæéãåããããããŽãç®±ã«æããŠããã ãã§ãïŒïŒïŒâ
ãŸãã¯ãã以å€ïŒ
ãä»ãŸã§èŠãäžã§æé«ã®æ ç»!!! äœæ²ã俳åªãã·ããªãªãªã© ãããã¯ãã¹ãŠé©ãã¹ããã®ã§ã!!!ã
æåã®äŸã§ã¯ãã³ã¡ã³ããåŠå®çã§ããããã·ã¹ãã ã¯åŠå®çãªçµæãè¿ãã2çªç®ã®äŸã§ã¯è¯å®çã§ããå¿ èŠããããŸãã æ©æ¢°åŠç¿ã®åæ§ã®ã¿ã¹ã¯ã¯åé¡ãšåŒã°ãããã®æ¹æ³ã¯æåž«ã«ããåŠç¿ã§ãã ã€ãŸããæåã«ãã¬ãŒãã³ã°ã»ããã®ã¢ã«ãŽãªãºã ãããã¬ãŒãã³ã°ãããå¿ èŠãªä¿æ°ãšä»ã®ã¢ãã«ããŒã¿ãä¿åããæ°ããããŒã¿ãå ¥åããããšãç¹å®ã®ç¢ºçã§ããããåé¡ããŸãã ä¿æ°ãšã¯ã次ã®ãããªãã®ã§ãã
ããã§ãããŒã¿å€ã¯ããã¹ãããŒã¿ã®ãã¬ãŒãã³ã°ã«åºã¥ããŠååŸããä¿æ°ã§ãã ã芧ã®ãšããããã®åŒã¯æçµçã«0ã1ã®å€ãè¿ããŸãïŒè©³çŽ°ã«ã€ããŠã¯ã·ã°ã¢ã€ããåç §ïŒãã€ãŸãã0ã«è¿ã¥ãã»ã©ãããã¹ãã«è² ã®æ å ±ãå«ãŸããå¯èœæ§ãé«ããªããŸãã
ãã¬ãŒãã³ã°ãµã³ãã«ã§ã¯ã www.kaggle.comã®ãªãŒãã³ããŒã¿ã»ãããã€ãŸãã調æ§è§£æçšã«ç¹å¥ã«éžæãããIMDB Webãµã€ãã®50,000件ã®æ ç»ã¬ãã¥ãŒã®ããŒã¿ãå«ãããŒã¿ã»ããã䜿çšããŸããã 調æ§ã¡ããªãã¯ã¯ãã€ããªå€ã§ããã€ãŸããIMDBè©äŸ¡<5ã«ã¯å€0ãå²ãåœãŠãããè©äŸ¡> = 7ã«ã¯å€1ãå²ãåœãŠãããŸãã
ãã®ããŒã¿ã»ããã®åã¬ã³ãŒãã¯ã次ã®ãã£ãŒã«ãã§æ§æãããŠããŸãã
- ID-åã¬ãã¥ãŒã®äžæã®èå¥åã
- ææ -ã¬ãã¥ãŒã®èª¿æ§; 1ãŸãã¯0;
- ã¬ãã¥ãŒ-ããã¹ããã¬ãã¥ãŒããŸãã
ã¢ã«ãŽãªãºã
ãããã£ãŠãåé¡ã®è§£æ±ºã«çŽæ¥é²ã¿ãŸãã ãã®èšäºã§èª¬æããã¢ã«ãŽãªãºã å šäœã¯ãPythonïŒvã2.7ïŒã§å®è£ ãããŠããŸãã èªã¿ãããããããã«ãã¢ã«ãŽãªãºã ã次ã®æé ã«åå²ããŸããã
ã¹ããã1.ååŠç
ããŒã¿åŠçã®ååŠçãå¿ èŠã«ãªãåã ãã®æ®µéã§ããã¹ãŠã®htmlã¿ã°ãå¥èªç¹ãæåãåé€ãããŸãã ãã®æäœã¯ãPythonã©ã€ãã©ãªãBeautiful Soupãã䜿çšããŠå®è¡ãããŸãã ãŸããããã¹ãå ã®ãã¹ãŠã®æ°åãšãªã³ã¯ã¯ã¿ã°ã«çœ®ãæããããŸãã ããã«ãããã¹ãã«ã¯ãããããã¹ãããã¯ãŒããããããŸã-ãããã¯åºæ¬çã«æå³ãæããªãèšèªã®é »ç¹ãªåèªã§ãïŒããšãã°ãè±èªã§ã¯ãtheãatãabout ...ãã®ãããªåèªã§ãïŒã ã¹ãããã¯ãŒãã¯ãPython Natural Language ToolkitïŒNLTKïŒã䜿çšããŠåé€ãããŸãã ãœãŒã¹ããã¹ããååŠçãããšã次ã®çµæãåŸãããŸãã
[äŒèšãããŒããç¹éãæ ç»ãèŠããè¡ããèŠããæ ç»ãå ã ]-ã€ãŸããäžé£ã®åèªã
ãã®æ®µéã§ã¯ãååèªããã®æåã®åœ¢åŒïŒã¹ããã³ã°ïŒãªã©ã«å€æŽããããšã«ãããèªåèªèº«ãããã«æŽç·Žãããããšãã§ããŸãã ãããããã®å®éšã®ããã«ãç§ã¯èªåèªèº«ãå¶éããããããããããšã«ããŸããã
ã¹ããã2.ãã¯ã¿ãŒãšããŠã®ãã¬ãŒã³ããŒã·ã§ã³
ã¢ãããŒã1
äºå®ã¯ãæ°åŒã ãã§ãªãã³ã³ãã¥ãŒã¿ãŒããåèªã®ã»ããã§ã¯ãªãæ°åãæ±ãæ¹ãç°¡åã ãšããããšã§ãã ãããã£ãŠãããã¹ããæ°å€ã®ãã¯ãã«ãšããŠè¡šãå¿ èŠããããŸãã ãããè¡ãã«ã¯ããã¹ãŠã®åèªãå«ãèŸæžãäœæã§ããŸãã ããã¹ãã§èŠã€ãã£ããã¹ãŠã®åèªã1ã€ã®å€§ããªèŸæžã«ãŸãšããããæ¢è£œã®èŸæžïŒDahlãŸãã¯ZaliznyakïŒã䜿çšããŠãããã¹ãã®åèªãèŸæžã®ã€ã³ããã¯ã¹ã«çœ®ãæããŸãã ã€ãŸãã次ã®ååŠçãããåèªãã¯ãã«ã䜿çšããã¬ãã¥ãŒã3ã€ãããªããšããŸãã
- [äŒèšãããŒããç¹é]
- [æ ç»ãèŠããŠãè¡ã]
- [ããšããšæ ç»ãã芧ãã ãã]
ãªã¹ãã®ãã¹ãŠã®åèªã1ã€ã«ãŸãšãããšã次ã®ãœãŒããããèŸæžãåŸãããŸãïŒãã¯ãã«ã®åºç€ãšåŒã³ãŸãïŒã
[äŒèšãæ ç»ãç¹éãæ ç»ãè¡ããå ã ãäžéšãèŠããŠãããèŠã]
以åã®ãã¯ãã«ãèŸæžã®åèªã®ã€ã³ããã¯ã¹ã«çœ®ãæãããšã次ã®ããã«ãªããŸãã
- [1ã0ã1ã0ã0ã0ã1ã0ã0]
- [0ã0ã0ã1ã1ã0ã0ã1ã0]
- [0ã1ã0ã0ã1ã0ã0ã0ã1]
ãã¹ãŠã®ã¬ãã¥ãŒã«å¯ŸããŠãã®ãããªäœæ¥ãè¡ã£ãã®ã§ãããªã倧ããªãªã¹ããååŸã§ããŸãïŒç§ã®äŸã§ã¯ãæãäžè¬çãªåèªã5000ååããŸããïŒã ãããã®ãã¯ãã«ã¯ããããããã£ãã¯ãã«ããŸãã¯ãæ©èœãã¯ãã«ããšåŒã°ããŸãã ãã®ããã«ããŠãåãã¹ãã¬ãã¥ãŒã®ãã¯ãã«ãååŸãããŠãŒã¯ãªããè·é¢ãã³ãµã€ã³è·é¢ãªã©ã®æšæºã¡ããªãã¯ã䜿çšããŠãããã®ãã¯ãã«ãæ¯èŒã§ããŸãã ãã®ã¢ãããŒãã¯ãåèªã®è¢ããŸãã¯ãåèªã®è¢ããšåŒã°ããŸãã
from sklearn.feature_extraction.text import CountVectorizer # sklearn âBag-Of-Wordsâ vectorizer = CountVectorizer(analyzer = "word", \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) train_data_features = vectorizer.fit_transform(clean_train_reviews) train_data_features = train_data_features.toarray()
ã¢ãããŒã2
æåã®ã¢ãããŒãã¯ããªãäžè¬çãªæ¹æ³ã§ãããå®è£ ãéåžžã«ç°¡åã§ãããæ¬ ç¹ããé€å€ãããŠããŸããã 2ã€ã®ãã¯ãã«ãæ¯èŒããå Žåãæ£ç¢ºãªåèªäžèŽã䜿çšãããéèŠãªæ å ±ã倱ãããŸãã ãã®ãããªã倱ããããæ å ±ã®1ã€ã¯ãåèªã®ã»ãã³ãã£ã¯ã¹ã§ãã ããšãã°ããé»ããšããèšèããæãããšããèšèã«ç°¡åã«çœ®ãæããããšãã§ããŸãããããã®æå³ã¯éåžžã«äŒŒãŠããããã§ãã ãã®ãããªåèªã¯ãã»ãã³ãã£ãã¯é¢é£ã®åèªãšåŒã¶ããšãã§ããŸãã ãã®ãããªåèªã®ã°ã«ãŒãã«ã¯ãå矩èªãäžäœèªãäžäœèªãªã©ãå«ãŸããŸãã
å¥ã®ã¢ãããŒãã§ã¯ããªã¹ãå ã®ååèªããã®ã»ãã³ãã£ãã¯ã°ã«ãŒãã®çªå·ã§çœ®ãæããããšããŸãã ãã®çµæããèšèã®è¢ãã®ãããªãã®ãåŸãããŸãããããæ·±ãæå³ããããŸãã ãããè¡ãã«ã¯ãGoogleã®Word2Vecãã¯ãããžãŒã䜿çšããŸãã ããã¯ãçµã¿èŸŒã¿ã®Word2Vecã¢ãã«ãåããgensimã©ã€ãã©ãªããã±ãŒãžã«å«ãŸããŠããŸãã
Word2Vecã¢ãã«ã®æ¬è³ªã¯æ¬¡ã®ãšããã§ã-倧éã®ããã¹ããå ¥åã«æž¡ããïŒãã®å ŽåãçŽ10,000件ã®ã¬ãã¥ãŒïŒãåºåã§ã¯ååèªã®éã¿ä»ããã¯ãã«ãåºå®é·ïŒãã¯ãã«ã®é·ãã¯æåã§èšå®ïŒãååŸããŸããããã¯ããŒã¿ã»ããã«ãããŸãã ããšãã°ãåèªmenã«ã€ããŠããã¹ãŠã®åèªãšæ¯èŒããéé ã§äžŠã¹æ¿ãããšã次ã®çµæãåŸãããŸããïŒè¿æ¥æ§ã®å°ºåºŠã«ã€ããŠã¯ãäœåŒŠè·é¢ãéžæããŸããïŒã
åèªãmanãã®æå³ã«é¢é£ããåèª
èšè | 察ç |
女 | 0.6056 |
ç· | 0.4935 |
å°å¹Ž | 0.4893 |
ç·æ§ | 0.4632 |
人 | 0.4574 |
婊人 | 0.4487 |
圌èªèº« | 0.4288 |
å°å¥³ | 0.4166 |
圌㮠| 0.3853 |
圌 | 0.3829 |
ãã®ãªã³ã¯ã§ ãWord2Vecã¢ãã«ãã©ã®ããã«æ©èœãããã«ã€ããŠè©³ããç¥ãããšãã§ããŸãã
次ã«ãã¯ã©ã¹ã¿ãªã³ã°ã䜿çšããŠãæå³ã®è¿ãåèªãçµåããŸãã ã¯ããããã«å¥ã®äžæ¡çãªèšèããããŸã-ã¯ã©ã¹ã¿ãªã³ã°ã ããã«ã€ããŠã¯è©³ãã説æããŸããããwikièšäºïŒ sigmoid ïŒã§ãã¹ãŠãããŸã説æã§ãããšæããŸãã ããããæãåå§çãªã¯ã©ã¹ã¿ãªã³ã°ã¢ã«ãŽãªãºã ïŒK-meansïŒã®æ¬è³ªã説æããŸããç¹å®ã®æ°ã®ã¯ã©ã¹ã¿ãŒNãçšæãããã¬ãŒãã³ã°ããŒã¿ããåŠç¿ããŠããããã¯ã©ã¹ã¿ãŒã«åå²ããããããã®äžå¿ãèŠã€ãããã¹ãããŒã¿ãå ¥åãããšãã«ãã¢ã«ãŽãªãºã ãã¯ã©ã¹ã¿ãŒçªå·ãäžå¿ãå²ãåœãŠãŸã圌ã«äžçªè¿ãã§ãã ãã®å Žåãåã¯ã©ã¹ã¿ãŒã«å¹³å5åèªãå«ãŸããããšãèæ ®ããŠãèŸæžã®åèªæ°ã5ã§é€ç®ããŸããã å¹³åããŠãçŽ3000åã®ã¯ã©ã¹ã¿ãŒãååŸããŸããã 次ã«ãæåã®ãBag-Of-Wordsãã¢ãããŒããšåãããšãè¡ããååèªãã¯ã©ã¹ã¿ãŒã€ã³ããã¯ã¹ã«çœ®ãæããŸãããä»åã¯ãBag-Of-Clustersãã®ãããªãã®ãååŸããŸãã ãã®ã¡ãœããã®èª¬æãå«ãå®å šãªãœãŒã¹ã³ãŒãã¯ããã®ãªã³ã¯ããå ¥æã§ããŸãã
ã¹ããã3.ããã¹ãã®åé¡
ãããã£ãŠãå ¥æµŽæ®µéã§ã¯ãäžèŠãªãã®ããã¹ãŠåé€ããããã¹ãããã¯ãã«ã«å€æããŠããããã£ããã·ã¥ã©ã€ã³ã«é²ã¿ãŸãã ãã®å®éšã§ã¯ãã©ã³ãã ãã©ã¬ã¹ãåé¡ã¢ã«ãŽãªãºã ã䜿çšããŠããã¥ã¡ã³ããåé¡ããŸãã ãã®ã¢ã«ãŽãªãºã ã¯æ¢ã«scikit-learnããã±ãŒãžã«å®è£ ãããŠããŸããæ®ãã¯ããã¹ãããŒã¿ããã£ãŒãããŠããªãŒã®æ°ã瀺ãããšã ãã§ãã ããã«ãã¢ã«ãŽãªãºã ã¯ãã¹ãŠãåŠçãããã¬ãŒãã³ã°ã»ããã§ãã¬ãŒãã³ã°ãè¡ããå¿ èŠãªããŒã¿ããã¹ãŠä¿åããŸãã
from sklearn.ensemble import RandomForestClassifier # - 100 forest = RandomForestClassifier(n_estimators = 100) forest = forest.fit( train_data_features, train["sentiment"] )
çµæ
èŠããã«ãåºæãã¯ãã«ãååŸããããã®äž¡æ¹ã®ã¢ãããŒãã«åºã¥ããåé¡åšãèµ·åããŸããã ç§ã¯ãã®ãããªèå³æ·±ãçµæãåŸãŸããïŒ
æ¹æ³ | 粟床 | æãåºã | Fã¡ãžã£ãŒ | 粟床 |
ããã°ãªãã¯ãŒã | 85.2ïŒ | 83.7ïŒ | 84.4ïŒ | 84.5ïŒ |
Word2vec | 90.3ïŒ | 87.2ïŒ | 88.7ïŒ | 89.8ïŒ |
å€ãã©ãããããã§Word2Vecãèµ·åãããš2æéããã£ããšããäºå®ãèãããšãå€ãè¯ãBag-Of-Wordsãããæ¯èŒçè¯ãçµæã瀺ããŸããã
䜿çšææïŒ
[1] I. ChetviorkinãPãBraslavskiyãNãLoukachevichããROMIP 2011ã®ã»ã³ãã¡ã³ãåæãã©ãã¯ããèšç®èšèªåŠããã³ç¥çæè¡ïŒåœéäŒè°ã®è°äºé²ãDialog 2012ããBekosovoã2012幎ãppã 1-14ã
[2] AA PakãSS NarynovãAS ZharmagambetovãSN SagyndykovaãZE KenzhebayevaãIãTuremuratovichãã泚éãªãã³ãŒãã¹ããã®å矩èªæœåºã®æ¹æ³ããIn procã DINWC2015ãã¢ã¹ã¯ã¯ã2015幎ãppã 1-5
[3] T. MikolovãKãChenãGãCorradoãJãDeanãããã¯ãã«ç©ºéã«ãããåèªè¡šçŸã®å¹ççãªæšå®ããProcã ICLRã§ã®ã¯ãŒã¯ã·ã§ãã2013幎ã
[4] P. Boããã³L. Leeããã»ã³ãã¡ã³ã¿ã«æè²ïŒæå°ã«ããã«åºã¥ã䞻芳çèŠçŽã䜿çšããã»ã³ãã¡ã³ãåæããACLã®è°äºé²ã2004幎
[5] T.ãšã¢ãã ã¹ãããµããŒããã¯ã¿ãŒãã·ã³ã䜿çšããããã¹ãã®åé¡ïŒé¢é£ããå€ãã®æ©èœã䜿çšããåŠç¿ãã欧å·æ©æ¢°åŠç¿äŒè°ïŒECMLïŒãSpringer Berlin / Heidelbergã1998幎ãppã 137-142
[6] PD Turneyãã芪æãç«ãŠãã®ãã芪æãäžããã®ãïŒ ã¬ãã¥ãŒã®æåž«ãªãåé¡ã«é©çšãããã»ãã³ãã£ãã¯æåããèšç®èšèªåŠåäŒïŒACL'02ïŒã®ç¬¬40å幎次äŒè°ã®è°äºé²ããã³ã·ã«ããã¢å·ãã£ã©ãã«ãã£ã¢ã2002幎ãppã 417-424ã
[7] A. GoãRãBhayaniãLãHuangããé éç£èŠã䜿çšããTwitterææ åé¡ãããã¯ãã«ã«ã¬ããŒããã¹ã¿ã³ãã©ãŒãã 2009ã
[8] J. FurnkranzãTãMitchellãããã³E. RiloffããWWWäžã®ããã¹ãåé¡ã«èšèªå¥ã䜿çšããå Žåã®äºäŸç 究ããAAAI / ICML Workshop on Learning for Text Categorizationã1998 5-12ã
[9] MF CaropresoãSãMatwinãFãSebastianiããèªåããã¹ãåé¡ã®ããã®çµ±èšçãã¬ãŒãºã®æçšæ§ã®åŠç¿è ã«äŸåããªãè©äŸ¡ããããã¹ãããŒã¿ããŒã¹ãšããã¥ã¡ã³ã管çïŒTheory and practiceã2001ãppã 78-102ã