ããã«ã¡ã¯ ç§ã®ååã¯ã·ãªã«ã§ãITãããŒãžã£ãŒãšããŠ10幎以äžã¢ã«ã³ãŒã«äžæ¯è
ã§ãã MIPTã§å匷ããŠãããšãã«ãæã
ææã§ã³ãŒããæžããŸããã ããããå³ããçŸå®ïŒããªãã¯ãéã皌ãå¿
èŠããããã§ããã°ãã以äžïŒã«çŽé¢ããŠãç§ã¯äžãåã«ãªããŸããã
ãããããã¹ãŠãããã»ã©æªãããã§ã¯ãããŸããïŒ æè¿ãããŒãããŒãšååããŠã Okdeskã®é¡§å®¢äŒèšããã³ã¯ã©ã€ã¢ã³ãã¢ããªã±ãŒã·ã§ã³ã·ã¹ãã ãšããã¹ã¿ãŒãã¢ããã®éçºã«å®å
šã«åãçµãã§ããŸã ã äžæ¹ã§-åãã®æ¹åãéžæããéã®ããå€ãã®èªç±ã ãããäžæ¹ã§ãã6ãæéã§3人ã®éçºè
ãç 究ãè¡ãã... ç§ãã¡ã¯å€ãã®ããšãããªããã°ãªããŸããã éçºã«é¢é£ããã³ã¢ä»¥å€ã®å®éšïŒè£œåã®äž»ãªæ©èœã«é¢é£ããªãå®éšïŒãå«ã¿ãŸãã
ãã®ãããªå®éšã®1ã€ã¯ãããã©ãŒããŒã®ã°ã«ãŒãã«ããã«ã«ãŒãã£ã³ã°ããããã«ãããã¹ãã«ãã£ãŠã¯ã©ã€ã¢ã³ãã¢ããªã±ãŒã·ã§ã³ãåé¡ããã¢ã«ãŽãªãºã ã®éçºã§ããã ãã®èšäºã§ã¯ããéããã°ã©ããŒãã1.5ãæã§ããã¯ã°ã©ãŠã³ãã§pythonãç¿åŸããå®çšçãªå©ç¹ãããç°¡åãªMLã¢ã«ãŽãªãºã ãèšè¿°ããæ¹æ³ã«ã€ããŠã話ããããšæããŸãã
å匷ããæ¹æ³ã¯ïŒ
ç§ã®å Žåãã³ãŒã¹ã©ã§ã®é éåŠç¿ã æ©æ¢°åŠç¿ã人工ç¥èœã«é¢é£ããä»ã®åéã®ã³ãŒã¹ã¯éåžžã«å€ããããŸãã å€å
žã¯ãã³ãŒã¹ã©ã®åµèšè
ã§ããã¢ã³ããªã¥ãŒã»ãŠã³ïŒã¢ã³ããªã¥ãŒã»ã³ïŒã®ã³ãŒã¹ãšèããããŠããŸãã ãããããã®ã³ãŒã¹ã®ãã€ãã¹ç¹ã¯ïŒã³ãŒã¹ãè±èªã§ãããšããäºå®ã«å ããŠïŒããã¯äžäººåãã§ã¯ãããŸããïŒãçããOctaveããŒã«ãããïŒMATLABã®ç¡æã®ã¢ããã°ïŒã§ãã ã¢ã«ãŽãªãºã ãç解ããããã«ãããã¯äž»ãªããšã§ã¯ãããŸããããããäžè¬çãªããŒã«ããåŠã¶æ¹ãè¯ãã§ãã
MIPTãšYandexã®å°éåéã§ããã æ©æ¢°åŠç¿ãšããŒã¿åæ ããéžæããŸããïŒ6ã€ã®ã³ãŒã¹ããããŸããèšäºã«æžãããŠããããšã¯ãæåã®2ã€ã§ååã§ãïŒã å°éåã®äž»ãªå©ç¹ã¯ãã¹ã©ãã¯ã®åŠçãšã¡ã³ã¿ãŒã®æŽ»æ°ã®ããã³ãã¥ããã£ã§ãã1æ¥ã®ã»ãšãã©ã®æéã«ã質åã«é£çµ¡ã§ãã人ãããŸãã
æ©æ¢°åŠç¿ãšã¯äœã§ããïŒ
èšäºã®äžéšãšããŠãçšèªè«äºã«é£ã³èŸŒãããšã¯ãããŸããããããã£ãŠãæ°åŠçãªç²ŸåºŠãäžååãªé害ãèŠã€ãããå Žåã¯ãé æ
®ãã ããïŒç§ã¯åäœã®ç¯å²ãè¶
ããªãããšãçŽæããŸã:)ã
ããã§ã¯ãæ©æ¢°åŠç¿ãšã¯æ£ç¢ºã«ã¯äœã§ããïŒ ããã¯ãç¥ç人件費ãå¿
èŠãšãããã³ã³ãã¥ãŒã¿ãŒã䜿çšããåé¡ã解決ããããã®äžé£ã®æ¹æ³ã§ãã æ©æ¢°åŠç¿æ¹æ³ã®ç¹åŸŽã¯ãåäŸïŒã€ãŸããäºåã«ããç¥ãããæ£è§£ãããäŸïŒã§ããã¬ãŒãã³ã°ããããããšã§ãã
ããæ°åŠçãªå®çŸ©ã¯æ¬¡ã®ãšããã§ãã
- äžé£ã®ç¹æ§ãæã€å€ãã®ãªããžã§ã¯ãããããŸãã ãã®ã»ãããæåXã§ç€ºããŸãã
- å€ãã®çãããããŸãã ãã®ã»ãããæåYã§ç€ºããŸãã
- è€æ°ã®ãªããžã§ã¯ããšè€æ°ã®å¿çã®éã«ã¯ïŒæªç¥ã®ïŒé¢ä¿ããããŸãã ããªãã¡ ã»ããXã®ãªããžã§ã¯ããã»ããYã®ãªããžã§ã¯ãã«é¢é£ä»ããé¢æ°ã é¢æ°yãšåŒã³ãŸãã
- Yããã®åçãæ¢ç¥ã§ããX ïŒãã¬ãŒãã³ã°ã»ããïŒããã®ãªããžã§ã¯ãã®æéãµãã»ããããããŸãã
- ãã¬ãŒãã³ã°ãµã³ãã«ã«ãããšãé¢æ°aã䜿çšããŠãé¢æ°yãã§ããã ãè¿äŒŒããå¿ èŠããããŸãã é¢æ°aã䜿çšããŠã Xããã®ä»»æã®ãªããžã§ã¯ããYããæ£ããçããé©åãªç¢ºçïŒãŸãã¯æ°å€ã®çãã«ã€ããŠè©±ããŠããå Žåã¯ç²ŸåºŠïŒã§ååŸããããã«ããŸãã é¢æ°aã®æ€çŽ¢ã¯ãæ©æ¢°åŠç¿ã®åé¡ã§ãã
ããã人çã®äŸã§ãã éè¡ã¯ããŒã³ãæäŸããŸãã éè¡ã¯ãããŒã³ã®è¿æžãè¿æžãªãã延æ»è¿æžãªã©ãçµæããã§ã«ããã£ãŠããåãæã®ãããã¡ã€ã«ãå€æ°èç©ããŠããŸãã ãã®äŸã®ãªããžã§ã¯ãã¯ãèšå
¥æžã¿ã®ç³è«æžãæã€åãæã§ãã ã¢ã³ã±ãŒãããã®ããŒã¿-ãªããžã§ã¯ããã©ã¡ãŒã¿ã ããŒã³ã®è¿æžãŸãã¯æªè¿æžã®äºå®ã¯ããªããžã§ã¯ãã«å¯Ÿãããå¿çãã§ãïŒåãæã®ã¢ã³ã±ãŒãïŒã æ¢ç¥ã®çµæãæã€ã¢ã³ã±ãŒãã®ã»ããã¯ããã¬ãŒãã³ã°ãµã³ãã«ã§ãã
åäž»ã®ãããã£ãŒã«ã§ãæœåšçãªåãæã«ããããŒã³ã®è¿æžãŸãã¯éè¿æžãäºæž¬ã§ããããã«ããããšããèªç¶ãªæ¬²æ±ããããŸãã äºæž¬ã¢ã«ãŽãªãºã ãèŠã€ããããšã¯æ©æ¢°åŠç¿ã¿ã¹ã¯ã§ãã
æ©æ¢°åŠç¿ã¿ã¹ã¯ã«ã¯å€ãã®äŸããããŸãã ãã®èšäºã§ã¯ãããã¹ããåé¡ããã¿ã¹ã¯ã«ã€ããŠè©³ãã説æããŸãã
åé¡ã®å£°æ
顧客ãµãŒãã¹çšã®ã¯ã©ãŠããµãŒãã¹ã§ããOkdeskãéçºããŠããããšãæãåºããŠãã ããã Okdeskãæ¥åã«äœ¿çšããäŒæ¥ã¯ãã¯ã©ã€ã¢ã³ãããŒã¿ã«ãã¡ãŒã«ããµã€ãããã®Webãã©ãŒã ãã€ã³ã¹ã¿ã³ãã¡ãã»ã³ãžã£ãŒãªã©ãããŸããŸãªãã£ãã«ãä»ããŠã¯ã©ã€ã¢ã³ãã¢ããªã±ãŒã·ã§ã³ãåãå
¥ããŸãã ã¢ããªã±ãŒã·ã§ã³ã¯ã1ã€ãŸãã¯å¥ã®ã«ããŽãªã«é¢é£ããå ŽåããããŸãã ã«ããŽãªã«å¿ããŠãã¢ããªã±ãŒã·ã§ã³ã«ã¯1人ãŸãã¯å¥ã®ããã©ãŒããŒãããå ŽåããããŸãã ããšãã°ã1Cã®ã¢ããªã±ãŒã·ã§ã³ã¯ãœãªã¥ãŒã·ã§ã³ã®ããã«1Cã®å°é家ã«éä¿¡ãããªãã£ã¹ãããã¯ãŒã¯ã«é¢é£ããã¢ããªã±ãŒã·ã§ã³ã¯ã·ã¹ãã 管çè
ã®ã°ã«ãŒãã«éä¿¡ããå¿
èŠããããŸãã
ã¢ããªã±ãŒã·ã§ã³ã®ãããŒãåé¡ããããã«ããã£ã¹ãããã£ãéžæã§ããŸãã ãããããŸãããéãããããŸãïŒçµŠäžãçšéããªãã£ã¹ã¬ã³ã¿ã«ïŒã 次ã«ãã¢ããªã±ãŒã·ã§ã³ã®åé¡ãšã«ãŒãã£ã³ã°ã«æéãããããã¢ããªã±ãŒã·ã§ã³ã¯åŸã§è§£æ±ºãããŸãã å
容ã«å¿ããŠã¢ããªã±ãŒã·ã§ã³ãèªåçã«åé¡ããããšãã§ããããããã¯çŽ æŽãããããšã§ãïŒ ãã®åé¡ãæ©æ¢°åŠç¿ïŒããã³1人ã®ITãããŒãžã£ãŒïŒã§è§£æ±ºããŠã¿ãŸãããã
ãã®å®éšã§ã¯ãåé¡ããã1200åã®ã¢ããªã±ãŒã·ã§ã³ã®ãµã³ãã«ãååŸãããŸããã ãµã³ãã«ã§ã¯ãââã¢ããªã±ãŒã·ã§ã³ã¯14ã®ã«ããŽãªã«åé¡ãããŠããŸãã å®éšã®ç®çïŒã³ã³ãã³ãã«å¿ããŠã¢ããªã±ãŒã·ã§ã³ãèªååé¡ããã¡ã«ããºã ãéçºããŸããããã«ãããã©ã³ãã ãªã¢ããªã±ãŒã·ã§ã³ãããäœåãã®å質ãåŸãããŸãã å®éšã®çµæã«ããã°ãã¢ã«ãŽãªãºã ã®éçºãšãã¢ããªã±ãŒã·ã§ã³ã®åé¡ã«åºã¥ããç£æ¥ãµãŒãã¹ã®éçºã«é¢ããŠæ±ºå®ãäžãå¿
èŠããããŸãã
ããŒã«ããã
å®éšã«ã¯ãLenovoã©ãããããïŒã³ã¢i7ã8GB RAMïŒãNumPyãPandasãScikit-learnãreã©ã€ãã©ãªã IPythonã·ã§ã«ãåããPython 2.7ããã°ã©ãã³ã°èšèªã䜿çšããŸããã 䜿çšããã©ã€ãã©ãªã«ã€ããŠè©³ãã説æããŸãã
- NumPy-倧ããªå€æ¬¡å æ°å€é åã§ç®è¡æŒç®ãå®è¡ããããã®å€ãã®äŸ¿å©ãªã¡ãœãããšã¯ã©ã¹ãå«ãã©ã€ãã©ãªã
- Pandasã¯ãããŒã¿ãç°¡åãã€èªç¶ã«åæããã³èŠèŠåãããããã«å¯ŸããŠæäœãå®è¡ã§ããã©ã€ãã©ãªã§ãã äž»ãªããŒã¿æ§é ïŒãªããžã§ã¯ãã¿ã€ãïŒã¯ã Series ïŒ1次å æ§é ïŒãšDataFrame ïŒ2次å æ§é ãå®éã«ã¯åãé·ãã®äžé£ïŒã§ãã
- Scikit-learn-æ©æ¢°åŠç¿ã®ã»ãšãã©ã®æ¹æ³ãå®è£ ããã©ã€ãã©ãªã
- Reã¯æ£èŠè¡šçŸã©ã€ãã©ãªã§ãã æ£èŠè¡šçŸã¯ãããã¹ãåæã«é¢é£ããã¿ã¹ã¯ã«äžå¯æ¬ ãªããŒã«ã§ãã
Scikit-learnã©ã€ãã©ãªããã¯ãããã€ãã®ã¢ãžã¥ãŒã«ãå¿ èŠã«ãªããŸãããã®ç®çã¯ãè³æã®ãã¬ãŒã³ããŒã·ã§ã³ã®éçšã§äœæããŸãã ãããã£ãŠãå¿ èŠãªãã¹ãŠã®ã©ã€ãã©ãªãšã¢ãžã¥ãŒã«ãã€ã³ããŒãããŸãã
import pandas as pd import numpy as np import re from sklearn import neighbors, model_selection, ensemble from sklearn.grid_search import GridSearchCV from sklearn.metrics import accuracy_score
ãããŠãããŒã¿ã®æºåã«é²ã¿ãŸãã
ïŒã€ã³ããŒãxxxã®yyãšããŠã®æ§ç¯ã¯ãxxxã©ã€ãã©ãªã«æ¥ç¶ããŠããããšãæå³ããŸãããã³ãŒãã§ã¯yyãä»ããŠã¢ã¯ã»ã¹ããŸãïŒ
ããŒã¿æºå
æ©æ¢°åŠç¿ã«é¢é£ããæåã®ïŒå®éšå®€ã§ã¯ãªãïŒå®éã®ã¿ã¹ã¯ã解決ãããšããã¢ã«ãŽãªãºã ã®åŠç¿ïŒã¢ã«ãŽãªãºã ã®éžæããã©ã¡ãŒã¿ãŒã®éžæãããŸããŸãªã¢ã«ãŽãªãºã ã®å質ã®æ¯èŒãªã©ïŒã«ã»ãšãã©æéãè²»ãããªãããšãããããŸãã ãªãœãŒã¹ã®å€§éšåã¯ãããŒã¿ã®åéãåæãæºåã«äœ¿çšãããŸãã
æ©æ¢°åŠç¿ã¿ã¹ã¯ã®ããŸããŸãªã¯ã©ã¹ã®ããŒã¿ãæºåããããã®ããŸããŸãªææ³ãæ¹æ³ãããã³æšå¥šäºé
ããããŸãã ããããã»ãšãã©ã®å°é家ã¯ãããŒã¿ã®æºåãç§åŠã§ã¯ãªãèžè¡ãšåŒã³ãŸãã ãã®ãããªè¡šçŸããããŸã-æ©èœãšã³ãžãã¢ãªã³ã°ïŒã€ãŸãããªããžã§ã¯ããèšè¿°ãããã©ã¡ãŒã¿ãŒã®æ§ç¯ïŒã
ããã¹ããåé¡ããã¿ã¹ã¯ã§ã¯ãæ©èœã«1ã€ã®ãªããžã§ã¯ã-ããã¹ãããããŸãã æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã圌ã«äžããããšã¯äžå¯èœã§ãïŒç§ã¯ãã¹ãŠãç¥ã£ãŠããããã§ã¯ãããŸãã:)ã ããã¹ãã¯äœããã®åœ¢ã§ããžã¿ã«åãã圢åŒåããå¿
èŠããããŸãã
å®éšã®æ çµã¿ã§ã¯ãããã¹ãã圢åŒåããåå§çãªæ¹æ³ã䜿çšãããŸããïŒãããããããã¯è¯ãçµæã瀺ããŸããïŒã ããã«ã€ããŠã¯åŸã§èª¬æããŸãã
ããŒã¿ã®èªã¿èŸŒã¿
åæããŒã¿ãšããŠã1200ã®ã¢ããªã±ãŒã·ã§ã³ã®ã¢ããããŒããããããšãæãåºããŠãã ããïŒ14ã®ã«ããŽãªã«äžåçã«ååžããŠããŸãïŒã åã¢ããªã±ãŒã·ã§ã³ã«ã¯ãã件åããã£ãŒã«ããã説æããã£ãŒã«ãããã«ããŽãªããã£ãŒã«ãããããŸãã [件å]ãã£ãŒã«ãã¯ã¢ããªã±ãŒã·ã§ã³ã®ççž®ã³ã³ãã³ãã§ãããå¿
é ã§ãã[説æ]ãã£ãŒã«ãã¯æ¡åŒµèª¬æã§ããã空ã®å ŽåããããŸãã
ããŒã¿ã.xlsxãã¡ã€ã«ããDataFrameã«ããŒãããŸãã .xlsxãã¡ã€ã«ã«ã¯å€ãã®åïŒå®éã®ã¢ããªã±ãŒã·ã§ã³ã®ãã©ã¡ãŒã¿ãŒïŒããããŸãããå¿
èŠãªã®ã¯ãSubjectãããDescriptionããããã³ãCategoryãã®ã¿ã§ãã
ããŒã¿ãããŒãããåŸããSubjectããã£ãŒã«ããšãDescriptionããã£ãŒã«ãã1ã€ã®ãã£ãŒã«ãã«çµåããŠãããã«åŠçããããããŸãã ãããè¡ãã«ã¯ãæåã«ãã¹ãŠã®ç©ºã®èª¬æãã£ãŒã«ãïŒããšãã°ã空ã®æååïŒãå
¥åããå¿
èŠããããŸãã
# issues DataFrame issues = pd.DataFrame() # issues Theme, Description Cat, , .xlsx . u'...' â 'âŠ' utf issues[['Theme', 'Description','Cat']] = pd.read_excel('issues.xlsx')[[u'', u'', u'']] # Description issues.Description.fillna('', inplace = True) # Theme Description ( ) Content issues['Content'] = issues.Theme + ' ' + issues.Description
ãããã£ãŠãDataFrameã¿ã€ãã®issueå€æ°ããããŸãããã®å€æ°ã§ã¯ãContentåïŒSubjectãã£ãŒã«ããšDescriptionãã£ãŒã«ãã®çµåãã£ãŒã«ãïŒãšCatïŒã¢ããªã±ãŒã·ã§ã³ã«ããŽãªïŒãæäœããŸãã ã¢ããªã±ãŒã·ã§ã³ã®ã³ã³ãã³ãïŒã€ãŸãã[ã³ã³ãã³ã]åïŒã®åœ¢åŒåã«é²ã¿ãŸãã
ã¢ããªã±ãŒã·ã§ã³ã®ã³ã³ãã³ãã®åœ¢åŒå
圢åŒåã¢ãããŒãã®èª¬æ
åè¿°ã®ããã«ãæåã®ã¹ãããã¯ã¢ããªã±ãŒã·ã§ã³ããã¹ãã圢åŒåããããšã§ãã 次ã®ããã«åœ¢åŒåããŸãã
- ã¢ããªã±ãŒã·ã§ã³ã®å 容ãèšèã«å解ããŸãã åèªãšã¯ãåºåãæåïŒããã·ã¥ããã€ãã³ãããªãªããã¹ããŒã¹ãæ¹è¡ãªã©ïŒã§åºåããã2ã€ä»¥äžã®æåã®ã·ãŒã±ã³ã¹ãæå³ããŸãã ãã®çµæãã¢ããªã±ãŒã·ã§ã³ããšã«ããã®ã³ã³ãã³ãã«å«ãŸããåèªã®é åãååŸããŸãã
- ã»ãã³ãã£ãã¯ã®è² è·ãæããªãåã¢ããªã±ãŒã·ã§ã³ãããå¯çè«ã®åèªããé€å€ããŸãïŒããšãã°ãæšæ¶ãã¬ãŒãºã«å«ãŸããåèªïŒãhelloãããgoodãããdayããªã©ïŒã
- çµæã®é åãããèŸæžãã³ã³ãã€ã«ããŸãããã¹ãŠã®ã¢ããªã±ãŒã·ã§ã³ã®ã³ã³ãã³ããèšè¿°ããããã«äœ¿çšãããåèªã®ã»ããã
- 次ã«ããµã€ãºïŒã¢ããªã±ãŒã·ã§ã³ã®æ°ïŒ x ïŒèŸæžå ã®åèªã®æ° ïŒã®è¡åãäœæããŸããããã§ãjçªç®ã®åã®içªç®ã®ã»ã«ã¯ãèŸæžã®jçªç®ã®åèªã®içªç®ã®ã¢ããªã±ãŒã·ã§ã³ã®ãšã³ããªã®æ°ã«å¯Ÿå¿ããŸãã
è«æ±é
4ã®ãããªãã¯ã¹ã¯ãç³è«å
容ã®åœ¢åŒåãããèšè¿°ã§ããã æ°åŠçã«ã¯ãè¡åã®åè¡ã¯ãèŸæžç©ºéå
ã®å¯Ÿå¿ããã¢ããªã±ãŒã·ã§ã³ã®ãã¯ãã«ã®åº§æšã§ãã ã¢ã«ãŽãªãºã ããã¬ãŒãã³ã°ããã«ã¯ãçµæã®ãããªãã¯ã¹ã䜿çšããŸãã
éèŠãªãã€ã³ã ïŒpã3ã¯ããã¬ãŒãã³ã°ã»ããããã¢ã«ãŽãªãºã ïŒãã¹ãã»ããïŒã®å質管ççšã®ã©ã³ãã ãµããµã³ãã«ãéžæããåŸã«å®è¡ãããŸãã ããã¯ãæ°ããããŒã¿ã§ã¢ã«ãŽãªãºã ããæŠéäžãã«è¡šç€ºããå質ãããããç解ããããã«å¿
èŠã§ãïŒããšãã°ããã¬ãŒãã³ã°ãµã³ãã«ã§å®å
šã«æ£ããçããäžããã¢ã«ãŽãªãºã ãå®è£
ããããšã¯é£ãããããŸããããæ°ããããŒã¿ã©ã³ãã ã§ã¯ããŸãæ©èœããŸããïŒ ïŒãã®ç¶æ³ã¯ãªãã¬ãŒãã³ã°ãšåŒã°ããŸãïŒã èŸæžãã³ã³ãã€ã«ããçŽåã«ãã¹ããµã³ãã«ãåé¢ããããšã¯éèŠã§ãããã¹ãããŒã¿ãå«ãèŸæžãã³ã³ãã€ã«ããå Žåããµã³ãã«ã§ãã¬ãŒãã³ã°ãããã¢ã«ãŽãªãºã ã¯æªç¥ã®ãªããžã§ã¯ãã«ç²ŸéããŠããããã§ãã æªç¥ã®ããŒã¿ã§ã®å質ã«é¢ããçµè«ã¯æ£ãããããŸããã
次ã«ãp.pã 1-4ã®ã³ãŒããèŠãŠãã ããã
ã³ã³ãã³ããåèªã«åå²ããåèªã®å¯çè«ãåé€ããŸã
ãŸãããã¹ãŠã®ããã¹ããå°æåã«ããŸãïŒãããªã³ã¿ãŒããšãããªã³ã¿ãŒã-人ã«å¯ŸããŠã®ã¿åãèšèã䜿çšããŸãããæ©æ¢°ã«å¯ŸããŠã¯äœ¿çšããŸããïŒïŒ
# def lower(str): return str.lower() # Content issues['Content'] = issues.Content.apply(lower)
次ã«ããå¯çè«ã®èšèãã®è£å©èŸæžãå®çŸ©ããŸãïŒãã®å å¡«ã¯ãã¢ããªã±ãŒã·ã§ã³ã®ç¹å®ã®ãµã³ãã«ã®å埩å®éšã«ãã£ãŠå®è¡ãããŸããïŒã
garbagelist = [u'', u'', u'', u'', u'',u'', u'', u'', u'']
åã¢ããªã±ãŒã·ã§ã³ã®ããã¹ãã2æå以äžã®é·ãã®åèªã«åå²ãããå¯çåèªããé€ãçµæã®åèªãé åã«å«ããé¢æ°ã宣èšããŸãã
def splitstring(str): words = [] # [] for i in re.split('[;,.,\n,\s,:,-,+,(,),=,/,«,»,@,\d,!,?,"]',str): # "" 2 if len(i) > 1: # - if i in garbagelist: None else: words.append(i) return words
æ£èŠè¡šçŸã©ã€ãã©ãªreãšãã®splitã¡ãœããã¯ãåºåãæåã§ããã¹ããåèªã«åå²ããããã«äœ¿çšãããŸãã åºåãæåã®é
åãsplitã¡ãœããïŒåºåãæåã®ã»ãããå埩çã«è£å
ãããïŒãšåå²ãããæååã«æž¡ãããŸãã
宣èšãããé¢æ°ãåã¢ããªã±ãŒã·ã§ã³ã«é©çšããŸãã åºåã§ã¯ãå
ã®DataFrameãååŸããŸãããã®DataFrameã«ã¯ãåã¢ããªã±ãŒã·ã§ã³ãæ§æããåèªã®é
åïŒãå¯çåèªããé€ãïŒãå«ãæ°ããWordsåã衚瀺ãããŸãã
issues['Words'] = issues.Content.apply(splitstring)
èŸæžãäœæããŸã
次ã«ããã¹ãŠã®ã¢ããªã±ãŒã·ã§ã³ã®ã³ã³ãã³ãã«å«ãŸããåèªã®èŸæžã®ã³ã³ãã€ã«ãéå§ããŸãã ããããã®åã«ãäžã§æžããããã«ããã¬ãŒãã³ã°ãµã³ãã«ãã³ã³ãããŒã«ãµã³ãã«ïŒããã¹ããããé
延ããšãåŒã°ããŸãïŒãšã¢ã«ãŽãªãºã ããã¬ãŒãã³ã°ãããµã³ãã«ã«åå²ããŸãã
éžæã®åé¢ã¯ã Scikit-learnã©ã€ãã©ãªãŒã®model_selectionã¢ãžã¥ãŒã«ã®train_test_splitã¡ãœããã«ãã£ãŠå®è¡ãããŸãã ããŒã¿ãå«ãé
åïŒã¢ããªã±ãŒã·ã§ã³ããã¹ãïŒãã©ãã«ãå«ãé
åïŒã¢ããªã±ãŒã·ã§ã³ã«ããŽãªïŒãããã³ãã¹ããµã³ãã«ã®ãµã€ãºïŒéåžžã¯å
šäœã®30ïŒ
ïŒãã¡ãœããã«æž¡ããŸãã åºåã§ã¯ããã¬ãŒãã³ã°çšã®ããŒã¿ããã¬ãŒãã³ã°çšã®ã©ãã«ãå¶åŸ¡çšã®ããŒã¿ãå¶åŸ¡çšã®ã©ãã«ã®4ã€ã®ãªããžã§ã¯ããååŸããŸãã
issues_train, issues_test, labels_train, labels_test = model_selection.train_test_split(issues.Words, issues.Cat, test_size = 0.3)
ããã§ããã¬ãŒãã³ã°çšã«æ®ãããããŒã¿ïŒ issues_train ïŒã䜿çšããŠèŸæžãã³ã³ãã€ã«ããé¢æ°ã宣èšãããã®ããŒã¿ã«é¢æ°ãé©çšããŸãã
def WordsDic(dataset): WD = [] for i in dataset.index: for j in xrange(len(dataset[i])): if dataset[i][j] in WD: None else: WD.append(dataset[i][j]) return WD # words = WordsDic(issues_train)
ãã®ããããã¬ãŒãã³ã°ãµã³ãã«ã®ãã¹ãŠã®ã¢ããªã±ãŒã·ã§ã³ã®ããã¹ããæ§æããåèªã®èŸæžãäœæããŸããïŒã¢ããªã±ãŒã·ã§ã³ã¯å¶åŸ¡çšã«æ®ãããŠããŸãïŒã èŸæžã¯å¯å€ã¯ãŒãã«æžã蟌ãŸããŸããã åèªé åã®ãµã€ãºã¯12015çªç®ã®èŠçŽ ïŒã€ãŸãåèªïŒã§ããããšãå€æããŸããã
ã¢ããªã±ãŒã·ã§ã³ã®ã³ã³ãã³ããèŸæžã¹ããŒã¹ã«ç¿»èš³ããŸã
ãã¬ãŒãã³ã°çšã®ããŒã¿ãæºåããæçµã¹ãããã«ç§»ããŸãããã ã€ãŸãããµã€ãºãããªãã¯ã¹ïŒãµã³ãã«ã®ã¢ããªã±ãŒã·ã§ã³ã®æ°ïŒ x ïŒèŸæžã®åèªã®æ°ïŒãæ§æããŸã ãããã§ãjçªç®ã®åã®içªç®ã®è¡ã¯ããµã³ãã«ã®içªç®ã®ã¢ããªã±ãŒã·ã§ã³ã®èŸæžã®jçªç®ã®åèªã®åºçŸæ°ã§ãã
# len(issues_train) len(words), train_matrix = np.zeros((len(issues_train),len(words))) # , [i][j] j- words i- for i in xrange(train_matrix.shape[0]): for j in issues_train[issues_train.index[i]]: if j in words: train_matrix[i][words.index(j)]+=1
ããã§ããã¬ãŒãã³ã°ã«å¿ èŠãªãã¹ãŠã®ãã®ããããŸãïŒ train_matrixãããªãã¯ã¹ïŒãã¹ãŠã®ã¢ããªã±ãŒã·ã§ã³çšã«ã³ã³ãã€ã«ãããèŸæžç©ºéã®ã¢ããªã±ãŒã·ã§ã³ã«å¯Ÿå¿ãããã¯ãã«ã®åº§æšåœ¢åŒã®ã¢ããªã±ãŒã·ã§ã³ã®åœ¢åŒåãããã³ã³ãã³ãïŒããã³labels_train ïŒãã¬ãŒãã³ã°çšã«æ®ããããµã³ãã«ããã®ã¢ããªã±ãŒã·ã§ã³ã®ã«ããŽãªïŒã
ãã¬ãŒãã³ã°
ã©ãã«ä»ãããŒã¿ïŒã€ãŸããæ£ããçããããã£ãŠããããŒã¿ïŒ labels_trainã®train_matrixè¡åïŒã®ãã¬ãŒãã³ã°ã¢ã«ãŽãªãºã ã«ç§»ããŸãããã ã»ãšãã©ã®æ©æ¢°åŠç¿ã¡ãœããã¯Scikit-learnã©ã€ãã©ãªã«å®è£
ãããŠããããããã®ã»ã¯ã·ã§ã³ã«ã¯ã»ãšãã©ã³ãŒãããããŸããã ãããã®æ¹æ³ã®éçºã¯ãææãç¿åŸããã®ã«åœ¹ç«ã€ãããããŸããããå®çšçãªèŠ³ç¹ããã¯ããã®å¿
èŠã¯ãããŸããã
以äžã§ã¯ãæ©æ¢°åŠç¿ã®ç¹å®ã®æ¹æ³ã®ååãç°¡åãªèšèªã§è¿°ã¹ãããšããŸãã
æé©ãªã¢ã«ãŽãªãºã ãéžæããååã«ã€ããŠ
ã©ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãç¹å®ã®ããŒã¿ã§æè¯ã®çµæããããããã¯æ±ºããŠããããŸããã ããããåé¡ãç解ããã°ãæ¢åã®ãã¹ãŠã®ã¢ã«ãŽãªãºã ãééããªãããã«ãæé©ãªã¢ã«ãŽãªãºã ã®ã»ããã決å®ã§ããŸãã åé¡ã解決ããããã«äœ¿çšãããæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã®éžæã¯ããã¬ãŒãã³ã°ã»ããã®ã¢ã«ãŽãªãºã ã®å質ãæ¯èŒããããšã«ãã£ãŠå®è¡ãããŸãã
ã¢ã«ãŽãªãºã ã®å質ãšèŠãªããããã®ã¯ã解決ããåé¡ã«ãã£ãŠç°ãªããŸãã å質ã¡ããªãã¯ã®éžæã¯ãå¥ã®å€§ããªãããã¯ã§ãã ã¢ããªã±ãŒã·ã§ã³åé¡ã®äžç°ãšããŠãåçŽãªã¡ããªãã¯ã§ãã粟床ãéžæãããŸããã 粟床ã¯ãã¢ã«ãŽãªãºã ãæ£ããçããäžããïŒã¢ããªã±ãŒã·ã§ã³ã®æ£ããã«ããŽãªãå
¥åããïŒãµã³ãã«å
ã®ãªããžã§ã¯ãã®å²åãšããŠå®çŸ©ãããŸãã ãããã£ãŠãã¢ããªã±ãŒã·ã§ã³ã®ã«ããŽãªãããæ£ç¢ºã«äºæž¬ã§ããã¢ã«ãŽãªãºã ãéžæããŸãã
ã¢ã«ãŽãªãºã ãã€ããŒãã©ã¡ãŒã¿ãŒãªã©ã®æŠå¿µã«ã€ããŠèšãããšãéèŠã§ãã æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã«ã¯ãäœæ¥ã®å質ã決å®ããå€éšïŒã€ãŸãããã¬ãŒãã³ã°ã»ããããåæçã«å°åºã§ããªããã®ïŒãã©ã¡ãŒã¿ãŒããããŸãã ããšãã°ããªããžã§ã¯ãéã®è·é¢ãèšç®ããå¿
èŠãããã¢ã«ãŽãªãºã ã§ã¯ãè·é¢ã¯ç°ãªãããšãæå³ããå ŽåããããŸãã ãã³ããã¿ã³è·é¢ ãå€å
žçãªãŠãŒã¯ãªããè·é¢ãªã©ã§ãã
åæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã«ã¯ãç¬èªã®ãã€ããŒãã©ã¡ãŒã¿ãŒã»ããããããŸãã å¥åŠãªããšã«ããã€ããŒãã©ã¡ãŒã¿ãŒã®æé©ãªå€ã®éžæã¯åæã«ãã£ãŠå®è¡ãããŸãããã©ã¡ãŒã¿ãŒå€ã®åçµã¿åããã«å¯ŸããŠãã¢ã«ãŽãªãºã ã®å質ãèšç®ãããå€ã®æé©ãªçµã¿åããããã®ã¢ã«ãŽãªãºã ã«äœ¿çšãããŸãã ãã®ããã»ã¹ã¯ãã³ã³ãã¥ãŒãã£ã³ã°ã®èŠ³ç¹ããã¯ã³ã¹ããããããŸãããã©ãã«è¡ãã¹ãã§ããã
çžäºæ€èšŒã¯ããã€ããŒãã©ã¡ãŒã¿ãŒã®åçµã¿åããã§ã¢ã«ãŽãªãºã ã®å質ã決å®ããããã«äœ¿çšãããŸãã ãããäœã§ããã説æãããŠãã ããã ãã¬ãŒãã³ã°ãµã³ãã«ã¯ãNåã®çããéšåã«åå²ãããŸãã ã¢ã«ãŽãªãºã ã¯ãN-1åã®éšåã®ãµããµã³ãã«ã§é 次ãã¬ãŒãã³ã°ãããå質ã¯1ã€ã®é
延ã§èæ
®ãããŸãã ãã®çµæãNåã®éšåã¯ãããããå質ã®ã«ãŠã³ãã«1åãã¢ã«ãŽãªãºã ã®åŠç¿ã«N-1å䜿çšãããŸãã ãã©ã¡ãŒã¿ãŒã®çµã¿åããã§ã®ã¢ã«ãŽãªãºã ã®å質ã¯ãçžäºæ€èšŒäžã«ååŸãããå質å€éã®å¹³åãšèŠãªãããŸãã ååŸããå質å€ãããä¿¡é Œã§ããããã«ãçžäºæ€èšŒãå¿
èŠã§ãïŒå¹³ååãããšããç¹å®ã®ãµã³ãã«ããŒãã£ã·ã§ã³ã®èãããããã¹ãã¥ãŒããå¹³æºåããŸãïŒã ã©ãã§ããå°ã詳ããç¥ã£ãŠããŸãã
ãããã£ãŠãåã¢ã«ãŽãªãºã ã«æé©ãªã¢ã«ãŽãªãºã ãéžæããã«ã¯ïŒ
- ãã€ããŒãã©ã¡ãŒã¿ãŒå€ã®å¯èœãªãã¹ãŠã®çµã¿åãããæŽçãããŸãïŒåã¢ã«ãŽãªãºã ã«ã¯ããã€ããŒãã©ã¡ãŒã¿ãŒãšãã®å€ã®ç¬èªã®ã»ããããããŸãïŒã
- çžäºæ€èšŒã䜿çšãããã€ããŒãã©ã¡ãŒã¿ãŒå€ã®çµã¿åããããšã«ãã¢ã«ãŽãªãºã ã®å質ãèšç®ãããŸãã
- ãã®ã¢ã«ãŽãªãºã ã¯ãæé«ã®å質ã瀺ããã€ããŒãã©ã¡ãŒã¿ãŒå€ã®çµã¿åããã§éžæãããŸãã
äžèšã®ã¢ã«ãŽãªãºã ã®ããã°ã©ãã³ã°ã®èŠ³ç¹ããã¯ãè€éãªããšã¯äœããããŸããã ããããããã¯å¿
èŠãããŸããã Scikit-learnã©ã€ãã©ãªãŒã«ã¯ãã°ãªããã«åŸã£ãŠãã©ã¡ãŒã¿ãŒãéžæããããã®æ¢è£œã®ã¡ãœããããããŸãïŒ grid_searchã¢ãžã¥ãŒã«ã®GridSearchCVã¡ãœããïŒã å¿
èŠãªã®ã¯ãã¢ã«ãŽãªãºã ããã©ã¡ãŒã¿ã°ãªãããããã³æ°NïŒã¯ãã¹æ€èšŒã®ããã«ãµã³ãã«ãåå²ããéšåã®æ°ããããã¯ãæãç³ã¿ããšãåŒã°ããŸãïŒãã¡ãœããã«è»¢éããããšã§ãã
åé¡ã®è§£æ±ºã®äžç°ãšããŠã2ã€ã®ã¢ã«ãŽãªãºã ãéžæãããŸããïŒkæè¿åãšã©ã³ãã ããªãŒã®æ§æã ããããã«ã€ããŠã以äžã®ã¹ããŒãªãŒããããŸãã
kæè¿åïŒkNNïŒ
kæè¿åæ³ãæãç°¡åã«ç解ã§ããŸãã 以äžã§æ§æãããŠããŸãã
ãã¬ãŒãã³ã°ãµã³ãã«ãããããã®ããŒã¿ã¯æ¢ã«æ£åŒåãããŠããŸãïŒãã¬ãŒãã³ã°ã®æºåãã§ããŠããŸãïŒã ã€ãŸãããªããžã§ã¯ãã¯ãã空éã®ãã¯ãã«ãšããŠè¡šãããŸãã ç§ãã¡ã®å Žåãã¢ããªã±ãŒã·ã§ã³ã¯èŸæžç©ºéã®ãã¯ãã«ãšããŠæ瀺ãããŸãã ãã¬ãŒãã³ã°ã»ããã®åãã¯ãã«ã«ã€ããŠãæ£ããçããããã£ãŠããŸãã
æ°ãããªããžã§ã¯ãããšã«ããã®ãªããžã§ã¯ããšãã¬ãŒãã³ã°ã»ããã®ãªããžã§ã¯ãéã®ãã¢ã¯ã€ãºè·é¢ãèšç®ãããŸãã 次ã«ããã¬ãŒãã³ã°ã»ããããkåã®æãè¿ããªããžã§ã¯ããååŸãããkåã®æãè¿ããªããžã§ã¯ãã®ãµããµã³ãã«ã§åªå
ãããåçãæ°ãããªããžã§ã¯ãã«å¯ŸããŠè¿ãããŸãïŒæ°ãäºæž¬ããå¿
èŠãããã¿ã¹ã¯ã®å Žåãkæãè¿ãå€ããå¹³åå€ãååŸã§ããŸãïŒã
ã¢ã«ãŽãªãºã ãéçºããããšãã§ããŸãïŒããè¿ããªããžã§ã¯ãã®ã©ãã«ã®å€ã«ãã倧ããªéã¿ãäžããããã ãã ããã¢ããªã±ãŒã·ã§ã³ãåé¡ããã¿ã¹ã¯ã«ã€ããŠã¯ããããè¡ããŸããã
ãã®åé¡ã®ãã¬ãŒã ã¯ãŒã¯ã«ãããã¢ã«ãŽãªãºã ã®ãã€ããŒãã©ã¡ãŒã¿ãŒã¯ãæ°kïŒçµè«ä»ããæè¿åã®æ°ïŒãšè·é¢ã®æ±ºå®ã§ãã 1ã7ã®ç¯å²ã§è¿åã®æ°ãéžæãããã³ããã¿ã³è·é¢ïŒåº§æšå·®ã®ã¢ãžã¥ã©ã¹ã®åèšïŒãšãŠãŒã¯ãªããè·é¢ïŒåº§æšå·®ã®å¹³æ¹åã®ã«ãŒãïŒããã®è·é¢ãéžæããŸãã
ç°¡åãªã³ãŒããå®è¡ããŸãã
%%time # param_grid = {'n_neighbors': np.arange(1,8), 'p': [1,2]} # fold- - cv = 3 # estimator_kNN = neighbors.KNeighborsClassifier() # , fold- optimazer_kNN = GridSearchCV(estimator_kNN, param_grid, cv = cv) # optimazer_kNN.fit(train_matrix, labels_train) # print optimazer_kNN.best_score_ print optimazer_kNN.best_params_
2å40ç§åŸããã³ããã¿ã³è·é¢ã«ãã£ãŠæ±ºå®ããã3ã€ã®æè¿åã®ã¢ã«ãŽãªãºã ã«ãããæé«å質ã®53.23ïŒ ã衚瀺ãããããšãããããŸãã
ã©ã³ãã ããªãŒæ§æ
決å®çãªæš
決å®æšã¯å¥ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã§ãã ã¢ã«ãŽãªãºã ã®ãã¬ãŒãã³ã°ã¯ãäœããã®çç±ã§ããã¬ãŒãã³ã°ãµã³ãã«ã段éçã«ïŒéåžžã¯2ã€ã«åå²ããŸãããéåžžã¯å¿
èŠãããŸããïŒã«åå²ããŸãã 決å®æšã®ä»çµã¿ã瀺ãç°¡åãªäŸã次ã«ç€ºããŸãã
決å®æšã«ã¯ãå
éšé ç¹ïŒãµã³ãã«ãããã«åå²ããããšã§æ±ºå®ãè¡ãããïŒãšæçµé ç¹ïŒã·ãŒãïŒããããããã«èœã¡ããªããžã§ã¯ããäºæž¬ããããã«äœ¿çšãããŸãã
決å®çãªé ç¹ã§ãåçŽãªæ¡ä»¶ããã§ãã¯ãããŸããæ¡ä»¶x jãžã®ãªããžã§ã¯ãã®äœããã®ïŒããã«ã€ããŠã¯ïŒjçªç®ã®ç¹åŸŽã®å¯Ÿå¿ã¯ãäœããã®t以äžã§ãã æ¡ä»¶ãæºãããªããžã§ã¯ãã¯äžæ¹ã®ãã©ã³ãã«éä¿¡ãããããäžæ¹ã®ãã©ã³ãã«ã¯éä¿¡ãããŸããã
ã¢ã«ãŽãªãºã ãåŠç¿ããéããã¹ãŠã®é ç¹ã«1ã€ã®ãªããžã§ã¯ããæ®ããŸã§ãã¬ãŒãã³ã°ã»ãããåå²ããããšãå¯èœã§ãã ãã®ã¢ãããŒãã¯ããã¬ãŒãã³ã°ãµã³ãã«ã§ã¯åªããçµæããããããŸãããæªç¥ã®ããŒã¿ã§ã¯ããããããçºçããŸãã ãããã£ãŠããããããåæ¢åºæºãã決å®ããããšãéèŠã§ã-é ç¹ããªãŒãã«ãªãããã®é ç¹ã®ãããªãåå²ãäžæãããæ¡ä»¶ã åæ¢åºæºã¯ã¿ã¹ã¯ã«äŸåããŸããããã«ã¯ããã€ãã®ã¿ã€ãã®åºæºããããŸããæäžéšã®ãªããžã§ã¯ãã®æå°æ°ãšããªãŒã®æ·±ãã®å¶éã§ãã ãã®åé¡ã解決ããããã«ãé ç¹å
ã®ãªããžã§ã¯ãã®æå°æ°ã®åºæºã䜿çšãããŸããã ãªããžã§ã¯ãã®æå°æ°ã«çããæ°ã¯ãã¢ã«ãŽãªãºã ã®ãã€ããŒãã©ã¡ãŒã¿ãŒã§ãã
æ°ããïŒäºæž¬ãå¿
èŠãšããïŒãªããžã§ã¯ãã¯ããã¬ãŒãã³ã°ãããããªãŒãä»ããŠå®è¡ããã察å¿ããã·ãŒãã«åé¡ãããŸãã ãªã¹ãå
ã®ãªããžã§ã¯ãã«å¯ŸããŠã次ã®çãã瀺ããŸãã
- åé¡ã®åé¡ã«ã€ããŠã¯ããã®ã·ãŒãã®ãã¬ãŒãã³ã°ã»ããã§æãäžè¬çãªãªããžã§ã¯ãã®ã¯ã©ã¹ãè¿ããŸãã
- ååž°åé¡ïŒã€ãŸããçããæ°å€ã§ããåé¡ïŒã®å Žåããã®ã·ãŒããããã¬ãŒãã³ã°ãµã³ãã«ã®ãªããžã§ã¯ãã®å¹³åå€ãè¿ããŸãã
åé ç¹ã®å±æ§jã®éžææ¹æ³ïŒããªãŒã®ç¹å®ã®é ç¹ã§ãµã³ãã«ãã©ã®å±æ§ã§åå²ãããïŒãšããã®å±æ§ã«å¯Ÿå¿ãããããå€tã«ã€ããŠã¯åŒãç¶ã説æããŸãã ãã®ããã«ããããããšã©ãŒåºæºQïŒX m ãjãtïŒãå°å
¥ãããŸãã ã芧ã®ããã«ããšã©ãŒåºæºã¯ããµã³ãã«X m ïŒåé¡ã®é ç¹ã«å°éãããã¬ãŒãã³ã°ãµã³ãã«ã®éšåïŒããµã³ãã«X mãåé¡ã®é ç¹ã§åå²ããããã©ã¡ãŒã¿ãŒjãããã³ãããå€tã«äŸåããŸãã ãšã©ãŒåºæºãæå°ã«ãªãããã«jãštãéžæããå¿
èŠããããŸãã åãã¬ãŒãã³ã°ã»ããã®jããã³tã®å€ã®å¯èœãªã»ããã¯éãããŠãããããåæã«ãã£ãŠåé¡ã¯è§£æ±ºãããŸãã
ãšã©ãŒåºæºãšã¯äœã§ããïŒ ãã®å Žæã®èšäºã®ãã©ããçã«ã¯ãå€ãã®å
¬åŒãšä»éãã説æãæ
å ±ã³ã³ãã³ãåºæºãšãã®ç¹æ®ãªã±ãŒã¹ïŒãžããŒåºæºãšãšã³ããããŒåºæºïŒã«ã€ããŠã®ã¹ããŒãªãŒããããŸããã ããããèšäºã¯éåžžã«è¥å€§åããŸããã æç¶ããšæ°åŠãç解ããã人ã¯ãã€ã³ã¿ãŒãããäžã®ãã¹ãŠã«ã€ããŠèªãããšãã§ããŸãïŒããšãã°ã ãã ïŒã ç§ã¯æã®ãç©ççæå³ãã«èªåèªèº«ãå¶éããŸãã ãšã©ãŒåºæºã¯ãåå²åŸã«ååŸããããµããµã³ãã«å
ã®ãªããžã§ã¯ãã®ãå€æ§æ§ãã®ã¬ãã«ã瀺ããŸãã åé¡åé¡ã®ãå€æ§æ§ããšã¯ãããŸããŸãªã¯ã©ã¹ãæå³ããååž°åé¡ïŒæ°å€ãäºæž¬ãããïŒã§ã¯åæ£ãæå³ããŸãã ãããã£ãŠããµã³ãã«ããµã³ããªã³ã°ãããšããçµæã®ãµããµã³ãã«ã®ãå€æ§æ§ããæå°éã«æããå¿
èŠããããŸãã
æšãèŠã€ããŸããã æšã®æ§æã«ç§»ããŸãããã
決å®çãªããªãŒã®æ§æ
決å®æšã¯ããã¬ãŒãã³ã°ã»ããã®éåžžã«è€éãªãã¿ãŒã³ãæããã«ããããšãã§ããŸãã , â . ().
N ââ. (, ) , â ( N ) .
, N . : N . : / (.. , ). â (.. , ; , â ). .
- ! 2- : ( ) ( ).
:
%%time # param_grid = {'n_estimators': np.arange(20,101,10), 'min_samples_split': np.arange(4,11, 1)} # fold- - cv = 3 # estimator_tree = ensemble.RandomForestClassifier() # , fold- optimazer_tree = GridSearchCV(estimator_tree, param_grid, cv = cv) # optimazer_tree.fit(train_matrix, labels_train) # print optimazer_tree.best_score_ print optimazer_tree.best_params_
3 30 , 65,82% 60 , 4.
çµæ
(, â ) .
test_matrix, , (.. , train_matrix, ).
# len(issues_test) len(words) test_matrix = np.zeros((len(issues_test),len(words))) # , [i][j] j- words i- for i in xrange(test_matrix.shape[0]): for j in issues_test[issues_test.index[i]]: if j in words: test_matrix[i][words.index(j)]+=1
accuracy_score metrics Scikit-learn. :
print u' :', accuracy_score(optimazer_tree.best_estimator_.predict(test_matrix), labels_test) print u'kNN:', accuracy_score(optimazer_kNN.best_estimator_.predict(test_matrix), labels_test)
51,39% k 73,46% .
""
, â â , random. ââ , random. , ââ , - .
3- ââ :
- random;
- ;
- Random, .
random 14 100/14 * 100% = 7,14% . , 14,5% ( ). random- . , random-:
# random import random # , random- rand_ans = [] # for i in xrange(test_matrix.shape[0]): rand_ans.append(labels_train[labels_train.index[random.randint(0,len(labels_train))]]) # print u' random:', accuracy_score(rand_ans, labels_test)
14,52%.
, , ââ . ãã£ãïŒ
次ã¯ïŒ
, 90% â . , . ââ ( , ). -, (: " ", " ", " " ..) â ( ) .
, : . " " "", . . , , , .
, Okdesk .
"! !" ïŒcïŒ