æè¿ãã¿ã€ããã¹ãä¿®æ£ããããã®ã©ã€ãã©ãªãå¿ èŠã«ãªããŸããã ã»ãšãã©ã®ãªãŒãã³ã¹ãã«ãã§ãã«ãŒïŒããšãã°ãhunspellïŒã¯ã³ã³ããã¹ããèæ ®ã«å ¥ããŠããªããããé©åãªç²ŸåºŠãåŸãã®ã¯å°é£ã§ãã Peter Norwigã®ã¹ãã«ãã§ãã«ãŒãããŒã¹ãšããŠãèšèªã¢ãã«ãïŒN-gramã«åºã¥ããŠïŒåºå®ããïŒSymSpellã¢ãããŒãã䜿çšããŠïŒå éãã匷åãªã¡ã¢ãªæ¶è²»ãïŒãã«ãŒã ãã£ã«ã¿ãŒãšå®å šãªããã·ã¥ãä»ããŠïŒå æãããã¹ãŠã©ã€ãã©ãªãŒã®åœ¢ã§èšèšããŸããC ++ãšä»ã®èšèªçšã®Swigãã€ã³ããŒã
å質ææš
ã¹ãã«ãã§ãã«ãŒãçŽæ¥æžãåã«ããã®å質ã枬å®ããæ¹æ³ãèãåºãå¿ èŠããããŸããã ãããã®ç®çã®ããã«ãNorwigã¯æ¢è£œã®ã¿ã€ããã¹ã®ã³ã¬ã¯ã·ã§ã³ã䜿çšããŸãããããã«ã¯ããšã©ãŒã®ããåèªã®ãªã¹ããšæ£ããããŒãžã§ã³ã瀺ãããŠããŸãã ããããç§ãã¡ã®å Žåããã®ã¡ãœããã¯ã³ã³ããã¹ããäžè¶³ããŠããããé©åã§ã¯ãããŸããã 代ããã«ãåçŽãªã¿ã€ããã¹ãžã§ãã¬ãŒã¿ãæåã«æžãããŸããã
typoãžã§ãã¬ãŒã¿ãŒã¯ãå ¥åã§ã¯ãŒããåãåããåºåã§ã¯ãŒãã«ãšã©ãŒãäžããŸãã ãšã©ãŒã«ã¯æ¬¡ã®çš®é¡ããããŸãã1ã€ã®æåãå¥ã®æåã«çœ®ãæãããæ°ããæåãæ¿å ¥ãããæ¢åã®æåãåé€ããã2ã€ã®æåãæå®ã®å Žæã«åé 眮ããã åã¿ã€ãã®ãšã©ãŒã®ç¢ºçã¯åå¥ã«èšå®ãããã¿ã€ããã¹ã®å šäœçãªç¢ºçïŒåèªã®é·ãã«ããïŒãš2çªç®ã®ã¿ã€ããã¹ã®ç¢ºçã調æŽãããŸãã
ãããŸã§ã®ãšããããã¹ãŠã®ãã©ã¡ãŒã¿ãŒã¯çŽæçã«éžæãããŠããããšã©ãŒã®ç¢ºçã¯10ã¯ãŒãäžçŽ1ãæãåçŽãªã¿ã€ãã®ãšã©ãŒïŒããæåãå¥ã®æåã«çœ®ãæããïŒã®ç¢ºçã¯ä»ã®ã¿ã€ãã®ãšã©ãŒã®7åã§ãã
ãã®ã¢ãã«ã«ã¯å€ãã®æ¬ ç¹ããããŸã-ã¿ã€ããã¹ã®å®éã®çµ±èšã«åºã¥ããŠããªããããŒããŒãã®ã¬ã€ã¢ãŠããèæ ®ããããŸãäžç·ã«ãã£ã€ãããåèªãåé¢ããŸããã ãã ããåæããŒãžã§ã³ã§ã¯ååã§ãã ãŸããã©ã€ãã©ãªã®å°æ¥ã®ããŒãžã§ã³ã§ã¯ãã¢ãã«ãæ¹åãããŸãã
ã¿ã€ããžã§ãã¬ãŒã¿ã䜿çšãããšãä»»æã®ããã¹ããå®è¡ããŠããšã©ãŒã®ããåæ§ã®ããã¹ããååŸã§ããŸãã ã¹ãã«ãã§ãã«ãŒã®å質ã®ææšãšããŠããã®ã¹ãã«ãã§ãã«ãŒãå®è¡ããåŸã«ããã¹ãã«æ®ã£ãŠãããšã©ãŒã®å²åã䜿çšã§ããŸãã ãã®ã¡ããªãã¯ã«å ããŠã次ã®ãã®ã䜿çšãããŸããã
- èšæ£ãããåèªã®å²åïŒãšã©ãŒã®å²åã䌎ãã¡ããªãã¯ãšã¯å¯Ÿç §çã«ãããã¹ãå šäœã§ã¯ãªãããšã©ãŒã®ããåèªã«ã€ããŠã®ã¿èæ ®ãããŸãïŒ
- å£ããåèªã®å²åïŒããã¯ãåèªã«ééãããªãã£ãå Žåã§ãããã¹ãã«ãã§ãã«ãŒã¯ãããã«ãããšå€æããŠä¿®æ£ãããšãã§ãïŒ
- Nåã®åè£ã®ãªã¹ãã§æ£ããããªã¢ã³ããææ¡ãããåèªã®å²åïŒã¹ãã«ãã§ãã«ãŒã¯éåžžãããã€ãã®ä¿®æ£ãªãã·ã§ã³ãæäŸããŸãïŒ
ã¹ãã«ãã§ãã«ãŒããŒã¿ãŒããŒãŠã£ã°
Peter Norwigã¯ãç°¡åãªã¹ãã«ãã§ãã«ãŒã«ã€ããŠèª¬æããŸããã åèªããšã«ããã¹ãŠã®å¯èœãªããªãšãŒã·ã§ã³ãçæãããŸãïŒåé€+æ¿å ¥+眮æ+眮æïŒãæ·±ãã¯2以äžã§ããçµæã®åèªã¯ãèŸæžïŒããã·ã¥ããŒãã«ïŒã«ååšãããã©ããããã§ãã¯ãããŸããæãé »ç¹ã«çºçããåèªã¯ãé©åãªãªãã·ã§ã³ã®ã»ããããéžæãããŸãã ãã®ã¹ãã«ãã§ãã«ãŒã®è©³çŽ°ã«ã€ããŠã¯ã å ã®èšäºãã芧ãã ãã ã
ãã®ã¹ãã«ãã§ãã«ãŒã®äž»ãªæ¬ ç¹ã¯ãé·ãæéïŒç¹ã«é·ãåèªïŒãã³ã³ããã¹ãã®èæ ®ã®æ¬ åŠã§ãã åŸè ã®ä¿®æ£ããå§ããŸããã-èšèªã¢ãã«ãè¿œå ããåèªã®åçŽãªåºçŸé »åºŠã®ä»£ããã«ãèšèªã¢ãã«ã«ãã£ãŠè¿ãããæšå®å€ã䜿çšããŸãã
N-gramèšèªã¢ãã«
N-gramã¯ãnåã®èŠçŽ ã®ã·ãŒã±ã³ã¹ã§ãã ããšãã°ãé³ãé³ç¯ãåèªãŸãã¯æåã®ã·ãŒã±ã³ã¹ã |
èšèªã¢ãã«ã¯è³ªåã«çããããšãã§ããŸã-ãã®æãèšèªã§çºçããå¯èœæ§ãã©ã®ãããã®ç¢ºçã§ãããã çŸåšãäž»ã«2ã€ã®ã¢ãããŒãã䜿çšãããŠããŸããN-gramã«åºã¥ãã¢ãã«ãš ãã¥ãŒã©ã«ãããã¯ãŒã¯ã«åºã¥ã ã¢ãã« ã§ã ã ã©ã€ãã©ãªã®æåã®ããŒãžã§ã³ã§ã¯ãããåçŽã§ãããããN-gramã¢ãã«ãéžæãããŸããã ãã ããå°æ¥çã«ã¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãã«ãè©Šãèšç»ããããŸãã
N-gramã¢ãã«ã¯æ¬¡ã®ããã«æ©èœããŸãã ã¢ãã«ã®ãã¬ãŒãã³ã°ã«äœ¿çšãããããã¹ãã«ãããšãNã¯ãŒãã®ãŠã£ã³ããŠãééããåçµã¿åãããçºçããåæ°ïŒnã°ã©ã ïŒãã«ãŠã³ãããŸãã ã¢ãã«ãžã®èŠæ±ã«å¿ããŠãææ¡ã®ãŠã£ã³ããŠãåæ§ã«ééãããã¹ãŠã®n-gramã®ç¢ºçã®ç©ãèæ ®ããŸãã n-gramãæºããå¯èœæ§ã¯ããã¬ãŒãã³ã°ããã¹ãå ã®ãã®ãããªn-gramã®æ°ã«ãã£ãŠæšå®ãããŸãã
måã®åèªã®æïŒw 1 ã...ãw m ïŒãæºãã確çPïŒw 1 ã...ãw m ïŒã¯ããã®æãæ§æãããµã€ãºnã®ãã¹ãŠã®n-gramã®ç©ã«ã»ãŒçãããªããŸãã
ån-gramã®ç¢ºçã¯ãåãn-gramãæåŸã®åèªãé€ããŠåºäŒã£ãåæ°ã«å¯ŸããŠããã®n-gramãåºäŒã£ãåæ°ã«ãã£ãŠæ±ºå®ãããŸãã
å®éã«ã¯ããã®ãããªã¢ãã«ã¯æ¬¡ã®åé¡ããããããçŽç²ãªåœ¢ã§ã¯äœ¿çšãããŸããã äœããã®çš®é¡ã®n-gramããã¬ãŒãã³ã°ããã¹ãã§èŠã€ãããªãå Žåãæå šäœãå³åº§ã«ãŒãã®ç¢ºçãåãåããŸãã ãã®åé¡ã解決ããã«ã¯ãã¹ã ãŒãžã³ã°ïŒã¹ã ãŒãžã³ã°ïŒã®ãªãã·ã§ã³ã®ããããã䜿çšããŸãã ç°¡åã«èšãã°ãããã¯ããã¹ãŠã®n-gramã®åºçŸé »åºŠã«åäœãè¿œå ããããšã§ããããè€éãªåœ¢åŒã§ã¯ãé«æ¬¡n-gramããªãå Žåã«äœæ¬¡n-gramã䜿çšããŸãã
æãäžè¬çãªã¹ã ãŒãžã³ã°ææ³ã¯Kneser â Ney smoothingã§ãã ãã ããn-gramããšã«è¿œå ã®æ å ±ãä¿åããå¿ èŠããããããåçŽãªå¹³æ»åãšæ¯èŒããŠã²ã€ã³ã¯åŒ·ããããŸããïŒå°ãªããšãã5000äžn-gramãŸã§ã®å°ããªã¢ãã«ã®å®éšã§ã¯ïŒã ç°¡åã«ããããã«ãån-gramã®ç¢ºçãããã¹ãŠã®æ¬¡æ°ã®n-gramã®ç©ãšããŠãããšãã°ããã©ã€ã°ã©ã ã®å Žåã®å¹³æ»åãšèŠãªããŸãã
ããã§ãèšèªã¢ãã«ãäœæããããã¿ã€ããã¹ã®åè£ã®äžãããã³ã³ããã¹ããèæ ®ããŠèšèªã¢ãã«ãæé«ã®ã¹ã³ã¢ãäžããä¿®æ£ãè¡ãåè£ãéžæããŸãã ããã«ãå€æ°ã®èª€æ€ç¥ãåé¿ããããã«ãå ã®åèªãå€æŽããããã®å°ããªããã«ãã£ãè©äŸ¡ã«è¿œå ããŸãã ãã®çœ°éãå€æŽãããšã誀æ€ç¥ã®å²åã調æŽã§ããŸããããšãã°ãããã¹ããšãã£ã¿ãŒã§ã¯ã誀æ€ç¥ã®å²åãé«ãããããã¹ãã®èªåä¿®æ£ãäœãããããšãã§ããŸãã
ã·ã³ã¹ãã«
ãã«ãŠã§ãŒèªã®ã¹ãã«ãã§ãã«ãŒã®æ¬¡ã®åé¡ã¯ãåè£è ãããªãå Žåã®äœæ¥é床ãé ãããšã§ãã ãããã£ãŠã15æåã®åèªã§ã¯ãã¢ã«ãŽãªãºã ã¯çŽ1ç§éæ©èœããŸããããã®ãããªããã©ãŒãã³ã¹ã¯å®çšã«ååã§ã¯ãããŸããã ããã©ãŒãã³ã¹ãé«éåããããã®ãªãã·ã§ã³ã®1ã€ã¯SymSpellã¢ã«ãŽãªãºã ã§ãèè ã«ããã°ãããã¯100äžåé«éã«åäœããŸãã SymSpellã¯ã次ã®ååã«åŸã£ãŠæ©èœããŸããèŸæžã®ååèªã«ã€ããŠãå ã®åèªãåç §ããŠãåé€ãåå¥ã®ã€ã³ããã¯ã¹ã«è¿œå ããããã1ã€ä»¥äžã®æåïŒéåžž1ããã³2ïŒãåé€ããããšã§å ã®åèªããã¹ãŠååŸãããŸãã åè£ã®æ€çŽ¢æã«ãåèªã«å¯ŸããŠåæ§ã®åé€ãè¡ãããã€ã³ããã¯ã¹å ã§ã®ãããã®ååšããã§ãã¯ãããŸãã ãã®ãããªã¢ã«ãŽãªãºã ã¯ããšã©ãŒã®ãã¹ãŠã®ã±ãŒã¹ãæ£ããåŠçããŸã-æåã®çœ®ãæããåé 眮ãè¿œå ããã³åé€ã
ããšãã°ã眮æãæ€èšããŸãïŒãã®äŸã§ã¯è·é¢1ã®ã¿ãæ€èšããŸãïŒã å ã®èŸæžã«ã test ããšããåèªãå«ããŸãã ãããŠãã ãã³ã ããšããåèªãå ¥åããŸããã ã€ã³ããã¯ã¹ã«ã¯ãã test ããšããåèªã®ãã¹ãŠã®åé€ãã€ãŸãeats ã tst ã tet ã tesãå«ãŸããŸãã åèªã tempo ãã®å Žåãåé€ã¯emt ã tmt ã tet ã emtã«ãªããŸãã ã tet ãã®åé€ã¯ã€ã³ããã¯ã¹ã«å«ãŸããŠããŸããã€ãŸããã test ããšããåèªã¯ã temt ããšããã¿ã€ããã¹ã®åèªã«å¯Ÿå¿ããŸãã
å®å šãªããã·ã¥
次ã®åé¡ã¯ã¡ã¢ãªæ¶è²»ã§ãã 200äžæã®ããã¹ãïŒWikipediaãã100äž+ãã¥ãŒã¹ããã¹ããã100äžïŒã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ã7 GBã®RAMãå æããŸããã ãã®ããªã¥ãŒã ã®çŽååã¯èšèªã¢ãã«ïŒçºçé »åºŠã®nã°ã©ã ïŒã§äœ¿çšãããæ®ãã®ååã¯SymSpellã®ã€ã³ããã¯ã¹ã§äœ¿çšãããŸããã ãã®ãããªã¡ã¢ãªæ¶è²»ã«ãããã¢ããªã±ãŒã·ã§ã³ã®äœ¿çšã¯ããŸãå®çšçã§ã¯ãªããªããŸããã
å質ãèããäœäžãå§ãããããèŸæžã®ãµã€ãºãçž®å°ããããããŸããã§ããã å€æããããã«ãããã¯æ°ããåé¡ã§ã¯ãããŸããã ç§åŠèšäºã§ã¯ãèšèªã¢ãã«ã«ããã¡ã¢ãªæ¶è²»ã®åé¡ã解決ããããŸããŸãªæ¹æ³ãææ¡ãããŠããŸãã èå³æ·±ãã¢ãããŒãã®1ã€ïŒèšäºã å¹ççãªæå°å®å šããã·ã¥èšèªã¢ãã«ããåç §ïŒã¯ã å®å šããã·ã¥ ïŒãŸãã¯CHDã¢ã«ãŽãªãºã ïŒã䜿çšããŠn-gramã«é¢ããæ å ±ãä¿åããããšã§ãã å®å šããã·ã¥ã¯ãåºå®ããŒã¿ã»ããã«è¡çªããªãããã·ã¥ã§ãã è¡çªããªãå ŽåãããŒãæ¯èŒããå¿ èŠããªããããããŒãä¿åããå¿ èŠããªããªããŸãã ãã®çµæãçºçé »åºŠãæ ŒçŽããn-gramã®æ°ã«çããé åãã¡ã¢ãªã«ä¿æã§ããŸãã ããã«ãããn-gramèªäœãçºçé »åºŠãããã¯ããã«å€ãã®ã¹ããŒã¹ãå æãããããã¡ã¢ãªãå€§å¹ ã«ç¯çŽã§ããŸãã
ãããã1ã€ã®åé¡ããããŸãã ã¢ãã«ã䜿çšãããšãn-gramãå ¥åãããŸãããããã¯ãã¬ãŒãã³ã°ããã¹ãã«ã¯ãããŸããã ãã®çµæãå®å šãªããã·ã¥ã¯ãä»ã®æ¢åã®n-gramã®ããã·ã¥ãè¿ããŸãã ãã®åé¡ã解決ããããã«ããã®èšäºã®èè ã¯ãn-gramããšã«å¥ã®ããã·ã¥ãä¿åããããšãææ¡ããŠããŸããããã«ãããn-gramãäžèŽãããã©ãããæ¯èŒã§ããŸãã ããã·ã¥ãç°ãªãå Žåããã®n-gramã¯ååšããªããããçºçé »åºŠã¯ãŒããšèŠãªãããŸãã
ããšãã°ãn1ãn2ãn3ã®3ã€ã®n-gramãããããããã¯10ã15ãããã³3ååºäŒã£ãã ãã§ãªãããœãŒã¹ããã¹ãã§ã¯çºçããªãã£ãn-gram n4ããããŸãã
n1 | n2 | n3 | n4 | |
å®å šãªããã·ã¥ | 1 | 0 | 2 | 1 |
2çªç®ã®ããã·ã¥ | 42 | 13 | 24 | 18 |
é »åºŠ | 10 | 15 | 3 | 0 |
çºçé »åºŠãšè¿œå ã®ããã·ã¥ãä¿åããé åãäœæããŸããã å®å šããã·ã¥ã®å€ãé åã®ã€ã³ããã¯ã¹ãšããŠäœ¿çšããŸãã
15ã13 | 10ã42 | 3ã24 |
n-gram n1ãæºãããšä»®å®ããŸãã ãã®å®å šããã·ã¥ã¯1ã2çªç®ã®ããã·ã¥ã¯42ã§ããã€ã³ããã¯ã¹1ã®é åã«ç§»åããããã«ããããã·ã¥ãæ¯èŒããŸãã äžèŽãããããé »åºŠã¯n-gram 10ã§ãã次ã«ãn-gram n4ãèããŸãã ãã®å®å šããã·ã¥ã1ã§ããã2çªç®ã®ããã·ã¥ã¯18ã§ããããã¯ãã€ã³ããã¯ã¹1ã«ããããã·ã¥ãšã¯ç°ãªããŸããã€ãŸããçºçé »åºŠã¯0ã§ãã
å®éã«ã¯ã16ãããã®CityHashãããã·ã¥ãšããŠäœ¿çšãããŸããã ãã¡ãããããã·ã¥ã¯èª€æ€ç¥ãå®å šã«é€å€ããããã§ã¯ãããŸããããæçµçãªå質ææšã«åœ±é¿ãäžããªãããã«é »åºŠãæžãããŸãã
çºçé »åºŠèªäœããéç·åœ¢éååã«ããã32ãããæ°ãã16ãããæ°ãŸã§ããã³ã³ãã¯ãã«ãšã³ã³ãŒããããŸããã å°ããæ°åã¯1ã1ã倧ããæ°åã¯1ã2ã1ã4ãªã©ã«å¯Ÿå¿ããŠããŸãããåã³éååã¯æçµçãªã¡ããªãã¯ã«åœ±é¿ããŸããã§ããã
ã»ãšãã©ã®å Žåãããã·ã¥ãšçºçé »åºŠã®äž¡æ¹ãããã«åŒ·åã«ããã±ãŒãžåã§ããŸãããããã¯ãã§ã«æ¬¡ã®ããŒãžã§ã³ã«å«ãŸããŠããŸãã çŸåšã®ããŒãžã§ã³ã§ã¯ãã¢ãã«ã¯æ倧260 MBãŸã§çž®å°ããŸãããå質ã¯äœäžããã10å以äžã§ããã
ãã«ãŒã ãã£ã«ã¿ãŒ
èšèªã¢ãã«ã«å ããŠãSymSpellã¢ã«ãŽãªãºã ããã®ã€ã³ããã¯ã¹ããŸã ããããããå€ãã®ã¹ããŒã¹ãå æããŸããã ããã«ã€ããŠã¯æ¢æã®ãœãªã¥ãŒã·ã§ã³ããªãã£ãã®ã§ãããå°ãèããªããã°ãªããŸããã§ããã èšèªã¢ãã«ã®ã³ã³ãã¯ããªè¡šçŸã«é¢ããç§åŠèšäºã§ã¯ã ãã«ãŒã ãã£ã«ã¿ãŒããã䜿çšãããŸããã 圌ã¯ãã®ä»äºãæäŒãããšãã§ããããã§ããã ãã«ãŒã ãã£ã«ã¿ãŒãé¡ã«çŽæ¥é©çšããããšã¯ã§ããŸããã§ãã-åé€ãããã€ã³ããã¯ã¹ããã®ååèªã«ã¯ãå ã®åèªãžã®ãªã³ã¯ãå¿ èŠã§ããã«ãŒã ãã£ã«ã¿ãŒã¯å€ã®ä¿åãèš±å¯ãããååšã確èªããã ãã§ãã äžæ¹ããã«ãŒã ãã£ã«ã¿ãŒããã®ãããªåé€ãã€ã³ããã¯ã¹å ã«ããããšã瀺ããŠããå Žåãæ¿å ¥ãå®è¡ããã€ã³ããã¯ã¹ã§ãããããã§ãã¯ããããšã«ãããå ã®åèªã埩å ã§ããŸãã SymSpellã¢ã«ãŽãªãºã ã®æçµçãªé©å¿ã¯æ¬¡ã®ãšããã§ãã
ãã«ãŒã ãã£ã«ã¿ãŒã«å ã®èŸæžããåé€ãããåèªããã¹ãŠä¿åããŸãã åè£ãæ€çŽ¢ããå Žåãæåã«å ã®åèªããç®çã®æ·±ããŸã§åé€ããŸãïŒSymSpellãšåæ§ïŒã ãã ããSymSpellãšã¯ç°ãªããååé€ã®æ¬¡ã®ã¹ãããã¯ãå ã®èŸæžã«çµæã®åèªãæ¿å ¥ããŠç¢ºèªããããšã§ãã ãããŠããã«ãŒã ãã£ã«ã¿ãŒã«ä¿åãããåé€ã®ã€ã³ããã¯ã¹ã¯ããã®äžã«ãªãåé€ã®æ¿å ¥ãã¹ãããããããã«äœ¿çšãããŸãã ãã®å Žåã誀æ€ç¥ã¯ç§ãã¡ãæããŸãã-ç§ãã¡ã¯å°ãäœåãªäœæ¥ãè¡ããŸãã
çµæãšããŠåŸããããœãªã¥ãŒã·ã§ã³ã®ããã©ãŒãã³ã¹ã¯å®éã«ã¯äœäžããã䜿çšãããã¡ã¢ãªã¯éåžžã«å€§å¹ ã«åæžãããŸãã-æ倧140 MBïŒçŽ25åïŒã ãã®çµæãåèšã¡ã¢ãªãµã€ãºã7 GBãã400 MBã«åæžãããŸããã
çµæ
次ã®è¡šã¯ãè±èªã®ããã¹ãã®çµæã瀺ããŠããŸãã ãã¬ãŒãã³ã°ã«ã¯ããŠã£ãããã£ã¢ããã®30äžæãšãã¥ãŒã¹ããã¹ãããã®30äžæã䜿çšãããŸããïŒããã¹ãã¯ããã«ãããŸã ïŒã æåã®ãµã³ãã«ã¯2ã€ã®éšåã«åå²ããã95ïŒ ããã¬ãŒãã³ã°ã«ã5ïŒ ãè©äŸ¡ã«äœ¿çšãããŸããã çµæïŒ
ãšã©ãŒ | äžäœ7ã€ã®ãšã©ãŒ | ä¿®æ£ç | ããã7ä¿®æ£ç | å£ãã | ã¹ããŒã
ïŒ1ç§ãããã®åèªæ°ïŒ | |
ãžã£ã ã¹ãã« | 3.25ïŒ | 1.27ïŒ | 79.53ïŒ | 84.10ïŒ | 0.64ïŒ | 1833 |
ããŒãã° | 7.62ïŒ | 5.00ïŒ | 46.58ïŒ | 66.51ïŒ | 0.69ïŒ | 395 |
ãã³ã¹ãã« | 13.10ïŒ | 10.33ïŒ | 47.52ïŒ | 68.56ïŒ | 7.14ïŒ | 163 |
ãã㌠| 13.14ïŒ | 13.14ïŒ | 0.00ïŒ | 0.00ïŒ | 0.00ïŒ | - |
JamSpellã¯ãçµæã®ã¹ãã«ãã§ãã«ãŒã§ãã ãããŒ-äœãããªãä¿®æ£ããã°ã©ã ã¯ããœãŒã¹ããã¹ãã®ãšã©ãŒã®å²åãæ確ã«ããããã«æäŸãããŸãã Norvigã¯Peter Norwigã®ã¹ãã«ãã§ãã«ãŒã§ãã Hunspellã¯ãæã人æ°ã®ãããªãŒãã³ãœãŒã¹ã®ã¹ãã«ãã§ãã«ãŒã®1ã€ã§ãã å®éšã®çŽåºŠãé«ããããã«ãæåŠããã¹ãã®ãã§ãã¯ãå®æœããŸããã ãã·ã£ãŒããã¯ããŒã ãºã®åéºããšããããã¹ãã®ã¡ããªãã¯ïŒ
ãšã©ãŒ | äžäœ7ã€ã®ãšã©ãŒ | ä¿®æ£ç | ããã7ä¿®æ£ç | å£ãã | ã¹ããŒã
ïŒ1ç§ãããã®åèªæ°ïŒ | |
ãžã£ã ã¹ãã« | 3.56ïŒ | 1.27ïŒ | 72.03ïŒ | 79.73ïŒ | 0.50ïŒ | 1764 |
ããŒãã° | 7.60ïŒ | 5.30ïŒ | 35.43ïŒ | 56.06ïŒ | 0.45ïŒ | 647 |
ãã³ã¹ãã« | 9.36ïŒ | 6.44ïŒ | 39.61ïŒ | 65.77ïŒ | 2.95ïŒ | 284 |
ãã㌠| 11.16ïŒ | 11.16ïŒ | 0.00ïŒ | 0.00ïŒ | 0.00ïŒ | - |
JamSpellã¯ã1ã€ã®åè£ã®å Žåãšäžäœ7ã€ã®åè£ã®å Žåã®äž¡æ¹ã®ãã¹ãã§ãHunspellããã³Norwigã®ã¹ãã«ãã§ãã«ãŒã«æ¯ã¹ãŠåªããå質ãšããã©ãŒãã³ã¹ã瀺ããŸããã
次ã®è¡šã«ãããŸããŸãªèšèªããã³ããŸããŸãªãµã€ãºã®ãã¬ãŒãã³ã°ã»ããã®ã¡ããªãã¯ã瀺ããŸãã
ãšã©ãŒ | äžäœ7ã€ã®ãšã©ãŒ | ä¿®æ£ç | ããã7ä¿®æ£ç | å£ãã | ã¹ããŒã | èšæ¶ | |
è±èª
ïŒ30äžã®ãŠã£ãããã£ã¢+ 30äžã®ãã¥ãŒã¹ïŒ | 3.25ïŒ | 1.27ïŒ | 79.53ïŒ | 84.10ïŒ | 0.64ïŒ | 1833 | 86.2 Mb |
ãã·ã¢èª
ïŒ30äžã®ãŠã£ãããã£ã¢+ 30äžã®ãã¥ãŒã¹ïŒ | 4.69ïŒ | 1.57ïŒ | 76.77ïŒ | 82.13ïŒ | 1.07ïŒ | 1482 | 138.7 Mb |
ãã·ã¢èª
ïŒ1MãŠã£ãããã£ã¢+ 1Mãã¥ãŒã¹ïŒ | 3.76ïŒ | 1.22ïŒ | 80.56ïŒ | 85.47ïŒ | 0.71ïŒ | 1375 | 341.4 Mb |
ãã€ã人
ïŒ30äžã®ãŠã£ãããã£ã¢+ 30äžã®ãã¥ãŒã¹ïŒ | 5.50ïŒ | 2.02ïŒ | 70.76ïŒ | 75.33ïŒ | 1.08ïŒ | 1559 | 189.2 Mb |
ãã©ã³ã¹èª
ïŒ30äžã®ãŠã£ãããã£ã¢+ 30äžã®ãã¥ãŒã¹ïŒ | 3.32ïŒ | 1.26ïŒ | 76.56ïŒ | 81.25ïŒ | 0.76ïŒ | 1543 | 83.9 Mb |
ãŸãšã
ãã®çµæãåæ§ã®ãªãŒãã³ãœãªã¥ãŒã·ã§ã³ãåãé«å質ã§é«éãªã¹ãã«ãã§ãã«ãŒãå®çŸããŸãã 䜿çšäŸã¯ãããã¹ããšãã£ã¿ãŒãã¡ãã»ã³ãžã£ãŒãæ©æ¢°åŠç¿ã¿ã¹ã¯ã§ã®ããŒãã£ããã¹ãã®ååŠçãªã©ã§ãã
ãœãŒã¹ã¯ ãMITã©ã€ã»ã³ã¹ã®äžã§githubã§å ¥æã§ããŸã ã ã©ã€ãã©ãªã¯C ++ã§èšè¿°ãããŠãããswigãä»ããŠä»ã®èšèªã®ãã€ã³ãã£ã³ã°ãå©çšã§ããŸãã Pythonã§ã®äœ¿çšäŸïŒ
import jamspell corrector = jamspell.TSpellCorrector() corrector.LoadLangModel('model_en.bin') corrector.FixFragment('I am the begt spell cherken!') # u'I am the best spell checker!' corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3) # (u'best', u'beat', u'belt', u'bet', u'bent', ... ) corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5) # (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)</td>
ãããªãæ¹åã«ã¯ãèšèªã¢ãã«ã®å質ã®æ¹åãã¡ã¢ãªæ¶è²»ã®åæžãæ¥çãšåèªåå²ã®åŠçèœåã®è¿œå ãç°ãªãèšèªã®æ©èœã®ãµããŒããå«ãŸããŸãã 誰ããã©ã€ãã©ãªã®æ¹åã«åå ããããªããç§ã¯ããªãã®ãã«ãªã¯ãšã¹ãã«åãã§ããã§ãããã
åç §è³æ
- JamSpellãœãŒã¹ïŒ github.com/bakwc/JamSpell
- ã¹ãã«ã³ã¬ã¯ã¿ãŒã®äœææ¹æ³ã Peter Norvig ïŒ norvig.com/spell-correct.html
- HunspellïŒ github.com/hunspell/hunspell
- N-gramã䜿çšããèšèªã¢ããªã³ã°ã Daniel Jurafskyããã³James H. Martin ïŒ web.stanford.edu/~jurafsky/slp3/4.pdf
- LSTMãããã¯ãŒã¯ã«ã€ããŠã Christopher Olah ïŒ colah.github.io/posts/2015-08-Understanding-LSTMs/
- ãªã«ã¬ã³ããã¥ãŒã©ã«ãããã¯ãŒã¯æšå®ã䜿çšããN-GRAMèšèªã¢ããªã³ã°ãGoogle Tech Reportã Ciprian ChelbaãMohammad NorouziãSamy Bengio ïŒ static.googleusercontent.com/media/research.google.com/en//pubs/archive/46183.pdf
- èšèªã¢ããªã³ã°ã®ããã®å¹³æ»åææ³ã®çµéšçç 究ã ã¹ã¿ã³ãªãŒF.ãã§ã³ããã³ãžã§ã·ã¥ã¢ã°ãããã³ ïŒ u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf
- 1000x Faster Spelling Correctionã¢ã«ãŽãªãºã ã Wolf Garbe ïŒ blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
- Efficient Minimal Perfect Hash Language Modelsã David GuthrieãMark HeppleãWei Liu ïŒ www.lrec-conf.org/proceedings/lrec2010/pdf/860_Paper.pdf
- å®ç§ãªããã·ã¥é¢æ°ã wikipedia ïŒ en.wikipedia.org/wiki/Perfect_hash_function
- ããã·ã¥ã眮æãããã³å§çž®ã Djamal Belazzougui1ãFabiano C. BotelhoãMartin Dietzfelbinger ïŒ cmph.sourceforge.net/papers/esa09.pdf
- ãã«ãŒã ãã£ã«ã¿ãŒã habrahabr ïŒ habrahabr.ru/post/112069/
- ã©ã€ããã£ãã³ãŒãã©ã³ã¬ã¯ã·ã§ã³ïŒ wortschatz.uni-leipzig.de/en/download/