ã¯ããã«
ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ïŒ é¡äŒŒæ€çŽ¢ãŸãã¯ãã¡ãžãŒæååæ€çŽ¢ ãšãåŒã°ããŸã ïŒã¯ãã¹ãã«ãã§ãã¯ã·ã¹ãã ããã³GoogleãYandexãªã©ã®æ¬æ Œçãªæ€çŽ¢ãšã³ãžã³ã®åºç€ã§ãã ããšãã°ããã®ãããªã¢ã«ãŽãªãºã ã¯ãåãæ€çŽ¢ãšã³ãžã³ã®ããããã...ããªã©ã®æ©èœã«äœ¿çšãããŸãã
ãã®ã¬ãã¥ãŒèšäºã§ã¯ã次ã®æŠå¿µãæ¹æ³ãã¢ã«ãŽãªãºã ãæ€èšããŸãã
- ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢
- è·é¢Damerau-Levenshtein
- Wuããã³Manberããã®å€æŽã䌎ãBitapã¢ã«ãŽãªãºã
- ãµã³ããªã³ã°æ¡åŒµã¢ã«ãŽãªãºã
- N-gramæ³
- 眲åããã·ã¥
- BKã®æš
ã ãã...
ãã¡ãžãŒæ€çŽ¢ã¯ãæ€çŽ¢ãšã³ãžã³ã®éåžžã«äŸ¿å©ãªæ©èœã§ãã åæã«ããã®å¹æçãªå®è£ ã¯ãå®å šäžèŽã«ããåçŽãªæ€çŽ¢ã®å®è£ ãããã¯ããã«è€éã§ãã
ãã¡ãžãŒæ€çŽ¢åé¡ã¯æ¬¡ã®ããã«å®åŒåã§ããŸãã
ãæå®ãããåèªã«ã€ããŠããµã€ãºnã®ããã¹ããŸãã¯èŸæžã§ã kã®å¯èœãªéããèæ ®ããŠããã®åèªã«äžèŽããïŒãŸãã¯ãã®åèªã§å§ãŸãïŒãã¹ãŠã®åèªãæ€çŽ¢ããŸããã
ããšãã°ã2ã€ã®å¯èœæ§ã®ãããšã©ãŒãèæ ®ããŠãMachineããç §äŒããå ŽåããMachineãããMakhinaãããRaspberryãããKalinaããªã©ã®åèªãæ€çŽ¢ããŸãã
ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ã¯ã2ã€ã®åèªéã®è·é¢ã®é¢æ°ã§ããã¡ããªãã¯ã«ãã£ãŠç¹åŸŽä»ããããŸããããã«ãããç¹å®ã®ã³ã³ããã¹ãã®é¡äŒŒåºŠãè©äŸ¡ã§ããŸãã ã¡ããªãã¯ã®å³å¯ãªæ°åŠçå®çŸ©ã«ã¯ãäžè§åœ¢ã®äžçåŒæ¡ä»¶ïŒXã¯åèªã®ã»ãããpã¯ã¡ããªãã¯ïŒãæºããå¿ èŠããããŸãã
äžæ¹ãã»ãšãã©ã®å Žåãã¡ããªãã¯ãšã¯ããã®ãããªæ¡ä»¶ã®å 足ãå¿ èŠãšããªãããäžè¬çãªæŠå¿µãæå³ãããã®æŠå¿µã¯è·é¢ãšãåŒã°ããŸã ã
æãããç¥ãããŠãã枬å®åºæºã«ã¯ã ããã³ã° è·é¢ ã ã¬ãŒãã³ ã·ã¥ã¿ã€ã³è·é¢ ãããã³ãã¡ã©ãŠãŒã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ããããŸãã ããã«ãããã³ã°è·é¢ã¯åãé·ãã®åèªã®ã»ããã§ã®ã¿ã®ã¡ããªãã¯ã§ããããã®é©çšç¯å²ãå€§å¹ ã«å¶éããŸãã
ãã ããå®éã«ã¯ãããã³ã°è·é¢ã¯å®éã«ã¯åœ¹ã«ç«ããªãããã人éã®èŠ³ç¹ããããèªç¶ãªã¡ããªãã¯ãåŸãããŸããããã«ã€ããŠã¯ä»¥äžã§èª¬æããŸãã
ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢
æãäžè¬çã«äœ¿çšãããã¡ããªãã¯ã¯ãã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ãŸãã¯ç·šéè·é¢ã§ããããã®èšç®ã¢ã«ãŽãªãºã ã¯ãã¹ãŠã®ã¹ãããã§èŠã€ããããšãã§ããŸãã
ããã«ãããããããæã人æ°ã®ããèšç®ã¢ã«ãŽãªãºã ã§ããã¯ãŒã°ããŒã»ãã£ãã·ã£ãŒæ³ã«ã€ããŠã³ã¡ã³ããã䟡å€ã¯ãããŸãã
ãã®ã¢ã«ãŽãªãºã ã®åæããŒãžã§ã³ã®æéã®è€éãã¯OïŒmnïŒã§ããã OïŒmnïŒã¡ã¢ãªãæ¶è²»ããŸããããã§ã mãšnã¯æ¯èŒãããæååã®é·ãã§ãã ããã»ã¹å šäœã¯ã次ã®ãããªãã¯ã¹ã§è¡šãããšãã§ããŸãã
ã¢ã«ãŽãªãºã ã®ããã»ã¹ãèŠããšãåã¹ãããã§ãããªãã¯ã¹ã®æåŸã®2è¡ã®ã¿ã䜿çšãããŠãããããã¡ã¢ãªæ¶è²»ãOïŒminïŒmãnïŒïŒã«æžããããšãã§ããŸãã
ããããããã ãã§ã¯ãããŸãããã¿ã¹ã¯ãkå以äžã®éããèŠã€ããããšã§ããå Žåãã¢ã«ãŽãªãºã ãããã«æé©åã§ããŸãã ãã®å Žåããããªãã¯ã¹ã§å¹ 2k + 1 ïŒUkkonen cut-offïŒã®å¯Ÿè§ã¹ããªããã®ã¿ãèšç®ããå¿ èŠããããŸããããã«ãããæéã®è€éããOïŒk minïŒmãnïŒïŒã«æžå°ããŸãã
ãã¬ãã£ãã¯ã¹è·é¢
ãŸãããã¿ãŒã³ãã¬ãã£ãã¯ã¹ãšæååã®éã®è·é¢ãèšç®ããå¿ èŠããããŸããã€ãŸããæå®ããããã¬ãã£ãã¯ã¹ãšæãè¿ãæååãã¬ãã£ãã¯ã¹ã®éã®è·é¢ãèŠã€ããŸãã ãã®å Žåããµã³ãã«ãã¬ãã£ãã¯ã¹ãããã¹ãŠã®åç·ãã¬ãã£ãã¯ã¹ãŸã§ã®æå°è·é¢ããšãå¿ èŠããããŸãã æããã«ããã¬ãã£ãã¯ã¹è·é¢ã¯å³å¯ãªæ°åŠçæå³ã§ã®ã¡ããªãã¯ãšã¯èŠãªãããããã®é©çšãå¶éãããŸãã
å€ãã®å Žåããã¡ãžãŒæ€çŽ¢ã§ã¯ãè·é¢èªäœã§ã¯ãªããç¹å®ã®å€ãè¶ ãããã©ããã®äºå®ãéèŠã§ãã
è·é¢Damerau-Levenshtein
ãã®ããªãšãŒã·ã§ã³ã¯ãã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ã決å®ããå¥ã®ã«ãŒã«ãå°å ¥ããŸã-æ¿å ¥ãåé€ãããã³çœ®æãšãšãã«ã2ã€ã®é£æ¥ããæåã®è»¢çœ® ïŒåé 眮ïŒã1ã€ã®æäœãšããŠèæ ®ãããŸãã
æ°å¹Žåããã¬ããªãã¯ã»ãã¡ã©ãŠã¯ãã»ãšãã©ã®ã¿ã€ããã¹ãåãªã転眮ã§ããããšãä¿èšŒã§ããŸããã ãããã£ãŠããã®ç¹å®ã®ã¡ããªãã¯ã¯ãå®éã«æè¯ã®çµæããããããŸãã
ãã®ãããªè·é¢ãèšç®ããã«ã¯ãéåžžã®ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ãèŠã€ããã¢ã«ãŽãªãºã ã次ã®ããã«ãããã«ä¿®æ£ããã ãã§ååã§ãïŒæåŸã®2è¡ã§ã¯ãªããè¡åã®æåŸã®3è¡ãä¿åãã察å¿ããè¿œå æ¡ä»¶ãè¿œå ããŸã-転眮ã®å Žåãè·é¢ãèšç®ãããšãã«ããã®ã³ã¹ããèæ ®ããŸã
äžèšã§èª¬æããè·é¢ã«å ããŠãJaro-Winklerã¡ããªãã¯ãªã©ãå®éã«äœ¿çšãããè·é¢ã¯ä»ã«ãå€ãããããã®å€ãã¯SimMetricsããã³SecondStringã©ã€ãã©ãªã§å©çšã§ããŸã ã
ã€ã³ããã¯ã¹ãªãã®ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ïŒãªã³ã©ã€ã³ïŒ
ãããã®ã¢ã«ãŽãªãºã ã¯ã以åã¯æªç¥ã®ããã¹ãã§æ€çŽ¢ããããã«èšèšãããŠãããããšãã°ãããã¹ããšãã£ã¿ãŒãããã¥ã¡ã³ãã衚瀺ããããã°ã©ã ããŸãã¯ããŒãžã§æ€çŽ¢ããWebãã©ãŠã¶ãŒã§äœ¿çšã§ããŸãã ããã¹ãã®ååŠçãå¿ èŠãšãããããŒã¿ã®é£ç¶ã¹ããªãŒã ãåŠçã§ããŸãã
ç·åœ¢æ¢çŽ¢
å ¥åããã¹ãã®åèªãžã®ç¹å®ã®ã¡ããªãã¯ïŒããšãã°ãã¬ãŒãã³ã·ã¥ã¿ã€ã³ã¡ããªãã¯ïŒã®åçŽãªé 次é©çšã å¶éä»ãã®ã¡ããªãã¯ã䜿çšããå Žåããã®æ¹æ³ã«ããæé©ãªé床ãå®çŸã§ããŸãã ããããåæã«ã kã倧ããã»ã©ãåäœæéãé·ããªããŸãã 挞è¿æéæšå®ã¯OïŒknïŒã§ãã
BitapïŒShift-OrãŸãã¯Baeza-Yates-GonnetãšãåŒã°ããWu-Manberããã®å€æŽïŒ
Bitapã¢ã«ãŽãªãºã ãšãã®ããŸããŸãªå€æŽã¯ãã€ã³ããã¯ã¹ä»ããªãã®ãã¡ãžãŒæ€çŽ¢ã«æããã䜿çšãããŸãã ãã®ããªãšãŒã·ã§ã³ã¯ãããšãã°ãæšæºã®grepãšåæ§ã®æ©èœãå®è¡ããagrep unixãŠãŒãã£ãªãã£ã§äœ¿çšãããŸãããæ€çŽ¢ã¯ãšãªã®ãšã©ãŒããµããŒãããæ£èŠè¡šçŸãé©çšããæ©äŒãéãããŠããŸãã
ãã®ã¢ã«ãŽãªãºã ã®ã¢ã€ãã¢ã¯ã1992幎ã«èšäºãçºè¡šããŠã ãªã«ã«ãããšã¶ã€ã§ã€ããšã¬ã¹ãã³ãŽãããã®åžæ°ã«ãã£ãŠåããŠææ¡ãããŸããã
ã¢ã«ãŽãªãºã ã®å ã®ããŒãžã§ã³ã¯ãæåã®çœ®æã®ã¿ãæ±ããå®éã ããã³ã°è·é¢ãèšç®ããŸãã ããããå°ãåŸã§ã Sun WuãšUdi Manberã¯ããã®ã¢ã«ãŽãªãºã ãä¿®æ£ããŠã ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ãèšç®ããããšãææ¡ããŸããã æ¿å ¥ãšåé€ã®ãµããŒãããããããããã«åºã¥ããŠagrepãŠãŒãã£ãªãã£ã®æåã®ããŒãžã§ã³ãéçºããŸããã
æäœãããã·ãã
æ¿å ¥ç©
åé€
代æ¿å
çµæã®å€
ããã§ã kã¯ãšã©ãŒæ°ã jã¯ã·ã³ãã«ã€ã³ããã¯ã¹ã s xã¯ã·ã³ãã«ãã¹ã¯ã§ãïŒãã¹ã¯ã§ã¯ããŠããããããã¯ãªã¯ãšã¹ãå ã®ãã®ã·ã³ãã«ã®äœçœ®ã«å¯Ÿå¿ããäœçœ®ã«ãããŸãïŒã
ã¯ãšãªã®äžèŽãŸãã¯äžäžèŽã¯ãçµæã®ãã¯ãã«Rã®ææ°ã®ãããã«ãã£ãŠæ±ºãŸããŸãã
ãã®ã¢ã«ãŽãªãºã ã®é«éæ§ã¯ããããåäœã®èšç®ã®äžŠååŠçã«ãã£ãŠä¿èšŒãããŸãã1åã®æäœã§ãäžåºŠã«32ããã以äžã®èšç®ãå®è¡ã§ããŸãã
ããã«ãåçŽãªå®è£ ã§ã¯ã32以äžã®åèªã®æ€çŽ¢ããµããŒããããŸãããã®å¶éã¯ãæšæºã®intåã®å¹ ïŒ32ãããã¢ãŒããã¯ãã£ïŒã«ãã£ãŠæ±ºãŸããŸãã 倧ããªæ¬¡å ã®ã¿ã€ãã䜿çšã§ããŸãããããã«ããã¢ã«ãŽãªãºã ã®åäœãããçšåºŠé ããªãå¯èœæ§ããããŸãã
ãã®ã¢ã«ãŽãªãºã OïŒknïŒã®æŒžè¿çãªå®è¡æéã¯ç·åœ¢æ³ã®ãããšäžèŽãããšããäºå®ã«ãããããããããã¯é·ãã¯ãšãªã§ã¯ããã«éãããšã©ãŒæ°k㯠2以äžã§ãã
ãã¹ãäž
ãã¹ãã¯320äžèªã®ããã¹ãã§å®è¡ãããå¹³åèªé·ã¯10ã§ãã
å®å šäžèŽæ€çŽ¢
æ€çŽ¢æéïŒ3562ããªç§ã¬ãŒãã³ã·ã¥ã¿ã€ã³ã¡ããªãã¯ã䜿çšããæ€çŽ¢
k = 2ã§ã®æ€çŽ¢æéïŒ5728 msk = 5ã§ã®æ€çŽ¢æéïŒ8385 ms
Wu-Manberãå€æŽããBitapã¢ã«ãŽãªãºã ã䜿çšããæ€çŽ¢
k = 2ã§ã®æ€çŽ¢æéïŒ5499 msk = 5ã§ã®æ€çŽ¢æéïŒ5928ããªç§
æããã«ãã¡ããªãã¯ã䜿çšããåçŽãªæ€çŽ¢ã¯ãBitapã¢ã«ãŽãªãºã ãšã¯å¯Ÿç §çã«ããšã©ãŒæ°kã«å€§ããäŸåããŠããŸãã
ããã§ãã倧éã®æªå€æŽã®ããã¹ããæ€çŽ¢ããå Žåããã®ãããªããã¹ããååŠçããŠã€ã³ããã¯ã¹ä»ããšãåŒã°ããããšã«ãããæ€çŽ¢æéãå€§å¹ ã«ççž®ã§ããŸãã
ã€ã³ããã¯ã¹ä»ãã®ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ïŒãªãã©ã€ã³ïŒ
ã€ã³ããã¯ã¹ä»ãã®ãã¹ãŠã®ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ã®æ©èœã¯ããœãŒã¹ããã¹ããŸãã¯ããŒã¿ããŒã¹å ã®ã¬ã³ãŒãã®ãªã¹ãããã³ã³ãã€ã«ãããèŸæžã䜿çšããŠã€ã³ããã¯ã¹ãæ§ç¯ãããããšã§ãã
ãããã®ã¢ã«ãŽãªãºã ã¯ãåé¡ã解決ããããã«ããŸããŸãªã¢ãããŒãã䜿çšããŸã-ãããã®ããã€ãã¯æ£ç¢ºãªæ€çŽ¢ãžã®çž®å°ã䜿çšããä»ã¯ããŸããŸãªç©ºéæ§é ãªã©ãæ§ç¯ããããã«ã¡ããªãã¯ã®ããããã£ã䜿çšããŸãã
ãŸããæåã®ã¹ãããã§ãããã¹ãå ã®åèªãšãã®äœçœ®ãå«ãèŸæžããœãŒã¹ããã¹ãäžã«æ§ç¯ãããŸãã åèªããã¬ãŒãºã®é »åºŠãã«ãŠã³ãããŠãæ€çŽ¢çµæã®å質ãåäžãããããšãã§ããŸãã
ãã£ã¯ã·ã§ããªãšåæ§ã«ãã€ã³ããã¯ã¹ã¯å®å šã«ã¡ã¢ãªã«ããŒãããããšæ³å®ãããŠããŸãã
èŸæžã®ããã©ãŒãã³ã¹ç¹æ§ïŒ
- ãœãŒã¹ããã¹ãã¯ãMoshkovã©ã€ãã©ãªïŒ lib.ru ïŒããã®8.2ã®ã¬ãã€ãã®è³æã680çŸäžèªã§ãã
- èŸæžã®ãµã€ãºã¯65ã¡ã¬ãã€ãã§ãã
- åèªæ°-320äž;
- å¹³ååèªé·ã¯9.5æåã§ãã
- äºä¹å¹³åå¹³æ¹æ ¹ã®é·ãïŒäžéšã®ã¢ã«ãŽãªãºã ã®è©äŸ¡ã«åœ¹ç«ã€å ŽåããããŸãïŒã¯10.0æåã§ãã
- ã¢ã«ãã¡ããã-倧æåã®AãZãEãªãïŒäžéšã®æäœãç°¡ç¥åããããïŒã ã¢ã«ãã¡ããã以å€ã®æåãå«ãåèªã¯èŸæžã«å«ãŸããŸããã
ãµã³ããªã³ã°æ¡åŒµã¢ã«ãŽãªãºã
ãã®ã¢ã«ãŽãªãºã ã¯ãèŸæžã®ãµã€ãºãå°ããããŸãã¯é床ãäž»èŠãªåºæºã§ã¯ãªããã¹ãã«ãã§ãã¯ã·ã¹ãã ïŒã€ãŸããã¹ãã«ãã§ãã«ãŒïŒã§ãã䜿çšãããŸãã
ããã¯ããã¡ãžãŒæ€çŽ¢åé¡ãæ£ç¢ºãªæ€çŽ¢åé¡ã«æžããããšã«åºã¥ããŠããŸãã
å€ãã®ã誀ã£ããåèªãæåã®ã¯ãšãªããäœæãããããããã«å¯ŸããŠèŸæžã§ã®æ£ç¢ºãªæ€çŽ¢ãå®è¡ãããŸãã
ãã®åäœæéã¯ãkãšã©ãŒã®æ°ãšã¢ã«ãã¡ãããAã®ãµã€ãºã«å€§ããäŸåãããã€ããªèŸæžæ€çŽ¢ã䜿çšããå Žåã¯æ¬¡ã®ããã«ãªããŸãã
ããšãã°ããã·ã¢èªã®ã¢ã«ãã¡ãããã®k = 1ããã³é·ã7ã®åèªïŒããšãã°ããã¯ããïŒã®å Žåãå€ãã®èª€ã£ãåèªã®ãµã€ãºã¯çŽ450ã«ãªããŸããã€ãŸããèŸæžã«å¯Ÿãã450ã®ã¯ãšãªãäœæããå¿ èŠããããŸãã
ãããã k = 2ã®å Žåã§ãããã®ãããªã»ããã®ãµã€ãºã¯11äž5000ãè¶ ãããªãã·ã§ã³ã«ãªããŸããããã¯ãå°ããªèŸæžã®å®å šãªæ€çŽ¢ããŸãã¯ãã®å Žåã¯1/27ã®æ€çŽ¢ã«å¯Ÿå¿ãããããæäœæéã¯éåžžã«é·ããªããŸãã åæã«ããããã®åèªã®ããããã«ã€ããŠãèŸæžã§å®å šã«äžèŽãããã®ãæ€çŽ¢ããå¿ èŠãããããšãå¿ããŠã¯ãªããŸããã
æ©èœïŒ
ãã®ã¢ã«ãŽãªãºã ã¯ãä»»æã®ã«ãŒã«ã«åŸã£ãŠã誀ã£ãããªãã·ã§ã³ãçæããããã«ç°¡åã«å€æŽã§ããããã«ãèŸæžã®äºååŠçããããã£ãŠè¿œå ã®ã¡ã¢ãªãå¿ èŠãšããŸãããå¯èœãªæ¹åïŒ
å€ãã®ã誀ã£ããåèªããã¹ãŠçæããããšã¯ã§ããŸããããå®éã®ç¶æ³ã§çºçããå¯èœæ§ãæãé«ããã®ãããšãã°ãäžè¬çãªã¹ãã«ãã¹ãã¿ã€ããã¹ãèæ ®ããåèªã®ã¿ãçæã§ããŸããN-gramæ³
ãã®æ¹æ³ã¯é·ãéçºæãããŠããããã®å®è£ ã¯éåžžã«åçŽã§ãããååãªããã©ãŒãã³ã¹ãæäŸãããããæãåºã䜿çšãããŠããŸãã ã¢ã«ãŽãªãºã ã¯æ¬¡ã®ååã«åºã¥ããŠããŸãã
ãããã€ãã®ãšã©ãŒãèæ ®ããŠãåèªAãåèªBãšäžèŽããå Žåãé«åºŠã®ç¢ºçã§ãé·ãNã®å°ãªããšã1ã€ã®å ±ééšåæååããããŸããã
é·ãNã®ãããã®éšåæååã¯N-gramãšåŒã°ããŸãã
ã€ã³ããã¯ã¹äœæäžã«ãåèªã¯ãã®ãããªN-gramã«åå²ããããã®N-gramããšã«ãã®åèªããªã¹ããããŸãã æ€çŽ¢äžãã¯ãšãªã¯N-gramã«ãåå²ãããããããã«ã€ããŠããã®ãããªéšåæååãå«ãåèªã®ãªã¹ãã®é 次æ€çŽ¢ãå®è¡ãããŸãã
å®éã«æãäžè¬çã«äœ¿çšãããã®ã¯ãé·ã3ã®éšåæååã§ãããã©ã€ã°ã©ã ã§ããNã®å€ã倧ãããããšããšã©ãŒæ€åºãæ¢ã«å¯èœãªæå°èªé·ãå¶éãããŸãã
æ©èœïŒ
N-gramã¢ã«ãŽãªãºã ã¯ããšã©ãŒã®ããå¯èœæ§ã®ãããã¹ãŠã®åèªãèŠã€ããããã§ã¯ãããŸããã ããšãã°ãVOTKAãšããåèªãåãããããããªã°ã©ã ã«å解ãããšãTOTâTOTTKTAã«ãã¹ãŠãšã©ãŒTãå«ãŸããŠããããšãããããŸãããããã£ãŠããVODKAããšããåèªã¯èŠã€ãããŸããããããã®ãã©ã€ã°ã©ã ã¯å«ãŸããŠããããããããã®ãªã¹ãã«å«ãŸããŠããŸããã ãããã£ãŠãèªé·ãçãããšã©ãŒãå€ãã»ã©ãèŠæ±ã®Nã°ã©ã ã«å¯Ÿå¿ãããªã¹ãã«è©²åœãããçµæãšããŠååšããªãå¯èœæ§ãé«ããªããŸããäžæ¹ãN-gramã¡ãœããã¯ãä»»æã®ããããã£ãšè€éããåããç¬èªã®ã¡ããªãã¯ã䜿çšããããã®å®å šãªç¯å²ãæ®ããŠããŸããããããæ¯æãå¿ èŠããããŸã-ããã䜿çšãããšãã¯ãèŸæžã®çŽ15ïŒ ãé çªã«æ€çŽ¢ããå¿ èŠããããŸããããã¯ã倧ããªèŸæžã«ã¯ããªãå€ããããŸã
å¯èœãªæ¹åïŒ
N-gramã®ããã·ã¥ããŒãã«ã¯ãåèªã®é·ãããã³åèªå ã®N-gramã®äœçœ®ã§åå²ã§ããŸãïŒå€æŽ1ïŒã æ€çŽ¢ãããåèªãšã¯ãšãªã®é·ââããkãè¶ ããŠç°ãªãããšã¯ã§ããªãã®ãšåæ§ã«ãåèªå ã®N-gramã®äœçœ®ã¯kãè¶ ããŠç°ãªãããšãã§ããŸããã ãããã£ãŠãåèªå ã®ãã®N-gramã®äœçœ®ã«å¯Ÿå¿ããããŒãã«ã ãã§ãªããå·ŠåŽã®kåã®ããŒãã«ãšå³åŽã®kåã®ããŒãã«ã ãããã§ãã¯ããå¿ èŠããããŸãã 2k + 1åã®é£æ¥ããŒãã«ã®ã¿ãåèªã®é·ãã§ããŒãã«ãåå²ããåæ§ã«é£æ¥ãã2k + 1ããŒãã«ã®ã¿ãèŠããšïŒå€æŽ2ïŒã衚瀺ã«å¿ èŠãªã»ããã®ãµã€ãºãå°ããããããšãã§ããŸãã
眲åããã·ã¥
ãã®ã¢ã«ãŽãªãºã ã¯ãLãBoytsovã«ããèšäºã«èšèŒãããŠããŸãã ã眲åã«ããããã·ã¥ãã ããã¯ããããããŒãã«åœ¢åŒã®ããã·ã¥ïŒçœ²åïŒãšããŠäœ¿çšããããããããã圢åŒã®ãæ§é ããšããåèªã®ããªãæçœãªè¡šçŸã«åºã¥ããŠããŸãã
ã€ã³ããã¯ã¹ãäœæããå Žåããã®ãããªããã·ã¥ã¯åèªããšã«èšç®ãããèŸæžã®åèªã®ãªã¹ããšãã®ããã·ã¥ã®å¯Ÿå¿ãããŒãã«ã«å ¥åãããŸãã 次ã«ãæ€çŽ¢äžã«ããªã¯ãšã¹ãã®ããã·ã¥ãèšç®ãããkããã以äžã§å ã®ããã·ã¥ãšç°ãªããã¹ãŠã®é£æ¥ããã·ã¥ããœãŒããããŸãã ãããã®åããã·ã¥ã«å¯ŸããŠãããã«å¯Ÿå¿ããåèªã®ãªã¹ããåæãããŸãã
ããã·ã¥èšç®ããã»ã¹-ããã·ã¥ã®åãããã¯ãã¢ã«ãã¡ãããã®æåã°ã«ãŒãã«é¢é£ä»ããããŠããŸãã ããã·ã¥ã®äœçœ®iã®ããã1ã¯ãå ã®åèªã«ã¢ã«ãã¡ãããã®içªç®ã®ã°ã«ãŒãã®æåãããããšãæå³ããŸãã åèªå ã®æåã®é åºã«ã¯ãŸã£ããæå³ããããŸããã
1æåãåé€ããŠããããã·ã¥å€ã¯å€æŽãããŸããïŒåèªã«åãã¢ã«ãã¡ãããã°ã«ãŒãã®æåããŸã ããå ŽåïŒããŸãã¯ãã®ã°ã«ãŒãã«å¯Ÿå¿ãããããã¯0ã«å€æŽãããŸããåæ§ã«ãæ¿å ¥ããããšã1ãããã1ã«ãªãããå€æŽããããŸããã æåã眮æããå Žåããã¹ãŠãå°ãè€éã«ãªããŸã-ããã·ã¥ã¯ãŸã£ããå€æŽãããªããã1ã2ã®äœçœ®ã§å€æŽãããŸãã å ã«è¿°ã¹ãããã«ãããã·ã¥ã®æ§ç¯ã«ãããæåã®é åºã¯èæ ®ãããªããããåé 眮ãããšããå€æŽã¯ãŸã£ããè¡ãããŸããã ãããã£ãŠãkåã®ãšã©ãŒãå®å šã«ã«ããŒããã«ã¯ãããã·ã¥å ã®å°ãªããšã2kããããå€æŽããå¿ èŠããããŸãã
å¹³åãkãã®ãäžå®å šãïŒæ¿å ¥ãåé€ã転眮ãããã³çœ®æã®äžéšïŒãšã©ãŒã䌎ã皌åæéïŒ
æ©èœïŒ
1ã€ã®æåã眮ãæãããš2ããããäžåºŠã«å€æŽãããå¯èœæ§ããããããããšãã°ãäžåºŠã«2ããã以äžã®æªã¿ãå®è£ ããã¢ã«ãŽãªãºã ã§ã¯ãå®éã«ã¯åèªã®éèŠãªéšåïŒããã·ã¥ãµã€ãºãšã¢ã«ãã¡ãããã®æ¯çã«äŸåïŒãäžè¶³ããŠãããããçµæã®å šéãçæãããŸãã2åã®çœ®æïŒããã³ããã·ã¥ãµã€ãºã倧ãããªããšãæåã®çœ®æãäžåºŠã«2ãããã®æªã¿ã«ã€ãªããããšãå€ããªããçµæã®å®æ床ãäœããªããŸãïŒã ããã«ããã®ã¢ã«ãŽãªãºã ã¯ãã¬ãã£ãã¯ã¹æ€çŽ¢ãèš±å¯ããŸãããBKã®æš
Burkhard-KellerããªãŒã¯ã¡ããªãã¯ããªãŒã§ããããã®ãããªããªãŒãæ§ç¯ããã¢ã«ãŽãªãºã ã¯ãäžè§åœ¢ã®äžçåŒã«å¯Ÿå¿ããã¡ããªãã¯ã®ããããã£ã«åºã¥ããŠããŸãã
ãã®ããããã£ã«ãããã¡ããªãã¯ã¯ä»»æã®ãã£ã¡ã³ã·ã§ã³ã®ã¡ããªãã¯ã¹ããŒã¹ã圢æã§ããŸãã ãã®ãããªèšé空éã¯å¿ ããããŠãŒã¯ãªããã§ã¯ãããŸãããããšãã°ã ã¬ãŒãã³ã·ã¥ã¿ã€ã³èšéããã³ãã¡ã©ãŠ-ã¬ãŒãã³ã·ã¥ã¿ã€ã³èšéã¯éãŠãŒã¯ãªãã空éã圢æããŸãã ãããã®ããããã£ã«åºã¥ããŠãBarkhard-KellerããªãŒã§ãããã®ãããªã¡ããªãã¯ç©ºéã§æ€çŽ¢ããããŒã¿æ§é ãæ§ç¯ã§ããŸãã
æ¹åç¹ïŒ
äžéšã®ã¡ããªãã¯ã®æ©èœã䜿çšããŠãé ç¹ã®åå«ãŸã§ã®æ倧è·é¢ãšçµæã®è·é¢ã®åèšã«çããäžéãèšå®ããããšã«ãããå¶çŽãããè·é¢ãèšç®ã§ããŸããããã«ãããããã»ã¹ãå°ãéããªããŸãããã¹ãäž
ãã¹ãã¯ãIntel Core Duo T2500ïŒ2GHz / 667MHz FSB / 2MBïŒã2Gb RAMãOS-Ubuntu 10.10 Desktop i686ãJRE-OpenJDK 6 Update 20ãæèŒããã©ãããããã§å®è¡ãããŸããã
Damerau-Levenshteinè·é¢ãšãšã©ãŒæ°k = 2ã䜿çšããŠãã¹ããå®è¡ããŸããã ã€ã³ããã¯ã¹ã®ãµã€ãºã¯èŸæžã§ç€ºãããŸãïŒ65 MBïŒã
ãµã³ãã«æ¡åŒµ
ã€ã³ããã¯ã¹ãµã€ãºïŒ65 MBæ€çŽ¢æéïŒ320 ms / 330 ms
çµæã®å®å šæ§ïŒ100ïŒ
N-gramïŒãªãªãžãã«ïŒ
ã€ã³ããã¯ã¹ãµã€ãºïŒ170 MBã€ã³ããã¯ã¹äœææéïŒ32ç§
æ€çŽ¢æéïŒ71 ms / 110 ms
çµæã®å®å šæ§ïŒ65ïŒ
N-gramïŒå€æŽ1ïŒ
ã€ã³ããã¯ã¹ãµã€ãºïŒ170 MBã€ã³ããã¯ã¹äœææéïŒ32ç§
æ€çŽ¢æéïŒ39 ms / 46 ms
çµæã®å®å šæ§ïŒ63ïŒ
N-gramïŒå€æŽ2ïŒ
ã€ã³ããã¯ã¹ãµã€ãºïŒ170 MBã€ã³ããã¯ã¹äœææéïŒ32ç§
æ€çŽ¢æéïŒ37 ms / 45 ms
çµæã®å®å šæ§ïŒ62ïŒ
眲åããã·ã¥
ã€ã³ããã¯ã¹ãµã€ãºïŒ85 MBã€ã³ããã¯ã¹äœææéïŒ0.6ç§
æ€çŽ¢æéïŒ55ããªç§
çµæã®å®å šæ§ïŒ56.5ïŒ
BKã®æš
ã€ã³ããã¯ã¹ãµã€ãºïŒ150 MBã€ã³ããã¯ã¹äœææéïŒ120ç§
æ€çŽ¢æéïŒ540ããªç§
çµæã®å®å šæ§ïŒ63ïŒ
åèš
ã€ã³ããã¯ã¹ä»ãã®ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ã®ã»ãšãã©ã¯ãçã«æºç·åœ¢ã§ã¯ãããŸããïŒã€ãŸãã挞è¿çãªå®è¡æéOïŒlog nïŒä»¥äžïŒã§ããããããã®åäœé床ã¯éåžžNã«çŽæ¥äŸåããŸã ããã«ãããããããå€æ°ã®æ¹åãšæ¹åã«ãããéåžžã«å€§éã®èŸæžããã£ãŠãååã«çãæéãéæããããšãã§ããŸãã
ãŸãããã®äž»é¡åéã®ä»ã®å Žæã§æ¢ã«äœ¿çšãããŠããããŸããŸãªæè¡ã®é©å¿ã«åºã¥ãããããå€æ§ã§å¹æã®ãªãæ¹æ³ãæ°å€ããããŸãã ãããã®æ¹æ³ã®äžã«ã¯ã ãã¬ãã£ãã¯ã¹ããªãŒïŒTrieïŒã®ãã¡ãžãŒæ€çŽ¢ã¿ã¹ã¯ãžã®é©å¿ããããŸã ãããã¯ãå¹çãäœãããã«ç¡äººã§ããã ããããç¬èªã®ã¢ãããŒãã«åºã¥ãã¢ã«ãŽãªãºã ããããŸããããšãã°ã Maass-Novakã¢ã«ãŽãªãºã ã¯ãæºç·åœ¢æŒžè¿å®è¡æéãæã¡ãŸããããã®ãããªæéæšå®ã®èåŸã«é ããŠãã巚倧ãªå®æ°ã®ããã«éåžžã«éå¹ççã§ããã巚倧ãªã€ã³ããã¯ã¹ãµã€ãºãšããŠè¡šç€ºãããŸãã
å®éã®æ€çŽ¢ãšã³ãžã³ã§ã®ãã¡ãžãŒæ€çŽ¢ã¢ã«ãŽãªãºã ã®å®éã®äœ¿çšã¯ã é³å£°ã¢ã«ãŽãªãºã ãèªåœã¹ããã³ã°ã¢ã«ãŽãªãºã -åãåèªã®ç°ãªãåèªåœ¢åŒã®åºæ¬éšåã®åŒ·èª¿ïŒããšãã°ã SnowballãšYandex mystemã«ãã£ãŠæäŸãããæ©èœïŒãããã³çµ±èšæ å ±ã«åºã¥ãã©ã³ãã³ã°ã«å¯æ¥ã«é¢é£ããŠããŸãããŸãã¯æŽç·Žãããé«åºŠãªã¡ããªãã¯ã䜿çšããŸãã
ç§ã®Javaå®è£ ã¯http://code.google.com/p/fuzzy-search-toolsã§èŠã€ããããšãã§ããŸãïŒ
- ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ïŒã¯ãªããã³ã°ããã³ãã¬ãã£ãã¯ã¹ãªãã·ã§ã³ä»ãïŒ;
- Damerau-Levenshteinè·é¢ïŒã¯ãªããã³ã°ããã³ãã¬ãã£ãã¯ã¹ãªãã·ã§ã³ä»ãïŒ;
- Bitapã¢ã«ãŽãªãºã ïŒWu-Manberãå€æŽããShift-OR / Shift-ANDïŒ;
- ãµã³ããªã³ã°æ¡åŒµã¢ã«ãŽãªãºã ã
- N-gramã¡ãœããïŒãªãªãžãã«ããã³å€æŽããïŒ;
- 眲åããã·ã¥æ¹åŒã
- BKæšã
ãã®ãããã¯ãç 究ããéçšã§ãã€ã³ããã¯ã¹ã®ãµã€ãºãç·©ããã«å¢å ããã¡ããªãã¯ã®éžæã®èªç±ãå¶éãããŠãããããæ€çŽ¢æéãå€§å¹ ã«ççž®ã§ããç¬èªã®éçºãããã€ããã£ãããšã¯æ³šç®ã«å€ããŸãã ããããããã¯ãŸã£ããç°ãªã話ã§ãã
åç §ïŒ
- Javaã®èšäºã®ãœãŒã¹ã³ãŒãã http://code.google.com/p/fuzzy-search-tools
- ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ã http://ru.wikipedia.org/wiki/Levenshtein_Distance
- è·é¢Damerau-Levenshteinã http://en.wikipedia.org/wiki/DamerauâLevenshtein_distance
- ãã ããWu-Manberãå€æŽããShift-Orã®ãã€ãèªã®èª¬æã http://de.wikipedia.org/wiki/Baeza-Yates-Gonnet-Algorithmus
- N-gramã¡ãœããã http://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf
- 眲åããã·ã¥ã http://itman.narod.ru/articles/rtf/confart.zip
- Leonid Moiseevich Boytsovã®ãµã€ããå®å šã«ãããŸãæ€çŽ¢å°çšã§ãã http://itman.narod.ru/
- Shift-Orããã³ãã®ä»ã®ã¢ã«ãŽãªãºã ã®å®è£ ã http://johannburkard.de/software/stringsearch/
- Agrepã䜿çšããé«éããã¹ãæ€çŽ¢ïŒWuïŒManberïŒã http://www.at.php.net/utils/admin-tools/agrep/agrep.ps.1
- ããã¯ãŒã«ãªã¢ã«ãŽãªãºã -ã¬ãŒãã³ã·ã¥ã¿ã€ã³ã®ãªãŒãããã³ãBKããªãŒãããã³ãã®ä»ã®ã¢ã«ãŽãªãºã ã http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
- Javaã®BKããªãŒã http://code.google.com/p/java-bk-tree/
- Maass-Novakã¢ã«ãŽãªãºã ã http://yury.name/internet/09ia-seminar.ppt
- SimMetrics Metrics Libraryã http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html
- SecondStringã¡ããªãã¯ã©ã€ãã©ãªã http://sourceforge.net/projects/secondstring/
ãã·ã¢èªçïŒ ãã¡ãžãŒæååæ€çŽ¢