ã¬ã-ã¹ã¿ããã®ãµããŒãã®ã¡ã³ããŒã è·åã®äžã€-å°é家ã®éã§é»åã¡ãŒã«ãžã®çä¿¡åŒã®ååžã ããã¯é åãåæããç¹æ§ã®æ°ã決å®ããŸãã ããšãã°ããã¢ã¯ã»ã¹ã¿ã€ããïŒã·ã¹ãã ãšã©ãŒãåã«å©çšè ã«çžè«ããå¿ èŠããããŠãŒã¶ãŒã¯ãããã€ãã®æ°ããæ©èœãæãã§ããŸãã å®çŸ©ãã·ã¹ãã ã®æ©èœã¢ãžã¥ãŒã«ãïŒäŒèšã¢ãžã¥ãŒã«ãæ©åšã®èªèšŒãªã© ãããã®ç¹æ§ã®ãã¹ãŠã確èªããã«ã¯ãé©åãªæ²»çã«ãªãã€ã¬ã¯ãããŸãã
- ç§ã¯ãããèªåçã«è¡ããŸãããã°ã©ã ãæžãããããïŒ - ç§ã¯çããŸããã
æè¡çãªéšåãžã®ãã®é æçãªå°èª¬ããã£ããã·ã¥ã ãŒãã§ã
ã¿ã¹ã¯ã圢åŒ
- å ¥åã¯ãä»»æã®é·ããšã³ã³ãã³ãã®ããã¹ãã§ãã
- ããã¯èª€æ€ãç¥èªãå«ãŸããŠãããäžè¬çã«ãããŸãããããã®ã§ã人ã«ãã£ãŠæžãããããã¹ãã
- ãã®ããã¹ãã¯ãäœããã®åœ¢ã§åæããå¿ èŠããããããã«ã¯ããã€ãã®ç¡é¢ä¿ãªç¹æ§ã§åé¡ããŸãã æ°æ©èœããŸãã¯å¿ èŠæ§ã®å©èšãèŠæ±ã ãã§ãªããæ©èœã¢ãžã¥ãŒã«ã決å®ããããã«ããšã©ãŒã¡ãã»ãŒãžïŒç§ãã¡ã®å Žåãé åã®æ¬è³ªãå®çŸ©ããå¿ èŠããã£ãããå庫管çãªã©ãå ããSã®èšç®ã/ Nã
- ãªããåç¹æ§ã®ããã«ã®ã¿1ã€ã®å€ã«å²ãåœãŠãããšãã§ããŸãã
- äºåã«ç¥ãããŠããåç¹æ§ã®å¯èœãªå€ã®ã»ããã
- ãã§ã«åé¡ãããã¢ããªã±ãŒã·ã§ã³ã®æ°åããããŸãã
ç§ãã¡ã®ç¹å®ã®ããŒãºã«åãããŠã¿ã¹ã¯ãšéçºã®éå§ãå®åŒåãç§ã¯ããã«å³å¯ã«ããã€ãã®ç¹å®ã®ç¹æ§åã³ãããã®ç¹æ§ã®æ°ã«çžãããæ±çšæ§ã®é«ãããŒã«ãéçºããæ¹ãããããšã«æ°ã¥ããŸããã çµæã¯ãä»»æã®ç¹æ§ã®ããã¹ããåé¡ããããšãã§ãããŒã«ã§ããã
ãããèªåã§è¡ãã«ã¯ãæéãšé¢å¿ã ã£ãã®ã§ãæºåå®äºãœãªã¥ãŒã·ã§ã³ã¯ãæ±ããŠããªãã£ãããã©ã¬ã«ã¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ã®ç 究ã«æµžæŒ¬ããŸããã
楜åšã®éžæ
ç§ã¯ãJavaã§éçºããããšã決ããŸããã
䜿çšããŠããŒã¿ããŒã¹ãšããŠSQLiteã®ãšH2ã ã ããã¯ãŸãã«äŸ¿å©ã§ãäŒæ¢ç¶æ ã
ãããã£ãŠãããã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ïŒEncogæ©æ¢°åŠç¿ãã¬ãŒã ã¯ãŒã¯ïŒã®æºåãã§ããŠå®è£ ããŸããã ç§ã¯ãåé¡åšããã¥ãŒã©ã«ãããã¯ãŒã¯ã§ã¯ãªããäŸãã°ããšããŠããã䜿çšããããšã決ããåçŽãã€ãºåé¡åšãŸããçè«çã«ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ãæ£ç¢ºã§ãªããã°ãªããã®ã§ãã 第äºã«ãç§ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ãããã£ãŠã¿ããã£ãã§ãã
ããã€ãã®ã©ã€ãã©ãªèšç·Žããããã«å¿ èŠãªããŒã¿ã®Excelãã¡ã€ã«èªã¿èŸŒã¿ããã«ã¯Apache POIã ã
ããããããŠäŒçµ±çã«äœ¿çšãã¹ãã®ããã®JUnit 4 + Mockito ã
å°ãçè«
ïŒããã¯ååãªã®ã§ãç§ã¯ã詳现ã«ãã¥ãŒã©ã«ãããã¯ãŒã¯ã®çè«ã説æããŠããªãã ãããããã§ã¯è¯ãå ¥éè³æïŒã ç°¡åã«èšããšãåçŽãªæ¹æ³ã§ïŒãããã¯ãŒã¯ã¯ãå ¥åå±€ãäžéå±€ããã³åºåå±€ãæããŠããŸãã ããããã®å±€ã®ãã¥ãŒãã³ã®æ°ã¯ãäºãéçºè ã«ãã£ãŠæ±ºå®ãããè¿œå ããå Žåã®åŠç¿ãããã¯ãŒã¯ãå€æŽããããšã¯ã§ããŸãã/åŸãåã³åèšç·Žãããããã¯ãŒã¯ã®å¿ èŠæ§ãåäžãã¥ãŒãã³ãäœäžãããŸããã å ¥åå±€ã®åãã¥ãŒãã³ã«å¯Ÿããæ£èŠåæ°ã¯ãïŒã»ãšãã©ã®å Žåã0ãã1ãŸã§ïŒãé©çšãããŸãã å ¥åå±€ã«ãããæŽæ°ã®éåã«å¿ããŠãååºåãã¥ãŒãã³äžã®ããã€ãã®èšç®åŸã0ãã1ãŸã§ã®æ°ãåŸãŠããŸãã
åŠç¿ãããã¯ãŒã¯ã®æ¬è³ªã¯ããããã¯ãŒã¯ããèšç®ã«é¢äžãããªã³ã¯ã®éã¿ã調æŽããããšã§ããã®ã§ãå ¥åå±€ã«ãããäžå®ã®äºãèšå®ãããæ°å€ãåºåå±€ã«äºãæ°åã®æ¢ç¥ã®ã»ãããåä¿¡ãããšãã ãã©ã³ã¹ã®èª¿æŽ - ããã¯å埩ã®äžå®æ°ã«éãããŸã§ããããã¯ãŒã¯ã®ããããèšç·Žã»ããäžã®æå®ã®ç²ŸåºŠã«éãããŸã§å埩ããã»ã¹ãçºçããŸãã èšç·Žã®åŸãå ¥åããã¬ãŒãã³ã°ã»ããã«ãããã®ãšé¡äŒŒã®æ°ã®éåããæåºãããå Žåããããã¯ãŒã¯ã¯ãåºåå±€ã®æ°åã®æšæºã»ããã«è¿ããçºè¡ããããšãæåŸ ãããŸãã
ç§ãã¡ã¯ç·Žç¿ã«åããŸã
æåã®ã¿ã¹ã¯ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®å ¥åã«æž¡ãããšãã§ãã圢åŒã«ããã¹ããå€æããæ¹æ³ãèŠã€ãåºãããšã§ããã ããããæåã«ãäºåã«èšå®ããå¿ èŠãããããã«ããããã¯ãŒã¯ã®å ¥åå±€ã®ãµã€ãºã決å®ããå¿ èŠããããŸããã å ¥åå±€ã¯ããã®å±€ã§ãã£ãããšãã§ããŸãä»»æã®ããã¹ãã«ãµã€ãºæ±ºããªããã°ãªããªãããšã¯æçœã§ããããã£ãããã å¿ã«æ¥ãæåã®äº- å ¥åå±€ã®å€§ããã¯ãã³ã³ããã¹ããã®èšè/ãã¬ãŒãºãå«ãèŸæžã®ãµã€ãºãšåãã§ãªããã°ãªããŸãã ã
èªåœãæ§ç¯ããããã®å€ãã®æ¹æ³ã äŸãã°ãæã ã¯æããªãã·ã¢èªã®ãã¹ãŠã®åèªãåãããããç§ãã¡ã®èªåœã«ãªããŸãã å ¥åå±€ã®å€§ããã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ãååãªãªãœãŒã¹ãæã£ãŠããªãã·ã³ãã«ãªã¯ãŒã¯ã¹ããŒã·ã§ã³ãäœæããããšããšãŠã巚倧ã«ãªããŸãã®ã§ããããããã®ã¢ãããŒãã¯ãé©åã§ã¯ãããŸããã ããšãã°ãç§ãã¡ã¯èŸæžã¯100åã®000ã®åèªãæã£ãŠããããšãæ³åãããã®åŸãæã ã¯ãå ¥åå±€100,000ãã¥ãŒãã³ãæã£ãŠããŸãã äŸãã°ãé ãå±€ïŒäžéå±€ã®å¯žæ³ã決å®ããããã®ä»¥äžã«èšèŒãããæ¹æ³ïŒãåºå25ã§80000ã 次ãã§ã ãæ¥ç¶ã®éã¿ãæ ŒçŽããããã®RAMã60ã®ã¬ãã€ãïŒïŒ100 * 000 80 000ïŒ+ïŒ000 * 80 25ïŒïŒ* 64ãããïŒJAVAã§doubleåïŒãå¿ èŠãšããã§ãããã ç¹å®ã®çšèªãããã¹ãã§ã¯ãªããèŸæžã«äœ¿çšããããšãã§ããã®ã§ã第äºã«ããã®ã¢ãããŒãã¯ãé©åã§ã¯ãããŸããã
ãããã£ãŠãèŸæžã¯å¯äžã®ç§ãã¡ã®åæãããããã¹ããæ§æãããããã®åèª/ãã¬ãŒãºãæ§ç¯ãã¹ããšã®çµè«ã ãã¬ãŒãã³ã°ããŒã¿ã®ååã«å€§ããéã§ããã¹ãã§ããããã®å Žåã«ã¯èŸæžãæ§ç¯ããããšãç解ããããšãéèŠã§ã ã
ããã¹ãã®èšè/ãã¬ãŒãºããåŒãåºããã®äžã€ã®æ¹æ³ïŒãã£ãšæ£ç¢ºã«ã¯ããã©ã°ã¡ã³ããïŒã®å»ºèšãšåŒã°Nã°ã©ã ã æã人æ°ã®ããunigrammyãšãã€ã°ã©ã ã§ãã æåNã°ã©ã ããããŸã - ããã¹ãã¯ãç¹å®ã®é·ãã®ã»ã°ã¡ã³ãã«åã ã®åèªãæåã®äžã«ãªããšãã«æŒã朰ãããŸãã ããã¯ããªããå®éšããå¿ èŠãããã®ã§ãç¹å®ã®ã¿ã¹ã¯ã«æå¹ã«ãªããŸãNã°ã©ã ã®ã©ã®èšãã®ã¯é£ããã§ãã
ããã¹ã | Unigramma | ãã€ã°ã©ã | 3æåNã°ã©ã |
---|---|---|---|
ãã®ããã¹ãã¯éšåã«åå²ããå¿ èŠããããŸã | [ "ãã®"ã "ããã¹ã"ã "ã§ãã" "ã¹ãã§ãã"ã "ã«"ã "å£ãã"ã "éš"] | [ãã«åå²ãããã®æç« ãããããã¹ãããªããã°ãªããªããããç Žå£ããããã¹ããããé¢ããŠã] | [ "ãã®"ã "TT"ã "CEN"ã "ç"ã "olzh"ã "EN"ã "çåœ"ã "S P"ã "AZB"ã "UM"ã "äž"ããH "" åã] |
ç§ã¯åçŽãªãã®ããè€éãªãã®ãŸã§ç§»åããããšã決ãããšãŠãã°ã©ã ã¯ã©ã¹ãéå§ããããã«èšèšãããŠããŸãã
ã¯ã©ã¹ãŠãã°ã©ã
class Unigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { if (text == null) { text = ""; } // get all words and digits String[] words = text.toLowerCase().split("[ \\pP\n\t\r$+<>â=]"); Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } }
ãã®çµæããäžã®ããã¹ããåŠçããåŸãç§ã¯ã32,000èŠçŽ ã®èŸæžãµã€ãºãåŸãŸããã ãã®çµæã察è§èŸæžãåæããåŸãç§ã¯ãåãé€ãããã«æã£ãŠããã ããããšãéåžžã«å€ãã®ç¹°ãè¿ããããããšã«æ°ã¥ããŸããã ãããè¡ãã«ã¯ïŒ
- ãã¹ãŠã®åœŒãã¯éåžžéã°ãªããšããæå³ã®ã§ããã¹ãŠã®è±æ°å以å€ã®æåïŒãªã©æ°åãå¥èªç¹ãç®è¡æŒç®ããïŒãåãå€ããŸãã
- ç§ã¯ãæç¶ãã®èšèãéããŠé転ã¹ããã³ã° ïŒäœ¿çšããŒã¿ãŒã¹ãããŒãã·ã¢èªã®ããïŒã ã¡ãªã¿ã«ããã®æé ã®äŸ¿å©ãªå¯äœçšã¯ããå®äºãããããuniseksatsiyaããšããèšèãããã³ãå®äºããsdelaãã«å€æãããŸãã§ãã
- æåã¯ç§ã¯æ£ããã¿ã€ããã¹ãææ³ã®èª€ããšãæ€åºãããã£ãã§ãã ãªãªããŒã¢ã«ãŽãªãºã ïŒé¢æ°ã«ã€ããŠNachitalsja similar_text PHPã§ïŒãšã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ ã ããããåé¡ã¯ããããšã©ãŒã§ãããç°¡åã«è§£æ±ºãããïŒç§ã¯ãèŠçŽ Nã°ã©ã ã¯ããã¬ãŒãã³ã°ã»ãããã4ã€ã®æªæºã®ããã¹ããçºçããå Žåãæã ã¯èŸæžã§ããã®èŠçŽ ã¯å°æ¥ã®åœ¹ã«ç«ããªããšããŠå«ãŸããŠããªãããšã決ããŸããã ç§ã¯ã¿ã€ããã¹ã®ã»ãšãã©ãåŠåããã®ã§ãææ³çãªãšã©ãŒãçºçããèšèããsleplennyhSlovããšã ãéåžžã«ããã€ãã®åèªã ããããæã ã¯å°æ¥ã®ããã¹ãã«ãã°ãã°èª€æ€ãææ³ã®èª€ããå«ãŸããŠããå Žåã¯ããã®ãããªããã¹ãã®åé¡ã®ç²ŸåºŠãäœããªãããšãç解ããŠããŸã ã¿ã€ããã¹ãææ³ã®èª€ããèšæ£ããããã®ã¡ã«ããºã ãå®è£ ããŠããå¿ èŠããããŸãã éèŠïŒåèªãåé€ããããšã«ããããã£ãã«ééããªããã®ãããªãã¯ããšèšã£ããšãã«åŠç¿ããèªåœãæ§ç¯ããããã®å€§éã®ããŒã¿ã
ãã®ãã¹ãŠã¯ãæ宀ãFilteredUnigram VocabularyBuilderã«å®è£ ãããŠããŸãã
ã¯ã©ã¹FilteredUnigram
public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
ããã¹ãïŒ{ public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
ïŒ.splitïŒ "[\ N \ãã³\ R $ + <>â=]"ïŒ; public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
{ public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
[I]ïŒã public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
<>ïŒã¯ãArrays.asListïŒã¯ãŒãïŒïŒ; public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
ãïŒïŒ; public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
[\\ã«pP \\ D]ã"ã ""ïŒ; public class FilteredUnigram implements NGramStrategy { @Override public Set<String> getNGram(String text) { // get all significant words String[] words = clean(text).split("[ \n\t\r$+<>â=]"); // remove endings of words for (int i = 0; i < words.length; i++) { words[i] = PorterStemmer.doStem(words[i]); } Set<String> uniqueValues = new LinkedHashSet<>(Arrays.asList(words)); uniqueValues.removeIf(s -> s.equals("")); return uniqueValues; } private String clean(String text) { // remove all digits and punctuation marks if (text != null) { return text.toLowerCase().replaceAll("[\\pP\\d]", " "); } else { return ""; } } }
ã¯ã©ã¹VocabularyBuilder
class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
ClassifiableText> classifiableTextsïŒ{ class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
æ°ããHashMapã®<>ïŒïŒ; class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
<>ïŒïŒ; class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
å
šãŠåé¡ãããã¹ãããnã°ã©ã ã«å€æïŒ class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
ïŒïŒïŒïŒ{ class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
+ class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
{ïŒuniqueValues.entrySetïŒïŒãšã³ããªïŒ class VocabularyBuilder { private final NGramStrategy nGramStrategy; VocabularyBuilder(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } List<VocabularyWord> getVocabulary(List<ClassifiableText> classifiableTexts) { if (classifiableTexts == null || classifiableTexts.size() == 0) { throw new IllegalArgumentException(); } Map<String, Integer> uniqueValues = new HashMap<>(); List<VocabularyWord> vocabulary = new ArrayList<>(); // count frequency of use each word (converted to n-gram) from all Classifiable Texts // for (ClassifiableText classifiableText : classifiableTexts) { for (String word : nGramStrategy.getNGram(classifiableText.getText())) { if (uniqueValues.containsKey(word)) { // increase counter uniqueValues.put(word, uniqueValues.get(word) + 1); } else { // add new word uniqueValues.put(word, 1); } } } // convert uniqueValues to Vocabulary, excluding infrequent // for (Map.Entry<String, Integer> entry : uniqueValues.entrySet()) { if (entry.getValue() > 3) { vocabulary.add(new VocabularyWord(entry.getKey())); } } return vocabulary; } }
èŸæžã®äŸïŒ
ããã¹ã | ãã£ã«ã¿unigramma | èŸæž
|
---|---|---|
12ã®ã¿ã¹ã¯ã®ã·ãŒã±ã³ã¹ãèŠã€ããå¿ èŠããããŸã | ããŒãºãéšå£«ãçµæçã«ãåé¡ | ããŒãºãéšå£«ãçµæçã«ãä»»æã®ããã®ã¿ã¹ã¯ãADDãè»¢äœ |
ä»»æã®ããã®åé¡ | ä»»æã®ããã®ã¿ã¹ã¯ | |
ä»»æã®è»¢çœ®ãè¿œå ããŸãã | ãä»»æã®ãã©ã³ã¹ããŒãºãè¿œå |
ãã€ã°ã©ã ãæ§ç¯ããããã«ããäžåºŠæåŸã«ãå®éšåŸã®èŸæžãµã€ãºãšåé¡ç²ŸåºŠã®æ¯ããæé«ã®çµæãçæãã代æ¿ã«éžæãåæ¢ããã¯ã©ã¹ãæžããŸããã
ã¯ã©ã¹ãã€ã°ã©ã
class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
ããã¹ãïŒ{ class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
<>ïŒnGramStrategy.getNGramïŒããã¹ãïŒïŒ; class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
ïŒã class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
ãããªãããã£ãŠãããããªãã¯ã} class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
<>ïŒïŒ; class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
" + unigram.getïŒI + class Bigram implements NGramStrategy { private NGramStrategy nGramStrategy; Bigram(NGramStrategy nGramStrategy) { if (nGramStrategy == null) { throw new IllegalArgumentException(); } this.nGramStrategy = nGramStrategy; } @Override public Set<String> getNGram(String text) { List<String> unigram = new ArrayList<>(nGramStrategy.getNGram(text)); // concatenate words to bigrams // example: "How are you doing?" => {"how are", "are you", "you doing"} Set<String> uniqueValues = new LinkedHashSet<>(); for (int i = 0; i < unigram.size() - 1; i++) { uniqueValues.add(unigram.get(i) + " " + unigram.get(i + 1)); } return uniqueValues; } }
ããã§ç§ã¯åæ¢ããããšã決ããããããã¯èŸæžã®ç·šçºã®ããã®èšèã®å¯èœæ§ãšãããã«åŠçããããŸã§ã ããšãã°ãã·ããã ãå®çŸ©ããããšãã§ããå ±éã®å¿ã«ããããããããããšããªã©ãæãä»ãããã«ãããªãã¯èªåèªèº«ã®äœããè¡ãããšãã§ããŸãåèªã®é¡äŒŒæ§ãåæããããšãã§ããŸã ããããååãšããŠãåé¡ã®ç²ŸåºŠã®ææãªå¢å ãäžããŠããŸããã
ããŠãå ã«è¡ããŸãã èŸæžå ã®èŠçŽ ã®æ°ã«çãããªããã¥ãŒã©ã«ãããã¯ãŒã¯ã®å ¥åå±€ã®ãµã€ãºã¯ãæã ã¯ãèšç®ããŸããã
ç§ãã¡ã®ã¿ã¹ã¯ã®åºåå±€ã®å€§ãããç¹åŸŽã®å¯èœãªå€ã®æ°ãšåãã«ããããšã§ãã ãŠãŒã¶ãŒãæ°ããæ©èœãå©ããããã«ãã·ã¹ãã ã®ãšã©ãŒïŒããšãã°ãç§ãã¡ã¯ããã¢ã¯ã»ã¹ã¿ã€ããã®ç¹æ§ã®ããã®3ã€ã®å¯èœãªå€ãæã£ãŠããŸãã 次ã«ãåºåå±€ã®ãã¥ãŒãã³ã®æ°ã¯3ã«çããã§ãã æã ã¯åºåå±€ã§åãåãããšãæåŸ äºãæ°å€ã®åç §äžæã®ã»ããã決å®ããã®ã«å¿ èŠãªç¹æ§ã®åå€ã«å¯ŸããŠãããã¯ãŒã¯ãåŠç¿ããå ŽåïŒ - 第äºã®ããã«ã0 0 1 - æåã®å€1 0 0 1 0 0第äžã®ããã®...
é ãããå±€ã®æ°åã³å¯žæ³ã«é¢ããŠã¯ããã®åŸç¹å¥ãªæšèŠããããŸããã ãœãŒã¹ã¯ãåç¹å®ã®ã¿ã¹ã¯ã®ããã«æé©ãªãµã€ãºã®ã¿å®éšã«ããç®åºããããšãã§ããããšãæžãããå 现ãã®å ¥åå±€ãšåºåã®å€§ããã®ç¯å²å ã§å€åãã倧ãããã1ã€ã®é ãå±€ãå§ãŸããããã¯ãŒã¯ããå§ãããŸãã ãŸããç§ã¯ãå ¥åå±€ã®2/3ã®é ãå±€ã®ãµã€ãºãäœæããŠãé ãããå±€ã®æ°ãšé·ãã§å®éšããŸãã ããã§ã¯ãããã§ããªãã¯ããã®ããŒãã«é¢ããçè«ãšã¬ã€ãã©ã€ã³ã®ããããèªã¿åãããšãã§ããŸãã åŠç¿ããå¿ èŠããããŸãã©ã®ãããã®ããŒã¿ã«ã€ããŠåããããããŸãã
ããã§ãæã ã¯ãããã¯ãŒã¯ãäœæããŸããã ä»ãããªãã¯ãç§ãã¡ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®ãéããã«é©ããããã¹ããæ°å€ã«å€æããæ¹æ³ã決å®ããå¿ èŠããããŸãã ãããè¡ãããã«ãæã ã¯ã ãã¯ãã«ããã¹ãã«ããã¹ããå€æããå¿ èŠããããŸãã åèªã®ãŠããŒã¯ãªãã¯ãã«ãå²ãåœãŠãããã«èŸæžå ã®ãã¹ãŠã®åèªãäºãã¯ãå¿ èŠãã®éã¯ãèŸæžã®ãµã€ãºã«çãããªããã°ãªããŸããã åèªã®ãã¯ãã«ãå€æŽããããã«ãããã¯ãŒã¯ãèšç·ŽããåŸã¯äžå¯èœã§ãã ããã§ã¯ã4ã€ã®åèªã®èŸæžãæ€çŽ¢ããæ¹æ³ã¯æ¬¡ã®ãšããã§ãã
èŸæžã§åèª | ãã¯ãã«ã¯ãŒã |
---|---|
ããã«ã¡ã¯ | 1 0 0 0 |
ã©ã®ããã« | 0 1 0 0 |
æ¥å | 0 0 1 0 |
ããªãã« | 0 0 0 1 |
ãã¯ãã«ããã¹ãã§ã®å€ææé ã®ããã¹ãã¯ãåèªã®ããã¹ãã«äœ¿çšããããã¯ãã«ã®è¿œå ãæå³ããŸãïŒããã¹ãã¯ããïŒããã«ã¡ã¯ãããªãã®äž¡æ¹ã«ããã¯ã¿ãŒã1 1 0 1ãã«å€æãããŸãã æã ã¯ãã§ã«ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®å ¥åã«ããããããšãã§ãããã®ãã¯ã¿ãŒïŒå ¥åå±€ã®ååã ã®ãã¥ãŒãã³ã®ããã®åå¥ã ã®çªå·ïŒãã¥ãŒãã³ã®æ°ã¯ãããã¹ãã®ãã¯ãã«ã®å€§ããã«ã¡ããã©çããã§ãïŒã
ãã¯ãã«ããã¹ããèšç®ããæ¹æ³
private double[] getTextAsVectorOfWords(ClassifiableText classifiableText) { double[] vector = new double[inputLayerSize]; // convert text to nGram Set<String> uniqueValues = nGramStrategy.getNGram(classifiableText.getText()); // create vector // for (String word : uniqueValues) { VocabularyWord vw = findWordInVocabulary(word); if (vw != null) { // word found in vocabulary vector[vw.getId() - 1] = 1; } } return vector; }
ïŒ{ private double[] getTextAsVectorOfWords(ClassifiableText classifiableText) { double[] vector = new double[inputLayerSize]; // convert text to nGram Set<String> uniqueValues = nGramStrategy.getNGram(classifiableText.getText()); // create vector // for (String word : uniqueValues) { VocabularyWord vw = findWordInVocabulary(word); if (vw != null) { // word found in vocabulary vector[vw.getId() - 1] = 1; } } return vector; }
inputLayerSize]ã private double[] getTextAsVectorOfWords(ClassifiableText classifiableText) { double[] vector = new double[inputLayerSize]; // convert text to nGram Set<String> uniqueValues = nGramStrategy.getNGram(classifiableText.getText()); // create vector // for (String word : uniqueValues) { VocabularyWord vw = findWordInVocabulary(word); if (vw != null) { // word found in vocabulary vector[vw.getId() - 1] = 1; } } return vector; }
classifiableText.getTextïŒïŒïŒã private double[] getTextAsVectorOfWords(ClassifiableText classifiableText) { double[] vector = new double[inputLayerSize]; // convert text to nGram Set<String> uniqueValues = nGramStrategy.getNGram(classifiableText.getText()); // create vector // for (String word : uniqueValues) { VocabularyWord vw = findWordInVocabulary(word); if (vw != null) { // word found in vocabulary vector[vw.getId() - 1] = 1; } } return vector; }
//åèªãèªåœã§èŠã€ãããŸãã private double[] getTextAsVectorOfWords(ClassifiableText classifiableText) { double[] vector = new double[inputLayerSize]; // convert text to nGram Set<String> uniqueValues = nGramStrategy.getNGram(classifiableText.getText()); // create vector // for (String word : uniqueValues) { VocabularyWord vw = findWordInVocabulary(word); if (vw != null) { // word found in vocabulary vector[vw.getId() - 1] = 1; } } return vector; }
ãããšããã§ããªãã¯ãããã«åæã®ããã®ããã¹ãã®æºåã«ã€ããŠèªãããšãã§ããŸãã
åé¡ç²ŸåºŠ
èŸæžã圢æããããã®ç°ãªãã¢ã«ãŽãªãºã ãè©ŠããåŸã«é ãããå±€ãšå¯žæ³ã®ç°ãªãæ°ã§ãç§ã¯ããã®ããŒãžã§ã³ã«èœã¡çãïŒèŸæžåœ¢æããã«ãããªãFilteredUnigramã®redkoispolzuemyhèªãšå ±ã«äœ¿çšããŸãã 第1ã®å±€ãšç¬¬äžå±€ã®å€§ããã®1/4 - - 第2ã®å±€2ã€ã®é ãããå±€ã¯ãèŸæžãµã€ãºã®å¯žæ³ã®1/6è¡ããŸãã
ïŒãã®ãµã€ãºã®ãããã¯ãŒã¯ã®ããã®éåžžã«å°ãªãã§ãïŒã20ã®000ã®æç« ã§èšç·ŽããåŸãæã ã¯æã£ãŠãã2000åç §ããã¹ãã®ãããã¯ãŒã¯ãå®è¡ããŸãã
Nã°ã©ã | 粟床 | èŸæžãµã€ãº
ïŒredkoispolzuemyhèšèãªãïŒ |
---|---|---|
Unigramma | 58ïŒ | ã25000 |
ãã£ã«ã¿unigramma | 73ïŒ | ã1200幎 |
ãã€ã°ã©ã | 63ïŒ | ã8000 |
ãã£ã«ã¿ãã€ã°ã©ã | 69ïŒ | ã3000 |
äžã€ã®ç¹æ§ã®ããããã®ç²ŸåºŠã ããªãã¯ãããã©ãŒãã³ã¹ã®ããã«ããã粟床ãå¿ èŠãªå Žåã¯ã次ã®ããã«ãèšç®åŒã¯ä»¥äžã®ãšããã§ãã
- å³ã®ãã¹ãŠã®ç¹æ§ãæšæž¬ãã確çã¯ãåç¹æ§ãæšæž¬ã®ç¢ºçã®ç©ã«çããã§ãã
- å°ãªããšãäžã€ã®ç¹æ§ãæšæž¬ãã確çã¯ãäžãšåç¹æ§ã®äžæ£ç¢ºãªæ±ºæã®ç¢ºçã®ç©ãšã®å·®ã§ãã
äŸïŒ
73ïŒ - ä»®å®ããªãŒãã£ãªç¹æ§ã®ç²ŸåºŠã¯65ïŒ ã®ç¬¬2ã®ç¹åŸŽã§ãã çŽã¡ã« 0.65 * 0.73 = 0.4745 = 47.45ããŒã»ã³ãã§ããã å°ãªããšã1ã€ã®ç¹æ§ãã1-ïŒ1-0,65ïŒ*ïŒ1-0,73ïŒã§å€å®ç²ŸåºŠç²ŸåºŠã®äž¡æ¹ã決å®= 0 ã9055 = 90.55ããŒã»ã³ãã
ããã¯ãå ¥åããŒã¿ã®ååŠçããã¥ã¢ã«ãå¿ èŠãšããªãããŒã«ã®ããã«è¯ãçµæã§ãã
ãã1ã§ãããããããïŒããã«ããã®ã«ããŽãªã«ãããæ£ç¢ºãªåé¡ããã®ããã¹ãã®ã«ããŽãªãã·ã¹ãã ãšã©ãŒããã以äžåæ§ã®ããã¹ãïŒç²ŸåºŠãç°ãªãã«ããŽãªãŒã«åé¡ãããã¹ãããã¹ãã®é¡äŒŒæ§ã«å€§ããäŸåããŠããŸãã ãã®ãããç°ãªãã¿ã¹ã¯ãšç°ãªãæè©ã§åããããã¯ãŒã¯èšå®ãããã³èªåœã®ç²ŸåºŠã«å€§ããªéããçããå ŽåãããããŸãã
æŠèŠããã°ã©ã
ç§ã¯ããªããããã¹ããåé¡ãããããšã«ãããç¹æ§ã®æ°ã«çžãããªãæ®éçãªããã°ã©ã ãæžãããšã«ãããåã«èšããŸããã ç§ã¯ãããã°ã©ã ã®å¯äžã®ã¢ã«ãŽãªãºã ã説æãéçºã®ãã¹ãŠã®è©³çŽ°ã説æããŸããããã«ããããšãœãŒã¹ãåç §ããã«ã¯ãèšäºã®æåŸã«ãªããŸãã
ããã°ã©ã ã®äžè¬çãªã¢ã«ãŽãªãºã ïŒ
- ããªããæåã«èµ·åãããšããã°ã©ã ããã¬ãŒãã³ã°ããŒã¿ãšXLSXãã¡ã€ã«ãèŠæ±ããŸãã ãã¡ã€ã«ã¯ã1æã®ãŸãã¯2ã€ã®ã·ãŒãããæ§æãããŠãããã§ãã ãã¬ãŒãã³ã°ã®ããã®ããŒã¿ã®æåã®ããŒãžã§ã第äºã« - æŽãªãè©Šéšã®ãããã¯ãŒã¯ç²ŸåºŠã«ã€ããŠã®ããŒã¿ã ã·ãŒãã®æ§é ãäžèŽããŠããŸãã æåã®åã¯ãããã¹ããåæããå¿
èŠããããŸãã åŸç¶ã®åå
ã®ããã¹ãã®ç¹æ§ã®å€ãå«ãŸãªããã°ãªããªãïŒä»»æã®æ°ãååšããŠãããã§ãïŒã æåã®è¡ã¯ããããã®ç¹æ§ã®ååãå«ãŸããŠããå¿
èŠããããŸãã
- ãã®ãã¡ã€ã«ã«åºã¥ãèŸæžã決å®ããç¹æ§ãšåç¹æ§ã®ããã«èš±å¯ããããŠããŒã¯ãªå€ã®ãªã¹ããæ§ç¯ãããŠããŸãã ãã®ãã¹ãŠã¯ããªããžããªã«æ ŒçŽãããŠããŸãã
- ããã¯ãåç¹æ§ã®ããã®åå¥ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ãäœæããŸãã
- ãã¹ãŠã®äœæããããã¥ãŒã©ã«ãããã¯ãŒã¯ã¯èšç·Žãããä¿æãããŸãã
- ããªãã¯ãã¹ãŠã®èšç·Žããããã¥ãŒã©ã«ãããã¯ãŒã¯ãããŒããããŠä¿åããã次åã ããã°ã©ã ã¯ãããã¹ãã解æããæºåãã§ããŠããŸãã
- åŸãããããã¹ãã¯ãããããç¬ç«ããŠããã¥ãŒã©ã«ãããã¯ãŒã¯ã®åŠçããããããŠå
šäœçãªçµæã¯ãåç¹æ§ã®å€ãšããŠäžããããŸãã
èšç»ïŒ
- åé¡åšïŒäŸãã°ãä»ã®ã¿ã€ãã®è¿œå ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ãšåçŽãã€ãºåé¡åšã ïŒã
- çµã¿åããunigrammãšãã€ã°ã©ã ãããªãèŸæžã䜿çšããŠã¿ãŠãã ããã
- ã¿ã€ããã¹ãææ³ã®èª€ããæé€ããä»çµã¿ãè¿œå ããŸãã
- å矩èªã®èŸæžãè¿œå ããŸãã
- 圌ãã¯éåžžãæ å ±ãã€ãºãããšããŠãããŸãã«ãé »ç¹ã«èšèãé¿ããŠãã ããã
- åæãããããã¹ãã«ãã®åœ±é¿ãžã®èšèã®ããã«ãåã ã®æå³ã«éã¿ãè¿œå ãã以äžã§ããã
- ããªããå¥ã®ã©ã€ãã©ãªã®åºæ¬çãªæ©èœãäœãããšãã§ããããã«ãAPIãç°¡çŽ åããã¢ãŒããã¯ãã£ãå€æŽããããã«å°ãã
ãœãŒã¹ã³ãŒãã®ããã
ãœãŒã¹ã¯ãç¹å®ã®èšèšãã¿ãŒã³ã®äœ¿çšäŸãšããŠæçšã§ããåŸãŸãã å¥ã®èšäºã§ãããææ ¢ããŠããã®ã§ããã®èšäºã§ãèãããŠãæå³ããããŸãããããŸããç§ã¯ç§ã®çµéšãå ±æããŠããã®ã§ããããã«ã€ããŠæ²é»ããããšãæãŸããããšã§ã¯ãããŸããã
- ãã¿ãŒã³ãæŠç¥ã ã NGramStrategyã€ã³ã¿ãã§ãŒã¹ãšãããå®è£ ããã¯ã©ã¹ã
- ãã¿ãŒã³ããªãã¶ãŒããŒã ã ã¯ã©ã¹LogWindowãšåé¡åšããªãã¶ãŒããŒãšèŠ³æž¬ã€ã³ã¿ãŒãã§ã€ã¹ãå®è£ ããŸãã
- ãã¿ãŒã³ããã³ã¬ãŒã¿ãŒã ã ã¯ã©ã¹ãã€ã°ã©ã ã
- ãã¿ãŒã³ãã·ã³ãã«å·¥å Žãã getNGramStrategyæ³ïŒïŒNGramStrategyã€ã³ã¿ãã§ãŒã¹ã
- ãã¿ãŒã³ããã¡ã¯ããªã¡ãœããã ã getNgramStrategyã¡ãœããïŒïŒNGramStrategyTestã¯ã©ã¹ã
- ãã¿ãŒã³ãæœè±¡ãã¡ã¯ããªãŒã ã ã¯ã©ã¹JDBCDAOFactoryã
- ãã¿ãŒã³ïŒã¢ã³ããã¿ãŒã³ïŒïŒ "ã·ã³ã°ã«" ã ã¯ã©ã¹EMFProviderã
- ãã¿ãŒã³ããã³ãã¬ãŒãã¡ãœããã ã initializeIdealæ¹æ³ïŒïŒNGramStrategyTestã¯ã©ã¹ã 確ãã«ããã®ã¢ããªã±ãŒã·ã§ã³ã®éåžžã«å€å žã¯ãããŸããã
- æ«DAO» ã ã€ã³ã¿ãã§ãŒã¹CharacteristicDAOãClassifiableTextDAOãªã©
å®å šãªãœãŒã¹ã³ãŒãïŒ https://github.com/RusZ/TextClassifier
建èšçãªææ¡ãšæ¹å€ãæè¿ããŸãã
PSã¯ïŒã¬ãã®å Žåã¯ãå¿é ããªãã§ãã ãã - ããã¯æ¶ç©ºã®äººç©ã§ãã