ç§ã¯è±èªãå匷ãããã®ããã»ã¹ãããããæ¹æ³ã§ç°¡çŽ åããŸãã ã©ãããããããç¹å®ã®ããã¹ãã®ç¿»èš³ãšè»¢åãšãšãã«åèªã®ãªã¹ããååŸããå¿ èŠããããŸããã ã¿ã¹ã¯ã¯é£ãããªããä»äºã«åãæãããŸããã å°ãåŸã«ã Pythonã¹ã¯ãªãããäœæãããŸãããPythonã¹ã¯ãªããã¯ãããããã¹ãŠç¥ã£ãŠããŸããããã«ãè±èªã®ããã¹ããå«ããã¹ãŠã®ãã¡ã€ã«ããåšæ³¢æ°èŸæžãååŸãããã£ãã®ã§ãããå°ãç¥ã£ãŠããŸãã ããã§ãã¹ã¯ãªããã®å°ããªã»ãããåºãŠããŸãããããã«ã€ããŠã話ãããããšæããŸãã
ã¹ã¯ãªããã¯ããã¡ã€ã«ãè§£æããè±èªã®åèªãéžæããããããæ£èŠåããŠãçµæã®è±èªã®åèªã®ãªã¹ãå šäœããæåã®countWordã®åèªãã«ãŠã³ãããã³çºè¡ããããšã«ãã£ãŠæ©èœããŸãã
æçµãã¡ã€ã«ã§ã¯ãåèªã¯æ¬¡ã®ããã«èšè¿°ãããŸãã
[ç¹°ãè¿ãåæ°] [åèªèªäœ] [åèªã®ç¿»èš³]
次ã«äœãèµ·ãããïŒ
- ãã¡ã€ã«ããè±èªã®åèªã®ãªã¹ããååŸããããšããå§ããŸãïŒ æ£èŠè¡šçŸã䜿çšïŒã
- 次ã«ãåèªã®æ£èŠåãéå§ããŸããã€ãŸããåèªãèªç¶ãªåœ¢åŒããèŸæžã«ä¿åãããŠãã圢åŒã«æ»ããŸãïŒããã§ã¯ã WordNet圢åŒã«ã€ããŠå°ãåŠç¿ããŸãïŒã
- 次ã«ããã¹ãŠã®æ£èŠåãããåèªã®åºçŸåæ°ãã«ãŠã³ãããŸãïŒããã¯è¿ éãã€ç°¡åã§ãïŒã
- ããã«ã StarDict圢åŒã®è©³çްã«ã€ããŠã説æããŸããããã¯ããããå©çšããŠç¿»èš³ãšæåèµ·ãããè¡ãããã§ãã
- æåŸã«ãçµæãã©ããã«æžã蟌ã¿ãŸãïŒ Excelãã¡ã€ã«ãéžæããŸããïŒã
ç§ã¯python 3.3ã䜿çšããŸããããå€ãã®å Žåãå¿ èŠãªã¢ãžã¥ãŒã«ãæ¬ èœããŠãããããpython 2.7ã§èšè¿°ããªãã£ãããšãåŸæããªããã°ãªããŸããã
åšæ³¢æ°ã¢ãã©ã€ã¶ãŒã
ããã§ã¯ãç°¡åãªãã®ããå§ããŠããã¡ã€ã«ãååŸããããããåèªã«è§£æããã«ãŠã³ããããœãŒãããçµæãçæããŸãããã
ãŸããããã¹ãå ã®è±èªã®åèªãæ€çŽ¢ããããã®æ£èŠè¡šçŸãäœæããŸãã
è±èªã®åèªãæ€çŽ¢ããããã®æ£èŠè¡šçŸ
ãoverããªã©ã®ç°¡åãªè±èªã®åèªã¯ã ãïŒ[a-zA-Z] +ïŒããšãã衚çŸã䜿çšããŠèŠã€ããããšãã§ããŸããè±èªã®ã¢ã«ãã¡ãããã®1ã€ä»¥äžã®æåãããã§æ€çŽ¢ãããŸãã
ãåžä»€å®ããªã©ã®è€åèªãèŠã€ããã®ã¯ããé£ãããªããŸãããåžä»€å®ãããin-ãããããŒãããšãã圢åŒã®é£ç¶ããéšååŒãæ¢ãå¿ èŠããããŸãã æ£èŠè¡šçŸã®åœ¢åŒã¯ã ãïŒïŒ[a-zA-Z] +-ïŒïŒ* [A-zA-Z] +ïŒãã§ãã
äžééšååŒãåŒã«ååšããå Žåãçµæã«ãå«ãŸããŸãã ãã®ããããåžä»€å®ããšããåèªã ãã§ãªããèŠã€ãã£ããã¹ãŠã®éšååŒãæ€çŽ¢çµæã«å«ãŸããŸãã æ£èŠè¡šçŸã¯ã ãïŒïŒ?: [A-zA-Z] +-ïŒïŒ* [A-zA-Z] +ïŒããšãã圢åŒãåããŸãã ç§ãã¡ã¯ãŸã ã衚çŸã«ãdid n'tããšãã圢åŒã®ã¢ãã¹ãããã£ãå«ãåèªãå«ããå¿ èŠããããŸãã ãããè¡ãã«ã¯ãæåã®éšååŒã®ã-ïŒãã眮ãæããŸã ã[-']ïŒã ã
ããã ãã§ããæ£èŠè¡šçŸã®æ¹åãçµäºããŸããããã«æ¹åããããšãã§ããŸãããããã«ã€ããŠè©³ãã説æããŸãã
ãïŒïŒïŒïŒ[a-zA-Z] + [-']ïŒïŒ* [a-zA-Z] +ïŒã
è±åèªã®åšæ³¢æ°åæåšã®å®è£
  è±èªã®åèªãæœåºããããããã«ãŠã³ãããŠçµæãçæã§ããå°ããªã¯ã©ã¹ãäœæããŸãã 
        
        
        
      
    
      # -*- coding: utf-8 -*- import re import os from collections import Counter class FrequencyDict: def __init__(): #        self.wordPattern = re.compile("((?:[a-zA-Z]+[-']?)*[a-zA-Z]+)") #  (  collections.Counter       ) self.frequencyDict = Counter() #   ,     def ParseBook(self, file): if file.endswith(".txt"): self.__ParseTxtFile(file, self.__FindWordsFromContent) else: print('Warning: The file format is not supported: "%s"' %file) #      txt def __ParseTxtFile(self, txtFile, contentHandler): try: with open(txtFile, 'rU') as file: for line in file: #    contentHandler(line) #       except Exception as e: print('Error parsing "%s"' % txtFile, e) #               def __FindWordsFromContent(self, content): result = self.wordPattern.findall(content) #       for word in result: word = word.lower() #      self.frequencyDict[word] += 1 #         #    countWord   ,      def FindMostCommonElements(self, countWord): dict = list(self.frequencyDict.items()) dict.sort(key=lambda t: t[0]) dict.sort(key=lambda t: t[1], reverse = True) return dict[0 : int(countWord)]
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ããã§ãæ¬è³ªçã«ãåšæ³¢æ°èŸæžã䜿çšããäœæ¥ã¯å®äºã§ããŸãããäœæ¥ã¯ãŸã å§ãŸã£ãã°ããã§ãã åé¡ã¯ãããã¹ãå ã®åèªãææ³èŠåãèæ ®ããŠæžãããŠããããšã§ããã€ãŸããæ«å°Ÿã«edãingãªã©ã®åèªãããã¹ãå ã§çºçããå¯èœæ§ããããŸãã å®éãåè©ã®åœ¢åŒïŒamãisãareïŒã§ãããç°ãªãåèªãšããŠã«ãŠã³ããããŸãã
ãã®ãããåèªãåèªã«ãŠã³ã¿ãŒã«è¿œå ããåã«ãæ£ãã圢åŒã«ããå¿ èŠããããŸãã
次ã®ããŒãã«ç§»ããŸã- è±åèªã®ããŒãã©ã€ã¶ãŒãæžããŸã ã
è±èªã®è£å©è©
ã¹ããã³ã°ãšèŠåºãèªåã® 2ã€ã®ã¢ã«ãŽãªãºã ããããŸãã ã¹ããã³ã°ãšã¯ããã¥ãŒãªã¹ãã£ãã¯åæãæããããŒã¹ã¯äœ¿çšããŸããã èŠåºãèªåã§ã¯ãããŸããŸãªåèªããŒã¹ã䜿çšãããææ³èŠåã«åŸã£ã倿ãé©çšãããŸãã çµæã®èª€å·®ã¯åæ»æãããã¯ããã«å°ãããããç®çã«å¿ããŠè£é¡ã䜿çšããŸãã
èŠåºãèªåã«ã€ããŠã¯ããã§ã«ããããããªã©ãhabrã«é¢ããèšäºãããã€ããããŸãã ã 圌ãã¯AOTããŒã¹ã䜿çšããŠããŸãã ç¹°ãè¿ããããããŸããã§ããããŸããä»ã®èŠåºãèªåã®ããŒã¹ãæ¢ãã®ãé¢çœãã£ãã§ãã WordNetã«ã€ããŠã話ãããããšæããŸããããã®äžã§è£é¡ãäœæããŸãã ãŸãã å ¬åŒã®WordNet Webãµã€ãã§ãããã°ã©ã ã®ãœãŒã¹ã³ãŒããšããŒã¿ããŒã¹èªäœãããŠã³ããŒãã§ããŸãã WordNetã«ã¯å€ãã®æ©èœããããŸãããå¿ èŠãªæ©èœã¯ããäžéšãã€ãŸãåèªã®æ£èŠåã ãã§ãã
ããŒã¿ããŒã¹ã®ã¿ãå¿ èŠã§ãã WordNetã®ãœãŒã¹ããã»ã¹ïŒCïŒã¯ãæ£èŠåããã»ã¹èªäœã説æããŠããŸããæ¬è³ªçã«ã¯ãããããã¢ã«ãŽãªãºã ãåãåºããPythonã§æžãçŽããŸããã ããããã¡ãããPythonçšã®WordNetçšã®ã©ã€ãã©ãª-nltkããããŸããããŸããPython 2.7ã§ã®ã¿åäœããæ¬¡ã«èŠãéããæ£èŠåã§ã¯WordNetãµãŒããŒãžã®ãªã¯ãšã¹ãã®ã¿ãéä¿¡ãããŸãã
ã¬ã³ãã¿ã€ã¶ãŒã®äžè¬çãªã¯ã©ã¹å³ïŒ
 
      å³ãããããããã«ãæ£èŠåãããŠããã®ã¯4ã€ã®åè©ïŒåè©ãåè©ã圢容è©ãå¯è©ïŒã ãã§ãã
æ£èŠåããã»ã¹ãç°¡åã«èª¬æãããšã次ã®ããã«ãªããŸãã
1.åè©ããšã«ã2ã€ã®ãã¡ã€ã«ãWordNetããããŠã³ããŒããããŸã-ã€ã³ããã¯ã¹èŸæžïŒåè©ã«å¿ããååã€ã³ããã¯ã¹ãšæ¡åŒµåãå¯è©ã®å Žåã¯index.advãªã©ïŒãšäŸå€ãã¡ã€ã«ïŒåè©ã«å¿ããæ¡åŒµåexcãšååãããšãã°adv.excã®å ŽåïŒå¯è©ïŒã
2.æ£èŠåäžã«ãäŸå€ã®é åãæåã«ãã§ãã¯ãããåèªãååšããå Žåããã®æ£èŠåããã圢åŒãè¿ãããŸãã åèªãäŸå€ã§ã¯ãªãå Žåãåèªã®ãŽãŒã¹ãã¯ææ³èŠåã«åŸã£ãŠå§ãŸããŸããã€ãŸããèªå°ŸãåãæšãŠãããæ°ããèªå°Ÿãæ¥çãããæ¬¡ã«åèªãã€ã³ããã¯ã¹é åã§æ€çŽ¢ãããããã«ããå Žåãåèªã¯æ£èŠåããããšèŠãªãããŸãã ãã以å€ã®å Žåãã«ãŒã«ãçµäºããããåèªã以åã«æ£èŠåããããŸã§ã次ã®ã«ãŒã«ãé©çšãããŸãã
Lemmalizerã®ã¯ã©ã¹ïŒ
  åè©ã®åºæ¬ã¯ã©ã¹BaseWordNetItem.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- import os class BaseWordNetItem: #  def __init__(self, pathWordNetDict, excFile, indexFile): self.rule=() #        . self.wordNetExcDict={} #   self.wordNetIndexDict=[] #   self.excFile = os.path.join(pathWordNetDict, excFile) #      self.indexFile = os.path.join(pathWordNetDict, indexFile) #      self.__ParseFile(self.excFile, self.__AppendExcDict) #    self.__ParseFile(self.indexFile, self.__AppendIndexDict) #    self.cacheWords={} #  .     ,  -  ,  -   #       . #     : [-][][] def __AppendExcDict(self, line): #     ,     2      (  - ,  - ).         group = [item.strip() for item in line.replace("\n","").split(" ")] self.wordNetExcDict[group[0]] = group[1] #       . def __AppendIndexDict(self, line): #        group = [item.strip() for item in line.split(" ")] self.wordNetIndexDict.append(group[0]) #     ,          ,    def __ParseFile(self, file, contentHandler): try: with open(file, 'r') as openFile: for line in openFile: contentHandler(line) #       except Exception as e: raise Exception('File does not load: "%s"' %file) #      .      ,   . #        def _GetDictValue(self, dict, key): try: return dict[key] except KeyError: return None #     ,    True,  False. #  ,  ,   ,   (       ). def _IsDefined(self, word): if word in self.wordNetIndexDict: return True return False #   (  ) def GetLemma(self, word): word = word.strip().lower() #     if word == None: return None #   ,           lemma = self._GetDictValue(self.cacheWords, word) if lemma != None: return lemma # ,      ,    if self._IsDefined(word): return word #   ,    ,     lemma = self._GetDictValue(self.wordNetExcDict, word) if lemma != None: return lemma #    ,         ,      . lemma = self._RuleNormalization(word) if lemma != None: self.cacheWords[word] = lemma #       return lemma return None #     (  ,     ) def _RuleNormalization(self, word): #    ,         ,  ,   . for replGroup in self.rule: endWord = replGroup[0] if word.endswith(endWord): lemma = word #     lemma = lemma.rstrip(endWord) #    lemma += replGroup[1] #    if self._IsDefined(lemma): # ,        ,    ,    return lemma return None
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        åè©ãæ£èŠåããããã®ã¯ã©ã¹WordNetVerb.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- from WordNet.BaseWordNetItem import BaseWordNetItem #     #    BaseWordNetItem class WordNetVerb(BaseWordNetItem): def __init__(self, pathToWordNetDict): #   (BaseWordNetItem) BaseWordNetItem.__init__(self, pathToWordNetDict, 'verb.exc', 'index.verb') #        .  ,  "s"   "" , "ies"   "y" . self.rule = ( ["s" , "" ], ["ies" , "y" ], ["es" , "e" ], ["es" , "" ], ["ed" , "e" ], ["ed" , "" ], ["ing" , "e" ], ["ing" , "" ] ) #      GetLemma(word)     BaseWordNetItem
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        åè©ã®æ£èŠåã®ããã®ã¯ã©ã¹WordNetNoun.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- from WordNet.BaseWordNetItem import BaseWordNetItem #       #    BaseWordNetItem class WordNetNoun(BaseWordNetItem): def __init__(self, pathToWordNetDict): #   (BaseWordNetItem) BaseWordNetItem.__init__(self, pathToWordNetDict, 'noun.exc', 'index.noun') #        .  ,  "s"   "", "ses"   "s"  . self.rule = ( ["s" , "" ], ["'s" , "" ], ["'" , "" ], ["ses" , "s" ], ["xes" , "x" ], ["zes" , "z" ], ["ches" , "ch" ], ["shes" , "sh" ], ["men" , "man" ], ["ies" , "y" ] ) #    (  ) #       BaseWordNetItem,          , #      def GetLemma(self, word): word = word.strip().lower() #    ,         if len(word) <= 2: return None #     "ss",         if word.endswith("ss"): return None #   ,           lemma = self._GetDictValue(self.cacheWords, word) if lemma != None: return lemma # ,      ,    if self._IsDefined(word): return word #   ,    ,     lemma = self._GetDictValue(self.wordNetExcDict, word) if (lemma != None): return lemma #     "ful",   "ful",   ,     . #  ,  ,   "spoonsful"    "spoonful" suff = "" if word.endswith("ful"): word = word[:-3] #   "ful" suff = "ful" #   "ful",     #    ,         ,      . lemma = self._RuleNormalization(word) if (lemma != None): lemma += suff #     "ful",    self.cacheWords[word] = lemma #       return lemma return None
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        å¯è©WordNetAdverb.pyãæ£èŠåããããã®ã¯ã©ã¹ 
        
        
        
      
    
       # -*- coding: utf-8 -*- from WordNet.BaseWordNetItem import BaseWordNetItem #     #    BaseWordNetItem class WordNetAdverb(BaseWordNetItem): def __init__(self, pathToWordNetDict): #   (BaseWordNetItem) BaseWordNetItem.__init__(self, pathToWordNetDict, 'adv.exc', 'index.adv') #      (adv.exc)    (index.adv). #           .
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        圢容è©WordNetAdjective.pyãæ£èŠåããããã®ã¯ã©ã¹ 
        
        
        
      
    
       # -*- coding: utf-8 -*- from WordNet.BaseWordNetItem import BaseWordNetItem #       #    BaseWordNetItem class WordNetAdjective(BaseWordNetItem): def __init__(self, pathToWordNetDict): #   (BaseWordNetItem) BaseWordNetItem.__init__(self, pathToWordNetDict, 'adj.exc', 'index.adj') #        .  ,  "er"   ""  "e"  . self.rule = ( ["er" , "" ], ["er" , "e"], ["est" , "" ], ["est" , "e"] ) #      GetLemma(word)     BaseWordNetItem
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        Lemmatizer Lemmatizer.pyã®ã¯ã©ã¹ 
        
        
        
      
    
       # -*- coding: utf-8 -*- from WordNet.WordNetAdjective import WordNetAdjective from WordNet.WordNetAdverb import WordNetAdverb from WordNet.WordNetNoun import WordNetNoun from WordNet.WordNetVerb import WordNetVerb class Lemmatizer: def __init__(self, pathToWordNetDict): #    self.splitter = "-" #      adj = WordNetAdjective(pathToWordNetDict) #  noun = WordNetNoun(pathToWordNetDict) #  adverb = WordNetAdverb(pathToWordNetDict) #  verb = WordNetVerb(pathToWordNetDict) #  self.wordNet = [verb, noun, adj, adverb] #     (, ) def GetLemma(self, word): #     ,    ,   ( )  ,    wordArr = word.split(self.splitter) resultWord = [] for word in wordArr: lemma = self.__GetLemmaWord(word) if (lemma != None): resultWord.append(lemma) if (resultWord != None): return self.splitter.join(resultWord) return None #   (  ) def __GetLemmaWord(self, word): for item in self.wordNet: lemma = item.GetLemma(word) if (lemma != None): return lemma return None
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
      ããŠãæ£èŠåãçµäºããŸããã ããã§ãåšæ³¢æ°ã¢ãã©ã€ã¶ãŒã¯åèªãæ£èŠåã§ããŸãã ã¿ã¹ã¯ã®æåŸã®éšå-è±èªã®åèªã®ç¿»èš³ãšè»¢åãååŸããŸãã
StarDictèŸæžã䜿çšããå€åœèªç¿»èš³è
StarDictã«ã€ããŠã¯é·ãéæžãããšãã§ããŸããããã®åœ¢åŒã®äž»ãªå©ç¹ã¯ãã»ãšãã©ãã¹ãŠã®èšèªã§å€ãã®èŸæžããŒã¿ããŒã¹ãããããšã§ãã Habrã®StarDictãããã¯ã«é¢ããèšäºã¯ãããŸããã§ããããã®ã®ã£ãããåããæãæ¥ãŸããã StarDict圢åŒãèšè¿°ãããã¡ã€ã«ã¯éåžžããœãŒã¹èªäœã®é£ã«ãããŸãã
ãã¹ãŠã®è¿œå ãç Žæ£ããå Žåããã®åœ¢åŒã§æãæå°éã®ç¥èã»ããã¯æ¬¡ã®ããã«ãªããŸãã
åèŸæžã«ã¯3ã€ã®å¿ é ãã¡ã€ã«ãå«ãŸããŠããå¿ èŠããããŸãã
1. ifoæ¡åŒµåãæã€ãã¡ã€ã«-èŸæžèªäœã®äžè²«ãã説æãå«ãŸããŠããŸãã
2.æ¡åŒµåãidxã®ãã¡ã€ã«ã idxãã¡ã€ã«å ã®åãšã³ããªã¯ã次ã ã«ç¶ã3ã€ã®ãã£ãŒã«ãã§æ§æãããŸãã
- word_str -'\ 0'ã§çµããutf-8圢åŒã®æååã
- word_data_offset- .dictãã¡ã€ã«ã«æžã蟌ãåã®ãªãã»ããïŒ32ãŸãã¯64ããããµã€ãºïŒã
- word_data_size -.dictãã¡ã€ã«ã®ãšã³ããªå šäœã®ãµã€ãºã
3. dictæ¡åŒµåãæã€ãã¡ã€ã«-翻蚳èªäœãå«ãŸããŠããŸãã翻蚳ãžã®ãªãã»ãããç¥ãããšã§ã¢ã¯ã»ã¹ã§ããŸãïŒãªãã»ããã¯idxãã¡ã€ã«ã«èšé²ãããŸãïŒã
æçµçã«ã©ã®ã¯ã©ã¹ã«ãªããã«ã€ããŠèãçŽãããšãªãããã¡ã€ã«ããšã«1ã€ã®ã¯ã©ã¹ãäœæããããããçµåãã1ã€ã®äžè¬çãªStarDictã¯ã©ã¹ãäœæããŸããã
çµæã®ã¯ã©ã¹å³ïŒ
 
      StarDictã®ã¯ã©ã¹ïŒ
  èŸæžãšã³ããªã®åºæ¬ã¯ã©ã¹BaseStarDictItem.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- import os class BaseStarDictItem: def __init__(self, pathToDict, exp): #     self.encoding = "utf-8" #      self.dictionaryFile = self.__PathToFileInDirByExp(pathToDict, exp) #    self.realFileSize = os.path.getsize(self.dictionaryFile) #     path      exp def __PathToFileInDirByExp(self, path, exp): if not os.path.exists(path): raise Exception('Path "%s" does not exists' % path) end = '.%s'%(exp) list = [f for f in os.listdir(path) if f.endswith(end)] if list: return os.path.join(path, list[0]) #    else: raise Exception('File does not exist: "*.%s"' % exp)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        ã¯ã©ã¹ifo.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- from StarDict.BaseStarDictItem import BaseStarDictItem from Frequency.IniParser import IniParser class Ifo(BaseStarDictItem): def __init__(self, pathToDict): #   (BaseStarDictItem) BaseStarDictItem.__init__(self, pathToDict, 'ifo') #     self.iniParser = IniParser(self.dictionaryFile) #   ifo   #        ,        self.bookName = self.__getParameterValue("bookname", None) #   [ ] self.wordCount = self.__getParameterValue("wordcount", None) #    ".idx"  [ ] self.synWordCount = self.__getParameterValue("synwordcount", "") #    ".syn"   [ ,    ".syn"] self.idxFileSize = self.__getParameterValue("idxfilesize", None) #  ( ) ".idx" .    ,        [ ] self.idxOffsetBits = self.__getParameterValue("idxoffsetbits", 32) #    (32  64),         .dict.      3.0.0,      32 [ ] self.author = self.__getParameterValue("author", "") #   [ ] self.email = self.__getParameterValue("email", "") #  [ ] self.description = self.__getParameterValue("description", "") #   [ ] self.date = self.__getParameterValue("date", "") #    [ ] self.sameTypeSequence = self.__getParameterValue("sametypesequence", None) # ,    [ ] self.dictType = self.__getParameterValue("dicttype", "") #     ,  WordNet[ ] def __getParameterValue(self, key, defaultValue): try: return self.iniParser.GetValue(key) except: if defaultValue != None: return defaultValue raise Exception('\n"%s" has invalid format (missing parameter: "%s")' % (self.dictionaryFile, key))
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        ã¯ã©ã¹idx.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- from struct import unpack from StarDict.BaseStarDictItem import BaseStarDictItem class Idx(BaseStarDictItem): #  def __init__(self, pathToDict, wordCount, idxFileSize, idxOffsetBits): #   (BaseStarDictItem) BaseStarDictItem.__init__(self, pathToDict, 'idx') self.idxDict ={} # , self.idxDict = {'.': [_____dict, _____dict], ...} self.idxFileSize = int(idxFileSize) #   .idx,   .ifo  self.idxOffsetBytes = int(idxOffsetBits/8) #  ,         .dict.        self.wordCount = int(wordCount) #    ".idx"  #    (  .ifo    .idx  [idxfilesize]      ) self.__CheckRealFileSize() #   self.idxDict    .idx self.__FillIdxDict() #    (  .ifo     [wordcount]        .idx ) self.__CheckRealWordCount() #    ,   .ifo ,           def __CheckRealFileSize(self): if self.realFileSize != self.idxFileSize: raise Exception('size of the "%s" is incorrect' %self.dictionaryFile) #    ,   .ifo ,       .idx       def __CheckRealWordCount(self): realWordCount = len(self.idxDict) if realWordCount != self.wordCount: raise Exception('word count of the "%s" is incorrect' %self.dictionaryFile) #         ,      def __getIntFromByteArray(self, sizeInt, stream): byteArray = stream.read(sizeInt) #   ,    #       formatCharacter = 'L' #   "unsigned long" ( sizeInt = 4) if sizeInt == 8: formatCharacter = 'Q' #   "unsigned long long" ( sizeInt = 8) format = '>' + formatCharacter #     : "  " + " " #  '>' - ,       int( formatCharacter)     . integer = (unpack(format, byteArray))[0] #       return int(integer) #    .idx    (   3- )       self.idxDict def __FillIdxDict(self): languageWord = "" with open(self.dictionaryFile, 'rb') as stream: while True: byte = stream.read(1) #    if not byte: break #    ,     if byte != b'\0': #        '\0',      languageWord += byte.decode("utf-8") else: #    '\0',  ,         ("     dict"  "     dict") wordDataOffset = self.__getIntFromByteArray(self.idxOffsetBytes, stream) #    "     dict" wordDataSize = self.__getIntFromByteArray(4, stream) #    "     dict" self.idxDict[languageWord] = [wordDataOffset, wordDataSize] #    self.idxDict :   +  +   languageWord = "" #  ,     #       .dict ("     dict"  "     dict"). #      ,   None def GetLocationWord(self, word): try: return self.idxDict[word] except KeyError: return [None, None]
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
        ã¯ã©ã¹Dict.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- from StarDict.BaseStarDictItem import BaseStarDictItem #     ( , sametypesequence = tm). #  -x    (       utf-8,  '\0'): # 'm' -     utf-8,  '\0' # 'l' -       utf-8,  '\0' # 'g' -        Pango # 't' -    utf-8,  '\0' # 'x' -    utf-8,    xdxf # 'y' -    utf-8,  (YinBiao)   (KANA)  # 'k' -    utf-8,    KingSoft PowerWord XML # 'w' -     MediaWiki # 'h' -     Html # 'n' -     WordNet # 'r' -    .      (jpg),  (wav),  (avi), (bin)   . # 'W' - wav  # 'P' -  # 'X' -       class Dict(BaseStarDictItem): def __init__(self, pathToDict, sameTypeSequence): #   (BaseStarDictItem) BaseStarDictItem.__init__(self, pathToDict, 'dict') # ,     self.sameTypeSequence = sameTypeSequence def GetTranslation(self, wordDataOffset, wordDataSize): try: #              .dict self.__CheckValidArguments(wordDataOffset, wordDataSize) #   .dict   with open(self.dictionaryFile, 'rb') as file: #   file.seek(wordDataOffset) #      ,     byteArray = file.read(wordDataSize) #   ,     return byteArray.decode(self.encoding) #       o (self.encoding     BaseDictionaryItem) except Exception: return None def __CheckValidArguments(self, wordDataOffset, wordDataSize): if wordDataOffset is None: pass if wordDataOffset < 0: pass endDataSize = wordDataOffset + wordDataSize if wordDataOffset < 0 or wordDataSize < 0 or endDataSize > self.realFileSize: raise Exception
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
      ããŠã翻蚳è ã¯æºåãã§ããŠããŸããä»åºŠã¯ãåšæ³¢æ°ã¢ãã©ã€ã¶ãã¯ãŒãããŒãã©ã€ã¶ãããã³ãã©ã³ã¹ã¬ãŒã¿ãçµã¿åãããå¿ èŠããããŸããmain.pyãã¡ã€ã«ãšSettings.inièšå®ãã¡ã€ã«ãäœæããŸãã
 ã¡ã€ã³ãã¡ã€ã«main.py 
        
        
        
      
    
       # -*- coding: utf-8 -*- import os import xlwt3 as xlwt from Frequency.IniParser import IniParser from Frequency.FrequencyDict import FrequencyDict from StarDict.StarDict import StarDict ConfigFileName="Settings.ini" class Main: def __init__(self): self.listLanguageDict = [] #      StarDict self.result = [] #      ( , ,  ) try: #    - config = IniParser(ConfigFileName) self.pathToBooks = config.GetValue("PathToBooks") #   ini   PathToBooks,     (,   ),      self.pathResult = config.GetValue("PathToResult") #   ini   PathToResult,       self.countWord = config.GetValue("CountWord") #   ini   CountWord,       ,    self.pathToWordNetDict = config.GetValue("PathToWordNetDict") #   ini   PathToWordNetDict,      WordNet self.pathToStarDict = config.GetValue("PathToStarDict") #   ini   PathToStarDict,        StarDict #    StarDict           .      listPathToStarDict listPathToStarDict = [item.strip() for item in self.pathToStarDict.split(";")] #       StarDict     for path in listPathToStarDict: languageDict = StarDict(path) self.listLanguageDict.append(languageDict) #   ,      self.listBooks = self.__GetAllFiles(self.pathToBooks) #    self.frequencyDict = FrequencyDict(self.pathToWordNetDict) #  ,   StarDict  WordNet.    ,      ,     self.__Run() except Exception as e: print('Error: "%s"' %e) #    ,    path def __GetAllFiles(self, path): try: return [os.path.join(path, file) for file in os.listdir(path)] except Exception: raise Exception('Path "%s" does not exists' % path) #     ,      .        ,    def __GetTranslate(self, word): valueWord = "" for dict in self.listLanguageDict: valueWord = dict.Translate(word) if valueWord != "": return valueWord return valueWord #   ( , ,  )   countWord     Excel def __SaveResultToExcel(self): try: if not os.path.exists(self.pathResult): raise Exception('No such directory: "%s"' %self.pathResult) if self.result: description = 'Frequency Dictionary' style = xlwt.easyxf('font: name Times New Roman') wb = xlwt.Workbook() ws = wb.add_sheet(description + ' ' + self.countWord) nRow = 0 for item in self.result: ws.write(nRow, 0, item[0], style) ws.write(nRow, 1, item[1], style) ws.write(nRow, 2, item[2], style) nRow +=1 wb.save(os.path.join(self.pathResult, description +'.xls')) except Exception as e: print(e) #      def __Run(self): #       for book in self.listBooks: self.frequencyDict.ParseBook(book) #   countWord        mostCommonElements = self.frequencyDict.FindMostCommonElements(self.countWord) #      for item in mostCommonElements: word = item[0] counterWord = item[1] valueWord = self.__GetTranslate(word) self.result.append([counterWord, word, valueWord]) #      Excel self.__SaveResultToExcel() if __name__ == "__main__": main = Main()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
       Settings.inièšå®ãã¡ã€ã« 
        
        
        
      
    
       ;   (,   ),      PathToBooks = e:\Bienne\Frequency\Books ;    WordNet(    ) PathToWordNetDict = e:\Bienne\Frequency\WordNet\wn3.1.dict\ ;      StarDict(   ) PathToStarDict = e:\Bienne\Frequency\Dict\stardict-comn_dictd04_korolew ;     ,        Excel CountWord = 100 ; ,    (   Excel     -  , ,  ) PathToResult = e:\Bienne\Frequency\Books
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     
      ããŠã³ããŒãããŠè¿œå ã§ã€ã³ã¹ããŒã«ããå¿ èŠãããå¯äžã®ãµãŒãããŒãã£ã©ã€ãã©ãªã¯xlwtã§ããExcel圢åŒã®ãã¡ã€ã«ãäœæããå¿ èŠããããŸãïŒçµæã¯ããã«æžã蟌ãŸããŸãïŒã
PathToStarDict倿°ã®Settings.inièšå®ãã¡ã€ã«ã§ã¯ãã;ãã䜿çšããŠè€æ°ã®èŸæžãäœæã§ããŸãããã®å Žåãåèªã¯èŸæžã®é ã«æ€çŽ¢ãããŸã-åèªãæåã®èŸæžã§èŠã€ãã£ãå Žåãæ€çŽ¢ã¯çµäºããŸãããã以å€ã®å Žåãä»ã®ãã¹ãŠã®StarDictèŸæžãæ€çŽ¢ãããŸãã
ããšãã
ãã®èšäºã§èª¬æãããŠãããã¹ãŠã®ãœãŒã¹ã¯ãgithubããããŠã³ããŒãã§ããŸãã
éç¥ïŒ
- ã¹ã¯ãªããã¯ãŠã£ã³ããŠã®äžã«æžãããŸããã
- 䜿çšãããpython 3.3 ;
- ããã«ãxlwtã©ã€ãã©ãªãExcelã§åäœããããã«é 眮ããå¿ èŠããããŸãã
- åå¥ã«ãWordNetããã³StarDictã®èŸæžããŒã¿ããŒã¹ãããŠã³ããŒãããå¿ èŠããããŸãïŒStarDictèŸæžã®å Žåã¯ãã¢ãŒã«ã€ããããããã¯ãã¡ã€ã«ãdictæ¡åŒµåã§ããã«è§£åããå¿ èŠããããŸãïŒã
- Settings.iniãã¡ã€ã«ã§ãèŸæžã®ãã¹ãšçµæãä¿åããå Žæãæå®ããå¿ èŠããããŸãã
- , StarDict, ( ).