åºæ¬åå
ãã®ããã¹ãã¯ãStrutext C ++ããã¹ãåŠçã©ã€ãã©ãªã«é¢ããæçš¿ã®ç¶ãã§ãã ããã§ã¯ãèšèªè¡šçŸã®åå¥ã¬ãã«ã®å®è£ ãç¹ã«åœ¢æ åŠã®å®è£ ã«ã€ããŠèª¬æããŸãã
èè ã«ãããšãèšèªè¡šçŸã®åå¥ã¬ãã«ã®ããã°ã©ã ã¢ãã«ãå®è£ ããéã«è§£æ±ºããªããã°ãªããªãäž»ãªã¿ã¹ã¯ã¯æ¬¡ã®ãšããã§ãã
- æå³ã®ããæåã®ãã§ãŒã³ã®ãœãŒã¹ããã¹ãããã®åé¢ã ããã¹ããäžé£ã®åèªãšããŠæ瀺ããŸãã
- åå¥ã¿ã€ãã®èŠçŽ ãšããŠéžæããããã§ãŒã³ã®èå¥ã
- ãã®åå¥å±æ§ã®éžæããããã§ãŒã³ã®å®çŸ©ïŒè©³çŽ°ã«ã€ããŠã¯ä»¥äžãåç §ïŒã
åå¥ã¿ã€ãã¯ãéåžžãèšèªã®æã§åãæå³ãæã€æåã®æéã»ãããšããŠè¡šãããŸãã åå¥ã¯ã©ã¹ã®èŠçŽ ã¯éåžžãåèªåœ¢åŒãšåŒã°ããåèªåœ¢åŒã®ã»ããã¯ãã©ãã€ã ãšåŒã°ããåå¥ã¿ã€ãã¯åèªãŸãã¯è£é¡ãšåŒã°ããŸãã ããšãã°ãåå¥ã¿ã€ããmomãã¯åèªåœ¢åŒ{momãmomãmomã...ãmomãmomã...}ã§æ§æãããŸãã
åå¥ã¿ã€ãã¯ãæ§æã«ããŽãªãŒïŒåè©ïŒã«åé¡ãããŸãã ã¹ããŒãã®äžéšã¯ãèšèªã®æã§åèªãæãã圹å²ãå®çŸ©ããŸãã ãã®åœ¹å²ã¯ãæäžã®åèªã®æ£ããå Žæã決å®ããããã«éèŠã§ããããããã£ãŠãæã®æå³ã決å®ããäžã§éèŠã§ãã ãã·ã¢èªã®ã¹ããŒãã®æåãªéšåïŒåè©ã圢容è©ãåè©ãå¯è©ãªã©
åå¥ã¿ã€ãã®åèªåœ¢åŒã«ã¯ããããã£ããããŸãã ãã®ãããªããããã£ã¯ãåå¥å±æ§ãŸãã¯åå¥å±æ§ãšãåŒã°ããŸãã ãããã®ããããã£ã®ã¿ã€ãã¯ãæå®ãããåå¥ã¿ã€ããå±ããæ§æã«ããŽãªã«äŸåããŸãã ããšãã°ãã±ãŒã¹ãã©ãŒã ã¯åè©ã«ãšã£ãŠéèŠãªåœ¹å²ãæãããŸããããã®å±æ§ã¯åè©ã«ã¯äœ¿çšã§ããŸããã
åå¥ã¿ã€ããã°ã«ãŒãåããããã«äœ¿çšãããç¹å®ã®æ§æã«ããŽãªãšããããæã€åå¥å±æ§ã¯ãå®è£ ãããèšèªãšå®è£ ãããåå¥è§£æã®å ·äœçãªã¢ãã«ã®äž¡æ¹ã«äŸåããŸãã 以äžã§ã¯ã AOTã®åå¥ã¢ãã«ã«ã€ããŠæ€èšããŸãã
èªåœã®ãããŸãã
ãœãŒã¹ããã¹ãããåèªãæœåºããããã»ã¹ã§ããããŸãããçºçããããšããããŸãã ããã§ã¯ã2ã€ã®å±ã®ãããŸãããèæ ®ããŸãã
- 第1çš®ã®ãããŸããã¯ãããã¹ãããéžæãããåå¥ã¿ã€ãã®æååãå²ãåœãŠãããã»ã¹ã§çºçããŸãã ãããããã¬ãŒã ãæŽã£ãããšããäŸãèããŠã¿ãŸãããã ããã§ãæååãsoapãã¯åè©ãwashãã«ãªããåè©ãsoapãã«ããªããŸãã ãã®ãããªãããŸããªå Žåã¯ãåå¥åé³ç°çŸ©èªãšãåŒã°ããŸãã
- 第2çš®ã®ãããŸããã¯ããœãŒã¹ããã¹ããåèªã®ãã§ãŒã³ã«ã«ããããããã»ã¹ã§çºçããŸãã ã»ãšãã©ã®èªç¶èšèªã§ã¯ãåèªã¯ã¹ããŒã¹ã§åºåãããŠããŸããããã®ååã«éåããããšããããŸãïŒäŸãšããŠããã€ãèªã®è€åèªïŒã ããããããã°ã©ãã³ã°èšèªã«ã¯èå³æ·±ãäŸããããŸãã ããšãã°ãC ++ã§ãa >> bããšãã圢åŒã®åŒãèããŠã¿ãŸãããã å€å žçãªCã§ã¯ããã®åŒã¯æ確ã«è§£éãããŸãïŒèå¥åãaããå³ã·ããæŒç®åã>>ããèå¥åãbãã ããããC ++ã®æè¿ã®ããŒãžã§ã³ã§ã¯ããã³ãã¬ãŒãããªã¹ãã®æåŸã®ãã©ã¡ãŒã¿ãŒãšããŠãæ©èœããå Žåããã®åŒã¯ãã³ãã¬ãŒããã©ã¡ãŒã¿ãŒã®ãªã¹ãã®æåŸãæå³ããå ŽåããããŸãã ãã®å Žåãåèªã®ã·ãŒã±ã³ã¹ã¯æ¬¡ã®ããã«ãªããŸãïŒèå¥åãaãããã³ãã¬ãŒããã©ã¡ãŒã¿ã®ãªã¹ãã®çµããã>ãããã³ãã¬ãŒããã©ã¡ãŒã¿ã®ãªã¹ãã®çµããã>ããèå¥åãbãã
ãã®ããã¹ãã§ã¯ã第1çš®ã®èªåœã®ãããŸããã®ã¿ãèæ ®ããŸãã
AOTèŸæžã®åœ¢æ ã¢ãã«
Strutextã©ã€ãã©ãªã¯ã AOTããã®åœ¢æ åŠçã¢ãã«ãå®è£ ããŠããŸãã ãããã£ãŠããã®èª¬æã«ç¹å®ã®å ŽæãäžããŸãã
AOTèŸæžã§ã¯ãååå¥ã¿ã€ãã¯2ã€ã®ãã©ã¡ãŒã¿ãŒã«ãã£ãŠå®çŸ©ãããŸãã
- æ¥å°ŸèŸãè¿œå ããŠåèªåœ¢åŒã圢æããããŒã¹æååïŒåèªã®ã«ãŒãïŒã
- åè§ãã©ãã€ã ã®çªå·ãããã¯ãã¢ã®ãªã¹ãïŒæ¥å°ŸèŸãäžé£ã®åå¥å±æ§ïŒã§ãã
åå¥çç¹åŸŽã®ã»ããã®çµã¿åããã¯æ¯èŒçå°ãªãããããã¯ç¹å¥ãªãã¡ã€ã«ã«ãªã¹ããããŠããããã®ãããªçµã¿åããã¯ãããã2æåã®ã³ãŒãã§ãšã³ã³ãŒããããŠããŸãã äŸïŒ
A ,, A ,, A ,,,2 ... Y ,,,,, Y ,,,,, ... a ,,1, a ,,1, a ,,2, ...
ããã§ã¯ãåè¡ã®æåã®èŠçŽ ã¯ã»ããã®2æåã®ã³ãŒãã§ããã3çªç®ã®èŠçŽ ã¯åè©ã®ã³ãŒãïŒCã¯åè©ãPã¯åœ¢å®¹è©ãGã¯åè©ãªã©ïŒã§ãããææ³èšå·ã®ã³ãŒãã¯ã³ã³ãã§ãªã¹ããããŠããŸãã
èŸæžèšè¿°ãã¡ã€ã«ã¯5ã€ã®ã»ã¯ã·ã§ã³ã§æ§æãããŠããããã®ãã¡2ã€ã®ã»ã¯ã·ã§ã³ãæãéèŠã§ãã 赀緯ã®ãã©ãã€ã ã®èª¬æã®ãã®ã»ã¯ã·ã§ã³ãšãåºæ¬ã®ã»ã¯ã·ã§ã³ïŒåå¥ã¿ã€ãïŒã ãã®ã»ã¯ã·ã§ã³ã®åè¡ã¯ãåè§ã®ãã©ãã€ã ãè¡šããŠããŸãã èªåœã¿ã€ãã®èª¬æã®ã»ã¯ã·ã§ã³ã§ã¯ãåºç€ãšãšãã«ã赀緯ãã©ãã€ã ã®è¡çªå·ãèšå®ãããŸãã
ããšãã°ãç·ãšããèšèãèããŠã¿ãŸãããã AOTèŸæžå ã®ãã®åèªã®åå¥ã¿ã€ãã¯ã次ã®åœ¢åŒã®æååã§äžããããŸã
15 12 1 -
ããã§ãçªå·15ã¯ããã©ãã€ã ã»ã¯ã·ã§ã³ã®èµ€ç·¯ãã©ãã€ã çªå·ã§ãã ãã®ãã©ãã€ã ã®è¡ã¯æ¬¡ã®ããã«ãªããŸãã
%*%*%*%*%*%*%*%*%*%*%*%*%*
ãã©ãã€ã å ã®åãã¢ã¯ãã·ã³ãã«ãïŒ ãã«ãã£ãŠäºãã«åé¢ããããã¢ã®èŠçŽ ã¯ã·ã³ãã«ã*ãã«ãã£ãŠäºãã«åé¢ãããŸãã æåã®ãã¢ïŒKAãhaïŒã¯ãgreen + ka = zelenkaãšããåèªãå®çŸ©ããåå¥å±æ§ã®ã»ããããããŸãïŒha = G C zrãedãim =åè©ã女æ§ãåæ°ãäž»æ Œã ä»ã®ãã©ãã€ã ãã¢ã¯ããã«å¿ããŠè§£èªã§ããŸãã
AOTã§äœ¿çšãããåèªãšã³ã³ãŒãæ¹åŒã«ã¯ãé·æãšçæããããŸãã ããã§ã¯ãããã«ã€ããŠã¯èª¬æããŸãããèå³æ·±ãäºå®ã®ã¿ã«æ³šæããŠãã ãããèŸæžã«ã¯ã空ã®ããŒã¹ãæã€åå¥ã¿ã€ããå«ãŸããŠããŸãã ããšãã°ãè€æ°åœ¢ã®ãpersonããšããåèªã¯ããpeopleããšããåèªåœ¢åŒã§è¡šãããŸããããpersonããšãã圢åŒãšã¯å ±éã®åºç€ããããŸããã ãããã£ãŠããã®åèªã¯ãåèªåœ¢åŒã®åçŽãªåæã«ãã£ãŠèšå®ããå¿ èŠããããŸãã
%*%*%*%*%*%*%*%*%*%*%*%*%*%*%*%*
ãã®ãã©ãã€ã ã¯ããŽãããã³ãã¢ã³ããŒãã³ãªã©ã®ä»ã®åèªïŒç©ºã§ãªãã«ãŒããæã€ïŒã§äœ¿çšã§ããŸãã
æ§æã«ããŽãªã®ã»ãããšãAOTèŸæžã®å¯Ÿå¿ããåå¥å±æ§ãããã«è©³ããèããŠã¿ãŸãããã
AOTæ§æã®ã«ããŽãª
åè¿°ã®ãšãããAOTãã£ã¯ã·ã§ããªã®æ§æã«ããŽãªã¯å¥ã®ãã¡ã€ã«ã§å®çŸ©ããã2æåã®ã³ãŒãã«åè©ãšåå¥å±æ§ã®ã»ãããäžããããæååã®ã»ããã§ãã Strutextã©ã€ãã©ãªã§ã¯ãåè©ãšãã®å±æ§ã¯C ++ã®ã¯ã©ã¹ã®éå±€ãšããŠè¡šãããŸãã ãã®å®è£ ããã詳现ã«æ€èšããŠãã ããã
AOTèŸæžã®æ§æã«ããŽãªã®ã¢ãã«ã¯ãmorpho / modelsãã£ã¬ã¯ããªã«å®çŸ©ãããŠããŸãã ãã·ã¢èªãšè±èªã®ã¢ãã«ã衚瀺ãããŸãã morpho / models / rus_model.hãã¡ã€ã«ã®äžéšã®ãã©ã°ã¡ã³ããèããŠã¿ãŠãã ãããããã¯ãã·ã¢èªã¢ãã«ã®èª¬æã瀺ããŠããŸãã
ãã¹ãŠã®ã¢ãã«ã®åºæ¬ã¯ã©ã¹ã¯PartOfSpeechæœè±¡ã¯ã©ã¹ã§ãåæåãšããŠèšèªã©ãã«ãå«ãŸãããã®ã©ãã«ãè¿ãããã®ä»®æ³ã¡ãœãããèšå®ããŸãã
class PartOfSpeech : private boost::noncopyable { public: /// Type of smart pointer to the class object. typedef boost::shared_ptr<PartOfSpeech> Ptr; /// Language tag definitions. enum LanguageTag { UNKNOWN_LANG = 0 ///< Unknown language. , RUSSIAN_LANG = 1 ///< Russian language. , ENGLISH_LANG = 2 ///< English language. }; /// Language tag. virtual LanguageTag GetLangTag() const = 0; /// Virtual destruction for abstract class. virtual ~PartOfSpeech() {} };
ãã·ã¢èªã®ãã¹ãŠã®æ§æã«ããŽãªã®åºæ¬ã¯ã©ã¹ã¯ããã®ã¯ã©ã¹ããç¶æ¿ãããŸãã
struct RussianPos : public PartOfSpeech { /// Type of smart pointer to the class object. typedef boost::shared_ptr<RussianPos> Ptr; /// Possible parts of speech. enum PosTag { UNKNOWN_PS = 0 ///< Unknown part of speech. , NOUN_PS = 1 ///< , ADJECTIVE_PS = 2 ///< , PRONOUN_NOUN_PS = 3 ///< - , VERB_PS = 4 ///< , PARTICIPLE_PS = 5 ///< , ADVERB_PARTICIPLE_PS = 6 ///< , PRONOUN_PREDICATIVE_PS = 7 ///< - , PRONOUN_ADJECTIVE_PS = 8 ///< , NUMERAL_QUANTITATIVE_PS = 9 ///< () , NUMERAL_ORDINAL_PS = 10 ///< , ADVERB_PS = 11 ///< , PREDICATE_PS = 12 ///< , PREPOSITION_PS = 13 ///< , CONJUCTION_PS = 14 ///< , INTERJECTION_PS = 15 ///< , PARTICLE_PS = 16 ///< , INTRODUCTORY_WORD_PS = 17 ///< , UP_BOUND_PS }; /// Number. enum Number { UNKNOUN_NUMBER = 0 ///< Unknown number. , SINGULAR_NUMBER = 0x01 ///< . , PLURAL_NUMBER = 0x02 ///< . }; /// Language. enum Lang { NORMAL_LANG = 0 // Normal language. , SLANG_LANG = 1 , ARCHAIZM_LANG = 2 , INFORMAL_LANG = 3 }; /// Gender definitions. enum Gender { UNKNOWN_GENDER = 0 ///< Unknown gender value. , MASCULINE_GENDER = 0x01 ///< , FEMININE_GENDER = 0x02 ///< , NEUTER_GENDER = 0x04 ///< }; /// Case definition. enum Case { UNKNOWN_CASE = 0 ///< Unknown case. , NOMINATIVE_CASE = 1 ///< , GENITIVE_CASE = 2 ///< , GENITIVE2_CASE = 3 ///< , DATIVE_CASE = 4 ///< , ACCUSATIVE_CASE = 5 ///< , INSTRUMENTAL_CASE = 6 ///< , PREPOSITIONAL_CASE = 7 ///< , PREPOSITIONAL2_CASE = 8 ///< , VOCATIVE_CASE = 9 ///< }; /// Time. enum Time { UNKNOWN_TIME = 0 ///< Unknown time. , PRESENT_TIME = 0x01 ///< , FUTURE_TIME = 0x02 ///< , PAST_TIME = 0x04 ///< }; /// Person. enum Person { UNKNOWN_PERSON = 0 ///< Unknown person. , FIRST_PERSON = 0x01 ///< , SECOND_PERSON = 0x02 ///< , THIRD_PERSON = 0x04 ///< }; /// Entity kind. enum Entity { UNKNOWN_ENTITY = 0 ///< Unknown entity, for ordinal words. , ABBREVIATION_ENTITY = 1 ///< . , FIRST_NAME_ENTITY = 2 ///< . , MIDDLE_NAME_ENTITY = 3 ///< . , FAMILY_NAME_ENTITY = 4 ///< . }; /// Animation. enum Animation { UNKNOWN_ANIMATION = 0 , ANIMATE_ANIMATION = 0x01 ///< . , INANIMATE_ANIMATION = 0x02 ///< . }; /// Voice defintion. enum Voice { UNKNOWN_VOICE = 0 ///< Unknown voice. , ACTIVE_VOICE = 0x01 ///< . , PASSIVE_VOICE = 0x02 ///< . }; /// Language tag. LanguageTag GetLangTag() const { return RUSSIAN_LANG; } /// Class is absract one -- virtual destruction. virtual ~RussianPos() {} /// Get part of speech tag. virtual PosTag GetPosTag() const = 0; /// Serialization implementaion. virtual void Serialize(uint32_t& out) const = 0; /// Desirialization implementation. virtual void Deserialize(const uint32_t& in) = 0; /// Write POS signature. static void WritePosSign(PosTag pos, uint32_t& out) { // Write to lower 5 bits. out |= static_cast<uint32_t>(pos); } /// Read POS signature. static PosTag ReadPosSign(const uint32_t& in) { return PosTag(in & 0x1f); } };
ãã®ã¯ã©ã¹ã«ã¯ãPosTagåæåã®åœ¢åŒã§æ§æã«ããŽãªã®ã©ãã«ãå«ãŸããŠãããåå¥å±æ§ãå®çŸ©ãããŠããŸãã ææ³ã³ã³ããŒãã³ãã«å ããŠããã®ã¯ã©ã¹ã¯ããã€ããªåœ¢åŒãšã®éã§å€æãè¡ãããã®Serializeããã³Deserializeã¡ãœãããå®çŸ©ããŸãã æ§æã¿ã€ãããšã«ãuint32_tã¿ã€ãã§è¡šããã4ãã€ãã®å€æãå®çŸ©ãããŠããŸãã
RussianPosã¯ã©ã¹ã¯æœè±¡çã§ãããç¹å®ã®æ§æã«ããŽãªãè¡šãã¯ã©ã¹ã¯ããããç¶æ¿ãããŸãã ããšãã°ãã¯ã©ã¹Nounã¯åè©ãå®çŸ©ããŸãã
struct Noun : public RussianPos { Noun() : number_(UNKNOUN_NUMBER) , lang_(NORMAL_LANG) , gender_(UNKNOWN_GENDER) , case_(UNKNOWN_CASE) , entity_(UNKNOWN_ENTITY) {} /// Get part of speech tag. PosTag GetPosTag() const { return NOUN_PS; } /** * \brief Serialization implementaion. * * Binary map of the object: * 13 3 4 3 2 2 5 * ----------------------------------------------------------- * Unused | Entity | Case | Gender | Lang | Number | POS tag | * ----------------------------------------------------------- * * \param[out] ob The buffer to write to. */ void Serialize(uint32_t& ob) const { ob |= static_cast<uint32_t>(number_) << 5; ob |= static_cast<uint32_t>(lang_) << 7; ob |= static_cast<uint32_t>(gender_) << 9; ob |= static_cast<uint32_t>(case_) << 12; ob |= static_cast<uint32_t>(entity_) << 16; } /** * \brief Desirialization implementaion. * * Binary map of the object: * 13 3 4 3 2 2 5 * ----------------------------------------------------------- * Unused | Entity | Case | Gender | Lang | Number | POS tag | * ----------------------------------------------------------- * * \param ib The buffer to write to. */ void Deserialize(const uint32_t& ib) { number_ = static_cast<Number>((ib & 0x0060) >> 5); lang_ = static_cast<Lang>((ib & 0x0180) >> 7); gender_ = static_cast<Gender>((ib & 0x0e00) >> 9); case_ = static_cast<Case>((ib & 0xf000) >> 12); entity_ = static_cast<Entity>((ib & 0x070000) >> 16); } Number number_; Lang lang_; Gender gender_; Case case_; Entity entity_; };
åè©ã¯ã©ã¹ã«ã¯ãæ°ãèšèªã®çš®é¡ïŒéåžžãæ代é¯èª€ãå£èªãªã©ïŒãæ§å¥ã倧æåå°æåãååãŸãã¯ç¥èªã®èšå·ãªã©ã®åå¥å±æ§ãæ ŒçŽãããŸãã
èŸæžãã³ãŒãã£ã³ã°ããããã®ã¹ããŒããã·ã³
èŸæžãä¿åããèŸæžããåèªãå¹ççã«æœåºããããã«ãStrutextã©ã€ãã©ãªã¯ã¹ããŒããã·ã³ã䜿çšããŸãã æéç¶æ ãã·ã³ã¯ããªãŒãããã³ãã£ã¬ã¯ããªå ã®å¯Ÿå¿ããC ++ã¿ã€ãã«ãã£ãŠå®çŸ©ãããŸãã
æéç¶æ ãã·ã³ã¯ãããç¶æ ïŒã·ã³ãã«ãã·ã³ãã«ïŒãç¹å®ã®ç¶æ ã«é¢é£ä»ããé·ç§»é¢æ°ã«ãã£ãŠå®çŸ©ãããããšãæãåºããŠãã ããããã«ã¿ïŒQ x V-> Q.ç¶æ ãã·ã³ãäœæ¥ãéå§ãã1ã€ã®åæç¶æ ãšãäžå®æ°ã®ãèš±å¯ãç¶æ ããããŸãã ãã·ã³ã¯æåããšã«å ¥åè¡ãèªã¿åããŸããçŸåšã®ç¶æ ãšèªã¿åãããæåã«ã€ããŠãé·ç§»é¢æ°ãç¹å®ã®ç¶æ ã«äžèŽããå Žåããã·ã³ã¯ãã®æ°ããç¶æ ã«ã移è¡ããããã®åŸæ°ããæåã®èªã¿åããµã€ã¯ã«ãåã³éå§ãããŸãã ãªãŒãããã³ã¯2ã€ã®å Žåã«åæ¢ã§ããŸãïŒãã¢ã«é·ç§»ããªãå ŽåïŒçŸåšã®ç¶æ ãæåã®èªã¿åãïŒãããã³æåãã§ãŒã³å šäœãæåŸãŸã§èªã¿åãããå Žåã æåã®ã±ãŒã¹ã§ã¯ãå ¥åãã§ãŒã³ã¯ãã·ã³ã«ãã£ãŠèš±å¯ãããŠããªããšèŠãªãããŸãã2çªç®ã®ã±ãŒã¹ã§ã¯ãåæ¢åŸã«ãã·ã³ã蚱容ç¶æ ã®ããããã«ããå Žåããã§ãŒã³ãèš±å¯ãããŸãã
ãããã£ãŠãå ¥åãã§ãŒã³ã®æ°ããæåãèªã¿åããããã³ã«ããªãŒãããã³ã¯æ°ããç¶æ ã®ãã¢ïŒç¶æ ãã·ã³ãã«ïŒãèŠã€ããã¿ã¹ã¯ã«çŽé¢ããŸãã Strutextã©ã€ãã©ãªã§ã¯ããã®æ€çŽ¢é¢æ°ã®å®è£ ã¯TransitionãšåŒã°ããå¥ã®ã¯ã©ã¹ã§åŒ·èª¿è¡šç€ºãããŸãã ãªãŒãããã³ã¯ãåç¶æ ïŒãªãŒãããã³/ fsm.hïŒã«å¯ŸããŠå®çŸ©ãããé·ç§»ã¯ã©ã¹ã®ãªããžã§ã¯ãã®é åã§ãã
template <typename TransImpl> struct FiniteStateMachine { /// Type of transition table. typedef TransImpl Transitions; ... /// State definition. struct State { Transitions trans_; ///< Move table. bool is_accepted_; ///< Is the state accepptable. /// Default initialization. explicit State(bool is_accepted = false) : is_accepted_(is_accepted) {} }; /// Type of states' list. typedef std::vector<State> StateTable ... StateTable states_; ///< The table of states. };
ããã§ãTransImplãã³ãã¬ãŒããã©ã¡ãŒã¿ãŒã¯é·ç§»é¢æ°ãè¡šããŸãã
Strutextã©ã€ãã©ãªã«ã¯ãé·ç§»é¢æ°ãå®è£ ããããã®2ã€ã®ã¡ãœããããããŸãã 1ã€ã®æ¹æ³ã¯ãéåžžã®std :: mapïŒautomata / flex_transitions.hïŒã«åºã¥ããŠããŸããããã§ãããŒã¯æåã³ãŒãã§ãããã¹ããŒã¿ã¹çªå·ã¯å€ã§ãã å¥ã®æ¹æ³ïŒautomata / flat_transitions.hïŒã¯ãé åãå¯èœãªæåã³ãŒãã«å¯Ÿå¿ããŠå²ãåœãŠããããšãã®çé åã«åºã¥ããŠããŸãã é åã®åèŠçŽ ã«ã¯ã¹ããŒã¿ã¹ã³ãŒããå«ãŸããŠããŸãã å€ãŒãã¯ç¡å¹ãªç¶æ ã®ããã«äºçŽãããŠããŸãã 移è¡ãªããæå³ããŸãã å€ããŒã以å€ã®å Žåããã®ãã¢ïŒé åã€ã³ããã¯ã¹=ã·ã³ãã«ã³ãŒããé åã»ã«ã®ç¶æ çªå·ïŒãé·ç§»ãèšå®ããŸãã
FiniteStateMachineã¯ã©ã¹ã¯ããã®ãã§ãŒã³ãèš±å¯ãããŠããããšãé€ããŠãå ¥åãã§ãŒã³ã«ã€ããŠäœãèšãããšãã§ããŸããã èš±å¯ããããã§ãŒã³ã«é¢ããè¿œå æ å ±ãä¿åããã«ã¯ãèš±å¯ãããç¶æ ã«å±æ§ãè¿œå ããå¿ èŠããããŸãã ããã¯ãAttributeFsmãã³ãã¬ãŒãã¯ã©ã¹ã§è¡ãããŸãã ãã®ã¯ã©ã¹ã¯ããã³ãã¬ãŒãã®ãã©ã¡ãŒã¿ãŒãšããŠãé·ç§»é¢æ°ã®å®è£ ãšæå¹åç¶æ ã®å±æ§ã¿ã€ããåãåããŸãã å±æ§ã¯èš±å¯ç¶æ ã«ã¢ã¿ããã§ããã ãã§ãªãïŒãããçã«ããªã£ãŠãããã©ããã¯äžæã§ããïŒãç¶æ ã«è€æ°ã®å±æ§ãã¢ã¿ããã§ããããšã«ã泚æããŠãã ããããããã¯ãã¹ãŠãã¯ãã«ã«æ ŒçŽãããŸãã
ãã£ã¯ã·ã§ããªãã¹ããŒããã·ã³ã«ä¿åãããšããã®ãã£ã¯ã·ã§ããªã®ã¹ããŒããã·ã³ã®é·ç§»é¢æ°ã®ããªãŒæ§é ãå®çŸ©ãããŸãã ãã®ãããªæ§é ã®å ŽåãDãã¯ããŒãã«ãã£ãŠå°å ¥ããããã©ã€ãšããçšèªã䜿çšãããŸãã Strutextã©ã€ãã©ãªã«ã¯ããªãŒããã¿/ trie.hãã¡ã€ã«ã«ãã®ãããªã¹ããŒããã·ã³ã®å®è£ ããããŸãã
template <class Trans, typename Attribute> struct Trie : public AttributeFsm<Trans, Attribute> { /// Chain identifier type. typedef Attribute ChainId; /// Attribute FSM type. typedef AttributeFsm<Trans, Attribute> AttributeFsmImpl; /// Default initialization. explicit Trie(size_t rsize = AttributeFsmImpl::kReservedStateTableSize) : AttributeFsmImpl(rsize) {} /// It may be base class. virtual ~Trie() {} /** * \brief Adding chain of symbols. * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * \param id Chain identifier. * * \return The number of last state of the chain. */ template <typename SymbolIterator> StateId AddChain(SymbolIterator begin, SymbolIterator end, const ChainId& id); /** * \brief Adding chain of symbols. * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * * \return The number of last state of the chain. */ template <typename SymbolIterator> StateId AddChain(SymbolIterator begin, SymbolIterator end); /** * \brief Search of the passed chain in the trie * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * \result The reference to the list of attributes of the chain if any. */ template <typename SymbolIterator> const typename AttributeFsmImpl::AttributeList& Search(SymbolIterator begin, SymbolIterator end) const; };
ã³ãŒããããAddChainãšSearchã®2ã€ã®äž»èŠãªã¡ãœãããããããšãããããŸãã åŸè ã®æ¹æ³ã¯ãå±æ§ãã¯ãã«ãžã®åç §ãè¿ããšããç¹ã§æ³šç®ã«å€ããŸãã æ€çŽ¢æã«ãç¶æ å±æ§ã¯ã³ããŒãããŸããã å ¥åæååãèŠã€ãããªãå Žåãå±æ§ãã¯ãã«ã¯ç©ºã«ãªããŸãã
Strutextã©ã€ãã©ãªã¯ãããã¹ãå ã®èŸæžèŠçŽ ãå¹ççã«æ€çŽ¢ããããã«ãAho-Korasikãµããã·ã³ã¬ã³ãå®è£ ããŠããŸãã å®è£ ã¯ãautomata / aho_corasick.hã«ç€ºãããŠããŸãã ãã®å®è£ ã®åçãšæ¹æ³ã®æ瀺ã¯ããã®ããã¹ãã®ç¯å²ãè¶ ããŠããŸããã€ã³ã¿ãŒãã§ã€ã¹ã®äœ¿ãæ¹ã¯éåžžã«ç°¡åã§ãããããã¹ãã«ãããã§ãŒã³ã«æ²¿ã£ãŠã€ãã¬ãŒã¿ããããŸãã
ãŸãããã¹ãŠã®ãªãŒãããã³ã¯std :: streamã§ã·ãªã¢ã©ã€ãº/ãã·ãªã¢ã©ã€ãºã§ããããšã«æ³šæããŠãã ããã ããã«ããããã·ã³ããã£ã¹ã¯äžã®ãã¡ã€ã«ã«ä¿åã§ããŸãã ãã€ããªåœ¢åŒã®èŸæžã®ã¹ãã¬ãŒãžãšããŠäœ¿çšããŸãã
圢æ åæè£ çœ®
ã¢ã«ãã©ããžãŒã¢ãã©ã€ã¶ãŒã¯ãmorpho / morpholibãã£ã¬ã¯ããªã«ããã©ã€ãã©ãªã§ãã ã¡ã€ã³ã€ã³ã¿ãŒãã§ãŒã¹ã¯ã©ã¹Morphologistã¯ãmorpho / morpholib / morpho.hãã¡ã€ã«ã«ãããŸãã
ã¯ã©ã¹ã®ã€ã³ã¿ãŒãã§ãŒã¹ãšå®è£ ã«ã€ããŠèª¬æããåã«ããŸãããã®å®è£ ã®åºç€ãšãªãåºæ¬ååã«ã€ããŠèª¬æããŸãã
ãŸããTrieã¯ã©ã¹ã®ãªããžã§ã¯ãã«å®è£ ãããŠããåºæ¬ã®èŸæžããããŸãã
次ã«ãåè§ãã©ãã€ã ã蚱容å¯èœãªç¶æ ã§åããŒã¹ã«å²ãåœãŠãããŸãïŒä»¥åãšåæ§ãããã¯ãã¢ã®ãã¯ãã«ïŒæ¥å°ŸèŸãåå¥å±æ§ã®ã»ããïŒãå±æ§ã»ããã¯PartOfSpeechããç¶æ¿ãããã¯ã©ã¹ã®ã€ã³ã¹ã¿ã³ã¹ã«ãã£ãŠè¡šãããŸãïŒã
第äžã«ãååå¥ã¿ã€ãã«ã¯äžæã®æ°å€èå¥åãèŸæžã®åºåºçªå·ãäžããããŸãã
ãããã£ãŠã転éãããåèªåœ¢åŒãåèªãšããŠèªèããããã«ã¯ããã·ã³ã®åºåºãæ€çŽ¢ããå¿ èŠãããïŒãã®åºåºã«å¯Ÿå¿ããåå¥ã¿ã€ãã®èå¥åãæ€çŽ¢ãããŸãïŒãæåŸã«å¯Ÿå¿ããå±æ§ãæ€çŽ¢ããŸãã ããã¯ãã¹ãŠãåºæ¬ãæ€çŽ¢ãããšããšãèªå°Ÿã決å®ãããšãã®äž¡æ¹ã§ããããŸãããèæ ®ããŠå®è¡ããå¿ èŠããããŸãã æ€çŽ¢ã®ã³ãŒãã¯æ¬¡ã®ãšããã§ãã
/** * \brief Implementation of morphological analysis of passed form. * * \param text Input text in UTF-8 encoding. * \param[out] lem_list List of lemmas within morphological attributes. */ void Analize(const std::string& text, LemList& lem_list) const { // The first phase. Go throw the passed word text, encode symbol // and remember symbol codes in the string. If found word base on // some position, remember attribute and position for an each // attribute. // Try starts with empty bases typedef std::list<std::pair<Attribute, size_t> > BaseList; BaseList base_list; strutext::automata::StateId state = strutext::automata::kStartState; if (bases_trie_.IsAcceptable(state)) { const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state); for (size_t i = 0; i < attrs.size(); ++i) { base_list.push_back(std::make_pair(attrs[i], 0)); } } // Permorm the first phase. std::string code_str; typedef strutext::encode::Utf8Iterator<std::string::const_iterator> Utf8Iterator; for (Utf8Iterator sym_it(text.begin(), text.end()); sym_it != Utf8Iterator(); ++sym_it) { Code c = alphabet_.Encode(*sym_it); code_str += c; if (state != strutext::automata::kInvalidState) { state = bases_trie_.Go(state, c); if (bases_trie_.IsAcceptable(state)) { const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state); for (size_t i = 0; i < attrs.size(); ++i) { base_list.push_back(std::make_pair(attrs[i], code_str.size())); } } } } // The second phase. Go throuth the found base list and find suffixes for them. // If suffixes have been found then add them to the lemma list. lem_list.clear(); for (BaseList::iterator base_it = base_list.begin(); base_it != base_list.end(); ++base_it) { AttrMap attr; attr.auto_attr_ = base_it->first; SuffixStorage::AttrList att_list; std::string suffix = code_str.substr(base_it->second); // If suffix is empty (empty suffix passed), add zero symbol to it. if (suffix.empty()) { suffix.push_back('\0'); } if (const SuffixStorage::AttrList* att_list = suff_store_.SearchAttrs(attr.line_id_, suffix)) { for (size_t i = 0; i < att_list->size(); ++i) { lem_list.push_back(Lemma(attr.lem_id_, (*att_list)[i])); } } } }
ã芧ã®ãšããã決å®ã¢ã«ãŽãªãºã ã¯2ã€ã®æ®µéã«åãããŠããŸãã æåã«ãåºæ¬äºé ã匷調衚瀺ãããŸãïŒããã§ã¯ã空ã®åºæ¬äºé ã®ååšãæ€èšããå¿ èŠããããŸãïŒã åããŒã¹ã«ã€ããŠãå ¥åãã§ãŒã³å ã®äœçœ®ãèšæ¶ãããŠãããããçµäºãéžæã§ããŸãã 第2段éã§ã¯ãéžæããåºæ¬ã«å¯Ÿå¿ãããšã³ãã£ã³ã°ã®æ€çŽ¢ãå®è¡ãããŸãã èªå°Ÿãæå®ãããåºåºã«å¯Ÿå¿ããæ²çšãã©ãã€ã ã§èŠã€ãã£ãå Žåããã®èªå°Ÿã®èªåœå±æ§ãåèªã®èå¥åãšãšãã«è¿ãããŸãã
Morphologistã¯ã©ã¹ã¯ãããŒã¹çªå·ãšéä¿¡ãããåå¥å±æ§ã«ãã£ãŠåèªåœ¢åŒãçæãããµãŒãã¹ãæäŸããŸãã Generateã¡ãœããã¯ãããè¡ããŸãïŒ
/** * \brief Generate form. * * \param lem_id The lemma identifier. * \param attrs The attributes of the form. * \return Generated text in UTF-8 encoding. */ std::string Generate(uint32_t lem_id, uint32_t attrs) const;
æå®ãããåèªã®ãã¹ãŠã®åœ¢åŒãçæããGenAllFormsã¡ãœãããšãåèªã®ã¡ã€ã³åœ¢åŒãè¿ãGenMainFormã¡ãœããããããŸãã åè©ã®å Žåãããã¯æããã«äž»æ Œã®åæ°åœ¢ã§ãã
main.cppãã¡ã€ã«ã®morpho / aotãã£ã¬ã¯ããªã¯ãå ã®åœ¢åŒã®AOTèŸæžè¡šçŸã®ããŒãµãŒãå®è£ ããŸããããã«ãããçµæãšããŠã圢æ ã©ã€ãã©ãªãšäºææ§ã®ãããã€ããªè¡šçŸãè¿ãããŸãã çµæã®ãã€ããªèŸæžã¯ãMorphologistã¯ã©ã¹ã§äœ¿çšã§ããŸãã ãã€ããªèŸæžèªäœã¯ãªããžããªã«ä¿åãããŸããããå¿ èŠã«å¿ããŠãŠãŒã¶ãŒãçæã§ããŸãã ãã·ã¢èªèŸæžãå®è£ ããã«ã¯ã次ã®ã³ãã³ãã䜿çšã§ããŸãã
./Release/bin/aot-parser -t ../morpho/aot/rus_tabs.txt -d ../morpho/aot/rus_morphs.txt -m rus -b aot-rus.bin
ãã€ããªåœ¢åŒã§ã¯ãèŸæžã®èŸæžãµã€ãºã¯20 MBæªæºã§ãã
ãœãŒã¹ããã¹ãããåèªãã©ãŒã ãåé¢ããã«ã¯ãutility / word_iterator.hã§å®çŸ©ãããŠããWordIteratorã¯ã©ã¹ã䜿çšã§ããŸãã ãã®ã¯ã©ã¹ã¯ãæåã®åèªã·ãŒã±ã³ã¹ïŒã·ã³ãã«:: IsLetterïŒãèæ ®ããŸãã ã€ãã¬ãŒã¿ã¯ãåèªããŠãã³ãŒãæååãšããŠè¿ããŸãã encode :: utf8_generator.hã§å®çŸ©ãããŠããGetUtf8Sequenceé¢æ°ã䜿çšããŠããã®æååãUTF-8ã«ãã©ã³ã¹ã³ãŒãã§ããŸãã
ããšãã
ããã¹ãã¯ããªãããªã¥ãŒã ããããããããèªã¿ã«ããããšãå€æããŸããã èè ã¯ãã¬ãŒã³ããŒã·ã§ã³ãåçŽåããããšãè©Šã¿ãŸããããããã¯å¯èœãªéãã§ããããè³æã®è€éããèãããšãæããã«ããã¹ãã«ã¯å€ãã®å ŽæããããŸããã§ããã
ããã§ããèè ã¯ãããã¹ãã§èª¬æãããŠããStrutextã©ã€ãã©ãªãæçšã§ããããã®å®è£ ã«é¢ããäœæ¥ãç¡é§ã«ãªããªãããšãæåŸ ããŠããŸãã