🤦🏼 🦅 🏑 word2vecの分類子 ♓️ 👶🏿 📌

最近の対話の後、ワトソンNLCとbingの翻訳者のアセンブリの形で、松葉杖なしでロシア語のテキストを処理できる分類子を見つけるという質問が生じました。自転車のレイアウトが決定されました。 Word2vecは、例とユーザー入力のベクトル表現を取得するための基礎として使用されます。これを使用した作業の例は、たとえば-hereにあります。ところで、質問はより経験豊富です-より適切な代替手段はありますか？膨大なテキストの分類は計画されていません。 word2vecを使用すると、送信された単語のベクトル表現を取得できることを思い出してください（数値係数による加算/減算および乗算を結果のベクトルに適用できます）。この場合、ベクトルは「軸」が軸として適用可能な空間にあります。

https://github.com/alex4321/word2vec-nlcのコード。 gensimを使用して書かれています。このモデル（英語版）は、GoogleNews-vectors-negative300.bin.gzによって適用されました。

たとえば、元のフレーズはI have a dog、ベクトル化の結果は（はい、明らかにゴミがあります）-

{'SheldenWilliams_@': 0.5255231261253357, "have't": 0.5583386421203613, 'personnaly': 0.5540199875831604, 'havent': 0.597449541091919, 'happening_Truver': 0.5273309946060181, 'Durcinka': 0.5368120670318604, 'wouldnt': 0.5314139723777771, 'whine_nag': 0.5264301300048828, "I'v": 0.532825231552124, 'dogs': 0.5436486601829529, 'i_realy': 0.5310240983963013, "'ve": 0.549283504486084, 'theyd': 0.5327804684638977, 'coz_i': 0.5257705450057983, 'love_veronica_Mars': 0.5656633973121643, 'LOVE_YOU_ALL': 0.5473299622535706, 'JeremyShockey_@': 0.5405183434486389, 'i_havnt': 0.5311092138290405, 've': 0.5290054678916931, 'Reputable_breeders': 0.5303832292556763, 'samantharonson_@': 0.542853593826294, 'hadnt': 0.5852461457252502, 'Er_um': 0.5257833003997803, 'couldve': 0.529604434967041, 'that.I': 0.5449035167694092, 'ive': 0.5347827672958374, "were'nt": 0.5263391733169556, 'i_havent': 0.545918345451355, "havn't": 0.6071761846542358, 'wouldve': 0.556045651435852}

：0.5368120670318604、' wouldntは'：0.5314139723777771、' whine_nag '：0.5264301300048828、 "I'v"：0.532825231552124、'犬'：0.5436486601829529、' i_realy '：0.5310240983963013は、「VE' "：0.549283504486084、

{'SheldenWilliams_@': 0.5255231261253357, "have't": 0.5583386421203613, 'personnaly': 0.5540199875831604, 'havent': 0.597449541091919, 'happening_Truver': 0.5273309946060181, 'Durcinka': 0.5368120670318604, 'wouldnt': 0.5314139723777771, 'whine_nag': 0.5264301300048828, "I'v": 0.532825231552124, 'dogs': 0.5436486601829529, 'i_realy': 0.5310240983963013, "'ve": 0.549283504486084, 'theyd': 0.5327804684638977, 'coz_i': 0.5257705450057983, 'love_veronica_Mars': 0.5656633973121643, 'LOVE_YOU_ALL': 0.5473299622535706, 'JeremyShockey_@': 0.5405183434486389, 'i_havnt': 0.5311092138290405, 've': 0.5290054678916931, 'Reputable_breeders': 0.5303832292556763, 'samantharonson_@': 0.542853593826294, 'hadnt': 0.5852461457252502, 'Er_um': 0.5257833003997803, 'couldve': 0.529604434967041, 'that.I': 0.5449035167694092, 'ive': 0.5347827672958374, "were'nt": 0.5263391733169556, 'i_havent': 0.545918345451355, "havn't": 0.6071761846542358, 'wouldve': 0.556045651435852}

'：0.5257705450057983、 'love_veronica_Mars'：0.5656633973121643、' LOVE_YOU_ALL '：0.5473299622535706、' JeremyShockey_ @ '：0.5405183434486389、' i_havnt '：0.5311092138290405、VE' '：0.5290054678916931、' Reputable_breeders'：0.5303832292556763 '

{'SheldenWilliams_@': 0.5255231261253357, "have't": 0.5583386421203613, 'personnaly': 0.5540199875831604, 'havent': 0.597449541091919, 'happening_Truver': 0.5273309946060181, 'Durcinka': 0.5368120670318604, 'wouldnt': 0.5314139723777771, 'whine_nag': 0.5264301300048828, "I'v": 0.532825231552124, 'dogs': 0.5436486601829529, 'i_realy': 0.5310240983963013, "'ve": 0.549283504486084, 'theyd': 0.5327804684638977, 'coz_i': 0.5257705450057983, 'love_veronica_Mars': 0.5656633973121643, 'LOVE_YOU_ALL': 0.5473299622535706, 'JeremyShockey_@': 0.5405183434486389, 'i_havnt': 0.5311092138290405, 've': 0.5290054678916931, 'Reputable_breeders': 0.5303832292556763, 'samantharonson_@': 0.542853593826294, 'hadnt': 0.5852461457252502, 'Er_um': 0.5257833003997803, 'couldve': 0.529604434967041, 'that.I': 0.5449035167694092, 'ive': 0.5347827672958374, "were'nt": 0.5263391733169556, 'i_havent': 0.545918345451355, "havn't": 0.6071761846542358, 'wouldve': 0.556045651435852}

{'SheldenWilliams_@': 0.5255231261253357, "have't": 0.5583386421203613, 'personnaly': 0.5540199875831604, 'havent': 0.597449541091919, 'happening_Truver': 0.5273309946060181, 'Durcinka': 0.5368120670318604, 'wouldnt': 0.5314139723777771, 'whine_nag': 0.5264301300048828, "I'v": 0.532825231552124, 'dogs': 0.5436486601829529, 'i_realy': 0.5310240983963013, "'ve": 0.549283504486084, 'theyd': 0.5327804684638977, 'coz_i': 0.5257705450057983, 'love_veronica_Mars': 0.5656633973121643, 'LOVE_YOU_ALL': 0.5473299622535706, 'JeremyShockey_@': 0.5405183434486389, 'i_havnt': 0.5311092138290405, 've': 0.5290054678916931, 'Reputable_breeders': 0.5303832292556763, 'samantharonson_@': 0.542853593826294, 'hadnt': 0.5852461457252502, 'Er_um': 0.5257833003997803, 'couldve': 0.529604434967041, 'that.I': 0.5449035167694092, 'ive': 0.5347827672958374, "were'nt": 0.5263391733169556, 'i_havent': 0.545918345451355, "havn't": 0.6071761846542358, 'wouldve': 0.556045651435852}

-私はコンピュータ持ちの場合

{'psn': 0.5582104325294495, '•_vaibhav_sir': 0.5425109267234802, 'theyve': 0.5318623781204224, 'love_veronica_Mars': 0.564369261264801, 'receive_MacMall_Exclusive': 0.5217918157577515, 'havnt': 0.517998456954956, 'gfx_card': 0.5196627378463745, 'macbook_pro': 0.5703814029693604, 'droid_x': 0.5607396364212036, 'dell_laptop': 0.5578193664550781, "'ve": 0.5209441184997559, 'Arrendondo_cook': 0.5448468923568726, "I'v": 0.5355654358863831, 'lenovo': 0.5266544222831726, 'reinstall_XP': 0.537743091583252, 'wifi_adapter': 0.5497201085090637, 'havent': 0.5768314003944397, 'LOVE_YOU_ALL': 0.5329291820526123, 'haha_i': 0.5369561314582825, 'computers': 0.5202767252922058, 'automaticly': 0.523144543170929, 'hadnt': 0.5282260775566101, 'ive': 0.5156651735305786, 'google_docs': 0.5261930227279663, 'google_chrome': 0.5319492816925049, 'i_havent': 0.5323601961135864, "havn't": 0.593987226486206, 'mainframes_minicomputers': 0.5255732536315918, 'Ive': 0.518458902835846, 'cect_u##_china': 0.5316625237464905}

{'psn': 0.5582104325294495, '•_vaibhav_sir': 0.5425109267234802, 'theyve': 0.5318623781204224, 'love_veronica_Mars': 0.564369261264801, 'receive_MacMall_Exclusive': 0.5217918157577515, 'havnt': 0.517998456954956, 'gfx_card': 0.5196627378463745, 'macbook_pro': 0.5703814029693604, 'droid_x': 0.5607396364212036, 'dell_laptop': 0.5578193664550781, "'ve": 0.5209441184997559, 'Arrendondo_cook': 0.5448468923568726, "I'v": 0.5355654358863831, 'lenovo': 0.5266544222831726, 'reinstall_XP': 0.537743091583252, 'wifi_adapter': 0.5497201085090637, 'havent': 0.5768314003944397, 'LOVE_YOU_ALL': 0.5329291820526123, 'haha_i': 0.5369561314582825, 'computers': 0.5202767252922058, 'automaticly': 0.523144543170929, 'hadnt': 0.5282260775566101, 'ive': 0.5156651735305786, 'google_docs': 0.5261930227279663, 'google_chrome': 0.5319492816925049, 'i_havent': 0.5323601961135864, "havn't": 0.593987226486206, 'mainframes_minicomputers': 0.5255732536315918, 'Ive': 0.518458902835846, 'cect_u##_china': 0.5316625237464905}

次のアルゴリズムが分類に使用されます。

ベクトル化の前に、あまりにも一般的な単語のセットから切り取られます（例とユーザー入力の両方）
提供された例のベクトル化（example_iベクトル）
最後のステップで取得した座標に基づいて、クラスに対応するクラスターの中心が計算されます
ベクトル化されたユーザー入力
[（class_name、length（input_vector-class_center）]という形式のレコードを取得します

結果の配列は、クラスターの中心とユーザー入力の間の距離でソートされます（したがって、ユーザー入力ベクトルに最も近いクラスが最初になります）

そのようなデータでテストする場合：

クラス-

{ 
      

        
        
        
      

     'computer': ['I have a computer', 'You have a laptop'], 
      

        
        
        
      

     'dog': ['I have a dog', 'Have you a dog?'] 
      

        
        
        
      

     }

{ 
      

        
        
        
      

     'computer': ['I have a computer', 'You have a laptop'], 
      

        
        
        
      

     'dog': ['I have a dog', 'Have you a dog?'] 
      

        
        
        
      

     }

入力-「犬を飼っていましたか？」

次のリストが判明しました：

dog 3.204362832876183 0 
      

        
        
        
      

     computer 3.577504792988848 0

dog 3.204362832876183 0 
      

        
        
        
      

     computer 3.577504792988848 0

つまり入力がコンピュータよりも犬のクラスに近いと判断することができます。

PSロシア語での作業について-技術的には、それで動作するword2vecモデルを組み立てることは問題ではありません。一般に、いくつかありました-しかし、それらは私にはかなり小さいように見えました（おそらく-実用に適しています）

word2vecの分類子

More articles: