{"id":1403,"date":"2022-05-25T00:36:03","date_gmt":"2022-05-24T16:36:03","guid":{"rendered":"http:\/\/www.eait.co\/?p=1403"},"modified":"2022-05-25T00:36:03","modified_gmt":"2022-05-24T16:36:03","slug":"nlp-word2vec","status":"publish","type":"post","link":"https:\/\/notes.coremix.net\/?p=1403","title":{"rendered":"NLP\u2014\u2014Word2Vec"},"content":{"rendered":"<p>\u770b\u5b8c\u4e86\uff0c\u611f\u89c9\u633a\u6709\u610f\u601d\uff0c\u57fa\u672c\u5206\u6790\u5982\u4e0b\uff1a<br \/>\n\u5148\u8bfb\u53d6\uff0c\u5206\u8bcd\uff0c\u53bb\u6389\u505c\u7528\u8bcd\uff0c\u5f62\u6210\u4e8c\u7ef4\u5217\u8868\uff0c\u4ea4\u7ed9Word2Vec\u53bb\u8ba1\u7b97<br \/>\n\u6700\u540e\u8c03\u7528\u76f8\u5173\u51fd\u6570\uff0c\u867d\u7136\u4e0d\u751a\u7406\u60f3\uff0c\u4f46\u662f\u6bd4\u8d77\u4e4b\u524d\u7684\u6d4b\u8bd5\u51c6\u786e\u5ea6\u63d0\u9ad8\u5230\u4e8670%\u5de6\u53f3<br \/>\n\u5148\u4e0a\u4ee3\u7801\uff1a<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#-*- coding:utf-8 -*-\r\nimport time\r\nimport pandas as pd\r\nimport jieba\r\nfrom gensim.models import  Word2Vec\r\ntrain_file = r'D:\\win7\u8fdc\u7a0b\\NLP \u5927\u6570\u636e\u4eba\u5de5\u667a\u80fd\u81ea\u7136\u8bed\u8a00\u5904\u7406&#92;&#48;725-\u57fa\u4e8eFastText\u7684\u4e2d\u6587\u5206\u7c7b\\sohu_train.txt'\r\ntest_file= r'D:\\win7\u8fdc\u7a0b\\NLP \u5927\u6570\u636e\u4eba\u5de5\u667a\u80fd\u81ea\u7136\u8bed\u8a00\u5904\u7406&#92;&#48;725-\u57fa\u4e8eFastText\u7684\u4e2d\u6587\u5206\u7c7b\\sohu_test.txt'\r\nstopwords= r'D:\\win7\u8fdc\u7a0b\\NLP \u5927\u6570\u636e\u4eba\u5de5\u667a\u80fd\u81ea\u7136\u8bed\u8a00\u5904\u7406&#92;&#48;725-\u57fa\u4e8eFastText\u7684\u4e2d\u6587\u5206\u7c7b\\stop_words.txt'\r\ndef processData():\r\n    train_pd = pd.read_csv(train_file,sep='\\t',header = None)\r\n    test_pd= pd.read_csv(test_file,sep='\\t',header = None)\r\n    train_pd = train_pd.head(5000)\r\n    test_pd= train_pd.head(5000)\r\n    train_pd.columns = &#x5B;'\u5206\u7c7b',&quot;\u6587\u7ae0&quot;]\r\n    # \u53bb\u9664\u505c\u7528\u8bcd\uff1a\r\n    stopwords_list = &#x5B;k.strip() for k in open(stopwords,encoding='utf-8').readlines()\r\n                      if k.strip() !='']\r\n    wordscuts_list = &#x5B;]  # \u58f0\u660e\u4e00\u4e2a\u653e\u505c\u7528\u8bcd\u7684list\r\n    time0 = time.time()\r\n    i  = 0\r\n    for article in train_pd&#x5B;'\u6587\u7ae0']:\r\n        wordsCut= &#x5B;k for k in  jieba.cut(article) if k not in stopwords_list]\r\n        wordscuts_list.append(wordsCut)\r\n        i+=1\r\n        if i%10==0:\r\n            print('\u5904\u7406\u4e86%d\u7bc7\u6587\u7ae0\uff0c\u603b\u5171\u7528\u65f6%.2f'%(i,time.time()-time0))\r\n    return wordscuts_list\r\n\r\n\r\nif __name__ == '__main__':\r\n    # \u6570\u636e\u9884\u5904\u7406\r\n    wordcutslist = processData()\r\n    # \u751f\u6210\u8bcd\u5411\u91cf\r\n    print(&quot;\u5f00\u59cb\u6267\u884cWord2Vec&quot;)\r\n    word2vec = Word2Vec(wordcutslist,sg=0,window=5,size=192,min_count=10,workers=5)\r\n    # \u76f8\u4f3c\u5ea6\u8ba1\u7b97\uff1a\r\n    print('\u76f8\u4f3c\u5ea6',word2vec.wv.most_similar('\u8d64\u58c1'))\r\n    # \u673a\u5668\u5b66\u4e60 LR\u505a\u5206\u7c7b\u2014\u2014\u7279\u5f81\u5de5\u7a0b\r\n\r\n    pass\r\n\r\n\r\n\r\n\r\n<\/pre>\n<p>\u8c03\u8bd5\u7ed3\u679c\uff1a<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"931\" height=\"395\" class=\"alignnone wp-image-1404 size-full\" src=\"http:\/\/www.eait.co\/wp-content\/uploads\/2022\/05\/Pasted-1.png\" srcset=\"https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-1.png 931w, https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-1-300x127.png 300w, https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-1-768x326.png 768w\" sizes=\"auto, (max-width: 931px) 100vw, 931px\" \/><img loading=\"lazy\" decoding=\"async\" width=\"963\" height=\"447\" class=\"alignnone wp-image-1405 size-full\" src=\"http:\/\/www.eait.co\/wp-content\/uploads\/2022\/05\/Pasted-2.png\" srcset=\"https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-2.png 963w, https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-2-300x139.png 300w, https:\/\/notes.coremix.net\/wp-content\/uploads\/2022\/05\/Pasted-2-768x356.png 768w\" sizes=\"auto, (max-width: 963px) 100vw, 963px\" \/>\u5176\u5b83\uff1a<\/p>\n<p>&nbsp;<\/p>\n<p>word2vec<\/p>\n<p>\uff081\uff09word2vec\uff1a\u5c06 one-hot-vector \u6620\u5c04\uff08embedding\uff09\u4f4e\u5bc6\u5ea6 \u8fde\u7eed\u7684\u7a20\u5bc6\u5411\u91cf\u3002<\/p>\n<p>\u795e\u7ecf\u7f51\u7edc\uff083\u5c42\uff09<\/p>\n<p>\u8f93\u5165\u5c42\uff1aOne-hot-vector<\/p>\n<p>\u9690\u542b\u5c42\uff1a\u7ebf\u6027\u5355\u5143\uff08\u8f93\u5165\u5c42\u8ba1\u7b97\uff09<\/p>\n<p>\u8f93\u51fa\u5c42\uff1asoftmax\u51fd\u6570&#8212;\uff08\u53ef\u4ee5\u5c06\u6570\u503c\u5f52\u4e00\u5316 0-1\u4e4b\u95f4\u3002\u6bcf\u4e2a\u5206\u7c7b\u88ab\u53d6\u5230\u7684\u6982\u7387\u3002<\/p>\n<p>0.1 0.3 0.4 0.05\uff0c0.15\uff09<\/p>\n<p>\uff082\uff09CBOW\uff08Continuous Bag of Words\uff09&amp; Skip-Gram<\/p>\n<p>(3) a\u3001\u5b89\u88c5 Gensim<\/p>\n<p>pip install gensim(\u6ca1\u6709\u505a cextension\u6269\u5c55)<\/p>\n<p>conda install -c conda-forge gensim\uff08\u5efa\u8bae -c \u6307channels\uff09<\/p>\n<p>b\u3001\u76f4\u63a5\u8c03\u7528<\/p>\n<p>from gensim.models import Word2Vec<\/p>\n<p>model=Word2Vec(sentences,sg=0,size=,window,min_count,workers)<\/p>\n<p>sentences&#8212;\u9884\u5904\u7406\u5b8c\u4e4b\u540e\u7684\u8bed\u6599\u3002\uff08\u5206\u53e5 \u5206\u8bcd\uff09<\/p>\n<p>sg&#8211;0\uff1aCBOW 1\uff1aSKip-gram<\/p>\n<p>size:\u7279\u5f81\u5411\u91cf\u7684\u7ef4\u5ea6\uff0c\u9ed8\u8ba4\u4e3a100\uff0c\u63a8\u8350\u503c\u4e3a \u51e0\u5341-\u51e0\u767e\u3002\uff08\u5c0f\u5751\uff0c\u540e\u6765\u6539\u6210\u4e86vector-size\uff09<\/p>\n<p>window\uff1a\u5f53\u524d\u8bcd\u4e0e\u9884\u6d4b\u8bcd\u5728\u4e00\u4e2a\u53e5\u5b50\u4e2d\u7684\u6700\u5927\u8ddd\u79bb\u3002<\/p>\n<p>alpha\uff1a\u5b66\u4e60\u901f\u7387 0-1\u4e4b\u95f4\u3002<\/p>\n<p>min_count:\u8bcd\u9891\u7684\u9650\u5236\uff0c\u6700\u5c0f\u51fa\u73b0\u6b21\u6570\u3002<\/p>\n<p>workers\uff1a\u7ebf\u7a0b\u6570<\/p>\n<p>\uff084\uff09\u641c\u72d0\u65b0\u95fb\u5206\u7c7b\u3002<\/p>\n<p>a\\\u6570\u636e\u9884\u5904\u7406<\/p>\n<p>b\\ word2vec\u6a21\u578b\u58f0\u660e&#8211;\u8f6c\u4e3a\u8bcd\u5411\u91cf<\/p>\n<p>c\u3001\u673a\u5668\u5b66\u4e60&#8211;\u7279\u5f81\u5de5\u7a0b\u3001\u3001\u3001\u6df1\u5ea6\u5b66\u4e60\u00a0 \u4e0d\u9700\u8981 \u58f0\u660e\u7f51\u7edc\u7684\u7ed3\u6784\u548c\u53c2\u6570<\/p>\n<p>d\\\u8bad\u7ec3\u548c\u6d4b\u8bd5\uff08\u8bc4\u4ef7\u6307\u6807 \u51c6\u786e\u5ea6\u3001\u53ec\u56de\u503c\u3001\u7cbe\u786e\u5ea6\u3001F1-Score\uff09<\/p>\n<p>\u7537\u4eba-\u5973\u4eba+\u7537\u4eba=\u7537\u4eba<\/p>\n<p>\u7537\u4eba-\u5973\u4eba+\u738b=\u56fd\u541b<\/p>\n<p>5\u3001doc2vec<\/p>\n<p>\uff081\uff092014\u5e74\u63d0\u51fa\u7684\uff0cword2vec\u6269\u5c55\u3002&#8212;\u53e5\u5b50 \u6bb5\u843d\u548c\u6587\u7ae0\u4e4b\u95f4\u7684\u76f8\u4f3c\u6027\u3002<\/p>\n<p>\uff082\uff09DM\uff1a\u5bf9\u5e94word2vec CBOW\u589e\u52a0\u4e86\u6587\u6863\u7684\u5411\u91cf\u3002\u4e3b\u8981\u662f\u6839\u636e\u4e0a\u4e0b\u6587\u9884\u6d4b\u4e0a\u4e0b\u6587\u7684\u5176\u4ed6\u5355\u8bcd\u3002<\/p>\n<p>DBOW\uff1a\u5bf9\u5e94\u4e8e word2vec\u7684 skip-gram\uff0c\u901a\u8fc7\u65ad\u4e71\u7684\u5411\u91cf \u9884\u6d4b\u5176\u4ed6\u5355\u8bcd\u3002<\/p>\n<p>Huffman\u6811\u00a0 \u53ef\u4ee5\u63d0\u9ad8\u8ba1\u7b97\u901f\u7387\u3002huffman\u7f16\u7801\u3002<\/p>\n<p>\uff083\uff09Doc2vec=\uff08dm=0,size,window,min_count,workers\uff09<\/p>\n<p>dm=0 \u8868\u793a DBOW dm=1\u8868\u793aDM\u7b97\u6cd5\u3002<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u770b\u5b8c\u4e86\uff0c\u611f\u89c9\u633a\u6709\u610f\u601d\uff0c\u57fa\u672c\u5206\u6790\u5982\u4e0b\uff1a \u5148\u8bfb\u53d6\uff0c\u5206\u8bcd\uff0c\u53bb\u6389\u505c\u7528\u8bcd\uff0c\u5f62\u6210\u4e8c\u7ef4\u5217\u8868\uff0c\u4ea4\u7ed9Word2Vec\u53bb\u8ba1\u7b97 \u6700\u540e [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[32,20],"class_list":["post-1403","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-nlp","tag-python"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/posts\/1403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1403"}],"version-history":[{"count":1,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/posts\/1403\/revisions"}],"predecessor-version":[{"id":1406,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=\/wp\/v2\/posts\/1403\/revisions\/1406"}],"wp:attachment":[{"href":"https:\/\/notes.coremix.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notes.coremix.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}