Word2vec 是基於非監督式學習,訓練集建議越大越好,語料涵蓋的越全面,訓練出來的結果相對比較好,當然也有可能 garbage input 進而得到 garbage output ,由於檔案所使用的資料集較大,所以每個過程中都請耐心等候。(ps: word2vec 如果把每種字當成一個維度,假設總共有 4000 個總字,那麼向量就會有 4000 維度。故可透過它來降低維度)
WiKi簡介:
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
By the way :
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.
Input:
1.download wiki data(請參考資料集)
2.進入 Word2Vec 資料夾
3.執行 python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)
4.執行 python segmentation.py(簡體轉繁體,在進行斷詞並同步過濾停用詞,由於檔案較大故斷詞較久)
5.執行 python train.py(訓練並產生 model ,時間上也會比較久)
5.執行 python main.py(使用 Model,輸入詞彙)
註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)
Output:
1.輸入一個詞彙會找出前5名相似
2.輸入兩個詞彙會算出兩者之間相似度
3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
老師
詞彙相似詞前 5 排序
班導,0.6360481977462769
班導師,0.6360464096069336
代課,0.6358826160430908
級任,0.6271134614944458
班主任,0.6270170211791992
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,媽媽
計算兩個詞彙間 Cosine 相似度
0.780765200371
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,老公,媽媽
爸爸之於老公,如媽媽之於
老婆,0.5401346683502197
蠢萌,0.5245970487594604
夠秤,0.5059393048286438
駁命,0.4888317286968231
孔爵,0.4857243597507477
主要以 pages-articles.xml.bz2 結尾之檔案類型,這邊使用 zhwiki-latest-pages-articles.xml.bz2。
- 維基資料集:
https://zh.wikipedia.org/wiki/Wikipedia:%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8B%E8%BD%BD - zhwiki-latest-pages-articles.xml.bz2 下載網址:
https://drive.google.com/file/d/0B4rlWa2S_JMBUmlMSG5IRVRMbnc/view?usp=sharing - 程式參考網址:
https://radimrehurek.com/gensim/corpora/wikicorpus.html
https://radimrehurek.com/gensim/models/word2vec.html
Python 3.5.2
pip install gensim
pip install jieba
pip install hanziconv
note: if your gensim can't install, please check your os then install correct gensim version.