断文识字的“孔乙己” -- 一个简单的中文分词工具 Kong Yiji, a simple fine tuned Chinese tokenizer
-
Trained on Chinese Treebank 8.0. Of version 1 now, using a extended word-level Hidden Markov Model(HMM) contrast by eariler char-level HMM.
-
Fine tuned to deal with Out-of-vocabulary (OOV) words(未登录词, 网络新词). If the algorithm cannot find them, just add them to user dict(see Constructor), and twist usr_dict_weight if necessary.
-
Fully exported debug info with functions below:
- postable : table of part-of-speech(pos) tags used in CTB
- h2vtable : table of hidden (pos tag) to visual (words), i.e., emission matrix
- v2htable : reverse of above
- h2htable : table of hidden to hidden, i.e., transfer matrix
- hprtable : table of prior of hidden, i.e. prior probabilistic
-
Masked digit chars to reduce parameters overfitting.
-
Removed lower discrimitive probs of word to postag(only keep top 2 highest).
- POS tag nerual language model(RNN) to model infinitive history of pos tags.
kong(; user_dict_path="", user_dict_array=[], user_dict_weight=1)
-
user_dict_path : a file path of user dict, eachline of which begin a word, optionally ahead by a part-of-speech tag(postag); If the postag not supplied, NR (Proper noun, 专有名词) is automatically inserted.
-
user_dict_array : a Vector{Tuple{String, String}} repr. [(postag, word)]
-
user_dict_weight : if value is m, frequency of (postag, word) in user dictionary will be $ m * maximum(values(h2v[postag])) $
Note all user suppiled postags MUST conform to specifications of Chinese Treebank.
See test/runtests.jl
- Filter low frequency words from CTB
- Exploit summary of POS table, insert a example column, plus constract with other POS scheme(PKU etc.)
- Explore MaxEntropy & CRF related algorithms