CJieba word tokenizor and part-of-speech tagging in Julia.
CJieba is nice and simple and quite adequate for simple tasks specially those prioritize speed over accuracy.
using CJieba
s = "他来到了网易杭研大厦"
Tokenize sentence:
cut(s)
Output:
6-element Vector{String}:
"他"
"来到"
"了"
"网易"
"杭研"
"大厦"
Tokenize and POS tagging:
tag(s)
Output:
6-element Vector{Tuple{String, String}}:
("他", "r")
("来到", "v")
("了", "ul")
("网易", "n")
("杭研", "x")
("大厦", "n")
Tokenize without tag:
cut(s; without="x")
Output:
5-element Vector{String}:
"他"
"来到"
"了"
"网易"
"大厦"
Manually create handle, add user word, and manually free handle:
jieba = JIEBA()
add_word!(jieba, "网易杭研大厦")
cut(jieba, s)
tag(jieba, s)
free!(jieba)
Output:
4-element Vector{String}:
"他"
"来到"
"了"
"网易杭研大厦"
4-element Vector{Tuple{String, String}}:
("他", "r")
("来到", "v")
("了", "ul")
("网易杭研大厦", "u")
This works too:
jieba = JIEBA(; user_words=["网易杭研大厦"])
Or pass a file path:
jieba = JIEBA(; user_words="/path/to/your/user_dict")
do
syntax (auto free handle):
JIEBA(; user_words=["网易杭研大厦"]) do jieba
tag(jieba, s)
end
- Make it idiomatic to use in Julia.
- Make it safe, mostly (I'm pretty sure some weird ill-formated user_dict can still segfault it...)
- Keep it simple (didn't even expose keyword extractor).