Julia interface to GloVe.
This package provides functionality for generating and working with GloVe word embeddings. The training is done using the original C code from the GloVe github repository.
Note that there is also a package called Glove.jl that provides a pure Julia implementation of the algorithm.
Pkg.clone("https://github.com/zgornel/Glowe.jl")
for the latest master
or
Pkg.add("Glowe")
for the stable versions.
Most of the documentation is provided in Julia's native docsystem.
Following Word2Vec.jl's example, considering the corpus from http://mattmahoney.net/dc/text8.zip extracted as text file text8
in the current working directory, the GloVe model can be obtained with:
julia> # Training (may take a while)
vocab_count("text8", "vocab.txt", min_count=5, verbose=1);
cooccur("text8", "vocab.txt", "cooccurrence.bin", memory=8.0, verbose=1);
shuffle("cooccurrence.bin", "cooccurrence.shuf.bin", memory=8.0, verbose=1);
glove("cooccurrence.shuf.bin", "vocab.txt", "text8-vec", threads=8,
x_max=10.0, iter=15, vector_size=300, binary=0, write_header=1,
verbose=1);
# BUILDING VOCABULARY
# Truncating vocabulary at min count 5.
# Using vocabulary of size 71290.
#
# COUNTING COOCCURRENCES
# window size: 15
# context: symmetric
# Merging cooccurrence files: processed 60666468 lines.
#
# SHUFFLING COOCCURRENCES
# array size: 510027366
# Merging temp files: processed 60666468 lines.
#
# TRAINING MODEL
# Read 60666468 lines.
# vector size: 300
# vocab size: 71290
# x_max: 10.000000
# alpha: 0.750000
# 12/11/18 - 12:58.58AM, iter: 001, cost: 0.070201
# 12/11/18 - 01:00.33AM, iter: 002, cost: 0.052521
# ...
The model can be imported with
model = wordvectors("text8-vec.txt", Float32, header=true, kind=:text)
# WordVectors 71291 words, 300-element Float32 vectors
The vector representation of a word can be obtained using get_vector
.
julia> get_vector(model, "book")
# 300-element Array{Float32,1}:
# 0.006189716
# 0.04822071
# 0.017121462
# ...
The cosine similarity of book
, for example, can be computed using cosine_similar_words
.
julia> cosine_similar_words(model, "book")
# 10-element Array{String,1}:
# "book"
# "books"
# "published"
# "domesday"
# "novel"
# "comic"
# "written"
# "bible"
# "urantia"
# "work"
Word vectors have many interesting properties. For example,
vector("king") - vector("man") + vector("woman")
is close to vector("queen")
.
julia> analogy_words(model, ["king", "woman"], ["man"])
# 5-element Array{String,1}:
# "queen"
# "daughter"
# "children"
# "wife"
# "son"
This code has an MIT license and therefore it is free. GloVe is released under an Apache License v2.0.
[1] GloVe: Global Vectors for Word Representation
[2] Glove.jl - native Julia implementation
The design of the package relies on design concepts from the word2vec Julia interface, Word2Vec.jl.
Please file an issue to report a bug or request a feature.