A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching. This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
- Fast algorithm for string matching
- 100% exact retrieval
- Support for unicodes
- Support for building databases directly from text files
- Mecab-based tokenizer support
- Support for persistent databases like MongoDB
- Dice coefficient
- Jaccard coefficient
- Cosine coefficient
- Overlap coefficient
- Exact match
You can grab the latest stable version of this package from Julia registries by simply running;
NB: Don't forget to invoke Julia's package manager with ]
pkg> add SimString
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]
:
pkg> add SimString#main
You are good to go with bleeding edge features and breakages!
To revert to a stable version, you can simply run:
pkg> free SimString