This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in
Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on
Balkan Language Resources and Tools (Balkan Conference in Informatics).
See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original paper for more details and examples.
This implementation, in contrast of the other available uses a Trie, instead of Dictionary/Hashtable/, in order to find the longest possible rule, that can be applied to a token.
Basic algorithm steps:
- Find the position of the first vowel in the token.
- Find the longest possible rule by traversing the string in reverse order until there is a matching suffix, or down to the position of the first vowel (found in Step. 1).
- Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).
This library is compatible with Python >= 3.6.
Clone the repository and run:
pip install -e .
pip install -r requirements.txt
A set of tests are included in the project, under the tests folder. The test suit can be run as follows:
pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest
The library works with a set of rules used for stemming. The rules can be either passed as a list to the BulStemmer
constructor, or as a path to a file.
For both options the rules need to be formatted as follows:
word ==> stem ==> freq
A pre-defined set of rules is included in the package, and can be used directly. The stemming rules can be found here. (examples: Reading the rules from an external file)
from bulstem.stem import BulStemmer
stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'
BulStemmer
constructor params:
rules
- Iterable of strings containing rules.min_freq
- The minimum frequency of a rule to be used when stemming.left_context
- Size of the prefix which will not be stemmed.
from bulstem.stem import BulStemmer
# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1',
'stem-context-2',
'stem-context-3']
# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
print(i, stemmer.stem('вторият'))
stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'
BulStemmer.from_file
params:
path
- Path (or pre-defined name) to the rules file formatted as follows: word ==> stem ==> freq.min_freq
- The minimum frequency of a rule to be used when stemming.left_context
- Size of the prefix which will not be stemmed.
Perl (Original), Java (JDK 1.4), Ruby, C#, Python2, GATE plugin (Java)
For license information, see LICENSE.