-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
understanding custom pipelines #73
Comments
Yes, I suggest you to use the latest version of botok. We have simplified the botok config, which in turn simplifies building custom pipelines. In the latest version we have introduced dialect packs, which are similar to various profiles in previous version, but they do a bit more than profile. Basically each dialect pack will have two components, Adjustments and Dictionary. Dictionary component contains all the standardized word list and rules (to adjust segmentation) for the tokenization and Adjustments is for researching and testing the segmentation and its content will eventually be included in the Dictionary component. Adjustment can also be used for customizing the default tokenization. As far as above toy example to get the expected output is concerned, the correct version is here. from botok import BoSyl, Config, TokChunks, Tokenize, Trie
in_str = "༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །"
config = Config()
trie = Trie(
BoSyl,
profile=config.profile,
main_data=config.dictionary,
custom_data=config.adjustments,
pickle_path=config.dialect_pack_path.parent,
)
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)
out = []
for i in range(len(tokens)):
out.append(tokens[i]["text"])
print(out) Output:
You can refer to https://github.com/Esukhia/botok/blob/7d85cbb0df62ff4c9da3c70088ad671f03472a18/botok/tokenizers/wordtokenizer.py#L28 class to customize the adjustment rules too. PS: We will be releasing botok documentation soon. |
How wonderful! For dialect packs, the use is 100% on Buddhadharma texts. Do you have recommendation for which dialect packs to use? |
Currently, we only have dialect pack for general Tibetan language. Our researcher team is working on dialect pack for Buddhadharma texts. Till then, you can experiment with general dialect pack to improve the segmentation. We will be releasing a detail documentation on customizing any dialect pack. |
In the below toy example, my expectation is to achieve a tokenized version of the input text. With the below code, the result is a list of tokens, but tokens are syllables only.
How can I change the above code to achieve what I'm trying to achieve?
The text was updated successfully, but these errors were encountered: