Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing tokenization #70

Closed
wants to merge 4 commits into from
Closed

Multiprocessing tokenization #70

wants to merge 4 commits into from

Conversation

10zinten
Copy link
Contributor

@10zinten 10zinten commented Jul 10, 2020

fix #64

@pep8speaks
Copy link

pep8speaks commented Jul 10, 2020

Hello @10zinten! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 26:5: E303 too many blank lines (2)
Line 28:23: E741 ambiguous variable name 'l'
Line 40:1: W293 blank line contains whitespace
Line 57:80: E501 line too long (82 > 79 characters)
Line 63:80: E501 line too long (80 > 79 characters)
Line 67:5: E303 too many blank lines (2)
Line 302:10: W292 no newline at end of file

Line 70:80: E501 line too long (82 > 79 characters)

Line 10:80: E501 line too long (101 > 79 characters)

Line 170:6: W292 no newline at end of file

Line 85:80: E501 line too long (91 > 79 characters)
Line 89:80: E501 line too long (88 > 79 characters)
Line 110:80: E501 line too long (166 > 79 characters)
Line 115:80: E501 line too long (88 > 79 characters)
Line 178:1: E302 expected 2 blank lines, found 1

Line 136:80: E501 line too long (104 > 79 characters)
Line 144:80: E501 line too long (81 > 79 characters)
Line 152:1: W293 blank line contains whitespace
Line 171:1: W293 blank line contains whitespace
Line 178:80: E501 line too long (97 > 79 characters)
Line 182:80: E501 line too long (84 > 79 characters)
Line 191:80: E501 line too long (94 > 79 characters)
Line 200:1: W293 blank line contains whitespace
Line 205:80: E501 line too long (83 > 79 characters)
Line 222:1: W293 blank line contains whitespace
Line 235:80: E501 line too long (82 > 79 characters)
Line 249:1: E302 expected 2 blank lines, found 1
Line 257:35: W292 no newline at end of file

Comment last updated at 2020-07-27 06:08:31 UTC

@mikkokotila
Copy link
Contributor

Looks like it will not parallelize properly.

I'm using the following example from the commit:

from botok import *
profile = "empty"
main, custom = Config().get_tok_data_paths(profile)
tok = Tokenize(Trie(BoSyl, profile, main, custom))
in_str = "མཐའི་བཀྲ་ཤིས། ཀཀ abc མཐའི་རྒྱ་མཚོ་"
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.parallelized_tokenize(preproc)

But I'm replacing in_str with something like this volume.

The result is that the workload occupies only a single thread and wall-clock time is unaffected.

@10zinten 10zinten closed this Jul 16, 2020
@10zinten 10zinten reopened this Jul 16, 2020
@10zinten
Copy link
Contributor Author

10zinten commented Jul 16, 2020

Interesting, I will look into it soon as possible.

@mikkokotila
Copy link
Contributor

Do you have any update available for this?

@mikkokotila
Copy link
Contributor

Related to this matter, here is a working code example for running Botok in a multiprocessing manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

multi-threading
3 participants