Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics performance with tokenizer.list_word_types #61

Open
mikkokotila opened this issue Oct 20, 2019 · 3 comments
Open

statistics performance with tokenizer.list_word_types #61

mikkokotila opened this issue Oct 20, 2019 · 3 comments

Comments

@mikkokotila
Copy link
Contributor

mikkokotila commented Oct 20, 2019

As it stands, Text(doc).list_word_types includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with bigger doc as I would have to basically spend twice the time.

May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from Text(doc). That way we can offer other common things like co-occurrence and ngram statistics.

I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of Text() clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.

@mikkokotila
Copy link
Contributor Author

The proposed way is over 2,000 times faster:

current way: 5.92 s ± 55.5 ms per loop
proposed way: 2.27 ms ± 16.7 µs per loop

# bonus for bigrams (and other more complex ops)
proposed way for bigrams: 16.9 ms ± 349 µs per loop

@ngawangtrinley
Copy link
Contributor

Sounds good to me but I'll let @drupchen confirm

@drupchen
Copy link
Collaborator

@mikkokotila, I'm all for improvements! Thanks for the proposal.

Have you looked at how easy it is to set up different components for the Text class?
Here's how you use them: https://github.com/Esukhia/botok/blob/master/tests/test_text.py#L169-L183
Have a look at the function signatures here for what the building blocks of the pipeline take in and spit out. They are all meant to be independant from one another, so it would be pretty easy to plug in any external code, class, function or whatever.

One good use of that modularity is to create a tokenizer pipe with the tokenizer instanciated outside of Text, so it does not need to be loaded into memory everytime tokenizing needs to be done in a Text object.

I wanted to make sure you understood what I meant to do with this pipeline.
Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants