statistics performance with tokenizer.list_word_types #61

mikkokotila · 2019-10-20T09:41:44Z

As it stands, Text(doc).list_word_types includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with bigger doc as I would have to basically spend twice the time.

May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from Text(doc). That way we can offer other common things like co-occurrence and ngram statistics.

I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of Text() clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.

The text was updated successfully, but these errors were encountered:

mikkokotila · 2019-10-20T10:05:27Z

The proposed way is over 2,000 times faster:

current way: 5.92 s ± 55.5 ms per loop
proposed way: 2.27 ms ± 16.7 µs per loop

# bonus for bigrams (and other more complex ops)
proposed way for bigrams: 16.9 ms ± 349 µs per loop

ngawangtrinley · 2019-10-21T15:47:28Z

Sounds good to me but I'll let @drupchen confirm

drupchen · 2019-10-23T13:38:12Z

@mikkokotila, I'm all for improvements! Thanks for the proposal.

Have you looked at how easy it is to set up different components for the Text class?
Here's how you use them: https://github.com/Esukhia/botok/blob/master/tests/test_text.py#L169-L183
Have a look at the function signatures here for what the building blocks of the pipeline take in and spit out. They are all meant to be independant from one another, so it would be pretty easy to plug in any external code, class, function or whatever.

One good use of that modularity is to create a tokenizer pipe with the tokenizer instanciated outside of Text, so it does not need to be loaded into memory everytime tokenizing needs to be done in a Text object.

I wanted to make sure you understood what I meant to do with this pipeline.
Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statistics performance with tokenizer.list_word_types #61

statistics performance with tokenizer.list_word_types #61

mikkokotila commented Oct 20, 2019 •

edited

Loading

mikkokotila commented Oct 20, 2019

ngawangtrinley commented Oct 21, 2019

drupchen commented Oct 23, 2019

statistics performance with tokenizer.list_word_types #61

statistics performance with tokenizer.list_word_types #61

Comments

mikkokotila commented Oct 20, 2019 • edited Loading

mikkokotila commented Oct 20, 2019

ngawangtrinley commented Oct 21, 2019

drupchen commented Oct 23, 2019

mikkokotila commented Oct 20, 2019 •

edited

Loading