You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As it stands, Text(doc).list_word_types includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with bigger doc as I would have to basically spend twice the time.
May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from Text(doc). That way we can offer other common things like co-occurrence and ngram statistics.
I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of Text() clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.
The text was updated successfully, but these errors were encountered:
current way: 5.92 s ± 55.5 ms per loop
proposed way: 2.27 ms ± 16.7 µs per loop
# bonus for bigrams (and other more complex ops)
proposed way for bigrams: 16.9 ms ± 349 µs per loop
@mikkokotila, I'm all for improvements! Thanks for the proposal.
Have you looked at how easy it is to set up different components for the Text class?
Here's how you use them: https://github.com/Esukhia/botok/blob/master/tests/test_text.py#L169-L183
Have a look at the function signatures here for what the building blocks of the pipeline take in and spit out. They are all meant to be independant from one another, so it would be pretty easy to plug in any external code, class, function or whatever.
One good use of that modularity is to create a tokenizer pipe with the tokenizer instanciated outside of Text, so it does not need to be loaded into memory everytime tokenizing needs to be done in a Text object.
I wanted to make sure you understood what I meant to do with this pipeline.
Thanks for your help!
As it stands,
Text(doc).list_word_types
includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with biggerdoc
as I would have to basically spend twice the time.May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from
Text(doc)
. That way we can offer other common things like co-occurrence and ngram statistics.I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of
Text()
clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.The text was updated successfully, but these errors were encountered: