Ability to specify custom tokenizer #131

ibnesayeed · 2017-01-16T17:38:35Z

Currently, the following code is used to split the document in tokens/words for training and classification.

str.gsub(/[^\p{WORD}\s]/, '').downcase.split

This covers general case, but there could be situations where the user might want to customize the way document is split into words. For example, tokenizing Japanese text could be a whole different thing. Another situation where a custom tokenizer is needed when the user wants to train the model on N-grams (for example bi-grams such as New York). Splitting New and York from New York would mean New will be removed if it is present in stopwords. Similarly, to be or not to be is another popular example of a significant phrase fully made of common stopwords.. N-grams often play significant role in contextualizing a document and help improve the accuracy of the model in special situations. In many languages (Arabic, Persian, Urdu etc. to name a few) two or more words are combined (they are still separated by space, only put together) to form various linguistic constructs. This could be important if one wants to know who is the author of relatively small piece of text such as those posted on forums.

It would be nice if we can pass a Lambda as a tokenizer at the time of classifier initialization or some other more expressive means to tell the system how split the text.

The text was updated successfully, but these errors were encountered:

Ch4s3 · 2017-01-17T15:35:01Z

I was thinking about adding ngram support as well. I want to do this by abstracting out tokenizing to a separate public api that can get either called by the classifier, or passed in. I'm not sure which approach would be better.

ibnesayeed · 2017-01-17T15:46:43Z

Would the dependency injection be a good idea where we create a an instance of the tokenizer and then pass it during the initialization of the classifier, the way we do for the storage backend support.

ibnesayeed · 2017-01-17T15:51:37Z

In the first post what i described was n-gram based on words, which are also called shingles. However, one can also use letter-based n-grams that are often produce good results while putting a finite upper bound on total memory used (as the maximum possible number of keys would be the number of possible letters raised to the power length of the n-gram), and could be ideal for large collection training.

Ch4s3 · 2017-02-08T20:27:40Z

Yeah, I think dependency injection is the way to go here.

piroor · 2017-06-29T04:36:44Z

I've opened #161 but it should be resolved as a part of this tokenizer issue... Sorry I didn't research this before I open it.

Ch4s3 · 2017-07-31T04:11:13Z

@piroor Thanks for hopping in!

ibnesayeed mentioned this issue Jan 17, 2017

don't train or untrain empty word hashes #132

Merged

ibnesayeed mentioned this issue Mar 6, 2017

fix sqrt issue when there is only one char various in traindata #154

Closed

piroor mentioned this issue Jun 30, 2017

Separate tokenizer from hasher #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to specify custom tokenizer #131

Ability to specify custom tokenizer #131

ibnesayeed commented Jan 16, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Feb 8, 2017

piroor commented Jun 29, 2017

Ch4s3 commented Jul 31, 2017 •

edited

Loading

Ability to specify custom tokenizer #131

Ability to specify custom tokenizer #131

Comments

ibnesayeed commented Jan 16, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Feb 8, 2017

piroor commented Jun 29, 2017

Ch4s3 commented Jul 31, 2017 • edited Loading

Ch4s3 commented Jul 31, 2017 •

edited

Loading