Question #53
Replies: 4 comments 4 replies
-
Hello, Very pertinent question! Yes, BM25Vectorizer and Similarity are language agnostic; since the winkNLP's tokenizer is a multilingual one (supports Latin scripts), it can easily tokenize the text — please refer to features section in README. Once tokenized, it can be easily used by these two utilities for further processing. For example, look at the following code with text in German as input: const winkNLP = require( 'wink-nlp' );
const its = require( 'wink-nlp/src/its.js' );
// Use web model for RunKit.
const model = require( 'wink-eng-lite-web-model' );
const nlp = winkNLP( model );
const text = `winkNLP ist eine JavaScript-Bibliothek für Natural Language Processing
(NLP). WinkNLP wurde speziell entwickelt, um die Entwicklung von NLP-Lösungen einfacher
und schneller zu machen, und ist für das richtige Gleichgewicht zwischen Leistung und
Genauigkeit optimiert.`;
const doc = nlp.readDoc( text );
// Print tokens.
console.log( doc.tokens().out() );
// Print each token's type.
console.log( doc.tokens().out( its.type ) ); Notice, in the above example English model has been used. and This can be used to provide input to the above mentioned utilities. Best, |
Beta Was this translation helpful? Give feedback.
-
Both |
Beta Was this translation helpful? Give feedback.
-
Ah, I think I missed that your German example used the "English model."
If I'm now understanding it correctly, then as long I'm using a Latin
script language, then my current WInkNLP will work, right?
Please confirm, and thanks very much for your help.
Cordially,
Paul
…On Mon, Sep 6, 2021, 4:59 AM Sanjaya Kumar Saxena ***@***.***> wrote:
For any *bag-of-words* based similarity, we only need tokens. WinkNLP and
wink-tokenizer both can tokenize any language that is based on Latin
script.
Therefore if you want an integrated package, please use 'winkNLP'.
Otherwise you can use the combination of wink-distance and wink-tokenizer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#53 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBE2P7SPR7LLQVQK4ZXDADUAR7FJANCNFSM5DLLPQ3A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Thank you, sir!
…-Paul
On Mon, Sep 6, 2021 at 9:10 AM Sanjaya Kumar Saxena < ***@***.***> wrote:
Right @paul-bell <https://github.com/paul-bell>, as long as we need only
tokenization/bow similarity. 🙂
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBE2PZ32PKGHGQZVX6XU53UAS4TNANCNFSM5DLLPQ3A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Hi,
I am about to start looking at BM25 Vectorizer. I recalled our discussion, available at
#31
wherein Sanjaya remarks that both BM25 and Wink's Similarity are language agnostic. If this is true, why do its examples, e.g.,
https://winkjs.org/wink-nlp/bm25-vectorizer.html
show an English model?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions