-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect similar words for words with 'vbz' pos #177
Comments
I think this problem appears because there are some words with incorrect pos in dict, "computerized":["k-ah-m p-y-uw1 t-er ay-z-d","jj nn vb vbn"],
"discriminated":["d-ih s-k-r-ih1 m-ah n-ey t-ah-d","vbd jj nn vb"],
"expected":["ih-k s-p-eh1-k t-ah-d","vbn vbd jj vb"] words like 'computerized' will be considered as base form verbs (because their pos contain 'vb') and hence, in this case where the target pos is vbz, conjugator will directly return 'computerizeds'. The easiest way to solve this might be just to modify the words' pos in dict? |
#179 might be for the same reason |
good notice -- I wonder if we might be able to remove all the 'vbn' from the dictionary, since we can compute them from the base form |
So we have done this before with verb tenses (see earlier tickets from @cqx931 below). Once we find a pos that we want to remove from the dict, then we need to find all the places we would need to make updates to the code to deal with that pos (soundsLike, spellsLike, search, pos, conjguate, hasWord, tag etc.), then add tests (which will fail), then add the code to handle these cases, then remove the words with a script... then re-try the tests and adjust until the pass... See: |
So here's a list of verbs with incorrect pos in the current dict, and the pos I think need to be removed/added are in the comment: "beat": ["b-iy1-t", "vb jj nn vbd vbn vbp"], //-vbn
"become": ["b-ih k-ah1-m", "vb vbd vbn vbp"], //-vbd
"bit": ["b-ih1-t", "nn vbd vbn jj rb vb"], //-vb, -vbn
"bore": ["b-ao1-r", "vbd vbp jj nn vb"], //-vbd
"broke": ["b-r-ow1-k", "vbd vbn jj rb vb"], //-vb, -vbn
"build": ["b-ih1-l-d", "vb vbn vbp nn"], //-vbn
"called": ["k-ao1-l-d", "vbn vbd vb"], //-vb
"come": ["k-ah1-m", "vb vbd vbn vbp vbz jj"], //-vbd, -vbz
"committed": ["k-ah m-ih1 t-ah-d", "vbn jj vb vbd"], //-vb
"computerized": ["k-ah-m p-y-uw1 t-er ay-z-d", "jj nn vb vbn"], //-vb, -nn
"concerned": ["k-ah-n s-er1-n-d", "vbn jj vb vbd"], //-vb
"discriminated": ["d-ih s-k-r-ih1 m-ah n-ey t-ah-d", "vbd jj nn vb"], //-vb, -nn
"ended": ["eh1-n d-ah-d", "vbd jj vb vbn"], //-vb
"enter": ["eh1-n t-er", "vb vbn vbp"], //-vbn
"expected": ["ih-k s-p-eh1-k t-ah-d", "vbn vbd jj vb"], //-vb
"finished": ["f-ih1 n-ih-sh-t", "vbd jj vb vbn"], //-vb
"gained": ["g-ey1-n-d", "vbd vbn vb"], //-vb
"got": ["g-aa1-t", "vbd vbn vbp vb"], //-vb, -vbn
"have": ["hh-ae1-v", "vbp jj nn vb vbn"], //-vbn
"include": ["ih-n k-l-uw1-d", "vbp vbn vb"], //-vbn
"increased": ["ih-n k-r-iy1-s-t", "vbn jj vb vbd"], //-vb
"involved": ["ih-n v-aa1-l-v-d", "vbn vbd jj vb"], //-vb
"knit": ["n-ih1-t", "vbn jj nn vb"], //+vbd
"launched": ["l-ao1-n-ch-t", "vbn vbd vb"], //-vb
"lead": ["l-eh1-d", "vb vbn vbp jj nn"], //-vbn
"led": ["l-eh1-d", "vbn vbd vb"], //-vb
"lived": ["l-ay1-v-d", "vbd vbn vb"], //-vb
"outpaced": ["aw1-t p-ey-s-t", "vbd nn vb vbn vbp"], //-vb
"oversaw": ["ow1 v-er s-ao", "vbd vb"], //-vb
"oversold": ["ow1 v-er s-ow1-l-d", "vbn jj vb"], //-vb
"own": ["ow1-n", "jj vbn vbp vb"], //-vbn
"paled": ["p-ey1-l-d", "vbd vb vbn"], //-vb
"pay": ["p-ey1", "vb vbd vbp nn"], //-vbd
"plan": ["p-l-ae1-n", "nn vb vbn vbp"], //-vbn
"post": ["p-ow1-s-t", "nn in jj vb vbd vbp"], //-vbd
"prepaid": ["p-r-iy p-ey1-d", "jj vbn vb"], //-vb
"pressured": ["p-r-eh1 sh-er-d", "vbn jj nn vb vbd"], //-vb
"proliferated": ["p-r-ah l-ih1 f-er ey t-ih-d", "vbn vb vbd"], //-vb
"remade": ["r-iy m-ey1-d", "vbn nn vb"], //-vb, +vbd
"rent": ["r-eh1-n-t", "nn vb vbn vbp"], //-vbn
"reopened": ["r-iy ow1 p-ah-n-d", "vbd vbn vb"], //-vb
"reported": ["r-iy p-ao1-r t-ah-d", "vbd jj vb vbn vbp"], //-vb
"repurchase": ["r-iy p-er1 ch-ah-s", "nn vbd vbn jj vb"], //-vbd, -vbn
"resold": ["r-iy s-ow1-l-d", "vbn vbd vbp vb"], //-vb
"roast": ["r-ow1-s-t", "nn vb vbn"], //-vbn
"settled": ["s-eh1 t-ah-l-d", "vbd vbn jj vb"], //-vb
"spit": ["s-p-ih1-t", "vb nn vbd"], //+vbn
"started": ["s-t-aa1-r t-ah-d", "vbd jj vbn vb"], //-vb
"sublet": ["s-ah1 b-l-eh-t", "vb vbn"], //+vbd
"trouble": ["t-r-ah1 b-ah-l", "nn vbd vbp jj vb"], //-vbd
"wed": ["w-eh1-d", "vbn vb"], //+vbd
"were": ["w-er", "vbd vb"], //-vb
"weren't": ["w-er-ah-n-t", "vbd vb"], //-vb
"wet": ["w-eh1-t", "jj nn vbd vb vbp"], //+vbn I suggest that the first step is to remove the 'vb' tags in words that are not in base form, which should fix the problem in this ticket. Then we can consider removing those verbs with only vb* tag and no other tags, as suggested in dhowe/RiTaV1#357 For step 1, below are the corresponding tests to be added, taking 'concern' ('concerned') as an example: //hasWord
expect(RiTa.hasWord("concerned")).to.be.true;
expect(RiTa.hasWord("concerneds")).to.be.false;
expect(RiTa.hasWord("concerneded")).to.be.false;
//pos
eql(RiTa.pos("concerned"), ["vbd"]);
eql(RiTa.pos("concerned", { simple: 1 }), ["v"]);
//search
expect(RiTa.search({ pos: "vb",limit: -1 }).includes("concerned")).to.be.false;
expect(RiTa.search({ pos: "vbn",limit: -1 }).includes("concerned")).to.be.true;
expect(RiTa.search('concern', { pos: "vbd", limit: -1 })).eql([ 'concerned']);
expect(RiTa.search('concern', { pos: "vbn", limit: -1 })).eql([ 'concerned']);
//conjugate
let opt = {
number: RiTa.SINGULAR,
person: RiTa.FIRST,
tense: RiTa.PAST
};
expect(RiTa.conjugate("concern", opt)).eq("concerned");
//unconjugate
expect(RiTa.conjugator.unconjugate("concerned")).eq("concern");
//allTags
expect(RiTa.tagger.allTags("concerned")).eql(['vbd','jj','vbn']);
//tag
eq(RiTa.tagger.tag(["I", "am", "concerned", "about","this", "."], { inline: true }), "I/prp am/vbp concerned/jj about/in this/dt .");
//soundsLike
expect(RiTa.soundsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false;
//spellsLike
expect(RiTa.spellsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false; please let me know if any part of the list/tests has problems. |
This looks really good -- I think the ultimate goal is to only have 'vb' for each of the regular verbs (plus all needed forms for irregular verbs) and compute all the other forms when needed... But this is a great first step -- do you want to do a PR in ritajs to start? |
yes, I'll make the tests past and create a PR |
great -- also needs to handle: RiTa.analyze('concerned')
RiTa.analyze('concerns') |
@KarlieZhao status ? |
the issue in this ticket should've been fixed, however, I think we can go ahead and try to remove the words with only vb* tags in the lexicon... |
good - this will take some thought, so first come up with a plan... then we can discuss |
For example, many incorrect verb forms in list for 'spreads':
The text was updated successfully, but these errors were encountered: