-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finding sentence limits #48
Comments
out of the box, here is what pybo's preprocessor does: I think it covers some part of the behaviour you expect, but not everything... from pybo import Chunks
strs = ["གཅིག། གཉིས",
"གཅིག། །གཉིས",
"གཅིག༑ གཉིས",
"གཅིག ༑ གཉིས",
"༑ གཉིས",
"གཅིག ༑གཉིས",
"གཅིག\u0f14གཉིས",
"གཅིག\u0f7f གཉིས",
"གཅིག།། །། ༆ ། །གཉིས",
"གཅིག༎ ༎༆ ༎གཉིས",
"སྤྱི་ལོ་༢༠༡༧ ཟླ་༡ ཚེས་༡༤ ཉིན་ལ་བྲིས་པ་དགེ",
"བཀྲིས་ ༼བཀྲ་ཤིས༽ ངའི་གྲོགས་པོ་རེད།",
"ག གི གྲ ཀ ཤ པ མ",
"༄༅། ཀ༌ཀོ་ཀཿཀ࿒ཀ་ཀ ཀ་རང་ཀ།་"]
for s in strs:
c = Chunks(s)
chunks = c.make_chunks()
r = c.get_markers(chunks)
readable = c.get_readable(chunks)
print(r)
print(readable)
print() output: [('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། '), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། །'), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༑ '), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑ '), ('TEXT', 'གཉིས')]
[('PUNCT', 0, 2), ('TEXT', 2, 4)]
[('PUNCT', '༑ '), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑'), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 1), ('TEXT', 5, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༔'), ('TEXT', 'གཉིས')]
[('TEXT', 0, 5), ('TEXT', 5, 5)]
[('TEXT', 'གཅིགཿ'), ('TEXT', ' གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 11), ('TEXT', 15, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ༆ ། །'), ('TEXT', 'གཉིས')]
[('TEXT', 0, 4), ('PUNCT', 4, 6), ('TEXT', 10, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༎ ༎༆ ༎'), ('TEXT', 'གཉིས')]
[('TEXT', 0, 5), ('TEXT', 5, 3), ('NUM', 8, 5), ('TEXT', 13, 3), ('NUM', 16, 2), ('TEXT', 18, 4), ('NUM', 22, 3), ('TEXT', 25, 4), ('TEXT', 29, 2), ('TEXT', 31, 5), ('TEXT', 36, 2), ('TEXT', 38, 3)]
[('TEXT', 'སྤྱི་'), ('TEXT', 'ལོ་'), ('NUM', '༢༠༡༧ '), ('TEXT', 'ཟླ་'), ('NUM', '༡ '), ('TEXT', 'ཚེས་'), ('NUM', '༡༤ '), ('TEXT', 'ཉིན་'), ('TEXT', 'ལ་'), ('TEXT', 'བྲིས་'), ('TEXT', 'པ་'), ('TEXT', 'དགེ')]
[('TEXT', 0, 6), ('PUNCT', 6, 2), ('TEXT', 8, 4), ('TEXT', 12, 3), ('PUNCT', 15, 2), ('TEXT', 17, 4), ('TEXT', 21, 6), ('TEXT', 27, 3), ('TEXT', 30, 3), ('PUNCT', 33, 1)]
[('TEXT', 'བཀྲིས་'), ('PUNCT', ' ༼'), ('TEXT', 'བཀྲ་'), ('TEXT', 'ཤིས'), ('PUNCT', '༽ '), ('TEXT', 'ངའི་'), ('TEXT', 'གྲོགས་'), ('TEXT', 'པོ་'), ('TEXT', 'རེད'), ('PUNCT', '།')]
[('TEXT', 0, 15)]
[('TEXT', 'ག གི གྲ ཀ ཤ པ མ')]
[('PUNCT', 0, 4), ('TEXT', 4, 2), ('TEXT', 6, 3), ('TEXT', 9, 2), ('TEXT', 11, 1), ('SYM', 12, 1), ('TEXT', 13, 2), ('TEXT', 15, 4), ('TEXT', 19, 3), ('TEXT', 22, 1), ('PUNCT', 23, 2)]
[('PUNCT', '༄༅། '), ('TEXT', 'ཀ༌'), ('TEXT', 'ཀོ་'), ('TEXT', 'ཀཿ'), ('TEXT', 'ཀ'), ('SYM', '࿒'), ('TEXT', 'ཀ་'), ('TEXT', 'ཀ ཀ་'), ('TEXT', 'རང་'), ('TEXT', 'ཀ'), ('PUNCT', '།་')] |
thanks! Indeed, if we take
for instance, ideally it would do something like
so that we could tag that a sentence break happens between tokens 2 and 3... I'm not sure what the API would be to tag the sentence limits... maybe some sort of tags parallel to the POS tags? |
the cases of Ideally, we would have something similar to the Then, in a second pass over the list of tokens, we could use matchers (like in this test), providing a The best would be that my code to split in sentences(not fully integrated yet) is run first, so we only process the punctuations that are near the end-of-sentence already found. That might be the best integration in pybo I can currently think of... Although it might sound overkill for what you are trying to acheive. |
@ngawangtrinley says:
in case of a yigo+shad+text, the shad will belong to the yigo, which in turn will belong to the text Does that sound like what you have implemented? |
in short yes, I also consulted NT for the code and nagged him with many boring edge cases... I think what's implemented in Java is pretty good, unfortunately it's not very fresh in my mind... |
ok, then it's perfect. When I have more time, I'll port your implementation in python. |
thanks a lot! Finding the utterances is the first step before segmentation in the workflow to create the ACTIB corpus (and probably others) so I think having a way to do that with Pybo will be helpful! |
finding utterances is the same as tokenizing in sentences. This can't be done with the punctuation alone (correct me if I'm wrong). That is what I tried to do with the As for the next function, What your code seems to be doing is to identify where does the punctuation belong, either to the text preceding it or to the one following it. Am I right? |
yes, properly finding utterances involves more than punctuation and this script is a very good start, it's just that ideally it would be able to give the index of the exact right place (right of the shad, left of the yigo, etc.). Thanks a lot for considering it! |
also just a small detail, for the sake of completeness in the sentencify script, some titles in the Tengyur end with |
There is also another bug that @10zinten reported to me (still waiting for test data): the སྟེ་ ཏེ་ དེ་ particles often seems to be used in the middle of sentences. Gar Tsering and others complained that they had partial sentences preventing them from correctly tag sentences in LightTag. On the other hand, removing these markers ended up in huge sentences, which they anyhow seemed to have prefered. |
While it seems quite reasonable to cut on naro + shad, there are so many edge cases where the proper cut is difficult to find that it would be helpful to have some code doing that in pybo. i'm thinking of the various types of punctuation that are at the beginning of sentences and not at the end, etc. I have some code that does it here plus some tests here (only part of this code is interesting, namely the
getAllBreakingCharsIndexes
) but it could certainly improved (and better documented, mea culpa!). This could then be combined with some euristics to find actual sentences, not just shunitsThe text was updated successfully, but these errors were encountered: