finding sentence limits #48

eroux · 2019-06-27T10:30:10Z

While it seems quite reasonable to cut on naro + shad, there are so many edge cases where the proper cut is difficult to find that it would be helpful to have some code doing that in pybo. i'm thinking of the various types of punctuation that are at the beginning of sentences and not at the end, etc. I have some code that does it here plus some tests here (only part of this code is interesting, namely the getAllBreakingCharsIndexes) but it could certainly improved (and better documented, mea culpa!). This could then be combined with some euristics to find actual sentences, not just shunits

The text was updated successfully, but these errors were encountered:

drupchen · 2019-06-27T10:56:48Z

out of the box, here is what pybo's preprocessor does:
In the first line of output, ('TEXT', 0, 4), 0 stands for the starting index of the chunk, 4 for the length of the chunk.

I think it covers some part of the behaviour you expect, but not everything...

    from pybo import Chunks

    strs = ["གཅིག། གཉིས",
            "གཅིག། །གཉིས",
            "གཅིག༑ གཉིས",
            "གཅིག ༑ གཉིས",
            "༑ གཉིས",
            "གཅིག ༑གཉིས",
            "གཅིག\u0f14གཉིས",
            "གཅིག\u0f7f གཉིས",
            "གཅིག།། །། ༆ ། །གཉིས",
            "གཅིག༎ ༎༆ ༎གཉིས",
            "སྤྱི་ལོ་༢༠༡༧ ཟླ་༡ ཚེས་༡༤ ཉིན་ལ་བྲིས་པ་དགེ",
            "བཀྲིས་ ༼བཀྲ་ཤིས༽ ངའི་གྲོགས་པོ་རེད།",
            "ག གི གྲ ཀ ཤ པ མ",
            "༄༅། ཀ༌ཀོ་ཀཿཀ࿒ཀ་ཀ ཀ་རང་ཀ།་"]
    for s in strs:
        c = Chunks(s)
        chunks = c.make_chunks()
        r = c.get_markers(chunks)
        readable = c.get_readable(chunks)
        print(r)
        print(readable)
        print()

output:

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། །'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༑ '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑ '), ('TEXT', 'གཉིས')]

[('PUNCT', 0, 2), ('TEXT', 2, 4)]
[('PUNCT', '༑ '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 1), ('TEXT', 5, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༔'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 5), ('TEXT', 5, 5)]
[('TEXT', 'གཅིགཿ'), ('TEXT', ' གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 11), ('TEXT', 15, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ༆ ། །'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 6), ('TEXT', 10, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༎ ༎༆ ༎'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 5), ('TEXT', 5, 3), ('NUM', 8, 5), ('TEXT', 13, 3), ('NUM', 16, 2), ('TEXT', 18, 4), ('NUM', 22, 3), ('TEXT', 25, 4), ('TEXT', 29, 2), ('TEXT', 31, 5), ('TEXT', 36, 2), ('TEXT', 38, 3)]
[('TEXT', 'སྤྱི་'), ('TEXT', 'ལོ་'), ('NUM', '༢༠༡༧ '), ('TEXT', 'ཟླ་'), ('NUM', '༡ '), ('TEXT', 'ཚེས་'), ('NUM', '༡༤ '), ('TEXT', 'ཉིན་'), ('TEXT', 'ལ་'), ('TEXT', 'བྲིས་'), ('TEXT', 'པ་'), ('TEXT', 'དགེ')]

[('TEXT', 0, 6), ('PUNCT', 6, 2), ('TEXT', 8, 4), ('TEXT', 12, 3), ('PUNCT', 15, 2), ('TEXT', 17, 4), ('TEXT', 21, 6), ('TEXT', 27, 3), ('TEXT', 30, 3), ('PUNCT', 33, 1)]
[('TEXT', 'བཀྲིས་'), ('PUNCT', ' ༼'), ('TEXT', 'བཀྲ་'), ('TEXT', 'ཤིས'), ('PUNCT', '༽ '), ('TEXT', 'ངའི་'), ('TEXT', 'གྲོགས་'), ('TEXT', 'པོ་'), ('TEXT', 'རེད'), ('PUNCT', '།')]

[('TEXT', 0, 15)]
[('TEXT', 'ག གི གྲ ཀ ཤ པ མ')]

[('PUNCT', 0, 4), ('TEXT', 4, 2), ('TEXT', 6, 3), ('TEXT', 9, 2), ('TEXT', 11, 1), ('SYM', 12, 1), ('TEXT', 13, 2), ('TEXT', 15, 4), ('TEXT', 19, 3), ('TEXT', 22, 1), ('PUNCT', 23, 2)]
[('PUNCT', '༄༅། '), ('TEXT', 'ཀ༌'), ('TEXT', 'ཀོ་'), ('TEXT', 'ཀཿ'), ('TEXT', 'ཀ'), ('SYM', '࿒'), ('TEXT', 'ཀ་'), ('TEXT', 'ཀ ཀ་'), ('TEXT', 'རང་'), ('TEXT', 'ཀ'), ('PUNCT', '།་')]

eroux · 2019-06-27T11:08:58Z

thanks! Indeed, if we take

[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ༆ ། །'), ('TEXT', 'གཉིས')]

for instance, ideally it would do something like

[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ), ('PUNCT', '༆ ། །'), ('TEXT', 'གཉིས')]

so that we could tag that a sentence break happens between tokens 2 and 3... I'm not sure what the API would be to tag the sentence limits... maybe some sort of tags parallel to the POS tags?

drupchen · 2019-06-27T11:37:17Z

the cases of ག གི གྲ ཀ ཤ པ མ are not handled yet (the issue is still open).
I think the best will be to work on a list of Token objects. We will be able to use all the info gathered during tokenization. Every Token object has a Token._ attribute containing an empty dict that can receive any custom data. That sounds like a good usecase.

Ideally, we would have something similar to the SplitAffixed class (that splits affixed particles into a different token) calling the TokenSplit(see file of the same name) class at some point to split the punct tokens into two.

Then, in a second pass over the list of tokens, we could use matchers (like in this test), providing a match_query to identify where to insert the end-of-sentence info, and a replace to add the end-of-sentence info in the right token.

The best would be that my code to split in sentences(not fully integrated yet) is run first, so we only process the punctuations that are near the end-of-sentence already found.

That might be the best integration in pybo I can currently think of... Although it might sound overkill for what you are trying to acheive.

drupchen · 2019-06-27T12:23:56Z

@ngawangtrinley says:

shad always belong to what is on their left
all yigo types belong to the text on their right
separators like drulshad belong to what is on their right

in case of a yigo+shad+text, the shad will belong to the yigo, which in turn will belong to the text

Does that sound like what you have implemented?

eroux · 2019-06-27T13:10:51Z

in short yes, I also consulted NT for the code and nagged him with many boring edge cases... I think what's implemented in Java is pretty good, unfortunately it's not very fresh in my mind...

drupchen · 2019-06-27T13:14:05Z

ok, then it's perfect.

When I have more time, I'll port your implementation in python.

eroux · 2019-06-27T13:16:15Z

thanks a lot! Finding the utterances is the first step before segmentation in the workflow to create the ACTIB corpus (and probably others) so I think having a way to do that with Pybo will be helpful!

drupchen · 2019-06-27T13:22:32Z

finding utterances is the same as tokenizing in sentences. This can't be done with the punctuation alone (correct me if I'm wrong).

That is what I tried to do with the sentencify code that is here.
What it does is to first tokenize, then use heuristics to identify sentences. punctuation plays a role in it, but there are many other things involved.

As for the next function, paragraphify, I group sentences from the sentencify until I reach a certain size.

What your code seems to be doing is to identify where does the punctuation belong, either to the text preceding it or to the one following it. Am I right?

eroux · 2019-06-27T13:33:48Z

yes, properly finding utterances involves more than punctuation and this script is a very good start, it's just that ideally it would be able to give the index of the exact right place (right of the shad, left of the yigo, etc.). Thanks a lot for considering it!

eroux · 2019-06-28T07:06:21Z

also just a small detail, for the sake of completeness in the sentencify script, some titles in the Tengyur end with བཞུགས (without the སོ), perhaps other sentences too. maybe it could be added to the ending_word list?

drupchen · 2019-07-01T13:26:16Z

There is also another bug that @10zinten reported to me (still waiting for test data): the སྟེ་ ཏེ་ དེ་ particles often seems to be used in the middle of sentences. Gar Tsering and others complained that they had partial sentences preventing them from correctly tag sentences in LightTag. On the other hand, removing these markers ended up in huge sentences, which they anyhow seemed to have prefered.

…issuecomment-506628058

drupchen referenced this issue Aug 16, 2019

implementing suggestion in https://github.com/Esukhia/pybo/issues/48#…

ea2f944

…issuecomment-506628058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finding sentence limits #48

finding sentence limits #48

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019 •

edited

Loading

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019 •

edited

Loading

eroux commented Jun 27, 2019

eroux commented Jun 28, 2019 •

edited

Loading

drupchen commented Jul 1, 2019

finding sentence limits #48

finding sentence limits #48

Comments

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019 • edited Loading

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019

eroux commented Jun 27, 2019

drupchen commented Jun 27, 2019 • edited Loading

eroux commented Jun 27, 2019

eroux commented Jun 28, 2019 • edited Loading

drupchen commented Jul 1, 2019

drupchen commented Jun 27, 2019 •

edited

Loading

drupchen commented Jun 27, 2019 •

edited

Loading

eroux commented Jun 28, 2019 •

edited

Loading