Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finding sentence limits #48

Open
eroux opened this issue Jun 27, 2019 · 11 comments
Open

finding sentence limits #48

eroux opened this issue Jun 27, 2019 · 11 comments

Comments

@eroux
Copy link
Contributor

eroux commented Jun 27, 2019

While it seems quite reasonable to cut on naro + shad, there are so many edge cases where the proper cut is difficult to find that it would be helpful to have some code doing that in pybo. i'm thinking of the various types of punctuation that are at the beginning of sentences and not at the end, etc. I have some code that does it here plus some tests here (only part of this code is interesting, namely the getAllBreakingCharsIndexes) but it could certainly improved (and better documented, mea culpa!). This could then be combined with some euristics to find actual sentences, not just shunits

@drupchen
Copy link
Collaborator

out of the box, here is what pybo's preprocessor does:
In the first line of output, ('TEXT', 0, 4), 0 stands for the starting index of the chunk, 4 for the length of the chunk.

I think it covers some part of the behaviour you expect, but not everything...

    from pybo import Chunks

    strs = ["གཅིག། གཉིས",
            "གཅིག། །གཉིས",
            "གཅིག༑ གཉིས",
            "གཅིག ༑ གཉིས",
            "༑ གཉིས",
            "གཅིག ༑གཉིས",
            "གཅིག\u0f14གཉིས",
            "གཅིག\u0f7f གཉིས",
            "གཅིག།། །། ༆ ། །གཉིས",
            "གཅིག༎ ༎༆ ༎གཉིས",
            "སྤྱི་ལོ་༢༠༡༧ ཟླ་༡ ཚེས་༡༤ ཉིན་ལ་བྲིས་པ་དགེ",
            "བཀྲིས་ ༼བཀྲ་ཤིས༽ ངའི་གྲོགས་པོ་རེད།",
            "ག གི གྲ ཀ ཤ པ མ",
            "༄༅། ཀ༌ཀོ་ཀཿཀ࿒ཀ་ཀ ཀ་རང་ཀ།་"]
    for s in strs:
        c = Chunks(s)
        chunks = c.make_chunks()
        r = c.get_markers(chunks)
        readable = c.get_readable(chunks)
        print(r)
        print(readable)
        print()

output:

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '། །'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༑ '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 3), ('TEXT', 7, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑ '), ('TEXT', 'གཉིས')]

[('PUNCT', 0, 2), ('TEXT', 2, 4)]
[('PUNCT', '༑ '), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 2), ('TEXT', 6, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', ' ༑'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 1), ('TEXT', 5, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༔'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 5), ('TEXT', 5, 5)]
[('TEXT', 'གཅིགཿ'), ('TEXT', ' གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 11), ('TEXT', 15, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ༆ ། །'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 4), ('PUNCT', 4, 6), ('TEXT', 10, 4)]
[('TEXT', 'གཅིག'), ('PUNCT', '༎ ༎༆ ༎'), ('TEXT', 'གཉིས')]

[('TEXT', 0, 5), ('TEXT', 5, 3), ('NUM', 8, 5), ('TEXT', 13, 3), ('NUM', 16, 2), ('TEXT', 18, 4), ('NUM', 22, 3), ('TEXT', 25, 4), ('TEXT', 29, 2), ('TEXT', 31, 5), ('TEXT', 36, 2), ('TEXT', 38, 3)]
[('TEXT', 'སྤྱི་'), ('TEXT', 'ལོ་'), ('NUM', '༢༠༡༧ '), ('TEXT', 'ཟླ་'), ('NUM', '༡ '), ('TEXT', 'ཚེས་'), ('NUM', '༡༤ '), ('TEXT', 'ཉིན་'), ('TEXT', 'ལ་'), ('TEXT', 'བྲིས་'), ('TEXT', 'པ་'), ('TEXT', 'དགེ')]

[('TEXT', 0, 6), ('PUNCT', 6, 2), ('TEXT', 8, 4), ('TEXT', 12, 3), ('PUNCT', 15, 2), ('TEXT', 17, 4), ('TEXT', 21, 6), ('TEXT', 27, 3), ('TEXT', 30, 3), ('PUNCT', 33, 1)]
[('TEXT', 'བཀྲིས་'), ('PUNCT', ' ༼'), ('TEXT', 'བཀྲ་'), ('TEXT', 'ཤིས'), ('PUNCT', '༽ '), ('TEXT', 'ངའི་'), ('TEXT', 'གྲོགས་'), ('TEXT', 'པོ་'), ('TEXT', 'རེད'), ('PUNCT', '།')]

[('TEXT', 0, 15)]
[('TEXT', 'ག གི གྲ ཀ ཤ པ མ')]

[('PUNCT', 0, 4), ('TEXT', 4, 2), ('TEXT', 6, 3), ('TEXT', 9, 2), ('TEXT', 11, 1), ('SYM', 12, 1), ('TEXT', 13, 2), ('TEXT', 15, 4), ('TEXT', 19, 3), ('TEXT', 22, 1), ('PUNCT', 23, 2)]
[('PUNCT', '༄༅། '), ('TEXT', 'ཀ༌'), ('TEXT', 'ཀོ་'), ('TEXT', 'ཀཿ'), ('TEXT', 'ཀ'), ('SYM', '࿒'), ('TEXT', 'ཀ་'), ('TEXT', 'ཀ ཀ་'), ('TEXT', 'རང་'), ('TEXT', 'ཀ'), ('PUNCT', '།་')]

@eroux
Copy link
Contributor Author

eroux commented Jun 27, 2019

thanks! Indeed, if we take

[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ༆ ། །'), ('TEXT', 'གཉིས')]

for instance, ideally it would do something like

[('TEXT', 'གཅིག'), ('PUNCT', '།། །། ), ('PUNCT', '༆ ། །'), ('TEXT', 'གཉིས')]

so that we could tag that a sentence break happens between tokens 2 and 3... I'm not sure what the API would be to tag the sentence limits... maybe some sort of tags parallel to the POS tags?

@drupchen
Copy link
Collaborator

drupchen commented Jun 27, 2019

the cases of ག གི གྲ ཀ ཤ པ མ are not handled yet (the issue is still open).
I think the best will be to work on a list of Token objects. We will be able to use all the info gathered during tokenization. Every Token object has a Token._ attribute containing an empty dict that can receive any custom data. That sounds like a good usecase.

Ideally, we would have something similar to the SplitAffixed class (that splits affixed particles into a different token) calling the TokenSplit(see file of the same name) class at some point to split the punct tokens into two.

Then, in a second pass over the list of tokens, we could use matchers (like in this test), providing a match_query to identify where to insert the end-of-sentence info, and a replace to add the end-of-sentence info in the right token.

The best would be that my code to split in sentences(not fully integrated yet) is run first, so we only process the punctuations that are near the end-of-sentence already found.

That might be the best integration in pybo I can currently think of... Although it might sound overkill for what you are trying to acheive.

@drupchen
Copy link
Collaborator

@ngawangtrinley says:

  • shad always belong to what is on their left
  • all yigo types belong to the text on their right
  • separators like drulshad belong to what is on their right

in case of a yigo+shad+text, the shad will belong to the yigo, which in turn will belong to the text

Does that sound like what you have implemented?

@eroux
Copy link
Contributor Author

eroux commented Jun 27, 2019

in short yes, I also consulted NT for the code and nagged him with many boring edge cases... I think what's implemented in Java is pretty good, unfortunately it's not very fresh in my mind...

@drupchen
Copy link
Collaborator

ok, then it's perfect.

When I have more time, I'll port your implementation in python.

@eroux
Copy link
Contributor Author

eroux commented Jun 27, 2019

thanks a lot! Finding the utterances is the first step before segmentation in the workflow to create the ACTIB corpus (and probably others) so I think having a way to do that with Pybo will be helpful!

@drupchen
Copy link
Collaborator

drupchen commented Jun 27, 2019

finding utterances is the same as tokenizing in sentences. This can't be done with the punctuation alone (correct me if I'm wrong).

That is what I tried to do with the sentencify code that is here.
What it does is to first tokenize, then use heuristics to identify sentences. punctuation plays a role in it, but there are many other things involved.

As for the next function, paragraphify, I group sentences from the sentencify until I reach a certain size.

What your code seems to be doing is to identify where does the punctuation belong, either to the text preceding it or to the one following it. Am I right?

@eroux
Copy link
Contributor Author

eroux commented Jun 27, 2019

yes, properly finding utterances involves more than punctuation and this script is a very good start, it's just that ideally it would be able to give the index of the exact right place (right of the shad, left of the yigo, etc.). Thanks a lot for considering it!

@eroux
Copy link
Contributor Author

eroux commented Jun 28, 2019

also just a small detail, for the sake of completeness in the sentencify script, some titles in the Tengyur end with བཞུགས (without the སོ), perhaps other sentences too. maybe it could be added to the ending_word list?

@drupchen
Copy link
Collaborator

drupchen commented Jul 1, 2019

There is also another bug that @10zinten reported to me (still waiting for test data): the སྟེ་ ཏེ་ དེ་ particles often seems to be used in the middle of sentences. Gar Tsering and others complained that they had partial sentences preventing them from correctly tag sentences in LightTag. On the other hand, removing these markers ended up in huge sentences, which they anyhow seemed to have prefered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants