-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data FI sample #564
Data FI sample #564
Conversation
This looks like the schema for the .ana version of the corpus was used, as it expects s(entence), rather than text. For the rest, I hope @matyaskopp will be able to help. |
you are including TEI version of component files in TEI.ana root file:
I will fix it and let you know. #565 |
Hi @matyaskopp! I have now fixed the "easy" error cases in the FI sample data. Any ideas on how to proceed from here would be appreciated. |
you don't have
<extent>
<measure unit="words" quantity="0" xml:lang="fi">0 sanat</ns0:measure>
<measure unit="words" quantity="0" xml:lang="en">0 words</ns0:measure> into every TEI file
https://github.com/clarin-eric/ParlaMint/actions/runs/4325365846/jobs/7551440647#step:4:28 you can set <idno subtype="handle" type="URI">http://hdl.handle.net/11356/XXXX</idno> in the sample, Tomaž will add the proper handle in ParlaMint release
https://github.com/clarin-eric/ParlaMint/actions/runs/4325365846/jobs/7551440647#step:4:29 Empty term is not allowed, you should translate it into fi: <catDesc xml:lang="fi">
<term/>
</catDesc>
forename shouldn't be empty: <person xml:id="SDP">
<persName>
<surname>SDP</surname>
<forename/>
</persName>
</person> BTW this looks more like political party - not like person I can continue in the same way for the rest of the errors. |
@matyaskopp, regarding the errors that we still have in the FI sample data:
|
@yoge1, @matyaskopp is ill, so I will try to answer:
The short answer is, yes, it is allowed in the "plain text" version of the corpus, but not in the linguistically annotated version (so. .ana) which you are validating (cf. also https://clarin-eric.github.io/ParlaMint/#sec-ana-markup). Here all text content of the transcription proper should be inside E.g. you have A bit longer answer: you point to the definition of seg in the TEI ODD generated schema - we do try to keep it as compatible as possible with the official schemas that are used for validation, and can be found in the Schema directory. But it is not always possible and the ODD schema allows some construction not allowed by the schemas in Schema/. The text content of
You have links like
but the contents of
|
Ok, I have a fix for these now, but I think the error message is actually about Here's the only occurrence of the string 'ParlaMint-FI_2015-05-22-ps-7.seg1.1' mentioned in the error message:
|
Actually, no, there is also:
which is what the error message was probably referring to. |
Thanks, you are right, of course! The "Strange pointer" issue is now fixed. And this caused new issues, which I'll look into next. |
…xt contents with period (.) but instead remove such w elements
I made some fixes to the sample data. No more errors in the local validation. |
It should be fixed now. You used a different namespace from the default one, and the script for conversion to conllu did not cover that. |
fix getting component files when bit xi prefix is used (related to #564)
It seems that this is caused by the issue that some of the interruptions are marked as regular speeches (not as @matyaskopp Do you happen to have an easy process (e.g. a ready-made script / one-liner) for finding out such cases? |
I hacked a conversion tei2text, so it can be used for it. make text.seg-FI
make text.seg.ana-FI produces folder meld Data/ParlaMint-FI/text.seg/{ParlaMint-FI_2015-05-22-ps-7.ana.txt,ParlaMint-FI_2015-05-22-ps-7.txt} but I am suggesting starting with the |
…elements (regular speeches)
… org events for legislative periods instead of parliamentary sessions; component files: add meeting elements for term, session, meeting and sitting
…liation to party.PV
… and ana.xml content (don't create a seg for interruption in ana.xml to follow xml's practice)
…and ana.xml content (align the xml and ana.xml text contents by utilizing levenshtein distance; root cause: interruptions can be marked as utterances in xml, whereas in our linguistically processed data they are segments of interrupted speeches)
…s it's not part of the sample set
There are still some issues with the data.
seg
, e.g.:Data (Data/ParlaMint-FI/ParlaMint-FI_2015-05-26-ps-7.xml lines 99-101):
Validation error:
make add-common-content-FI
but it resulted in errors, at least due to country-code FI not being available in the fileScripts/parlamint-add-common-content.xsl
:FATAL : BAD COUNTRY!