Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors from earlier word segmentation runs that need updating #195

Open
cmroughan opened this issue Feb 22, 2023 · 3 comments
Open

Errors from earlier word segmentation runs that need updating #195

cmroughan opened this issue Feb 22, 2023 · 3 comments
Assignees

Comments

@cmroughan
Copy link
Collaborator

cmroughan commented Feb 22, 2023

It appears that there are several XML files that went through passes of the word segmentation workflow at earlier stages which now preserve transcription_segmented divs that are not encoded to the current standard.

For example, jeru0183 :
<orig xml:id="jeru0183-7" xml:lang="arc"><foreign xml:lang="grc"><unclear>Κ</unclear></foreign><g ref="interpunct">·</g><foreign xml:lang="grc"><unclear>Ν</unclear>ΙΦ</foreign></orig>

The foreign tag should be removed and its xml:lang attribute moved to replace the xml:lang attribute in the enclosing orig tag.

To do: check other transcription_segmented divs for issues like this and push an update for these.

@cmroughan cmroughan self-assigned this Feb 22, 2023
@cmroughan
Copy link
Collaborator Author

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml
halu0001.xml
jeru0237.xml
hmti0003.xml
seph0100.xml
gers0001.xml
dora0002.xml
hmti0005.xml
jeru0196.xml
masa0038.xml
masa0039.xml
rehn0001.xml
hmti0004.xml
masa0037.xml
jent0006.xml
jeru0305.xml
anri0001.xml
knah0002.xml

@cmroughan
Copy link
Collaborator Author

There are also multiple cases where the '.xml' is being erroneously included in the wordID produced as part of this workflow. A nonexhaustive sample:

jeru0492.xml-1
jord0001.xml-229
masa0469.xml-2
jaff0054.xml-1
jord0001.xml-475
jord0001.xml-20
huqo0001.xml-10
jeru0492.xml-4
jeru0501.xml-1
beth0244.xml-5
beth0242.xml-3
masa0529.xml-1
qumr0001.xml-14
beth0243.xml-7
masa0416.xml-2
jord0001.xml-552
masa0493.xml-1
erra0001.xml-3
jeru0357.xml-2
qumr0001.xml-18
jord0001.xml-261

@zeichman
Copy link
Collaborator

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml
halu0001.xml
jeru0237.xml
hmti0003.xml
seph0100.xml
gers0001.xml
dora0002.xml
hmti0005.xml
jeru0196.xml
masa0038.xml
masa0039.xml
rehn0001.xml
hmti0004.xml
masa0037.xml
jent0006.xml
jeru0305.xml
anri0001.xml
knah0002.xml

These, I believe, are mostly because they were redundant files and thus deleted or combined with another file, though I would need to check each case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants