Errors from earlier word segmentation runs that need updating #195

cmroughan · 2023-02-22T18:10:45Z

It appears that there are several XML files that went through passes of the word segmentation workflow at earlier stages which now preserve transcription_segmented divs that are not encoded to the current standard.

For example, jeru0183 :
<orig xml:id="jeru0183-7" xml:lang="arc"><foreign xml:lang="grc"><unclear>Κ</unclear></foreign><g ref="interpunct">·</g><foreign xml:lang="grc"><unclear>Ν</unclear>ΙΦ</foreign></orig>

The foreign tag should be removed and its xml:lang attribute moved to replace the xml:lang attribute in the enclosing orig tag.

To do: check other transcription_segmented divs for issues like this and push an update for these.

The text was updated successfully, but these errors were encountered:

cmroughan · 2023-03-14T21:22:52Z

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml
halu0001.xml
jeru0237.xml
hmti0003.xml
seph0100.xml
gers0001.xml
dora0002.xml
hmti0005.xml
jeru0196.xml
masa0038.xml
masa0039.xml
rehn0001.xml
hmti0004.xml
masa0037.xml
jent0006.xml
jeru0305.xml
anri0001.xml
knah0002.xml

cmroughan · 2023-03-14T21:26:10Z

There are also multiple cases where the '.xml' is being erroneously included in the wordID produced as part of this workflow. A nonexhaustive sample:

jeru0492.xml-1
jord0001.xml-229
masa0469.xml-2
jaff0054.xml-1
jord0001.xml-475
jord0001.xml-20
huqo0001.xml-10
jeru0492.xml-4
jeru0501.xml-1
beth0244.xml-5
beth0242.xml-3
masa0529.xml-1
qumr0001.xml-14
beth0243.xml-7
masa0416.xml-2
jord0001.xml-552
masa0493.xml-1
erra0001.xml-3
jeru0357.xml-2
qumr0001.xml-18
jord0001.xml-261

zeichman · 2023-03-14T21:42:48Z

Running through some validation of the existing output from past word segmentation workflows. Will add to this thread as issues that should be resolved come up.

Another error: the occurrences column in the parsed language wordlists reference some XML files that do not seem to exist (anymore):

caes0412.xml
halu0001.xml
jeru0237.xml
hmti0003.xml
seph0100.xml
gers0001.xml
dora0002.xml
hmti0005.xml
jeru0196.xml
masa0038.xml
masa0039.xml
rehn0001.xml
hmti0004.xml
masa0037.xml
jent0006.xml
jeru0305.xml
anri0001.xml
knah0002.xml

These, I believe, are mostly because they were redundant files and thus deleted or combined with another file, though I would need to check each case.

cmroughan self-assigned this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors from earlier word segmentation runs that need updating #195

Errors from earlier word segmentation runs that need updating #195

cmroughan commented Feb 22, 2023 •

edited

Loading

cmroughan commented Mar 14, 2023

cmroughan commented Mar 14, 2023

zeichman commented Mar 14, 2023

Errors from earlier word segmentation runs that need updating #195

Errors from earlier word segmentation runs that need updating #195

Comments

cmroughan commented Feb 22, 2023 • edited Loading

cmroughan commented Mar 14, 2023

cmroughan commented Mar 14, 2023

zeichman commented Mar 14, 2023

cmroughan commented Feb 22, 2023 •

edited

Loading