From 638589f065f00c365c1a9f8e0f540f48a91d2635 Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Mon, 5 Feb 2024 17:06:08 +0100 Subject: [PATCH] corrected annotations and added to software list --- data/JTEI/10_2016-19/jtei-10-haaf-source.xml | 2 +- ...jtei-cc-ra-hannessschlaeger-164-source.xml | 7 ++-- .../jtei-cc-ra-parisse-182-source.xml | 42 +++++++++---------- .../jtei-cc-ra-wittern-189-source.xml | 2 +- .../jtei-cc-ra-mylonas-202-source.xml | 2 +- data/JTEI/7_2014/jtei-7-dee-source.xml | 4 +- .../8_2014-15/jtei-8-boschetti-source.xml | 6 +-- data/JTEI/8_2014-15/jtei-8-iglesia-source.xml | 12 +++--- .../JTEI/8_2014-15/jtei-8-rosselli-source.xml | 14 +++---- .../JTEI/9_2016-17/jtei-9-armaselu-source.xml | 2 +- data/JTEI/9_2016-17/jtei-9-turska-source.xml | 2 +- .../jtei-vagionakis-204-source.xml | 4 +- .../rolling_2022/jtei-mitiku-212-source.xml | 32 +++++++------- taxonomy/software-list.xml | 15 +++++++ 14 files changed, 80 insertions(+), 66 deletions(-) diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index 0d66a6c9..c62b4c2b 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -212,7 +212,7 @@ well as collaborative text correction and annotationSee <ptr type="software" xml:id="R3" - target="#dtaq"/><rs type="soft.name" ref="R3">DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv</rs> + target="#dtaq"/>DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of quality assurance in the DTA, see, for example, Haaf, diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml index f157b0df..6fa4f312 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml @@ -548,10 +548,9 @@ type="soft.name" ref="#R3">GitHub).

But the story did not end there. The freely available and processable collection of abstracts inspired Peter Andorfer, a colleague of the editors at the Austrian Centre for - Digital Humanities, to use this text collection to built an eXistdb-powered web - application (Andorfer and Hannesschläger - 2017). In the context of licensing issues, it is important to mention that + Digital Humanities, to use this text collection to built an eXistdb-powered web + application (Andorfer and Hannesschläger + 2017). In the context of licensing issues, it is important to mention that Andorfer was never approached by the editors or explicitly asked to process the TEI files, and he only informed the editors about the web application that he was building when it was already available online (as a work in progress, but diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml index d0cd2c31..0e723e25 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml @@ -115,7 +115,7 @@ format. Backward conversion is possible in many cases, with limitations inherent in the destination target format. TEICORPO can run the + target="#treetagger"/> Treetagger part-of-speech tagger and the Stanford CoreNLP tools on TEI files and can export @@ -231,15 +231,15 @@

Similarities with and Differences from Other Approaches

Many software packages dedicated to editing spoken language transcription contain - utilities that can convert many formats: for example, EXMARaLDA (Schmidt 2004 ; see ), - + Anvil ( Kipp 2001; see ), and ELAN + type="software" xml:id="R17" target="#elan"/>ELAN (Wittenburg et al. 2006; see ). However, in all cases, the @@ -257,7 +257,7 @@

The list of tools that are considered in the two projects is nearly the same. The only tools missing in the TEICORPO approach are EXMARaLDA and + xml:id="R19" target="#exmaralda"/>EXMARaLDA and FOLKER (Schmidt and Schütte 2010; see spanGrp elements is insufficient to represent the complex organization that can be constructed with the ELAN and + >ELAN and Praat tools. ELAN is a tool used by many researchers to describe data of greater complexity than the data presented in the @@ -792,7 +792,7 @@

ELAN example of a temporal division + type="soft.name" ref="#R98">ELAN example of a temporal division
@@ -851,7 +851,7 @@ corpora to be used with other editing tools, some of which are suited to specific processing: for example, Praat for phonetics/phonology; + type="software" xml:id="R105" target="#transcriber"/> Transcriber/ CLAN for raw transcription; and CLAN , ELAN, - Praat, Praat, Transcriber, nor of course in TEI format.

@@ -1094,7 +1094,7 @@ TEICORPO: TreeTagger and + target="#stanfordcorenlp"/> CoreNLP.

@@ -1118,11 +1118,11 @@ TEICORPO should be used to generate an annotated file with lemma and POS information based on - TreeTagger. TreeTagger. - TreeTagger should be installed separately. The - implementation of - TreeTagger in TreeTagger should be installed separately. The + implementation of + TreeTagger in TEICORPO includes the ability to use any syntactic model. For French data, we used the PERCEO model (filename

filename is the full location of the - TreeTagger program, according to the system + TreeTagger program, according to the system used (Windows, MacOS, or Linux).

@@ -1163,7 +1163,7 @@

The environment variable TREE_TAGGER can be used to locate the model and the program. If no -program option is used, the default name for the - TreeTagger program is used.

+ TreeTagger program is used.

The -model parameter is mandatory.

The resulting filename ends with .tei_corpo_ttg.tei_corpo.xml or a specific name provided by the user (option -o).

@@ -1279,10 +1279,10 @@
- Stanford CoreNLP + Stanford CoreNLP

- The Stanford Core Natural Language Processing -

Accessed March 11, 2021, The Stanford Core Natural Language Processing +

Accessed March 11, 2021, .

( CoreNLP) package is a suite of tools (Badin et al. 2021) made it possible to insert metadata stored in CSV files (including participant metadata) into the TEI files. This makes it possible to achieve more powerful corpus analysis - using a tool such as TXM.

Our approach is somewhat similar to what is suggested in the conclusion of Schmidt, Hedeland, and Jettka (2017), who describe a @@ -1465,7 +1465,7 @@

Conclusion

- TEICORPO is a functional tool, created by the CORLI + TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts files created by software specializing in editing spoken-language data into TEI format. The result is fully compatible with the most recent developments in TEI, especially those that concern spoken-language material.

diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml index 37e79557..988de0ad 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml @@ -574,7 +574,7 @@ usage has been increasing slowly but steadily.

Kanripo Project Details -

All the texts are freely available on GitHub in their +

All the texts are freely available on GitHub in their source form. This repository of texts can be accessed through the kanripo.org website, but also through a module of the Emacs editor called Mandoku. This allows users to query, access, clone, edit, and diff --git a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml index 4bcce845..0c9911a9 100644 --- a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml +++ b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml @@ -619,7 +619,7 @@ target="http://nomisma.org/">Nomisma, and CRMtexCIDOC (International Committee for Documentation) Conceptual Reference Model, + target="#omekareference"/>Reference Model, accessed July 4, 2022, ; Nomisma (knowledge organization system for numismatics), accessed July 4, 2022, ; CRMtex model for the study of ancient texts (an diff --git a/data/JTEI/7_2014/jtei-7-dee-source.xml b/data/JTEI/7_2014/jtei-7-dee-source.xml index 8792c1ff..8066f09f 100644 --- a/data/JTEI/7_2014/jtei-7-dee-source.xml +++ b/data/JTEI/7_2014/jtei-7-dee-source.xml @@ -734,9 +734,9 @@ Integrated Resources

While initiatives such as TAPAS, TEICHI, and CWRC-Writer

Welcome to CWRC Writer, - CWRC-Writer Help, accessed September 7, 2013, .

have begun to address to different aspects of these needs (Cocoon. and the native XML database eXist-dbeXist-db. deserve to be mentioned. Specifically for TEI-annotated documents, TUSTEP,Java objects. The resources are stored and maintained in a native XML database management system (i.e., eXist-db). The APIs and services provided by Lucene, a software library developed and hosted by the Apache Foundation, have been used for indexing the textual data.

@@ -646,7 +646,7 @@

The marshalling and unmarshalling process handles the serialization of the object representation of the TEI document, in order to store and retrieve data on the filesystem or in native XML databases, such as eXist-db.

+ target="#existdb"/>eXist-db.

Performance measurement tools such as JMeter will help to optimize the performance of the library components.

Software currently under development will be available on Example rs.

-

Reference attributes (ref) point to nodes located elsewhere in the TEI dataset. It should be noted that the organization of the TEI dataset and the location of the entity notes therein is of no importance to the reference linking @@ -471,7 +471,7 @@ . software library. One of our goals is to implement the aggregations within the digital edition, and for this we would like to use web technologies only. The D3.js + xml:id="D3.js" target="#d3js"/>D3.js (Data Driven Documents Javascript library) created by Mike Bostock provides a framework for different visualizations. The list of examplesJavaScript and a reference to the external D3.js library. The second is a + target="#d3js"/>D3.js library. The second is a JSON file, which contains one object per entity and one associated array per object that includes a list of connected entities. The tree-like structure of XML allows the transformation of any document to a network graph by selecting elements that share the @@ -526,8 +526,8 @@ headline is Thüringens Geschichte (History of Thuringia), which is also the topic of the following pages. The benefit of the network is that a major topic can be identified with a single view.

-

The output of this D3.js application is an SVG graphic which can be +

The output of this D3.js application is an SVG graphic which can be further transformed. svg:title elements are used to store the node names, which modern browsers should display on mouseover. To get a better overview of the entities in the notebook, the node names should actually be inserted as nodes, but since there is @@ -599,7 +599,7 @@ xml:id="XSLT" target="#XSLT"/>XSLT) and code customization were easily carried out in addition to our regular work within the Fontane edition project. These efforts were facilitated by a spirit of openness shared by all - parties involved: both the D3.js library and the SIMILE Timeline widget are open-source software released under a BSD license; the data sources GND, GeoNames, and OpenStreetMap have permissive licenses—Creative Commons Zero (CC0), Creative Commons diff --git a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml index 94556410..751edd6f 100644 --- a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml @@ -117,9 +117,9 @@ the constant search for an effective price/result ratio and the local availability of technical skills, have led to a remarkable fragmentation: publishing solutions range from simple HTML pages produced using the - + TEI stylesheets (or the - + TEI Boilerplate software) to very complex frameworks based on CMS and SQL search engines. Researchers of the Digital Vercelli Book project started looking into a simple, user-friendly solution and eventually @@ -553,7 +553,7 @@ accessible at .

soliciting feedback from all interested parties. Shortly afterwards, the version of the - EVT software we used, improved by more bug fixes and + EVT software we used, improved by more bug fixes and small enhancements, was made available for the academic community on the project’s SourceForge site.

Edition Visualization Technology: Digital edition visualization @@ -629,7 +629,7 @@ Forcing the prerequisites of an Internet connection and of dependency on a server-based XML database would have undermined our original goal. Going the database route was no longer an option for a client-only - EVT and we immediately felt the need to go back to + EVT and we immediately felt the need to go back to our original architecture to meet this standard. This sudden turnaround marked another chapter in the research process and brought us to the current implementation of EVT Search.

@@ -651,7 +651,7 @@
Tipue Search -

Tipue search

@@ -739,7 +739,7 @@ use of simple recursive functions on relevant HTML nodes has proved to be very difficult to apply to the textual contents handled by - EVT.

+ EVT.

HTML text within EVT is represented as a combination of text nodes and span elements. These spans are used to define the characteristics of the @@ -952,7 +952,7 @@ type="software" xml:id="R93" target="#EVT"/> EVT or, more precisely, a separate version of - EVT will migrate to this architecture, at some point + EVT will migrate to this architecture, at some point in the future it will be possible to integrate a full version of the DL. Plans for the current, client-only version envision implementing all those features that do not depend on server software: even if this means giving up interesting features such as diff --git a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml index 5fae6897..6f7d0f71 100644 --- a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml @@ -1121,7 +1121,7 @@ Commonwealth Office, Western Organisations Department: Registered Files (W and WD Series). Western European Union (WEU). Future of Standing Armaments Committee of Western European Union. 01/01/1975–31/12/1975, FCO 41/1749 (Former Reference Dep: WDU 11/1 PART B). The interpretation of the less predictable results is not straightforward, since they may have been determined by an under- or overrepresentation of certain elements in the discourse, diff --git a/data/JTEI/9_2016-17/jtei-9-turska-source.xml b/data/JTEI/9_2016-17/jtei-9-turska-source.xml index d3ca9a92..50f7b614 100644 --- a/data/JTEI/9_2016-17/jtei-9-turska-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-turska-source.xml @@ -303,7 +303,7 @@ SARIT or Buddhist Stonesutras or experiments with EEBO-TCPEarly English Books Online eXist-db app, accessed February 11, 2016, . are more than promising (see, for example, Wicentowski and diff --git a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml index 95384ef3..0eed9e2e 100644 --- a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml +++ b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml @@ -370,7 +370,7 @@ users to view a publishable form of their inscriptions, and to publish them online in a full-featured searchable database, by easily ingesting EpiDoc texts and providing formatting for their display and indexing through the EpiDoc + xml:id="R16" target="#epidocxslt"/> EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT transformations, and the possibility of already having, during construction, an immediate front-end visualization of the desired final outcome of the TEI-EpiDoc @@ -387,7 +387,7 @@ Yordanova (2020).

Some of these useful features of EFES are common to other existing tools, + />EFES are common to other existing tools, such as TEI Publisher,

Accessed July 21, 2021, .

diff --git a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml index d83c37b6..ffa8c151 100644 --- a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml +++ b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml @@ -132,7 +132,7 @@ from manuscripts, to be published alongside the catalogue description of the manuscript itself, we have investigated a series of options, among which we have chosen to use the Transkribus sofware by READ Coop.Accessed February 2, 2022, .

@@ -151,7 +151,7 @@ an historical catalogue that involves copying from the former cataloguer transcription. Having a new transcription, based on autopsy or at least on the images of the manuscript would be preferable and technology as Transkribus allows one to obtain this transcription in an almost entirely automated way. Additionally, most of the internal referencing within a manuscript is done with the indication of the ranges of folios, and in TEI with @@ -202,7 +202,7 @@

The following steps have been taken to carry out an investigation of the possibilities for the automated production of text transcriptions based on images of manuscripts, before we opted for Transkribus + target="#transkribus"/>Transkribus and its integration in the workflow to make texts available in the Beta maṣāḥǝft research environment.

@@ -291,7 +291,7 @@ one script.

- Transkribus

This software is freely accessible and has a subscription model based on credits. The platform was created within the framework of the EU projects @@ -304,7 +304,7 @@ platform. The Pattern Recognition and Human Language Technology (PRHLT) group of the Universitat Politècnica de València and the CITlab group of the University of Rostock should be mentioned in particular.

-

Transkribus comes as an expert tool in its downloadable version and its online version,Accessed February 2, 2022,

Thus, the first stage for developing a model was gathering the data and preparing an initial dataset. Also for this aspect, Transkribus + target="#transkribus"/>Transkribus proved superior to all other options offering support also for this step. Colleagues which we called to contribute could be added to a collection, share their images without publishing them and add their transcriptions in the tool with a very mild learning curve.

-

Within Within Transkribus we have trained a model called Manuscripts from Ethiopia and Eritrea in Classical Ethiopic (Gǝʿǝz).See, accessed February 2, 2022,

Training a model in Transkribus

Gathering data to train an HTR model in Transkribus + target="#transkribus"/>Transkribus was not easy. Researchers were directly asked to contribute images of which they had already done the correct transcription. Sets of images with the relative transcription was thus obtained thanks to the generosity of contributors listed @@ -437,7 +437,7 @@ for the available time of the colleagues to fix the work of the machine, since we intended to train the model again. After three months with a full-time dedicated person, we had more than 50k words in the Transkribus + target="#transkribus"/>Transkribus expert tool, and we could train a model which could be made public, since this is the unofficial threshold to make a model available to everyone.

The features of the final model can be seen in

Adding transcriptions to Beta maṣāḥǝft from Transkribus

Even if a user already worked through each page of a manuscript to produce a transcription, doing it again with Transkribus + target="#transkribus"/>Transkribus and checking it has many advantages, chiefly the alignment of the text regions and lines on the base image to the transcription.Guidelines are provided for this steps to the users in theproject Guidelines, @@ -470,7 +470,7 @@ />.

With the transcribed images, either by hand with the help of the tool, or using the HTR model, the export functionalities of the Transkribus tool, allow to download a TEI encoded version of this transcription where we encourage users to use Line Breaks (lb) instead of l and preserve the coordinates of the boxes.

@@ -487,7 +487,7 @@

We have then prepared a bespoke XSLT transformation which can be used to transform the rich TEI from Transkribus, + target="#transkribus"/>Transkribus, called transkribus2Beta maṣāḥǝft.xsl. This transformation, given a few parameters, @@ -510,7 +510,7 @@

Conclusions -

Working with Working with Transkribus for the Beta maṣāḥǝft project gives the community of users a way to support the process of transcribing to the text on source manuscripts without typing it down. This is not intended to substitute the @@ -607,7 +607,7 @@ Weidemann, Herbert Wurster, and Konstantinos Zagoris. 2019. Transforming scholarship in the archives through handwritten text recognition: <ptr type="software" - xml:id="Transkribus" target="#Transkribus"/><rs type="soft.name" + xml:id="Transkribus" target="#transkribus"/><rs type="soft.name" ref="#Transkribus">Transkribus</rs> as a case study. Journal of Documentation, 75 (5) https://github.com/mandoku/mandoku + + Toolbox + + + + + Ediarum + + + + + Transkribus + + +