-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Converting between FoLiA and UIMA CAS XMI XML #47
Comments
P.S. Another data formats that would allow this would be CONLL-U or TEI5. |
That library looks promising yeah, with that in combination with foliapy, a convertor could be implemented. The main problem is to find a mapping from various FoLiA structures to UIMA CAS and vice versa, that's often far from trivial.
CONLL-U is significantly simpler so converting that from/to FoLiA is doable, there's already a tool in foliatools for it. |
I believe so; perhaps there is no need to prioritize this. |
UIMA is agnostic to the annotations schema - it just provides the means of defining a schema and working with the annotated texts. There are other projects like DKPro Core that provide type systems. Additionally, there are annotation tools like INCEpTION that allow the user to define their own annotation schema (called "layers" in INCEpTION) and then export/import that to/from UIMA CAS. If I am not mistaken, FoLiA is a fully specified format that does not support "custom annotation types" - all elements are provided by the FoLiA spec and other elements are not supported. So if I am correct and there is no support for custom annotation types in FoLiA, a fully generic mapping from UIMA CAS to FoLiA or from INCEpTION custom annotation layers to FoLiA would not be possible. 👉 FoLiA <-> UIMA CAS (DKPro Core) -- It should be possible to map a bunch of those to/from the DKPro Core types (paragraph, sentence, token, lemma, etc.) - not fully but at least to some degree. It would be interesting to figure out to which degree. 👉 Tooling interoperability Since e.g. INCEpTION knows the DKPro Core types, that would also make it easy then to use the mapped data in the annotation tool. Similarly, it would enable to some degree to use texts annotated with INCEpTION or processed with DKPro Core with the FoLiA tools. |
Btw. if anybody has implemented any conversions between FoLiA and UIMA CAS, it would be great if you could share them (e.g. link them here) for others to use as potential starting points for own conversions or more complete conversions. |
If I am not mistaken, FoLiA is a fully specified format that does not
support "custom annotation types" - all elements are provided by the
FoLiA spec and other elements are not supported. So if I am correct
and there is no support for custom annotation types in FoLiA, a fully
generic mapping from UIMA CAS to FoLiA or from INCEpTION custom
annotation layers to FoLiA would not be possible.
Correct, FoLiA defines types for various kinds of structural and
linguistic annotation. It does not, however, define the tagsets used for
linguistic annotation, those are user defined. So we define for example
the concept "part-of-speech annotation" and the user determines what
tagset to use with that (for which there are formal structures
available). I'm not very familiar with DKPro Core, but this looks
similar in scope.
👉 **FoLiA <-> UIMA CAS (DKPro Core)** -- It should be possible to map
a bunch of those to/from the DKPro Core types (paragraph, sentence,
token, lemma, etc.) - not fully but at least to some degree. It would
be interesting to figure out to which degree.
Indeed, that sounds doable.
👉 **Tooling interoperability** Since INCEpTION knows the DKPro Core
types, that would also make it easy then to use the mapped data in the
annotation tool. Similarly, it would enable to some degree to use
texts annotated with INCEpTION or processed with DKPro Core with the
FoLiA tools.
Having such interoperability would be quite nice yes.
|
Would it be an idea to investigate the interoperability between the FoLiA and the "UIMA CAS XMI XML" formats?
If I understand it right, this would allow data exchange between the FoLiA and the UIMA ecosystems.
Would it be of interest to the community, and would foliapy and dkpro-cassis (https://github.com/dkpro/dkpro-cassis) be instrumental for this?
Many thanks for any pointers!
The text was updated successfully, but these errors were encountered: