-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GBIF's EML profile (EML 2.2.0) #5
Comments
Thank you Matt! I think this makes sense. I think we use the NCD elements in GRSciColl synchronisation when a dataset is set as source of information for a collection. Something to keep in mind if/when we update to the TDWG Collection Descriptions standard for datasets. See also gbif/registry#319 (comment) |
Suggestion for the registry dataset API response for the DocBook-formatted fields: Respond with HTML formatting, which is easy for consumers to use and sort-of what we have already, i.e. convert I think everything there has a direct HTML equivalent. |
<distribution scope="document">
<online>
<url function="information">https://reeflifesurvey.com/</url>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution> does not look like a valid example according to the schema (only one <distribution scope="document">
<online>
<url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>
<distribution scope="document">
<online>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution> Not sure about the distribution's |
Since there isn't an identifier, I don't think we need a scope either. <distribution>
<online>
<url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>
<distribution>
<online>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution> |
So should I remove the attribute in the new schema? |
I think that makes most sense. |
@MattBlissett You haven't mentioned |
I missed that, please add it. |
@MattBlissett I might be mistaken, but there is no |
Also, we made emails ( |
I found more new/absent fields we haven't discussed yet, but might want to include:
Possible changes to the existing elements:
|
From the perspective of ChecklistBank and metadata used there I would really appreciate if we'd support |
I was probably looking at https://eml.ecoinformatics.org/eml-schema#the-eml-physical-module---physical-file-format but I'm now confused on how it fits in. |
I think it's fine to include other fields if other projects request them, but I recommend not adding everything — stored procedures are irrelevant, for example. Most of those are older fields which were excluded before, so I didn't change that. The dataTable field could be used, but it would seem to duplicate other dataset descriptors (meta.xml, Frictionless). I think implementing it would be a lot of work, and not worth it when no-one has shown any interest. |
@MattBlissett I think I missed that, so we should also support extension of the |
Btw that example is invalid: <abstract>
<para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
<section>
<title>A separate section</title>
<para>More text</para>
<para>And more text, with
<itemizedlist>
<listitem>First item</listitem>
</itemizedlist>
<orderedlist>
<listitem>First item</listitem>
</orderedlist>
<section>
<title>A sub-section</title>
<emphasis>Emphasis</emphasis>
CO<subscript>2</subscript> (or just CO₂)
m<superscript>3</superscript> (or just m³)
<literalLayout>
x = fn(y, z)
</literalLayout>
</section>
<ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
</para>
</section>
</abstract> Issues are:
Valid example would be something like this: <abstract>
<para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
<section>
<title>A separate section</title>
<para>More text</para>
<para>And more text, with
<itemizedlist>
<listitem><para>First item</para></listitem>
</itemizedlist>
<orderedlist>
<listitem><para>First item</para></listitem>
</orderedlist>
<ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
</para>
<section>
<title>A sub-section</title>
<para><emphasis>Emphasis</emphasis>
CO<subscript>2</subscript> (or just CO₂)
m<superscript>3</superscript> (or just m³)
<literalLayout>
x = fn(y, z)
</literalLayout>
</para>
</section>
</section>
</abstract> |
Yes please. I know this is probably annoying, but the IPT currently writes escaped HTML into the descriptive formats, which means other users of EML have to handle this — it's not ideal when EML itself includes equivalent formatting. I suggest we support the DocBook elements where they are defined, and adjust the IPT and Registry to use them. |
Updating EML within GBIF
Ecological Metadata Language, EML, is the primary data standard used by Darwin Core Archives to provide metadata about a dataset — descriptions, information on geographic and taxonomic coverage, contacts, publishers and so on. It is also used as part of GBIF's API, and included in Darwin Core Archive data downloads.
Since 2011 we have been using EML version 2.1.1, extended with some additional properties including some from the Natural Collections Description Data (NCD) draft standard. The EML properties we recognize, as well as these extensions, are described in the GBIF Metadata Profile – How-to Guide. For reference, the 2.1.1 standard can be seen here.
It is now time for us to upgrade to the latest EML version, 2.2.0. There are several new elements which we plan to support, some of which will replace the GBIF extensions. A summary of the changes in 2.2.0 is available. It is also a good time to introduce multilingual support, so a dataset can be described in more than one language.
These updates will allow us to remove the need for some of the custom elements (or custom use of standard elements) added by GBIF.
New or updated elements
We will support these new or updated elements. Information within these elements will be added to the REST (JSON) API and shown on dataset pages as appropriate.
New:
dataset/licensed/{licenseName,url,identifier}
— this will properly reference the licence used by a dataset, rather than the special ulink used at present underdataset/intellectualRights
. EML also recommends a particular set of licence URIs, different to what we use ourselves (see also Suggestion for updating the way creative commons licenses are provided via the EML from the IPT ipt#1967). We will recognize the values preferred by EML (spdx.org) as well as the existing values (creativecommons.org...)TODO: What values should we use for EML we produce?
Old:
New:
New:
dataset/distribution/online
— this currently lists the dataset homepage with the function"information"
. It may in addition link directly to a data download:Old:
New, dataset published by IPT:
New:
dataset/introduction
— New. One to many paragraphs that provide background and context for the dataset with appropriate figures and references. This is similar to the introduction for a journal article, and would include, for example, project objectives, hypotheses being addressed, what is known about the pattern or process under study, how the data have been used to date (including references), and how they could be used in the future.New:
dataset/gettingStarted
— New. One or more paragraphs that describe the overall interpretation, content and structure of the dataset. For example, the number and names of data files, they types of measurements that they contain, how those data files fit together in an overall design, and how they relate to the data collections methods, experimental design, and sampling design that are described in other EML sections. One might describe any specialized software that is available and/or may be necessary for analyzing or interpreting the data, and possibly include a high level description of data formats if they are unusual, keeping in mind that detailed descriptions of data structure and format are contained in the entity sections of EML. Citations, inline figures, and inline images can be included via inline references in Markdown sections.New:
dataset/acknowledgements
— New, "text that acknowledges funders and other key contributors."Three new elements will be supported, all with multilingual support. We will support the DocBook subset used in EML, and prefer this to adding HTML. There are incompatibilities between EML's new Markdown support and multilingual support (Update GBIF's EML profile — multilingual support #6), so we do not plan to support Markdown at this stage.
Old:
New, showing all available formatting using DocBook:
dataset/project/award
— new element for structured funding information, all useful for project tracking. Note there is no multilingual support for the project title, so we will ignore the support on some of the other parts of a project.New:
dataset/project/relatedProject
— new element, recursive links to other projectsdataset/creator/individualName/salutation
— this is not a new element, but will allow us to preserve titles (Dr., Prof. etc) in dataset contact names while excluding them from generated citationsNew:
dataset/literatureCited
— replacesadditionalMetadata/metadata/gbif/bibliography
. Note we plan only BibTeX support.Old:
New:
Dedicated issue New field
literatureCited
in GBIF EML profile gbif-metadata-profile#29, won't be included in the 1.3 profile— there's no such thingdataset/physical
dataset/publisher
— describes the publisher of the data. This is presented elsewhere in the GBIF API, and we will not change the way a dataset is registered using the API. However, we can expose the publisher in the GBIF-generated EML for a dataset:New:
New or updated elements — support not planned
We don't plan to support these elements at this stage.
dataset/usageCitation
— This can expose known citations of a dataset. Presented elsewhere in the API.dataset/coverage/taxonomicCoverage/taxonomicClassification/taxonId
— taxonomic identifiers added. Not yet supported by occurrences.dataset/annotation
— structured annotations (key-value). Probably deserves to be a separate task.dataset/pubPlace
— a publisher address. Not clear to me why this is different to an address on the publisher.dataset/referencePublication
— "Common cases where a Reference Publication may be useful include when a data paper is published that describes the dataset, or when a paper is intended to be the canonical or examplar reference to the dataset."GBIF Extension
These are all the elements of the GBIF extension:
dateStamp
— keepmetadataLanguage
— kept, but could usexml:lang
ondataset
insteadheirarchyLevel
— keptcitation
— kept, as EML suggests generating a citation from the components (authors etc)bibliography
— replaced withdataset/literatureCited
physical
— keptreplaces
— keptresourceLogoUrl
— keptNCD elements.
These are from the obsolete TDWG Natural Collections Description Data (NCD) draft standard. We will leave them as they are, but in future they could be replaced by elements from a TDWG Collection Descriptions standard.
collection/parentCollectionIdentifier
collection/collectionName
collection/collectionIdentifier
formationPeriod
livingTimePeriod
specimenPreservationMethod
jgtiCuratorialUnit
The text was updated successfully, but these errors were encountered: