From 49f7410b795ec0777392428dfaf1aa85e95bd78e Mon Sep 17 00:00:00 2001 From: Caterina Penone Date: Fri, 17 Nov 2017 17:27:48 +0100 Subject: [PATCH 1/3] Minor changes to description I put that in a new branch to avoid "breaking" master. --- index.Rmd | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/index.Rmd b/index.Rmd index c58d2d3..2110390 100644 --- a/index.Rmd +++ b/index.Rmd @@ -6,17 +6,17 @@ date: "v0.6, released: 14 Nov. 2017" # Glossary of terms -This defined vocabulary aims at providing all essential terms to describe datasets of functional trait measurements and facts for ecological research. Many terms refine terms from the Darwin Core Standard and it's extensions (terms of DWC are referenced thus in field 'Refines'; the full Darwin Core Standard can be found here: http://rs.tdwg.org/dwc/terms/index.htm). +This defined vocabulary aims at providing all essential terms to describe datasets of functional trait measurements and facts for ecological research. Many terms refine terms from the Darwin Core Standard (DWC: Darwin Core Terms) and its extensions. DWC are referenced in field 'Refines'; the full Darwin Core Standard can be found here: http://rs.tdwg.org/dwc/terms/index.htm). -The glossary of terms is ordered into a **core section** with essential columns for trait data, extensions which are allowing to provide additional layers of information, as well as a vocabulary for **metadata** information of particular importance for trait data. +The glossary of terms is ordered into a **core section** with essential columns for trait data, **extensions** which are allowing to provide additional layers of information, as well as a vocabulary for **metadata** information of particular importance for trait data. Another section provides defined terms and structure for **trait Thesauri**, i.e. lists of trait definitions. We provide three **extensions** of the vocabulary, that allow for additional information on the trait measurement. - the `Occurrence` extension contains information on the level of individual specimens, such as date and location and method of sampling and preservation, or physiological specifications of the phenotype, such as sex, life stage or age. -- the `MeasurementOrFact` extension takes information at the level of single measurements or reported values, such as the original literature from where the value is cited, the method of measurement or statistical method of aggregation. -- The `BiodiversityExploratories` extension provides columns for localisation for trait data from the Biodiversity Exploratories sites (www.biodiversity-exploratories.de). +- the `MeasurementOrFact` extension contains information at the level of single measurements or reported values, such as the original literature from where the value is cited, the method of measurement or the statistical method used for aggregation. +- The `BiodiversityExploratories` extension provides columns for localisation of trait data from the Biodiversity Exploratories plots and regions (www.biodiversity-exploratories.de). This glossary of terms is available as @@ -99,14 +99,14 @@ parseterms("Traitdata") # Metadata vocabulary -For datasets collate from multiple other datasets +For datasets collated from multiple other datasets. @Flo: maybe clarify this There is the set of information that applies to the entire trait-dataset, which classifies them as metadata. -To retain the rights of the original data contributor, the field `rightsHolder` states the person or organization that owns or manages the rights to the data; `bibliographicCitation` states a bibliographic reference which should be cited when the data is used; and license specifies under which terms and conditions the data can be used, re-used and/or published. This information always applies to one single fact or measurement, +To retain the rights of the original data contributor, the field `rightsHolder` states the person or organization that owns or manages the rights to the data; `bibliographicCitation` states a bibliographic reference which should be cited when the data is used; and license specifies under which terms and conditions the data can be used, re-used and/or published. This information always applies to one single fact or measurement. -Further information on the larger dataset which originally contained this entry can be stored in `datasetID`, `datasetName`, `author` . These columns should hence give credit to the person who compiled the original dataset and signs responsible for the correct identification and reporting of the rights holder. -These information usually may be kept in the metadata of the dataset, but if datasets from different sources are merged, those should be referred to by a unique identifier (`datasetID`) or be reported as additional columns in the merged dataset (`author`, `license`, ...; see Dublin Core Metadata standards, Ref). +Further information on the larger dataset which originally contained the single fact or measurement can be stored in `datasetID`, `datasetName`, `author` . These columns should hence give credit to the person who compiled the original dataset and signs responsible for the correct identification and reporting of the rights holder. +These information can usually be kept in the metadata of the dataset, but if datasets from different sources are merged, those should be referred to by a unique identifier (`datasetID`) or be reported as additional columns in the merged dataset (`author`, `license`, ...; see Dublin Core Metadata standards, Ref). Since trait data are of great use for synthesis studies, information about how the data may be distributed, re-used and attributed to are of particular importance for trait datasets. Most researchers encourage re-use of their published datasets while making sure they are sufficiently credited. The use of permissive licenses for traitdata publications, such as Creative Commons Attribution or Creative Commons Zero/Public Domain release, has been established as the gold standard. @@ -143,7 +143,7 @@ This links traits of similar functional meaning and allows cross-taxon comparati Ontologies for functional traits are being developed for different organism groups, mostly centered around certain research questions or subjects of study. To date, the TRY database takes the most inclusive approach on functional traits for vascular plants (Kattge). For some animal groups, similar approaches do exist, but few are available as an online ontology. -As a starting point for creating an ontology for functional traits, we propose the following terms for trait lists (also termed 'Thesaurus'), to describe functional traits that are in the focus of the research project. +As a starting point for creating an ontology for functional traits, we propose the following terms for trait lists (also termed 'Thesaurus'), to describe functional traits that are in the focus of a given project. Using this standardized terminology will allow merging trait definitions from multiple sources. We encourage providing these lookup tables as an open resource on public terminology servers to enable a global referencing. The benefit of such classifications will increase if open Application Programming Interfaces (APIs) provide a way to extract the definitions and higher-level trait hierarchies programmatically via software tools. To harmonize trait data across databases, future trait standard initiatives should provide this functionality. @@ -167,10 +167,10 @@ parseterms("Traitlist") This section provides additional information about a reported measurement or fact and in most cases can easily be included as extra columns to the core dataset. -As a high-level discrimination of the source of the measurement or fact, the Darwin Core Term `basisOfRecord` takes an entry about the type of trait data recorded: Were they taken by own measurement (distinguish "LivingSpecimen", "PreservedSpecimen", "FossilSpecimen") or taken from literature ("literatureData"), from an existing trait database ("traitDatabase"), or is it expert knowledge ("expertKnowledge"). It is highly recommended to provide further detail about the source in the column `basisOfRecordDescription`. +As a high-level discrimination of the source of the measurement or fact, the Darwin Core Term `basisOfRecord` takes an entry about the type of trait data recorded. It distingushed between data collected by own measurement (distinguish "LivingSpecimen", "PreservedSpecimen", "FossilSpecimen"), from literature ("literatureData"), from an existing trait database ("traitDatabase"), or from expert knowledge ("expertKnowledge"). It is highly recommended to provide further detail about the source in the column `basisOfRecordDescription`. To keep track of potential sources of noise or bias in measured data, the method of measurement (`measurementMethod`), the person conducting the measurement (`measurementDeterminedBy`), and the date at which the measurement was obtained (`measurementDeterminedDate`) are recorded. -Authors would often report aggregate data of repeated or pooled measurements, e.g. by weighing multiple individuals simultaneously and calculating an average. In these cases, recording the number of individuals (`individualCount`) along with a dispersion measure (e.g. variance or standard deviation, `dispersion`) or range of values (e.g. min and max of values observed in the field `measurementValueMin`, `measurementValueMax`) is adviced. The field `statisticalMethod` names the method for data aggregation (e.g. mean or median) as well as the variation or range (e.g. reporting variance or standard deviation). +Authors would often report aggregated data from repeated or pooled measurements, e.g. by weighing multiple individuals simultaneously and calculating an average. In these cases, recording the number of individuals (`individualCount`) along with a dispersion measure (e.g. variance or standard deviation, `dispersion`) or range of values (e.g. min and max of values observed in the field `measurementValueMin`, `measurementValueMax`) is adviced. The field `statisticalMethod` names the method for data aggregation (e.g. mean or median) as well as the variation or range (e.g. reporting variance or standard deviation). For data not obtained from own measurement, the field `references` provides a precise reference to the source of data (e.g. a book or existing database) or the authority of expert knowledge. For literature data, the original source might report trait values on the family or genus level, but the dataset author infers and reports the trait data at species level (e.g. if the entire genus reportedly shares the same trait value). To preserve this information, the column `measurementResolution` should report the taxon rank for which the reported value was originally assessed. @@ -190,7 +190,7 @@ For both literature and measured data, trait values may be recorded for differen Sampling may be further specified using a unique identifier for the sampling event (`eventID`) which references to an external dataset. The record of a `samplingProtocol` may capture bias in samling methods. Further procedures and methods of preservation should be reported in `preparations`. -Seasonal variation of traits may be recored by assigning a date and time of sampling to the occurrence, using the fields `year`, `month` and `day`, depending on resolution. Further field definitions of the Darwin Core Standard can be applied instead, to refer to a geological stratum, for instance. +Seasonal variation of traits may be recorded by assigning a date and time of sampling to the occurrence, using the fields `year`, `month` and `day`, depending on resolution. Further field definitions of the Darwin Core Standard can be applied instead, to refer to a geological stratum, for instance. To capture geographic variation of traits, a set of fields for georeferencing can put the observation into spatial and ecological context (`habitat`, `decimalLongitude`, `decimalLatitude`, `elevation`, `geodeticDatum`, `verbatimLocality`, `country`, `countryCode`). The field `locationID` may be used to reference the occurrence to a dataset-specific or global identifier. This allows the trait data to double as observation data, e.g. for upload to the GBIF database. @@ -204,7 +204,7 @@ parseterms("Occurrence") # Extension: Biodiversity Exploratories -This section records location in the context of the Biodiversity Exploratories project (www.biodiversity-exploratories.de). The field `OriginExploratories` flags trait measurements originating from samples in the project context. `Exploratory` and `ExploType` allow to place the sample within a region or a landscape type (Grassland or Forest). From `ExploratotriesPlotID` a detailled georeference can be inferred. Additional spatial resolution, e.g. on subplots, may be provided in `locationID` of the Occurence extension. +This section records location in the context of the Biodiversity Exploratories project (www.biodiversity-exploratories.de). The field `OriginExploratories` flags trait measurements originating from samples in the project context. `Exploratory` and `ExploType` allow to place the sample within a region or a landscape type (Grassland or Forest). From `ExploratotriesPlotID` a detailed georeference can be inferred. Additional spatial resolution, e.g. on subplots, may be provided in `locationID` of the Occurence extension. Trait data uploaded to the Biodiversity Exploratories Information System (BExIS) should use the vocabulary in a single-file longtable format (no DwC-Archives supported). From e032112b67abd10cdc2222f306deb235e9794a25 Mon Sep 17 00:00:00 2001 From: Caterina Penone Date: Fri, 17 Nov 2017 17:42:42 +0100 Subject: [PATCH 2/3] Minor changes in description --- structure.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/structure.Rmd b/structure.Rmd index 6cbde6a..bfed149 100644 --- a/structure.Rmd +++ b/structure.Rmd @@ -17,11 +17,11 @@ There are two possibilities to integrate further information to the core trait d For chosing one or the other, the trade-off is data-consistency and readability *vs.* avoidance of content duplication: -For standalone dataset publications on a hosting service with only little information content beside the core traitdata columns, the first would be the preferred format, since it facilitates an analysis of cofactors and correlations further down the road. If datasets of different source are merged, the information is readily available without the risk of breaking the reference to an external datasheet. -Other cases, where key data columns would be placed in the same table as the core data are traits assessed on a higher level of organisation, e.g. microbial functional traits assessed at the community level taken from a soil sample. Here, location or measurement information are in the primary focus of the investigation (see vocabulary extensions below). -A general definition, whether a column describes asset data or is part of the central dataset is ill advised. Therefore, our glossary of terms and its extensions should be used to describe the scientific data according to the study context. +For standalone dataset publications on a hosting service with only little information content beside the core traitdata columns, the first would be the preferred format, since it facilitates an analysis of cofactors and correlations further down the road. If datasets from different sources are merged, the information is readily available without the risk of breaking the reference to an external datasheet. +Other cases, where key data columns would be placed in the same table as the core data are traits assessed on a higher level of organisation (nested), e.g. microbial functional traits assessed at the community level taken from a soil sample. Here, location or measurement information are in the primary focus of the investigation (see vocabulary extensions below). +A general definition on whether a given column describes asset data or is part of the central dataset is advised. Therefore, our glossary of terms and its extensions should be used to describe the scientific data according to the study context. -The latter links separate data sheets by identifiers, which has the advantage of tidy datasets and avoids duplication of verbose information [@wickham14]. As a rule of thumb, the columns of the 'Measurement or Fact' and 'Occurrence' extension would be stored in a separate data sheet. The use of Darwin Core Archives [http://eol.org/info/structured_data_archives, DwC-A; @robertson09] is the recommended structure for GBIF [@gbif17, http://tools.gbif.org/dwca-assistant/] or EOL TraitBank [@parr16, http://eol.org/info/cp_archives]. These are .zip archives containing data table txt-files along with a descriptive metadata file (in .xml format). Detailled descriptions and tools for validation can be found on the website of EOL (http://eol.org/info/cp_archives) and GBIF (http://tools.gbif.org/dwca-assistant/). +The option of separating data sheets by identifiers has the advantage of providing tidy datasets and avoids duplication of verbose information [@wickham14]. As a rule of thumb, the columns of the 'Measurement or Fact' and 'Occurrence' extension would be stored in a separate data sheet. The use of Darwin Core Archives [http://eol.org/info/structured_data_archives, DwC-A; @robertson09] is the recommended structure for GBIF [@gbif17, http://tools.gbif.org/dwca-assistant/] or EOL TraitBank [@parr16, http://eol.org/info/cp_archives]. These are .zip archives containing data table txt-files along with a descriptive metadata file (in .xml format). Detailed descriptions and tools for validation can be found on the website of EOL (http://eol.org/info/cp_archives) and GBIF (http://tools.gbif.org/dwca-assistant/). The metadata of any dataset that employs this data structure should refer to the respective version of the Ecological Traitdata Standard as "Schneider et al. 2017 Ecological Traitdata Standard v1.0, DOI: XXXX.xxxx, URL: https://ecologicaltraitdata.github.io/ETS/v1.0/". In addition to the versioned online reference, the dataset should also cite the methods paper "Schneider et al. (in preparation) ..." for an explanation of the rationale. From acc11bfa8f32797df38bbbdf1dbf6594407b92ec Mon Sep 17 00:00:00 2001 From: Caterina Penone Date: Fri, 17 Nov 2017 17:59:16 +0100 Subject: [PATCH 3/3] Update thesauri.Rmd --- thesauri.Rmd | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/thesauri.Rmd b/thesauri.Rmd index 559c830..3c0a88f 100644 --- a/thesauri.Rmd +++ b/thesauri.Rmd @@ -6,27 +6,30 @@ csl: amnat.csl # Trait thesauri and ontologies -If no published trait definition is available that can be referenced, trait-datasets should be accompanied by a dataset-specific glossary of traits, also termed a 'thesaurus'. In its simplest form, a trait thesaurus would provide a "controlled vocabulary designed to clarify the definition and structuring of key terms and associated concepts in a specific discipline" [@laporte13; @garnier17]. To be unambiguous, any thesaurus should be defining terms based on other well-defined terms from semantic ontologies. +If no published trait definition is available and can be referenced, trait-datasets should be accompanied by a dataset-specific glossary of traits, also termed a 'thesaurus'. In its simplest form, a trait thesaurus would provide a "controlled vocabulary designed to clarify the definition and structuring of key terms and associated concepts in a specific discipline" [@laporte13; @garnier17]. To be unambiguous, any thesaurus should be defining terms based on other well-defined terms from semantic ontologies. -In addition to the mere listing of trait definitions, a trait 'ontology' would be providing a formal model of the conceptual objects and relationships, or entities and qualities [@garnier17], ideally structured following the guidelines for a semantic web standard [@berners-lee01]. This can take rather complex forms, but for trait data might be just adding a hierarchy of broader and narrower terms. +In addition to the mere listing of trait definitions, a trait 'ontology' would be providing a formal model of the conceptual objects and relationships, or entities and qualities [@garnier17], ideally structured following the guidelines for a semantic web standard [@berners-lee01]. This can take rather complex forms, but for trait data this might be just imply to add a hierarchy of broader and narrower terms. -By providing a minimal vocabulary for thesauri and ontologies (https://ecologicaltraitdata.github.io/TraitDataStandard/#terms-for-traitlists-a-trait-thesaurus), we hope to facilitate the publication of trait thesauri developed for the own project context which always should be referenced in the core trait dataset. The thesaurus might accompany the core data in a Darwin Core Archive, or be published on any other stable webservice. +By providing a minimal vocabulary for thesauri and ontologies (https://ecologicaltraitdata.github.io/TraitDataStandard/#terms-for-traitlists-a-trait-thesaurus), we hope to facilitate the publication of trait thesauri developed for specific projects and which should be referenced in the core trait dataset. The thesaurus might accompany the core data in a Darwin Core Archive, or be published on any other stable webservice. -Thus, in its simplest form, a trait thesaurus would assign trait names with A) an unambiguous definition of the trait and B) an expected format (e.g. units or legit factor levels) of measured values or reported facts. A trait ontology would additionally provide semantic relationships between terms for deriving a hierarchical or tree-based classification of traits and analysing traits in a broader taxonomic context. +Thus, in its simplest form, a trait thesaurus would assign unique trait names to A) an unambiguous definition of the trait and B) an expected format (e.g. units or legit factor levels) of measured values or reported facts. A trait ontology would additionally provide semantic relationships between terms for deriving a hierarchical or tree-based classification of traits and analysing traits in a broader taxonomic context. # Minimal terms of a trait thesaurus -A project-specific trait thesaurus may be a table of terms containing the following information: +A project-specific trait thesaurus can be a table of terms containing the following information: - a human readable, informative trait name (`trait`) - unique dataset-specific identifier (`Identifier`), which is referenced in the trait data-set -- a short, unambiguous verbal definition (`traitDescription`) which may make use of standard terms provided in other Ontologies, e.g. the definition for 'fruit mass' in TOP reads: "the mass (PATO:mass), either fresh or dried, of a fruit (PO:fruit)", referring to Phenotypic Characeristics Ontology PATO and Planteome Plant Ontolgy, PO (http://top-thesaurus.org/annotationInfo?viz=1&&trait=Fruit_mass). +- a short, unambiguous verbal definition (`traitDescription`) which may make use of standard terms provided in other Ontologies. For example the definition for 'fruit mass' in TOP reads: "the mass (PATO:mass), either fresh or dried, of a fruit (PO:fruit)", referring to Phenotypic Characeristics Ontology PATO and Planteome Plant Ontolgy, PO (http://top-thesaurus.org/annotationInfo?viz=1&&trait=Fruit_mass). + +Furthermore it should ideally: + - constrain the legit factor levels (for categorical data, `factorLevels`) or expected standard units (`expectedUnit` for numerical data). The type of values should be differentiated in the field `valueType` by specifying 'numerical', 'logical', 'integer', 'categorical' traits. -- link the term to a broader or narrower term (`broaderTerm`, `narrowerTerm`), related terms (`relatedTerm`) or synonyms (`synonym`), e.g. the definition of 'femur length of first leg, left side' is narrower than 'femur length' which is narrower than 'leg trait' which is narrower than 'locomotion trait'. This extends the trait list into a semantic web resource, facilitates the classification of traits, and allows for cross-taxon comparative studies at the level of broader terms [@garnier17]. +- link the term to a broader or narrower term (`broaderTerm`, `narrowerTerm`), related terms (`relatedTerm`) or synonyms (`synonym`). For instance, the definition of 'femur length of first leg, left side' is narrower than 'femur length' which is narrower than 'leg trait' which is narrower than 'locomotion trait'. This extends the trait list into a semantic web resource, facilitates the classification of traits, and allows for cross-taxon comparative studies at the level of broader terms [@garnier17]. ## defining expected values -Traits are not only defined in terms of their interpretation, but are ideally also standardised in terms of numerical units and, even more important, the use of factor levels. This is challenging, given the range of data types that fall within datasets of functional traits. +Traits are not only defined in terms of their interpretation, but are ideally also standardised in terms of numerical units and, even more important, the use of factor levels. This is challenging, given the range of data types that fall within the definition of functional traits. **Numerical values** represent measurements of lengths, volumes, ratios, rates or timespans. Integer values may apply to count data (e.g. eggs per clutch). **Binary data** (encoded as 0 or 1) or logical data (coded as TRUE or FALSE) may apply to qualitative traits such as specific behaviour during mating (e.g. are territories defended) or specialisation to a given habitat (e.g. species restricted to relicts of primeval forests). @@ -71,4 +74,4 @@ A simple way to publish a list of trait definitions for a project may be as a pu Online ontologies hosted with accredited ontology servers have the advantage of providing a persistent and direct link of the term on the internet (a *Uniform Resource Identifier*, URI). Terminology portals or registries, such as the GFBio Terminology Service, the OBO Foundry, or Ontobee, may provide a central host for trait ontologies. -# References \ No newline at end of file +# References