Skip to content

5.4. Semantic Data Dictionary (SDD)

Paulo Pinheiro edited this page Sep 3, 2019 · 6 revisions

5.4.1. When to use an SDD

SDDs semantically describe the content of a tabular data file. Without an SDD, HADatAc does not know how to extract the content (both data and metadata) from a data file and to move it into a queryable data repository. For each column header listed in an SDD, the document minimally specifies the kind of attribute of that column, and the object that has that property. For example, the attribute may be Height and the object may be a human who is a subject in a given study.

It is important to note that SDDs are not necessarily designed for a single study, and that there are many examples of SDDs that can be reused multiple times. For instance, one SDD may be developed to describe the output of a sensor or instrument, and that same SDD can be used in any study that uses that same kind of sensor. Moreover, the SDD may be developed to describe data files produced by using an standardized questionnaire, and should be able to be used in any study that uses that standardized questionnaire.

5.4.2. How to Build an SDD

An SDD is made up of five tables: InfoSheet, Dictionary Mapping, Code Mapping, Timeline, and Codebook.

Infosheet

The Infosheet is used to organize the other four tables named above. An example Infosheet template is shown below.

Attribute Value
SDD_Name Name
Dictionary_Mappings #DICT
Codebook #CODEBOOK
Code_Mappings #CODEMAPPING
Timeline #TIMELINE
Imports http://example.org/myontology/

We often encode SDDs using either a collection of Google Documents or a single Excel Spreadsheet. When using Google Docs, the values in the InfoSheet are URLs to other Google Docs containing the corresponding SDD specification. In Excel, we use the # sign followed with the name of the sheet within the Spreadsheet that contains the corresponding SDD specification.

Dictionary Mappings

The bulk of the SDD specification is done using the Dictionary Mapping Table, which is used to annotate the columns of a given dataset. The SDD Data Model (DM) Specification is shown below.

DM Column Description
Attribute rdf:type Class of attribute entry
attributeOf sio:isAttributeOf Entity having the attribute
Column Entry column header in dataset
Comment rdfs:comment Comment for the entry
Definition skos:definition Entry text definition
Entity rdf:type Class of entity entry
Format Specifies the structure of the Unit value
inRelationTo sio:inRelationTo Entity that the role is linked to
Label rdfs:label Label for the entry
Relation Custom relation that replaces inRelationTo
Role sio:hasRole Type of the role of the entry
Time sio:existsAt Time point of measurement
Unit sio:hasUnit Unit of Measure for entry
wasDerivedFrom prov:wasDerivedFrom Entity from which the entry was derived
wasGeneratedBy prov:wasGeneratedBy Activity from which the entry was produced

The attributes hasco:uriId and hasco:originalID specify the main identifier of an SDD, which can only have one main identifier. If an SDD has more than one identifier and one of them can be identified as the main one, the other identifiers can be of type sio:Identifier. In this case, it is assumed that the other attributes of the SDD are either of the same attributeOf of the main identifier or they are attributes of objects identified in the SDD and connected to the object behind the main identifier.

The use of cell scope in the OAS file is required when an SDD has more than one identifiers and none of them can be specified as the main one.

Codebook
The Codebook table contains possible values of coded variables and their associated labels. For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.

Column Code Label Class
race 0 white kb:White
race 1 black kb:AfricanAmerican
race 2 other kb:OtherRace
smoke 0 no smoking kb:NonSmoker
smoke 1 some smoking kb:Smoker

Timeline
Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. An example timeline is shown below.

Name Label Type Start End Unit inRelationTo
??visit1 Visit 1 kb:Visit 10.1 19.9 sio:Week ??baseline
??visit2 Visit 2 kb:Visit 20.0 32.0 sio:Week ??baseline

5.4.3. SDD Verification Rules

SDD-Level Rule 1

Semantic Data Dictionary (SDD) must contain an info sheet
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error message: “The Info sheet is missing in this SDD file.”
Correction: The user should add an info sheet as the first sheet into the SDD file.

SDD-Level Rule 2

If SDD contain the following sheets: DataDictionary, CodeBook, TimeLine, and CodeMapping
Consequence of breaking the rule: HADatAc will skip reading the File and generate an error message but will not stop processing the SDD except for missing dictionaryFile. Error messages: “The CodeMapping is missing” or “The DataDictionary is missing.” or “The Codebook is missing.” or ”The TimeLine is missing”
Correction: The user should check if the missing sheet(s) is(are) needed. The DataDictionary must be included under any circumstance.

SDD-Level Rule 3

InfoSheet cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “InfoSheet is empty”
Correction: Add the info sheet content.

SDD-Level Rule 4

The DataDictionary sheet in the SDD cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “The DataDictionary sheet is empty”
Correction: The user should check why the DataDictionary is empty in the SDD file and add correct content into data dictionary sheet.

SDD-Level Rule 5

URIs in SDDs must be resolvable against existing knowledge graph
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following URIs in the Dictionary Mapping are unresolvable: "
Correction: Identify if the URI is available in its original content. Correct the URI if it is misspelled. Stop using the URI in the SDD if it does not exist.

SDD-Level Rule 6

Namespaces of URIs used in the SDD must be registered in HADatAc
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following namespaces in the Dictionary Mapping has unregistered namespace in cells: xxx”
Correction: If the namespace is misspelled, correct it. If the namespace is correct, verify if it is included in namespaces.properties

SDD-Level Rule 7

HADatAc special attributes are:

  • sio:TimeStamp
  • sio:TimeInstant
  • hasco:namedTime
  • hasco:originalID
  • hasco:uriId
  • hasco:hasMetaEntity
  • hasco:hasMetaEntityURI
  • hasco:hasMetaAttribute
  • hasco:hasMetaAttributeURI
  • hasco:hasMetaUnit
  • hasco:hasMetaUnitURI
  • sio:InRelationTo
  • hasco:hasLOD
  • hasco:hasCalibration
  • hasco:hasElevation
  • hasco:hasLocation

Non-special attributes are called ordinary attributes. Every ordinary attribute must have a path to a subclass of hasco StudyIndicator
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “"The Attributes: xxx is not associated with any hasco:StudyIndicator.”
Correction: Change the domain ontology to associate the attribute(s) to study indicators, or check if the ontology loaded is complete

SDD-Level Rule 8

if DASAs and DASOS derived from an Excel SDD are well-formed (constains non ASCII chars)
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The Dictionary Mapping has incorrect content in : xxx”
Correction: check if the content in the indicated cells contain illegal characters

5.4.4 Study-level Verification Rules

Study-Level Rule 1

Each ordinary attribute in an SDD must have a path in the knowledge graph directly or indirectly connecting it to an object defined in the SSD that has a grounding label
Consequence of breaking the rule: HADatAc will not be able to use the SDD for ingesting data
Error message: “xxx has study object path : xxx - xxx - xxx”
“xxx has has no study object path !”
Correction: check if the indicated DASA in the DM is correctly defined with attribute-object relationships

Data Owner Guide

  1. Installation
    1.1. Installing for Linux (Production)
    1.2. Installing for Linux (Development)
    1.3. Installing for MacOS (Development)
    1.4. Deploying with Docker (Production)
    1.5. Deploying with Docker (Development)
    1.6. Installing for Vagrant under Windows
    1.7. Upgrading
    1.8. Starting HADatAc
    1.9. Stopping HADatAc
  2. Setting Up
    2.1. Software Configuration
    2.2. Knowledge Graph Bootstrap
    2.2.1. Knowledge Graph
    2.2.2. Bootstrap without Labkey
    2.2.3. Bootstrap with Labkey
    2.3. Config Verification
  3. Using HADatAc
    3.1. Initial Page
    3.1.1. Home Button
    3.1.2. Sandbox Mode Button
    3.2. File Ingestion
    3.2.1. Ingesting Study Content
    3.2.2. Manual Submission of Files
    3.2.3. Automatic Submission of Files
    3.2.4. Data File Operations
    3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
    3.3.2. SDD Editor
    3.3.3. DD Editor
    3.4. Manage Metadata
    3.4.1. Manage Instrument Infrastructure
    3.4.2. Manage Deployments 3.4.3. Manage Studies
    3.4.4. [Manage Object Collections]
    3.4.5. Manage Streams
    3.4.6. Manage Semantic Data Dictionaries
    3.4.7. Manage Indicators
    3.5. Data Search
    3.5.1. Data Faceted Search
    3.5.2. Data Spatial Search
    3.6. Metadata Browser and Search
    3.7. Knowledge Graph Browser
    3.8. API
    3.9. Data Download
  4. Software Architecture
    4.1. Software Components
    4.2. The Human-Aware Science Ontology (HAScO)
  5. Metadata Files
    5.1. Deployment Specification (DPL)
    5.2. Study Specification (STD)
    5.3. Semantic Study Design (SSD)
    5.4. Semantic Data Dictionary (SDD)
    5.5. Stream Specification (STR)
  6. Content Evolution
    6.1. Namespace List Update
    6.2. Ontology Update
    6.3. [DPL Update]
    6.4. [SSD Update]
    6.5. SDD Update
  7. Data Governance
    7.1. Access Network
    7.2. User Status, Categories and Access Permissions
    7.3. Data and Metadata Privacy
  8. HADatAc-Supported Projects
  9. Derived Products and Technologies
  10. Glossary
Clone this wiki locally