-
Notifications
You must be signed in to change notification settings - Fork 24
5.4. Semantic Data Dictionary (SDD)
SDDs semantically describe the content of a tabular data file. Without an SDD, HADatAc does not know how to extract the content (both data and metadata) from a data file and to move it into a queryable data repository. For each column header listed in an SDD, the document minimally specifies the kind of attribute of that column, and the object that has that property. For example, the attribute may be Height and the object may be a human who is a subject in a given study.
It is important to note that SDDs are not necessarily designed for a single study, and that there are many examples of SDDs that can be reused multiple times. For instance, one SDD may be developed to describe the output of a sensor or instrument, and that same SDD can be used in any study that uses that same kind of sensor. Moreover, the SDD may be developed to describe data files produced by using an standardized questionnaire, and should be able to be used in any study that uses that standardized questionnaire.
An SDD is made up of five tables: InfoSheet, Dictionary Mapping, Code Mapping, Timeline, and Codebook.
Infosheet
The Infosheet is used to organize the other four tables named above. An example Infosheet template is shown below.
Attribute | Value |
---|---|
SDD_Name | Name |
Dictionary_Mappings | #DICT |
Codebook | #CODEBOOK |
Code_Mappings | #CODEMAPPING |
Timeline | #TIMELINE |
Imports | http://example.org/myontology/ |
We often encode SDDs using either a collection of Google Documents or a single Excel Spreadsheet. When using Google Docs, the values in the InfoSheet are URLs to other Google Docs containing the corresponding SDD specification. In Excel, we use the # sign followed with the name of the sheet within the Spreadsheet that contains the corresponding SDD specification.
Dictionary Mappings
The bulk of the SDD specification is done using the Dictionary Mapping Table, which is used to annotate the columns of a given dataset. The SDD Data Model (DM) Specification is shown below.
DM | Column | Description |
---|---|---|
Attribute | rdf:type | Class of attribute entry |
attributeOf | sio:isAttributeOf | Entity having the attribute |
Column | Entry | column header in dataset |
Comment | rdfs:comment | Comment for the entry |
Definition | skos:definition | Entry text definition |
Entity | rdf:type | Class of entity entry |
Format | Specifies the structure of the Unit value | |
inRelationTo | sio:inRelationTo | Entity that the role is linked to |
Label | rdfs:label | Label for the entry |
Relation | Custom relation that replaces inRelationTo | |
Role | sio:hasRole | Type of the role of the entry |
Time | sio:existsAt | Time point of measurement |
Unit | sio:hasUnit | Unit of Measure for entry |
wasDerivedFrom | prov:wasDerivedFrom | Entity from which the entry was derived |
wasGeneratedBy | prov:wasGeneratedBy | Activity from which the entry was produced |
The attributes hasco:uriId and hasco:originalID specify the main identifier of an SDD, which can only have one main identifier. If an SDD has more than one identifier and one of them can be identified as the main one, the other identifiers can be of type sio:Identifier. In this case, it is assumed that the other attributes of the SDD are either of the same attributeOf of the main identifier or they are attributes of objects identified in the SDD and connected to the object behind the main identifier.
The use of cell scope in the OAS file is required when an SDD has more than one identifiers and none of them can be specified as the main one.
Codebook
The Codebook table contains possible values of coded variables and their associated labels. For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.
Column | Code | Label | Class |
---|---|---|---|
race | 0 | white | kb:White |
race | 1 | black | kb:AfricanAmerican |
race | 2 | other | kb:OtherRace |
smoke | 0 | no smoking | kb:NonSmoker |
smoke | 1 | some smoking | kb:Smoker |
Timeline
Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. An example timeline is shown below.
Name | Label | Type | Start | End | Unit | inRelationTo |
---|---|---|---|---|---|---|
??visit1 | Visit 1 | kb:Visit | 10.1 | 19.9 | sio:Week | ??baseline |
??visit2 | Visit 2 | kb:Visit | 20.0 | 32.0 | sio:Week | ??baseline |
Semantic Data Dictionary (SDD) must contain an info sheet
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error message: “The Info sheet is missing in this SDD file.”
Correction: The user should add an info sheet as the first sheet into the SDD file.
If SDD contain the following sheets: DataDictionary, CodeBook, TimeLine, and CodeMapping
Consequence of breaking the rule: HADatAc will skip reading the File and generate an error message but will not stop processing the SDD except for missing dictionaryFile.
Error messages: “The CodeMapping is missing” or “The DataDictionary is missing.” or “The Codebook is missing.” or ”The TimeLine is missing”
Correction: The user should check if the missing sheet(s) is(are) needed. The DataDictionary must be included under any circumstance.
InfoSheet cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “InfoSheet is empty”
Correction: Add the info sheet content.
The DataDictionary sheet in the SDD cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “The DataDictionary sheet is empty”
Correction: The user should check why the DataDictionary is empty in the SDD file and add correct content into data dictionary sheet.
URIs in SDDs must be resolvable against existing knowledge graph
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following URIs in the Dictionary Mapping are unresolvable: "
Correction: Identify if the URI is available in its original content. Correct the URI if it is misspelled. Stop using the URI in the SDD if it does not exist.
Namespaces of URIs used in the SDD must be registered in HADatAc
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following namespaces in the Dictionary Mapping has unregistered namespace in cells: xxx”
Correction: If the namespace is misspelled, correct it. If the namespace is correct, verify if it is included in namespaces.properties
HADatAc special attributes are:
- sio:TimeStamp
- sio:TimeInstant
- hasco:namedTime
- hasco:originalID
- hasco:uriId
- hasco:hasMetaEntity
- hasco:hasMetaEntityURI
- hasco:hasMetaAttribute
- hasco:hasMetaAttributeURI
- hasco:hasMetaUnit
- hasco:hasMetaUnitURI
- sio:InRelationTo
- hasco:hasLOD
- hasco:hasCalibration
- hasco:hasElevation
- hasco:hasLocation
Non-special attributes are called ordinary attributes. Every ordinary attribute must have a path to a subclass of hasco StudyIndicator
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “"The Attributes: xxx is not associated with any hasco:StudyIndicator.”
Correction: Change the domain ontology to associate the attribute(s) to study indicators, or check if the ontology loaded is complete
if DASAs and DASOS derived from an Excel SDD are well-formed (constains non ASCII chars)
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The Dictionary Mapping has incorrect content in : xxx”
Correction: check if the content in the indicated cells contain illegal characters
Each ordinary attribute in an SDD must have a path in the knowledge graph directly or indirectly connecting it to an object defined in the SSD that has a grounding label
Consequence of breaking the rule: HADatAc will not be able to use the SDD for ingesting data
Error message: “xxx has study object path : xxx - xxx - xxx”
“xxx has has no study object path !”
Correction: check if the indicated DASA in the DM is correctly defined with attribute-object relationships
Copyright (c) 2019, HADatAc.org
-
Installation
1.1. Installing for Linux (Production)
1.2. Installing for Linux (Development)
1.3. Installing for MacOS (Development)
1.4. Deploying with Docker (Production)
1.5. Deploying with Docker (Development)
1.6. Installing for Vagrant under Windows
1.7. Upgrading
1.8. Starting HADatAc
1.9. Stopping HADatAc -
Setting Up
2.1. Software Configuration
2.2. Knowledge Graph Bootstrap
2.2.1. Knowledge Graph
2.2.2. Bootstrap without Labkey
2.2.3. Bootstrap with Labkey
2.3. Config Verification -
Using HADatAc
3.1. Initial Page
3.1.1. Home Button
3.1.2. Sandbox Mode Button
3.2. File Ingestion
3.2.1. Ingesting Study Content
3.2.2. Manual Submission of Files
3.2.3. Automatic Submission of Files
3.2.4. Data File Operations
3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
3.3.2. SDD Editor
3.3.3. DD Editor
3.4. Manage Metadata
3.4.1. Manage Instrument Infrastructure
3.4.2. Manage Deployments 3.4.3. Manage Studies
3.4.4. [Manage Object Collections]
3.4.5. Manage Streams
3.4.6. Manage Semantic Data Dictionaries
3.4.7. Manage Indicators
3.5. Data Search
3.5.1. Data Faceted Search
3.5.2. Data Spatial Search
3.6. Metadata Browser and Search
3.7. Knowledge Graph Browser
3.8. API
3.9. Data Download -
Software Architecture
4.1. Software Components
4.2. The Human-Aware Science Ontology (HAScO) -
Metadata Files
5.1. Deployment Specification (DPL)
5.2. Study Specification (STD)
5.3. Semantic Study Design (SSD)
5.4. Semantic Data Dictionary (SDD)
5.5. Stream Specification (STR) -
Content Evolution
6.1. Namespace List Update
6.2. Ontology Update
6.3. [DPL Update]
6.4. [SSD Update]
6.5. SDD Update -
Data Governance
7.1. Access Network
7.2. User Status, Categories and Access Permissions
7.3. Data and Metadata Privacy - HADatAc-Supported Projects
- Derived Products and Technologies
- Glossary