5.4. Semantic Data Dictionary (SDD)

5.4.1. When to use an SDD

SDDs semantically describe the content of a tabular data file. Without an SDD, HADatAc does not know how to extract the content (both data and metadata) from a data file and to move it into a queryable data repository. For each column header listed in an SDD, the document minimally specifies the kind of attribute of that column, and the object that has that property. For example, the attribute may be Height and the object may be a human who is a subject in a given study.

It is important to note that SDDs are not necessarily designed for a single study, and that there are many examples of SDDs that can be reused multiple times. For instance, one SDD may be developed to describe the output of a sensor or instrument, and that same SDD can be used in any study that uses that same kind of sensor. Moreover, the SDD may be developed to describe data files produced by using an standardized questionnaire, and should be able to be used in any study that uses that standardized questionnaire.

5.4.2. How to Build an SDD

An SDD is made up of five tables: InfoSheet, Dictionary Mapping, Code Mapping, Timeline, and Codebook.

Infosheet

The Infosheet is used to organize the other four tables named above. An example Infosheet template is shown below.

Attribute	Value
SDD_Name	Name
Dictionary_Mappings	#DICT
Codebook	#CODEBOOK
Code_Mappings	#CODEMAPPING
Timeline	#TIMELINE
Imports	http://example.org/myontology/

We often encode SDDs using either a collection of Google Documents or a single Excel Spreadsheet. When using Google Docs, the values in the InfoSheet are URLs to other Google Docs containing the corresponding SDD specification. In Excel, we use the # sign followed with the name of the sheet within the Spreadsheet that contains the corresponding SDD specification.

Dictionary Mappings

The bulk of the SDD specification is done using the Dictionary Mapping Table, which is used to annotate the columns of a given dataset. The SDD Data Model (DM) Specification is shown below.

DM	Column	Description
Attribute	rdf:type	Class of attribute entry
attributeOf	sio:isAttributeOf	Entity having the attribute
Column	Entry	column header in dataset
Comment	rdfs:comment	Comment for the entry
Definition	skos:definition	Entry text definition
Entity	rdf:type	Class of entity entry
inRelationTo	sio:inRelationTo	Entity that the role is linked to
Label	rdfs:label	Label for the entry
Relation		Custom relation that replaces inRelationTo
Role	sio:hasRole	Type of the role of the entry
Time	sio:existsAt	Time point of measurement
Unit	sio:hasUnit	Unit of Measure for entry
wasDerivedFrom	prov:wasDerivedFrom	Entity from which the entry was derived
wasGeneratedBy	prov:wasGeneratedBy	Activity from which the entry was produced

The attributes hasco:uriId and hasco:originalID specify the main identifier of an SDD, which can only have one main identifier. If an SDD has more than one identifier and one of them can be identified as the main one, the other identifiers can be of type sio:Identifier. In this case, it is assumed that the other attributes of the SDD are either of the same attributeOf of the main identifier or they are attributes of objects identified in the SDD and connected to the object behind the main identifier.

The use of cell scope in the OAS file is required when an SDD has more than one identifiers and none of them can be specified as the main one.

Codebook
The Codebook table contains possible values of coded variables and their associated labels. For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.

Column	Code	Label	Class
race	0	white	kb:White
race	1	black	kb:AfricanAmerican
race	2	other	kb:OtherRace
smoke	0	no smoking	kb:NonSmoker
smoke	1	some smoking	kb:Smoker

Timeline
Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. An example timeline is shown below.

Name	Label	Type	Start	End	Unit	inRelationTo
??visit1	Visit 1	kb:Visit	10.1	19.9	sio:Week	??baseline
??visit2	Visit 2	kb:Visit	20.0	32.0	sio:Week	??baseline

5.4.3. SDD Verification Rules

SDD-Level Rule 1

Semantic Data Dictionary (SDD) must contain an info sheet
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error message: “The Info sheet is missing in this SDD file.”
Correction: The user should add an info sheet as the first sheet into the SDD file.

SDD-Level Rule 2

If SDD contain the following sheets: DataDictionary, CodeBook, TimeLine, and CodeMapping
Consequence of breaking the rule: HADatAc will skip reading the File and generate an error message but will not stop processing the SDD except for missing dictionaryFile. Error messages: “The CodeMapping is missing” or “The DataDictionary is missing.” or “The Codebook is missing.” or ”The TimeLine is missing”
Correction: The user should check if the missing sheet(s) is(are) needed. The DataDictionary must be included under any circumstance.

SDD-Level Rule 3

InfoSheet cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “InfoSheet is empty”
Correction: Add the info sheet content.

SDD-Level Rule 4

The DataDictionary sheet in the SDD cannot be empty
Consequence of breaking the rule: HADatAc will stop processing the SDD
Error messages: “The DataDictionary sheet is empty”
Correction: The user should check why the DataDictionary is empty in the SDD file and add correct content into data dictionary sheet.

SDD-Level Rule 5

URIs in SDDs must be resolvable against existing knowledge graph
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following URIs in the Dictionary Mapping are unresolvable: "
Correction: Identify if the URI is available in its original content. Correct the URI if it is misspelled. Stop using the URI in the SDD if it does not exist.

SDD-Level Rule 6

Namespaces of URIs used in the SDD must be registered in HADatAc
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The following namespaces in the Dictionary Mapping has unregistered namespace in cells: xxx”
Correction: If the namespace is misspelled, correct it. If the namespace is correct, verify if it is included in namespaces.properties

SDD-Level Rule 7

HADatAc special attributes are:

sio:TimeStamp
sio:TimeInstant
hasco:namedTime
hasco:originalID
hasco:uriId
hasco:hasMetaEntity
hasco:hasMetaEntityURI
hasco:hasMetaAttribute
hasco:hasMetaAttributeURI
hasco:hasMetaUnit
hasco:hasMetaUnitURI
sio:InRelationTo
hasco:hasLOD
hasco:hasCalibration
hasco:hasElevation
hasco:hasLocation

Non-special attributes are called ordinary attributes. Every ordinary attribute must have a path to a subclass of hasco StudyIndicator
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “"The Attributes: xxx is not associated with any hasco:StudyIndicator.”
Correction: Change the domain ontology to associate the attribute(s) to study indicators, or check if the ontology loaded is complete

SDD-Level Rule 8

if DASAs and DASOS derived from an Excel SDD are well-formed (constains non ASCII chars)
Consequence of breaking the rule: HADatAc will generate an error message and will not stop processing the SDD
Error message: “The Dictionary Mapping has incorrect content in : xxx”
Correction: check if the content in the indicated cells contain illegal characters

5.4.4 Study-level Verification Rules

Study-Level Rule 1

Each ordinary attribute in an SDD must have a path in the knowledge graph directly or indirectly connecting it to an object defined in the SSD that has a grounding label
Consequence of breaking the rule: HADatAc will not be able to use the SDD for ingesting data
Error message: “xxx has study object path : xxx - xxx - xxx”
“xxx has has no study object path !”
Correction: check if the indicated DASA in the DM is correctly defined with attribute-object relationships

Data Owner Guide

Installation
1.1. Installing for Linux (Production)
1.2. Installing for Linux (Development)
1.3. Installing for MacOS (Development)
1.4. Deploying with Docker (Production)
1.5. Deploying with Docker (Development)
1.6. Installing for Vagrant under Windows
1.7. Upgrading
1.8. Starting HADatAc
1.9. Stopping HADatAc
Setting Up
2.1. Software Configuration
2.2. Knowledge Graph Bootstrap
2.2.1. Knowledge Graph
2.2.2. Bootstrap without Labkey
2.2.3. Bootstrap with Labkey
2.3. Config Verification
Using HADatAc
3.1. Initial Page
3.1.1. Home Button
3.1.2. Sandbox Mode Button
3.2. File Ingestion
3.2.1. Ingesting Study Content
3.2.2. Manual Submission of Files
3.2.3. Automatic Submission of Files
3.2.4. Data File Operations
3.3. Manage Working Files 3.3.1. [Create Empty Semantic File from Template]
3.3.2. SDD Editor
3.3.3. DD Editor
3.4. Manage Metadata
3.4.1. Manage Instrument Infrastructure
3.4.2. Manage Deployments 3.4.3. Manage Studies
3.4.4. [Manage Object Collections]
3.4.5. Manage Streams
3.4.6. Manage Semantic Data Dictionaries
3.4.7. Manage Indicators
3.5. Data Search
3.5.1. Data Faceted Search
3.5.2. Data Spatial Search
3.6. Metadata Browser and Search
3.7. Knowledge Graph Browser
3.8. API
3.9. Data Download
Software Architecture
4.1. Software Components
4.2. The Human-Aware Science Ontology (HAScO)
Metadata Files
5.1. Deployment Specification (DPL)
5.2. Study Specification (STD)
5.3. Semantic Study Design (SSD)
5.4. Semantic Data Dictionary (SDD)
5.5. Stream Specification (STR)
Content Evolution
6.1. Namespace List Update
6.2. Ontology Update
6.3. [DPL Update]
6.4. [SSD Update]
6.5. SDD Update
Data Governance
7.1. Access Network
7.2. User Status, Categories and Access Permissions
7.3. Data and Metadata Privacy
HADatAc-Supported Projects
Derived Products and Technologies
Glossary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly