-
Notifications
You must be signed in to change notification settings - Fork 9
2. Enriching the Data
In his Ted talk Sir Tim Berners-Lee underlines the importance of connecting data. Adapting the Metadata focussed on changing the metadata.json
. In Enriching the Data you learn to expand the metadata.
- Virtual Columns explains how to add data concerning the columns.
- Adding Provenance shows how to add information regarding:
- Dates demonstrates how to add and format dates.
- Geographical Information illustrates how to include latitude and longitude to map locations.
- If Statement can change certain data.
Adapting the Metadata added information regarding the rows of the CSV file. The example described the relationship between the neighbourhoods (rows) and the average number of maids (columns). So far the neighbourhoods or rows held the subject (?s
) position. Virtual columns instead refer to columns in the subject position.
A virtual column resembles existing columns in the JSON schema file. In the metadata.json
file search for the list of "columns"
, then start a new set of parenthesis to create a virtual column.
The following virtual column relates the average number of maids ("Dienstboden") column to an occupational category in the PST scheme, namely that of 'housemaid' with code '5,25,1,3'.
{
"virtual": true,
"datatype": "string",
"aboutUrl": "https://iisg.amsterdam/buurt.csv/column/Dienstboden",
"propertyUrl": "https://iisg.amsterdam/vocab/PST/recon",
"csvw:value": "5,25,1,3"
},
The results of the JSON schema are triples like:
<https://iisg.amsterdam/buurt.csv/column/Dienstboden> <https://iisg.amsterdam/vocab/PST/recon> "5,25,1,3"^^<http://www.w3.org/2001/XMLSchema#string>
?s ?p ?o
Another notation for triples is ?s ?p ?o
. For clarity, we've added this notation to the example. The notation will not show in your example files.
For advanced features of virtual columns, use Jinja. All values within a column can become virtual columns. Change the "aboutUrl":
of a virtual column according to the {_row}
example here.
Try creating a virtual column yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. In the list of
"columns":[
add the following virtual column:
},
{
"virtual": true,
"datatype": "string",
"aboutUrl": "https://iisg.amsterdam/buurt.csv/column/Dienstboden",
"propertyUrl": "https://iisg.amsterdam/vocab/PST/recon",
"csvw:value": "5,25,1,3"
}
]
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. In this file you should find a triple with subject <https://iisg.amsterdam/buurt.csv/column/Dienstboden>
.
Provenance information is of key importance for historians and other researchers by extension. Solid source criticism requires information about
- the publisher of the dataset
- the creator of the data
- references to the origin of the data
By default CoW only includes the publisher of the dataset in the JSON schema.
"dc:publisher": {
"schema:name": "CLARIAH Structured Data Hub - Datalegend",
"schema:url": {
"@id": "http://datalegend.net"
}
},
Any information about the publisher is found under "dc:publisher"
. The commonly used vocabulary to denote the publisher in Linked Data is DCMI Metadata terms of the Dublin Core Metadata Initiative. The resulting triples contain the following predicates ?p
and objects ?o
:
<http://schema.org/name> "CLARIAH Structured Data Hub - Datalegend"
<http://schema.org/url> <http://datalegend.net/>
?p ?o
While the publisher makes the dataset available, the creator is the person, organization, or service who created the dataset.
"dc:creator": [{
"@id": "sdp:some.chap@example.net",
"rdf:type": [{
"@id": "foaf:Person"
}, {
"@id": "prov:Person"
}],
"owl:sameAs": [{
"@id": "http://orcid.org/xxxx-xxxx-xxxx-xxxx"
}, {
"@id": "info:eu-repo/dai/cc/1234"
}, {
"@id": "http://isni.org/isni/00000001234"
}],
"foaf:name": "Some Chap",
"foaf:givenName": "Some",
"foaf:familyName": "Chap",
"foaf:mbox": {
"@id": "mailto:some.chap@example.net"
}
}],
A single person can be identified in many ways. The creator is first found in the object (?o
) position. The "@id"
of the creator is an e-mail address that will usually show as ?s
or subject. Two "rdf:type"
definitions determine that the creator is a "Person"
according to the foaf and prov vocabularies. The "owl:sameAs"
predicate ?p
indicates that the creator is also identified by another "@id"
. These "@id"
's are usually links to profile pages on other websites. Finally, the "foaf:"
vocabulary is used to state the full name
, first name as givenName
, and familyName
, together with their e-mail address as mbox
.
The results of the JSON schema are the following triples:
<https://example.com/id/buurt.csv> <http://purl.org/dc/terms/creator> <sdp:some.chap@example.net>
?s ?p ?o
<sdp:some.chap@example.net> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>
<sdp:some.chap@example.net> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Person>
?s ?p ?o
<sdp:some.chap@example.net> <http://www.w3.org/2002/07/owl#sameAs> <http://isni.org/isni/00000001234>
<sdp:some.chap@example.net> <http://www.w3.org/2002/07/owl#sameAs> <http://orcid.org/xxxx-xxxx-xxxx-xxxx>
<sdp:some.chap@example.net> <http://www.w3.org/2002/07/owl#sameAs> <info:eu-repo/dai/cc/1234>
?s ?p ?o
<sdp:some.chap@example.net> <http://xmlns.com/foaf/0.1/name> "Some Chap"
<sdp:some.chap@example.net> <http://xmlns.com/foaf/0.1/familyName> "Chap"
<sdp:some.chap@example.net> <http://xmlns.com/foaf/0.1/givenName> "Some"
<sdp:some.chap@example.net> <http://xmlns.com/foaf/0.1/mbox> <mailto:some.chap@example.net>
?s ?p ?o
Note: Another notation for triples is ?s ?p ?o
. For clarity, we've added this notation to the example. The notation will not show in your example files. In addition, the order of the resulting triples in this example differs from the order in the .nq
-file.
Aside from the publisher and creator, datasets are often derived from other sources. The dataset can be the result of an update, or it can be a new dataset based on a book or other source.
"prov:wasDerivedFrom": [{
"@id": "urn:isbn:9789058671967",
"rdf:type": {
"@id": "bibo:Book"
},
"dc:title": {
"@value": "Historical International Standard Classification of Occupations",
"@lang": "en"
},
"schema:author": ["Marco van Leeuwen", "Ineke Maas", "Andrew Miles"],
"dc:publisher": "Leuven University Press",
"dc:date": {"@value":"2002","@type":"xsd:Year"}
}, {
"@id": "http://historyofwork.iisg.nl"
}],
In the above example the dataset is derived from a website, and based on a book. The "@id"
is set to a "urn:isbn"
, a common uniform resource name for books with an international standard book number. The "rdf:type"
uses the Bibliographic Ontology bibo to set the data type to book. The "dc:title"
predicate ?p
lists two objects ?o
: the "@value"
and the "@lang"
. The value of the title refers to the book title, and the language of the title is set to "en" or English. The "schema:author"
lists all authors, and "dc:publisher"
provides the name of the book publisher. The "dc:date"
specifies the "@value" and the "@type". This means that the object ?o
of the date is a value "2002" with a datatype "xsd:Year" attached:
<https://example.com/id/buurt.csv> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://historyofwork.iisg.nl/>
<https://example.com/id/buurt.csv> <http://www.w3.org/ns/prov#wasDerivedFrom> <urn:isbn:9789058671967>
?s ?p ?o
<urn:isbn:9789058671967> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/Book>
?s ?p ?o
<urn:isbn:9789058671967> <http://purl.org/dc/terms/title> "Historical International Standard Classification of Occupations"
?s ?p ?o
<urn:isbn:9789058671967> <http://schema.org/author> "Ineke Maas"
<urn:isbn:9789058671967> <http://schema.org/author> "Marco van Leeuwen"
<urn:isbn:9789058671967> <http://schema.org/author> "Andrew Miles"
?s ?p ?o
<urn:isbn:9789058671967> <http://purl.org/dc/terms/publisher> "Leuven University Press"
?s ?p ?o
<urn:isbn:9789058671967> <http://purl.org/dc/terms/date> "2002"^^<http://www.w3.org/2001/XMLSchema#Year>
?s ?p ?o
Note: Another notation for triples is ?s ?p ?o
. For clarity, we've added this notation to the example. The notation will not show in your example files. In addition, the order of the resulting triples in this example differs from the order in the .nq
-file.
Try adding provenance yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build buurt.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. Adapt the following code by adding yourself as creator and a colleague as co-creator. Keep the same reference.
"dc:publisher": {
"schema:name": "CLARIAH Structured Data Hub - Datalegend",
"schema:url": {
"@id": "http://datalegend.net"
}
},
"dc:creator": [{
"@id": "sdp:some.chap@example.net",
"rdf:type": [{
"@id": "foaf:Person"
}, {
"@id": "prov:Person"
}],
"owl:sameAs": [{
"@id": "http://orcid.org/xxxx-xxxx-xxxx-xxxx"
}, {
"@id": "info:eu-repo/dai/cc/1234"
}, {
"@id": "http://isni.org/isni/00000001234"
}],
"foaf:name": "Some Chap",
"foaf:givenName": "Some",
"foaf:familyName": "Chap",
"foaf:mbox": {
"@id": "mailto:some.chap@example.net"
}
}, {
"@id": "sdp:some.gal@example.net",
"rdf:type": [{
"@id": "foaf:Person"
}, {
"@id": "prov:Person"
}],
"owl:sameAs": {
"@id": "http://orcid.org/xyxy-xyxy-xyxy-xyxy"
},
"foaf:name": "Some Gal",
"foaf:familyName": "Gal",
"foaf:givenName": "Some",
"foaf:mbox": {
"@id": "mailto:some.gal@example.net"
}
}],
"prov:wasDerivedFrom": [{
"@id": "urn:isbn:9789058671967",
"rdf:type": {
"@id": "bibo:Book"
},
"dc:title": {
"@value": "Historical International Standard Classification of Occupations",
"@lang": "en"
},
"schema:author": ["Marco van Leeuwen", "Ineke Maas", "Andrew Miles"],
"dc:publisher": "Leuven University Press",
"dc:date": {"@value":"2002","@type":"xsd:Year"}
}, {
"@id": "http://historyofwork.iisg.nl"
}],
- Create the Linked Data file with the following command:
cow_tool convert buurt.csv
The end result is an .nq
file. Try to locate each of the triples determined by the JSON schema using the search function for the .nq
file.
Dates are often recorded by historians and researchers. The datatype xsd:date
formats complete dates (YYYY-MM-DD).
The example below adds a virtual column with death dates based on three different columns (Year_Death, Month_Death, Day_Death).
{
"virtual": "true",
"datatype": "xsd:date",
"propertyUrl": "schema:deathDate",
"dc:description": "Date of death",
"csvw:value": "{{['%04d'| format(Year_Death|int),'-','%02d' | format(Month_Death|int),'-','%02d'|format(Day_Death|int)]|join}}"
}
The object csvw:value
builds on Jinja. The code joins several elements between []
into a single date string (YYYY-MM-DD). One element such as '%04d'| format(Year_Death|int)
specifies the number of digits ('%04d') to be included from the Year_Death column. The numbers in the CSV column are formatted as integers (int). Each element of the date is separated by '-'.
Buildings, places, regions, countries, and other spaces are recorded in two ways. The geographical information to mark the position of a building is described as a point, with longitude and latitude. Whereas a polygon or a series of points describes the shape of a country.
Geographical information in Linked Data can be plotted on a map. Plotting Linked Data is done using the SPARQL Protocol And RDF Query Language (SPARQL). Only certain SPARQL editors such as Yasgui can handle geographical information.
The following data example describes the location of two cinemas.
cinema;latitude;longitude
Flora;52.366097;4.899298
Paleis voor Volksvlijt;52.360018;4.89978
The simplest notation for a location is as a single triple, where the subject (?s
) has a geometry (?p
) of a point or shape (?o
).
<https://iisg.amsterdam/atm/cinemas/0> <http://www.opengis.net/ont/geosparql#asWKT> "POINT(4.893937 52.353038)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>
?s ?p ?o
Another notation for triples is ?s ?p ?o
. For clarity, we've added this notation to the example. The notation will not show in your example files.
Cinema 'Flora' in the first row (row 0) has a geographical representation of data type 'wktLiteral' (a point) with value (4.893937 52.353038).
Objects are rarely captured with a single geographical representation. The point for cinema Flora could be based on the location of the entrance, or the building's shape centroid. Even with polygons based on blue prints, the location could be represented differently over time when the building expanded. When paper maps are digitized, there are often slight errors in georeferencing and shape file creation. The result are multiple improved versions of the same blueprint.
Instead of linking multiple shapes directly to a single object, create a new object linked to both the building and the geographical representation (point or polygon). This approach allows you to filter only the geographical representation that is useful to you.
<https://iisg.amsterdam/atm/cinemas/0> <geo:hasGeometry> <https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>
<https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000><http://www.opengis.net/ont/geosparql#asWKT> "POINT(4.893937 52.353038)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>
First create a new id '83000' and then attach the point-representation of cinema Flora to it. The 'geo:hasGeometry' property allows you to query all the geometry properties for this cinema. The provenance for the point representation can also be attached to id '83000'. Remember to create unique id's, otherwise geographic representations of other cinema's or even countries might be linked incorrectly.
Try adding geographical information yourself. Copy this example and save it as a csv file.
cinema;latitude;longitude
Flora;52.366097;4.899298
Paleis voor Volksvlijt;52.360018;4.89978
Copy the file path, and switch to your terminal. Move to the folder where you saved the cinema.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build cinema.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. By default, the row number serves as the identifier of the cinemas.
"tableSchema": {
"aboutUrl": "atm/cinemas/{_row}",
First, create virtual columns to represent latitude and longitude.
{
"virtual": true,
"propertyUrl": "schema:latitude",
"csvw:value": "{latitude}"
},
{
"virtual": true,
"propertyUrl": "schema:longitude",
"csvw:value": "{longitude}"
},
Second, create the geographic representation in Linked Data as follows:
{
"virtual": true,
"aboutUrl": "atm/cinemas/{_row}",
"propertyUrl": "geo:hasGeometry",
"valueUrl": "atm/cinemas/.well-known/genid/{{(83000 + _row)}}"
},
{
"virtual": true,
"datatype": "geosparql:wktLiteral",
"aboutUrl": "atm/cinemas/.well-known/genid/{{(83000 + _row)}}",
"propertyUrl": "geosparql:asWKT",
"csvw:value": "{{'POINT(' + longitude + ' ' + latitude + ')'}}"
},
Note: To ensure unique numbers, the arbitrary number '83000' is added. The number is derived from the number of rows in the original CSV file * 1000. It cannot be a random number, because the number needs to match the second virtual column. With large numbers the line number from the original file is still visible. A cinema on line 90, would then have genid 83090.
- Create the Linked Data file with the following command:
cow_tool convert cinema.csv
The end result is an .nq
file.
- Query the data in a SPARQL editor such as Yasgui.
To retrieve all available points and polygons for
?s
try:
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
?s geo:hasGeometry/asWKT ?point
While a query for a specific geometry could be:
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
?s geo:hasGeometry <https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>
<https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>asWKT ?point
To modify certain values of a CSV file during the conversion, use a conditional statement. When the if
condition is true, the data will change according to the first instruction. When the condition is false, an else
statement specifies a different change.
The example below converts the values of the "male" column to URIs instead of 0 (female) or 1 (male).
{
"name": "male",
"datatype": "string",
"@id": "https://iisg.amsterdam/cow_person_example.csv/column/male",
"dc:description": "The state of being male or female",
"titles": ["male"],
"propertyUrl": "sdmx-code:sex",
"valueUrl": "sdmx-code:{% if male == '0' %}sex-F{% else %}sex-M{% endif %}"
},
The conditional statement for the object or valueUrl
is written in Jinja. First, if the male
column contains "0", then the "sdmx-code" URI ends in "sex-F". Otherwise, "sex-M" is added to the "sdmx-code" URI.
The result of the JSON schema are triples containing the following predicate ?p
and objects ?o
:
<http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-F>
<http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-M>
?p ?o
A conditional statement in Jinja is placed between {% %}
because the conditions themselves should not be printed, only the outcome. The if
statement checks whether the value of a variable (the cell of the male column) equals ==
the string '0'. Only then will the text following the statement be added. When the if
statement is false, the text after the {% else %}
statement is used. A conditional statement ends with {% endif %}
.
Try adding an if
statement yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the cow_person_example.csv
file. Next, follow these steps:
- Upload the example csv file:
cow_tool build cow_person_example.csv
The tool generates a -metadata.json
file in the folder where you saved the example file.
- Open and edit the metadata file. Remember to change the base URI.
"@base": "https://iisg.amsterdam/resource/cow_person_example/"
Instead of the row number, the subject should contain the personID.
"aboutUrl": "sex/{personID}",
Finally, change the "male" column to the sdmx-code:sex
URIs using an if
statement.
{
"name": "male",
"datatype": "string",
"@id": "https://iisg.amsterdam/cow_person_example.csv/column/male",
"dc:description": "The state of being male or female",
"titles": ["male"],
"propertyUrl": "sdmx-code:sex",
"valueUrl": "sdmx-code:{% if male == '0' %}sex-F{% else %}sex-M{% endif %}"
},
- Create the Linked Data file with the following command:
cow_tool convert cow_person_example.csv
The end result is an .nq
file. In this file you should find the following triple for the person with personID 12:
<https://iisg.amsterdam/resource/cow_person_example/cowgirls/012> <http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-F>
Next: 3. Additional Features
CoW is built on QBer, part of the datalegend ecosystem for historical statistics. For more tools, datasets and infrastructure please visit https://datalegend.net. datalegend is a work package within CLARIAH a back-to-back NWO funded large research facility (#184.033.101)