Serene is a data integration platform designed to provide semantic matching across heterogeneous relational data stores.
You will need sbt to build and run the platform. On mac:
brew install sbt
on Debian linux
sudo apt-get install sbt
You can build the library with
bin/build
This should build the Serene server and place the final jar into the jars
directory.
Alternatively to use sbt
sbt assembly
To start the web server use
bin/server-start
The following cmd line options are available:
--storage-path <value> Storage Path determines the directory in which to store all files and objects
--host <value> Server host address (default 127.0.0.1)
--port <value> Server port number (default 8080)
--help Prints this usage text
Alternatively to use sbt, you can run
sbt run
with arguments in quotes e.g.
sbt "run --port 8888"
Additional configuration is available in application.conf, specifically for the initialization of Spark.
The API can be used with the following commands...
By default the server will run on localhost, port 8080. This can be changed in src/main/resources/application.conf
. To check that the server is running, ensure that the following endpoints return valid JSON:
# check version
curl localhost:8080
# simple test
curl localhost:8080/v1.0
WARNING: the server will not work properly if logging level is set to DEBUG!
Datasets need to be uploaded to the server. Currently only CSVs are supported. A description can also be added to the dataset upload. In case a dataset does not have headers, special header line needs to be added to the CSV (otherwise such dataset will not be properly read in by serene): the header line should be numbers starting from 0 to the number of columns -1.
# Get a list of datasets...
curl localhost:8080/v1.0/dataset
# Post a new dataset...
# Note that the max upload size is 2GB...
curl -X POST -F 'file=@test.csv' -F 'description=This is a file' -F 'typeMap={"a":"int", "c":"string", "e":"int"}' localhost:8080/v1.0/dataset
# Show a single dataset
curl localhost:8080/v1.0/dataset/12341234
# Show a single dataset with custom sample size
curl localhost:8080/v1.0/dataset/12341234?samples=50
# Update a single dataset
curl -X POST -F 'description=This is a file' -F 'typeMap={"a":"int", "c":"string", "e":"float"}' localhost:8080/v1.0/dataset/12341234
# Delete a dataset
curl -X DELETE localhost:8080/v1.0/dataset/12341234
The model endpoint controls the parameters used for the Schema Matcher classifier. The Schema Matcher takes a list of classes
, and attempts to assign them to the columns of a dataset. If a column is known, use labelData
to indicate the class to the ColumnID in the dataset. The features
, modelType
and resamplingStrategy
can be modified.
# List models
curl localhost:8080/v1.0/model
# Post model
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"description": "This is the description",
"modelType": "randomForest",
"classes": ["name", "address", "phone", "unknown"],
"features": { "activeFeatures" : [
"num-unique-vals",
"prop-unique-vals",
"prop-missing-vals",
"ratio-alpha-chars",
"prop-numerical-chars",
"prop-whitespace-chars",
"prop-entries-with-at-sign",
"prop-entries-with-hyphen",
"prop-range-format",
"is-discrete",
"entropy-for-discrete-values"
],
"activeFeatureGroups" : [
"inferred-data-type",
"stats-of-text-length",
"stats-of-numeric-type",
"prop-instances-per-class-in-knearestneighbours",
"mean-character-cosine-similarity-from-class-examples",
"min-editdistance-from-class-examples",
"min-wordnet-jcn-distance-from-class-examples",
"min-wordnet-lin-distance-from-class-examples"
],
"featureExtractorParams" : [
{
"name" : "prop-instances-per-class-in-knearestneighbours",
"num-neighbours" : 3
}, {
"name" : "min-editdistance-from-class-examples",
"max-comparisons-per-class" : 3
}, {
"name" : "min-wordnet-jcn-distance-from-class-examples",
"max-comparisons-per-class" : 3
}, {
"name" : "min-wordnet-lin-distance-from-class-examples",
"max-comparisons-per-class" : 3
}
]
},
"costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
"labelData" : {"1696954974" : "name", "66413956": "address"},
"resamplingStrategy": "ResampleToMean"
}' \
localhost:8080/v1.0/model
# Show a single model
curl localhost:8080/v1.0/model/12341234
# Update model (all fields optional)
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"description": "This is the description",
"modelType": "randomForest",
"classes": ["name", "address", "phone", "unknown"],
"features": { "activeFeatures" : [
"num-unique-vals",
"prop-unique-vals",
"prop-missing-vals",
"ratio-alpha-chars",
"prop-numerical-chars",
"prop-whitespace-chars",
"prop-entries-with-at-sign",
"prop-entries-with-hyphen",
"prop-range-format",
"is-discrete",
"entropy-for-discrete-values"
],
"activeFeatureGroups" : [
"inferred-data-type",
"stats-of-text-length",
"stats-of-numeric-type",
"prop-instances-per-class-in-knearestneighbours",
"mean-character-cosine-similarity-from-class-examples",
"min-editdistance-from-class-examples",
"min-wordnet-jcn-distance-from-class-examples",
"min-wordnet-lin-distance-from-class-examples"
],
"featureExtractorParams" : [
{
"name" : "prop-instances-per-class-in-knearestneighbours",
"num-neighbours" : 3
}, {
"name" : "min-editdistance-from-class-examples",
"max-comparisons-per-class" : 3
}, {
"name" : "min-wordnet-jcn-distance-from-class-examples",
"max-comparisons-per-class" : 3
}, {
"name" : "min-wordnet-lin-distance-from-class-examples",
"max-comparisons-per-class" : 3
}
]
},
"costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
"labelData" : {"1696954974" : "name", "66413956": "address"},
"resamplingStrategy": "ResampleToMean"
}' \
localhost:8080/v1.0/model/98793874
# Train model (async, use GET on model 98793874 to query state)
curl -X POST localhost:8080/v1.0/model/98793874/train
# Delete a model
curl -X DELETE localhost:8080/v1.0/model/12341234
# Predict a specific dataset 12341234 using model. Returns prediction JSON object
curl -X POST localhost:8080/v1.0/model/98793874/predict/12341234
To use the newly added bagging resampling strategy ("Bagging", "BaggingToMax", "BaggingToMean"), additional parameters can be indicated in model post resquest: numBags and bagSize. Both parameters are integer, and if not specified, default value 100 will be used for both. Example model post request to use bagging:
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"description": "This is the description",
"modelType": "randomForest",
"classes": ["name", "address", "phone", "unknown"],
"features": { "activeFeatures" : [ "num-unique-vals", "prop-unique-vals", "prop-missing-vals" ],
"activeFeatureGroups" : [ "stats-of-text-length", "prop-instances-per-class-in-knearestneighbours"],
"featureExtractorParams" : [{"name" : "prop-instances-per-class-in-knearestneighbours","num-neighbours" : 5}]
},
"costMatrix": [[1,0,0], [0,1,0], [0,0,1]],
"labelData" : {"1" : "name", "1817136897" : "unknown", "1498946589" : "name", "134383522" : "phone", "463734360" : "address"},
"resamplingStrategy": "Bagging",
"numBags": 10,
"bagSize": 1000
}' \
localhost:8080/v1.0/model
Explanation of features and the list of available features can be found here.
Resampling strategies are enumerated here.
Currently only randomForest
is supported as a modelType through Serene API.
Attribute ids in the source descriptions are really important since we rely on Karma code to perform semantic modelling. We have to make sure that they are unique across different data sources.
The labels (semantic types) are assumed to come in the format: className---propertyName.
The configuration for the semantic modeler is specified in modeling.properties.
Semantic source descriptions provide information how exactly a particular dataset maps into a specified ontology. They include information both about the semantic types (i.e., classes/labels) for the columns as well as information about the relationships of these semantic types. All this information is encoded in the semantic model. Before a semantic source description can be uploaded to the server, the associated datasets should be uploaded.
# Get a list of semantic source descriptions...
curl localhost:8080/v1.0/ssd
# Post a new SSD...
# Note that the max upload size is 2GB...
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "serene-user-example-ssd",
"ontology": [1],
"semanticModel": {
"nodes": [
{
"id": 0,
"label": "State",
"type": "ClassNode"
},
{
"id": 1,
"label": "City",
"type": "ClassNode"
}],
"links": [
{
"id": 0,
"source": 1,
"target": 0,
"label": "isPartOf",
"type": "ObjectPropertyLink"
}]
},
"mappings": [
{
"attribute": 1997319549,
"node": 0
},
{
"attribute": 1160349990,
"node": 1
}],
}' \
localhost:8080/v1.0/ssd
# Show a single ssd
curl localhost:8080/v1.0/ssd/12341234
# Update a single ssd
# Delete a ssd
curl -X DELETE localhost:8080/v1.0/ssd/12341234
Serene can handle only OWL ontologies.
# Get a list of ontologies...
curl localhost:8080/v1.0/owl
# Post a new ontology...
# Note that the max upload size is 2GB...
curl -X POST -F 'file=@test.owl' localhost:8080/v1.0/owl
# Show a single owl
curl localhost:8080/v1.0/owl/12341234
# Update a single owl
# Delete a owl
curl -X DELETE localhost:8080/v1.0/owl/12341234
The octopus endpoint controls the parameters used for the Semantic Modeller of the Serene API. Octopus is the final model which performs both relational and ontological schema matching.
# List octopi
curl localhost:8080/v1.0/model
# Post octopus
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "hello",
"description": "Testing octopus used for identifying phone numbers only.",
"ssds": [1, 2, 3],
"ontologies": [1, 2, 3],
"modelingProps": "see below for explanations",
"modelType": "randomForest",
"features": ["isAlpha", "alphaRatio", "atSigns", ...],
"resamplingStrategy": "ResampleToMean",
"numBags": 10,
"bagSize": 10
}' \
localhost:8080/v1.0/octopus
# Train octopus (async, includes training for the schema matcher model, use GET on octopus 98793874 to query state)
curl -X POST localhost:8080/v1.0/octopus/98793874/train
# Delete a single octopus
curl -X DELETE localhost:8080/v1.0/octopus/12341234
# Suggest a list of semanctic models for a specific dataset 12341234 using octopus. Returns prediction JSON object
curl -X POST localhost:8080/v1.0/octopus/98793874/predict/12341234
Modeling properties:
Ontology inference properties govern the construction of the alignment graph and regulate how many nodes and links will be additionally inferred from the ontology:
Property Name | Type | Default | Description |
---|---|---|---|
compatibleProperties | Boolean | true | Governs construction of ontology cache (extends alignment graph with inferred nodes and links from the ontology) |
ontologyAlignment | Boolean | false | Governs construction of ontology cache (extends alignment graph with inferred nodes and links from the ontology) |
addOntologyPaths | Boolean | false | Extends alignment graph with inferred paths from the ontology |
multipleSameProperty | Boolean | false | Allow multiple same data properties per class node |
thingNode | Boolean | false | Add Thing node as superclass of all other classes |
nodeClosure | Boolean | true | Additional inference on nodes (closure of the node contains all the nodes that are connected to the input node by ObjectProperty or SubClass links) |
propertiesDirect | Boolean | true | Extend with direct properties |
propertiesIndirect | Boolean | true | Extend with indirect properties |
propertiesSubclass | Boolean | true | Extend with subclass properties |
propertiesWithOnlyDomain | Boolean | true | Allow properties in the ontology which have only domain indicated, but not range |
propertiesWithOnlyRange | Boolean | true | Allow properties in the ontology which have only range indicated, but not domain |
propertiesWithoutDomainRange | Boolean | false | Allow properties in the ontology which do not have domain or range |
Search optimization (to better understand the search algorithms please refer to the report):
Property Name | Type | Default | Description |
---|---|---|---|
numSemanticTypes | Int | 4 | Parameter which filters possible matches per column (only Top numSemanticTypes will be considered during mapping stage) |
mappingBranchingFactor | Int | 50 | Parameter which reduces the search space for the possible mappings (mappings are built as combinations of matches) |
numCandidateMappings | Int | 10 | Parameter which reduces the search space for the heuristic STP (Steiner Tree Problem) algorithm (only Top numCandidateMappings are considered for STP) |
topkSteinerTrees | Int | 10 | number of Steiner Trees to be constructed by the algorithm (ranked according to the overall score) |
Score is a weighted sum of confidence score, coherence score and size score:
Property Name | Type | Default | Description |
---|---|---|---|
confidenceWeight | Double | 1.0 | Weight of the confidence score (this is the confidence score returned by the schema matcher) |
coherenceWeight | Double | 1.0 | Weight of the coherence score (this score is calculated based on combinations of links and nodes) |
sizeWeight | Double | 0.5 | Weight of the size score (size of the semantic model) |
All weights have to be in range (0,1]. Changing weights will affect the search and the results returned by the semantic modeler.
Unknown:
Property Name | Type | Default | Description |
---|---|---|---|
unknownThreshold | Double | 0.05 | If confidence score with unknown class is above this threshold and unknown is the most likely class, then the column will be discarded |
Threshold must be in range [0,1].
Compute three metrics to compare a predicted SSD against the correct one
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"predictedSsd": {
"name": "businessInfo.csv",
"ontologies": [1],
"semanticModel": {
"nodes": [***],
"links": [***]
},
"mappings": [***]
},
"correctSsd": {
"name": "businessInfo.csv",
"ontologies": [1],
"semanticModel": {
"nodes": [***],
"links": [***]
},
"mappings": [***]
},
"ignoreSemanticTypes": true,
"ignoreColumnNodes": true
}' \
localhost:8080/v1.0/evaluate
To run all tests:
sbt test
To run individual module tests, refer to the module name e.g.
sbt serene-core/test
sbt serene-matcher/test
sbt serene-modeler/test
To run an individual test spec refer to the Spec e.g.
sbt "serene-core/test-only au.csiro.data61.core.SSDStorageSpec"
To generate the code coverage report:
sbt serene-core/test serene-core/coverageReport
This will generate an HTML report at core/target/scala-2.11/scoverage-report/index.html
For the semantic modelling part 3 Karma java libraries need to be available:
- karma-common;
- karma-typer;
- karma-util.
Certain changes have been made to the original Karma code:
-
Make the following methods public: SortableSemanticModel.steinerNodes.getSizeReduction.
-
Add method ModelLearningGraph.setLastUpdateTime:
public void setLastUpdateTime(long newTime) {
this.lastUpdateTime = newTime;
}
- Add
DINT
to Karma origin of semantic types:
public enum Origin {
AutoModel, User, CRFModel, TfIdfModel, RFModel, DINT
}
- Add two more parameters to the method in GraphBuilder.java:
private void updateLinkCountMap(DefaultLink link, Node source, Node target)