Imports DBPedia dumps into Neo4j
This importer aims for faster import times, but less restrictive (and thus, less performant) implementations might follow. In particular, it can only perform a full-load on an offline, ideally empty, database. Importing the same file with two executions might result in duplicated nodes or a corrupt database.
./sbt "run dewiki-20140320-article-categories.nt.gz"
- JDK7 (Probably Oracle JDK)
- (SBT and Scala 2.11) - will be pulled automatically, if you do not have these installed
1. Build the project
./sbt assembly
2. Run the jar against one or more scripts. Always run all files in one run, otherwise the database will have incorrect data
./dbpedia-neo4j.jar ./dewiki-20140320-article-categories.nt.gz
3. Specify run options
e.g. You can change the Neo4j database dir with the dbpedia.db-dir
system property and give the script more memory to run
java -server -Xms1g -Xmx8g -Ddbpedia.db-dir=import.db -jar dbpedia-neo4j.jar {article_categories,category_labels,instance_types,labels,skos_categories}_de.nt.bz2
The importer can work regular nt
or nq
files, or de-compress gz
or bz2
archives on the fly (plain-text files are the fastest, of course).
It helps a lot, if the files are ordered by subject, so that every statement for one particular subject comes in one block; otherwise the data might be incorrect (esp. in regard to the rdf:type
and rdfs:label
handling).
Generally, a triple will be stored as a (s)-[p]->(o)
with the following additions:
- the subject gets the label
:Resource
or:BNode
- the object gets the label
:Resource
or:BNode
or:Literal
- the relationship gets a type, that is equivalent to the predicate's URI
:Resource
s have auri
property and:Literal
s avalue
property
On top of that, that following changes are implemented:
- relationships of
rdf:type
will be discarded and the particular object uri will be added as a label to the subject - relationships of
rdfs:label
orskos:prefLabel
will be discarded and the particular object literal will be added as avalue
property to the subject. (When there might be more than one of these labels, currently only one of these gets actually stored)
That means, you can use the uri
property as a unique-id-kind-of property and the value
property as a pretty-output-end-user-kind-of property (this also makes great node captions for the neo4j browser UI)
The importer can create or maintain a schema index for later use. See Configuration
The importer uses Typesafe Config for configuration, for details on how to write the config, please see their documentation.
The following keys are used, see reference.conf for details
dbpedia.db-dir
dbpedia.tx-size
dbpedia.approx-resources
dbpedia.deferred-index
Probably the most important setting would be db-dir
, which is the location of the database directory for neo4j.
To be precise, this must match your org.neo4j.server.database.location
property from conf/neo4j-server.properties
If you set dbpedia.deferred-index
to true, the importer will create a schema index on :Resource(uri)
and :Literal(value)
.
These indices will (and can) not be used during the importing process, they only release you from the burden, to remember to add these later on by yourself.
The importer uses the batch API with all its implications, such as only using a single thread with no transactions. The importer maintains an in-memory map for URIs <-> NodeIDs mappings, that is, the node cache, that prevents multiple resources nodes for the same URI. This cache is ephemeral and will be destroyed when the importer terminates. There is no usage of any index, lookup, or cache that actually goes to the database. This is also the reason, that you have to do everything in one go.
The importer logs several runtime metrics through the excellent metrics library. Meters for inserted nodes and relationships are reported during the process, a full report is given at the end. here's an example output:
[nodes]: count=2800168 rate=17652.78/s
[rels]: count=5953013 rate=37528.85/s
[triples]: count=11975105 rate=75493.05/s
[create-rel]: count=5953013 rate=37807.85/s [0.00ms, 0.01ms] ~0.00ms ±0.00ms p95=0.00ms p99=0.01ms p999=0.01ms
[create-resource]: count=2800168 rate=17781.77/s [0.00ms, 0.09ms] ~0.00ms ±0.00ms p95=0.01ms p99=0.01ms p999=0.09ms
[graph-tx]: count=498 rate=3.15/s [83.61ms, 2586.21ms] ~281.81ms ±271.38ms p95=681.58ms p99=1828.87ms p999=2586.21ms
[import]: count=1 rate=0.01/s [151272.85ms, 151272.85ms] ~151272.85ms ±0.00ms p95=151272.85ms p99=151272.85ms p999=151272.85ms
[lookup-resource]: count=10930877 rate=69411.97/s [0.00ms, 0.01ms] ~0.00ms ±0.00ms p95=0.00ms p99=0.00ms p999=0.01ms
[shutdown]: count=1 rate=0.14/s [7243.15ms, 7243.15ms] ~7243.15ms ±0.00ms p95=7243.15ms p99=7243.15ms p999=7243.15ms
[subject-nodes]: count=4977864 rate=31606.02/s [0.00ms, 0.81ms] ~0.02ms ±0.04ms p95=0.04ms p99=0.15ms p999=0.80ms
[update-resource]: count=2328089 rate=20607.95/s [0.01ms, 0.75ms] ~0.03ms ±0.06ms p95=0.03ms p99=0.33ms p999=0.75ms
rels
andnodes
count, whenever a relationship or node, resp., is added to the database.triples
counts how many triples were processed during the import, event the ones that did not result in new nodes or relationshipscreate-rel
measures, how long it takes to create a new relationship. The count should be equivalent toedges
create-resource
measures, how long it takes to create a resource nodecreate-literal
measures, how long it takes to create a literal nodecreate-bnode
measures, how long it takes to create a blank node. These last three should add up tonodes
subject-nodes
measures, how long it takes to handle all statements for a particular subject.lookup-resource
measures the lookup time in the cache to determine, whether or not a resource already exists in the graphlookup-bnode
measures the lookup time in the cache to determine, whether or not a blank node already exists in the graphupdate-resource
measures, how long it takes to update labels and properties of an already existing resourcegraph-tx
measures the time of one "transaction"import
measures, how long the actual import process takesshutdown
measures, how long the shutdown process after the import takes. This includes flushing to disk and potentially creating schema indices
Credit goes to @zazi and @sojoner for outlining the data model definition and possibly creating the need for this importer