-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace_blank_node must go #172
Comments
Minimum non-working example: from django.conf import settings
from rdflib import Graph, BNode
def test_save_blank_node():
store = settings.RDFLIB_STORE
graph = Graph(store)
triple = BNode(), BNode(), BNode()
graph.add(triple) Add this as a unit test, run using |
Good thinking, @lukavdplas! Ironically, that error message links to a section of the SPARQL specification that explains how blank nodes should be implemented. It does not say anywhere that they should be omitted or unsupported. The grammar section is even quite explicit about blank nodes being allowed; there are a few corner cases where blank node labels are not allowed in SPARQL updates, but none of those cases is occurring in this test and even in those corner cases, blank nodes can still be referenced by variables. Storing and retrieving blank nodes is by all means allowed, according to the standard. Your test inspired me to try some things in the Blazegraph UI. This SPARQL update is not allowed: insert data {
[] [] [].
} but this one is: insert data {
[] a [].
} and I am also able to retrieve that triple afterwards. The reason it was not allowed with a blank node in the middle, is that the predicate of a triple must always be an IRI. Of course, that prompted me to retry your test with So, two possible solutions:
I lean towards option 1 because I have wanted to get rid of rdflib as a middle agent for a while in READ-IT as well. Not necessarily get rid of it wholesale, just not using it as an intermediary between us and Blazegraph. Besides performance (due to superfluous parsing and serializing) and artificial restrictions like this one, rdflib also has some other quirks, such as faulty serializations and an obnoxious logical inconsistency that can cause it to consider |
Following this discussion with interest.
See rdflib docs for a -very- brief justification for not supporting blank nodes in
Performance-wise, absolutely. However, I still cannot say with certainty that the problems occurring in READ-IT involving rdflib are caused by bugs in said package. They might spring from a misunderstanding or misuse on my/our part. I'm in favour of picking up UUDigitalHumanitieslab/readit-interface#500 (where the
Why would this be unsustainable? |
Hmm, but that means that read-it basically has the same solution as what we have with But is that really a problem? rdflib assigns unique identifiers to blank nodes using UUID4. As far as I know that means that there is (virtually) no risk of collision. The identifiers are not stable if new data is added, of course, but that should not be a problem in our case. Or am I missing something? |
I can't really judge whether this is a viable solution in the long term, but within the current project, "greater initial cost" is a huge red flag. In terms of both time and budget, the project is at a stage where we should be wrapping up and focus on the shortest route to a satisfactory end product. At best, this suggestion is something that could be considered in a future project. However, I would also argue that this application is generally plagued by solutions that favour possible long-term gains over short-term ones, so I'm wary of this kind of suggestion. As for this particular issue, I agree with Tijmen: with sufficiently large random IDs, the risk of collision is negligible. I would argue that the two solutions you propose ultimately carry a higher risk of data corruption (due to programming errors), especially if they have to be developed short-term. So if the purpose here is to protect data integrity, it may be best to leave this as-is. |
@JeltevanBoheemen Thanks for chiming in!
Thanks. I think that very brief justification is very misguided. I will quote it here for clarity:
It may be true that blank nodes "act as variables" in SPARQL queries, but if it is true, it is still no reason to drop support for them. Suppose that I use the following insert data {
rdfs:Class a _:1.
_:1 rdfs:subClassOf rdfs:Resource.
} The above minigraph contains a single blank node, here labeled as construct where {
?s ?p _:1.
} Blank nodes are not supposed to be uniquely identifyable like that, anyway; IRIs already address that use case. However, I can still retrieve the above minigraph faithfully, i.e., with the knowledge that the blank nodes in both triples refer to the same blank node, for example with the following query: construct where {
rdfs:Class a ?type.
?type rdfs:subClassOf rdfs:Resource.
} which might output something like rdfs:Class a [ rdfs:subClassOf rdfs:Resource ]. If blank nodes "act as variables", that means I can also write the query above as follows, and still get the same result: construct where {
rdfs:Class a _:2.
_:2 rdfs:subClassOf rdfs:Resource.
} If that works, then I would argue that the fact that blank nodes "act as variables" is a feature, not a reason to omit support for them from
Ah, we did that
No, we debugged it together. We entered the interactive Python console and were able to make rdflib produce a logical inconsistency without any of our own code being involved. This is really a bug in rdflib.
While I am already sure that this problem is caused by a bug in rdflib, I am certainly also open to picking up UUDigitalHumanitieslab/readit-interface#500 again.
Because we may eventually want to copy RDF data from external sources to our own triplestore. You cannot simply replace blank nodes by IRIs in that case, because the triples will no longer be equivalent.
No, you are right. If the labels are generated with UUID, then collision should not be likely. I based my worry on the usual convention of labeling blank nodes with small integers like
Fair point.
I guess you have a point here as well.
Yes, I agree with this as well. I was not aware of rdflib using UUIDs for blank nodes until Tijmen pointed it out. Let's close this ticket and not change anything about the way blank nodes are represented. (As for rdflib choosing to label blank nodes with UUID: I still feel that is probably based on a false assumption that blank nodes need to be uniquely identifyable. Then again, that is a moot point now.) |
I just realized there is a critical problem with the piece of code below.
EDPOP/backend/triplestore/utils.py
Lines 74 to 80 in f7de000
When I commented on it in #163, @tijmenbaarda stated that this function was necessary because Blazegraph did not support blank nodes, so we left it in.
At first sight, this function just replaces real blank nodes by an emulation of blank nodes. However, it actually introduces a risk of collision between unrelated nodes. Real blank nodes are intrinsically distinct; they can only be identified if they occur within the same representation. The emulation replacement, on the other hand, produces URIs consisting of a small random number embedded in a fixed string. It is only a matter of time before the same random number is reused. The duplicates will be generated on different dates, possibly by different WSGI workers, and most likely in the process of serializing different queries. This will compromise data integrity.
As for the argument that Blazegraph does not support blank nodes: this should not be true. Blank nodes are too essential to leave them out of any serious RDF implementation. In fact, the Blazegraph wiki mentions support for blank nodes and even an alternative mode of supporting them.
@tijmenbaarda, could you retrace the origin of your belief that Blazegraph does not support blank nodes? I'm sure there is a real problem that needs to be addressed, but we must find an approach that does not involve
replace_blank_node
or anything similar.The text was updated successfully, but these errors were encountered: