-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate bridge files via SSSOM #3061
Conversation
Define and export the ROBOT_PLUGINS_DIRECTORY variable, pointing to the $(TMPDIR)/plugins directory. Add a rule to download the SSSOM plugin to that directory.
To generate the components containing the SSSOM-derived cross-references pointing to FBbt terms, we can now use the SSSOM plugin for ROBOT rather than the hideous hack that was the AWK script.
For now, FBbt is the only foreign ontology to provide its mappings as a SSSOM set, but that may change in the future if SSSOM gets used more widely. So we generalise the rules that download the FBbt mapping set so that they can be used to download any externally provided mapping set. Likewise, the rule that generates the mappings.owl components (made of cross-references derived from the FBbt SSSOM set) is generalised so that it is ready to create a component from *all* externally provided mapping sets, not only the FBbt one.
Several of the foreign ontologies we are bridging to are deprecated, so there shouldn't be any need to keep the bridges around. If someone needs one of those bridges, they should grab it from the last release prior to this commit and keep it on their side.
This bridge is a remnant of the time where Uberon-ext was a thing. It has not been a thing for quite a while now, so this bridge is hopelessly obsolete.
CL does not contain terms that represent life stages -- all such terms are in Uberon. Therefore the bridges between CL and those ontologies that are specialised in life stage terms are always empty and can be safely removed.
The ABA bridge is deprecated in favour of the (custom-made) MBA bridge, so we can remove it.
The Uberon bridge to BSD and the CL bridge to "NIF Gross Anatomy" are both empty and their intended source of truth is unknown (they are *not* set up to be generated from cross-references, as most other bridges are). Empty and no source to fill them: they are useless and can be removed.
We have a bridge to BFO whose source of truth is currently unknown. We align it with the other bridges by creating cross-references in the Uberon edit file, so that the edit file becomes the source of truth and the bridge can be (re-)generated in the same way as all the other bridges. Of note, several of the axioms in the bridge are redundant with axioms already explicitly stated in Uberon, so we don't add cross-references for those.
The Uberon bridge to NIFSTD contains 6 bridging axioms: * 4 that link Uberon terms to NIFSTD terms that themselves have been deprecated in favour of... the very Uberon term that points to them; * 2 that have non-resolvable IRIs. I think it is fair to see this bridge is not useful.
It seems VAO and VSAO are the same thing. VAO is listed as an alternative prefix for VSAO in the bioregistry, and seemingly all cross-references to VSAO terms in the Uberon edit file have a VAO equivalent in the VAO bridge. Since 1) we do not know where the VAO bridge comes from and 2) we have already decided to remove the VSAO bridge because VSAO is obsolete, there is no reason to keep this bridge.
The CL bridge to KUPO has likely at some point been generated from cross-references in CL (all links in the bridge have corresponding xrefs in CL). It is currently _not_ re-generated because KUPO xrefs are not listed in Uberon as being associated with a bridge, which we fix here.
The hdr-* files were used to prepend metadata at the top of some of the bridge files. This will be done differently in the new pipeline and those header files won't be necessary anymore. Likewise for the footer file.
Dear reader, we have a lot to unpack here. Very simply put, this commit replaces the old Perl script that generated the bridge files from cross-references in Uberon and CL. Critically, it does so by clearly separating the process in two independent steps: * extracting the cross-references and make a SSSOM set out of them; * generating the bridges out of the SSSOM set. This allows for future flexibility as both steps can be modified independently of the other, as long as the first step still results in a SSSOM set and the second set still takes a SSSOM set as input. (In particular, in the future we may switch to maintaining the mappings directly in SSSOM, instead of using cross-references in the -edit file. This would obviously make the first step above obsolete, but the second step would be unaffected.) Now for the details: The first step is entirely done by the 'xref-extract' command of the SSSOM plugin for ROBOT. We merge Uberon and CL (since we also need to extract xrefs from CL, not only Uberon), then use xref-extract to produce a SSSOM set. The command uses the 'treat-xrefs-as-...' annotations found in the ontology to decide: 1) which cross-references to extract (only the cross-references using a prefix declared in a 'treat-xrefs-as-...' annotations are considered), and 2) which mapping predicate to use according to the following table: * 'treat-xrefs-as-equivalent' -> skos:exactMatch * 'treat-xrefs-as-has-subclass' -> skos:narrowMatch * 'treat-xrefs-as-is_a' -> skos:broadMatch * 'treat-xrefs-as-reverse-genus-differentia' -> semapv:crossSpeciesExactMatch We use the same process to extract cross-references from ZFA, since ZFA is the source of truth for the CL-to-ZFA mappings (but not for the Uberon-to-ZFA mappings, for some reason). In parallel, we fetch the SSSOM sets that are maintained by foreign ontologies (for now, only FBbt), so that we end up with mapping sets covering all the ontologies we want to bridge with. The second step is done by the 'sssom-inject' command of the same SSSOM plugin for ROBOT. Beyond the mapping sets themselves, that command requires two additional files: 1) The bridge/bridges.rules file dictates which bridging axioms should be generated for each mapping, depending on the mapping predicate and the subject or object of the mapping. It also takes care of making sure the mappings are in the expected orientation (Uberon/CL terms on the object side), filtering mappings to anything else than Uberon/CL (in case the externally provided mapping sets contain more than what we need), and checking that all mappings concern classes that do exist in Uberon/CL and are not obsolete. That file is written in the adhoc "SSSOM/Transform-OWL" language, the documentation of which can be found on the website of the SSSOM plugin. 2) The bridge/bridges.dispatch dictates where to write the bridging axioms produced by the bridges.rules file. It does so using a tagging system: each rule in bridges.rules is tagged to indicate the bridge to which the axioms generated by that rule belong. The dispatch file associates each tag to an output file and in addition allows to specify some metadata for each output file.
Since the bridge-generating rules are highly repetitive, they can easily be automatically generated by some macros. This keeps the editor-facing bridge/bridges.rules.m4 file easier to read and modify (for example, to add or remove a bridge, an editor has simply to add or remove a single 'BRIDGE(...)' line).
The seed.obo file was previously used as an intermediate to generate the bridge file. It is no longer needed for this. It was also used to generate the life-cycle-xrefs report, but there is no reason not to use the -edit file as source for that step. The only thing we need to be careful about is that seed.obo was pruned of obsolete terms, while the -edit file obviously is not. But it is more efficient to filter out obsolete terms directly in the SPARQL query, instead of: * use ROBOT to generate seed.obo; * use (another instance of) ROBOT to reason over seed.obo; * use a Perl script to hack the resulting OBO file and prune obsolete terms; * use ROBOT again to run the SPARQL query.
The CUSTOM_BRIDGES variable needs to be declared before it is referenced in the pre-requisites of the $(TMPDIR)/bridges target, otherwise the corresponding bridge would not be generated when that target is invoked.
When the bridges.rules file was moved to a M4-generated file, the reference to that file in the call to 'sssom-inject' did not follow.
To generate the bridge, we need a version of CL both to extract the cross-references from it, and to filter out any mappings to an inexistent or obsolete CL class. We also need a version of ZFA to extract the cross-references from it. This is problematic as the bridge generation pipeline is typically run under MIR=false conditions, so we may not have local CL and ZFA mirrors available (unless we are on an editor machine and the editor has refreshed the mirrors prior to running the pipeline). There are several possible solutions: 1) For the specific purpose of generating the bridges, always download CL and ZFA from their online location, regardless of MIR (basically bypassing the ODK mirroring system). This is what the previous version of the pipeline was doing. This is not completely unreasonable as we are not using CL or ZFA to inject any axioms into a final product here, we are just extracting their cross-references (though one might argue that those cross-references ultimately end up as axioms in products such as composite-metazoan). 2) Make the pipeline dependent on the mirrors -- that is, skipping the pipeline entirely under MIR=false. This would mean that the bridges would only be re-generated when the pipeline is specifically run under MIR=true -- notably, they would never be re-generated during the CI checks, since those run under MIR=false. 3) Try to have it both ways: use a local mirror when available (either because we are under MIR=true, or because the mirrors had been refreshed previously), and fallback to download CL and ZFA from their online locations otherwise. This commit implements option 3. I don't particularly like it (it's a ugly hack), but I would like input from other maintainers before deciding between options 1 and 2 (or possibly another option that I didn't see), so this will do for now.
Make sure the tmp/bridges is created when we generate the bridges, to avoid always re-generating them.
When BRI is set to false, skip the entire BRIDGES section. Just create the tmp/bridges file so that rules that depend on that target are happy.
When generating the SSSOM set from the cross-references in the -edit file (and in CL), ignore any cross-references to a foreign term if there are more than one, and produce a report listing all such cases. This requires version 0.4.2 of the SSSOM plugin, which is not released yet -- but it will be before this branch is ready for prime time.
The mappings between FBbt and Uberon/CL are entirely provided by FBbt as a SSSOM set already, so we should not try to extract FBbt cross-references from Uberon's edit file.
Use new features of the SSSOM plugin (that will be available in the upcoming 0.4.2 release) to reduce the amount of boilerplate in the dispatch table: * the filename of an entry is now relative to the directory containing the dispatch table; * the ontology ID and version IRIs can be specified once and for all in a pseudo-entry named "__default".
We are already ensuring, when extracting the cross-references from Uberon and CL, that we detect, ignore and report any case where the same foreign term is mapped to more than one Uberon/CL term. But we could still get duplicate mappings when we add to the mix the externally provided sets -- either because an external set already contains duplicate, or because of an overlap between the mappings extracted the cross-references and the external sets. So we add another layer of defense during the bridge generation step, where we instruct sssom-inject to drop any mapping with a cardinality of *:n (any mapping where the subject is mapped to many objects).
CI check explicitly cancelled since the new pipeline won’t work anyway with the latest ODK release. An ODK image with support for ROBOT plugins is required. |
Document the bridges: what they are, what are the different sources of truths (and so, where editors should look for to update them when needed), how they are generated.
@matentzn (and anyone else interested in this PR): The last changes were to fix the conflicts with the current state of the master branch and to add the documentation about how the bridges are maintained. The Pr does what I wanted it to do and is ready to be merged. However, I am not entirely happy with something. Currently, the bridge pipeline is disabled in QC (the test suite runs with The obvious solution would be to, well, enable the bridge pipeline in QC. But this in turn would cause another problem. In addition to using the cross-references extracted from Uberon, the bridge pipeline fetches mappings from upstream sources (at least three sources, for now: the FBbt mapping set, the cross-references from CL, and the cross-references from ZFA) and uses them to produce the bridges. Doing that at QC time would mean that the test suite on any PR could fail for reasons that have nothing to do with the PR that is being tested, just because something changed in an upstream source (something that Uberon editors have no control on). So I’d like to propose that we further refine the bridge pipeline to break it into two separate phases, to be run at different times:
Phase 2 would be enabled under The only downside I can see is that in this setup, external mappings would never be automatically refreshed. They would only be refreshed when an editor explicitly runs the bridge pipeline with So:
|
The bridging axiom used as example in the introduction comes from ZFA, but the following paragraph refers to FBbt, which may cause needless confusion. Also add a missing closing quote in another example.
I think your idea of separating in two sub-phases is the right one. For completeness: We could consider In any case, I don't mind you implementing that separation here, in this very PR! Whatever works for you. |
Yes. That’s how I actually plan to implement it. But because the mapping sets are only needed to generate the bridge, the rules to fetch the external mapping sets would in effect only be called as dependencies of the rules that generate the bridges – and those rules are themselves only used under So basically, we would have:
-> re-generates the bridges only from locally available resources (i.e. the mapping sets that have previously been downloaded and committed to the repository)
-> refreshes the mapping sets and re-generates the bridges We could also have something like
to only refresh all the mappings without immediately re-generating the bridges. Not sure there would be a need for that, but it would be easy to do.
OK, I’ll do it now and here then. |
Do not extract the mappings from CL at the same time as we extract the mappings from Uberon. They are maintained in an external repository (the CL repository), so they should be treated differently, downloaded separately and committed to the repository before being used in downstream rules (instead of being written to a temporary file). Same for the mappings extracted from ZFA.
Since the external mapping sets are a way to introduce assertions maintained elsewhere into some of our released products (e.g. composite-metazoan), we treat them as if they were imports, and refresh them only when IMP is set to true. Upon IMP=false the bridges are generated from the (unrefreshed) mapping sets that are committed in the repository.
We treat the "custom bridges" (bridges that we use directly as they are provided to us) the same way as we treat external mapping sets: we only refresh them if we are willing to refresh external resources, which is stated by the IMP variable.
All bridges are now generated in the OWL format, and the custom bridges are provided by upstream already in the OWL format, so we can use them directly in that format. We do not generate OBO versions by default. If we do want an OBO version of them, such a version can be generated using the default rule to produce an OBO bridge from an OWL bridge, there is no need to have specific rules for the custom bridges.
Add two new helper commands (Make targets, really): * refresh-mappings, to forcefully refresh the external mapping sets; * refresh-bridges, to forcefully refresh the bridges (equivalent to calling 'make tmp/bridges', but slightly more user-friendly).
Now that the bridge pipeline no longer systematically refreshes externam mapping sets (it only does upon IMP=true), we can enable that pipeline during QC. This ensures that if a bogus mapping is introduced within Uberon itself, if that bogus mapping causes unsatisfiability issues when Uberon is merged with its bridges, the problem will be caught at QC time.
When extracting cross-references from ZFA, ensure that we drop any duplicate mappings, as we already do for the cross-references from Uberon itself and from CL.
Mention the 'refresh-bridges' Make target in the documentation about bridges, and the fact that it should be invoked explicitly by maintainers.
@matentzn @anitacaron This is done and ready. Under the new system, bridges are always re-generated (both at release time, as before, and now also at QC time), but only using local versions of the externally maintained mapping sets, unless the bridge pipeline runs under To force refreshing the bridges completely (including downloading fresh versions of the externally maintained mapping sets), a maintainer must run |
Can we have one command for refreshing imports and bridges? We're constantly refreshing imports before a release; we could add the bridges too. |
I had considered that but I thought you would prefer to refresh the imports and the bridges separately. You’re the one doing the releases usually, so if you prefer a single command, then sure. :) I’ll add a |
Add a convenience target to refresh the imports and the bridges at the same time. That target could later be expanded to refresh any other external resources if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is not a valid mapping set as it is missing the license and the mapping_set_id. in sssom-py, we just inject default license and ID when they are not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for zfa)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update sssom:xref-extract
to behave similarly in a future version.
I note that sssom-py
injects a default license of https://w3id.org/sssom/license/unspecified
. Is that the expected, recommended value for an unknown license? There’s no mention of it in the spec and the IRI does not resolve to anything.
For what it’s worth I disagree with sssom-py
’s behaviour regarding the mapping_set_id
. It injects an auto-generated ID like this: https://w3id.org/sssom/mappings/2de2cb05-b347-42c8-bb68-48f97a8ccce1
. Sure, it is compliant with the spec that says the ID “should be IRI, ideally resolvable”, but I think that when a set does not have an ID, creating one on the fly that looks like it is resolvable even though it is certain not to be resolvable is a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that the CL and ZFA mapping sets are committed to the repository so that we can be sure they will be locally available when we need to create the bridges (without having to fetch external resources), but they are not intended to be published. It’s not up to Uberon to publish those sets, which do not “belong” to Uberon. It should be up to CL and ZFA respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to suggest better default values on the sssom issue tracker!
The bridges to MBA and DMBA are now maintained in a new location, so we update the downloading URLs. We also remove the old OBO versions of those bridges.
This PR overhauls the pipeline that generates the bridges between Uberon/CL and other ontologies.
First, it removes a number of bridges that are considered no longer relevant, as discussed in #3047.
Then, it replaces the old Perl-based process to generate the remaining bridge by a 3-step process:
This implements the “phase 1” discussed in #3004.
Still to do in this PR: