Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate bridge files via SSSOM #3061

Merged
merged 49 commits into from
Nov 15, 2023
Merged

Generate bridge files via SSSOM #3061

merged 49 commits into from
Nov 15, 2023

Conversation

gouttegd
Copy link
Collaborator

@gouttegd gouttegd commented Sep 7, 2023

⚠️ This PR requires a version of ROBOT that supports pluggable commands. ⚠️

This PR overhauls the pipeline that generates the bridges between Uberon/CL and other ontologies.

First, it removes a number of bridges that are considered no longer relevant, as discussed in #3047.

Then, it replaces the old Perl-based process to generate the remaining bridge by a 3-step process:

  1. extract cross-references from Uberon/CL (which are the source of truth for almost all bridges) and from ZFA (source of truth for the CL-to-ZFA bridge) and turn them into a SSSOM set;
  2. fetch externally maintained SSSOM sets for the foreign ontologies that already provide their mappings as SSSOM (currently, only FBbt – the PR prepares the way for more ontologies to follow, should they wish to do so);
  3. combine all mapping sets and generate the bridges from them.

This implements the “phase 1” discussed in #3004.

Still to do in this PR:

  • Commit the newly generated bridge files.
  • Add documentation for the editors regarding the source of truth of each bridge.

Define and export the ROBOT_PLUGINS_DIRECTORY variable, pointing to the
$(TMPDIR)/plugins directory.

Add a rule to download the SSSOM plugin to that directory.
To generate the components containing the SSSOM-derived cross-references
pointing to FBbt terms, we can now use the SSSOM plugin for ROBOT rather
than the hideous hack that was the AWK script.
For now, FBbt is the only foreign ontology to provide its mappings as a
SSSOM set, but that may change in the future if SSSOM gets used more
widely.

So we generalise the rules that download the FBbt mapping set so that
they can be used to download any externally provided mapping set.

Likewise, the rule that generates the mappings.owl components (made of
cross-references derived from the FBbt SSSOM set) is generalised so that
it is ready to create a component from *all* externally provided mapping
sets, not only the FBbt one.
Several of the foreign ontologies we are bridging to are deprecated, so
there shouldn't be any need to keep the bridges around.

If someone needs one of those bridges, they should grab it from the last
release prior to this commit and keep it on their side.
This bridge is a remnant of the time where Uberon-ext was a thing. It
has not been a thing for quite a while now, so this bridge is hopelessly
obsolete.
CL does not contain terms that represent life stages -- all such terms
are in Uberon. Therefore the bridges between CL and those ontologies
that are specialised in life stage terms are always empty and can be
safely removed.
The ABA bridge is deprecated in favour of the (custom-made) MBA bridge,
so we can remove it.
The Uberon bridge to BSD and the CL bridge to "NIF Gross Anatomy" are
both empty and their intended source of truth is unknown (they are *not*
set up to be generated from cross-references, as most other bridges
are).

Empty and no source to fill them: they are useless and can be removed.
We have a bridge to BFO whose source of truth is currently unknown. We
align it with the other bridges by creating cross-references in the
Uberon edit file, so that the edit file becomes the source of truth and
the bridge can be (re-)generated in the same way as all the other
bridges.

Of note, several of the axioms in the bridge are redundant with axioms
already explicitly stated in Uberon, so we don't add cross-references
for those.
The Uberon bridge to NIFSTD contains 6 bridging axioms:
* 4 that link Uberon terms to NIFSTD terms that themselves have been
  deprecated in favour of... the very Uberon term that points to them;
* 2 that have non-resolvable IRIs.

I think it is fair to see this bridge is not useful.
It seems VAO and VSAO are the same thing. VAO is listed as an
alternative prefix for VSAO in the bioregistry, and seemingly all
cross-references to VSAO terms in the Uberon edit file have a VAO
equivalent in the VAO bridge.

Since 1) we do not know where the VAO bridge comes from and 2) we have
already decided to remove the VSAO bridge because VSAO is obsolete,
there is no reason to keep this bridge.
The CL bridge to KUPO has likely at some point been generated from
cross-references in CL (all links in the bridge have corresponding xrefs
in CL). It is currently _not_ re-generated because KUPO xrefs are not
listed in Uberon as being associated with a bridge, which we fix here.
The hdr-* files were used to prepend metadata at the top of some of the
bridge files. This will be done differently in the new pipeline and
those header files won't be necessary anymore.

Likewise for the footer file.
Dear reader, we have a lot to unpack here.

Very simply put, this commit replaces the old Perl script that generated
the bridge files from cross-references in Uberon and CL. Critically, it
does so by clearly separating the process in two independent steps:

* extracting the cross-references and make a SSSOM set out of them;
* generating the bridges out of the SSSOM set.

This allows for future flexibility as both steps can be modified
independently of the other, as long as the first step still results in a
SSSOM set and the second set still takes a SSSOM set as input.

(In particular, in the future we may switch to maintaining the mappings
directly in SSSOM, instead of using cross-references in the -edit file.
This would obviously make the first step above obsolete, but the second
step would be unaffected.)

Now for the details:

The first step is entirely done by the 'xref-extract' command of the
SSSOM plugin for ROBOT. We merge Uberon and CL (since we also need to
extract xrefs from CL, not only Uberon), then use xref-extract to
produce a SSSOM set. The command uses the 'treat-xrefs-as-...'
annotations found in the ontology to decide: 1) which cross-references
to extract (only the cross-references using a prefix declared in a
'treat-xrefs-as-...' annotations are considered), and 2) which mapping
predicate to use according to the following table:

* 'treat-xrefs-as-equivalent' -> skos:exactMatch
* 'treat-xrefs-as-has-subclass' -> skos:narrowMatch
* 'treat-xrefs-as-is_a' -> skos:broadMatch
* 'treat-xrefs-as-reverse-genus-differentia' -> semapv:crossSpeciesExactMatch

We use the same process to extract cross-references from ZFA, since ZFA
is the source of truth for the CL-to-ZFA mappings (but not for the
Uberon-to-ZFA mappings, for some reason).

In parallel, we fetch the SSSOM sets that are maintained by foreign
ontologies (for now, only FBbt), so that we end up with mapping sets
covering all the ontologies we want to bridge with.

The second step is done by the 'sssom-inject' command of the same SSSOM
plugin for ROBOT. Beyond the mapping sets themselves, that command
requires two additional files:

1) The bridge/bridges.rules file dictates which bridging axioms should
be generated for each mapping, depending on the mapping predicate and
the subject or object of the mapping. It also takes care of making sure
the mappings are in the expected orientation (Uberon/CL terms on the
object side), filtering mappings to anything else than Uberon/CL (in
case the externally provided mapping sets contain more than what we
need), and checking that all mappings concern classes that do exist in
Uberon/CL and are not obsolete. That file is written in the adhoc
"SSSOM/Transform-OWL" language, the documentation of which can be found
on the website of the SSSOM plugin.

2) The bridge/bridges.dispatch dictates where to write the bridging
axioms produced by the bridges.rules file. It does so using a tagging
system: each rule in bridges.rules is tagged to indicate the bridge to
which the axioms generated by that rule belong. The dispatch file
associates each tag to an output file and in addition allows to specify
some metadata for each output file.
Since the bridge-generating rules are highly repetitive, they can easily
be automatically generated by some macros. This keeps the editor-facing
bridge/bridges.rules.m4 file easier to read and modify (for example, to
add or remove a bridge, an editor has simply to add or remove a single
'BRIDGE(...)' line).
The seed.obo file was previously used as an intermediate to generate the
bridge file. It is no longer needed for this.

It was also used to generate the life-cycle-xrefs report, but there is
no reason not to use the -edit file as source for that step. The only
thing we need to be careful about is that seed.obo was pruned of
obsolete terms, while the -edit file obviously is not. But it is more
efficient to filter out obsolete terms directly in the SPARQL query,
instead of:

* use ROBOT to generate seed.obo;
* use (another instance of) ROBOT to reason over seed.obo;
* use a Perl script to hack the resulting OBO file and prune obsolete
  terms;
* use ROBOT again to run the SPARQL query.
The CUSTOM_BRIDGES variable needs to be declared before it is referenced
in the pre-requisites of the $(TMPDIR)/bridges target, otherwise the
corresponding bridge would not be generated when that target is invoked.
When the bridges.rules file was moved to a M4-generated file, the
reference to that file in the call to 'sssom-inject' did not follow.
To generate the bridge, we need a version of CL both to extract the
cross-references from it, and to filter out any mappings to an
inexistent or obsolete CL class. We also need a version of ZFA to
extract the cross-references from it.

This is problematic as the bridge generation pipeline is typically run
under MIR=false conditions, so we may not have local CL and ZFA mirrors
available (unless we are on an editor machine and the editor has
refreshed the mirrors prior to running the pipeline).

There are several possible solutions:

1) For the specific purpose of generating the bridges, always download
CL and ZFA from their online location, regardless of MIR (basically
bypassing the ODK mirroring system). This is what the previous version
of the pipeline was doing. This is not completely unreasonable as we are
not using CL or ZFA to inject any axioms into a final product here, we
are just extracting their cross-references (though one might argue that
those cross-references ultimately end up as axioms in products such as
composite-metazoan).

2) Make the pipeline dependent on the mirrors -- that is, skipping the
pipeline entirely under MIR=false. This would mean that the bridges
would only be re-generated when the pipeline is specifically run under
MIR=true -- notably, they would never be re-generated during the CI
checks, since those run under MIR=false.

3) Try to have it both ways: use a local mirror when available (either
because we are under MIR=true, or because the mirrors had been refreshed
previously), and fallback to download CL and ZFA from their online
locations otherwise.

This commit implements option 3. I don't particularly like it (it's a
ugly hack), but I would like input from other maintainers before
deciding between options 1 and 2 (or possibly another option that I
didn't see), so this will do for now.
Make sure the tmp/bridges is created when we generate the bridges, to
avoid always re-generating them.
When BRI is set to false, skip the entire BRIDGES section. Just create
the tmp/bridges file so that rules that depend on that target are happy.
When generating the SSSOM set from the cross-references in the -edit
file (and in CL), ignore any cross-references to a foreign term if there
are more than one, and produce a report listing all such cases.

This requires version 0.4.2 of the SSSOM plugin, which is not released
yet -- but it will be before this branch is ready for prime time.
The mappings between FBbt and Uberon/CL are entirely provided by FBbt as
a SSSOM set already, so we should not try to extract FBbt
cross-references from Uberon's edit file.
Use new features of the SSSOM plugin (that will be available in the
upcoming 0.4.2 release) to reduce the amount of boilerplate in the
dispatch table:

* the filename of an entry is now relative to the directory containing
  the dispatch table;
* the ontology ID and version IRIs can be specified once and for all in
  a pseudo-entry named "__default".
We are already ensuring, when extracting the cross-references from
Uberon and CL, that we detect, ignore and report any case where the same
foreign term is mapped to more than one Uberon/CL term.

But we could still get duplicate mappings when we add to the mix the
externally provided sets -- either because an external set already
contains duplicate, or because of an overlap between the mappings
extracted the cross-references and the external sets.

So we add another layer of defense during the bridge generation step,
where we instruct sssom-inject to drop any mapping with a cardinality of
*:n (any mapping where the subject is mapped to many objects).
@gouttegd gouttegd self-assigned this Sep 7, 2023
@gouttegd
Copy link
Collaborator Author

gouttegd commented Sep 7, 2023

CI check explicitly cancelled since the new pipeline won’t work anyway with the latest ODK release. An ODK image with support for ROBOT plugins is required.

@gouttegd gouttegd added tech bridge-files Issues related to the generation of bridge files from Uberon to other ontologies. labels Sep 7, 2023
Document the bridges: what they are, what are the different sources of
truths (and so, where editors should look for to update them when
needed), how they are generated.
@gouttegd
Copy link
Collaborator Author

@matentzn (and anyone else interested in this PR): The last changes were to fix the conflicts with the current state of the master branch and to add the documentation about how the bridges are maintained. The Pr does what I wanted it to do and is ready to be merged.

However, I am not entirely happy with something.

Currently, the bridge pipeline is disabled in QC (the test suite runs with BRI=false). This is annoying because it means that an editor could introduce a bogus mapping that would cause some unsatisfiability issues when merging Uberon and the bridges, and such an issue would not be caught by the test suite – it would only be caught at release time, which is when the bridge pipeline is enabled and the bridges are actually re-generated.

The obvious solution would be to, well, enable the bridge pipeline in QC. But this in turn would cause another problem. In addition to using the cross-references extracted from Uberon, the bridge pipeline fetches mappings from upstream sources (at least three sources, for now: the FBbt mapping set, the cross-references from CL, and the cross-references from ZFA) and uses them to produce the bridges. Doing that at QC time would mean that the test suite on any PR could fail for reasons that have nothing to do with the PR that is being tested, just because something changed in an upstream source (something that Uberon editors have no control on).

So I’d like to propose that we further refine the bridge pipeline to break it into two separate phases, to be run at different times:

  1. One phase when we just fetches the remote mappings (again: from FBbt, CL, and ZFA) and commit them to the repository (we are actually already doing that for the FBbt mapping set).
  2. One phase when we use all the collected mappings (which by now will all be available locally in the repository, without needing to fetch them from their upstream source) and the mappings locally extracted from Uberon to generate the bridges.

Phase 2 would be enabled under BRI=true, but phase 1 would only be enabled under BRI=true and IMP=true. So only phase 2 would be enabled during a QC or release run. This would ensure that a bogus, unsat-causing mapping introduced in Uberon would be caught during QC, while preventing QC from failing because of changes happening upstream.

The only downside I can see is that in this setup, external mappings would never be automatically refreshed. They would only be refreshed when an editor explicitly runs the bridge pipeline with BRI=true and IMP=true (something for which we could provide a convenient Make target, like make refresh-bridges). But I see that more as a benefit than a downside, actually. It’s similar to the way we now deal with imports and it ensures that external changes are only brought into the ontology by explicit action.

So:

  1. Do you agree with that idea?
  2. If so, would you rather to have it implemented as part of this very PR, or do you prefer to merge this PR as it is and then refine the pipeline as proposed as part of another, distinct PR? I don’t mind either way; I have a slight preference for doing that as part of another PR as this one is already complicated enough, but I am not the one doing the review.

The bridging axiom used as example in the introduction comes from ZFA,
but the following paragraph refers to FBbt, which may cause needless
confusion.

Also add a missing closing quote in another example.
@matentzn
Copy link
Contributor

I think your idea of separating in two sub-phases is the right one. For completeness:

We could consider IMP to refer to regular imports as well as imported mappings. in some sense, you could even drop the BRI=true criterion from importing external mapping sets (I am not saying we should, but it may be worth contemplating to subsume mapping syncing under import or component syncing).

In any case, I don't mind you implementing that separation here, in this very PR! Whatever works for you.

@gouttegd
Copy link
Collaborator Author

in some sense, you could even drop the BRI=true criterion from importing external mapping sets

Yes. That’s how I actually plan to implement it. But because the mapping sets are only needed to generate the bridge, the rules to fetch the external mapping sets would in effect only be called as dependencies of the rules that generate the bridges – and those rules are themselves only used under BRI=true.

So basically, we would have:

sh run.sh make tmp/bridges BRI=true IMP=false

-> re-generates the bridges only from locally available resources (i.e. the mapping sets that have previously been downloaded and committed to the repository)

sh run.sh make tmp/bridges BRI=true IMP=true

-> refreshes the mapping sets and re-generates the bridges

We could also have something like

sh run.sh make tmp/mappings BRI=false [or true, doesn’t matter] IMP=true

to only refresh all the mappings without immediately re-generating the bridges. Not sure there would be a need for that, but it would be easy to do.

I don't mind you implementing that separation here, in this very PR! Whatever works for you.

OK, I’ll do it now and here then.

Do not extract the mappings from CL at the same time as we extract the
mappings from Uberon. They are maintained in an external repository (the
CL repository), so they should be treated differently, downloaded
separately and committed to the repository before being used in
downstream rules (instead of being written to a temporary file).

Same for the mappings extracted from ZFA.
Since the external mapping sets are a way to introduce assertions
maintained elsewhere into some of our released products (e.g.
composite-metazoan), we treat them as if they were imports, and refresh
them only when IMP is set to true. Upon IMP=false the bridges are
generated from the (unrefreshed) mapping sets that are committed in the
repository.
We treat the "custom bridges" (bridges that we use directly as they are
provided to us) the same way as we treat external mapping sets: we only
refresh them if we are willing to refresh external resources, which is
stated by the IMP variable.
All bridges are now generated in the OWL format, and the custom bridges
are provided by upstream already in the OWL format, so we can use them
directly in that format. We do not generate OBO versions by default. If
we do want an OBO version of them, such a version can be generated using
the default rule to produce an OBO bridge from an OWL bridge, there is
no need to have specific rules for the custom bridges.
Add two new helper commands (Make targets, really):

* refresh-mappings, to forcefully refresh the external mapping sets;
* refresh-bridges, to forcefully refresh the bridges (equivalent to
  calling 'make tmp/bridges', but slightly more user-friendly).
Now that the bridge pipeline no longer systematically refreshes externam
mapping sets (it only does upon IMP=true), we can enable that pipeline
during QC. This ensures that if a bogus mapping is introduced within
Uberon itself, if that bogus mapping causes unsatisfiability issues when
Uberon is merged with its bridges, the problem will be caught at QC
time.
When extracting cross-references from ZFA, ensure that we drop any
duplicate mappings, as we already do for the cross-references from
Uberon itself and from CL.
Mention the 'refresh-bridges' Make target in the documentation about
bridges, and the fact that it should be invoked explicitly by
maintainers.
@gouttegd
Copy link
Collaborator Author

@matentzn @anitacaron This is done and ready.

Under the new system, bridges are always re-generated (both at release time, as before, and now also at QC time), but only using local versions of the externally maintained mapping sets, unless the bridge pipeline runs under IMP=true (which is never the case at release and QC time).

To force refreshing the bridges completely (including downloading fresh versions of the externally maintained mapping sets), a maintainer must run refresh-bridges, which will both refresh the mappings and immediately rebuild the bridges. Alternatively, one can also run refresh-mappings to refresh the mappings only and then commit the refreshed mappings – they will then be used the next time the bridge pipeline is run, e.g. at the next release).

@gouttegd gouttegd removed the blocked blocked by another issue label Nov 14, 2023
@anitacaron
Copy link
Collaborator

Alternatively, one can also run refresh-mappings to refresh the mappings only and then commit the refreshed mappings – they will then be used the next time the bridge pipeline is run, e.g. at the next release).

Can we have one command for refreshing imports and bridges? We're constantly refreshing imports before a release; we could add the bridges too.

@gouttegd
Copy link
Collaborator Author

I had considered that but I thought you would prefer to refresh the imports and the bridges separately.

You’re the one doing the releases usually, so if you prefer a single command, then sure. :)

I’ll add a make refresh-external-resources target that will refresh both the imports and the bridges.

Add a convenience target to refresh the imports and the bridges at the
same time. That target could later be expanded to refresh any other
external resources if needed.
matentzn
matentzn previously approved these changes Nov 15, 2023
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not a valid mapping set as it is missing the license and the mapping_set_id. in sssom-py, we just inject default license and ID when they are not there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for zfa)

Copy link
Collaborator Author

@gouttegd gouttegd Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update sssom:xref-extract to behave similarly in a future version.

I note that sssom-py injects a default license of https://w3id.org/sssom/license/unspecified. Is that the expected, recommended value for an unknown license? There’s no mention of it in the spec and the IRI does not resolve to anything.

For what it’s worth I disagree with sssom-py’s behaviour regarding the mapping_set_id. It injects an auto-generated ID like this: https://w3id.org/sssom/mappings/2de2cb05-b347-42c8-bb68-48f97a8ccce1. Sure, it is compliant with the spec that says the ID “should be IRI, ideally resolvable”, but I think that when a set does not have an ID, creating one on the fly that looks like it is resolvable even though it is certain not to be resolvable is a bad idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that the CL and ZFA mapping sets are committed to the repository so that we can be sure they will be locally available when we need to create the bridges (without having to fetch external resources), but they are not intended to be published. It’s not up to Uberon to publish those sets, which do not “belong” to Uberon. It should be up to CL and ZFA respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to suggest better default values on the sssom issue tracker!

The bridges to MBA and DMBA are now maintained in a new location, so we
update the downloading URLs. We also remove the old OBO versions of
those bridges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bridge-files Issues related to the generation of bridge files from Uberon to other ontologies. tech
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants