Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anvi'o backend for 'genome view' #1712

Open
wants to merge 1,071 commits into
base: master
Choose a base branch
from
Open

Anvi'o backend for 'genome view' #1712

wants to merge 1,071 commits into from

Conversation

meren
Copy link
Member

@meren meren commented Apr 12, 2021

IF THIS NOTE IS HERE, DO NOT MERGE THIS BRANCH. IT IS BROKEN AND WILL RUIN THE ACTIVE DEVELOPMENT BRANCH.


This PR introduces some preliminary backend functionality in anvi'o for 'genome view', a LOOOOONG-waited anvi'o functionality to interactively study large genomic contexts.

@isaacfink21 and @matthewlawrenceklein are already working on the frontend of genome view using some mock data, and the code in this branch will help them test things using real-world data, and figure out what would they like the backend to do for them when it comes to 'massaging' the data structures to their liking.

The most critical class here is AggregateGenomes in anvio/genomedescriptions.py module. For a given set of external and/or internal genomes, the purpose of this class is to aggregate all sorts of information, which is then passed to the interactive world through bottle routes.

When there is an anvi'o pan database that includes all genome names found in internal and/or external genomes files, AggregateGenomes also utilizes gene clusters found in that database to pass it to the interface so genes can be associated with one another.

The class is simple in its design, and will have room for expansion based on our needs. I hope it makes sense so far.

Testing

The purpose of these examples is to make sure you can play with this code to connect its products with the frontend.

Cartoonishly Simple

Download this file and unpack, and run this command in the resulting directory:

anvi-display-genomes -e external-genomes.txt --pan-db PAN.db

FYI, This is how I created this file:

anvi-self-test --suite pangenomics -o PAN
mkdir GENOME_VIEW_TEST_FILES
cp PAN/pan_test/0*db PAN/pan_test/external-genomes.txt GENOME_VIEW_TEST_FILES/
cp PAN/pan_test/TEST/TEST-PAN.db GENOME_VIEW_TEST_FILES/PAN.db
tar -zcvf GENOME_VIEW_TEST_FILES.tar.gz GENOME_VIEW_TEST_FILES/

Somewhat Realistic

Run these steps:

# download a fresh copy of the infant gut data
curl -L https://ndownloader.figshare.com/files/26218961 -o INFANT-GUT-TUTORIAL.tar.gz
tar -zxvf INFANT-GUT-TUTORIAL.tar.gz && cd INFANT-GUT-TUTORIAL

# subset E. faecalis genomes (instead of two distinct species) to simplify the problem
head -n 1 additional-files/pangenomics/external-genomes.txt > additional-files/pangenomics/Enterococcus_faecalis.txt
grep Enterococcus_faecalis additional-files/pangenomics/external-genomes.txt >> additional-files/pangenomics/Enterococcus_faecalis.txt

# generate a pangenome for E. faecalis (should take <5 mins)
anvi-gen-genomes-storage -e additional-files/pangenomics/Enterococcus_faecalis.txt -o Enterococcus-GENOMES.db
anvi-pan-genome -g Enterococcus-GENOMES.db --project-name Enterococcus -T 4

Now you can run this to get genome view data generated for 6 genomes that in the same 'species' WITHOUT the pangenome:

anvi-display-genomes -e additional-files/pangenomics/Enterococcus_faecalis.txt

and WITH the pangenome:

anvi-display-genomes -e additional-files/pangenomics/Enterococcus_faecalis.txt -p Enterococcus/Enterococcus-PAN.db

Upon which this is what you should find in your JavaScript console:

image

Next steps

Fill in the following two files:

  • anvio/data/interactive/genomeview.html
  • anvio/data/interactive/js/genomeview.js

:)

@isaacfink21
Copy link
Contributor

Thanks @meren!

While converting from test data to the new real data from external genomes, I realized it might be beneficial to not only store gene IDs for each gene cluster, but also have a data structure that maps individual gene IDs to gene clusters. For example:

{g01: {0:"GC_00000001", 1:"GC_00000008"}, g02: {0:"GC_00000005", 1:"GC_00000001"}}

This way we wouldn't have to apply a find operation to the gene_associations dataset for each individual gene ID. Let me know if you think this is worth adding, or if it would be better to keep the data simpler :)

@meren
Copy link
Member Author

meren commented Apr 14, 2021

Absolutely! I will add this ASAP. :)

@meren
Copy link
Member Author

meren commented Apr 14, 2021

This is now done, @isaacfink21. Please note that the new data structure is slightly different.

image

I hope this helps.

@meren
Copy link
Member Author

meren commented Apr 14, 2021

As you can see from 530fe85 it took that much effort :) anvi'o has everything ready at all times! :p

@isaacfink21
Copy link
Contributor

isaacfink21 commented Apr 14, 2021

Thanks! This was much easier than I expected :) I will work on using this to fix gene cluster alignment.

For future reference, I'm also attaching the screencaps from my meeting with @meren and @matthewlawrenceklein that illustrate some of the new features we plan to implement going forward. Some of these include:

  • 3 separate windows for scale, genome labels, and the genomes themselves
  • Shaded background between genes of the same gene cluster
  • Toggleable scale "rulers" over each genome
  • Align by gene cluster when there are 2+ genes in a cluster; align by other properties
  • Show similarity between genomes as % identity
  • Show a "graph" on each genome with GC content or other info
  • Editable gene labels

Screen Shot 2021-04-07 at 2 34 56 PM

Screen Shot 2021-04-07 at 2 29 29 PM

@matthewlawrenceklein
Copy link
Contributor

@meren @isaacfink21 Not sure if this would be a major pain point on the backend or with what Isaac's already written, but would it be possible to change the genomes payload from a nested object to an array of objects? I believe that would make it a lot easier to sort genomes in the display (alphabetically, click + drag, etc).

There's also a very real chance I'm misunderstanding the data, so feel free to correct me : )

@meren
Copy link
Member Author

meren commented May 24, 2021

Hey @matthewlawrenceklein, do you think it is possible to 'arrayify' the data when it arrives? or is it a bad idea to do it that way?

@matthewlawrenceklein
Copy link
Contributor

@meren totally doable on the front end. My only concern is that we'd want to make that change directly after the fetch req so that all front-end processes use the same array-ified genome dataset. I'll touch base with @isaacfink21 tomorrow and make sure that this change doesn't require a ton of refactoring.

Thinking a little further out - we would want (need?) to save these kind of front-end manipulations to state, correct?

@meren
Copy link
Member Author

meren commented May 24, 2021

Thinking a little further out - we would want (need?) to save these kind of front-end manipulations to state, correct?

It is always great to think a little further out when it comes to these kinds of decisions! Thank you :)

Just like the other interactive interfaces, I think we need to have a state framework for genome view, too (so people can 'zoom' to a certain area, order their genomes in a particular way, and then if they would store that state, the same interface could greet them when they restart things the next day :)).

@isaacfink21
Copy link
Contributor

@matthewlawrenceklein Just reviewed the code and it shouldn't be a problem for the genomes object - so far it is always iterated through in order, and I can't think of a situation where direct key-value access would be necessary. I think the array format would be helpful :)

Copy link
Contributor

@isaacfink21 isaacfink21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't mean to force-push here, sorry about that--I restored the previous commit since there were some recent changes I accidentally removed

@meren
Copy link
Member Author

meren commented Aug 27, 2021

lots of activity here. I'm hoping to continue working on the backend next week.

@isaacfink21
Copy link
Contributor

@matthewlawrenceklein, @mschecht, and I decided in our last meeting to disable genome dragging and proportional scale in v0 and keep this functionality under the feature flag percentScale. Starting with 61394b9, percentScale should always be set to false.

To revisit this in the future, the genome ruler, background shades, and ADLs will need to be made selectable again, and the object:moving event listener reenabled. As it stands, the proportional scale is mostly functional across the genome view interface but has several bugs with displaying the correct viewport upon selecting a region of the scale (scaleFactor is being calculated incorrectly), and bookmarks do not work with a proportional scale.

isaacfink21 and others added 30 commits May 18, 2023 00:30
Genome View: Genome Sliding and Gene Centering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants