Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logoplots #534

Merged
merged 28 commits into from
Nov 24, 2024
Merged

Logoplots #534

merged 28 commits into from
Nov 24, 2024

Conversation

MKanetscheider
Copy link
Collaborator

@MKanetscheider MKanetscheider commented Aug 9, 2024

Closes #12
Added the file for sequence motif analysis via logoplots. It works as a wrapper function to the palmotif package. Also added palmotif now to the dependencies together with IPython. The latter was used to help with SVG visualization

  • CHANGELOG.md updated
  • Tests added (For bug fixes or new features)
  • Tutorial updated (if necessary)
  • also update TCR tutorial?

@grst
Copy link
Collaborator

grst commented Aug 11, 2024

The conda tests are failing because the dependency for the conda tests needs to be declared separately here:

If you add palmotif in that list it shoud be ok.

pyproject.toml Outdated Show resolved Hide resolved
Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the function signature, I'm not a big fan of one function that tries to do different things by accepting different, mutually exclusive parameters.

I'd prefer to split it up and have

logoplot_cdr3_motif_length
logoplot_cdr3_motif_gene_segment
logoplot_cdr3_motif_clonotype

where each has only the corresponding parameters. If you want to reuse code between those functions, you can do so by factoring it out into a helper function.

Alternatively, the function could just make a motif of all sequences in the anndata object and we could leave filtering to the user entirely, e.g.

# clonotype
logoplot_cdr3_motif(mdata[mdata.obs["clone_id"] == "42", :]
# length
logoplot_cdr3_motif(mdata[ir.get.airr(mdata, "VJ_1", "junction_aa").str.len() == 15, :])
...

We could also have a combintation of the two where there's one implementation that makes a plot with all sequences in the AnnData object, and the length, clonotype and gene_segment versions are wrappers around the former that do the filtering for the user.

What do you think?

src/scirpy/pl/_logoplots.py Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
@MKanetscheider
Copy link
Collaborator Author

Regarding the function signature, I'm not a big fan of one function that tries to do different things by accepting different, mutually exclusive parameters.

I'd prefer to split it up and have

logoplot_cdr3_motif_length
logoplot_cdr3_motif_gene_segment
logoplot_cdr3_motif_clonotype

where each has only the corresponding parameters. If you want to reuse code between those functions, you can do so by factoring it out into a helper function.

Alternatively, the function could just make a motif of all sequences in the anndata object and we could leave filtering to the user entirely, e.g.

# clonotype
logoplot_cdr3_motif(mdata[mdata.obs["clone_id"] == "42", :]
# length
logoplot_cdr3_motif(mdata[ir.get.airr(mdata, "VJ_1", "junction_aa").str.len() == 15, :])
...

We could also have a combintation of the two where there's one implementation that makes a plot with all sequences in the AnnData object, and the length, clonotype and gene_segment versions are wrappers around the former that do the filtering for the user.

What do you think?

Actually, I was also not a big fan of my solution here, but I wasn't sure if including several similar functions is preferable. I will adapt it accordingly once I rewrote the code so that it properly returns matplotlib plots instead of SVG.

I will experiment a little bit, but I am afraid that it's not possible to plot a logo of all sequences as they need to be aligned and this is just the case for junction sequences that have the same length.

@grst
Copy link
Collaborator

grst commented Aug 15, 2024

but I am afraid that it's not possible to plot a logo of all sequences as they need to be aligned and this is just the case for junction sequences that have the same length.

Unless you use the hamming distance, this wouldn't even be guaranteeded for a clonotype I think? I see two options

  • check if all sequences are of the same length and fail otherwise
  • perform a multiple-sequence alignment (there must be python packages, but I haven't used any so far) before generating the logo plot.

@grst
Copy link
Collaborator

grst commented Aug 15, 2024

perform a multiple-sequence alignment (there must be python packages, but I haven't used any so far) before generating the logo plot.

this might be something that palmotif is doing for us (while logomaker does not) and the reason why it requires parasail.

@MKanetscheider
Copy link
Collaborator Author

Hi there!
So after spending some time getting to know logomaker I decided that it's actually way easier to code with and offers better customization opportunities. Further, after you stated that you want to get rid of parasail as fast as possible, I think there is really no reason any more to still use palmotif.

So what I have done on my latest commit was that I rewrote the function according to the logomaker synthax/workflow and simplified it a lot. There is now no conditional syntax any more that applies a filter function on the Anndata object for the user. I thought a lot about it and made the decision that it's probably better to leave filtering entirely to the user (as suggested by you in a previous comment) as it's not that difficult and any user can do as he/she prefers.

Multiple sequence alignment was not possible in palmotif either and I am not sure if it's even useful for cdr3 sequences. In all the paper that I have encountered so far, they generate CDR3 motifs only from sequences of the same length and thus MSA was never performed. This function too expects sequences to be the same length and raises and error if that's not the case. Regarding filter approaches, I'd like to showcase how easy it's done in the upcoming tutorial :)

The only thing that bothers me is a future warning raised while using logomaker.alignment_to_matrix() as I am not sure if this will become a serious issue in the (near) future:
image

Please let me know what you think about this warning and my "new" approach to this function :)

@grst
Copy link
Collaborator

grst commented Aug 19, 2024

The only thing that bothers me is a future warning raised while using logomaker.alignment_to_matrix() as I am not sure if this will become a serious issue in the (near) future:

This will become an issue with future pandas releases. There's even an issue here:
jbkinney/logomaker#36

Unfortunately, logomaker seems quite unmaintained (last commit 5 years ago), but I couldn't find any better alternatives.
I suggest we still move forward with it. You can suppress the warning as described here: https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings

If the package really breaks eventually, we can still consider copying the code to scirpy, or forking the repo to scverse or something like that.

src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Show resolved Hide resolved
@grst
Copy link
Collaborator

grst commented Sep 10, 2024

I've seen one example in a paper where they

  • put logoplots of VJ / VDJ seqs next to each other
  • also made logos of VDJ gene usage.

See #12 (comment)

What do you think of these ideas?

@MKanetscheider
Copy link
Collaborator Author

I've seen one example in a paper where they

  • put logoplots of VJ / VDJ seqs next to each other
  • also made logos of VDJ gene usage.

See #12 (comment)

What do you think of these ideas?

I have also encountered this many times and I think it would be a nice functionality as the information usually complements each other quite well. However, I am not sure if and how it could work with our current logomaker implementation

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grst grst mentioned this pull request Oct 17, 2024
6 tasks
Copy link

codecov bot commented Nov 1, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 16 lines in your changes missing coverage. Please review.

Project coverage is 81.38%. Comparing base (08e0cc3) to head (d30416c).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
src/scirpy/pl/_logoplots.py 30.43% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #534      +/-   ##
==========================================
- Coverage   81.43%   81.38%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        4213     4367     +154     
==========================================
+ Hits         3431     3554     +123     
- Misses        782      813      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
src/scirpy/pl/_logoplots.py Outdated Show resolved Hide resolved
@grst
Copy link
Collaborator

grst commented Nov 23, 2024

Thanks @MKanetscheider, I think I can take it from here! I'll apply some final cosmetic touches and then merge it :)

@grst grst merged commit 0d0201a into scverse:main Nov 24, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sequence Logos
2 participants