Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding genetic map for honeybee #1399

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

janaobsteter
Copy link
Contributor

I've added the tar.gz version for now, since this is how it says in the instructions.

The genetic map was created from a map on microsatellites as described here (this is David Wragg's bitbucket): https://bitbucket.org/scriptBee/hapmap-pilot/src/main/GeneticMaps/

@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Base: 99.94% // Head: 99.81% // Decreases project coverage by -0.13% ⚠️

Coverage data is based on head (26641b5) compared to base (1ea8810).
Patch coverage: 0.00% of modified lines in pull request are covered.

❗ Current head 26641b5 differs from pull request most recent head 79da94a. Consider uploading reports for the commit 79da94a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1399      +/-   ##
==========================================
- Coverage   99.94%   99.81%   -0.14%     
==========================================
  Files         109      110       +1     
  Lines        3789     3812      +23     
  Branches      515      522       +7     
==========================================
+ Hits         3787     3805      +18     
- Misses          1        6       +5     
  Partials        1        1              
Impacted Files Coverage Δ
stdpopsim/catalog/ApiMel/genetic_maps.py 0.00% <0.00%> (ø)
stdpopsim/genomes.py 100.00% <0.00%> (ø)
stdpopsim/ext/selection.py 100.00% <0.00%> (ø)
stdpopsim/catalog/AraTha/species.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@petrelharp
Copy link
Contributor

petrelharp commented Oct 13, 2022

Hm, okay - now we need the python code? can you get that going?

@igronau
Copy link
Contributor

igronau commented Oct 19, 2022

@janaobsteter - do you think you can finalize this within the next couple of days? If so, we can update fig 1 in the "adding species" manuscript to show the genetic map for ApiMel

@janaobsteter
Copy link
Contributor Author

@igronau , hi! Probably yes - I just have another question: should the genetic map provide genetic positions for every base in the reference genome? Thanks!

@igronau
Copy link
Contributor

igronau commented Oct 19, 2022

According to what I see in the examples in the docs, you don't have to specify a rate for every position.

@igronau
Copy link
Contributor

igronau commented Oct 20, 2022

LGTM. @petrelharp - anything else we need to check here before this is merged?

@janaobsteter
Copy link
Contributor Author

@igronau I just have to finish the description! Will do that today!

@janaobsteter
Copy link
Contributor Author

Ok, I've updated the description! However, these maps are a bit redundant - they were made from data on crossover events, and recombination rate was assumed stable between crossovers. Hence, the maps would actually really need to include the sites where the recombination rate changes. Currently, they are expanded to fit a certain vcf, but I can strip them down (if you agree).

@igronau
Copy link
Contributor

igronau commented Oct 20, 2022

I'm not sure because I haven't added genetic maps to stdpopsim. Maybe we should consult with @petrelharp.

@lauterbur
Copy link

I had no idea honeybee recombination rates were so high, that's neat.

@janaobsteter By expanded to fit a certain VCF, do you mean that there are successive positions in the files that have the same recombination rate only so each site in that VCF was given a row? If that's the case, I would think that it could be stripped back down to just the sites where the rate changes (to fit the msprime specification from the link Ilan gave, "The value in the rate column in a given line gives the constant rate between the physical position in that line (inclusive) and the physical position on the next line (exclusive).") And it might be a good idea for the sake of file size. But @petrelharp (or @jeromekelleher ?) would be able to say more definitely.

(Also, I think it's expecting a header line for each rate file?)

@janaobsteter
Copy link
Contributor Author

@lauterbur , yes, that's exactly what has been done (several lines have the rate) - that's why I want to strip it down. Will do this! I will also add the column names.

@janaobsteter
Copy link
Contributor Author

@igronau , everything has been added! I've also added the new reduced genetic maps (although I need to remove them and send them separately - atm they are in the ApiMel folder).

@lauterbur
Copy link

@nspope and @petrelharp I'm not sure what failing codecov/patch and codecov/project means. Do you think this is ready to be merged?

@nspope
Copy link
Collaborator

nspope commented Oct 24, 2022

I think you need to import the contents of the new genetic_maps.py in catalog/ApiMel/__init__.py. (Codecov is complaining because none of the added changes are actually getting run). You can just copy the relevant line from DroMel. After that, the genetic map should be registered when stdpopsim is imported.

@nspope
Copy link
Collaborator

nspope commented Oct 24, 2022

And I think @andrewkern has to upload the maps to AWS for everything to work correctly?

@janaobsteter
Copy link
Contributor Author

@nspope , I've edited the __init__.py file. Regarding the zip with the maps - they are currently in the git in ApiMel folder. Do I need to delete them and send them to @andrewkern separately?

@nspope
Copy link
Collaborator

nspope commented Oct 25, 2022

Yeah @janaobsteter, I think it's easiest to remove the tarball from this PR and send it to @andrewkern. Once the maps are on AWS, I can rerun the tests to make sure it's loading OK.

@janaobsteter
Copy link
Contributor Author

@nspope, I've removed the tarball and send it to Andrew.

@nspope
Copy link
Collaborator

nspope commented Oct 26, 2022

Thanks @janaobsteter!

@andrewkern
Copy link
Member

@janaobsteter @nspope -- i've gone ahead and uploaded the ApiMel map to AWS. It can be found at the following URL:

https://stdpopsim.s3.us-west-2.amazonaws.com/genetic_maps/ApiMel/Liu2015_litfover_maps.tar.gz

Once the URL and the checksum have been updated in the code, tests should pass.

@nspope
Copy link
Collaborator

nspope commented Oct 26, 2022

Thanks @andrewkern! @janaobsteter could you please update the sha256sum and url?

@janaobsteter
Copy link
Contributor Author

janaobsteter commented Oct 26, 2022

@nspope , sure - just to check - how do I obtain the sha?

@andrewkern
Copy link
Member

you would use the sha256sum unix tool. In this case I get

%  sha256sum ~/Desktop/Liu2015_litfover_maps.tar.gz
551a1819fb007573b6fa00a088493964dacd089a5d5863cf03b7b73eac0bd405  /Users/adk/Desktop/Liu2015_litfover_maps.tar.gz

@janaobsteter
Copy link
Contributor Author

@andrewkern , ok! This is the sha that I've put in the file - however, the tests are still failing ...

@nspope
Copy link
Collaborator

nspope commented Oct 26, 2022

Hi @janaobsteter, it looks like the directory structure in that tarball doesn't match file_pattern, and there's some extraneous hidden files/directories.

$ tar tzf Liu2015_litfover_maps.tar.gz | head
#Users/adk/Desktop/._Liu2015_litfover_maps
#tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
#tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.macl'
#Users/adk/Desktop/Liu2015_litfover_maps/
#tar: Users/adk/Desktop/Liu2015_litfover_maps/._Amel_HAv3.1_lifted_chrCM009935.2.txt
#Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
#Users/adk/Desktop/Liu2015_litfover_maps/Amel_HAv3.1_lifted_chrCM009935.2.txt
#Users/adk/Desktop/Liu2015_litfover_maps/._Amel_HAv3.1_lifted_chrCM009937.2.txt
#tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'

e.g. this'll unpack into Users/adk/Desktop/Liu2015_litfover_maps/<files>.

Probably would be good to clean it up and just have a single directory 'Liu2015_litfover_maps' without the hidden files, then resend to Andy; rather than add a really long path to file_pattern.

@nspope
Copy link
Collaborator

nspope commented Oct 26, 2022

Whoops @janaobsteter looks like this file path screw-up happened on Andy's end -- so nevermind, there's nothing you need to do; @andrewkern or I will fix it and re-upload. Sorry for the confusion. This stuff is a bit fiddly, unfortunately.

@nspope
Copy link
Collaborator

nspope commented Oct 26, 2022

Darn, looks like there's some formatting issues with these maps. I think it's because (1) there's no header line (so, the first line is being silently skipped); (2) there's no last line (with a 0 rate). The format should match the example in the docs for msprime.RateMap.read_hapmap

@nspope
Copy link
Collaborator

nspope commented Oct 27, 2022

Hmm, I think there are issues with the genetic map positions as well -- for example take these two lines from Amel_HAv3.1_lifted_chrCM009931.2.txt:

CM009931.2  920891  41.8604651162791  38.5488837209301                                                                         
CM009931.2  1094157 23.2558139534884  44.0501395348836

The genetic map position (4th column) for the second row is 44.050139, but shouldn't it be,

map_position[row_2] = map_position[row_1] + (phys_position[row_2] - phys_position[row_1]) * rate[row_1]/1e6
  = 38.5488837209301 + (1094157 - 920891) * 41.8604651162791 /  1e6
  = 45.80188

For instance, the example from the msprime.RateMap.read_hapmap docstring includes the lines,

chr10       50009         0.159        0.002947
chr10       52147         0.1574       0.003287

Using the equation above, the map position for the second row should be:

0.002947 + (52147 - 50009) * 0.159 /  1e6 = 0.003286942

which matches what was given. So, the last line always should always have rate 0 (which isn't the case for these ApiMel maps).

(I think I'm doing this right?? Hapmap format always confuses me)

@lauterbur
Copy link

Do folks (@nspope @petrelharp ?) think it's worth waiting for this to be resolved before submitting the adding species ms to bioRxiv? If it will be ready in the next few days, especially if it won't make it into a release until the selection paper release otherwise, we can wait. If it will take longer we'd like to go ahead and put it up on bioRxiv and do the version release.

@nspope
Copy link
Collaborator

nspope commented Oct 27, 2022

To keep others in the loop-- @janaobsteter sent me the crossovers and the script used to compute rates from these. I'll try to figure out what's going wrong, but I'm not sure how long that'll take.

@igronau and @lauterbur, it looks like you wanted to add this map to the "adding species" preprint? Should we just go ahead with the preprint and put this map in a later release?

@lauterbur
Copy link

lauterbur commented Oct 27, 2022 via email

@nspope
Copy link
Collaborator

nspope commented Oct 27, 2022

@lauterbur My suggestion is to go ahead with the preprint. I don't know how long it'll take to work out this map (it'll be next week before I can really take a look).

@andrewkern
Copy link
Member

yes I agree @nspope. @lauterbur we need the preprint to go out before the release so that the documentation might point to the preprint DOI

@lauterbur
Copy link

That's the plan, and it sounds like the timing doesn't work out to include the honeybee map in the preprint.

@petrelharp
Copy link
Contributor

Hi, folks - it'd be great to get this map in here! But, looking at what's going on - our policy has been that we want to include "published" things, so that (a) it's well documented what's going into stdpopsim, and (b) the burden isn't on us of figuring out what to include and what's useful. It sounds to me like you're basically inferring the map, @janaobsteter? And, for us to make sure that the map looks reasonable, we'd really want to have a short writeup of what you did, with some diagnostic plots? So, here's my suggestion: do a short (2pg?) writeup of what you're doing, with diagnostic plots, and post it to bioRxiv, and then we can cite that (and maybe you can publish it somewhere, saying "and also this is in stdpopsim") - what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants