Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental yaml input format #1842

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Sep 18, 2021

This is an experiment to see what a yaml/json input format (building on demes) would look like. It mostly works I think, except for the basic confusion about the direction of time. We can easily imagine adding to this to allow for things like recombination maps.

Here's an example input file:

demography:
  # This is an **embedded** Demes yaml model.
  time_units: generations
  demes:
    - name: X
      epochs: [{end_time: 1000, start_size: 2000}]
    - name: A
      ancestors: [X]
      epochs: [{start_size: 2000}]
    - name: B
      ancestors: [X]
      epochs: [{start_size: 2000}]

# Note: We are **referring** to the Demes model here.
samples: {A: 100, B: 100}
sequence_length: 100000
recombination_rate: 1e-8
ploidy: 1
model: hudson

The idea is that we embed the Demes yaml description within the larger simulation configuration context. When we're parsing the input yaml, we just hand-off the parsing of the demography object to demes-python which will do all the hard work for us.

I'm not suggesting this as a general specification for popgen simulations, I just want to illustrate the power that we get from keeping Demes simple and self-contained. To me, the ability to make a simple configuration file for a specific simulator like this is a powerful argument for not over-specifying the standard. The more bells and whistles we add to the spec the less likely it is that it'll be compatible across different simulators.

Any thoughts @molpopgen @grahamgower @apragsdale? I've been talking about simulation configurations being able to "refer" to elements of the Demes model for a while, and this is an attempt to make things concrete. (I guess we shouldn't get into detailed discussions about Demes itself here though: if someone wants to follow up, maybe create an issue on the spec repo to discuss?)

@jeromekelleher jeromekelleher marked this pull request as draft September 18, 2021 15:31
@codecov
Copy link

codecov bot commented Sep 18, 2021

Codecov Report

Merging #1842 (2f8956e) into main (6a9c603) will decrease coverage by 0.18%.
The diff coverage is 50.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1842      +/-   ##
==========================================
- Coverage   90.46%   90.28%   -0.19%     
==========================================
  Files          20       21       +1     
  Lines       10682    10733      +51     
  Branches     2167     2174       +7     
==========================================
+ Hits         9664     9690      +26     
- Misses        572      597      +25     
  Partials      446      446              
Flag Coverage Δ
C 90.28% <50.98%> (-0.19%) ⬇️
python 96.89% <50.98%> (-0.63%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/json_input.py 45.16% <45.16%> (ø)
msprime/cli.py 96.94% <52.94%> (-1.58%) ⬇️
msprime/mutations.py 98.59% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abe6116...2f8956e. Read the comment docs.

@petrelharp
Copy link
Contributor

Two thoughts:

  • it seems like this could be a nice bridge for the folks who aren't comfortable in python? It might be worth finding some of those people to test it out on.
  • Perhaps this all should be within an ancestry: block, to be followed by a mutations: block (and then maybe an output: block?) for a more complete specificatoin?

Comment on lines +41 to +43
# TODO nasty going back to JSON here - can we make a demes.fromdict()
# function to do this directly?
demes_model = demes.loads(json.dumps(demes_dict), format="json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Thanks @grahamgower.

@grahamgower
Copy link
Member

I agree with @petrelharp that it maybe needs to have separate ancestry: and mutations: blocks. But then it doesn't neatly align with the current CLI msp ancestry subcommand. Also, maybe the demography could be either inline or refer to a file path?

@jeromekelleher
Copy link
Member Author

Thanks, great points @petrelharp and @grahamgower ! I think a combined ancestry and mutation format is the right approach, and yes, this would be a good bridge for people who aren't comfortable with Python.

WRT to the CLI, I've already created an msp ancestry-yaml as a quick way of getting something working without having to worry about the semantics of msp ancestry. So, we just need a command to run a simulation from a yaml config. Unfortunately msp simulate is already used as the legacy interface. We could do msp yaml?

@jeromekelleher
Copy link
Member Author

Update: I've added the proposed mutations/ancestry sections and the config looks like this now:

ancestry:
  sequence_length: 100000
  recombination_rate: 1e-8
  samples: {A: 100, B: 100}
  ploidy: 1
  model: hudson
  demography:
    time_units: generations
    demes:
      - name: X
        epochs: [{end_time: 1000, start_size: 2000}]
      - name: A
        ancestors: [X]
        epochs: [{start_size: 2000}]
      - name: B
        ancestors: [X]
        epochs: [{start_size: 2000}]

mutations:
  rate: 1e-8
  model: blosum62

To make this fully general we'd need to

  1. Add support for reading RateMaps from dictionaries (easy)
  2. Support parsing Ancestry and Mutation models from dictionaries (should be pretty easy, this is basically what we turn the classes into anyway). Since the ancestry models use a duration, we actually sidestep the awkward time business
  3. Think properly about time and implement start_time and end_time accordingly (but, these are pretty niche options, so could just be dropped)

@apragsdale
Copy link
Contributor

This looks really nice to me. Agree that ancestry/mutations/output blocks makes a lot of sense, and those updates look clean. If I'm reading the changes correctly, you can place any valid argument to sim_ancestry and sim_mutations into this yaml? So specify seeds, or more complicated models (e.g. dtfw then switch to hudson), etc. For an "output" block, it might be nice to be able to specify "trees" vs "vcf", plus all the bells and whistles that go with those. Not sure how general you intend this input approach to be.

Overall, I think this would be a nice middle ground between avoiding both python scripting and the cli (which can sometimes be confusing for some). Looking forward to discussing more today in a bit.

@molpopgen
Copy link
Member

I like the approach overall. I think embedding the demes bits is quite elegant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants