Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sars-cov-2 alignment storage #175

Merged
merged 1 commit into from
Nov 28, 2024

Conversation

jeromekelleher
Copy link
Contributor

First pass at #172. Annoyingly the raw Viridian data isn't aligned to the reference, so we can't use files directly, and I'm using the sc2ts store for now just to see how well the Zarr approach works.

Just opening this so I don't forget about it, it may or may not be worth including.

@jeromekelleher
Copy link
Contributor Author

jeromekelleher commented Nov 21, 2024

I think this is worth including. Top level ideas:

  • Can store alignments just as well, we the IUPAC codes rather than ref and alt as alleles
  • Storing the 3.9M alignments as FASTA requires about 111G of space
  • Metadata must also be managed, separate TSV is 1.4G of text (but is a superset)
  • Combined data stored in Zarr Zipstore is 329M
  • Can extract haplotype (column) from middle 160ms
  • Can extract variant (row) from middle in 216ms
  • Can perform computation over entire matrix (compute missingness per sample) in 3 min

Also show some ways we can access metadata and do useful things in the notebook. I guess we'll need to give some indication of how long it takes to get access to the data using pyfaidx etc. There's no way to get a variant though, which is an important issue.

The code for doing the conversion will move to the sc2ts repo, and I'd hope to make the Zipstore dataset available on Figshare (as it's a pretty useful resource).

Update the notebook to compute diversity

Remove unused stuff
@jeromekelleher
Copy link
Contributor Author

Merging this, as it's pretty good. Some of the compression based analysis hasn't worked out as I had expected, so needs a bit more work (or we might drop it).

@jeromekelleher jeromekelleher marked this pull request as ready for review November 28, 2024 14:57
@jeromekelleher jeromekelleher merged commit b0e1366 into sgkit-dev:main Nov 28, 2024
1 check passed
@jeromekelleher jeromekelleher deleted the sc2-fasta branch November 28, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant