Port GFF3 I/O methods of scikit-allel to sgkit? #652
Replies: 4 comments 9 replies
-
+1, I use gff3_to_dataframe() often |
Beta Was this translation helpful? Give feedback.
-
We don't have anything concrete planned right now @patrick-koenig, but we'd definitely welcome contributions here! It looks to me like a fairly straightforward porting job, so that we return an xarray dataset with defined variables rather than a pandas dataframe? We'd be happy to help with the details if you'd like to make a pass at it (although folks are on vacation at the moment, so things might be slow) |
Beta Was this translation helpful? Give feedback.
-
It's usually fine just to read from GFF3 to pandas each time in my
experience, takes maybe a couple of seconds to parse a GFF3 with ~10,000
genes (~100,000 features).
…On Mon, 30 Aug 2021, 12:22 Jerome Kelleher, ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> - do we need a gff3_to_zarr or
something, or are the files always small enough to make direct pandas
conversion better?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/discussions/652#discussioncomment-1253650>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQSGBBFGHCTGOAZ3URLT7NSVTANCNFSM5CO2RT4A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Biggest GFF3 files I've encountered are maybe 200,000 rows. Not been a
problem reading whole thing into memory as pandas dataframe.
…On Mon, 30 Aug 2021, 10:33 Tom White, ***@***.***> wrote:
Can GFF/GFF3 files get very large? If so, and we don't want to read whole
files into memory, then reading them into an xarray Dataset, (which can be
lazy) might be another way to approach this. In the case of small files,
you could easily call to_dataframe() to convert it to a Pandas dataframe
if you wanted to process it with that API.
In terms of implementation, one way would be to use
dask.dataframe.read_csv
<https://docs.dask.org/en/latest/generated/dask.dataframe.read_csv.html#dask.dataframe.read_csv>,
which can read chunks of large files in parallel, before converting it to
an xarray Dataset. Unfortunately, there is no from_dask_dataframe() for
xarray Dataset yet, although there is some work in pydata/xarray#4659
<pydata/xarray#4659> (which appears to have
stalled). However, we do already have some code in sgkit for converting a
Dask DataFrame to a dict of Dask arrays
<https://github.com/pystatgen/sgkit/blob/718fe3b58b3da2d231bbfa6330a88bf209068c92/sgkit/io/utils.py#L14-L30>,
so it might not be too hard to use something like this to convert to an
xarray Dataset.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/discussions/652#discussioncomment-1252892>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQVFIB3D5LXTXN33R23T7NF6JANCNFSM5CO2RT4A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Hi sgkit team,
is it planned to port the GFF3 I/O utility methods (especially the gff3_to_dataframe() method) of scikit-allel to sgkit?
Beta Was this translation helpful? Give feedback.
All reactions