Port GFF3 I/O methods of scikit-allel to sgkit? #652

patrick-koenig · 2021-08-19T19:02:56Z

patrick-koenig
Aug 19, 2021

Hi sgkit team,

is it planned to port the GFF3 I/O utility methods (especially the gff3_to_dataframe() method) of scikit-allel to sgkit?

alimanfoo · 2021-08-19T19:45:40Z

alimanfoo
Aug 19, 2021
Maintainer

+1, I use gff3_to_dataframe() often

0 replies

jeromekelleher · 2021-08-20T08:39:45Z

jeromekelleher
Aug 20, 2021
Maintainer

We don't have anything concrete planned right now @patrick-koenig, but we'd definitely welcome contributions here! It looks to me like a fairly straightforward porting job, so that we return an xarray dataset with defined variables rather than a pandas dataframe?

We'd be happy to help with the details if you'd like to make a pass at it (although folks are on vacation at the moment, so things might be slow)

8 replies

alimanfoo Aug 20, 2021
Maintainer

Btw, in porting this to sgkit, it might be an opportunity to simplify/improve a little the implementation that's in scikit-allel. The scikit-allel implementation of gff3_to_dataframe relies on a function called iter_gff3 to build an interator of rows. But I did revisit this recently when I was developing a data access package for malariagen, and realised you could just get pandas to parse the GFF3 directly. In the malariagen_data package there is an alternative approach, see read_gff3. Hth.

alimanfoo Aug 20, 2021
Maintainer

Btw @patrick-koenig cool SNP browser :-)

jeromekelleher Aug 21, 2021
Maintainer

Yeah, getting pandas to do the parsing directly sounds like a definite win.

tomwhite Aug 30, 2021
Maintainer

Can GFF/GFF3 files get very large? If so, and we don't want to read whole files into memory, then reading them into an xarray Dataset, (which can be lazy) might be another way to approach this. In the case of small files, you could easily call to_dataframe() to convert it to a Pandas dataframe if you wanted to process it with that API.

In terms of implementation, one way would be to use dask.dataframe.read_csv, which can read chunks of large files in parallel, before converting it to an xarray Dataset. Unfortunately, there is no from_dask_dataframe() for xarray Dataset yet, although there is some work in pydata/xarray#4659 (which appears to have stalled). However, we do already have some code in sgkit for converting a Dask DataFrame to a dict of Dask arrays, so it might not be too hard to use something like this to convert to an xarray Dataset.

jeromekelleher Aug 30, 2021
Maintainer

@alimanfoo - do we need a gff3_to_zarr or something, or are the files always small enough to make direct pandas conversion better?

alimanfoo · 2021-08-30T12:11:00Z

alimanfoo
Aug 30, 2021
Maintainer

It's usually fine just to read from GFF3 to pandas each time in my experience, takes maybe a couple of seconds to parse a GFF3 with ~10,000 genes (~100,000 features).

…

On Mon, 30 Aug 2021, 12:22 Jerome Kelleher, ***@***.***> wrote: @alimanfoo <https://github.com/alimanfoo> - do we need a gff3_to_zarr or something, or are the files always small enough to make direct pandas conversion better? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/discussions/652#discussioncomment-1253650>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQSGBBFGHCTGOAZ3URLT7NSVTANCNFSM5CO2RT4A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

alimanfoo · 2021-08-30T12:17:21Z

alimanfoo
Aug 30, 2021
Maintainer

Biggest GFF3 files I've encountered are maybe 200,000 rows. Not been a problem reading whole thing into memory as pandas dataframe.

…

On Mon, 30 Aug 2021, 10:33 Tom White, ***@***.***> wrote: Can GFF/GFF3 files get very large? If so, and we don't want to read whole files into memory, then reading them into an xarray Dataset, (which can be lazy) might be another way to approach this. In the case of small files, you could easily call to_dataframe() to convert it to a Pandas dataframe if you wanted to process it with that API. In terms of implementation, one way would be to use dask.dataframe.read_csv <https://docs.dask.org/en/latest/generated/dask.dataframe.read_csv.html#dask.dataframe.read_csv>, which can read chunks of large files in parallel, before converting it to an xarray Dataset. Unfortunately, there is no from_dask_dataframe() for xarray Dataset yet, although there is some work in pydata/xarray#4659 <pydata/xarray#4659> (which appears to have stalled). However, we do already have some code in sgkit for converting a Dask DataFrame to a dict of Dask arrays <https://github.com/pystatgen/sgkit/blob/718fe3b58b3da2d231bbfa6330a88bf209068c92/sgkit/io/utils.py#L14-L30>, so it might not be too hard to use something like this to convert to an xarray Dataset. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/discussions/652#discussioncomment-1252892>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQVFIB3D5LXTXN33R23T7NF6JANCNFSM5CO2RT4A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

1 reply

tomwhite Aug 30, 2021
Maintainer

Thanks @alimanfoo. Perhaps we should just go the direct to Pandas route as originally suggested then...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port GFF3 I/O methods of scikit-allel to sgkit? #652

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Port GFF3 I/O methods of scikit-allel to sgkit? #652

patrick-koenig Aug 19, 2021

Replies: 4 comments · 9 replies

alimanfoo Aug 19, 2021 Maintainer

jeromekelleher Aug 20, 2021 Maintainer

alimanfoo Aug 20, 2021 Maintainer

alimanfoo Aug 20, 2021 Maintainer

jeromekelleher Aug 21, 2021 Maintainer

tomwhite Aug 30, 2021 Maintainer

jeromekelleher Aug 30, 2021 Maintainer

alimanfoo Aug 30, 2021 Maintainer

alimanfoo Aug 30, 2021 Maintainer

tomwhite Aug 30, 2021 Maintainer

patrick-koenig
Aug 19, 2021

Replies: 4 comments 9 replies

alimanfoo
Aug 19, 2021
Maintainer

jeromekelleher
Aug 20, 2021
Maintainer

alimanfoo Aug 20, 2021
Maintainer

alimanfoo Aug 20, 2021
Maintainer

jeromekelleher Aug 21, 2021
Maintainer

tomwhite Aug 30, 2021
Maintainer

jeromekelleher Aug 30, 2021
Maintainer

alimanfoo
Aug 30, 2021
Maintainer

alimanfoo
Aug 30, 2021
Maintainer

tomwhite Aug 30, 2021
Maintainer