Persistent storage of matrices that enables quick indexed lookup #9

dhimmel · 2016-07-25T21:35:25Z

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at cognoma/cognoma#17 (comment). We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

The text was updated successfully, but these errors were encountered:

clairemcleod · 2016-07-26T23:29:18Z

Questions from the group at Tuesday night discussion: Do you anticipate complete randomness in the subselection (i.e. totally user selected), or is there some structure that governs what might be asked for? IOW, is chunking an option?

--> Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

dhimmel · 2016-07-27T01:59:44Z

Do you anticipate complete randomness in the subselection (i.e. totally user selected)

Yes we should be prepared to serve any combination of rows.

Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

I like solutions that don't require any running services. Life is so much easier when all you need is a single file. Another option is feather which is a binary format for storing dataframes. While it doesn't support indexed reading (reading only a subset of the overall dataset), it's supposedly really quick.

Currently, it's not too too slow to read the full files, so this may be prematurely optimizing... we could stick with TSV until it becomes a bottleneck?

clairemcleod · 2016-07-27T02:47:08Z

Tagging @stephenshank and @mike19106, who I think were both interested in this topic.

awm33 · 2016-08-12T04:08:19Z

We may be running a single job per worker instance at a time, with multiple jobs running concurrently via multiple instances. I like the idea of doing it mostly so they are less likely to interfere with each other in isolation.

What makes that relevent to this discussion and and cognoma/cognoma#17 is that we can dedicate a decent amount of memory per job. So in-memory caching becomes more possible.

dhimmel added the task label Jul 25, 2016

dhimmel changed the title ~~Persistent storage that enables quick indexed lookup~~ Persistent storage of matrices that enables quick indexed lookup Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent storage of matrices that enables quick indexed lookup #9

Persistent storage of matrices that enables quick indexed lookup #9

dhimmel commented Jul 25, 2016 •

edited

Loading

clairemcleod commented Jul 26, 2016

dhimmel commented Jul 27, 2016

clairemcleod commented Jul 27, 2016

awm33 commented Aug 12, 2016

Persistent storage of matrices that enables quick indexed lookup #9

Persistent storage of matrices that enables quick indexed lookup #9

Comments

dhimmel commented Jul 25, 2016 • edited Loading

clairemcleod commented Jul 26, 2016

dhimmel commented Jul 27, 2016

clairemcleod commented Jul 27, 2016

awm33 commented Aug 12, 2016

dhimmel commented Jul 25, 2016 •

edited

Loading