-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent storage of matrices that enables quick indexed lookup #9
Comments
Questions from the group at Tuesday night discussion: Do you anticipate complete randomness in the subselection (i.e. totally user selected), or is there some structure that governs what might be asked for? IOW, is chunking an option? --> Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand. |
Yes we should be prepared to serve any combination of rows.
I like solutions that don't require any running services. Life is so much easier when all you need is a single file. Another option is feather which is a binary format for storing dataframes. While it doesn't support indexed reading (reading only a subset of the overall dataset), it's supposedly really quick. Currently, it's not too too slow to read the full files, so this may be prematurely optimizing... we could stick with TSV until it becomes a bottleneck? |
Tagging @stephenshank and @mike19106, who I think were both interested in this topic. |
We may be running a single job per worker instance at a time, with multiple jobs running concurrently via multiple instances. I like the idea of doing it mostly so they are less likely to interfere with each other in isolation. What makes that relevent to this discussion and and cognoma/cognoma#17 is that we can dedicate a decent amount of memory per job. So in-memory caching becomes more possible. |
Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at cognoma/cognoma#17 (comment). We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.
A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via
pandas
for a variety of setups.The text was updated successfully, but these errors were encountered: