Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support PostgreSQL as optional storage backend #169

Merged
merged 68 commits into from
Apr 3, 2023
Merged

feat: support PostgreSQL as optional storage backend #169

merged 68 commits into from
Apr 3, 2023

Conversation

jsstevenson
Copy link
Member

@jsstevenson jsstevenson commented Feb 21, 2023

  • Refactor Database into abstract class, move all boto3 usage behind DynamoDB-specific class
  • Provide PostgreSQL database class
    • This provides a pretty solid speedup over dynamodb-local (tests run in about 4 seconds vs 20 seconds). I don't think that advantage would hold over production DynamoDB but an individual SQL query can go in about 0.5ms so it's probably pretty close either way
    • Completely refreshing the DB is also relatively fast... maybe 30ish minutes in total?
    • It's also much easier with Postgres to do things like restoring the DB from a data dump, so I provided a quick implementation here. It takes about 30 seconds to download a DB dump off s3 and load it, which is nice.
  • Also some minor changes to ETL and query methods, including one that speeds up the normalization merge methods a bit in DynamoDB as well (merge is running locally in about 90 minutes instead of 2+ hours)
  • Small reorganization of the CLI to fit some new commands/require less indenting. Added console entry points.
  • Some reorganization of the README

For the production normalization service, I think we should stick with what's already working (and I tried to make sure this PR wouldn't meaningfully change much of that), but for places where we end up using the local version (eg AnyVar, DGIdb, local scans for analysis), the Postgres alternative is pretty convenient.

Possible future issues/todo

  • Use async/a connection pool. I have tried to stress test its current (synchronous) iteration by running multiple test suites at once and I don't think the wires get crossed at all like they did with psycopg2, but I could be wrong.
  • Separate boto and postgres dependency groups. It'd make install a little bit lighter.
  • Richer documentation. At this point I think there's enough stuff going on that it might be worth making a multi-page readthedocs instance. docs: add separate docs #181
  • Hosting psql dumps. I manually uploaded one to S3. Idk if that's something we want to support, or the best way to go about it -- I wanted to avoid something that would require boto so I just stuck it at a file address named "latest", but maybe there's a good way to let the normalizer view available versions and figure out the latest one itself? I was also thinking it'd be nice for the DB to track the time at which it was last updated internally so you could do a check to see if your local instance was out of date or not

Copy link
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great! Just forgot to update some docstring param names

gene/database/database.py Outdated Show resolved Hide resolved
gene/database/database.py Outdated Show resolved Hide resolved
gene/database/dynamodb.py Outdated Show resolved Hide resolved
gene/database/dynamodb.py Outdated Show resolved Hide resolved
gene/database/postgresql.py Outdated Show resolved Hide resolved
gene/database/postgresql.py Outdated Show resolved Hide resolved
@jsstevenson
Copy link
Member Author

🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:medium Medium priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants