Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support PostgreSQL as optional storage backend #169

Merged
merged 68 commits into from
Apr 3, 2023
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
88d7c67
stash
jsstevenson Feb 16, 2023
d376f23
add progress
jsstevenson Feb 21, 2023
15a706a
add progress
jsstevenson Feb 21, 2023
252c404
fix errors
jsstevenson Feb 21, 2023
b7aab4d
check unique violation
jsstevenson Feb 22, 2023
31a7117
stash progress
jsstevenson Feb 22, 2023
7fd2156
move db constructor into factory method
jsstevenson Feb 22, 2023
ef21ef3
Merge branch 'main' into pg
jsstevenson Feb 22, 2023
7248490
stashing progress
jsstevenson Feb 22, 2023
28ea3b6
update docstring
jsstevenson Feb 22, 2023
745b6d7
fix looup
jsstevenson Feb 22, 2023
be276b0
update tests
jsstevenson Feb 24, 2023
8313cc9
test conditional action
jsstevenson Feb 24, 2023
a6b5034
single qotes
jsstevenson Feb 24, 2023
490180e
update reqs
jsstevenson Feb 24, 2023
78e2f9c
Add postgres action
jsstevenson Feb 24, 2023
690f2c3
fix typo
jsstevenson Feb 24, 2023
1e297b2
fix typo
jsstevenson Feb 24, 2023
0c4812b
fix typo
jsstevenson Feb 24, 2023
cdfa12f
fix typo
jsstevenson Feb 24, 2023
09ea9cc
Fix typos
jsstevenson Feb 24, 2023
0369ac4
stash progress
jsstevenson Feb 24, 2023
00f57b0
update
jsstevenson Mar 14, 2023
4e057ce
update
jsstevenson Mar 15, 2023
a406b8d
fix tests?
jsstevenson Mar 15, 2023
4b278cc
update
jsstevenson Mar 15, 2023
917e2ce
faster
jsstevenson Mar 15, 2023
4167ff4
fix
jsstevenson Mar 15, 2023
cb976c6
more
jsstevenson Mar 20, 2023
2e5b0c9
readme
jsstevenson Mar 21, 2023
f8ca39d
readme
jsstevenson Mar 21, 2023
d97d7f4
readme
jsstevenson Mar 21, 2023
1df903e
update
jsstevenson Mar 21, 2023
a5e2f59
break out connection handling
jsstevenson Mar 22, 2023
7fd28fa
tentative delete updates
jsstevenson Mar 22, 2023
0f80d3a
fix docstring display
jsstevenson Mar 22, 2023
e9230ec
whitespace was a little excessive
jsstevenson Mar 22, 2023
8614444
Add more notes on decisions
jsstevenson Mar 22, 2023
a9e71cf
docstring
jsstevenson Mar 22, 2023
b8424ec
additional cleanup anticipating more comments
jsstevenson Mar 23, 2023
69adbe2
update reqs
jsstevenson Mar 23, 2023
d2002e8
review edits
jsstevenson Mar 24, 2023
a9000f2
Update
jsstevenson Mar 28, 2023
7eb3b39
add
jsstevenson Mar 28, 2023
27afb98
fix
jsstevenson Mar 28, 2023
acc2555
Pretty sure I've forgotten something but gonna chuck this at the CI a…
jsstevenson Mar 28, 2023
d06f7e5
update
jsstevenson Mar 29, 2023
f8a9ee6
update
jsstevenson Mar 29, 2023
1130c30
updates
jsstevenson Mar 29, 2023
9c6a039
docs????
jsstevenson Mar 31, 2023
61f0f18
add action
jsstevenson Mar 31, 2023
7198a9a
try something else
jsstevenson Mar 31, 2023
b4fded0
typo
jsstevenson Mar 31, 2023
68ccf27
try thing
jsstevenson Mar 31, 2023
f791bb8
version
jsstevenson Mar 31, 2023
467d57b
merge
jsstevenson Mar 31, 2023
fa40903
try another thing
jsstevenson Mar 31, 2023
3c81887
sanity check
jsstevenson Mar 31, 2023
5dd8cbc
Another sanity check
jsstevenson Mar 31, 2023
7fd360b
try this
jsstevenson Mar 31, 2023
8747167
use dev reqs
jsstevenson Mar 31, 2023
9cc053f
try self install
jsstevenson Mar 31, 2023
707a052
manually install sphinx autodoc
jsstevenson Mar 31, 2023
56a94b5
docs
jsstevenson Mar 31, 2023
ce303c8
update readme
jsstevenson Mar 31, 2023
f2bf4e1
remove docs
jsstevenson Mar 31, 2023
16fe0dd
add gitingore
jsstevenson Mar 31, 2023
762dc83
Update
jsstevenson Apr 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 37 additions & 24 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,45 @@
name: github-actions
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
test:
runs-on: ubuntu-latest
strategy:
matrix:
db_url: ["http://localhost:8000", "postgres://postgres:postgres@localhost:5432/gene_normalizer_test"]
services:
postgres:
image: postgres:14
env:
AWS_ACCESS_KEY_ID: ${{ secrets.DUMMY_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
AWS_DEFAULT_OUTPUT: text
GENE_NORM_DB_URL: http://localhost:8000
GENE_TEST: true
steps:
- uses: actions/checkout@v3
POSTGRES_USER: 'postgres'
POSTGRES_DB: 'gene_normalizer_test'
POSTGRES_PASSWORD: 'postgres'
ports:
- 5432:5432
env:
AWS_ACCESS_KEY_ID: ${{ secrets.DUMMY_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
AWS_DEFAULT_OUTPUT: text
GENE_NORM_DB_URL: ${{ matrix.db_url }}
GENE_TEST: true
steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8

- name: Install dependencies
run: |
python3 -m pip install pipenv
pipenv install --dev
- name: Install dependencies
run: |
python3 -m pip install pipenv
pipenv install --dev

- name: Build local DynamoDB
run: |
chmod +x ./tests/unit/dynamodb_build.bash
./tests/unit/dynamodb_build.bash
- name: Build local DynamoDB
if: ${{ env.GENE_NORM_DB_URL == 'http://localhost:8000' }}
run: |
chmod +x ./tests/unit/dynamodb_build.bash
./tests/unit/dynamodb_build.bash

- run: pipenv run flake8
- run: pipenv run pytest tests/
- run: pipenv run flake8
- run: pipenv run pytest tests/
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ venv/
ENV/
env.bak/
venv.bak/
.python-version

# Spyder project settings
.spyderproject
Expand Down Expand Up @@ -160,4 +161,4 @@ dynamodb_local_latest/*

Pipfile.lock
*.toml
*.zip
*.zip
2 changes: 1 addition & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ boto3 = "*"
gene = {editable = true, path = "."}
gffutils = "*"
"biocommons.seqrepo" = "*"
psycopg2-binary = "*"
psycopg = {version = "*", extras=["binary"]}
pytest = "*"
pre-commit = "*"
flake8 = "*"
Expand Down
141 changes: 74 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,19 @@
[![DOI](https://zenodo.org/badge/309797998.svg)](https://zenodo.org/badge/latestdoi/309797998)

# Gene Normalization
# Gene Normalizer
Services and guidelines for normalizing gene terms

Installing with pip:
## Installation

The Normalizer is available via PyPI:

```commandline
pip install gene[dev]
```

The `[dev]` argument tells pip to install packages to fulfill the dependencies of the `gene.etl` package.

## Developer instructions
Following are sections include instructions specifically for developers.

### Installation
For a development install, we recommend using Pipenv. See the
[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
for direction on installing pipenv in your compute environment.

Once installed, from the project root dir, just run:

```commandline
pipenv shell
pipenv lock && pipenv sync
pipenv install --dev
```
### External requirements

Gene Normalization relies on [SeqRepo](https://github.com/biocommons/biocommons.seqrepo) data, which you must download yourself.

Expand All @@ -44,103 +32,122 @@ PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01

You will want to do the following:\
(*Might not be ._fkuefgd, so replace with your error message path*)

```console
sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
exit
```

### Deploying DynamoDB Locally
### Database Initialization

We use Amazon DynamoDB for our database. To deploy locally, follow [these instructions](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html).
The Normalizer supports two data storage options:
jsstevenson marked this conversation as resolved.
Show resolved Hide resolved

### Init coding style tests
* [DynamoDB](https://aws.amazon.com/dynamodb), a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once downloaded, you can start service by running `java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb` in a terminal (add a `-port <VALUE>` option to use a different port)
* [PostgreSQL](https://www.postgresql.org/), a well-known relational database technology. Once starting the Postgres server process, [ensure that a database is created](https://www.postgresql.org/docs/current/sql-createdatabase.html) (we typically name ours `gene_normalizer`).

Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.
By default, the Gene Normalizer expects to find a DynamoDB instance listening at `http://localhost:8000`. Alternative locations can be specified in two ways:

We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.
The first way is to set the `--db_url` option to the URL endpoint.

This ensures:
```commandline
gene_update --update_all --db_url="http://localhost:8001"
```

* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key
The second way is to set the `GENE_NORM_DB_URL` environment variable to the URL endpoint.
```commandline
export GENE_NORM_DB_URL="http://localhost:8001"
```

Before first commit run:
To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.

```commandline
pre-commit install
export GENE_NORM_DB_URL="postgresql://postgres@localhost:5432/gene_normalizer"
```

### Adding and refreshing data

### Running unit tests
Use the `gene_update` command in a shell to update the database.

By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `GENE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.
#### Update source(s)

The normalizer currently pulls data from [HGNC](https://www.genenames.org/), [Ensembl](https://useast.ensembl.org/index.html), and [NCBI](https://www.ncbi.nlm.nih.gov/gene/).

To update one source, simply set `--normalizer` to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.
jsstevenson marked this conversation as resolved.
Show resolved Hide resolved

For example, run the following to acquire the latest HGNC data if necessary, and update the HGNC gene records in the normalizer database:

```commandline
export GENE_TEST=true
gene_update --normalizer="hgnc"
```

Running unit tests is as easy as pytest.
To update multiple sources, you can use the `--normalizer` option with the source names separated by spaces.

#### Update all sources

To update all sources, use the `--update_all` flag:

```commandline
pipenv run pytest
gene_update --update_all
```

### Updating the gene normalization database
### Starting the gene normalization service

Before you use the CLI to update the database, run the following in a separate terminal to start a local DynamoDB service on `port 8000`:
Once the Gene Normalizer database has been loaded, from the project root, run the following:

```
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
```commandline
uvicorn gene.main:app --reload
```

To change the port, simply add `-port value`.
Next, view the OpenAPI docs on your local machine:

#### Update source(s)
The sources we currently use are: HGNC, Ensembl, and NCBI.
http://127.0.0.1:8000/gene

## Developer instructions
The following sections include instructions specifically for developers.

To update one source, simply set `--normalizer` to the source you wish to update.
### Installation
For a development install, we recommend using Pipenv. See the
[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
for direction on installing pipenv in your compute environment.

From the project root, run the following to update the HGNC source:
Once installed, clone the repo and initialize the environment:

```commandline
python3 -m gene.cli --normalizer="hgnc"
git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
pipenv shell
pipenv update
pipenv install --dev
```

To update multiple sources, you can use the `--normalizer` flag with the source names separated by spaces.
### Init coding style tests

#### Update all sources
Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.

To update all sources, use the `--update_all` flag.
We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.

From the project root, run the following to update all sources:
This ensures:

```commandline
python3 -m gene.cli --update_all
```
* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key

#### Specifying the database URL endpoint
The default URL endpoint is `http://localhost:8000`.
There are two different ways to specify the database URL endpoint.
Before first commit run:

The first way is to set the `--db_url` flag to the URL endpoint.
```commandline
python3 -m gene.cli --update_all --db_url="http://localhost:8001"
pre-commit install
```

The second way is to set the `GENE_NORM_DB_URL` to the URL endpoint.
```commandline
export GENE_NORM_DB_URL="http://localhost:8001"
python3 -m gene.cli --update_all
```
### Running unit tests

By default, tests will employ an existing database. For test environments where this is unavailable (e.g. in CI), the `GENE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.

### Starting the gene normalization service
From the project root, run the following:
```commandline
uvicorn gene.main:app --reload
export GENE_TEST=true
```

Next, view the OpenAPI docs on your local machine:
Running unit tests is as easy as pytest.

http://127.0.0.1:8000/gene
```commandline
pipenv run pytest
```
4 changes: 2 additions & 2 deletions gene/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)


from gene.schemas import SourceName, NamespacePrefix, SourceIDAfterNamespace, ItemTypes # noqa: E402, E501
ITEM_TYPES = {k.lower(): v.value for k, v in ItemTypes.__members__.items()}
from gene.schemas import SourceName, NamespacePrefix, SourceIDAfterNamespace, RefType # noqa: E402, E501
ITEM_TYPES = {k.lower(): v.value for k, v in RefType.__members__.items()}

# Sources we import directly (HGNC, Ensembl, NCBI)
SOURCES = {source.value.lower(): source.value
Expand Down
Loading