Skip to content

Commit

Permalink
feat!: support PostgreSQL as optional storage backend (#116)
Browse files Browse the repository at this point in the history
  • Loading branch information
jsstevenson authored May 12, 2023
1 parent 6a6d720 commit f8e8c5e
Show file tree
Hide file tree
Showing 44 changed files with 2,301 additions and 954 deletions.
43 changes: 31 additions & 12 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,32 +8,51 @@ jobs:
AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-2
AWS_DEFAULT_OUTPUT: text
DISEASE_NORM_DB_URL: http://localhost:8002
DISEASE_NORM_DB_URL: ${{ matrix.db_url }}
DISEASE_TEST: true
strategy:
matrix:
db_url: ["http://localhost:8002", "postgres://postgres:postgres@localhost:5432/disease_normalizer_test"]
python-version: ['3.8', '3.9', '3.10']
services:
postgres:
image: postgres:14
env:
POSTGRES_USER: 'postgres'
POSTGRES_DB: 'disease_normalizer_test'
POSTGRES_PASSWORD: 'postgres'
ports:
- 5432:5432
steps:
- uses: actions/checkout@v3

- name: Setup Python
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: python3 -m pip install ".[dev,test]"
python-version: ${{ matrix.python-version }}

- name: Build local DynamoDB server
- name: Build local DynamoDB
if: ${{ env.DISEASE_NORM_DB_URL == 'http://localhost:8002' }}
run: |
chmod +x ./tests/scripts/dynamodb_run.sh
./tests/scripts/dynamodb_run.sh
chmod +x ./tests/scripts/dynamodb_run.sh
./tests/scripts/dynamodb_run.sh
- name: Install DynamoDB dependencies
if: ${{ env.DISEASE_NORM_DB_URL == 'http://localhost:8002' }}
run: python3 -m pip install ".[etl,test]"

- name: Install PG dependencies
if: ${{ env.DISEASE_NORM_DB_URL != 'http://localhost:8002' }}
run: python3 -m pip install ".[pg,etl,test]"

- name: Run tests
run: python3 -m pytest
run: python3 -m pytest tests/

lint:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8', '3.9', '3.10']
env:
AWS_ACCESS_KEY_ID: ${{ secrets.DUMMY_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
Expand All @@ -45,10 +64,10 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: python3 -m pip install ".[dev]"

- name: check style
run: python3 -m flake8
run: python3 -m flake8 disease/ tests/ setup.py
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,5 @@ Pipfile.lock
# Misc
tests/data/mondo/*.owl
tests/scripts/robot*
analysis/civic-data/*.json
analysis/pmkb-data/*.json
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ipykernel = "*"
matplotlib = "*"
lxml = "*"
xmlformatter = "*"
psycopg = {version = "*", extras=["binary"]}

[packages]
pydantic = "*"
Expand Down
164 changes: 95 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,137 +1,163 @@
# Disease Normalization
Services and guidelines for normalizing disease terms
# Disease Normalizer

## Developer instructions
Following are sections include instructions specifically for developers.
Services and guidelines for normalizing disease terms

### Installation
For a development install, we recommend using Pipenv. See the
[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
for direction on installing pipenv in your compute environment.
## Installation

Once installed, from the project root dir, just run:
The Disease Normalizer is available via PyPI:

```commandline
pipenv sync
pip install disease-normalizer[etl,pg]
```

### Deploying DynamoDB Locally
The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.

We use Amazon DynamoDB for our database. To deploy locally, follow [these instructions](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html).
### External requirements

### Init coding style tests
The Disease Normalizer can retrieve most required data itself. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the `disease/data/omim` folder within the library root. In order to access OMIM data, users must submit a request [here](https://www.omim.org/downloads). Once approved, the relevant OMIM file (`mimTitles.txt`) should be renamed according to the convention `omim_YYYYMMDD.tsv`, where `YYYYMMDD` indicates the date that the file was generated, and placed in the appropriate location.

Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.
### Database Initialization

We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.
The Disease Normalizer supports two data storage options:

This ensures:
* [DynamoDB](https://aws.amazon.com/dynamodb), a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once downloaded, you can start service by running `java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb` in a terminal (add a `-port <VALUE>` option to use a different port)
* [PostgreSQL](https://www.postgresql.org/), a well-known relational database technology. Once starting the Postgres server process, [ensure that a database is created](https://www.postgresql.org/docs/current/sql-createdatabase.html) (we typically name ours `disease_normalizer`).

* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key
By default, the Disease Normalizer expects to find a DynamoDB instance listening at `http://localhost:8000`. Alternative locations can be specified in two ways:

Before first commit run:
The first way is to set the `--db_url` command-line option to the URL endpoint.

```commandline
pre-commit install
disease_norm_update --update_all --db_url="http://localhost:8001"
```

The second way is to set the `DISEASE_NORM_DB_URL` environment variable to the URL endpoint.
```commandline
export DISEASE_NORM_DB_URL="http://localhost:8001"
```

### Running unit tests

Tests are provided via pytest.
To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.

```commandline
pytest
export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"
```

By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `DISEASE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.
### Adding and refreshing data

Use the `disease_norm_update` command in a shell to update the database.

```comandline
export DISEASE_TEST=true
pytest
```
#### Update source(s)

Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The `tests/scripts/` subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
The Disease Normalizer currently uses data from the following sources:

### Updating the disease normalization database
* The [National Cancer Institute Thesaurus (NCIt)](https://ncithesaurus.nci.nih.gov/ncitbrowser/)
* The [Mondo Disease Ontology](https://mondo.monarchinitiative.org/)
* The [Online Mendelian Inheritance in Man (OMIM)](https://www.omim.org/)
* [OncoTree](http://oncotree.mskcc.org/)
* The [Disease Ontology](https://disease-ontology.org/)

Before you use the CLI to update the database, run the following in a separate terminal to start DynamoDB on `port 8000`:
As described above, all source data other than OMIM can be acquired automatically.

```
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
To update one source, simply set `--normalizer` to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.

For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:

```commandline
disease_norm_update --normalizer="ncit"
```

To change the port, simply add `-port value`.
To update multiple sources, you can use the `--normalizer` option with the source names separated by spaces.

#### Update source(s)
#### Update all sources

The sources we currently use are: OncoTree, OMIM, Disease Ontology, and Mondo.
To update all sources, use the `--update_all` flag:

The application will automatically retrieve input data for all sources but OMIM, for which a source file must be manually acquired and placed in the `disease/data/omim` folder within the library root. In order to access OMIM data, users must submit a request [here](https://www.omim.org/downloads). Once approved, the relevant OMIM file (`mimTitles.txt`) should be renamed according to the convention `omim_YYYYMMDD.tsv`, where `YYYYMMDD` indicates the date that the file was generated, and placed in the appropriate location.
```commandline
disease_norm_update --update_all
```

To update one source, simply set `--normalizer` to the source you wish to update. Accepted source names are `DO` (for Disease Ontology), `Mondo`, `OncoTree`, and `OMIM`.
### Create Merged Concept Groups
The `normalize` endpoint relies on merged concept groups.

From the project root, run the following to update the Mondo source:
To create merged concept groups, use the `--update_merged` flag with the `--update_all` flag.

```commandline
python3 -m disease.cli --normalizer="Mondo"
python3 -m disease.cli --update_all --update_merged
```

To update multiple sources, you can use the `normalizer` flag with the source names separated by spaces.
### Starting the disease normalization service

Once the Disease Normalizer database has been loaded, from the project root, run the following:

```commandline
python3 -m disease.cli --normalizer="Mondo OMIM DO"
uvicorn disease.main:app --reload
```

#### Update all sources
Next, view the OpenAPI docs on your local machine:

http://127.0.0.1:8000/disease

To update all sources, use the `--update_all` flag.
## Developer instructions
Following are sections include instructions specifically for developers.

From the project root, run the following to update all sources:
### Installation
For a development install, we recommend using Pipenv. See the
[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
for direction on installing pipenv in your compute environment.

To get started, clone the repo and initialize the environment:

```commandline
python3 -m disease.cli --update_all
git clone https://github.com/cancervariants/disease-normalization
cd disease-normalization
pipenv shell
pipenv update
pipenv install --dev
```

### Create Merged Concept Groups
The `normalize` endpoint relies on merged concept groups.

To create merged concept groups, use the `--update_merged` flag with the `--update_all` flag.
Alternatively, install the `pg`, `etl`, `dev`, and test dependency groups in a virtual environment:

```commandline
python3 -m disease.cli --update_all --update_merged
git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
python3 -m virtualenv venv
source venv/bin/activate
pip install -e ".[pg,etl,dev,test]"
```

#### Specifying the database URL endpoint
### Init coding style tests

The default URL endpoint is `http://localhost:8000`.
Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.

There are two different ways to specify the database URL endpoint.
We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.

The first way is to set the `--db_url` flag to the URL endpoint.
This ensures:

```commandline
python3 -m disease.cli --update_all --db_url="http://localhost:8001"
```
* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key

Before first commit run:

The second way is to set the `DISEASE_NORM_DB_URL` to the URL endpoint.
```commandline
export DISEASE_NORM_DB_URL="http://localhost:8001"
python3 -m disease.cli --update_all
pre-commit install
```

### Starting the disease normalization service
### Running unit tests

From the project root, run the following:
Tests are provided via pytest.

```commandline
uvicorn disease.main:app --reload
pytest
```

Next, view the OpenAPI docs on your local machine:
By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `DISEASE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.

http://127.0.0.1:8000/disease
```comandline
export DISEASE_TEST=true
pytest
```

Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The `tests/scripts/` subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
6 changes: 3 additions & 3 deletions disease/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
logger.handlers = []


from disease.schemas import SourceName, SourceIDAfterNamespace, NamespacePrefix, ItemTypes # noqa: E402 E501
ITEM_TYPES = {k.lower(): v.value for k, v in ItemTypes.__members__.items()}
from disease.schemas import SourceName, SourceIDAfterNamespace, NamespacePrefix, RefType # noqa: E402 E501
ITEM_TYPES = {k.lower(): v.value for k, v in RefType.__members__.items()}

# use to lookup source name from lower-case string
# technically the same as PREFIX_LOOKUP, but source namespace prefixes
Expand All @@ -40,4 +40,4 @@
if v.value != ''}

# Use for checking whether to pull IDs for merge group generation
SOURCES_FOR_MERGE = {SourceName.MONDO.value}
SOURCES_FOR_MERGE = {SourceName.MONDO}
Loading

0 comments on commit f8e8c5e

Please sign in to comment.