feat!: support PostgreSQL as optional storage backend (#116)

cancervariants · May 12, 2023 · f8e8c5e · f8e8c5e
1 parent 6a6d720
commit f8e8c5e
Show file tree

Hide file tree

Showing 44 changed files with 2,301 additions and 954 deletions.
diff --git a/.github/workflows/github-actions.yml b/.github/workflows/github-actions.yml
@@ -8,32 +8,51 @@ jobs:
       AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
       AWS_DEFAULT_REGION: us-east-2
       AWS_DEFAULT_OUTPUT: text
-      DISEASE_NORM_DB_URL: http://localhost:8002
+      DISEASE_NORM_DB_URL: ${{ matrix.db_url }}
       DISEASE_TEST: true
     strategy:
       matrix:
+        db_url: ["http://localhost:8002", "postgres://postgres:postgres@localhost:5432/disease_normalizer_test"]
         python-version: ['3.8', '3.9', '3.10']
+    services:
+      postgres:
+        image: postgres:14
+        env:
+          POSTGRES_USER: 'postgres'
+          POSTGRES_DB: 'disease_normalizer_test'
+          POSTGRES_PASSWORD: 'postgres'
+        ports:
+          - 5432:5432
     steps:
     - uses: actions/checkout@v3
 
-    - name: Setup Python
+    - name: Set up Python
       uses: actions/setup-python@v4
       with:
-          python-version: ${{ matrix.python-version }}
-
-    - name: Install dependencies
-      run: python3 -m pip install ".[dev,test]"
+        python-version: ${{ matrix.python-version }}
 
-    - name: Build local DynamoDB server
+    - name: Build local DynamoDB
+      if: ${{ env.DISEASE_NORM_DB_URL == 'http://localhost:8002' }}
       run: |
-          chmod +x ./tests/scripts/dynamodb_run.sh
-          ./tests/scripts/dynamodb_run.sh
+        chmod +x ./tests/scripts/dynamodb_run.sh
+        ./tests/scripts/dynamodb_run.sh
+
+    - name: Install DynamoDB dependencies
+      if: ${{ env.DISEASE_NORM_DB_URL == 'http://localhost:8002' }}
+      run: python3 -m pip install ".[etl,test]"
+
+    - name: Install PG dependencies
+      if: ${{ env.DISEASE_NORM_DB_URL != 'http://localhost:8002' }}
+      run: python3 -m pip install ".[pg,etl,test]"
 
     - name: Run tests
-      run: python3 -m pytest
+      run: python3 -m pytest tests/
 
   lint:
     runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.8', '3.9', '3.10']
     env:
       AWS_ACCESS_KEY_ID: ${{ secrets.DUMMY_AWS_ACCESS_KEY_ID }}
       AWS_SECRET_ACCESS_KEY: ${{ secrets.DUMMY_AWS_SECRET_ACCESS_KEY }}
@@ -45,10 +64,10 @@ jobs:
     - name: Setup Python
       uses: actions/setup-python@v4
       with:
-        python-version: '3.10'
+        python-version: ${{ matrix.python-version }}
 
     - name: Install dependencies
       run: python3 -m pip install ".[dev]"
 
     - name: check style
-      run: python3 -m flake8
+      run: python3 -m flake8 disease/ tests/ setup.py
diff --git a/.gitignore b/.gitignore
@@ -165,3 +165,5 @@ Pipfile.lock
 # Misc
 tests/data/mondo/*.owl
 tests/scripts/robot*
+analysis/civic-data/*.json
+analysis/pmkb-data/*.json
diff --git a/Pipfile b/Pipfile
@@ -20,6 +20,7 @@ ipykernel = "*"
 matplotlib = "*"
 lxml = "*"
 xmlformatter = "*"
+psycopg = {version = "*", extras=["binary"]}
 
 [packages]
 pydantic = "*"

diff --git a/README.md b/README.md
@@ -1,137 +1,163 @@
-# Disease Normalization
-Services and guidelines for normalizing disease terms
+# Disease Normalizer
 
-## Developer instructions
-Following are sections include instructions specifically for developers.
+Services and guidelines for normalizing disease terms
 
-### Installation
-For a development install, we recommend using Pipenv. See the
-[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
-for direction on installing pipenv in your compute environment.
+## Installation
 
-Once installed, from the project root dir, just run:
+The Disease Normalizer is available via PyPI:
 
 ```commandline
-pipenv sync
+
+pip install disease-normalizer[etl,pg]
 ```
 
-### Deploying DynamoDB Locally
+The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.
 
-We use Amazon DynamoDB for our database. To deploy locally, follow [these instructions](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html).
+### External requirements
 
-### Init coding style tests
+The Disease Normalizer can retrieve most required data itself. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the `disease/data/omim` folder within the library root. In order to access OMIM data, users must submit a request [here](https://www.omim.org/downloads). Once approved, the relevant OMIM file (`mimTitles.txt`) should be renamed according to the convention `omim_YYYYMMDD.tsv`, where `YYYYMMDD` indicates the date that the file was generated, and placed in the appropriate location.
 
-Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.
+### Database Initialization
 
-We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.
+The Disease Normalizer supports two data storage options:
 
-This ensures:
+* [DynamoDB](https://aws.amazon.com/dynamodb), a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once downloaded, you can start service by running `java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb` in a terminal (add a `-port <VALUE>` option to use a different port)
+* [PostgreSQL](https://www.postgresql.org/), a well-known relational database technology. Once starting the Postgres server process, [ensure that a database is created](https://www.postgresql.org/docs/current/sql-createdatabase.html) (we typically name ours `disease_normalizer`).
 
-* Check code style
-* Check for added large files
-* Detect AWS Credentials
-* Detect Private Key
+By default, the Disease Normalizer expects to find a DynamoDB instance listening at `http://localhost:8000`. Alternative locations can be specified in two ways:
 
-Before first commit run:
+The first way is to set the `--db_url` command-line option to the URL endpoint.
 
 ```commandline
-pre-commit install
+disease_norm_update --update_all --db_url="http://localhost:8001"
 ```
 
+The second way is to set the `DISEASE_NORM_DB_URL` environment variable to the URL endpoint.
+```commandline
+export DISEASE_NORM_DB_URL="http://localhost:8001"
+```
 
-### Running unit tests
-
-Tests are provided via pytest.
+To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.
 
 ```commandline
-pytest
+export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"
 ```
 
-By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `DISEASE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.
+### Adding and refreshing data
 
+Use the `disease_norm_update` command in a shell to update the database.
 
-```comandline
-export DISEASE_TEST=true
-pytest
-```
+#### Update source(s)
 
-Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The `tests/scripts/` subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
+The Disease Normalizer currently uses data from the following sources:
 
-### Updating the disease normalization database
+ * The [National Cancer Institute Thesaurus (NCIt)](https://ncithesaurus.nci.nih.gov/ncitbrowser/)
+ * The [Mondo Disease Ontology](https://mondo.monarchinitiative.org/)
+ * The [Online Mendelian Inheritance in Man (OMIM)](https://www.omim.org/)
+ * [OncoTree](http://oncotree.mskcc.org/)
+ * The [Disease Ontology](https://disease-ontology.org/)
 
-Before you use the CLI to update the database, run the following in a separate terminal to start DynamoDB on `port 8000`:
+As described above, all source data other than OMIM can be acquired automatically.
 
-```
-java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
+To update one source, simply set `--normalizer` to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.
+
+For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:
+
+```commandline
+disease_norm_update --normalizer="ncit"
 ```
 
-To change the port, simply add `-port value`.
+To update multiple sources, you can use the `--normalizer` option with the source names separated by spaces.
 
-#### Update source(s)
+#### Update all sources
 
-The sources we currently use are: OncoTree, OMIM, Disease Ontology, and Mondo.
+To update all sources, use the `--update_all` flag:
 
-The application will automatically retrieve input data for all sources but OMIM, for which a source file must be manually acquired and placed in the `disease/data/omim` folder within the library root. In order to access OMIM data, users must submit a request [here](https://www.omim.org/downloads). Once approved, the relevant OMIM file (`mimTitles.txt`) should be renamed according to the convention `omim_YYYYMMDD.tsv`, where `YYYYMMDD` indicates the date that the file was generated, and placed in the appropriate location.
+```commandline
+disease_norm_update --update_all
+```
 
-To update one source, simply set `--normalizer` to the source you wish to update. Accepted source names are `DO` (for Disease Ontology), `Mondo`, `OncoTree`, and `OMIM`.
+### Create Merged Concept Groups
+The `normalize` endpoint relies on merged concept groups.
 
-From the project root, run the following to update the Mondo source:
+To create merged concept groups, use the `--update_merged` flag with the `--update_all` flag.
 
 ```commandline
-python3 -m disease.cli --normalizer="Mondo"
+python3 -m disease.cli --update_all --update_merged
 ```
 
-To update multiple sources, you can use the `normalizer` flag with the source names separated by spaces.
+### Starting the disease normalization service
+
+Once the Disease Normalizer database has been loaded, from the project root, run the following:
 
 ```commandline
-python3 -m disease.cli --normalizer="Mondo OMIM DO"
+uvicorn disease.main:app --reload
 ```
 
-#### Update all sources
+Next, view the OpenAPI docs on your local machine:
+
+http://127.0.0.1:8000/disease
 
-To update all sources, use the `--update_all` flag.
+## Developer instructions
+Following are sections include instructions specifically for developers.
 
-From the project root, run the following to update all sources:
+### Installation
+For a development install, we recommend using Pipenv. See the
+[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
+for direction on installing pipenv in your compute environment.
+
+To get started, clone the repo and initialize the environment:
 
 ```commandline
-python3 -m disease.cli --update_all
+git clone https://github.com/cancervariants/disease-normalization
+cd disease-normalization
+pipenv shell
+pipenv update
+pipenv install --dev
 ```
 
-### Create Merged Concept Groups
-The `normalize` endpoint relies on merged concept groups.
-
-To create merged concept groups, use the `--update_merged` flag with the `--update_all` flag.
+Alternatively, install the `pg`, `etl`, `dev`, and test dependency groups in a virtual environment:
 
 ```commandline
-python3 -m disease.cli --update_all --update_merged
+git clone https://github.com/cancervariants/gene-normalization
+cd gene-normalization
+python3 -m virtualenv venv
+source venv/bin/activate
+pip install -e ".[pg,etl,dev,test]"
 ```
 
-#### Specifying the database URL endpoint
+### Init coding style tests
 
-The default URL endpoint is `http://localhost:8000`.
+Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.
 
-There are two different ways to specify the database URL endpoint.
+We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.
 
-The first way is to set the `--db_url` flag to the URL endpoint.
+This ensures:
 
-```commandline
-python3 -m disease.cli --update_all --db_url="http://localhost:8001"
-```
+* Check code style
+* Check for added large files
+* Detect AWS Credentials
+* Detect Private Key
+
+Before first commit run:
 
-The second way is to set the `DISEASE_NORM_DB_URL` to the URL endpoint.
 ```commandline
-export DISEASE_NORM_DB_URL="http://localhost:8001"
-python3 -m disease.cli --update_all
+pre-commit install
 ```
 
-### Starting the disease normalization service
+### Running unit tests
 
-From the project root, run the following:
+Tests are provided via pytest.
 
 ```commandline
-uvicorn disease.main:app --reload
+pytest
 ```
 
-Next, view the OpenAPI docs on your local machine:
+By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `DISEASE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.
 
-http://127.0.0.1:8000/disease
+```comandline
+export DISEASE_TEST=true
+pytest
+```
+
+Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The `tests/scripts/` subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
diff --git a/disease/__init__.py b/disease/__init__.py
@@ -17,8 +17,8 @@
 logger.handlers = []
 
 
-from disease.schemas import SourceName, SourceIDAfterNamespace, NamespacePrefix, ItemTypes  # noqa: E402 E501
-ITEM_TYPES = {k.lower(): v.value for k, v in ItemTypes.__members__.items()}
+from disease.schemas import SourceName, SourceIDAfterNamespace, NamespacePrefix, RefType  # noqa: E402 E501
+ITEM_TYPES = {k.lower(): v.value for k, v in RefType.__members__.items()}
 
 # use to lookup source name from lower-case string
 # technically the same as PREFIX_LOOKUP, but source namespace prefixes
@@ -40,4 +40,4 @@
                     if v.value != ''}
 
 # Use for checking whether to pull IDs for merge group generation
-SOURCES_FOR_MERGE = {SourceName.MONDO.value}
+SOURCES_FOR_MERGE = {SourceName.MONDO}