Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

karatugo · 2024-07-10T13:40:27Z

karatugo · 2024-07-17T15:48:22Z

Files to Index

QTD0000*.all.tsv.gz: Contains comprehensive eQTL data. This should be the primary source for indexing.
QTD0000*.cc.tsv.gz: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.
QTD0000*.permuted.tsv.gz: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

Study Information:
- study_id: QTD000021
- study_name: "Sample eQTL Study"
Sample Information:
- sample_id: Auto-generated or derived from context if available?
eQTL Information:
- molecular_trait_id: Corresponding trait ID.
- molecular_trait_object_id: Object ID for the molecular trait.
- chromosome: Chromosome number.
- position: Position on the chromosome.
- ref: Reference allele.
- alt: Alternative allele.
- variant: Variant identifier.
- ma_samples: Minor allele sample count.
- maf: Minor allele frequency.
- pvalue: P-value of the association.
- beta: Effect size.
- se: Standard error.
- type: Variant type (e.g., SNP).
- aan: Additional annotation number.
- r2: R-squared value.
- gene_id: Gene identifier.
- median_tpm: Median TPM (Transcripts Per Million).
- rsid: Reference SNP ID.
Permuted eQTL Information:
- p_perm: Permuted p-value.
- p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

Extract Data:
- Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
- Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
Transform Data:
- Normalize data fields and structure according to the MongoDB schema.
Load Data:
- Insert the structured documents into MongoDB.
- Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
API Development:
- Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

Create indexes on key fields for efficient retrieval:
- gene_id
- chromosome
- position
- variant
- rsid

karatugo · 2024-08-23T13:46:51Z

@karatugo Focus on Mongo indexing, deployment and API development

karatugo · 2024-10-17T17:23:17Z

Deployment to sandbox is in progress. I was able to run build step successfully. Deploy step has some errors at the moment. I'll prioritise this next week.

karatugo · 2024-10-24T17:14:22Z

Sandbox deployment worked with singularity commands but while automating I got the error below.

Fix this error and test it in sandbox

FATAL:   could not open image /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: failed to retrieve path for /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: lstat /nfs/public: no such file or directory

karatugo · 2024-10-24T17:48:27Z

Fixed the above error, now working on mongo save failed issue.

karatugo · 2024-10-30T09:32:21Z

Deployment to sandbox complete.

karatugo · 2024-10-31T11:19:59Z

Started a full ingestion yesterday evening. In 16h, with 2 concurrent workers only 2 studies/19 datasets were complete.

Need to adjust accordingly:
- number of workers
- sbatch wallclock time
- sbatch memory
I'll wait until the ingestion is complete to see what we get at the end of 2 days with 2 workers and 8G mem.

karatugo · 2024-10-31T11:37:15Z

Sent an email to Kaur for the schemas of .permuted files.

karatugo · 2024-11-07T17:12:07Z

Ignored .permuted files
Used local file system rather than ftp protocol
This is due to bugfixing in sandbox. There were many ftp connection problems.

karatugo · 2024-11-07T17:12:44Z

fixed docker pull rate error in ingest script (docker-login needed)

karatugo · 2024-11-07T18:26:15Z

fixed an issue with file last modified date in local fs, demo ingest is running

karatugo · 2024-11-07T18:29:01Z

Started a full ingestion yesterday evening. In 16h, with 2 concurrent workers only 2 studies/19 datasets were complete.
* [ ]  Need to adjust accordingly:
  
  * number of workers
  * sbatch wallclock time
  * sbatch memory
  
  I'll wait until the ingestion is complete to see what we get at the end of 2 days with 2 workers and 8G mem.

Run 8 concurrent workers with 64G for 2 days

karatugo · 2024-11-07T18:47:18Z

* [ ]  Run 8 concurrent workers with 64G for 2 days

Running, will check on monday.

karatugo · 2024-11-07T21:30:18Z

I realized that there's a typo in memory, it should be 64G rather than 6G. Restarted.

karatugo · 2024-11-11T15:56:17Z

35 studies were ingested which seems very few.

karatugo · 2024-11-11T15:56:51Z

I test another approach using batch sizes of 10000 in mongo.

sprintell · 2024-11-13T11:31:44Z

@ala-ebi suggested using Mongo Bulk Operations API to improve the performance.

karatugo · 2024-11-13T16:05:18Z

Some results with 10k batch size after 2 days of ingestion.

karatugo · 2024-11-13T16:22:28Z

@ala-ebi suggested using Mongo Bulk Operations API to improve the performance.

I checked that Write to MongoDB in Batch Mode already uses bulk operations.

karatugo · 2024-11-14T14:53:52Z

Benchmarking with repartition and coalescing. 1 day. - looks like it doubles the performance and ingests ~2b row in 1 day.

karatugo · 2024-11-20T14:40:21Z

Started another test run in SLURM.

Update. Made a mistake with resource allocation. Will submit another one shortly.

karatugo · 2024-11-20T16:07:54Z

Update the sleep in script as 30 min
Give a name to wrap command
Convert for loop to while
Increase concurrent operations to 16

karatugo · 2024-11-21T17:26:45Z

Started test run but cancelled it as eqtl database is unable to respond.

karatugo · 2024-11-25T15:10:30Z

The issues with the mongo instance is solved. Started a new test run.

karatugo · 2024-11-28T18:33:41Z

Sharding is enabled. Started new test run.

Check writeconcern majority flag

karatugo self-assigned this Jul 10, 2024

karatugo mentioned this issue Jul 10, 2024

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database. eQTL-Catalogue/eQTL-SumStats#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

karatugo commented Jul 10, 2024 •

edited

Loading

karatugo commented Jul 17, 2024

karatugo commented Aug 23, 2024

karatugo commented Oct 17, 2024

karatugo commented Oct 24, 2024 •

edited

Loading

karatugo commented Oct 24, 2024

karatugo commented Oct 30, 2024

karatugo commented Oct 31, 2024 •

edited

Loading

karatugo commented Oct 31, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 11, 2024

karatugo commented Nov 11, 2024

sprintell commented Nov 13, 2024

karatugo commented Nov 13, 2024 •

edited

Loading

karatugo commented Nov 13, 2024 •

edited

Loading

karatugo commented Nov 14, 2024 •

edited

Loading

karatugo commented Nov 20, 2024 •

edited

Loading

karatugo commented Nov 20, 2024 •

edited

Loading

karatugo commented Nov 21, 2024

karatugo commented Nov 25, 2024

karatugo commented Nov 28, 2024

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Comments

karatugo commented Jul 10, 2024 • edited Loading

karatugo commented Jul 17, 2024

Files to Index

Suggested MongoDB Schema

Example MongoDB Document Structure

Steps to Implement

Indexing Strategy

karatugo commented Aug 23, 2024

karatugo commented Oct 17, 2024

karatugo commented Oct 24, 2024 • edited Loading

karatugo commented Oct 24, 2024

karatugo commented Oct 30, 2024

karatugo commented Oct 31, 2024 • edited Loading

karatugo commented Oct 31, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 7, 2024

karatugo commented Nov 11, 2024

karatugo commented Nov 11, 2024

sprintell commented Nov 13, 2024

karatugo commented Nov 13, 2024 • edited Loading

karatugo commented Nov 13, 2024 • edited Loading

karatugo commented Nov 14, 2024 • edited Loading

karatugo commented Nov 20, 2024 • edited Loading

karatugo commented Nov 20, 2024 • edited Loading

karatugo commented Nov 21, 2024

karatugo commented Nov 25, 2024

karatugo commented Nov 28, 2024

karatugo commented Jul 10, 2024 •

edited

Loading

karatugo commented Oct 24, 2024 •

edited

Loading

karatugo commented Oct 31, 2024 •

edited

Loading

karatugo commented Nov 13, 2024 •

edited

Loading

karatugo commented Nov 13, 2024 •

edited

Loading

karatugo commented Nov 14, 2024 •

edited

Loading

karatugo commented Nov 20, 2024 •

edited

Loading

karatugo commented Nov 20, 2024 •

edited

Loading