Merge pull request #7 from vanheeringen-lab/maelstrom

Maelstrom
vanheeringen-lab · Jan 12, 2023 · ed16cbb · ed16cbb
2 parents 1fce60f + 5f51e83
commit ed16cbb
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -5,45 +5,29 @@ Link seq2science output to ANANSE with 2 sample tables and one config file.
 ![](docs/img/anansnake.PNG)
 
 ## Installation
+
 ```bash
 mamba create -n anansnake -c bioconda anansnake
 ```
 
 Don't forget to activate the conda environment with `mamba activate anansnake`.
 
 ## Running anansnake on the example data
-The anansnake github repository contains an `example` folder which can be downloaded to try the workflow.
-Here we assume you've downloaded the folder in your current working directory.
-Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for additional details!
 
-To check if everything is set up right, we can do a dry run:
-```bash
-anansnake --configfile example/config.yaml --dry-run
-```
-If you get an error, be sure to check the red text!
-I've added human-readable feedback where I could.
-
-With the example data, you still need to provide a genome and gene annotation.
-You can download the GRCz11 genome and gene annotation with 
-```bash
-genomepy install GRCz11 --annotation
-```
-If you do another dry run you should have no more errors, and see what is going to happen.
-
-To do the real run, you need to specify how many cores you want to use, and how much RAM you have:
-```bash
-anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12
-```
+The anansnake github repository contains an `example` folder which can be downloaded to try the workflow.
+Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for details!
 
 ## Running anansnake
+
 Anansnake works with seq2science in- & output: The RNA- and ATAC-seq `samples.tsv` files are the same you've used for seq2science, with one addition (see below).
 The counts tables are output files without any changes.
 
 The RNA- and ATAC-seq samples are combined via a shared column in the samples.tsv files.
 In the example data, this is the `anansnake` column.
 Which conditions from the `anansnake` column are compared is set in the `config.yaml` file, under `contrasts`. 
 
-For files and settings you can check out the example folder.
+For files and settings & command line examples you can check out the [example folder](https://github.com/vanheeringen-lab/anansnake/blob/master/example).
 
 ## Troubleshooting
+
 ANANSE can take tonnes of memory. If your machine freezes, reduce the number of threads or mem_mb.
diff --git a/anansnake/rules/gimme.smk b/anansnake/rules/gimme.smk
@@ -71,7 +71,7 @@ rule maelstrom:
     log:
         expand("{result_dir}/gimme/log_{assembly}_maelstrom.txt", assembly=ASSEMBLY, **config),
     params:
-        atac_samples=lambda wildcards : sorted({sample for vals in CONDITIONS.values() for sample in vals}),
+        atac_samples=lambda wildcards : sorted({sample for v in CONDITIONS.values() for sample in v['ATAC-seq samples']}),
     threads: 24
     resources:
         mem_mb=40_000,

diff --git a/anansnake/scripts/maelstrom.py b/anansnake/scripts/maelstrom.py
@@ -37,17 +37,24 @@
         if "average" in df.columns:
             df.drop(columns=["average"], inplace=True)
 
-        # filter for samples in the conditions
-        logger.info(f"columns found in ATAC-seq read counts table: {df.columns}")
-        cols = "|".join(columns)
-        re_column = re.compile(rf"^{cols}$", re.IGNORECASE)
-        df = df.filter(regex=re_column)
-        logger.info(f"columns used for maelstrom: {df.columns}")
-        if len(columns) != len(df.columns):
-            logger.warning(
-                f"{len(columns)} expected in ATAC-seq read counts table, "
-                f"{len(df.columns)} found after filtering."
-            )
+        # filter for the ATAC-seq samples in the conditions
+        raw_columns = df.columns.to_list()
+        if not sorted(columns).__eq__(sorted(raw_columns)):
+            cols = "|".join(columns)
+            re_column = re.compile(rf"^{cols}$", re.IGNORECASE)
+            df = df.filter(regex=re_column, axis=1)
+
+            if len(columns) != len(df.columns):
+                logger.warning(
+                    f"{len(columns)} expected in ATAC-seq read counts table, "
+                    f"{len(raw_columns)} found before & "
+                    f"{len(df.columns)} found after filtering."
+                )
+                logger.info(f"expected: {columns}")
+                logger.info(f"columns found before filtering: {raw_columns}")
+                logger.info(f"columns found after filtering: {df.columns.to_list()}")
+
+            assert df.shape[1] > 0, "empty dataframe after filtering columns!"
 
         # remove zero rows (introduced by filtering columns)
         df = df[df.values.sum(axis=1) != 0]

diff --git a/example/README.md b/example/README.md
@@ -6,9 +6,10 @@
 4. activate the conda environment with `mamba activate anansnake`
 
 ## Running anansnake on the example data
+
 To check if everything is set up right, we can do a dry run:
 ```bash
-anansnake --configfile example/config.yaml --dry-run
+anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12 --dry-run --reason
 ```
 If you get an error, be sure to check the red text!
 I've added human-readable feedback where I could.