diff --git a/README.md b/README.md index a8c32f9..38e530a 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ Link seq2science output to ANANSE with 2 sample tables and one config file. ![](docs/img/anansnake.PNG) ## Installation + ```bash mamba create -n anansnake -c bioconda anansnake ``` @@ -12,30 +13,12 @@ mamba create -n anansnake -c bioconda anansnake Don't forget to activate the conda environment with `mamba activate anansnake`. ## Running anansnake on the example data -The anansnake github repository contains an `example` folder which can be downloaded to try the workflow. -Here we assume you've downloaded the folder in your current working directory. -Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for additional details! -To check if everything is set up right, we can do a dry run: -```bash -anansnake --configfile example/config.yaml --dry-run -``` -If you get an error, be sure to check the red text! -I've added human-readable feedback where I could. - -With the example data, you still need to provide a genome and gene annotation. -You can download the GRCz11 genome and gene annotation with -```bash -genomepy install GRCz11 --annotation -``` -If you do another dry run you should have no more errors, and see what is going to happen. - -To do the real run, you need to specify how many cores you want to use, and how much RAM you have: -```bash -anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12 -``` +The anansnake github repository contains an `example` folder which can be downloaded to try the workflow. +Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for details! ## Running anansnake + Anansnake works with seq2science in- & output: The RNA- and ATAC-seq `samples.tsv` files are the same you've used for seq2science, with one addition (see below). The counts tables are output files without any changes. @@ -43,7 +26,8 @@ The RNA- and ATAC-seq samples are combined via a shared column in the samples.ts In the example data, this is the `anansnake` column. Which conditions from the `anansnake` column are compared is set in the `config.yaml` file, under `contrasts`. -For files and settings you can check out the example folder. +For files and settings & command line examples you can check out the [example folder](https://github.com/vanheeringen-lab/anansnake/blob/master/example). ## Troubleshooting + ANANSE can take tonnes of memory. If your machine freezes, reduce the number of threads or mem_mb. diff --git a/anansnake/rules/gimme.smk b/anansnake/rules/gimme.smk index af33f47..a47bfd3 100644 --- a/anansnake/rules/gimme.smk +++ b/anansnake/rules/gimme.smk @@ -71,7 +71,7 @@ rule maelstrom: log: expand("{result_dir}/gimme/log_{assembly}_maelstrom.txt", assembly=ASSEMBLY, **config), params: - atac_samples=lambda wildcards : sorted({sample for vals in CONDITIONS.values() for sample in vals}), + atac_samples=lambda wildcards : sorted({sample for v in CONDITIONS.values() for sample in v['ATAC-seq samples']}), threads: 24 resources: mem_mb=40_000, diff --git a/anansnake/scripts/maelstrom.py b/anansnake/scripts/maelstrom.py index a1651a1..63a1756 100644 --- a/anansnake/scripts/maelstrom.py +++ b/anansnake/scripts/maelstrom.py @@ -37,17 +37,24 @@ if "average" in df.columns: df.drop(columns=["average"], inplace=True) - # filter for samples in the conditions - logger.info(f"columns found in ATAC-seq read counts table: {df.columns}") - cols = "|".join(columns) - re_column = re.compile(rf"^{cols}$", re.IGNORECASE) - df = df.filter(regex=re_column) - logger.info(f"columns used for maelstrom: {df.columns}") - if len(columns) != len(df.columns): - logger.warning( - f"{len(columns)} expected in ATAC-seq read counts table, " - f"{len(df.columns)} found after filtering." - ) + # filter for the ATAC-seq samples in the conditions + raw_columns = df.columns.to_list() + if not sorted(columns).__eq__(sorted(raw_columns)): + cols = "|".join(columns) + re_column = re.compile(rf"^{cols}$", re.IGNORECASE) + df = df.filter(regex=re_column, axis=1) + + if len(columns) != len(df.columns): + logger.warning( + f"{len(columns)} expected in ATAC-seq read counts table, " + f"{len(raw_columns)} found before & " + f"{len(df.columns)} found after filtering." + ) + logger.info(f"expected: {columns}") + logger.info(f"columns found before filtering: {raw_columns}") + logger.info(f"columns found after filtering: {df.columns.to_list()}") + + assert df.shape[1] > 0, "empty dataframe after filtering columns!" # remove zero rows (introduced by filtering columns) df = df[df.values.sum(axis=1) != 0] diff --git a/example/README.md b/example/README.md index 6ee6057..5332586 100644 --- a/example/README.md +++ b/example/README.md @@ -6,9 +6,10 @@ 4. activate the conda environment with `mamba activate anansnake` ## Running anansnake on the example data + To check if everything is set up right, we can do a dry run: ```bash -anansnake --configfile example/config.yaml --dry-run +anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12 --dry-run --reason ``` If you get an error, be sure to check the red text! I've added human-readable feedback where I could.