Skip to content

Commit

Permalink
Merge pull request #7 from vanheeringen-lab/maelstrom
Browse files Browse the repository at this point in the history
Maelstrom
  • Loading branch information
siebrenf authored Jan 12, 2023
2 parents 1fce60f + 5f51e83 commit ed16cbb
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 35 deletions.
28 changes: 6 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,45 +5,29 @@ Link seq2science output to ANANSE with 2 sample tables and one config file.
![](docs/img/anansnake.PNG)

## Installation

```bash
mamba create -n anansnake -c bioconda anansnake
```

Don't forget to activate the conda environment with `mamba activate anansnake`.

## Running anansnake on the example data
The anansnake github repository contains an `example` folder which can be downloaded to try the workflow.
Here we assume you've downloaded the folder in your current working directory.
Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for additional details!

To check if everything is set up right, we can do a dry run:
```bash
anansnake --configfile example/config.yaml --dry-run
```
If you get an error, be sure to check the red text!
I've added human-readable feedback where I could.

With the example data, you still need to provide a genome and gene annotation.
You can download the GRCz11 genome and gene annotation with
```bash
genomepy install GRCz11 --annotation
```
If you do another dry run you should have no more errors, and see what is going to happen.

To do the real run, you need to specify how many cores you want to use, and how much RAM you have:
```bash
anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12
```
The anansnake github repository contains an `example` folder which can be downloaded to try the workflow.
Check [it's README](https://github.com/vanheeringen-lab/anansnake/blob/master/example/README.md) for details!

## Running anansnake

Anansnake works with seq2science in- & output: The RNA- and ATAC-seq `samples.tsv` files are the same you've used for seq2science, with one addition (see below).
The counts tables are output files without any changes.

The RNA- and ATAC-seq samples are combined via a shared column in the samples.tsv files.
In the example data, this is the `anansnake` column.
Which conditions from the `anansnake` column are compared is set in the `config.yaml` file, under `contrasts`.

For files and settings you can check out the example folder.
For files and settings & command line examples you can check out the [example folder](https://github.com/vanheeringen-lab/anansnake/blob/master/example).

## Troubleshooting

ANANSE can take tonnes of memory. If your machine freezes, reduce the number of threads or mem_mb.
2 changes: 1 addition & 1 deletion anansnake/rules/gimme.smk
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ rule maelstrom:
log:
expand("{result_dir}/gimme/log_{assembly}_maelstrom.txt", assembly=ASSEMBLY, **config),
params:
atac_samples=lambda wildcards : sorted({sample for vals in CONDITIONS.values() for sample in vals}),
atac_samples=lambda wildcards : sorted({sample for v in CONDITIONS.values() for sample in v['ATAC-seq samples']}),
threads: 24
resources:
mem_mb=40_000,
Expand Down
29 changes: 18 additions & 11 deletions anansnake/scripts/maelstrom.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,17 +37,24 @@
if "average" in df.columns:
df.drop(columns=["average"], inplace=True)

# filter for samples in the conditions
logger.info(f"columns found in ATAC-seq read counts table: {df.columns}")
cols = "|".join(columns)
re_column = re.compile(rf"^{cols}$", re.IGNORECASE)
df = df.filter(regex=re_column)
logger.info(f"columns used for maelstrom: {df.columns}")
if len(columns) != len(df.columns):
logger.warning(
f"{len(columns)} expected in ATAC-seq read counts table, "
f"{len(df.columns)} found after filtering."
)
# filter for the ATAC-seq samples in the conditions
raw_columns = df.columns.to_list()
if not sorted(columns).__eq__(sorted(raw_columns)):
cols = "|".join(columns)
re_column = re.compile(rf"^{cols}$", re.IGNORECASE)
df = df.filter(regex=re_column, axis=1)

if len(columns) != len(df.columns):
logger.warning(
f"{len(columns)} expected in ATAC-seq read counts table, "
f"{len(raw_columns)} found before & "
f"{len(df.columns)} found after filtering."
)
logger.info(f"expected: {columns}")
logger.info(f"columns found before filtering: {raw_columns}")
logger.info(f"columns found after filtering: {df.columns.to_list()}")

assert df.shape[1] > 0, "empty dataframe after filtering columns!"

# remove zero rows (introduced by filtering columns)
df = df[df.values.sum(axis=1) != 0]
Expand Down
3 changes: 2 additions & 1 deletion example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
4. activate the conda environment with `mamba activate anansnake`

## Running anansnake on the example data

To check if everything is set up right, we can do a dry run:
```bash
anansnake --configfile example/config.yaml --dry-run
anansnake --configfile example/config.yaml --resources mem_mb=48_000 --cores 12 --dry-run --reason
```
If you get an error, be sure to check the red text!
I've added human-readable feedback where I could.
Expand Down

0 comments on commit ed16cbb

Please sign in to comment.