MethylMallet

Full outer join of very large gene methylation data files using low resources.

System Requirements

Required
- bash
- GNU core utils
- gzip
- gcc
Optional
- GNU parallel
- make
Testing and Quality Control
- python3
- R with the follwoing packages
  - doParallel
  - dplyr
  - readr

Installation

Prior to running this script, you must compile the C source code. If you have make and gcc installed on your computer, compiling is as easy as the following line.

make

Development Environment

If you are running this script on GNU-Linux, these programs are likely already installed. If you do not have a development environment, you will need to install one. MacOS users can install Xcode while I recommend RTools for Windows users.

Usage

usage:
methyl_mallet [-h] [-k] -n NMERGE -S BUFFER_SIZE -d DIR -o OUT_FILE FILE [FILE ...]

Do a full outer join of tab-separated methylation files.

positional arguments:
  FILE            files to be joined

required arguments:
  -d DIR          working directory (doesn't need to exist but should be empty)
  -o OUT_FILE     file name to be output to

optional arguments:
  -h              show this help message and exit
  -k              keep intermediary files
  -n NMERGE       number of files to merge simultaneously
  -S BUFFER_SIZE  buffer size allocated to sorting operation

NOTE: The working directory should be empty.

Quality Control

The full dataset is too big to be produced in R. However, small subsets of the data can be managed. Therefore, we can use random selection to verify results. This is somewhat imperfect, since we rely on the output file as the stock of keys from which we sample.

A random sample of lines from the outfile (1000 by default) can be read in and then a dataset matching those lines can be reproduced by reading in the source data. Use prod_check.r in the qc/ folder to do this QC check.

Test Data

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43857

Citation

If you use this work to generate data for publication, please cite it. A possible citation is as follows.

Egeler, PW (2019). MethylMallet. Github Repository: https://github.com/SpectrumHealthResearch/MethylMallet. Commit put hash here.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
assets		assets
bin		bin
extras		extras
qc		qc
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
makefile		makefile
methyl_mallet		methyl_mallet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MethylMallet

Table of Contents

System Requirements

Installation

Development Environment

Usage

Quality Control

Test Data

Citation

About

Releases

Packages

Contributors 2

Languages

License

pegeler/MethylMallet

Folders and files

Latest commit

History

Repository files navigation

MethylMallet

Table of Contents

System Requirements

Installation

Development Environment

Usage

Quality Control

Test Data

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages