Skip to content

Commit

Permalink
Merge pull request #45 from timothymillar/f/tabular-data
Browse files Browse the repository at this point in the history
Improved usability as library
  • Loading branch information
timothymillar authored May 5, 2017
2 parents 84e3df6 + eef854c commit 1f40715
Show file tree
Hide file tree
Showing 31 changed files with 2,754 additions and 2,666 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
__pycache__/
.idea/
.cache/
tectoolkit.egg-info/
tefingerprint.egg-info/
dist/
19 changes: 10 additions & 9 deletions IDEAS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,23 @@
This is record of feature/roadmap ideas

1. Preprocessing:
- [ ] A simple python wrapper for the established process to identifies the “dangler” reads and map them against the reference. Ideally the name of the TE each dangler represents is stored as a SAM tag rather than appended to the name (partially implemented)
- [x] A simple python wrapper for the established process to identifies the “dangler” reads and map them against the reference.
- [x] Store the name of the TE each dangler represents as a SAM tag rather than appended to the name.
- [ ] Use soft clipped reads to get more accurate location of insertion

2. Fingerprinting:
- [x] Basic “flat” clustering
- [x] Hierarchical clustering to split close/nested clusters
- [x] Output to GFF3 format (recorded statistics could be improved)
- [ ] Associate clusters on opposite strands representing ends of the same TE insertion (Not implemented)
- [ ] Use soft clipped reads to get more accurate location of insertion (Not implemented)
- [ ] Use of anchor reads to assess homozygosity of insertions
- [ ] Associate clusters on opposite strands representing ends of the same TE insertion (removed until required)
- [ ] Use of anchor reads to assess homozygosity of insertions (for re-sequence data)

3. Fingerprint comparisons:
- [x] Identify “comparative bins” where clusters are found in at least one sample
- [x] summary statistics for comparison across each bin (very basic although still on par with competitors)
- [x] summary statistics for comparison across each bin (basic counts)
4. Output filetypes:
- [x] GFF3 (not sorted properly)
- [ ] SQLite using GffUtils
- [x] GFF3
- [x] GFF3 long form (no nested counts)
- [x] Tabular data (available in python using pandas)
4. Filtering output:
- [x] Script to filter the GFF output (slowish and crude)
- [ ] Script to filter SQLite output
- [x] Script to filter the GFF output (could be improved)
336 changes: 108 additions & 228 deletions README.md

Large diffs are not rendered by default.

399 changes: 170 additions & 229 deletions README.rst

Large diffs are not rendered by default.

52 changes: 0 additions & 52 deletions applications/tec

This file was deleted.

43 changes: 43 additions & 0 deletions applications/tef
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#! /usr/bin/env python

import sys
import argparse
from tefingerprint.filtergff import FilterGffProgram
from tefingerprint.preprocess import PreProcessProgram
from tefingerprint.programs import FingerprintProgram
from tefingerprint.programs import ComparisonProgram


def parse_program_arg(arg):
""""""
parser = argparse.ArgumentParser('Identify program to run')
parser.add_argument('program',
type=str,
choices=("preprocess",
"fingerprint",
"compare",
"filter_gff"))
return parser.parse_args(arg)


def main():
""""""
if len(sys.argv) == 1:
# default to simple help message
program_arg = parse_program_arg([])
else:
program_arg = parse_program_arg([sys.argv[1]])

program = program_arg.program
if program == "fingerprint":
FingerprintProgram(sys.argv[2:])
elif program == "compare":
ComparisonProgram(sys.argv[2:])
elif program == "filter_gff":
FilterGffProgram(sys.argv[2:])
elif program == "preprocess":
job = PreProcessProgram.from_cli(sys.argv[2:])
job.run()

if __name__ == '__main__':
main()
4 changes: 1 addition & 3 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: tectoolkit
name: tefingerprint
dependencies:
- numpy>=1.11.1
- pip>=8.1.2
Expand All @@ -12,5 +12,3 @@ dependencies:
- pip:
- pysam>=0.9.1.4
- pytest>=3.0.0
- gffutils>=0.8.7.1

1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
numpy>=1.11.1
pysam>=0.9.1.4
gffutils>=0.8.7.1
12 changes: 6 additions & 6 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ def read_file(file_name):
return os.path.join(os.path.dirname(__file__), file_name)


setup(name='tectoolkit',
version='0.0.2',
setup(name='tefingerprint',
version='0.0.3',
author='Tim Millar',
author_email='tim.millar@plantandfood.co.nz',
url='https://github.com/PlantandFoodResearch/TECtoolkit',
description='Toolkit for identifying transposable element movement in regenerant clones',
url='https://github.com/PlantandFoodResearch/TEFingerprint',
description='Toolkit for identifying transposon movement',
long_description=read_file('README.MD'),
scripts=['applications/tec'],
packages=['tectoolkit'],
scripts=['applications/tef'],
packages=['tefingerprint'],
classifiers=['Development Status :: 2 - Pre-Alpha']
)
6 changes: 0 additions & 6 deletions tectoolkit/__init__.py

This file was deleted.

205 changes: 0 additions & 205 deletions tectoolkit/bam_io.py

This file was deleted.

Loading

0 comments on commit 1f40715

Please sign in to comment.