Merge pull request #45 from timothymillar/f/tabular-data

Improved usability as library
PlantandFoodResearch · May 5, 2017 · 1f40715 · 1f40715
2 parents 84e3df6 + eef854c
commit 1f40715
Show file tree

Hide file tree

Showing 31 changed files with 2,754 additions and 2,666 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,5 @@
 __pycache__/
 .idea/
 .cache/
-tectoolkit.egg-info/
+tefingerprint.egg-info/
 dist/
diff --git a/IDEAS.md b/IDEAS.md
@@ -3,22 +3,23 @@
 This is record of feature/roadmap ideas 
 
 1. Preprocessing:
-	- [ ] A simple python wrapper for the established process to identifies the “dangler” reads and  map them against the reference. Ideally the name of the TE each dangler represents is stored as a SAM tag rather than appended to the name (partially implemented)
+	- [x] A simple python wrapper for the established process to identifies the “dangler” reads and  map them against the reference. 
+	- [x] Store the name of the TE each dangler represents as a SAM tag rather than appended to the name.
+	- 	[ ] Use soft clipped reads to get more accurate location of insertion
 
 2. Fingerprinting:
 	- [x] Basic “flat” clustering
 	- [x] Hierarchical clustering to split close/nested clusters
 	- [x] Output to GFF3 format (recorded statistics could be improved)
-	- [ ] Associate clusters on opposite strands representing ends of the same TE insertion (Not implemented)
-	- [ ] Use soft clipped reads to get more accurate location of insertion (Not implemented)
-	- [ ] Use of anchor reads to assess homozygosity of insertions
+	- [ ] Associate clusters on opposite strands representing ends of the same TE insertion (removed until required)
+	- [ ] Use of anchor reads to assess homozygosity of insertions (for re-sequence data)
 
 3. Fingerprint comparisons:
 	- [x] Identify “comparative bins” where clusters are found in at least one sample
-	- [x] summary statistics for comparison across each bin (very basic although still on par with competitors)
+	- [x] summary statistics for comparison across each bin (basic counts)
 4. Output filetypes:
-	- [x] GFF3 (not sorted properly)
-	- [ ] SQLite using GffUtils
+	- [x] GFF3
+	- [x] GFF3 long form (no nested counts)
+	- [x] Tabular data (available in python using pandas)
 4. Filtering output:
-	- [x] Script to filter the GFF output (slowish and crude)
-	- [ ] Script to filter SQLite output
+	- [x] Script to filter the GFF output (could be improved)
diff --git a/README.md b/README.md
diff --git a/README.rst b/README.rst
diff --git a/applications/tec b/applications/tec
diff --git a/applications/tef b/applications/tef
@@ -0,0 +1,43 @@
+#! /usr/bin/env python
+
+import sys
+import argparse
+from tefingerprint.filtergff import FilterGffProgram
+from tefingerprint.preprocess import PreProcessProgram
+from tefingerprint.programs import FingerprintProgram
+from tefingerprint.programs import ComparisonProgram
+
+
+def parse_program_arg(arg):
+    """"""
+    parser = argparse.ArgumentParser('Identify program to run')
+    parser.add_argument('program',
+                        type=str,
+                        choices=("preprocess",
+                                 "fingerprint",
+                                 "compare",
+                                 "filter_gff"))
+    return parser.parse_args(arg)
+
+
+def main():
+    """"""
+    if len(sys.argv) == 1:
+        # default to simple help message
+        program_arg = parse_program_arg([])
+    else:
+        program_arg = parse_program_arg([sys.argv[1]])
+
+    program = program_arg.program
+    if program == "fingerprint":
+        FingerprintProgram(sys.argv[2:])
+    elif program == "compare":
+        ComparisonProgram(sys.argv[2:])
+    elif program == "filter_gff":
+        FilterGffProgram(sys.argv[2:])
+    elif program == "preprocess":
+        job = PreProcessProgram.from_cli(sys.argv[2:])
+        job.run()
+
+if __name__ == '__main__':
+    main()
diff --git a/environment.yml b/environment.yml
@@ -1,4 +1,4 @@
-name: tectoolkit
+name: tefingerprint
 dependencies:
 - numpy>=1.11.1
 - pip>=8.1.2
@@ -12,5 +12,3 @@ dependencies:
 - pip:
   - pysam>=0.9.1.4
   - pytest>=3.0.0
-  - gffutils>=0.8.7.1
-
diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,2 @@
 numpy>=1.11.1
 pysam>=0.9.1.4
-gffutils>=0.8.7.1
diff --git a/setup.py b/setup.py
@@ -8,14 +8,14 @@ def read_file(file_name):
     return os.path.join(os.path.dirname(__file__), file_name)
 
 
-setup(name='tectoolkit',
-      version='0.0.2',
+setup(name='tefingerprint',
+      version='0.0.3',
       author='Tim Millar',
       author_email='tim.millar@plantandfood.co.nz',
-      url='https://github.com/PlantandFoodResearch/TECtoolkit',
-      description='Toolkit for identifying transposable element movement in regenerant clones',
+      url='https://github.com/PlantandFoodResearch/TEFingerprint',
+      description='Toolkit for identifying transposon movement',
       long_description=read_file('README.MD'),
-      scripts=['applications/tec'],
-      packages=['tectoolkit'],
+      scripts=['applications/tef'],
+      packages=['tefingerprint'],
       classifiers=['Development Status :: 2 - Pre-Alpha']
       )
diff --git a/tectoolkit/__init__.py b/tectoolkit/__init__.py
diff --git a/tectoolkit/bam_io.py b/tectoolkit/bam_io.py