Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with gene file #48

Open
jcerca opened this issue Nov 2, 2020 · 3 comments
Open

issue with gene file #48

jcerca opened this issue Nov 2, 2020 · 3 comments

Comments

@jcerca
Copy link

jcerca commented Nov 2, 2020

Dear Evan,

I am really excited with getting Tephra running, it seems to be a beautiful piece of software. I had some issues I'd like to solve, though. I am putting them here so everyone can see, but please let me know if this shouldn't be on the issues.

I got the docker version running by installing docker:

$ docker run -it --name tephra-con -v $(pwd)/db:/db:Z sestaton/tephra
$ cd /db
$ wget https://raw.githubusercontent.com/sestaton/tephra/master/config/tephra_config.yml
#### changed the "logfile", "genome", "outfile", "repeatdb" (using your sunflower library, thank you for that!).
$ tephra all -c tephra_config.yml

[ERROR]: gene file was not defined in configuration or does not exist. Check input. Exiting.

I noticed that the new config file has this line. It is possibly new since it is not on the manual or help pages.

  • genefile: TAIR10_genes.fas

I deleted it:

$ sed "s/.*genefile.*//; /^$/d" tephra_config.yml > tephra_config2.yml
$ tephra all -c tephra_config2.yml

[ERROR]: 'trnadb' under 'all' is not defined after parsing configuration file.
         This indicates there may be a blank line in your configuration file.
         Please check your configuration file and try again. Exiting.

Q1: I interpret this that it did not like my re-formating of the config file. I was thus wondering what is this "TAIR10_genes.fas". Is this the genetic annotations of arabidopsis? I checked NCBI and TAIR10 seems to be an assembly name for this species ( https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.4).
Q2: Is there a way to run the "all" command without specifying the annotations?
See config file below.

$ cat t*yml
## For more information about this file, see:
## https://github.com/sestaton/tephra/wiki/Specifications-and-example-usage.
all:
  - logfile:          tephra.log
  - genome:           scalesia_atractyloides.fasta
  - outfile:          scalesia_atractyloides_thra_transposons.gff3
  - repeatdb:         Ha412v1r1_transposons_v1.0.fasta
  - genefile:         TAIR10_genes.fas
  - trnadb:           TephraDB
  - hmmdb:            TephraDB
  - threads:          24
  - clean:            YES
  - debug:            NO
  - subs_rate:        1e-8
findltrs:
  - dedup:             NO
  - tnpfilter:         NO
  - domains_required:  NO
  - ltrharvest:
     - mintsd:         4
     - maxtsd:         20
     - minlenltr:      100
     - maxlenltr:      1000
     - mindistltr:     1000
     - maxdistltr:     15000
     - seedlength:     30
     - tsdradius:      60
     - xdrop:          5
     - swmat:          2
     - swmis:          -2
     - swins:          -3
     - swdel:          -3
     - overlaps:       best
  - ltrdigest:
     - pptradius:      30
     - pptlen:         8 30
     - pptagpr:        0.25
     - uboxlen:        3 30
     - uboxutpr:       0.91
     - pbsradius:      30
     - pbslen:         11 30
     - pbsoffset:      0 5
     - pbstrnaoffset:  0 5
     - pbsmaxeditdist: 1
     - pdomevalue:     1E-6
     - pdomcutoff:     NONE
     - maxgaplen:      50
classifyltrs:
  - percentcov:       50
  - percentid:        80
  - hitlen:           80
illrecomb:
  - repeat_pid:       10
ltrage:
  - all:              NO
maskref:
  - percentid:        80
  - hitlength:        70
  - splitsize:        5000000
  - overlap:          100
sololtr:
  - percentid:        39
  - percentcov:       80
  - matchlen:         80
  - numfamilies:      20
  - allfamilies:      NO
tirage:
  - all:              NO
@sestaton
Copy link
Owner

Hi Jose,

Thank you for the comments and I'm sorry for the slow response. I have been busy with a new job and I was doing a long-distance move last week. Now, to the issue, you are correct that the gene file entry in the config file is new. This was added to remove spurious transposon predictions (mainly TIR elements) that are actually tandem gene duplicates.

There is no way to remove the entry at this time and run the tephra all command. I don't think I want to add this option because it will just mean the inclusion of spurious predictions based on my research.

The Arabidopsis gene file was an example for use with Arabidopsis specifically. My advice would be to use a set of gene predictions from your species, or a closely related species. This needs to be documented and the rationale needs to be explained because right now I think there is no mention at all. Sorry! Thank you for mentioning the issue here.

Please let me know if that helps and if you need advice on the files.

Thanks,
Evan

@sestaton
Copy link
Owner

The wiki page that is referenced in the configuration file has been updated at least. A thorough demonstration of the usage is still needed but this is a small step.

@jcerca
Copy link
Author

jcerca commented Nov 24, 2020

Hi Evan,

thank you for your answer and for your time. I'll try to do this as soon as I have some time! Possibly next week.

José

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants