Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST]: disable generation of cleaned sequences #95

Open
fgvieira opened this issue Aug 23, 2024 · 4 comments
Open

[FEATURE REQUEST]: disable generation of cleaned sequences #95

fgvieira opened this issue Aug 23, 2024 · 4 comments

Comments

@fgvieira
Copy link

fgvieira commented Aug 23, 2024

Is this a feature request for FCS-adaptor or FCS-GX?
FCS-adaptor

Describe the problem you'd like to be solved
FCS-adaptor runs quite fast, but takes a lot of long time to generate the cleaned FASTA file and even longer to compress it!

Describe the solution you'd like
Would it be possible to have an option to disable creating the cleaned FASTA file (and only create the report and logs)?

From the source code, it seems that a simple if statement would do:

def launch(self):
[...]
    if self.clean_fasta:
        self.out_dir = Path(self.runtime_context["outdir"]) / "cleaned_sequences"
        self.recompress_files()
        self.rename_output_files()

Is the source code available somewhere? If so, I could do a PR.

@etvedte
Copy link
Contributor

etvedte commented Aug 23, 2024

Hello,

Thanks for the suggestion. We don't have the source code available.

We can implement this for the next release. We might elect to completely remove this output... fcs.py clean genome is preferable for cleaning anyway because it allows better user handling of internal contaminants.

@fgvieira
Copy link
Author

I agree! It would also be more consistent with the behavour of fcs-gx. Do you know when the next release will be?

@etvedte
Copy link
Contributor

etvedte commented Aug 26, 2024

I did some more testing for this request. Here's what I found:

  1. Running in --debug mode can lengthen the run time, in my sample tests it was about 2X longer. Are you turning that off?
  2. Running a contaminated genome with ~10k contaminant cleaning actions and then running the cleaned genome did not reduce the run time considerably. i.e. it is not likely the application of contaminant-cleaning actions taking a lot of time
  3. Running on uncompressed FASTA as input (which also outputs uncompressed FASTA in the cleaned_sequences folder) speeds up the run by about 2X. If it isn't too cumbersome, uncompressed input will improve your performance.

Currently the major focus of the dev team is working on another NCBI tool. I've made a work ticket to look at this, and we will discuss near-term action items including this ticket later in the week. With our other priorities it is possible there won't be a next release for another month or two.

Also, may I ask if you are using FCS-adaptor to screen new genomes or publicly-available genomes on NCBI? I'm assuming its the former, but I want to double-check. If the latter, we are making available contamination reports for genomes on NCBI such that users can access the data without needing to run FCS themselves.

@fgvieira
Copy link
Author

  1. I am not using --debug
  2. No, I believe it is the I/O associated with writing the cleaned uncompressed FASTA file to disk and then compressing it (another read and write cycle, and not threaded)
  3. I am running it on a compressed FASTA, but I believe the issue is with writing the uncompressed clean FASTA (see above)

I'll be running it both on NCBI genomes and other. That is great that NCBI is releasing the reports, but I'll still have to run gx clean-genome, no? If so, then this issue still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants