[FEATURE REQUEST]: disable generation of cleaned sequences #95

fgvieira · 2024-08-23T08:02:01Z

Is this a feature request for FCS-adaptor or FCS-GX?
FCS-adaptor

Describe the problem you'd like to be solved
FCS-adaptor runs quite fast, but takes a lot of long time to generate the cleaned FASTA file and even longer to compress it!

Describe the solution you'd like
Would it be possible to have an option to disable creating the cleaned FASTA file (and only create the report and logs)?

From the source code, it seems that a simple if statement would do:

def launch(self):
[...]
    if self.clean_fasta:
        self.out_dir = Path(self.runtime_context["outdir"]) / "cleaned_sequences"
        self.recompress_files()
        self.rename_output_files()

Is the source code available somewhere? If so, I could do a PR.

The text was updated successfully, but these errors were encountered:

etvedte · 2024-08-23T19:39:20Z

Hello,

Thanks for the suggestion. We don't have the source code available.

We can implement this for the next release. We might elect to completely remove this output... fcs.py clean genome is preferable for cleaning anyway because it allows better user handling of internal contaminants.

fgvieira · 2024-08-25T17:07:16Z

I agree! It would also be more consistent with the behavour of fcs-gx. Do you know when the next release will be?

etvedte · 2024-08-26T19:37:47Z

I did some more testing for this request. Here's what I found:

Running in --debug mode can lengthen the run time, in my sample tests it was about 2X longer. Are you turning that off?
Running a contaminated genome with ~10k contaminant cleaning actions and then running the cleaned genome did not reduce the run time considerably. i.e. it is not likely the application of contaminant-cleaning actions taking a lot of time
Running on uncompressed FASTA as input (which also outputs uncompressed FASTA in the cleaned_sequences folder) speeds up the run by about 2X. If it isn't too cumbersome, uncompressed input will improve your performance.

Currently the major focus of the dev team is working on another NCBI tool. I've made a work ticket to look at this, and we will discuss near-term action items including this ticket later in the week. With our other priorities it is possible there won't be a next release for another month or two.

Also, may I ask if you are using FCS-adaptor to screen new genomes or publicly-available genomes on NCBI? I'm assuming its the former, but I want to double-check. If the latter, we are making available contamination reports for genomes on NCBI such that users can access the data without needing to run FCS themselves.

fgvieira · 2024-08-27T06:49:53Z

I am not using --debug
No, I believe it is the I/O associated with writing the cleaned uncompressed FASTA file to disk and then compressing it (another read and write cycle, and not threaded)
I am running it on a compressed FASTA, but I believe the issue is with writing the uncompressed clean FASTA (see above)

I'll be running it both on NCBI genomes and other. That is great that NCBI is releasing the reports, but I'll still have to run gx clean-genome, no? If so, then this issue still persists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST]: disable generation of cleaned sequences #95

[FEATURE REQUEST]: disable generation of cleaned sequences #95

fgvieira commented Aug 23, 2024 •

edited

Loading

etvedte commented Aug 23, 2024

fgvieira commented Aug 25, 2024

etvedte commented Aug 26, 2024

fgvieira commented Aug 27, 2024

[FEATURE REQUEST]: disable generation of cleaned sequences #95

[FEATURE REQUEST]: disable generation of cleaned sequences #95

Comments

fgvieira commented Aug 23, 2024 • edited Loading

etvedte commented Aug 23, 2024

fgvieira commented Aug 25, 2024

etvedte commented Aug 26, 2024

fgvieira commented Aug 27, 2024

fgvieira commented Aug 23, 2024 •

edited

Loading