Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does streamed output make sense? #7

Open
hoijui opened this issue Nov 8, 2022 · 0 comments
Open

Does streamed output make sense? #7

hoijui opened this issue Nov 8, 2022 · 0 comments

Comments

@hoijui
Copy link
Owner

hoijui commented Nov 8, 2022

Basically, this is a CLI tool which:

  1. scans files
  2. collects info/filters
  3. and writes it to 3 output/log files (that could become a multiple of 3, later on).

The input is usually many files (let's say: 100), which are scanned in sequence,
and maybe 3% of the input lines are filtered out and written to the 3 output files.
The question is, which of these methods should I use to do it:

  1. bulk scan & write in the end:
    Scan and filter all input files, storing the selected data in a variable (in memory),
    and in the end: write everything out to the output files at once.
  2. scan input, filter it constantly while reading, and write out to the output whenever something was selected for output; continuously.

I like the second option much more, as it uses less memory and is a stream based approach,
so output could start appearing right when starting to scan input. The question is, whether this potentially decreases overall performance, because we always switch between reading input and writing to one of the 3 output files.

I do not expect the higher memory usage of method 1 to ever be a problem,
and I am not sure how often the stream-approach of method 2 is really an advantage in practice.
I do know that file-system access is the main performance issue of this software,
as this is generally the case,
but also because the computation done here is very minimal.

Maybe I need not worry, and the OS/buffering is going to handle the second method (stream based) just fine?

For now, the tool will be run max 100 times a day, globally,
with ~1MB of input text for each run.
So it is not very critical either way,
but I came across this issue a few times already,
and would like to tackle it and be over with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant