You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and writes it to 3 output/log files (that could become a multiple of 3, later on).
The input is usually many files (let's say: 100), which are scanned in sequence,
and maybe 3% of the input lines are filtered out and written to the 3 output files.
The question is, which of these methods should I use to do it:
bulk scan & write in the end:
Scan and filter all input files, storing the selected data in a variable (in memory),
and in the end: write everything out to the output files at once.
scan input, filter it constantly while reading, and write out to the output whenever something was selected for output; continuously.
I like the second option much more, as it uses less memory and is a stream based approach,
so output could start appearing right when starting to scan input. The question is, whether this potentially decreases overall performance, because we always switch between reading input and writing to one of the 3 output files.
I do not expect the higher memory usage of method 1 to ever be a problem,
and I am not sure how often the stream-approach of method 2 is really an advantage in practice.
I do know that file-system access is the main performance issue of this software,
as this is generally the case,
but also because the computation done here is very minimal.
Maybe I need not worry, and the OS/buffering is going to handle the second method (stream based) just fine?
For now, the tool will be run max 100 times a day, globally,
with ~1MB of input text for each run.
So it is not very critical either way,
but I came across this issue a few times already,
and would like to tackle it and be over with it.
The text was updated successfully, but these errors were encountered:
Basically, this is a CLI tool which:
The input is usually many files (let's say: 100), which are scanned in sequence,
and maybe 3% of the input lines are filtered out and written to the 3 output files.
The question is, which of these methods should I use to do it:
Scan and filter all input files, storing the selected data in a variable (in memory),
and in the end: write everything out to the output files at once.
I like the second option much more, as it uses less memory and is a stream based approach,
so output could start appearing right when starting to scan input. The question is, whether this potentially decreases overall performance, because we always switch between reading input and writing to one of the 3 output files.
I do not expect the higher memory usage of method 1 to ever be a problem,
and I am not sure how often the stream-approach of method 2 is really an advantage in practice.
I do know that file-system access is the main performance issue of this software,
as this is generally the case,
but also because the computation done here is very minimal.
Maybe I need not worry, and the OS/buffering is going to handle the second method (stream based) just fine?
For now, the tool will be run max 100 times a day, globally,
with ~1MB of input text for each run.
So it is not very critical either way,
but I came across this issue a few times already,
and would like to tackle it and be over with it.
The text was updated successfully, but these errors were encountered: