30% faster reads by removing unnecessary `file.tell()` calls #579

emilykl · 2023-09-07T20:18:35Z

Rather than calling file.tell() for every line while searching for section headers, call it only for lines containing a section header. This results in a significant read performance boost of at least 30%.

Benchmark before	Benchmark after	Improvement
264ms	149ms	43%

(This particular benchmark shows an even greater improvement but I was usually seeing values closer to 30% during local testing.)

This change is less straightforward than it seems, because we need to know the file position of the beginning of the header line; but calling file.tell() after reading that line gives us the file position of the beginning of the following line.

This approach reverse-engineers the correct file position by observing that file.tell() seems to correspond to subtracting the length of the current line, plus an adjustment depending on the file's line ending type (CRLF vs. LF).

Since the performance improvement is so large, I think this change is worth pursuing. However, there are two caveats:

This approach does not work reliably on files with mixed line endings.
- As far as I can tell, Python provides no way of finding out the line ending of a specific line. file_obj.newlines only reveals which line endings have been encountered so far in the file.
- test_open_url_different_newlines() happens to pass anyway because the header line has the same line ending as the first line in the file. Header lines which have a different line ending from the first line in the file will fail.
And the more serious issue... io.TextIOWrapper.tell() officially returns "an opaque number" which is guaranteed only to work as input to io.TextIOWrapper.seek(). So although it seems to correspond to character count in practice, and works well enough to pass all the test cases for this project, that behavior is not guaranteed and could also change in future Python releases, making this a fragile approach.

Very open to comments and ideas! If there is a safer way to get the correct file position, without calling file.tell() for every line, I would love to implement it.

dcslagel · 2023-09-08T22:49:16Z

@emilykl @kinverarity1 ,

This seems like a really good idea. It might be worthwhile to revisit how mixed-line endings are currently handled. If they could be standardized on open() (or some other way) then the by the time the code gets to full_line_length(line, file_obj) the line-endings would be reliably one type and this pull-request would be a great addition.

For the issue around io.TextIOWrapper.tell() being an opaque number for io.TextIOWrapper.seek(), it feels like we could accept this risk and either: hope new versions of Python will always handle it in some proper way, or update Lasio code at such a time that Python changes the io.TextIOWrapper.tell()/seek() behavior. Does that seem like the right approach to you?

kinverarity1 · 2023-09-11T12:45:17Z

@emilykl Thank you! This is wonderful. I'm happy to merge if we can work out a way to provide for a sensible exception and a backup way to read files with mixed line endings (surely they are uncommon) - such as a keyword argument.

@dcslagel I'm on board with all your ideas. What do you think, Should we all get this PR to a merge state and release it first while considering other improvements for later work?

30% is worth risking changes with python in the future in my opinion.

emilykl · 2023-09-12T16:33:09Z

Thanks for the feedback! Very appreciated.

If they could be standardized on open() (or some other way) then the by the time the code gets to full_line_length(line, file_obj) the line-endings would be reliably one type and this pull-request would be a great addition.

What we could potentially do is disable newline standardization by passing the argument newline='' to open(). That would cause the strings returned by readline() to include the exact line endings for each line as written on disk, rather than standardizing, so we wouldn't have to guess.

The only issue I see there is that the LASfile() constructor accepts either a file path or an already-opened file object, so if the user has passed an already-opened file we won't know whether the newlines are being standardized or not, so it may fail completely in that case.

In any case, I think it probably makes sense to make this "fast" method a configurable option -- either opt-in or opt-out. We can document the caveats, and perhaps even add a hard restriction that if the "fast" method is enabled, file_ref MUST be a path, not an open file object.

kinverarity1 · 2023-09-12T21:36:39Z

Sounds good!

In any case, I think it probably makes sense to make this "fast" method a configurable option -- either opt-in or opt-out. We can document the caveats, and perhaps even add a hard restriction that if the "fast" method is enabled, file_ref MUST be a path, not an open file object.

I'd prefer the "fast" option to be on by default.

Also perhaps it makes sense to remove the support for file_ref to be an open file object at the same time. If I understand correctly, this method should still work then in the two scenarios of file_ref being a path or a string containing the LAS file contents, correct?

dcslagel · 2023-09-13T19:55:44Z

@kinverarity1, @emilykl

What do you think, Should we all get this PR to a merge state and release it first while considering other improvements for later work?
30% is worth risking changes with python in the future in my opinion.

We should probably make the failing case for

test_open_url_different_newlines() happens to pass anyway because the header line has the same line ending as the first line in the file. Header lines which have a different line ending from the first line in the file will fail.

So a file that has a header line with a different line-ending from the first line in the file, plus a test case for it. So that it is a visible issue.

Currently, I don't think we should remove the support for file_ref to be an open file object. It may be a popular feature. If we do want to consider removing support for file_ref, it is probably worth asking the user base.
It looks like the new function full_line_length(line, file_obj):. could be expanded to handle the main variations of line endings. Is that reasonable or is it difficult?

emilykl added 2 commits September 6, 2023 16:11

improve performance by calling file.tell() only when needed

404dc2c

handle files with no newline at EOF(?)

41dc973

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30% faster reads by removing unnecessary `file.tell()` calls #579

30% faster reads by removing unnecessary `file.tell()` calls #579

emilykl commented Sep 7, 2023 •

edited

Loading

dcslagel commented Sep 8, 2023 •

edited

Loading

kinverarity1 commented Sep 11, 2023

emilykl commented Sep 12, 2023 •

edited

Loading

kinverarity1 commented Sep 12, 2023

dcslagel commented Sep 13, 2023 •

edited

Loading

30% faster reads by removing unnecessary file.tell() calls #579

Are you sure you want to change the base?

30% faster reads by removing unnecessary file.tell() calls #579

Conversation

emilykl commented Sep 7, 2023 • edited Loading

dcslagel commented Sep 8, 2023 • edited Loading

kinverarity1 commented Sep 11, 2023

emilykl commented Sep 12, 2023 • edited Loading

kinverarity1 commented Sep 12, 2023

dcslagel commented Sep 13, 2023 • edited Loading

30% faster reads by removing unnecessary `file.tell()` calls #579

30% faster reads by removing unnecessary `file.tell()` calls #579

emilykl commented Sep 7, 2023 •

edited

Loading

dcslagel commented Sep 8, 2023 •

edited

Loading

emilykl commented Sep 12, 2023 •

edited

Loading

dcslagel commented Sep 13, 2023 •

edited

Loading