Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30% faster reads by removing unnecessary file.tell() calls #579

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

emilykl
Copy link

@emilykl emilykl commented Sep 7, 2023

Rather than calling file.tell() for every line while searching for section headers, call it only for lines containing a section header. This results in a significant read performance boost of at least 30%.

Benchmark before Benchmark after Improvement
264ms 149ms 43%

(This particular benchmark shows an even greater improvement but I was usually seeing values closer to 30% during local testing.)

This change is less straightforward than it seems, because we need to know the file position of the beginning of the header line; but calling file.tell() after reading that line gives us the file position of the beginning of the following line.

This approach reverse-engineers the correct file position by observing that file.tell() seems to correspond to subtracting the length of the current line, plus an adjustment depending on the file's line ending type (CRLF vs. LF).

Since the performance improvement is so large, I think this change is worth pursuing. However, there are two caveats:

  • This approach does not work reliably on files with mixed line endings.

    • As far as I can tell, Python provides no way of finding out the line ending of a specific line. file_obj.newlines only reveals which line endings have been encountered so far in the file.
    • test_open_url_different_newlines() happens to pass anyway because the header line has the same line ending as the first line in the file. Header lines which have a different line ending from the first line in the file will fail.
  • And the more serious issue... io.TextIOWrapper.tell() officially returns "an opaque number" which is guaranteed only to work as input to io.TextIOWrapper.seek(). So although it seems to correspond to character count in practice, and works well enough to pass all the test cases for this project, that behavior is not guaranteed and could also change in future Python releases, making this a fragile approach.

Very open to comments and ideas! If there is a safer way to get the correct file position, without calling file.tell() for every line, I would love to implement it.

@dcslagel
Copy link
Collaborator

dcslagel commented Sep 8, 2023

@emilykl @kinverarity1 ,

This seems like a really good idea. It might be worthwhile to revisit how mixed-line endings are currently handled. If they could be standardized on open() (or some other way) then the by the time the code gets to full_line_length(line, file_obj) the line-endings would be reliably one type and this pull-request would be a great addition.

For the issue around io.TextIOWrapper.tell() being an opaque number for io.TextIOWrapper.seek(), it feels like we could accept this risk and either: hope new versions of Python will always handle it in some proper way, or update Lasio code at such a time that Python changes the io.TextIOWrapper.tell()/seek() behavior. Does that seem like the right approach to you?

@kinverarity1
Copy link
Owner

@emilykl Thank you! This is wonderful. I'm happy to merge if we can work out a way to provide for a sensible exception and a backup way to read files with mixed line endings (surely they are uncommon) - such as a keyword argument.

@dcslagel I'm on board with all your ideas. What do you think, Should we all get this PR to a merge state and release it first while considering other improvements for later work?

30% is worth risking changes with python in the future in my opinion.

@emilykl
Copy link
Author

emilykl commented Sep 12, 2023

Thanks for the feedback! Very appreciated.

If they could be standardized on open() (or some other way) then the by the time the code gets to full_line_length(line, file_obj) the line-endings would be reliably one type and this pull-request would be a great addition.

What we could potentially do is disable newline standardization by passing the argument newline='' to open(). That would cause the strings returned by readline() to include the exact line endings for each line as written on disk, rather than standardizing, so we wouldn't have to guess.

The only issue I see there is that the LASfile() constructor accepts either a file path or an already-opened file object, so if the user has passed an already-opened file we won't know whether the newlines are being standardized or not, so it may fail completely in that case.

In any case, I think it probably makes sense to make this "fast" method a configurable option -- either opt-in or opt-out. We can document the caveats, and perhaps even add a hard restriction that if the "fast" method is enabled, file_ref MUST be a path, not an open file object.

@kinverarity1
Copy link
Owner

Sounds good!

In any case, I think it probably makes sense to make this "fast" method a configurable option -- either opt-in or opt-out. We can document the caveats, and perhaps even add a hard restriction that if the "fast" method is enabled, file_ref MUST be a path, not an open file object.

I'd prefer the "fast" option to be on by default.

Also perhaps it makes sense to remove the support for file_ref to be an open file object at the same time. If I understand correctly, this method should still work then in the two scenarios of file_ref being a path or a string containing the LAS file contents, correct?

@dcslagel
Copy link
Collaborator

dcslagel commented Sep 13, 2023

@kinverarity1, @emilykl

What do you think, Should we all get this PR to a merge state and release it first while considering other improvements for later work?
30% is worth risking changes with python in the future in my opinion.


  1. We should probably make the failing case for

test_open_url_different_newlines() happens to pass anyway because the header line has the same line ending as the first line in the file. Header lines which have a different line ending from the first line in the file will fail.

So a file that has a header line with a different line-ending from the first line in the file, plus a test case for it. So that it is a visible issue.

  1. Currently, I don't think we should remove the support for file_ref to be an open file object. It may be a popular feature. If we do want to consider removing support for file_ref, it is probably worth asking the user base.

  2. It looks like the new function full_line_length(line, file_obj):. could be expanded to handle the main variations of line endings. Is that reasonable or is it difficult?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants