Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizations for GoogleEventSet, speeding up merging 20+% #68

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

karlicoss
Copy link
Contributor

All objects from takeout pass through this merging HPI so worth speeding it up

  • add add_if_not_present method to avoid computing key twice (which is quite expensive!)

    This is intended to be used as a replacement for (e.g. in HPI)

    if event in emitted: 
        continue
    emitted.add(event)
    yield event
    

    With this method, we could rewrite as:

    if emitted.add_if_not_present(event):
        yield event
    

    This could be introduced to hpi with backwards compatibility.

  • use type directly as key, types are hashable (very tiny speedup, but it also feels more natural anyway)

- add `add_if_not_present` method to avoid computing key twice (which is quite expensive!)

  This is intended to be used as a replacement for (e.g. in HPI)

  ```
  if event in emitted:
      continue
  emitted.add(event)
  yield event
  ```

  With this method, we could rewrite as:

  ```
  if emitted.add_if_not_present(event):
      yield event
  ```

  This could be introduced to hpi with backwards compatibility.

- use type directly as key, types are hashable (very tiny speedup, but it also feels more natural anyway
@purarue
Copy link
Owner

purarue commented Sep 12, 2024

Thanks ❤️

Will merge when I get home

@purarue purarue merged commit 5779c8d into purarue:master Sep 12, 2024
7 checks passed
@purarue
Copy link
Owner

purarue commented Sep 12, 2024

@karlicoss bumped the version, since changing how to dedupe events might lead to some weird duplicates/errors if people were doing some custom merge with a cachew database with some other standalone export. (edit: actually, probably not...? since the cachew keys havent changed, its the key thats computed in python code. eh, good to push the perf improvements anyways)

feel free to ping me to bump pypi versions if you ever need em

@karlicoss karlicoss deleted the speedup-merging branch September 12, 2024 19:42
@karlicoss
Copy link
Contributor Author

Thanks! I'm running off an editable checkout anyway so don't mind pypi as much. I might push a few more changes in the next few days, sorting out some old branches and todos I never got to contribute :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants