Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Data File Duplicate Detection #27

Closed
stanbrub opened this issue Feb 6, 2023 · 1 comment · Fixed by #317
Closed

Improve Data File Duplicate Detection #27

stanbrub opened this issue Feb 6, 2023 · 1 comment · Fixed by #317
Assignees
Labels
enhancement New feature or request

Comments

@stanbrub
Copy link
Collaborator

stanbrub commented Feb 6, 2023

Before new files are generated to the DH data directory for use in Benchmark tests, existing files are checked to ensure duplicate data is not produced. This is a big time-saver for larger scale (like 100,000,000), assuming there are a significant number of existing generator files in the data directory

  • Change the file name convention so that the gen.def and gen.parquet files contain a hash of the generator definition
  • Change the file list glob to use the hash

This story is not an immediate concern because of the expected high reuse of data generator files.

@stanbrub stanbrub added the enhancement New feature or request label Feb 6, 2023
@stanbrub stanbrub changed the title Improved Data File Duplicate Detection Improve Data File Duplicate Detection Sep 22, 2023
@stanbrub stanbrub linked a pull request Jul 12, 2024 that will close this issue
@stanbrub stanbrub self-assigned this Jul 12, 2024
@stanbrub
Copy link
Collaborator Author

Updated Ids.uniqueName to allow a prefix to be provide. Used that to name files with a hash of the contents and search on the hash with a glob in the python "data file reuse detection" code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant