Improve Data File Duplicate Detection #27

stanbrub · 2023-02-06T17:42:37Z

Before new files are generated to the DH data directory for use in Benchmark tests, existing files are checked to ensure duplicate data is not produced. This is a big time-saver for larger scale (like 100,000,000), assuming there are a significant number of existing generator files in the data directory

Change the file name convention so that the gen.def and gen.parquet files contain a hash of the generator definition
Change the file list glob to use the hash

This story is not an immediate concern because of the expected high reuse of data generator files.

stanbrub · 2024-07-12T19:00:12Z

Updated Ids.uniqueName to allow a prefix to be provide. Used that to name files with a hash of the contents and search on the hash with a glob in the python "data file reuse detection" code

stanbrub added the enhancement New feature or request label Feb 6, 2023

stanbrub changed the title ~~Improved Data File Duplicate Detection~~ Improve Data File Duplicate Detection Sep 22, 2023

stanbrub linked a pull request Jul 12, 2024 that will close this issue

Design Changes for Dir Struct, Tagged Iterations, Metrics #317

Merged

stanbrub self-assigned this Jul 12, 2024

stanbrub closed this as completed in #317 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Data File Duplicate Detection #27

Improve Data File Duplicate Detection #27

stanbrub commented Feb 6, 2023

stanbrub commented Jul 12, 2024

Improve Data File Duplicate Detection #27

Improve Data File Duplicate Detection #27

Comments

stanbrub commented Feb 6, 2023

stanbrub commented Jul 12, 2024