Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add File Size to Parquet Metrics #310

Closed
stanbrub opened this issue Jun 12, 2024 · 1 comment · Fixed by #317
Closed

Add File Size to Parquet Metrics #310

stanbrub opened this issue Jun 12, 2024 · 1 comment · Fixed by #317
Assignees
Labels
enhancement New feature or request

Comments

@stanbrub
Copy link
Collaborator

Currently, we collect read/write rates for the Parquet Benchmarks. So for the multi-column tests that are meant to allow comparison between codecs (e.g. snappy, gzip), Shivam would like to see resulting parquet file size as well. (Stan would like to see a more generic way to pull in extra metrics to adhoc runs. So this is a good fit.)

This has been done manually before, but it makes sense to automate it, since there are more metrics of interest (like installation size, memory usage, etc) that are not being shown in a meaningful or obvious way.

  • Add file size metric collection to the FileTestRunner (this should be relatively straightforward)
  • Improve the adhoc snippet to allow selection of a metric by property name
    • Some metrics may be unique to certain benchmarks
    • Ensure proper null handling if the specified metrics are not present
  • Pull file size metric in beside the rate column in the result table for the run sets
@stanbrub stanbrub added the enhancement New feature or request label Jun 12, 2024
@stanbrub stanbrub self-assigned this Jun 12, 2024
@stanbrub stanbrub linked a pull request Jul 12, 2024 that will close this issue
@stanbrub
Copy link
Collaborator Author

stanbrub commented Jul 12, 2024

There is a metric called 'data.file.size' that has been added for the Parquet tests. Queries have been updated to allow metrics that are null for some and provided for others. Also, the adhoc snippet now has a way for the user to specify metric names that are pulled in as columns beside other columns like op_rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant