-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation set to EvalAI #30
Comments
Thank you! That is a good suggestion. We will consider it! Will update here later! |
Thanks! The issue is, we see a consistent gap between validation and test set results. While models did not use the validation set to optimize. Multiple teams resorted to reporting validation rather than test results in their papers. I'm guessing this could be because they don't trust the test result (which they can't repro/validate). It'd be good to triage and rectify that, at least by having the validation part reproducible with the EvalAI measurement. MMMU is a great benchmark. It measures overall LLM/VLM performance. But these issues with test/validation discrepancies (and misunderstanding that it's not just the visual part that matters) give it some bad light. I'd also suggest considering releasing the test set, maybe under a separate NC license and token password protection, to avoid accidental contamination. The benefits of the test set being used and potential cleanup/resolving this test/validation gap can outweigh the benefits of using the test set in a more controlled environment. |
Thank you for your feedback. The discrepancy between the validation and test sets arises from the slight differences in their distributions. In the validation set, each subject has an equal number of samples, whereas in the test set, the number of samples per subject varies. We are also considering releasing a portion of the test set while retaining a small part to prevent contamination or overfitting. We appreciate your valuable comments and encourage you to stay tuned for further updates! |
Would it be possible to add MMMU validation to EvalAI?
It'd be great to be able to compare the numbers calculated on the validation set with the ones produced by EvalAI.
The text was updated successfully, but these errors were encountered: