Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset-specific metrics #21

Open
evanmiltenburg opened this issue Feb 1, 2021 · 4 comments
Open

Dataset-specific metrics #21

evanmiltenburg opened this issue Feb 1, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@evanmiltenburg
Copy link
Contributor

I have a couple of metrics in mind that are dataset-specific. For example:

  • When evaluating global recall (i.e. how much of the vocabulary from the intersection of the training and test set is being produced? And how is this recall influenced by training set frequency? Does the model only produce words that occur frequently in the training data, or also less-frequent terms?
  • For some of the special test sets, we want to run some evaluations that don't make sense for other datasets. (Can't give any details right now.)

How should I go about this?

E.g. for the global recall metric I could preprocess the training data, and if the references have an identifier, I can use that to load the relevant data. Is that the best solution?

@tuetschek
Copy link
Collaborator

The newly added Questeval now uses task-specific models (#40 ), where task is specified in the system outputs file (to be set by default using a global config file, see #43 ). Potentially the same approach could be used here?

@sebastianGehrmann sebastianGehrmann added the enhancement New feature or request label Dec 16, 2021
@sebastianGehrmann
Copy link
Contributor

As just discussed in the larger group, we need the following:

  1. A way to assign metrics to task_types which should (not) be run
  2. A way to assign multiple configurations of metrics to a task type, for example a neural metric based on TWO different neural models (BERTScore with mBERT AND RoBERTa, or BLEURT and BLEURT-20)

@danieldeutsch
Copy link
Contributor

Perhaps we could define a schema for an AllenNLP-style jsonnet file that specifies the task or metrics to be run:

{
  "input_file": "/path/to/input.txt",
  "output_file": "/path/to/output.json",
  "metrics": [
    {
      "name": "bertscore",
      "model": "mbert",
      "output_key": "bertscore_mbert"
    },
    {
       "name": "bertscore",
       "model": "roberta",
       "output_key": "bertscore_roberta"
    }
  ]
}

We could potentially define task-specific suites of metrics to run, which would run a pre-defined set of metrics:

{
  "input_file": "/path/to/input.txt",
  "output_file": "/path/to/output.json",
  "task_suite": "summarization"
}

@tuetschek
Copy link
Collaborator

@sebastianGehrmann you're assuming a global config for GEM tasks, right?

@danieldeutsch (if it's a global config, then) maybe it would make sense to integrate this in gem_metrics.config.py, which already has a lot of configuration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants