Evaluation

We are in the process of writing a GF implementation of a CNL which is already precisely defined in another format, i.e. the AceWiki (OWL-compatible) subset of ACE defined by the Codeco grammar here. This allows us to likewise be very precise in our evaluation, which can be measured in the following ways:

Syntactic coverage

How much of the target language does our grammar cover/accept as input?

This question is answered by parsing each sentence in the supplied test set and counting how many of them are accepted by the GF grammar. The target is that all test set sentences are assigned a parse tree, i.e. 100% coverage.

Note that this evaluation lets us measure how many sentences are parsed, not necessarily parsed correctly. E.g. we cannot say confidently that in the GF parse tree of the sentence

it is false that John likes Mary and Mary likes John.

Mary likes John is not under negation (which is the correct ACE parse of this sentence). To measure parsing correctness we would need to have an "official" mapping of ACE-in-GF abstract trees to APE (or Codeco) syntax trees. Without this mapping we can only approach this problem analytically, e.g. look at the it is false function and conclude that all its argument sentences are prefixed by that, because in ACE, one needs to write:

it is false that John likes Mary and that Mary likes John.

to get everything under negation.

DRS coverage

This is a weaker coverage than the syntactic coverage. Some ACE sentences, e.g.

Mary likes who?
who does Mary like?

are DRS-equivalent, i.e. they are assigned the same DRS by the ACE parser. (Note additionally, that every ACE sentence has a single DRS.) So, we could claim coverage already if we managed to parse just one of the sentences in the same equivalence class. Testing this automatically is complicated, one needs to:

generate all the DRSs that can be obtained from the test sentences
generate the equivalence classes, each containing all possible verbalizations for the same DRS (we don't have a tool for this step)
parse the equivalence classes, successfully parsing at least one member gives a point

For step 2 one could use the original test sentences but this would not enumerate all the DRS-equivalent sentences. For example, AceWiki only supports Mary likes who? while ACE-in-GF only supports who does Mary like?.

Ambiguity

What percentage of all accepted sentences result in only one parse tree?

This question is answered by counting the number of parse trees returned by GF for each successful parse. The target is that every valid sentence returns only a single parse tree, i.e. 0% ambiguity.

The ambiguity should be measured foremost for ACE sentences, but ideally for each language in the grammar.

Precision

Can the grammar accept sentences which are not valid in the target language (does it over-generate)?

This can be answered by generating random trees using the GF grammar, and for each seeing if they are accepted by the parser for the target language. The target is all sentences generated by the GF grammar are accepted by the target parser, i.e. 100% precision.

make test_precision

Note GF's generate_random function is not guaranteed to cover the grammar in any sort of complete way. We can only approach this by using large numbers of randomly generated sentences and determining the precision statistically.

Weaker precision tests

Provided that we do not reach 100% precision, the test can be made weaker in several ways which would still give useful insights:

test precision after removing all anaphoric references (changing the into a, etc.)
test precision against full ACE (using APE)

Tree ambiguity [TODO: better name needed]

What percentage of all generated trees can be linearized into a single sentence?

The grammar can decide to treat certain constructs as syntactic sugar (GF variants), i.e. on semantic grounds (e.g. Attempto DRS equivalence) have different strings correspond to the same function. Examples:

does not vs doesn't
active vs passive
dative shift vs to-PP

This type of evaluation lets us detect where variants are used and which variant is picked first when linearizing.

Translation accuracy

Having a set of additional concrete syntaxes that correspond to the ACE abstract grammar, we can measure the accuracy of translating from ACE to the other syntaxes. This can be done in stages:

percentage of fully translated sentences, i.e. every function is linearized (this can be evaluated automatically)
percentage of syntactically correctly translated, i.e. the translation result is syntactically acceptable in the target language (needs human evaluation)
percentage of semantically correctly translated, i.e. the translation result reflects the ACE meaning (needs human evaluation)

Token look-ahead

If the grammar is used in a PENG/AceWiki/Minibar-style predictive editor then we need to make sure that the token look-ahead supported by the grammar is useful. Things to consider:

token look-ahead ignores dependent types, e.g. if we implement random tree generation using token look-ahead + parsing then it would overgenerate compared to the standard random generator if dependent types are used in the abstract syntax;
can the results of the token look-ahead be structured by category (so that they could be structured in the UI)? The grammar writer should then put some thought into the category system and names.

Tasks

The evaluation languages for D11.1 are: Ace, Eng, Fin, Fre, Ger, Ita, Swe, Urd, Spa, Cat, Dut.

Tasks for D11.1

In rough order of dependence:

Tackle remaining issues in ACE grammar/resource grammar to bring the (basic) syntactic coverage to 100%, keeping ambiguity as low as possible. make test_acewiki_aceowl
Minimize over-generation of grammar to bring up the precision. make test_precision
Complete grammars for all languages: (JJC: these are all in place but have not been checked)
1. Application grammars (grammars/acewiki_aceowl/)
2. Test vocabularies (words/acewiki_aceowl/)
3. Ontograph 40 vocabularies (words/ontograph_40/)
Linearise Ontograph 40 sentences into each language and give to language experts to evaluate "if they make sense". make lin_ontograph_40

Beyond D11.1

Conduct formal user studies where users have to decide for every sentence if it is true or false given a diagram that visually expresses the objects and relations of the sentences. Collaboration with UHEL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly