Skip to content

Latest commit

 

History

History
5 lines (3 loc) · 1.33 KB

README.md

File metadata and controls

5 lines (3 loc) · 1.33 KB

11-731_Project

Team project for 11-731 Machine Translation course, 19 fall semester.

As the MT systems nowadays get better and better, it is more important to have evaluation metrics that can properly compare the best models. On the WMT19 Metrics Shared Task this year, although many evaluation metrics achieve a high correlation with human evaluation overall, most metrics have poor performance on very good machine translation systems Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges. As we begin to consider only the top-N MT systems, the correlation between the metrics and human evaluation falls quickly. It is important to understand why this phenomenon happens, so we want to explore the results and try to figure out in what ways these evaluation metrics are diverging from human assessment. We investigate the effect of semantic features in an evaluation metric BERTScore: Evaluating Text Generation with BERT, and add such features to improve existing metrics. We also try to fine-tune the context embedding-based metric ESIM using data of human evaluation of the best models Putting Evaluation in Context: Contextual Embeddings improve Machine Translation Evaluation.