Machine Translation engines evaluation framework

Zitieren

EuroPython

Masalovich, Anton Manchanda, Sahil

Formale Metadaten

Titel

Machine Translation engines evaluation framework

Serientitel

EuroPython 2022

Anzahl der Teile

112

Autor

Masalovich, Anton

Manchanda, Sahil

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Lizenzieren

Identifikatoren

10.5446/60821 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2022

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

[Liffey Hall 1 on 2022-07-15] Task of Machine Translation engine evaluation may be very challenging. Quality of Machine Translation varies greatly depending on domain and language pair. Different MT engines may have different interfaces or APIs and different requirements to run. To add to that, even definition of a good translation may be debatable, with any automatic MT quality metric providing only approximation of actual translation quality. That's why having universal evaluation framework for this task is very important. In our work we tried to create such framework. 1) We defined base translation class that unified all file handling, batch creation and result processing. As a result of that, only work needed to support new MT engine was creation of small child class that implemented couple of simple functions. That allows us to easily extend our framework to MT engines and new language pairs. 2) We defined set of test datasets and provided a way to add new datasets to this set. For our evaluation our aim was to create test data that covers both general and healthcare domains EMEA dataset (https://opus.nlpl.eu/EMEA.php), OPUS-100 (https://opus.nlpl.eu/opus-100.php), Paracrawl (https://paracrawl.eu/) and several others. But our data preparations scripts can be easily extended to other domains and datasets as well. 3) We defined a set of quality metrics to evaluate results of MT engines. Metrics that we used included BLEU (https://github.com/mjpost/sacrebleu), BERTScore (https://github.com/Tiiiger/bert_score), ROUGE (https://github.com/pltrdy/rouge), TER and CHRF (both also from sacrebleu implementation). Beside MT evaluation framework we will present our own evaluation results. For our evaluation we used cloud based engines - Azure Translator (https://azure.microsoft.com/en-us/services/cognitive-services/translator/), Google Translate (https://cloud.google.com/translate/), as well as open-source engines - Marian MT (https://huggingface.co/transformers/model_doc/marian.html), NVIDIA's NeMo (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation.html), Facebook's MBart 50 (https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt), Facebook's M2M100 (https://huggingface.co/facebook/m2m100_418M). For open source engines we tried to use Huggingface's transformer implementation whenever possible. But as we mentioned our framework was designed in a way to be easily extendable to other MT engines and underlying frameworks. We also will present evaluation results for NeMo and MarianMT engines that we fine-tuned specifically for healthcare domain. While these particular results may rather specific to our use case, they help to highlight how our framework can be extended to custom MT engines as well.