[Liffey Hall 1 on 2022-07-15] Task of Machine Translation engine evaluation may be very challenging. Quality of Machine Translation varies greatly depending on domain and language pair. Different MT engines may have different interfaces or APIs and different requirements to run. To add to that, even definition of a good translation may be debatable, with any automatic MT quality metric providing only approximation of actual translation quality. That's why having universal evaluation framework for this task is very important. In our work we tried to create such framework. 1) We defined base translation class that unified all file handling, batch creation and result processing. As a result of that, only work needed to support new MT engine was creation of small child class that implemented couple of simple functions. That allows us to easily extend our framework to MT engines and new language pairs. 2) We defined set of test datasets and provided a way to add new datasets to this set. For our evaluation our aim was to create test data that covers both general and healthcare domains EMEA dataset (https://opus.nlpl.eu/EMEA.php), OPUS-100 (https://opus.nlpl.eu/opus-100.php), Paracrawl (https://paracrawl.eu/) and several others. But our data preparations scripts can be easily extended to other domains and datasets as well. 3) We defined a set of quality metrics to evaluate results of MT engines. Metrics that we used included BLEU (https://github.com/mjpost/sacrebleu), BERTScore (https://github.com/Tiiiger/bert_score), ROUGE (https://github.com/pltrdy/rouge), TER and CHRF (both also from sacrebleu implementation). Beside MT evaluation framework we will present our own evaluation results. For our evaluation we used cloud based engines - Azure Translator (https://azure.microsoft.com/en-us/services/cognitive-services/translator/), Google Translate (https://cloud.google.com/translate/), as well as open-source engines - Marian MT (https://huggingface.co/transformers/model_doc/marian.html), NVIDIA's NeMo (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation.html), Facebook's MBart 50 (https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt), Facebook's M2M100 (https://huggingface.co/facebook/m2m100_418M). For open source engines we tried to use Huggingface's transformer implementation whenever possible. But as we mentioned our framework was designed in a way to be easily extendable to other MT engines and underlying frameworks. We also will present evaluation results for NeMo and MarianMT engines that we fine-tuned specifically for healthcare domain. While these particular results may rather specific to our use case, they help to highlight how our framework can be extended to custom MT engines as well. |