Concetti Chiave
This study introduces the first parallel dataset for English-Tulu translation and develops a machine translation system for this low-resource language by leveraging resources from the related Kannada language.
Sintesi
The authors present the first parallel dataset for English-Tulu translation by extending the FLORES-200 dataset with human translations into Tulu. They collaborated with the Jai Tulunad organization, a volunteer group dedicated to preserving Tulu language and culture, to obtain the translations.
The authors then develop a machine translation system for English-Tulu using a transfer learning approach. They leverage resources available for the related South Dravidian language, Kannada, to train their model without parallel English-Tulu data. The key steps include:
- Fine-tuning a pre-trained IndicBARTSS model to translate from Kannada to English, and using this model to back-translate the Tulu monolingual data.
- Training an English-Tulu model using the back-translated pairs, parallel English-Kannada data, and denoising autoencoding.
- Further fine-tuning the models using the parallel English-Tulu data from the DravidianLangTech-2022 shared task.
The authors' English-Tulu model achieves a BLEU score of 35.41, significantly outperforming Google Translate, which scored 7.19 on the same test set. However, the authors note several limitations, including the absence of an adversarial training step and the relatively small size of the Tulu monolingual dataset.
Statistiche
Tulu has around 2.5 million speakers, predominantly in the southwestern region of India.
The Tulu Wikipedia contains 1,894 articles, from which the authors extracted a monolingual Tulu corpus of 40,000 sentences.
The DravidianLangTech-2022 shared task provided a parallel Kannada-Tulu dataset of 8,300 sentences.
Citazioni
"Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India."
"Without access to parallel EN–TCY data, we developed this system using a transfer learning (Zoph et al., 2016) to address translation challenges in this low-resource language."
"Our English–Tulu system, trained without using parallel English–Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023)."