Unsupervised Machine Translation For Indian Languages Using Monolingual Corpora
No Thumbnail Available
Date
2019-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute,Kolkata
Abstract
Machine translation has traditionally relied on parallel data but the amount
of parallel data available for Indian languages is very less . The parallel
data for Hindi-Marathi translation is around 50000 sentences which is
very less in terms of data set required for supervised machine translation.
But the good news is that monolingual data is very easy to find for this
low-resource Indian languages .The aim of this project is to investigate
whether it is possible to learn without the help of any parallel data . To
serve the purpose we have implemented a model that takes sentences from
two different monolingual corpora of different languages and maps them
into the same latent space. We can encode sentences into the same latent
space and can translate into any of the required languages . In this way, the
model effectively learns to translate (encode/decode) without any form of
supervision .The model only relies on monolingual corpora of two different
languages and in our case it is Hindi and Marathi .The BLUE scores achieved
by the model for Hindi to Marathi is 18.40 and Marathi to Hindi is 22.84 on
the FIRE data set without using a single parallel sentence at training time.
iii
Description
Dissertation under the supervision of Dr. Utpal Garain
Keywords
Machine Translation, DeNoising Auto-Encoders
Citation
28p.
