The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69 million English and 1.47 million Odia tokens. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) systems which will help translate English↔Odia. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. Abstract The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |