I am delighted to be able to announce the release of the EMILLE/CIIL
corpus. The corpus contains monolingual written corpus data for 14
South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada,
Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu).
It also contains orthographically transcribed spoken data and parallel
corpus data for five South Asian languages (Bengali, Gujarati, Hindi,
Punjabi and Urdu). In addition, the parallel corpus contains the
originals from which the translations stored in the corpus were
All data in the corpus is CES and Unicode compliant. The EMILLE corpus
totals some 94 million words.=20
The corpora were built as part of a collaboration between Lancaster
University and the Central Institute of Indian Languages, Mysore.
As well as the corpora, the following materials are also available for
download from the web-site:
i.) documentation relating to the corpus;
ii.) POS tagged Urdu corpus data;
iii.) Hindi corpus data in which demonstrative use has been subject to
iv.) A prototype POS tagger for Urdu.
The corpus can be downloaded from:
More details of the EMILLE project can be found at:
The GATE language engineering architecture has also been developed
further by the University of Sheffield to enable language processing
tasks using the EMILLE data. For more details on GATE see:
A new release of the EMILLE corpus will be made, indexed for use with
Xara, towards spring 2004.
Apologies if you receive this message more than once.
Professor of English Language and Linguistics,
Dept. Linguistics and Modern English Language,