Elsnet
 
   


ELSNET-list archive

Category:   E-Material
Subject:   EMILLE/CIIL Corpus
From:   Tony McEnery
Email:   eiaamme_(on)_exchange.lancs.ac.uk
Date received:   06 Jan 2004

Dear All, I am delighted to be able to announce the release of the EMILLE/CIIL corpus. The corpus contains monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). It also contains orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words.=20 The corpora were built as part of a collaboration between Lancaster University and the Central Institute of Indian Languages, Mysore. As well as the corpora, the following materials are also available for download from the web-site: i.) documentation relating to the corpus; ii.) POS tagged Urdu corpus data; iii.) Hindi corpus data in which demonstrative use has been subject to annotation; iv.) A prototype POS tagger for Urdu. The corpus can be downloaded from: http://www.ling.lancs .ac.uk/corplang/emille More details of the EMILLE project can be found at: http://www.emille.lancs.ac.uk The GATE language engineering architecture has also been developed further by the University of Sheffield to enable language processing tasks using the EMILLE data. For more details on GATE see: http://www.gate.ac.uk/ A new release of the EMILLE corpus will be made, indexed for use with Xara, towards spring 2004. Apologies if you receive this message more than once. Regards, Tony McEnery, Professor of English Language and Linguistics, Dept. Linguistics and Modern English Language, Lancaster University, Bailrigg, Lancaster, LA1 4YT.
 

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 14-02-2008 by Steven Krauwer Disclaimer / Contact ELSNET