List of contents of the ECI CDROM
Id Type Language Size (K words)
alb01 Word list and Texts Albanian 205 (a) Albanian word list 32K words with syntactic classes The Albanian dictionary of the 1984 published in Tirana by the Academy of Sciences. (b) The novel "Koncert në fund të dimrit" by Ismail Kadare published in Tirana.
bul01 Technical Bulgarian 5 A number of scientific papers from "Science" journal.
chi01 Newspaper Chinese 2895 The PH text corpus described here contains 3.75 million Chinese characters. It is a collection of news from the China's official Xinhua (New China) news agency (hereafter XinHua) during a period from January 1990 to March 1991. It is GB coded with word and phrase boundaries marked.
cze01 newspaper czech 726 Newspaper Texts (Lidove noviny, Literarni noviny)
cze02 newspaper czech 4000 Newspaper Texts (Lidove noviny, Literarni noviny)
dut01 newspaper dutch 600 Articles from the student newspaper Universiteitskrant of the University of Groningen from the academic years 1990/1991 and 1991/1992.
dut02 mixed dutch 5203 A large Dutch corpus from INL including transcripts of radio programs, newspaper and magazine issues and some technical texts.
dut03 mixed dutch 128 A continuation of dut02.
eng01 novels english 241 Three English novels from the OTA collection: Thomas Hardy 'Far from the Madding Crowd' George Eliot 'Silas Marner' Charles Dickens 'A Christmas Carol'
eng02 novels english 900 The Complete Sherlock Holmes, Sir Arthur Conan-Doyle.
est01 mixed estonian 100 Extracts from general fiction and prose.
fre01 newspaper french 4121 Text from Le Monde newspaper, consisting of articles from September and October 1989, and January 1990.
gae01 dictionary gaelic 141 MacBain, Alexander, "An etymological dictionary of the Gaelic language", Gairm Publications, 1982 1st edition - 1896 revised 1911
ger01 sentenceList german 20 Lists of german sentences - tagged with some syntactic info. The sentence test suite of DiTo, a linguistic database for diagnostics in the syntax components of NLP systems.
ger02 newspaper german 191 German Newspaper articles from VDI-Nachricten 1990-1991
ger03 newspaper german 34291 Frankfurter Rundschau Newspaper text
ger04 newspaper german 7376 Donau Courier newspaper texts
gre01 mixed greek 2515 Newspapers, periodicals, popular fiction 1976-1990;
ita01 novels italian 13 6 short stories by G.Verga
ita03 newspaper italian 303 Corpus of Italian newspapers (La Republica, La Stampa, Il Mattino, Il Corriere)
jap01 dictionary japanese 203 EDICT Japanese/English dictionary.
jap02 Technical Japanese 148(?) Japanese version of the ITU CCITT data.
lat01 poetry Latin 75 Vergil, Aeneid, book I - XII Vergil, Georgicon, book I - III
lit01 Fiction Lithuanian 20 "KOLEKCIONIERIUS" Story
mal01 Technical/Novels Malay 563 A collection of original Malay texts and translations from English, mainly technical books with some novels. From University Sains Malaysia and Dewan Bahasa & Pustaka (publishers)
mul01 Financial En/Fr/Ge 566 Financial reports from Union Bank Switz. (most french-german)
mul02 technical Fr/Ge/It 177 Avalanche bulletins 1986-1991 (ca. 40 per year/250 words) Swiss Federal Institute for Snow and Avalanche Bulletins. (Very little Italian)
mul03 legal Fr/Ge/It 227 Text of Swiss Civil Code
mul04 technical En/Fr/Sp 13497 International Telecommunications Union CCITT handbook
mul05 legal En/Fr/Spa 5000 K words International Labour Organisation "Official Bulletin, B Series": "Reports of the Committee on Freedom of Association of the Governing Body of the ILO and related material 1984-1989".
mul06 technical 9 EC langs 219 The announcement text of the EC Esprit program.
mul07 sentencelist En/Fr 12 BABEL project data - French business sentences and English translations.
mul08 novel En/Serb 386 George Orwell's "1984" in English, Serbian, Croatian and Slovenian versions.
mul09 technical 5 EC langs 248 ScanWorX User's Guide (Optical Character Reader)
mul10 Mixed English/French 19 HCRC MT Evaluation Corpus: French/English parallel texts
mul11 Financial German/French 615 Financial Reports from CREDIT SUISSE
mul12 Legal Danish/Spanish/English 1199 The machine-readable 'Civil Law Corpus' from the Copenhagen Buisness School
mul13 novel Uzbek/English 72 Usbek Novel 'Ärk Freedom' with English interlineal translation
nor01 novels norwegian 2226 Collection of texts Bokmaal & Nynorsk, some novels and some Ibsen plays.
por01 mixed portuguese 675 An extract from the Borba/Ramsey corpus of Brazilian Portuguese.
rus01 technical Russian 364 Technical reports (computer related) by Andrei Mikheev
ser01 stories serbian 700 Short stories and novel extracts
spa01 speech spanish 1041 Transcribed Spanish speech from CORPUS ORAL DE REFERENCIA DEL ESPANOL CONTEMPORANEO 1991-1992
spa02 newspaper spanish 447 1 week of local Spanish newspaper "Sur" from April and Sept 1991.
spa03 newspaper spanish 830 "El Diario Vasco" newspaper articles 1991
swe01 mixed swedish 1718 A Fragment of SUC: the Stockholm-Umea Corpus of modern written Swedish. Text extracts (~2000 words each) from books and newspapers published after 1990.
tur01 dictionary turkish 173 pc-kimmo rule specification and word lists for turkish morphology
tur02 newspaper turkish 110 This is news text excerpted from the Anatolia New Agency feed covering roughly Sept/Oct 1992. Aproximately 10% of the total.
Total 98,792 K words