elsnet

Language and Speech Resources

ELSNET is the European Network in Human Language Technologies (http://www.elsnet.org) This page is http://www.elsnet.org/resources.html [ print/pda version ] [ screen version ] [ navigation table ] [ navigation frame ]

Language and speech resources are of crucial importance for research and development in language and speech technology. ELSNET aims at the creation and distribution of pilot resources for experimentation purposes, and acts as a platform for exchange of expertise across languages, and for discussion of emerging standards. ELSNET collaborates closely with the main organisations in the field of resources.

The Resources Landscape
Resources made available through ELSNET
Resources created with ELSNET support
Resources organisations
ELSNET's directory of resources

The Resources Landscape

ELSNET, in close collaboration with the former ENABLER Network, is in the process of building a map of the Resources Landscape. This map should facilitate identification and access to Language Resources: surveys, metadata, networks, projects, ...

The first release of the landscape can now be found on http://www.ilc.cnr.it/elsnet4/

Resources made available through ELSNET

The European Corpus Initiative Multilingual Corpus I: The ECI/MCI CD-ROM contains over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material.
Newspapers on the internet: A list of links to electronic versions of newspapers from various countries in several languages. The URL is http://www.ims.uni-stuttgart.de/info/Newspapers.html

Resources created with ELSNET support (but no longer available through ELSNET)

The HCRC Map Task Corpus: The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design.; CD-ROMS available from LDC (no longer from ELSNET). The non-member price is ca $200.; The project URL is http://www.hcrc.ed.ac.uk/maptask/
The Groningen Speech Corpus: The Groningen Speech Corpus was collected by A.M. Sulter, MD and Prof. H.K. Schutte as part of a research project funded by NWO (Netherlands Organization for Scientific Research). The 4 CD-ROMs contain over 20 hours of speech. It is a corpus of read speech material in Dutch, recorded on PCM tape under fairly good conditions.; CD-ROMS available from ELRA/ELDA (no longer from ELSNET). The non-member price is ca 800 euro.
The Syntax/Senmantic Annotation Task: In the course of 2000-2001 ELSNET has produced two small sample corpora of parallel structure for German and Italian, about 1000 sentences of each language, illustrating 20 verbs, and their syntactic and semantic subcategorization. The annotation concentrates on the verbal predicates and their subcategorized complements, as well as on a few relevant modifiers. A short report can be found on http://www.elsnet.org/ssa

Resources organisations

ELRA: The European Language Resources Association (ELRA) was established as a non-profit organization in Luxembourg in February, 1995. The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community. The URL is http://www.icp.grenet.fr/ELRA/home.html.
LDC: The Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from the Advanced Research Projects Agency (ARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation. The URL is http://www.ldc.upenn.edu
ENABLER: The Enabler Network aims at improving cooperation among national activities established by national authorities for providing Language Resources for their languages. The action aims at: establishing a regular exchange of information; identifying and fostering possible synergies and cooperation; promoting the compatibility and interoperability of their results, thus facilitating the successful transfer of technologies and tools among languages and the construction of multilingual Language Resources; increasing the visibility and the strategic impact of those national activities in the field of HLT; contributing to the creation of an overall framework in which the public and private sectors, national efforts and international coordination could cooperate in order to answer the IST need for Language Resources.; URL: http://www.enabler-network.org/
NEMLAR: The goal of the NEMLAR (Network for Euro-Mediterranean LAnguage Resources) is to create a network of qualified Euro-Mediterranean partners to specify and support the development of high priority LRs for Arabic and other local languages in a systematic, standards-driven, collaborative learning context. The project will focus on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing a basic LR kit for the major forms of the region's predominant language - Arabic, and other local wide-spoken languages where appropriate.; URL: http://www.nemlar.org
TELRI: The TELRI association aims at collecting, promoting, and making available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge; with a special focus on Central and eastern European languages. The URL is http://www.telri.de.