| Category: ||E-Material |
| Subject: ||JRC-Acquis: bilingual alignments for 231 language pairs now available |
| From: || |
| Email: ||ralf.steinberger_(on)_jrc.it |
| Date received: ||06 Aug 2007 |
OLE_LINK2OLE_LINK1Bilingual alignments for all 231 language pairs of the
JRC-Acquis parallel corpus are now freely available online.
We are pleased to announce that the bilingual alignments for all 231
language pairs of the JRC-Acquis corpus are now available online for
download. The JRC-Acquis is a freely downloadable multilingual parallel
corpus in 22 languages comprising of a total of over 1 Billion words.
SIZE AND FORMAT
- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million
words in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Aligned bilingually at paragraph level (often equivalent to sentences
or sentence parts), using Vanilla.
- Modular: download the languages you need.
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,
Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.
- Documents on contents, principles and political objectives of the EU
- EU legislation;
- International agreements.
Paragraph alignment for all 231 language pairs was carried out with the
Vanilla aligner and is available for download. Paragraphs in the
JRC-Acquis are frequently equivalent to sentences or even sentence
parts. Version 2.2 of the JRC-Acquis corpus (210 language pairs, still
available on the same website) was additionally aligned with HunAlign.
- Paragraph-aligned for all 231 language pairs;
- Paragraphs are sentence parts, sentences, or groups of sentences;
- Using the Vanilla aligner;
- Over 1 Million alignments per language pair (on average for all
- 85.430ne-to-one alignments (on average for all language pairs).
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains;
- Selected from 6000 hierarchically organised classes, wide-coverage;
- suitable to experiment with multilingual multi-label categorisation.
USE / DOWNLOAD
- Download from <http://langtech.jrc.it/JRC-Acquis.html>
- Usage free for research purposes.
FOR MORE DETAILS
You will find a detailed description of version 2.2 of the corpus in the
following paper. Please use the following reference when you mention the
JRC-Acquis in any publications. We would be pleased to hear how you use
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation
(LREC'2006). Genoa, Italy, 24-26 May 2006. Available at
<http://langtech.jrc.it/#Publications> The JRC's Language Technology
group specialises in the development of highly multilingual text
analysis tools and in cross-lingual applications. An example is our
multilingual (19 languages) news analysis application NewsExplorer,
publicly accessible at http://press.jrc.it/NewsExplorer.
<http://press.jrc.it/NewsExplorer> Related JRC developments (both
covering 22+ languages):
- <http://press.jrc.it/NewsExplorer> NewsBrief
(http://press.jrc.it): breaking news detection and display of the very
latest thematic news from around the world;
- <http://press.jrc.it/> Medical Information System MedISys
(http://medusa.jrc.it): displays the latest health-related news from
around the world according to themes and diseases.
<http://medusa.jrc.it/> Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology
Elsnet-list mailing list