ELSNET-list archive

Category:   E-Material
Subject:   JRC-Acquis: bilingual alignments for 231 language pairs now available
Email:   ralf.steinberger_(on)_jrc.it
Date received:   06 Aug 2007

OLE_LINK2OLE_LINK1Bilingual alignments for all 231 language pairs of the JRC-Acquis parallel corpus are now freely available online. We are pleased to announce that the bilingual alignments for all 231 language pairs of the JRC-Acquis corpus are now available online for download. The JRC-Acquis is a freely downloadable multilingual parallel corpus in 22 languages comprising of a total of over 1 Billion words. SIZE AND FORMAT - 22 languages (all official EU languages except Irish) - Average corpus size per language: 28.9 million words + 19 Million words in annexes, etc. - 23,000 texts per language (less in Bulgarian, Maltese and Romanian) - XML Format according to TEI P4, UTF-8-encoded - Aligned bilingually at paragraph level (often equivalent to sentences or sentence parts), using Vanilla. - Modular: download the languages you need. LANGUAGES Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. TEXT TYPES - Documents on contents, principles and political objectives of the EU Treaties; - EU legislation; - Declarations; - Resolutions; - Acts; - International agreements. PARAGRAPH ALIGNMENT Paragraph alignment for all 231 language pairs was carried out with the Vanilla aligner and is available for download. Paragraphs in the JRC-Acquis are frequently equivalent to sentences or even sentence parts. Version 2.2 of the JRC-Acquis corpus (210 language pairs, still available on the same website) was additionally aligned with HunAlign. - Paragraph-aligned for all 231 language pairs; - Paragraphs are sentence parts, sentences, or groups of sentences; - Using the Vanilla aligner; - Over 1 Million alignments per language pair (on average for all language pairs); - 85.430ne-to-one alignments (on average for all language pairs). MANUAL SUBJECT DOMAIN CLASSIFICATION - Manually classified according to EUROVOC subject domains; - Selected from 6000 hierarchically organised classes, wide-coverage; - suitable to experiment with multilingual multi-label categorisation. USE / DOWNLOAD - Download from <http://langtech.jrc.it/JRC-Acquis.html> http://langtech.jrc.it/JRC-Acquis.html; - Usage free for research purposes. FOR MORE DETAILS You will find a detailed description of version 2.2 of the corpus in the following paper. Please use the following reference when you mention the JRC-Acquis in any publications. We would be pleased to hear how you use the corpus. Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufi&#351;, Dániel Varga (2006). 'The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages'. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Available at <http://langtech.jrc.it/#Publications> http://langtech.jrc.it/#Publications. <http://langtech.jrc.it/#Publications> <http://langtech.jrc.it/#Publications> The JRC's Language Technology group specialises in the development of highly multilingual text analysis tools and in cross-lingual applications. An example is our multilingual (19 languages) news analysis application NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer. <http://press.jrc.it/NewsExplorer> <http://press.jrc.it/NewsExplorer> Related JRC developments (both covering 22+ languages): <http://press.jrc.it/NewsExplorer> - <http://press.jrc.it/NewsExplorer> NewsBrief (http://press.jrc.it): breaking news detection and display of the very latest thematic news from around the world; <http://press.jrc.it/> - <http://press.jrc.it/> Medical Information System MedISys (http://medusa.jrc.it): displays the latest health-related news from around the world according to themes and diseases. <http://medusa.jrc.it/> <http://medusa.jrc.it/> <http://medusa.jrc.it/> <http://medusa.jrc.it/> Ralf Steinberger European Commission - Joint Research Centre (JRC) IPSC - SeS - EMM - Language Technology <http://medusa.jrc.it/> http://langtech.jrc.it, http://press.jrc.it/NewsExplorer <http://press.jrc.it/NewsExplorer/> _______________________________________________ Elsnet-list mailing list Elsnet-list_(on)_elsnet.org http://mailman.elsnet.org/mailman/listinfo/elsnet-list

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 14-02-2008 by Steven Krauwer Disclaimer / Contact ELSNET