ELSNET-list archive

Category:   E-Material
Subject:   New LDC Corpus
From:   Linguistic Data Consortium
Email:   ldc_(on)_ldc.upenn.edu
Date received:   24 Jul 2003

LDC2003T12 * Arabic Gigaword * The Linguistic Data Consortium (LDC) is pleased to announce the availability of the Arabic Gigaword corpus. * Arabic Gigaword is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the LDC. The newswire texts are drawn from four sources: Agence France Presse (afp) Al Hayat News Agency (alh) Al Nahar News Agency (ann) Xinhua News Agency (xin) Much of the Agence France Presse content in this collection has been published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55). The entire Al Hayat, An Nahar and Xinhua Arabic content, as well as AFP content for 2001-2002, is previously unreleased material. Arabic Gigaword consists of 319 files, totaling approximately 1.1GB in compressed form (4348 MB uncompressed, and 391619 Kwords). All text files corpus have been converted to UTF-8 character encoding. Arabic Gigaword is distributed on DVD. For further information, including a link to online documentation, please visit: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC200 3T12 Institutions that have membership in the LDC during the 2003 Membership Year will be able to receive this corpus free of charge. Nonmembers may license this publication for $2,500. * If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email to <ldc_(on)_ldc.upenn.edu> or call 1 (215) 573-1275. -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 3600 Market Street Fax: 1 (215) 573-2175 Suite 810 email: ldc_(on)_ldc.upenn.edu Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu

