The Prague Arabic Dependency Treebank (PADT) project is an open-ended activity
of the Institute of Formal and Applied Linguistics, Charles University in
Prague, resting in multi-level annotation of Arabic language resources in the
light of the theory of Functional Generative Description. The project is a
younger sibling to Prague Dependency Treebank for Czech.
The corpus of PADT currently consists of morphologically and syntactically
annotated newswire texts of Modern Standard Arabic, which originate from
resources published by the Linguistic Data Consortium, University of
Pennsylvania --- Arabic Gigaword and the plain data of Penn Arabic Treebank,
Part 1 and Part 2.
The linguistic description of PADT is unique in Arabic NLP. In morphology, we
resolve true grammatical categories rather than decomposing words into morphs,
and annotate hierarchies of possible analyses called MorphoTrees.
The first version of the treebank, PADT 1.0 at http://ufal.mff.cuni.cz/padt/,
was released via LDC in November 2004. PADT 1.0 counts more than 148,000 tokens
of data annotated with MorphoTrees. In syntax, dependency relations in a
sentence are captured, having a parallel in Prague Dependency Treebank for
Czech. The data reach over 113,500 tokens. The more recent ones, roughly 49,000
tokens, have their MorphoTrees lower-level counterpart, the rest has morphology
of the same system of tags, but without the reusable disambiguated hierarchies.
New development and extended annotations (350,000 tokens of MorphoTrees,
250,000 of analytical syntax, 20,000 of tectogrammatics, i.e. deep syntax) have
been taking place since the PADT 1.0 release. Please visit the PADT++ online
weblog for newest information about the project,