Project description
 |
The main goal of the project is to build, exploit and disseminate language
resources for Bulgarian as well as software for supporting the creation and
usage of the language resources. The current set of language resources
includes: (1) Bulgarian morphological lexicon of about 100 000 lexemes; (2)
Partial syntactic grammar of Bulgarian. It covers non-recursive noun phrases
and analytical verb phrases; (3) Machine readable Valence Dictionary. It
consists of 1000 most frequent verbs and their valence frames. The semantic
restrictions over the arguments are extracted and matched against the SIMPLE
core ontology.(4) Semantic Dictionary. At the moment we are classifying the
most frequent nouns with respect to the ontological hierarchy
without specifying the synonymic relations between them. Up to now
we have classified about 3~000 nouns with respect to the specifications in
SIMPLE ontology. (5) Named Entity recognition module. It consists
of lists of names classified according to four categories: personal names,
organization names, location names and others. The total number of names in the
list is 15000. The number of abbreviations is 1500. There are also pattern
grammars for recognition of compound names; (6) Text Archive. It consists of
about 72 mln. running words. In order to compile a representative and balanced
corpus of Bulgarian texts, we tried to gather a variety of different genres
(15% fiction, 78% newspapers and 7% legal texts, government bulletins and other
genres); (7) Linguistically interpreted corpus. This will be a balanced corpus
annotated on morphosyntactic level; (8) HPSG-based treebank of Bulgarian. At
the moment we have annotated about 10 000 sentences with constituent and
dependency structures as well as coreference relations.
The team is also developing an XML-based system, called CLaRK. The first
version of the system is available on the web:
http:www.bultreebank.org/clark/index.html. The main aim behind the design of
the system is the minimization of human intervention during the creation of
language resources. It incorporates several technologies: XML technology;
Unicode; Regular Cascaded Grammars; Constraints over XML Documents. Up to now
the system has been downloaded by about 600 people and to our knowledge it is
being used for the creation of corpora for less processed languages and for
named entity recognition. |