Project URL http://www.bultreebank.org 
The main goal of the project is to build, exploit and disseminate language 
resources for Bulgarian as well as software for supporting the creation and 
usage of the language resources. The current set of language resources 
includes: (1) Bulgarian morphological lexicon of about 100 000 lexemes; (2) 
Partial syntactic grammar of Bulgarian. It covers non-recursive noun phrases 
and analytical verb phrases; (3) Machine readable Valence Dictionary. It 
consists of 1000 most frequent verbs and their valence frames. The semantic 
restrictions over the arguments are extracted and matched against the SIMPLE 
core ontology.(4) Semantic Dictionary. At the moment we are classifying the
most frequent nouns with respect to the ontological hierarchy
without specifying the synonymic relations between them. Up to now
we have classified about 3~000 nouns with respect to the specifications in 
SIMPLE ontology. (5) Named Entity recognition module. It consists 
of lists of names classified according to four categories: personal names, 
organization names, location names and others. The total number of names in the 
list is 15000. The number of abbreviations is 1500. There are also pattern 
grammars for recognition of compound names; (6) Text Archive. It consists of 
about 72 mln. running words. In order to compile a representative and balanced 
corpus of Bulgarian texts, we tried to gather a variety of different genres 
(15% fiction, 78% newspapers and 7% legal texts, government bulletins and other 
genres); (7) Linguistically interpreted corpus. This will be a balanced corpus 
annotated on morphosyntactic level; (8) HPSG-based treebank of Bulgarian. At 
the moment we have annotated about 10 000 sentences with constituent and 
dependency structures as well as coreference relations. 
The team is also developing an XML-based system, called CLaRK. The first 
version of the system is available on the web: 
http:www.bultreebank.org/clark/index.html. The main aim behind the design of 
the system is the minimization of human intervention during the creation of 
language resources. It incorporates several technologies: XML technology; 
Unicode; Regular Cascaded Grammars; Constraints over XML Documents. Up to now 
the system has been downloaded by about 600 people and to our knowledge it is 
being used for the creation of corpora for less processed languages and for 
named entity recognition.
Project durationFeb 2001 - Aug 2004
NameSenior Researcher Kiril Simov
OrganisationLinguistic Modeling Laboratory, IPP, Bulgarian Academy of Sciences 
Address Acad. G. Bonchev Str . 25A 
City1113 Sofia, Bulgaria
Country Bulgaria 
Phone+359 2 979 2825 
Fax+359 2 870 72 73 
