http://www.elsnet.org/pix/elsnetheader.jpg

Project description: HPSG-based Syntactic Treebank for Bulgarian

[ ID = 0019 ]	BulTreeBank
Project name	HPSG-based Syntactic Treebank for Bulgarian
Short name or acronym	BulTreeBank
Project URL	http://www.bultreebank.org
Project description	The main goal of the project is to build, exploit and disseminate language resources for Bulgarian as well as software for supporting the creation and usage of the language resources. The current set of language resources includes: (1) Bulgarian morphological lexicon of about 100 000 lexemes; (2) Partial syntactic grammar of Bulgarian. It covers non-recursive noun phrases and analytical verb phrases; (3) Machine readable Valence Dictionary. It consists of 1000 most frequent verbs and their valence frames. The semantic restrictions over the arguments are extracted and matched against the SIMPLE core ontology.(4) Semantic Dictionary. At the moment we are classifying the most frequent nouns with respect to the ontological hierarchy without specifying the synonymic relations between them. Up to now we have classified about 3~000 nouns with respect to the specifications in SIMPLE ontology. (5) Named Entity recognition module. It consists of lists of names classified according to four categories: personal names, organization names, location names and others. The total number of names in the list is 15000. The number of abbreviations is 1500. There are also pattern grammars for recognition of compound names; (6) Text Archive. It consists of about 72 mln. running words. In order to compile a representative and balanced corpus of Bulgarian texts, we tried to gather a variety of different genres (15% fiction, 78% newspapers and 7% legal texts, government bulletins and other genres); (7) Linguistically interpreted corpus. This will be a balanced corpus annotated on morphosyntactic level; (8) HPSG-based treebank of Bulgarian. At the moment we have annotated about 10 000 sentences with constituent and dependency structures as well as coreference relations. The team is also developing an XML-based system, called CLaRK. The first version of the system is available on the web: http:www.bultreebank.org/clark/index.html. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources. It incorporates several technologies: XML technology; Unicode; Regular Cascaded Grammars; Constraints over XML Documents. Up to now the system has been downloaded by about 600 people and to our knowledge it is being used for the creation of corpora for less processed languages and for named entity recognition.
Languages	Bulgarian
Funding	public
Project duration	Feb 2001 - Aug 2004
Contact
Name	Senior Researcher Kiril Simov
Organisation	Linguistic Modeling Laboratory, IPP, Bulgarian Academy of Sciences
Address	Acad. G. Bonchev Str . 25A
City	1113 Sofia, Bulgaria
Country	Bulgaria
Email	kivs_at_bultreebank.org
Phone	+359 2 979 2825
Fax	+359 2 870 72 73
Update this profile	Last update: 2004-06-15 17:43:07

Browse and Search the elsnet Directory of National Language and Speech Resources Projects World-wide
The National Resources Projects Directory	Browse in alphabetical order	Browse in alphabetical order (in frame)	Browse by country	Browse by ID number	Add your profile	Search directories for keywords and phrases (use ~ for space within keys; most word-initial regular expressions can be used)

[print/pda] [no frame] [navigation table] [navigation frame] Page generated 13-02-2008 by Steven Krauwer

Disclaimer / Contact ELSNET