Elsnet
 


Central and Eastern European Survey

Resources

Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics, Charles University

NL and Speech Resources available: Textual, Speech, Software, Lexical resources (including terminology).


Name: Czech National Corpus
Nature: text
Language: Czech
Size: under construction - at the end of 1997 up to 100 million words
Format: ASCII, SGML
Coverage: mostly newspaper text,but also will certanly include prose, ficiton, dialogues, hystorical(diachronical) part
Medium: diskette
Availability: part available through internet (free) the rest for commercial purposes

Name: Penn Tree Bank
Nature: text
Language: English
Size: tagged part up to 5 million words
Format: ASCII, SGML
Coverage: newspapers, technicalmanuals, brown corpus,dow jonesnewswire, WBUR radio
Medium: CD-ROM
Availability: don't know, we are only users

Name: Brown Corpus
Nature: text
Language: English
Size: 1,013,644 words
Format: ASCII
Coverage: newspaper text, prose
Medium: diskette
Availability: don't know, we are only users

Name: hand- POS tagged corpus
Nature: text
Language: Czech
Size: 600 000 tokens, each token tagged by POS tag
Format: ASCII - pair TOKEN|TAG per line
Coverage: newspaper text - 60's and 70's
Medium: diskette
Availability: free for research purposes

Name: manually tagged corpus
Nature: text
Language: Czech
Size: 150 000 tokens
Format: SGML
Coverage: newspaper and magazine text ( 1991 - 1997)
Medium: diskette
Availability: upon individual agreement

Name: Korektor (Spell Checker for Slovak Language)
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries, count of word forms (over 6 000 000, canbe generated)
Format: ASCII (with Slovak Character set), with special semantic structure), Commercial format (proprietary "binary" structure), Software (Application Programming Interface (API), library in C)
Coverage: Main sources (newspaper text, law, economy)
Availability: Status (available, regularly updated)

Software description: For each word form, it returns boolean information, whether the word is a correct form in Slovak Language

Name: Spell Checker with Hyphenation for Slovak Language
Nature: lexical, software
Language: Slovak
Size: Around 5000 entries for TeX Hyphenation Algorithm, Size of Exceptionlist around 2000
Format: ASCII (with Slovak character set), Data for TeX hyphenation Algorithm Commercial format: proprietary "binary" structure Software: Application Programming Interface (API), library in C) (Note: Usually in single system with Spell Checker, due to a list of the exceptions)
Coverage: general, domain independent Precision_of_algorithm: 99.5%, on the word list from Item 3. List of exceptions: All known entries from Spell Checker (viz.), which are incorrectly hyphenated by the algorithm. Note - remaining errors: almost only semantically dependent
Medium: Hard disc, diskette
Availability: Status (available, regularly updated)

Software description: It returns hyphenated form using special "hyphenation" character

Name: Hyphenated word list for Slovak Language. (Node: No special name)
Nature: lexical
Language: Slovak
Size: around 150 000 hyphenated word forms (Note: All word forms from the Spell Checker can be generated and hyphenated)
Format: Format (ASCII (with Slovak character set), word list for TeX hyphenation Algorithm)
Coverage: For training purposes were added especially word forms incorrectly hyphenated by the Algorithm
Medium: Hard disc, diskette
Availability: for internal use, possible as commercial product. Status: available, regularly updated

Name: Lematizator: Lemmatization and Stemmer for Slovak Language
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries
Format: ASCII (with Slovak Character set Commercial format: proprietary "binary" structure Software: Application Programming Interface (API), library in C
Coverage: General, domain independent
Medium: Hard disc, diskette
Availability: commercial product. Status: Available, regularly updated

Software description: Word form analyzer; result: basic form(s) (lemma) and stem(s). Word form generator from a given lemma. (Note: some lemmas are semantically distinguished)

Name: Morphology (Note: Morphology for Slovak Language)
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries
Format: ASCII (with Slovak Character set) Commercial format: proprietary "binary" structure Software: Application Programming Interface (API), library in C
Coverage: general, domain independent
Medium: Hard disc, diskette
Availability: commercial product. Status: Aavailable, regularly updated.

Software description: Word form analyzer; result: basic form(s) (lemma) and morphological informations (POS, case, number etc. (Note: Some lemmas are semantically distinguished)

Name: Frequency list. (Note: Frequency list for Slovak Language)
Nature: lexical
Language: Slovak
Size: 10 000 word forms
Format: ASCII (with Slovak Character set)
Coverage: newspapers
Medium: Hard disc, diskette
Availability: commercial product. Status: available

Software description: Special-purpose tool (Unix and Windows platform) for easy disambiguation of morphological output. Available upon personal agreement.


This page is no longer maintained. Please visit http://www.elsnet.org/survey/quests to find out how to update your organisation profile or to find information about this organisation

[Survey] [Organisation] [General Info] [Training] [Resources] [Research] [Staff] [Publications]

 

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 04-01-1998 by Steven Krauwer Disclaimer / Contact ELSNET