``This article appeared in ELSNews 5.1 (February 1996), and is re-printed by permission from the Editor. ELSNews is the newsletter of ELSNET, the European Network in Language and Speech. Information about ELSNET is available from the Coordinator, at elsnet@let.ruu.nl.''
The topic of this year's School is Dialogue Systems, and some of the questions that will be addressed during the two weeks in July include:
See pages 4-5 of this issue of ELSNews for more information.
The resources under consideration are:
Members of ELSNET will be given substantial discounts on the costs of the resources which are being transferred to ELRA, and members of ELRA will be given an even greater discount.
A coordination committee has been established to help harmonise the actions and activities of ELSNET and ELRA. All members of this committee are involved in both organisations.
A complete catalogue of ELRA's resources will soon be available, and information about these resources will be published in future issues of ELSNews.
Dr. Trancoso explained the motivation for the survey: ``Newspaper text may constitute very important linguistic resources for research not only from the point of view of written language processing, but also from the point of view of speech processing (for building language models for large vocabulary continuous speech recognition [systems], for instance).''
A number of laboratories and institutes have made extensive use of news text for research purposes. Text from Le Monde and the Wall Street Journal, for example, has been used by LIMSI-CNRS for their work in multilingual speech recognition. Newspaper corpora has also frequently been used in developing text-to-speech systems.
The result of Dr. Trancoso's survey --- a list of more than 30 newspapers worldwide --- was posted to elsnet-list on Feb. 8. The list is alphabetical by country (European papers are listed first, and non-European papers are listed at the end), and for each newspaper, the URL is given along with information about periodicity (e.g., daily, weekly, etc.), language, and copyright regulations. The list includes not only newspapers from Western European countries, but also from Central and Eastern Europe, and from North and South America, in languages as diverse as Catalan, Dutch, Estonian, and Esperanto.
The list was converted to HTML by Oliver Christ (IMS, Universität Stuttgart), and a link was made to the Stuttgart file from the ELSNET home page.
FOR INFORMATION
If you'd like access to newstext on-line --- or if you'd just like to
read the morning news from your computer! --- point your broswer at:
http://www.ims.uni-stuttgart.de/info/Newspapers.html
or follow the link from the ELSNET home page.
Comments, new links, changes and updates are welcome! Please send them to:
Oliver Christ
IMS, Universität Stuttgart
Azenbergstraße 12
70174 Stuttgart, Germany
Email: oli@ims.uni-stuttgart.de
ELRA membership is open to any organization, public or private. Full membership, with voting rights, is available to organisations established in the EU or European Economic Area. Purely for organisational purposes, members will be classified by their chief interest (spoken, written, or terminological resources). The annual membership fee has been set at a modest ECU 1000 to encourage broad participation.
FOR INFORMATION
The address of ELRA's Paris office is:
Khalid Choukri, ELRA/ELDA
87, Avenue d'Italie
F-75013 Paris, France
Tel: +33 1 45 86 53 00
Fax: +33 1 45 86 44 88
Email: elra@calvanet.calvacom.fr
The TED (Transnational English Database) corpus contains recordings of speeches made at the Eurospeech-93 conference in Berlin. The name of the corpus --- and its nickname, ``The Terrible English Database'' --- reflects the fact that a high percentage of the presentations at Eurospeech-93 were given in English by non-native speakers of English. TED was first conceived at Eurospeech-91 in Genoa, Italy, and follow-up discussions were held at the COCOSDA meeting in Chiavari (1991). The idea was first formally presented as a potential project by Joseph Mariani (LIMSI-CNRS) at the COCOSDA meeting in Banff (1992). He proposed to record the speeches made at Eurospeech-93 in Berlin and to distribute the data, with the aim of developing speech recognisers which will try to automatically recognise the speeches made at Eurospeech-95.
As the result of the extensive preparatory work of many individuals including members of the EuroCOCOSDA Consortium, ESCA, and the local organisers in Berlin, particularly Professor Klaus Fellbaum who worked closely with the Institute of Phonetics at the University of München, recordings were made of the oral presentations at Eurospeech-93.
Of the 287 oral presentations at the conference, 224 were successfully recorded, providing a total of about 75 hours of speech material per channel. These recordings provide a relatively large number of speakers speaking a variant of the same language (English) over a relatively long period of time (15 min each + 5 min discussion) on a specific topic. A subset of speakers were recorded with a laryngograph in addition to the standard microphone. The laryngograph recordings were organized by the Department of Phonetics at University College London and supervised by Adrian Fourcin. A set of Polyphone-like recordings were also made, for which a subset also had a laryngograph signal recorded. These recordings were made in English and in the speakers' native languages.
Associated text materials consist of ascii versions of proceedings papers provided by the authors, collected in order to provide vocabulary items and data for language modeling. The texts are included both in their original form and in a normalised format (single column, 80 characters per line). Two hundred and fifty-three (253) speakers completed a short questionnaire giving information about their native language and other languages they speak, and providing details about their knowledge of English. The collection of texts and questionnaires was carried out by email.
The entire corpus comprises three subcorpora including:
A final report on the TED corpus has been written for the EuroCOCOSDA project. This report details the recording procedure, the data processing and organisation, and provides guidelines for eventual transcription of the corpus.
Financial support for data collection and preparation of the TED corpus was provided by the LRE projects, EuroCOCOSDA and RELATOR. The pressing onto CD of TEDspeeches and TEDlaryngo was financed by ELSNET. The TEDphone corpus will soon also be available on CD. Final arrangements are being made for the distribution of the entire corpus by ELRA.
F. Schiel, L. Lamel, TED Final Report, Deliverable D12, LRE project 62-057, EuroCOCOSDA, June, 1995.
FOR INFORMATION
The TED corpus will cost 300 ECU. ELSNET members will receive a 50%
discount on this price. The corpus may be obtained from:
Khalid Choukri, ELRA
87, Avenue d'Italie
F-75013 Paris
Tel: +33 1 45 86 53 00
Fax: +33 1 45 86 44 88
Email: elra@calvanet.calvacom.fr
The fourth annual European Summer School on Language and Speech Communication will be held this year at the Technical University of Budapest from July 8-19 1996, on the topic of ``Dialogue Systems.'' The choice of this topic reflects the growing interest of both NL and speech researchers in the theoretical and practical issues associated with the design and use of computer systems which are able to participate in spoken or written language dialogues.
Courses at the Summer School will be a mixture of short plenary sessions dedicated to surveys or particularly difficult or controversial topics in the field. Both plenary and parallel sessions will be offered, and several of the courses will include practical exercises. In addition, there will be ample opportunities for students to present their own work, not only in formal presentations, but also in informal poster sessions. As is fitting in a Summer School on dialogue, participants will be encouraged to play an active part in the learning process. Background knowledge in a relevant area such as linguistics, speech processing, artificial intelligence, computer science or psychology would be useful, but no prior experience in the area of dialogue systems will be assumed. The Summer School is open to advanced undergraduate students, PhD students, postdocs, and staff members from academic and industrial organisations.
The emphasis will be on small-group work and interaction between participants and lecturers. The number of participants will therefore be limited to 60. Because it is expected that the Summer School will be oversubscribed, pre-registration is strongly recommended. The deadline for pre-registration is May 1, 1996.
Fees and Accommodation Costs
Registration fees are as follows:
full time students | 130 USD or 190 DM |
academic staff members | 260 USD or 380 DM |
employees of industry | 520 USD or 760 DM |
Deadline for payment: June 1, 1996 After June 1, a late fee will be added, and the resulting costs will be:
full time students | 143 USD or 210 DM |
academic staff members | 286 USD or 418 DM |
employees of industry | 572 USD or 836 DM |
The local organisers have made available two possible options for accommodation, and the costs of these are:
Cost: 15 USD/person/night (breakfast included) [total 195 USD for 13 nights]
Cost: USD 68/night [total 884 USD for 13 nights] for two persons, or USD 38/night/person [total 494 USD for 13 nights]. Cost includes breakfast.
A limited number of grants will be made available for students from Central or Eastern Europe. To be eligible for a grant, applicants should send a letter of justification to the local organiser, Dr. Klara Vicsi (address given below). The letter should include details about the applicant's background in language and speech technology, and explaining his or her specific reasons for wanting to attend the Summer School. It would also be advantageous to attach a letter of recommendation from a supervisor.
Sponsors
This year's school is sponsored by the European Network in Language and Speech (ELSNET) and the Copernicus programme through the project ELSNET Goes East. Additional support has been provided by the European Speech Communication Association (ESCA) and the European Chapter of the Association of Computational Linguistics (EACL). Local support is provided by the Technical University of Budapest, the National Scientific Research Fund, and SUN Europe.
FOR INFORMATION
All correspondence regarding the 1996 ELSNET Summer School should be
addressed to:
Technical University of Budapest, Conference Office
Muegyetem rakpart 3.-9
Building K, 1st floor, room 64
H-1521 Budapest, Hungary
Tel: +36 1 463 2666
Fax: +36 1 463 3542
Email: school@khmk.bme.hu
WWW: http://www.ttt.bme.hu or
http://www.cogsci.ed.ac.uk/elsnet/summerschool96.html
9.00 - 10.45 Plenary Session Monday-Tuesday: Historical overview of the dialogue systems field. Lecturer: Louis Boves KPN Research, The Netherlands Wednesday-Thursday: Dialogue types. Lecturer: Francoise Neel LIMSI-CNRS, France Friday: Prosody in spoken language. Lecturer: Julia Hirschberg AT&T Bell Laboratories, USA 10.45 - 11.15 Coffee break 11.15 - 12.00 Plenary Session Monday-Friday: Student presentations 12.15 - 13.00 Plenary Session Monday-Friday: Multimodal systems. Lecturer: Niels Ole Bernsen Centre for Cognitive Science, Roskilde University, Denmark 13.00 - 15.00 Lunch 15.00 - 16.45 Parallel Sessions Option 1: Monday-Friday: Speech input and output. Lecturers: Rolf Carlson, Kjell Elenius and Bjorn Granstrom Dept of Speech Comm. & Music Acoustics, KTH, Sweden Option 2: Monday-Friday: Empirical foundations of dialogue design (with practicals). Lecturers: Hans Dybkjaer and Laila Dybkjaer Centre for Cognitive Science, Roskilde University, Denmark
Week 2
9.00 - 10.45 Plenary Session Monday: Prosody in spoken dialogue. Lecturer: Julia Hirschberg AT&T Bell Laboratories, USA Tuesday-Wednesday: Evaluation of dialogue systems. Lecturer: Paolo Baggia CSELT, Italy Thursday: Commercial realities. Lecturer: Nick Ostler Linguacubun Ltd., UK Friday: Panel discussion on the topic: Advantages and drawbacks of dialogue technology Gyoergy Takacs Ericsson Ltd., Hungary and all lecturers 10.45 - 11.15 Coffee break 11.15 - 13.00 Parallel Sessions Option 1: Monday-Friday: Natural language processing input output. Lecturer: Paul Heisterkamp Daimler-Benz AG, Germany Option 2: Monday-Friday: Dialogue modelling (with practicals) Lecturer: Harald Aust Philips Forschungslaboratorien, Germany 13.00 - 15.00 Lunch 15.00 - 16.45 Parallel Sessions Option 1: Monday-Friday: Human factors. Lecturer: Sharon Oviatt Oregon Graduate Institute, USA Option 2: Monday-Friday: System issues (with practicals). Lecturer: Norman Fraser Vocalis Ltd, UK
The Institute for Electronic Systems at Aalborg University, Denmark has recently been awarded major funding for a programme in intelligent multimedia, called the Multimodal and Multimedia User Interfaces (MMUI) initiative. The initiative is funded by the Faculty of Science and Technology within Aalborg University and involves the implementation of educational MMUI (at the MEng/Sc and PhD levels), the production of real-time MMUI demonstrators, and the establishment of a strong technology-based group of MMUI experts. The Institute for Electronic Systems has a strong track-record of research in the area of real-time processing of intelligent multimedia systems.
Several other teams at Aalborg University are also participating in the initiative, including: the Center for Person-Kommunikation (CPK) which works in the area of speech, language and interactive dialogue; the Laboratory of Image Analysis (LIA) which focuses on image processing and vision; the Laboratory for Medical Informatics (MI) which specialises in automated diagnostics and expert systems; and the Computer Science Group (CSG) which works on theories, platforms, and tools.
CPK is heading the MMUI initiative. The budget of approximately 2.5 million DKK includes funding for two visiting professors who will assist in establishing a new curriculum and research. Dr. Paul McKevitt has started at the CPK on February 1st, 1996 and a second visitor will arrive during the Fall of 1996.
Such initiatives will ensure the position of countries in the EU in the construction of Super Information Highways.
FOR INFORMATION
Further details about Aalborg's work in the area of intelligent
multimedia systems is available from:
Prof. Paul Dalsgaard
Center for PersonKommunikation (CPK)
Fredrik Bajers Vej 7A
Institute of Electronic Systems
Aalborg University
DK-9220 Aalborg, Denmark
Tel: +45 98 15 42 11 + tone + 4866
Fax: +45 98 15 15 83
Email: pd@cpk.auc.dk
URL: http://www.kom.auc.dk/CPK/MMUI.html.
The institute's research covers five main areas: parallelism and architecture; symbolic calculus, programming and software engineering; artificial intelligence; robotics, image and vision; and signal processing, automatics and robotics. Each of these areas has associated projects.
In the field of language and speech, more specifically, basic research focuses on the areas of:
Over the next 3 years period, research will focus mainly on the following fields:
FOR INFORMATION
The site coordinator at IRISA is:
Jacques Siroux, ENSSAT
6 rue de Kerampont
BP 447
F-22305 Lannion Cedex, France
Tel: +33 96 46 50 30
Fax: +33 96 37 01 99
Email: siroux@enssat.fr
Current research efforts are focused on the fields of speech recognition and coding. More specifically:
We have also defined a new complexity measure for continuous speech recognition tasks that allows a more accurate performance comparison than perplexity, and we have developed an ANN-based recognition system that utilises a hybrid approach called SLiding HMM (SLHMM) modelling with successful results.
The development of word-spotting systems with subword units is under investigation. The group is also interested in the study of discriminative techniques for training, and has developed several algorithms for discriminative VQ codebook design and discriminative feature transformation.
For the next few years, we intend to focus our efforts on two main goals: the development of a continuous speech dictation machine and a robust front-end for telephone information services and the dictation machine itself. Thus, we are interested in the study of the syntactic-semantic level of continuous speech recognition systems in order to obtain more flexible grammatical rules and reduce the required perplexity. In relation to the area of robust speech recognition, we are interested in the study of robust feature analysis, adaptation and the application of discriminative techniques to feature extraction and selection.
Our future interests in this area are in the study of joint channel/source coding, discrete transform coding, and their application to speech signals and images.
Victoria Sánchez
Research Group on Signal Processing and Communications
Dpto. de Electronica y Tecnologia de Computadores
Facultad de Ciencias,. Universidad de Granada
Campus Universitario Fuentenueva s/n
18071-Granada, Spain
Email: victoria@hal.ugr.es
Translating texts between English and Russian is a difficult problem; the task requires not only subject-area expertise, but also a deep knowledge of Russian grammar. Lingvistica '93, a small private company based in Kharkov, Ukraine, has developed a quick, convenient, user-friendly translation tool, called PARS, which simplifies English-Russian translation.
PARS runs on IBM PCs under DOS(TM) and Windows(TM), in either stand-alone or network mode. The system can use up to four subject-related dictionaries, in any combination, during a translation session. For example, it's more natural to use a medical dictionary instead of a technical one to translate a medical text: take the word ``cell,'' which means different things in medical and electrical engineering contexts.
PARS includes a large set of two-way English-Russian dictionaries containing more than 400,000 words and idioms. The topic areas covered are: machine building, business, computers, medicine, and aerospace engineering, among others. Plans to add the Polyglossum dictionaries compiled by ETS Ltd, in Moscow, will make PARS the world's largest English-Russian and Russian-English machine translation database.
The system includes the following other features:
The dictionary extending utility is called ``PARS visiting card:'' it lets the user enter words and idioms with their translations into the dictionary, and assign to them grammatical characteristics such as part of speech, gender, declension/conjugation, etc.
In January 1996, we began marketing the world's first commercial Ukrainian-English MT system, PARS/U, which runs on IBM PCs under the Windows operating system, and which may be used both in stand-alone and network modes.
PARS/U, which is based on PARS, includes a bi-directional English-Ukrainian general dictionary of 33,000 words and phrases. We have plans to enlarge this dictionary in the near future, and terminological dictionaries for the areas of computer science, ecology and technology are under development. Like PARS, PARS/U allows the user to amend the existing dictionaries and create new ones.
We have made a thorough description of Ukrainian morphology and have developed a Ukrainian grammatical dictionary. Dozens of rules for analysis and synthesis of Ukrainian texts have also been written and included into PARS/U.
Although PARS/U has much of the functionality of PARS, it also takes into account many Ukrainian peculiarities, such as the seven morphological cases in Ukrainian (Russian has only six) and differences between Ukrainian and Russian participial forms.
Like PARS, PARS/U offers the user a choice between alternative translations of polysemantic words. When analysing a word, PARS/U uses the Ukrainian grammatical dictionary. Taking into account the high morphological ambiguity of Ukrainian words, the system displays variants in those situations when it cannot choose between two alternatives, for example, when noun/adjective ambiguity is encountered. In such situations, the user makes the final choice.
PARS/U is supplied both on diskettes and on CD-ROM. In the latter case, the CD contains the following products:
Dr. Michael Blekhman
94a Prospekt Gagarina, apt.111
Kharkov 310140, Ukraine.
Tel: +380 572 277 135 / 400 036
Fax: +380 572 400 601
Email: blekhman@lotus.kpi.kharkov.ua
The following topics were discussed:
ELSNET-2: The ELSNET-2 proposal submitted in November 1995, has been evaluated very positively and a request for funding of 900,000 ECU for a three-year period has been put forward at the Commission for decision. The contract is expected to be ready by the end of May or early June.
Review of ELSNET-1: The Commission is currently reviewing 7 Networks of Excellence (NoE), including ELSNET, with the objective of evaluating the concept of NoEs in general. On February 9 a review kick-off meeting took place in Brussels. In early March, a questionnaire, produced by the Commission, was sent to some 25 nodes of ELSNET. On March 15 the reviewers --- M. Poza (COTEC, Spain) and H. Gallaire (Xerox, Grenoble) for ELSNET --- will visit the Network's Coordinating site in Utrecht. By the end of June, a final report is expected.
The ELSNET Foundation: In January 1996 the ELSNET Foundation was established. The Foundation will serve as an instrument for the ELSNET Executive Board in cases where it is desirable for the Network to present itself as a legal entity, e.g., in case of participation in new projects. All EB members joined the Board of the ELSNET Foundation.
Cooperation between ELSNET and ELRA: In January 1996, ELSNET and ELRA representatives met to discuss opportunities for cooperation. This resulted in the following agreements:
New project proposals submittted by ELSNET (via the ELSNET Foundation):
New and ongoing activities during the interim period (January-August 1996):
Grants: Because the Commisson has agreed in principle to a prolongation of ELSNET, it was agreed to release the modest amount of money set aside to bridge a possible funding gap between ELSNET and its successor. It was decided to give all ELSNET Nodes the opportunity to apply for grants up to 5,000 ECU. An announcement has been distributed via elsnet-forum [Yvonne van Holsteijn: elsnet@let.ruu.nl].
New Nodes: Nokia Research Center (Finland) and TNO (The Netherlands) were accepted as ELSNET Nodes [contact Nokia Research Center: Mikko Lehtokangas, mikko.lehtokangas@research.nokia.fi; contact TNO: David van Leeuwen: vanleeuwen@tm.tno.nl].
Next meeting: The next meeting of the Executive Board will take place on Thursday, June 20, 1996. Place: to be decided.
Abstract: The task of the text planner is to convert an input specification into an output suitable for manipulation by the tactical generator; a task which involves both content selection and content organisation. This paper takes a look at some of the resources needed for this. Using the results of a corpus analysis I first show how both the underlying domain and communication knowledge may be modelled in the Knowledge Representation Language, LOOM. Having looked at possible methods of specifying input, I go on to discuss how certain types of variation may be expressed by discourse plans. Then, taking examples from the corpus, I demonstrate how these may be implemented by the use of LOOM production rules. Finally, I look at the form of the output of the production rules and suggest what further resources are necessary in order to arrive at the desired output for the tactical generator.
Johannes Matiasek & Harald Trost: Implementing HPSG in FUF --- An experiment in the reusability of linguistic resources, OEFAI-TR-95-14. (Extended version of a paper that appeared in Proc. European Workshop on Natural Language Generation, Leiden, The Netherlands).
Abstract: In practical systems it is often required to reuse existing resources. Such an approach clearly has advantages: it speeds up the development process considerably if one doesn't have to start from scratch. However, combining resources not designed to work together is not a trivial task. An HPSG grammar of German has been implemented in FUF, an unification-based text generator. Although FUF is largely theory-neutral, some of its characteristics diverge from the processing requirements imposed by HPSG in its strict sense. The most prominent discrepancy is, that HPSG, being a lexically driven formalism lends itself best to a head-driven bottom-up processing strategy, whereas FUF, at least by default, uses a top-down, category-driven approach. FUF also lacks a morphological component able to deal with the rich German inflectional system. Therefore a two-level morphology component, X2MorF, has been added. We describe the problems arising when integrating these three resources and the transformations and adaptations made to them, leading to a wide coverage tactical generator for German.
Johannes Matiasek & Harald Trost: Requirements on linguistic knowledge sources for multilingual generation, OEFAI-TR-95-15. (Appeared in Proc. of the IJCAI-95 Workshop on Multilingual Generation)
Abstract: Multilingual generation is often regarded as a possible alternative to machine translation in a number of application scenarios. The expectation is that for these applications multilingual generation will prove to be an inherently easier solution. In this paper we investigate whether this claim is substantial. In particular, we consider the linguistic knowledge sources needed for multilingual generation and compare them to those needed for machine translation. By studying some examples in detail not surprisingly we conclude that this seems not to be the case in general. Only by shifting the emphasis from producing equivalent texts to texts conveying the same message this goal may be achieved. On the other hand, such an approach places additional demands on other components of the system.
Georg Niklfeld, Hannes Pirker & Harald Trost: Using two-level morphology as a generator-synthesizer interface in concept-to-speech generation, OEFAI-TR-95-22. (Appeared in Proc. of the HCM-Workshop on Spoken Dialogue and Discourse, Dublin, April 1, 1995)
Abstract: In a project for the development of a concept-to-speech system for German, we apply extended two-level-morphology (Trost 1991) to provide a unified solution to the tasks of morphotactics, segmental (morpho)phonology, syllabification and assignment of stress. Starting from a lexeme-based lexicon, we show that a declarative two-level-implementation of a single rule-corpus complemented with feature filters is sufficient for a comprehensive account of the various mutual influences holding between separate phonological dimensions in the phonology of German. Information from higher levels of linguistic structure, up to textual representation, can be exploited in our system by performing a look-up of relevant feature-values through the filter conditions.
Johannes Matiasek: The generation of idiomatic and collocational expressions, OEFAI-TR-95-29. (Appears in Proc. of 13th European Meeting on Cybernetics and Systems Research (EMCSR'96), Vienna, Austria, April 9 - 12, 1996).
Abstract: Collocations whose semantic content is not or only partially composed from the semantic content of their parts are often viewed as problematic for generation. In this paper a tactical generator combining FUF as the generation engine and HPSG as the grammar framework is presented. It is shown, that the lexicon driven approach to syntactic and semantic processing is well-suited for the generation of idioms exhibiting various degrees of noncompositionality and syntactic restrictions.
FOR INFORMATION
OFAI's technical reports can be FTP'ed from ftp://www.ai.univie.ac.at/papers
or ordered in printed form from:
Ms.Gerda Helscher
Austrian Research Institute for Artificial Intelligence (OFAI)
Schottengasse 3
A-1010 Vienna, Austria
Email: gerda@ai.univie.ac.at.
Abstract This volume of AIPUK gives a comprehensive account of the recording, processing and analysis platforms for spontaneous speech as they have been developed for German at IPDS Kiel. It is intended as a compendium for working with the speech data (signal and label files) of The Kiel Corpus of Read Speech (IPDS, 1994) and The Kiel Corpus of Spontaneous Speech (IPDS, 1995), which the IPDS is building up as a CD-ROM archive for phonetic research (1994ff.), and as part of a phonetic data bank of spoken North High German. Only data are incorporated into this database that have been processed at the phonetic level, i.e., that have been annotated segmentally and in parts prosodically. This essential constraint on speech data archiving provides a powerful tool for systematic large-scale data bank search in the symbolic and signal domains, to investigate a wide spectrum of spontaneous German pronunciation at the word and utterance levels. This handbook lays the methodological and theoretical foundations for this research.
FOR INFORMATION
Copies available at
IPDS
Universitaet Kiel
D-24098 Kiel, Germany
Fax: +49 431 880 1578
Email: ipds@ipds.uni-kiel.de
Abstract: Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantitities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well-founded techniques appropriate to this domain. Some versions of the Good-Turing approach are very demanding computationally, but we define a version, the Simple Good-Turing estimator. which is straightforward to use. Tested on a variety of natural-language-related data sets, the Simple Good-Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.
Roger Evans & Gerald Gazdar: DATR: A language for lexical knowledge representation, Cognitive Science Research Paper 382, University of Sussex, June 1995. (Price £1.50).
Abstract: Much recent research on the design of natural language lexicons has made use of nonmonotonic inheritance networks as originally developed for general knowledge representation purposes in Artificial Intelligence. DATR is a simple, spartan language for defining non-monotonic inheritance networks with path-value equations, one that has been designed specifically for lexical knowledge representation. In keeping with its intendedly minimalist character, it lacks many of the constructs embodied either in general purpose knowledge representation languages or in contemporary grammar formalisms. The present paper shows that the language is nonetheless sufficiently expressive to represent concisely the structure of lexical information at a variety of levels of linguistic analysis. The paper provides an informal example-based introduction to DATR and to techniques for its use, including finite state transduction, the encoding of DAGs and lexical rules, and therepresentation of ambiguity and alternation. Sample analyses of phenomena such as inflectional syncretism and verbal subcategorization are given which show how the language can be used to squeeze out redundancy from lexical descriptions.
FOR INFORMATION
These reports can be obtained from:
Librarian
School of Cognitive and Computing Sciences
University of Sussex
Brighton BN1 9QH, UK
March 31-April 2, 1996: AISB96 Workshops and Tutorials, University of Sussex, Brighton, UK. For information, contact: Lynne Cahill, School of Cognitive & Computing Sciences, Univ. of Sussex, Brighton, BN1 9QH, UK, Email: lynneca@cogs.susx.ac.uk. URL: http://www.cogs.susx.ac.uk/aisb/aisb96/.
May 2-4, 1996: Second International Conference on Mathematical Linguistics, Tarragone, Spain. For information, contact: Carlos Martin Vide, Email: cmv@astor.urv.es.
June 4-6, 1996: International Conference on Natural Language Processing and Industrial Applications (NLP+IA 96), Moncton, New Brunswick, Canada. For information: contact: Chadia Moghrabi, Email: nlp-ia@umoncton.ca.
June 23-28, 1996: The 34th Annual Meeting of the Association for Computational Linguistics (ACL 96), Santa Cruz, CA, USA. For information, contact: Email: ACL96-questions@linc.cis.upenn.edu. URL: http://www.cs.columbia.edu/~acl.
June 28, 1996: Second meeting of the Special Interest Group in Computational Phonology (SIGPHON 96), Santa Cruz, CA, USA. For information, contact: SIGPHON 96, c/o Richard Sproat, Bell Laboratories, Room 2D-451, 600 Mountain Avenue, Murray Hill, NJ 07974, USA, Email: sigphon@research.att.com.
July 8-19, 1996: The 4th European Summer School on Language and Speech Communication (ELSNET Summer School), Budapest, Hungary. For information, contact: Klara Vicsi, Email: school@khmk.bme.hu.
August 5-9, 1996: International Conference on Computational Linguistics (COLING 96), Copenhagen, Denmark. For information, contact: Bente Maegaard, Email: col96@cst.ku.dk.
August 12-23, 1996: The 8th European Summer School in Logic, Language and Information (ESSLLI-96), Prague, Czech Republic. For information, contact: Email: esslli@ufal.mff.cuni.cz, or URL: http://ufal.ms.mff.cuni.cz/.