ELSNews, vol. 5.1, February 1996

  • Programme Available!

    The programme is now available for the 1996 European Summer School on Language and Speech Communication (better known as the ELSNET Summer School).

    The topic of this year's School is Dialogue Systems, and some of the questions that will be addressed during the two weeks in July include:

    See pages 4-5 of this issue of ELSNews for more information.

    ELSNET Linguistic Resources to be Distributed by ELRA

    Representatives of ELSNET and ELRA are currently negotiating the terms of an agreement which would transfer to ELRA the distribution rights of language resources produced by ELSNET.

    The resources under consideration are:

    The original ECI Multilingual Corpus I will continue to be distributed by ELSNET, although ELRA will assist in promoting the CD.

    Members of ELSNET will be given substantial discounts on the costs of the resources which are being transferred to ELRA, and members of ELRA will be given an even greater discount.

    A coordination committee has been established to help harmonise the actions and activities of ELSNET and ELRA. All members of this committee are involved in both organisations.

    A complete catalogue of ELRA's resources will soon be available, and information about these resources will be published in future issues of ELSNews.

    ``Extra! Extra! Read all about it!''

    Newspapers on-line

    Isabel Trancoso (INESC), member of ELSNET's Linguistic and Speech Resources Task Group, recently surveyed the readers of elsnet-list to find out about newspapers available via the World Wide Web. The goal of this survey was to produce a comprehensive list of web addresses (URLs) for newspapers all over the world that are available on the Internet.

    Dr. Trancoso explained the motivation for the survey: ``Newspaper text may constitute very important linguistic resources for research not only from the point of view of written language processing, but also from the point of view of speech processing (for building language models for large vocabulary continuous speech recognition [systems], for instance).''

    A number of laboratories and institutes have made extensive use of news text for research purposes. Text from Le Monde and the Wall Street Journal, for example, has been used by LIMSI-CNRS for their work in multilingual speech recognition. Newspaper corpora has also frequently been used in developing text-to-speech systems.

    The result of Dr. Trancoso's survey --- a list of more than 30 newspapers worldwide --- was posted to elsnet-list on Feb. 8. The list is alphabetical by country (European papers are listed first, and non-European papers are listed at the end), and for each newspaper, the URL is given along with information about periodicity (e.g., daily, weekly, etc.), language, and copyright regulations. The list includes not only newspapers from Western European countries, but also from Central and Eastern Europe, and from North and South America, in languages as diverse as Catalan, Dutch, Estonian, and Esperanto.

    The list was converted to HTML by Oliver Christ (IMS, Universität Stuttgart), and a link was made to the Stuttgart file from the ELSNET home page.

    If you'd like access to newstext on-line --- or if you'd just like to read the morning news from your computer! --- point your broswer at:


    or follow the link from the ELSNET home page.

    Comments, new links, changes and updates are welcome! Please send them to:

    Oliver Christ
    IMS, Universität Stuttgart
    Azenbergstraße 12
    70174 Stuttgart, Germany
    Email: oli@ims.uni-stuttgart.de

    ELRA Sets up Shop

    ELRA (the European Language Resources Association) has established its offices in Paris and the infrastructure is now in operation. The European Language Resources Distribution Agency (ELDA) has been set up by Chief Executive, Khalid Choukri, to handle actual distribution of ELRA resources.

    ELRA membership is open to any organization, public or private. Full membership, with voting rights, is available to organisations established in the EU or European Economic Area. Purely for organisational purposes, members will be classified by their chief interest (spoken, written, or terminological resources). The annual membership fee has been set at a modest ECU 1000 to encourage broad participation.

    The address of ELRA's Paris office is:

    Khalid Choukri, ELRA/ELDA
    87, Avenue d'Italie
    F-75013 Paris, France
    Tel: +33 1 45 86 53 00
    Fax: +33 1 45 86 44 88
    Email: elra@calvanet.calvacom.fr

    Transnational English Database

    Lori Lamel, LIMSI-CNRS and Florian Schiel, Universitaet München

    The TED (Transnational English Database) corpus contains recordings of speeches made at the Eurospeech-93 conference in Berlin. The name of the corpus --- and its nickname, ``The Terrible English Database'' --- reflects the fact that a high percentage of the presentations at Eurospeech-93 were given in English by non-native speakers of English. TED was first conceived at Eurospeech-91 in Genoa, Italy, and follow-up discussions were held at the COCOSDA meeting in Chiavari (1991). The idea was first formally presented as a potential project by Joseph Mariani (LIMSI-CNRS) at the COCOSDA meeting in Banff (1992). He proposed to record the speeches made at Eurospeech-93 in Berlin and to distribute the data, with the aim of developing speech recognisers which will try to automatically recognise the speeches made at Eurospeech-95.

    As the result of the extensive preparatory work of many individuals including members of the EuroCOCOSDA Consortium, ESCA, and the local organisers in Berlin, particularly Professor Klaus Fellbaum who worked closely with the Institute of Phonetics at the University of München, recordings were made of the oral presentations at Eurospeech-93.

    Of the 287 oral presentations at the conference, 224 were successfully recorded, providing a total of about 75 hours of speech material per channel. These recordings provide a relatively large number of speakers speaking a variant of the same language (English) over a relatively long period of time (15 min each + 5 min discussion) on a specific topic. A subset of speakers were recorded with a laryngograph in addition to the standard microphone. The laryngograph recordings were organized by the Department of Phonetics at University College London and supervised by Adrian Fourcin. A set of Polyphone-like recordings were also made, for which a subset also had a laryngograph signal recorded. These recordings were made in English and in the speakers' native languages.

    Associated text materials consist of ascii versions of proceedings papers provided by the authors, collected in order to provide vocabulary items and data for language modeling. The texts are included both in their original form and in a normalised format (single column, 80 characters per line). Two hundred and fifty-three (253) speakers completed a short questionnaire giving information about their native language and other languages they speak, and providing details about their knowledge of English. The collection of texts and questionnaires was carried out by email.

    The entire corpus comprises three subcorpora including:

    All the recordings, with the exception of the telephone channel of the TEDphone recordings (which were directly sampled on a speech server in München), were made on digital audio tape (DAT) cassettes and were digitised by the Univ. of München for production on CD.

    A final report on the TED corpus has been written for the EuroCOCOSDA project. This report details the recording procedure, the data processing and organisation, and provides guidelines for eventual transcription of the corpus.

    Financial support for data collection and preparation of the TED corpus was provided by the LRE projects, EuroCOCOSDA and RELATOR. The pressing onto CD of TEDspeeches and TEDlaryngo was financed by ELSNET. The TEDphone corpus will soon also be available on CD. Final arrangements are being made for the distribution of the entire corpus by ELRA.


    L. Lamel, F. Schiel, A. Fourcin, J. Mariani, H. Tillmann, The Translanguage English Database (TED). In the Proceedings of ICSLP-94, Yokohama, Japan, September 1994.

    F. Schiel, L. Lamel, TED Final Report, Deliverable D12, LRE project 62-057, EuroCOCOSDA, June, 1995.

    The TED corpus will cost 300 ECU. ELSNET members will receive a 50% discount on this price. The corpus may be obtained from:

    Khalid Choukri, ELRA
    87, Avenue d'Italie
    F-75013 Paris
    Tel: +33 1 45 86 53 00
    Fax: +33 1 45 86 44 88
    Email: elra@calvanet.calvacom.fr

    Fourth European Summer School on Language and Speech Communication

    Dialogue Systems, Budapest, Hungary, July 8-19, 1996


    The fourth annual European Summer School on Language and Speech Communication will be held this year at the Technical University of Budapest from July 8-19 1996, on the topic of ``Dialogue Systems.'' The choice of this topic reflects the growing interest of both NL and speech researchers in the theoretical and practical issues associated with the design and use of computer systems which are able to participate in spoken or written language dialogues.

    Courses at the Summer School will be a mixture of short plenary sessions dedicated to surveys or particularly difficult or controversial topics in the field. Both plenary and parallel sessions will be offered, and several of the courses will include practical exercises. In addition, there will be ample opportunities for students to present their own work, not only in formal presentations, but also in informal poster sessions. As is fitting in a Summer School on dialogue, participants will be encouraged to play an active part in the learning process. Background knowledge in a relevant area such as linguistics, speech processing, artificial intelligence, computer science or psychology would be useful, but no prior experience in the area of dialogue systems will be assumed. The Summer School is open to advanced undergraduate students, PhD students, postdocs, and staff members from academic and industrial organisations.

    The emphasis will be on small-group work and interaction between participants and lecturers. The number of participants will therefore be limited to 60. Because it is expected that the Summer School will be oversubscribed, pre-registration is strongly recommended. The deadline for pre-registration is May 1, 1996.

    Fees and Accommodation Costs

    Registration fees are as follows:

    full time students 130 USD or 190 DM
    academic staff members 260 USD or 380 DM
    employees of industry 520 USD or 760 DM

    Deadline for payment: June 1, 1996 After June 1, a late fee will be added, and the resulting costs will be:

    full time students 143 USD or 210 DM
    academic staff members 286 USD or 418 DM
    employees of industry 572 USD or 836 DM

    The local organisers have made available two possible options for accommodation, and the costs of these are:

    A limited number of grants will be made available for students from Central or Eastern Europe. To be eligible for a grant, applicants should send a letter of justification to the local organiser, Dr. Klara Vicsi (address given below). The letter should include details about the applicant's background in language and speech technology, and explaining his or her specific reasons for wanting to attend the Summer School. It would also be advantageous to attach a letter of recommendation from a supervisor.


    This year's school is sponsored by the European Network in Language and Speech (ELSNET) and the Copernicus programme through the project ELSNET Goes East. Additional support has been provided by the European Speech Communication Association (ESCA) and the European Chapter of the Association of Computational Linguistics (EACL). Local support is provided by the Technical University of Budapest, the National Scientific Research Fund, and SUN Europe.

    All correspondence regarding the 1996 ELSNET Summer School should be addressed to:

    Technical University of Budapest, Conference Office
    Muegyetem rakpart 3.-9
    Building K, 1st floor, room 64
    H-1521 Budapest, Hungary
    Tel: +36 1 463 2666
    Fax: +36 1 463 3542
    Email: school@khmk.bme.hu
    WWW: http://www.ttt.bme.hu or

    Course Programme

    Week 1

     9.00 - 10.45	Plenary Session
    	  Monday-Tuesday:  Historical overview of the dialogue systems field.
    		Lecturer:  Louis Boves
    			   KPN Research, The Netherlands
          Wednesday-Thursday:  Dialogue types.
    		Lecturer:  Francoise Neel
    			   LIMSI-CNRS, France
    	          Friday:  Prosody in spoken language.
    		Lecturer:  Julia Hirschberg
    			   AT&T Bell Laboratories, USA
    10.45 - 11.15	Coffee break
    11.15 - 12.00	Plenary Session
    	   Monday-Friday:  Student presentations
    12.15 - 13.00	Plenary Session
    	   Monday-Friday:  Multimodal systems.
    		Lecturer:  Niels Ole Bernsen
    			   Centre for Cognitive Science,
    			   Roskilde University, Denmark
    13.00 - 15.00	Lunch
    15.00 - 16.45	Parallel Sessions
     Option 1: Monday-Friday:  Speech input and output.
    	       Lecturers:  Rolf Carlson, Kjell Elenius and Bjorn Granstrom
    			   Dept of Speech Comm. & Music Acoustics,
                               KTH, Sweden
     Option 2: Monday-Friday:  Empirical foundations of dialogue design
                               (with practicals).  
    	       Lecturers:  Hans Dybkjaer and Laila Dybkjaer
    			   Centre for Cognitive Science,
    			   Roskilde University, Denmark

    Week 2

     9.00 - 10.45 	Plenary Session
    	          Monday:  Prosody in spoken dialogue.
    		Lecturer:  Julia Hirschberg
    			   AT&T Bell Laboratories, USA
           Tuesday-Wednesday:  Evaluation of dialogue systems.
    		Lecturer:  Paolo Baggia
    			   CSELT, Italy
    	        Thursday:  Commercial realities.
    		Lecturer:  Nick Ostler
    			   Linguacubun Ltd., UK
    	          Friday:  Panel discussion on the topic:
                               Advantages and drawbacks of dialogue technology
    			   Gyoergy Takacs
    			   Ericsson Ltd., Hungary
    			   and all lecturers
    10.45 - 11.15	Coffee break
    11.15 - 13.00	Parallel Sessions
     Option 1: Monday-Friday:  Natural language processing input output.
    		Lecturer:  Paul Heisterkamp
    			   Daimler-Benz AG, Germany
     Option 2: Monday-Friday:  Dialogue modelling (with practicals)
    		Lecturer:  Harald Aust
    			   Philips Forschungslaboratorien, Germany
    13.00 - 15.00	Lunch
    15.00 - 16.45	Parallel Sessions
     Option 1: Monday-Friday:  Human factors.
    		Lecturer:  Sharon Oviatt
    			   Oregon Graduate Institute, USA
     Option 2: Monday-Friday:  System issues (with practicals).
    		Lecturer:  Norman Fraser
    			   Vocalis Ltd, UK

    Intelligent Multimedia at Aalborg University

    The area of intelligent multimedia involves real-time computer processing of perceptual input from speech, textual and visual sources, as well as the traditional display of text, voice, sound and video/graphics, with touch and virtual reality linked in. There has been a great upsurge of interest in this area over the last two years, but it is a field where many universities still do not have expertise.

    The Institute for Electronic Systems at Aalborg University, Denmark has recently been awarded major funding for a programme in intelligent multimedia, called the Multimodal and Multimedia User Interfaces (MMUI) initiative. The initiative is funded by the Faculty of Science and Technology within Aalborg University and involves the implementation of educational MMUI (at the MEng/Sc and PhD levels), the production of real-time MMUI demonstrators, and the establishment of a strong technology-based group of MMUI experts. The Institute for Electronic Systems has a strong track-record of research in the area of real-time processing of intelligent multimedia systems.

    Several other teams at Aalborg University are also participating in the initiative, including: the Center for Person-Kommunikation (CPK) which works in the area of speech, language and interactive dialogue; the Laboratory of Image Analysis (LIA) which focuses on image processing and vision; the Laboratory for Medical Informatics (MI) which specialises in automated diagnostics and expert systems; and the Computer Science Group (CSG) which works on theories, platforms, and tools.

    CPK is heading the MMUI initiative. The budget of approximately 2.5 million DKK includes funding for two visiting professors who will assist in establishing a new curriculum and research. Dr. Paul McKevitt has started at the CPK on February 1st, 1996 and a second visitor will arrive during the Fall of 1996.

    Such initiatives will ensure the position of countries in the EU in the construction of Super Information Highways.

    Further details about Aalborg's work in the area of intelligent multimedia systems is available from:

    Prof. Paul Dalsgaard
    Center for PersonKommunikation (CPK)
    Fredrik Bajers Vej 7A
    Institute of Electronic Systems
    Aalborg University
    DK-9220 Aalborg, Denmark
    Tel: +45 98 15 42 11 + tone + 4866
    Fax: +45 98 15 15 83
    Email: pd@cpk.auc.dk
    URL: http://www.kom.auc.dk/CPK/MMUI.html.

    Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA)

    IRISA is a public research laboratory comprised of two units --- INRIA-Rennes, and a CNRS research unit associated with the University of Rennes 1 and the Institut National de Sciences Appliquées (INSA) also in Rennes. There are around 300 researchers based at IRISA, of whom 100 are PhD students.

    The institute's research covers five main areas: parallelism and architecture; symbolic calculus, programming and software engineering; artificial intelligence; robotics, image and vision; and signal processing, automatics and robotics. Each of these areas has associated projects.

    In the field of language and speech, more specifically, basic research focuses on the areas of:

    The goal of this work is the development of speech synthesis and recognition systems. Two systems are now operative.

    Over the next 3 years period, research will focus mainly on the following fields:

    The institute has a long history of participating in national and international research projects (e.g., SUNDIAL, PALABRE), and student mobility programmes (e.g., ERASMUS).

    Participation in ELSNET

    Regarding IRISA's contribution to the network, we would like to promote the dialogue paradigm which is crucial in oral person-machine systems, and share our expertise and experience in the fields of person-machine communication and learning. In order to achieve these aims, IRISA would be willing to organise and participate in workshops. We would also welcome colleagues from other countries to spend time at the institute, working with our team.

    Selected Recent Publications

    The site coordinator at IRISA is:

    Jacques Siroux, ENSSAT
    6 rue de Kerampont
    BP 447
    F-22305 Lannion Cedex, France
    Tel: +33 96 46 50 30
    Fax: +33 96 37 01 99
    Email: siroux@enssat.fr

    Research Group on Signal Processing & Communications, Univ. of Granada

    The Research Group on Signal Processing and Communications (GiPSyC) is composed of ten researchers from the Department of Electronics and Computer Technology (DETC) of the University of Granada (UGR), Spain. The group is coordinated by Dr. Antonio J. Rubio (rubio@hal.ugr.es), and its members teach in the fields of signal processing, communications, computer networks, and robotics.

    Current research efforts are focused on the fields of speech recognition and coding. More specifically:

    Selected Recent Publications

    The ELSNET contact person at the Univ. of Granada is:

    Victoria Sánchez
    Research Group on Signal Processing and Communications
    Dpto. de Electronica y Tecnologia de Computadores
    Facultad de Ciencias,. Universidad de Granada
    Campus Universitario Fuentenueva s/n
    18071-Granada, Spain
    Email: victoria@hal.ugr.es

    PARS and PARS/U: Translation Tools for Russian and Ukrainian

    Michael S. Blekhman, Lingvistica '93 Co.

    Translating texts between English and Russian is a difficult problem; the task requires not only subject-area expertise, but also a deep knowledge of Russian grammar. Lingvistica '93, a small private company based in Kharkov, Ukraine, has developed a quick, convenient, user-friendly translation tool, called PARS, which simplifies English-Russian translation.

    PARS runs on IBM PCs under DOS(TM) and Windows(TM), in either stand-alone or network mode. The system can use up to four subject-related dictionaries, in any combination, during a translation session. For example, it's more natural to use a medical dictionary instead of a technical one to translate a medical text: take the word ``cell,'' which means different things in medical and electrical engineering contexts.

    PARS includes a large set of two-way English-Russian dictionaries containing more than 400,000 words and idioms. The topic areas covered are: machine building, business, computers, medicine, and aerospace engineering, among others. Plans to add the Polyglossum dictionaries compiled by ETS Ltd, in Moscow, will make PARS the world's largest English-Russian and Russian-English machine translation database.

    The system includes the following other features:

    A demo version of PARS is available to individuals wanting to see for themselves how the system works.

    PARS/U for Windows

    Because Russian has long been the only state language in the former Soviet Union, there are now relatively few native speakers of Ukrainian. So with the recent passing of a law which made Ukrainian the only state language in the country, it became a serious challenge for the majority to master another language, despite the similarities that exist between Ukrainian and Russian. At present, Ukraine is being drawn into the world scientific and business communities, and there is a vital need for commercial MT systems which translate between Ukrainian and the main European languages --- primarily English.

    In January 1996, we began marketing the world's first commercial Ukrainian-English MT system, PARS/U, which runs on IBM PCs under the Windows operating system, and which may be used both in stand-alone and network modes.

    PARS/U, which is based on PARS, includes a bi-directional English-Ukrainian general dictionary of 33,000 words and phrases. We have plans to enlarge this dictionary in the near future, and terminological dictionaries for the areas of computer science, ecology and technology are under development. Like PARS, PARS/U allows the user to amend the existing dictionaries and create new ones.

    We have made a thorough description of Ukrainian morphology and have developed a Ukrainian grammatical dictionary. Dozens of rules for analysis and synthesis of Ukrainian texts have also been written and included into PARS/U.

    Although PARS/U has much of the functionality of PARS, it also takes into account many Ukrainian peculiarities, such as the seven morphological cases in Ukrainian (Russian has only six) and differences between Ukrainian and Russian participial forms.

    Like PARS, PARS/U offers the user a choice between alternative translations of polysemantic words. When analysing a word, PARS/U uses the Ukrainian grammatical dictionary. Taking into account the high morphological ambiguity of Ukrainian words, the system displays variants in those situations when it cannot choose between two alternatives, for example, when noun/adjective ambiguity is encountered. In such situations, the user makes the final choice.

    PARS/U is supplied both on diskettes and on CD-ROM. In the latter case, the CD contains the following products:

    1. PARS/U;
    2. PARS for Windows;
    3. PARS for DOS;
    4. PARS Tutor: a Windows-based tutorial in two versions, in English and in Russian, with 100 illustrations in each version;

    5. Demo version of PARS for DOS;
    6. RUMP for DOS: an MT system for Russian-Ukrainian translation.
    To obtain a copy or demo version of PARS or PARS/U, please contact:

    Dr. Michael Blekhman
    94a Prospekt Gagarina, apt.111
    Kharkov 310140, Ukraine.
    Tel: +380 572 277 135 / 400 036
    Fax: +380 572 400 601
    Email: blekhman@lotus.kpi.kharkov.ua

    Minutes of the February 1996 Executive Board Meeting

    The last meeting of the ELSNET Executive Board was held in Pisa on February 26 in conjunction with the first meeting of the newly-established ELSNET Foundation. This was the first Executive Board meeting in the history of ELSNET that covered the entire agenda and also ended on time! Thanks to Antonio Zampolli and his staff at the Istituto di Linguistica Computazionale for organising the meeting.

    The following topics were discussed:

    ELSNET-2: The ELSNET-2 proposal submitted in November 1995, has been evaluated very positively and a request for funding of 900,000 ECU for a three-year period has been put forward at the Commission for decision. The contract is expected to be ready by the end of May or early June.

    Review of ELSNET-1: The Commission is currently reviewing 7 Networks of Excellence (NoE), including ELSNET, with the objective of evaluating the concept of NoEs in general. On February 9 a review kick-off meeting took place in Brussels. In early March, a questionnaire, produced by the Commission, was sent to some 25 nodes of ELSNET. On March 15 the reviewers --- M. Poza (COTEC, Spain) and H. Gallaire (Xerox, Grenoble) for ELSNET --- will visit the Network's Coordinating site in Utrecht. By the end of June, a final report is expected.

    The ELSNET Foundation: In January 1996 the ELSNET Foundation was established. The Foundation will serve as an instrument for the ELSNET Executive Board in cases where it is desirable for the Network to present itself as a legal entity, e.g., in case of participation in new projects. All EB members joined the Board of the ELSNET Foundation.

    Cooperation between ELSNET and ELRA: In January 1996, ELSNET and ELRA representatives met to discuss opportunities for cooperation. This resulted in the following agreements:

    New project proposals submittted by ELSNET (via the ELSNET Foundation):

    New and ongoing activities during the interim period (January-August 1996):

    Grants: Because the Commisson has agreed in principle to a prolongation of ELSNET, it was agreed to release the modest amount of money set aside to bridge a possible funding gap between ELSNET and its successor. It was decided to give all ELSNET Nodes the opportunity to apply for grants up to 5,000 ECU. An announcement has been distributed via elsnet-forum [Yvonne van Holsteijn: elsnet@let.ruu.nl].

    New Nodes: Nokia Research Center (Finland) and TNO (The Netherlands) were accepted as ELSNET Nodes [contact Nokia Research Center: Mikko Lehtokangas, mikko.lehtokangas@research.nokia.fi; contact TNO: David van Leeuwen: vanleeuwen@tm.tno.nl].

    Next meeting: The next meeting of the Executive Board will take place on Thursday, June 20, 1996. Place: to be decided.

    Technical reports and research papers

    Austrian Research Institute for Artificial Intelligence (OFAI)

    Elizabeth Garner: Knowledge resources for the text planner: The domain model and plans for discourse, OEFAI-TR-95-13.

    Abstract: The task of the text planner is to convert an input specification into an output suitable for manipulation by the tactical generator; a task which involves both content selection and content organisation. This paper takes a look at some of the resources needed for this. Using the results of a corpus analysis I first show how both the underlying domain and communication knowledge may be modelled in the Knowledge Representation Language, LOOM. Having looked at possible methods of specifying input, I go on to discuss how certain types of variation may be expressed by discourse plans. Then, taking examples from the corpus, I demonstrate how these may be implemented by the use of LOOM production rules. Finally, I look at the form of the output of the production rules and suggest what further resources are necessary in order to arrive at the desired output for the tactical generator.

    Johannes Matiasek & Harald Trost: Implementing HPSG in FUF --- An experiment in the reusability of linguistic resources, OEFAI-TR-95-14. (Extended version of a paper that appeared in Proc. European Workshop on Natural Language Generation, Leiden, The Netherlands).

    Abstract: In practical systems it is often required to reuse existing resources. Such an approach clearly has advantages: it speeds up the development process considerably if one doesn't have to start from scratch. However, combining resources not designed to work together is not a trivial task. An HPSG grammar of German has been implemented in FUF, an unification-based text generator. Although FUF is largely theory-neutral, some of its characteristics diverge from the processing requirements imposed by HPSG in its strict sense. The most prominent discrepancy is, that HPSG, being a lexically driven formalism lends itself best to a head-driven bottom-up processing strategy, whereas FUF, at least by default, uses a top-down, category-driven approach. FUF also lacks a morphological component able to deal with the rich German inflectional system. Therefore a two-level morphology component, X2MorF, has been added. We describe the problems arising when integrating these three resources and the transformations and adaptations made to them, leading to a wide coverage tactical generator for German.

    Johannes Matiasek & Harald Trost: Requirements on linguistic knowledge sources for multilingual generation, OEFAI-TR-95-15. (Appeared in Proc. of the IJCAI-95 Workshop on Multilingual Generation)

    Abstract: Multilingual generation is often regarded as a possible alternative to machine translation in a number of application scenarios. The expectation is that for these applications multilingual generation will prove to be an inherently easier solution. In this paper we investigate whether this claim is substantial. In particular, we consider the linguistic knowledge sources needed for multilingual generation and compare them to those needed for machine translation. By studying some examples in detail not surprisingly we conclude that this seems not to be the case in general. Only by shifting the emphasis from producing equivalent texts to texts conveying the same message this goal may be achieved. On the other hand, such an approach places additional demands on other components of the system.

    Georg Niklfeld, Hannes Pirker & Harald Trost: Using two-level morphology as a generator-synthesizer interface in concept-to-speech generation, OEFAI-TR-95-22. (Appeared in Proc. of the HCM-Workshop on Spoken Dialogue and Discourse, Dublin, April 1, 1995)

    Abstract: In a project for the development of a concept-to-speech system for German, we apply extended two-level-morphology (Trost 1991) to provide a unified solution to the tasks of morphotactics, segmental (morpho)phonology, syllabification and assignment of stress. Starting from a lexeme-based lexicon, we show that a declarative two-level-implementation of a single rule-corpus complemented with feature filters is sufficient for a comprehensive account of the various mutual influences holding between separate phonological dimensions in the phonology of German. Information from higher levels of linguistic structure, up to textual representation, can be exploited in our system by performing a look-up of relevant feature-values through the filter conditions.

    Johannes Matiasek: The generation of idiomatic and collocational expressions, OEFAI-TR-95-29. (Appears in Proc. of 13th European Meeting on Cybernetics and Systems Research (EMCSR'96), Vienna, Austria, April 9 - 12, 1996).

    Abstract: Collocations whose semantic content is not or only partially composed from the semantic content of their parts are often viewed as problematic for generation. In this paper a tactical generator combining FUF as the generation engine and HPSG as the grammar framework is presented. It is shown, that the lexicon driven approach to syntactic and semantic processing is well-suited for the generation of idioms exhibiting various degrees of noncompositionality and syntactic restrictions.

    OFAI's technical reports can be FTP'ed from ftp://www.ai.univie.ac.at/papers or ordered in printed form from:

    Ms.Gerda Helscher
    Austrian Research Institute for Artificial Intelligence (OFAI)
    Schottengasse 3
    A-1010 Vienna, Austria
    Email: gerda@ai.univie.ac.at.

    IPDS, Universitaet Kiel, Germany

    Klaus Kohler, Matthias Paetzold & Adrian Simpson: From scenario to segment. The controlled elicitation, transcription, segmentation and labelling of spontaneous speech, Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung, AIPUK 29, IPDS, Kiel University 1995 (Price: DEM 20,00 + postage)

    Abstract This volume of AIPUK gives a comprehensive account of the recording, processing and analysis platforms for spontaneous speech as they have been developed for German at IPDS Kiel. It is intended as a compendium for working with the speech data (signal and label files) of The Kiel Corpus of Read Speech (IPDS, 1994) and The Kiel Corpus of Spontaneous Speech (IPDS, 1995), which the IPDS is building up as a CD-ROM archive for phonetic research (1994ff.), and as part of a phonetic data bank of spoken North High German. Only data are incorporated into this database that have been processed at the phonetic level, i.e., that have been annotated segmentally and in parts prosodically. This essential constraint on speech data archiving provides a powerful tool for systematic large-scale data bank search in the symbolic and signal domains, to investigate a wide spectrum of spontaneous German pronunciation at the word and utterance levels. This handbook lays the methodological and theoretical foundations for this research.

    Copies available at

    Universitaet Kiel
    D-24098 Kiel, Germany
    Fax: +49 431 880 1578
    Email: ipds@ipds.uni-kiel.de

    University of Sussex, UK

    William A. Gale & Geoffrey Sampson: Good-Turing frequency estimation without tears, Cognitive Science Research Paper 407, University of Sussex, January 1996. (Price: £1.00).

    Abstract: Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantitities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well-founded techniques appropriate to this domain. Some versions of the Good-Turing approach are very demanding computationally, but we define a version, the Simple Good-Turing estimator. which is straightforward to use. Tested on a variety of natural-language-related data sets, the Simple Good-Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.

    Roger Evans & Gerald Gazdar: DATR: A language for lexical knowledge representation, Cognitive Science Research Paper 382, University of Sussex, June 1995. (Price £1.50).

    Abstract: Much recent research on the design of natural language lexicons has made use of nonmonotonic inheritance networks as originally developed for general knowledge representation purposes in Artificial Intelligence. DATR is a simple, spartan language for defining non-monotonic inheritance networks with path-value equations, one that has been designed specifically for lexical knowledge representation. In keeping with its intendedly minimalist character, it lacks many of the constructs embodied either in general purpose knowledge representation languages or in contemporary grammar formalisms. The present paper shows that the language is nonetheless sufficiently expressive to represent concisely the structure of lexical information at a variety of levels of linguistic analysis. The paper provides an informal example-based introduction to DATR and to techniques for its use, including finite state transduction, the encoding of DAGs and lexical rules, and therepresentation of ambiguity and alternation. Sample analyses of phenomena such as inflectional syncretism and verbal subcategorization are given which show how the language can be used to squeeze out redundancy from lexical descriptions.

    These reports can be obtained from:

    School of Cognitive and Computing Sciences
    University of Sussex
    Brighton BN1 9QH, UK

    Last update: March 20, 1996