ELSNews, vol. 4.5, November 1995

Permission is given by the Editor to distribute or re-use ELSNews articles appearing in the paper version of the newsletter, or in these Web pages, with the proviso that the following text is included with the re-used article:

``This article appeared in ELSNews 4.5 (November 1995), and is re-printed by permission from the Editor. ELSNews is the newsletter of ELSNET, the European Network in Language and Speech. Information about ELSNET is available from the Coordinator, at elsnet@let.ruu.nl.''

Table of Contents

ELRA update
General assembly meeting held in Luxembourg
RELATOR ends in May
Report on final results of RELATOR
Translating patents
Report on Danish academic-industrial cooperation
Correspondent Ewan Klein investigates CSLI's involvement in Verbmobil
Hungarian SME is bridging a gap
EB Minutes
Helpdesk set up in Edinburgh
Future Events and Misc
Return to the ELSNET home page.

Resources: What is ELSNET's Role?

Ulrich Heid, Universität Stuttgart and Antonio Zampolli, CNR, Pisa

In the spring of 1995, the European Linguistic Resources Association (ELRA) was founded. Now, after little more than half a year of existence, the Association has well over 70 member institutions from industry and academia. ELRA is now setting up an office which will serve as a European focal point for the dissemination and validation of linguistic resources, including text and speech corpora, along with other types of speech resources, lexicons, grammars, and possibly NLP and speech tools.

ELSNET can regard ELRA as a sort of ``grandchild'', since the creation of ELRA is the result of a joint action between the ELSNET-initiated RELATOR project with PAROLE, SPEECHDAT and POINTER, these last three being infrastructure design studies for the areas of written resources, speech resources, and terminology respectively. Given all this activity (and many other projects could be cited here, such as the EAGLES working groups on spoken language, text corpora and lexicons), some ELSNews readers may be wondering what the overall structure of the field is; what is the distribution of labour; what are the results available in the short to medium term; and --- last but not least --- what is the role of ELSNET and ELSNET's Resources Task Group? We can at best try to give a partial answer. It is partial in two senses, in that it is incomplete and personal.

The creation of linguistic resources involves a number of different tasks: i) design and definition, ii) production, iii) quality control and validation in practical applications, and iv) dissemination. If we look at how each of these tasks is being covered by the current activities, we find that design and definition studies of linguistic resources have been carried out over the last few years, both by European projects and by some national projects. For example, in some of the EAGLES working groups, guidelines and design criteria (and criteria for quality control) have been or are being defined. These intermediate results, together with the results of the more dedicated PAROLE and SPEECHDAT studies, provide a solid starting point for the real work of resource production, which will soon be undertaken in the PAROLE, EuroWordNet and SPEECHDAT-2 projects. (These projects will perhaps be the first to concentrate on resource production.) Initial steps in quality control and validation of resources have been taken as well --- for example, in EAGLES --- but more practical validation work in concrete applications is necessary. This will be a job for ELRA, in cooperation with PAROLE, EuroWordNet and SPEECHDAT-2. Finally, resource distribution and the dissemination of information about linguistic resources will be done by ELRA in its role as a `mail-order' distributor. ELRA's `mail-order catalogue', based on the Survey of Linguistic Resources published by RELATOR (see p. 3), is now in the process of being compiled.

So, what is left to be done by ELSNET? Alongside the production effort being undertaken by PAROLE, EuroWordNet and SPEECHDAT-2, we feel that experimental work (i.e., definition and pilot development) towards the next generation of resources is needed. The European research community currently lacks resources which combine and integrate information for both speech and text based processing, and which can serve research work that covers both fields. Such resources --- a typical example is dialogue corpora annotated at multiple linguistic levels --- would be most useful in developing integrated services and dialogue systems.

This kind of preparatory work---independent from, but in close collaboration with other projects---is exactly the sort of role that the ELSNET Resources Task Group can best fill. The members of Task Group either coordinate, or participate, in nearly all the current EC (and many national) projects in the field of resources. Therefore, the Task Group is in an ideal position to provide an informed and critical view of the field. This overall perspective makes it possible for ELSNET to define models for integrated resources, and prepare pilot actions just as it did in the past with RELATOR and ELRA.

Ulrich Heid and Antonio Zampolli are both members of ELSNET's Resources Task Group. Work is currently underway by the Task Group on preparing two 50,000-word samples of Italian and German newspaper text which are morpho-syntactically annotated according to EAGLES guidelines, and manually checked. These texts could thus serve as reference material for the testing of tools such as taggers. This will be the first practical test of EAGLES guidelines. Both corpora will be available in early 1996.

ELRA Elects Board of Directors

The European Language Resources Association (ELRA) held its first General Assembly meeting on Monday, September 25, in Luxembourg, just a few days after the official signing of the contract between the Association and the European Commission. The three-year contract, which has a start-date of October 1, will provide a total of 900 kecu for the purpose of setting up a European agency to distribute language resources.

It was reported at the meeting that ELRA now has 66 members. Many of these had sent representatives to the General Assembly in order to participate in the election of a Board of Directors. According to the Association's statutes, the 12-person Board of Directors is comprised of two persons representing the Written College; two representing the Spoken College; and two representing the Terminology College. The remaining six members of the Board are ``at large'' and may come from any of the three colleges. As there were only two nominations for each of the three college representatives, these nominees were elected without a formal vote. For the ``at large'' places, there were eight nominations for the six available places, so a vote was necessary. Those members of ELRA who had not sent a representative to Luxembourg cast their votes by proxy in advance of the meeting. The results of the election are shown here. There were few surprises, since most of the members of the elected Board had also been members of the Interim Steering Committee which had been directing the activities of ELRA since February. Bente Maegaard and Giuseppi Castagneri were the only Board members to be elected who had not previously been on the Interim Steering Committee.

Prior to the election of the Board, various reports were given by representatives of the three colleges, and by other individuals who had been active during ELRA's ``interim'' phase. Elizabeth Hinkleman (DFKI, Saarbrücken) presented a document entitled, A Survey of Language Resources, which is a first attempt at providing a catalogue of spoken, written and terminological resources available in Europe. (Please see the article on page 3, opposite.) This survey was produced by the RELATOR project for ELRA, and a copy of it has been distributed to all ELRA members. Hard copies of the document are for sale to non-members at a commercial rate.

In his report to the ELRA members, the Association's Chief Executive, Khalid Choukri, indicated that one of his first actions will be to rent office space in Paris, possibly in the 13th arrondisement, near the Très Grande Bibliothèque de France. Choukri will, at the same time, form a company to employ himself and the technical and administrative staff who will be assisting him. This company will be a sub-contractor to ELRA, and will act as a central distribution agency.

Other actions which are high on Choukri's agenda during the coming months are:

Following the General Assembly, the new Board of Directors held their first meeting to elect officers. Antonio Zampolli was elected President. Three vice-presidents (one from each of the three colleges) were chosen. These were: Nobert Kalfon, Joseph Mariani, and Angel Martín-Municio. Thomas Schneider was elected Treasurer and Robin Bonthrone was elected Secretary.

In the next few weeks, Choukri will begin drafting a business plan to present to the ELRA Board for discussion at their next meeting in mid-November. The long-term goal of the business plan is to turn the Association into a financially self-sustaining organisation by the end of the three-year contract from the Commission. It is expected that ELRA will receive income not only from membership fees, but also from the sale of resources. Therefore, one item that is high on the Board's list of priorities is the establishment of a policy for pricing and licensing, so that resources may be offered by ELRA at the earliest possible date.

Another thorny issue before the Board is the question of how ELRA will deal with non-European customers. Currently the Association's statutes restrict membership to European organisations (or, in exceptional cases, individual professionals), but the statutes set no geographical restriction on organisations wishing to purchase ELRA's resources. There is a strong feeling among some of ELRA's industrial members that ``strategic'' data should not be made available to non-European organisations. Clearly the Board must reach some compromise in order to address the concerns of its members while at the same time keeping the market for European resources as large as possible.

Until he has a permanent office, Khalid Choukri may be reached at:
11 bis Avenue Division Général Leclerc
F-92160 Antony, France
Tel: +33 1 43 70 90 76
Fax: +33 1 46 66 95 55
Email: choukri.acsys.croisix@gmail.gar.no

Members of ELRA Board of Directors

For the Written College: Antonio Zampolli, (Istituto di Linguistica Computazionale, CNR) and Thomas Schneider, (SIETEC Systemtechnik Gmbh).

For the Spoken College: Harald Höge, (Siemens AG) and Louis Boves, (Centre for Speech Processing Expertise (SPEX)).

For the Terminology College: Christian Galinski, (Infoterm) and Norbert Kalfon, (CL Servicios Lingüísticos S.A.).

Elected at large were: Robin Bonthrone, (Deutsches Institut für Terminologie); George Carayannis, (Institute for Language & Speech Processing); Giuseppe Castagneri, (CSELT S.p.a.); Bente Maegaard, (Center for Sprogteknologi); Joseph Mariani, (LIMSI, CNRS); and Angel Martín-Municio, (Real Academia de la Lengua Espanola).

RELATOR Publishes Resources Survey

With the founding of ELRA in February earlier this year, one might conclude that the RELATOR project had finished what it set out to do. But then one would be forgetting that the other goal of RELATOR was to set up and actually operate a resource distribution network--- a kind of precursor to the distribution service that ELRA is expected to provide in the near future. This was to be accomplished by gathering a number of existing resources and making them available to interested researchers.

Under the leadership of Elizabeth Hinkelman, at the DFKI in Saarbrücken, a team of researchers and technical staff have done just this, and information about these resources is now available electronically via the World Wide Web. The resources themselves are available via ftp and through the AFS (Andrew File System) wide-area network, and for those organisations that have an AFS client license, some of the software may even be remotely executed.

Web pages and ftp-able resources

RELATOR's multilingual web pages (in French, German, Italian, and English) contain a listing of all the resources available from RELATOR, along with descriptions of each resource in a standard format. These descriptions include --- in the case of software resources --- information about what the software does, how it is implemented, and which platform(s) it may be run on. In many cases, hyperlinks from the description page connect directly to an ftp server, so that the user may download a copy of the file(s).

A sample of NLP resources include, among other things: TULIP (Hepple/Pulman two-level phonology); PC Kimmo (SIL's two-level morphophonology); Xpost (the Xerox Markov-based part-of-speech tagger); the Brill Treebank rule-based part-of-speech tagger; and a demo of the Tomita parser. In the area of speech, there is extensive information about corpora, tools, lexicons and dictionaries. RELATOR's WWW server is located here.

Resources available via AFS

Resources which are executable files are accessible via AFS. AFS is a wide-area distributed file system, which allows files to be resident on any number of machines while appearing to the user to be `local.' In this way, it is similar to NFS (the Network File System), which is a local-area distributed file system now used widely with many workstations. AFS provides replication of read-only volumes (that is, multiple nodes can serve the same data, enhancing reliability and performance), and uses Kerberos for security, rather than just trusting the local Unix system. Kerberos is a security tool that goes beyond conventional password-based systems in providing secure mutually paranoid authentication in a distributed environment, so that the user and the server are assured of each other's identity in a way that third parties cannot exploit. AFS has a number of advantages over ftp in the distribution of language resources; the main benefits are the efficiency, transparency, security and robustness of the system.

Language Resources Survey

Another of the final results of the RELATOR project is the publication of the European Language Resources Survey. This 200-page document provides a broad overview of the current state of affairs in Europe with respect to resources. In the past, this information has resided with various experts in the language, speech, and terminology communities, but until now, it has not been documented, assembled and made available for the purposes of international exchange. The Survey was distributed at the ELRA General Assembly meeting, and is also available by anonymous ftp from the RELATOR server at: www.de.relator.research.ec.org/relator/Papers; filename survey.v1.ps.tar.gz.

FOR INFORMATION ELSNET has 8 AFS client licenses available for distribution to ELSNET member sites. ELSNET members interested in obtaining the AFS client software, or in finding out more about AFS should contact:
Email: elsnet@let.ruu.nl

Non-ELSNET members interested in obtaining AFS should contact the distributor, Transarc, at:
Tel: +1 412 338 6911
Email: scherb@transarc.com
Technical questions about the RELATOR AFS network may be addressed to:

Applications-Oriented Research

Lingtech/CST Collaboration: Patently Successful

Bente Maegaard, Center for Sprogteknologi and Viggo Hansen, Lingtech

With the ratification of the European Patents Agreement by the Danish Parliament at the end of 1989, the only legal requirement to validate a European patent in Denmark, was that a Danish translation of the patent text had to be filed with the Danish patent authorities. At the time the law went into effect in January 1990, it was correctly predicted that the number of European patents needing to be registered in Denmark would increase dramatically, as would the demand for translation facilities, particularly for English-to-Danish.

Two leading Danish patent attorney firms --- Hofman-Bang & Boutard A/S and Lehmann & Ree A/S --- undertook a joint venture in order to prepare themselves for the increased translation workload. Their plan was to form a third company, called Lingtech A/S, which was given the task of setting up a ``translation factory.'' Lingtech approached the Center for Sprogteknologi (CST --- the Center for Language Technology), after learning about CST's extensive experience in EUROTRA (the European Community's machine translation research programme, which ran from 1982 - 1992), and after some discussion, CST and Lingtech agreed to cooperate on developing a patent translation system. The result of this collaboration is PaTrans, a commercial system which translates patent texts from English to Danish.

It is well-known that the step from getting research results to building real commercial systems is a large one. This project was no exception. Although PaTrans used the prototype developed in the EUROTRA project as its starting point, nevertheless, the software and lingware resources in the EUROTRA prototype had to be improved and extended for the purpose of translating patents in a real office situation.

As PaTrans was initially developed to be used on chemical patents, the EUROTRA grammar was first streamlined. Then it was extended to include the vocabulary, terminology and grammatical constructions which resulted from investigating a corpus of 300 pages of petro-chemical patents. The grammar of PaTrans is now quite comprehensive, but certain phenomena are treated only by a fail-soft mechanism (described in more detail below) which requires special attention by the post-editor.

The translation kernel of PaTrans is preceded by a segmentation and pre-parsing module which attempts to determine the structure of the text at a shallow level. This first pass contributes to the optimisation of the resulting translation. A pre- and post-editing tool, PaEd, converts WordPerfect format into SGML codes and vice versa. The user may also mark untranslatable text with this tool. The document handling facility of the sytem allows for:

As the system is meant to be used in a real office, a fail-soft mechanism was introduced which would prevent PaTrans from stopping or crashing, even when it comes across a sentence which it cannot analyse. If the parser fails to reach its goal --- a well-formed sentence structure --- the fail-soft mechanism collects sentence constituents, and translates and outputs the constituents.

The lexical material of PaTrans consists of a general dictionary and the term dictionaries. The lexical entries of the general dictionary are general language words, plus general patent-specific words.

The system's terminology is divided into subject-specific databases. As the program is to be applied to a number of different subject fields, a priority mechanism was needed for term bases. The priority and use of the term bases is user-defined and may change from one translation task to another. Because it is the user, and not the computational linguist system developer, who codes the terms in the term dictionaries, a special tool, PaTerm, has been developed to make this task simple and flexible. The user defines the term bases, e.g., `chemistry, general', `petro- chemistry', `inorganic chemistry.' In themselves, these databases may form a hierarchy or they can form a flat structure. To the program, they form a flat structure and the user must specify which term bases are to be used for a particular translation job, and in which order of priority. A term is unique in its own database, but of course the same word may appear in another database with a different meaning and possibly a different translation. When a term is found in one term base, it is not looked up in the subsequent term bases. Lingtech has found that rather broadly defined, and therefore rather large, term bases are most useful.

New terms occur in each and every patent document which is submitted for translation. Consequently, it is important that the user can encode terms in a fast and precise way. The PaTerm coding tool provides a screen with fields to fill in. When one field is completed, the cursor automatically jumps to the next, and in most cases proposals for the answer are made by the system, so that the user may confirm the answer with just a key-stroke. PaTerm asks a minimum number of questions and computes the remaining linguistic information from the answers received. This also saves time for the user. The person who uses PaTerm needs some basic linguistic background in order to be able to assign valency frames, but no deep knowledge of linguistics or computational linguistics is required.

So, looking at PaTrans from the user's point of view: patents, written in English, are received on paper. To make the text machine readable, the document is scanned using a high-quality OCR scanner.

The document is then pre-edited with PaEd and certain information in the text is marked-up by the user. Information which might be marked in this way could include non-translatable words, the style and size of letters, or headings and tables requiring a word-to-word translation. PaEd is a very effecient tool. It is easily operated, and it will normally take no more than 2-3 minutes to pre-edit a full page of text. During the pre-editing process multi-word terms are identified for later coding. Following the pre-editing stage, term coding may be necessary for a successful translation to be done. The actual coding of one term takes less than one minute.

The document is then ready for machine translation. The system performs the translation and outputs the raw translation for proof- reading. If more than one possible translation has been produced, alternatives are hidden under the selected translation and can be substituted at the click of a mouse button.

All translated documents are then proof-read by terminology experts, again using PaEd. The proof-reader calls the document on the computer screen, and reads it sentence by sentence. Machine marked fail-softs are given special attention and corrections are made. If necessary the proof-reader can see the untranslated text in its original layout in a separate window. Or particular sentences may be selected with the mouse and compared with the equivalent sentence in its untranslated form. The process can be terminated at any point by either printing or saving the translated document.

Original: The present invention relates to a process for producing lube oil. More specifically, the present invention relates to a process for producing lube oil from olefins by isomerization over a silico-aluminophosphate catalyst.

Raw translation: Den foreliggende opfindelse angår en fremgangsmåde til at fremstille smøreolie. /- Mere specifikt, foreliggende opfindelse angår en fremgangsmåde til at fremstille smøreolie fra olefiner med isomerisering i løbet af en silocoalumino-phosphatkatalysator. -/

Post-edited translation: Den foreliggende opfindelse angår en fremgangsmåde til at fremstille smøreolie. Mere specifikt angår den foreliggende opfindelse en fremgangsmåde til at fremstille smøreolie ud fra olefiner ved isomerisering over en silocoaluminophophatkatalysator.

The short excerpts shown above give an actual example of two sentences of a patent translated at Lingtech. Notice in the original English text, that the present participle, producing, has been correctly transformed into a Danish infinitive, at fremstille. The second sentence in the translation has been treated by the fail-soft component. This is the reason that the constituent order is not correct: the verb should be the second constituent, after Mere specifikt. Some prepositions have to be changed as well. Most sentences need some post-editing, but often only minor corrections are necessary.

The average speed of the translation system is about twice the requested speed, mainly due to faster hardware but also due to intelligent programming, and proof-reading takes about one third of the time of the manual translation process. Currently, work is ongoing to improve the quality of output, in particular by improving the fail-soft results. As the system is used commercially, new development costs are spent only if a return on investment can be substantiated. Thus, certain phenomena which occur only rarely may be left to the post-editor, even though a technical solution could be developed to deal with them.

The system has proven to save substantial costs in the translation process. The table below lists Lingtech's costs for manual translation and machine translation of patent texts over a one-year period. It will be apparent from these figures that actual translation costs are more than halved by using the MT-system.

Table 1:  Cost performance per 2 million translated words per year
                        MT-environment          Manual translation
                          DKK     US$             DKK     US$
Scanning/Pre-editing     6,000   1,000           6,000    1,000
Work Station            25,000   4,000

Scanning /Pre-editing   65,000  11,000          12,000    2,000
Manual Translation                           1,300,000  217,000
Proof-reading          560,000  93,000         400,000   67,000
Coding/System control  102,000  17,000

Total                  758,000 126,000       1,718,000  287,000

Yearly cost savings    960,000 161,000

Since the use of PaTrans for translating chemical patents has so far proven successful, the system is now being further developed to deal with mechanical patent texts. Extensions to other language pairs are also planned, in particular English-to-Swedish. Longer-term goals are to adapt the system for other types of texts, and various possibilities are under consideration by Lingtech and CST. It is clear that this type of technology could be beneficial to user organisations needing to translate large amounts of text each year. Candidates might include large companies exporting to Denmark or Danish import companies. The viability of building a practical system on the basis of EUROTRA has been proven; now it seems that exploitation depends only on the needs.

Further details about PaTrans, and the former EUROTRA project, are available from:

Bente Maegaard
Center for Sprogteknologi
Njalsgade 80
DK-2300 Copenhagen S
Phone: +45 35 32 90 90
Fax: +45 35 32 90 89
Email: bente@cst.ku.dk

USA-German Verbmobil Collaboration

Verbmobil is a long-term project on the translation of spontaneous speech in negotiation dialogs. It is funded by the German Ministry for Development, Science, Research and Technology (BMBF) and an industrial consortium, including Alcatel, Daimler-Benz AG, IBM Deutschland, Philips GmbH, and Siemens Aktiengesellschaft. For the first four years of the project, the funding from BMBF will amount to 60 million DM. The project is expected to extend for 8-10 years; the first phase of 4 years is structured by 2 major milestones: a demonstrator after 2 years and a research prototype after 4 years. Today, over 30 German research groups are collaborating during the first phase that started in 1993.

An additional partner in the project is a small team from Stanford's Center for the Study of Language and Information (CLSI), with Ivan Sag and Dan Flickinger playing the leading roles. In mid-September, Ewan Klein interviewed Flickinger on behalf of ELSNews about CSLI's involvement in Verbmobil. A shortened version of this interview is transcribed below.

Klein: How did CSLI first become involved in Verbmobil?

Flickinger: The initial contact came because BMBF, the German ministry that is funding Verbmobil, asked CSLI to do an evaluation and feasiblity study back in 1993, involving Martin Kay, Mark Gawron and Peter Norvig. That study came back with a positive recommendation, and by that time the German ministry was familiar with CSLI. So when the responsibilities were divided up for the various parts of the Verbmobil project, there ended up being about 30 partners, and none of those had applied to write the English grammar which is needed to constrain the generation output of the translation effort. So Germany came back and invited CSLI to submit a proposal. After some discussion, Ivan Sag put in a proposal which was accepted in early 1994. We started the project in March 1994, and it will run until the end of 1996.

Could you say more about the role of the English grammar in the project?

Yes, but let me first give a little background on the face-to-face translation system that Verbmobil is trying to build. The intention is to have a box that sits on a table between two business people that are having a conversation largely in English. But at least one of the people is not a native speaker of English. The box sits passively most of the time, but at certain points in the conversation a speaker decides they don't know how to say something precisely in English, so they push a button on the box, and then in their native language, German or Japanese, they say the sentence or the phrase that they want to have translated.

The software then converts that speech signal into a meaning representation and produces a corresponding sentence or phrase in English that carries the gist of the meaning from the input. It is not necessarily a literal translation, but something which is good enough to carry on the conversation. It is always one-directional translation. The grammar that we're developing here at CSLI is providing the constraints that the generator being built in Germany will use to ensure that the output conforms to the rules of English grammar.

The generator itself is being built in Saarbrücken, and the group that is working there has a lot of experience with Tree Adjoining Grammars (TAGs). So there's an interesting wrinkle in this distribution of labour. We contracted to produce an HPSG grammar, but it then has to be compiled into a TAG so that the generator can use it. That compilation step is a new and vigorously pursued research effort, at least in Germany. We're going to be using a couple of layers of unproven technology in getting this thing to work, but there is some early evidence that that's going to work alright. The generation group provides all the ``micro-planning'', as they call it, strategies for word-choice selection, decisions about mood, topic and focus, and so on, while the grammar that we're supplying simply provides the syntactic and compositional-semantic constraints on permissible sequences of words in English. There is some interesting negotiation that has to take place between our group that is building the lexicon and grammar for English, the generation group, and of course the transfer group, which produces the meaning representation from the German input, which is then used by the generator to drive the production of a sentence.

What work has been done on the problems that speakers would encounter in using such a tool?

There has been some work, but to my mind, not enough. The strategy taken in determining the range of coverage that we'd like, in terms of lexicon and construction types, was to collect a set of dialogs from people simulating having a translation box between them. First, they simply held a conversation about the domain chosen for Verbmobil, which is to agree on a meeting time, and the first set of dialogs we got had no simulation of the machine or translator itself, but just two people talking. Those dialogs were translated into English and we got to look at the construction types. That helped somewhat with vocabulary but not at all with style of conversation. Then a second set of dialogs was collected as a Wizard of Oz study; there was a hidden human simulating the effect of this translation box, and the two people holding the conversation were told it was a computer system. That's a more realistic kind of simulation, but because of insufficient constraints, some of the outputs of the Wizard involved too much common-sense reasoning or really non-literal translations. These are not things we are going to be able to replicate with a straightforward domain model and the limited reasoning system that we have available.

New dialogs are now being collected, with a slightly enlarged subject matter: the negotiation involves both time of meeting and place of meeting. There are several areas where that makes the linguistic coverage more complicated. But the set-up for the data collection will be more realistic. Simulators of the translation system have been given more careful instructions about staying close to the ground. But there has been to my mind no analysis so far of how people will respond to a limited-capability translation system; if the machine makes errors, what level of tolerance people will maintain, or how they will change their style, or what level of literalness or unnaturalness in the translation they will be prepared to put up with. Those are all things which will come into play as a consequence of building this first prototype by the end of 1996. This will put all the pieces in place and will cover the vocabulary in the limited domain, and there are planned to be some experiments with this prototype before designing the next phase of the research, which will be more commercially oriented. So the argument is that people don't really know how to place reasonable constraints on simulated dialogs, and that it will be more efficient to build the first version of the system, and see what it's capable of.

Presumably for a lazy conversationalist, there will be a temptation to just speak German the whole time, which will be beyond the capabilities of the box. On the other hand, someone who knew that this box was severely constrained, and was relatively fluent, might just be uttering one or two isolated words in German.

It is true that no constraint is being placed on the partners in the dialog about how much or how little of the conversation takes place in English. The expectation is that most of it will be in English. The practical fact is that we have no means at the moment, and in fact it is outside the reach of the technology, for tracking the dialog in English. The only speech recognition that we're working on in Verbmobil is for German, and eventually Japanese. To the extent that the conversation takes place in English, the best we can do is some very limited keyword spotting, and by and large things like anaphoric referents are going to be unavailable; we're just not maintaining a precise discourse model. In that respect, having more of the conversation take place in German would be helpful because it would give us more anaphoric information. Whether this will turn out to be habitable for the users is an empirical question; we don't know the answer yet.

Can you say why HPSG was chosen as the grammar framework?

Well, I'd like to say that it was because of the obviously superior properties of HPSG with respect to the connection between theory and implementation. I believe that to be true, but that wasn't necessarily the motivation. It's in part a result of the fact that many of the partners in Verbmobil had been working in an HPSG framework, or a unification-based framework of one kind or another, and it was the most common linguistic framework that people were familiar with.

Did you find the syntactic coverage of your English grammar had to be modified for the Verbmobil domain?

Yes, it still has to be, and continuously so. We're getting a richer source of data all the time, and it is true that we discover, even in this relatively limited set of data, places where we have to go back and enlarge coverage. In most cases, however, I think we're headed in the opposite direction. The theory casts its net quite widely, so it's interested in things like parasitic gaps, cross-serial dependencies, and other phenomena that may not occur in English even, and we're looking for places to trim down the theory to something that is more practical for implementation. For example, we might not accommodate Right Node Raising, even though it does occur in the data; an example is: ``Do you mean Thursday the 8th or Thursday the 15th of July?'' where ``of July'' is interpreted as part of both conjuncts. The theory of this construction still isn't worked out very well, and we can take a shortcut and produce a false syntactic and semantic analysis, but one which is parallel to the one being produced for the German input. The translation, even though it's based on an incorrect semantics, may give a reasonable English output which the hearer can then interpret as a Right Node Raising case because of the context. That's also true for things like scope ambiguity; we're not going to do a lot of analysis here, because the ambiguity in German will be preserved in English.

There are cases of semantic type coercion which occur quite a lot. So someone might say: ``Let's arrange for Friday at 5 o'clock.'' Presumably what's meant is something like: ``Let's arrange for an event that will happen on Friday at 5 o'clock.'' There's an implicit event in the argument of many of the sentences we're looking at. Another example, of borderline grammaticality, is: ``On Tuesday would suit me pretty well,'' where presumably what is meant is: ``Some event on Tuesday would suit me pretty well,'' but the temporal locative is all you get.

There have been some surprises. There is very limited use, for example, of relative clauses in the data we've seen so far. Although we've developed an analysis of the construction for this domain, it hasn't been used much; maybe four examples out of several hundred dialog turns.

Here's the other significant fact about coverage. We have a great luxury: we're generating off of some semantic representation, and we can do a lot to avoid constructions that we don't have a good analysis of or are expensive to implement. So pseudo-clefting or topicalization, even passives, we might be able to choose only one of those constructions for getting focus on a particular constituent and not do the others. We might also skip complicated issues about attributive versus predicative adjectives by just choosing one. The output may sound stilted or not completely natural, but it will at least be interpretable and will greatly simplify the amount of computation we have to do. That clearly has to get negotiated with the generation team in the project, but it's an escape hatch that just isn't there if you try to do parsing.

Ultimately, we will have to do parsing as well, because we expect to engage in repair-dialogs in English, as opposed to the native language, if we can't quite produce a good analysis. And we'd like to start tracking the ongoing English conversation in some more distant version of the system. So we're making every effort to keep the grammar reversible, really reversible, so that it is exactly the same grammar for generation and parsing. In fact, the development work we're doing at the moment is nearly all parsing-based; since the German generator has to have the extra stage of TAG-compilation, it's not available in a robust form yet. That should happen in the next few months. So we're testing the grammar using a parser that was developed outside the project.

In the area of computational linguistics, this US-German collaboration is rather unusual. Do you think there are any lessons to be learnt for the future?

I can highlight a few. There is a great benefit to having electronic mail as a means of very effective quick collaboration, raising of issues, resolving of issues, and we've made really extensive use of that in collaboration with our partners. But there are some issues which have to be negotiated, not simply announced, and those negotiations are really cumbersome via electronic mail. Telephone is a possibility, but the nine hour time difference between Germany and California makes that sacrificial for one side or the other, since it has to be done outside of work hours. So we've found the telephone not a very useful device. We've opted for increased travel, even though that's on the face of it quite expensive. So far it has proven to be the most cost-effective way of getting work done --- maybe the only way of getting work done. We initially allocated a very small travel budget, thinking we would only have to go across once or twice a year. We've ended up spending something like five to ten times the amount we had projected for travel in the original budget. Next time, I would expect to spend two weeks out of every two months in travel.

On the common platforms issue, each of the groups that is participating in Verbmobil was accepted precisely because they had a history of working in some particular domain and because they had done implementation work. It wasn't the case that they had chosen magically the same platform for their previous work, and in order to maximise efficiency in getting their own bits of work done, it was agreed that groups could work on somewhat different platforms, within certain constraints. There is a systems integration group whose job is to provide a system that will absorb these components written and running in different environments, different languages, even on different machines, and put them all together into a single smooth piece of software. That integration effort has been enormously successful; it's one of the real triumphs of the work to date. But the consequence of their success is that the individual groups can continue to work without being forced to settle on common implementation platforms. That means that the grammar I'm writing on a Lisp platform can't be directly tested on software running in Stuttgart or in Heidelberg, or in some other place where grammar development is going on, even though it would be extremely useful for them to do so: to check examples I didn't give them, to see about variations in the coverage, gaps in the coverage. If we had been forced to settle on a common implementation platform there would have been a really high cost at the beginning but we'd now be recouping some benefit from that effort. Not requiring common platforms might well have been the right decision to make for this first prototype. The difficulties were a price that we knew we were going to have to pay, but we didn't appreciate just how much of a price it was.

Documentation would be an extremely useful thing if we could all discipline ourselves to produce documentation of the current implementation of the system. Some groups have been much better about that than others. We're among the worst. In part we think we're excused by trying to stay very close to the theoretical literature that is being published. So if one is able to keep up with that literature, they can get a pretty good sense of what we're trying to do. But there are places where we've drifted, and we haven't done a very good case of documenting that. I underestimated the importance of that kind of prose reminder of decisions that were taken, both in terms of analyis and in terms of engineering, and were we to design a project of this sort again, we would explicitly allocate a significant chunk of resources to building and maintaining that documentation along the way.

There is some difficulty about the fact that this is a German-funded project, to benefit German industry. There's been a reluctance, understandably, to make the proprietary software being developed by companies and even research institutes in Germany available outside the country. In every case, we've been able to negotiate a satisfactory solution to that, and the BMBF has been extremely cooperative as an intermediary between us and the corporations. But it's an issue that's going to confront any collaborative effort between European-funded groups and overseas partners. There are business interests involved that are necessarily parochial and that barrier is not going to go away. If one were to do this again, those intellectual property issues ought to be worked out more explicitly with companies in advance so that management knows upfront as they get into the project that they're going to need to embrace a little more flexibility than they might otherwise do for their software.

Will you make the grammar you are developing under Verbmobil available to other groups?

Yes, certainly. We would very much like this grammar, or another one that we would then become vigorous contributors to, to become a common base for HPSG work being done, at least in English. Where there are people working on implementations, there is a desire not to have to replicate all the boring stuff. People doing English, it's tiring to have to do the auxiliary system over and over again. Somebody ought to just do it, get it done right, and lock it into place then not have to think about it again, and that's true for a whole range of construction types. Most people who are working are focussed on some area or some application, they want the rest of it to be handled in a standard way.

Our belief is that the field may be now advanced enough in terms of a small number of available platforms and general agreement about the architecture of the feature structures that we can maybe get some useful work done in a consortium where the actual implementations get exchanged, at least at the level of the specification of the grammar types in source files. We're now just trying to see how that would work out mechanically, what the nature of that interaction would be, what kind of collaboration would work.

Also, there is the interesting question about to what degree one can find language-independent types, or representations, or pieces of language model that you could share across French and German and English at least, and maybe even for non-European languages. It's probably the case that a good model of lexical rules could serve across languages, if one could find the right kind of implementation; likewise the decision to use a rich set of typed rule schemata in an inheritance hierarchy, if that proves to be viable, could be shared across languages with minor variations. Those questions are of extreme importance as the HPSG theory extends into ever more language families and coverage of linguistic phenomena.

Kay, M., J-M. Gawron and P. Norvig (1994) Verbmobil: A Translation System for Face-to-Face Dialogs. CSLI Lecture Notes 33, University of Chicago Press.

Unfortunately, there was not sufficient space to reprint this interview in full. Flickinger went on to discuss, among other things, the challenges of trying to implement HPSG in Verbmobil, and gives his opinion on proposals which incorporate statistical information into unification-based grammars. A transcription of the entire interview is available via the ELSNET web pages.

Additional information about CSLI may be obtained here.

Information about research on Head-Driven Phrase Structure Grammar (HPSG), including a comprehensive bibliography, is maintained by Ohio State University.

MorphoLogic Bridging Gaps between Academia and Industry in Hungary

Gábor Prószéky, MorphoLogic

In Eastern Europe, as is often the case in Western Europe, language processing systems built in academic research labs are not generally flexible enough to be turned into real marketable products. Moreover, funding from Eastern European governments for language research tends to be decreasing, despite high demands by national and international software developers for linguistic-based tools. Some researchers, frustrated with the lack of funding for their research, and realising the potential market that exists in their own countries for lingware products have begun to put their academic backgrounds to good financial gain by setting up companies.

MorphoLogic, a Hungarian SME, is such an example. Established in 1991 by a group of NL researchers from the Academy of Sciences and the universities in Budapest, MorphoLogic has the distinction of being the only organisation in Hungary that is doing R&D solely in the field of natural language processing. MorphoLogic has discovered that one way of bridging the gap between basic research and profit-oriented development is by doing basic research within the company. The close ties that MorphoLogic continues to maintain with academic labs in Budapest have resulted in a number of very profitable language products, and the sale of these products not only provide funding for the company's profit-making activities, but its non-profit activities as well.

MorphoLogic's name expresses the company's focus on R&D work in morphology and syntax. The success of the company has been due to its staff's formal scientific background in studying Hungarian which is a morphologically complex language. R&D efforts over the past few years have focused on four main related areas:

  1. development of a reversible, string-based unification morpho-syntactic formalism;
  2. development of a family of proofreading tools (spelling and grammar checkers, hyphenators, thesauri) for Hungarian and other agglutinative and highly-inflectional languages (Polish, Romanian, Bulgarian, etc.);
  3. development of tools which support intelligent text analysis, free text search and database indexing;
  4. development of bilingual ``morpho-logical dictionaries'' which may be used in machine-aided translation.
Each of these areas has one or more specific projects and partners associated with it. For example, the Research Institute for Linguistics (RIL) at the Hungarian Academy of Sciences has been an important partner in the development of the reversible, string-based morpho-syntactic formalism, since the first real users of commercial morpho-syntactic systems in Hungary were the lexicographers at RIL who were writing an historical dictionary of Hungarian. MorphoLogic's system consists of both a morphological analyser and generator, and it can handle derivational and inflectional affixes and compounding.

MorphoLogic is marketing a number of commercial proofreading tools. These include the spell checker, Helyes-e?; the hyphenator, Helyesel; the grammar/style checker, Helyesebb; and an inflectional thesaurus called Helyette.

Spell-checking for morphologically simple languages, like English, involves the trivial task of looking up the word in a word list. Spell-checking for highly inflectional, agglutinative languages, like Hungarian, however, requires a thorough morpho-logical analysis. Helyes-e? consists of lexicons and algorithms that enable the software to handle billions of possible words, and to propose intelligent corrections for misspelled words. Helyes-e? can be customised by the user, and so it is easily adapted to OCR, handwriting and speech recognition systems where error-types are different from typical typing errors.

Helyesel hyphenates any word-form with the use of a morphological segmentation algorithm. This model is useful for languages in which morpheme boundaries override the usual hyphenation points. List-based hyphenation does not work in such languages. Helyesel also allows hyphenation with optional letter-insertion or letter-change.

Both Helyes-e? and Helyesel are available for DOS, Windows, Windows NT, Windows 95 and Macintosh operating systems. Different versions of these tools have been licensed by Microsoft, Lotus and other international software companies. Consequently, they are included in Word, WordPerfect and AmiPro, as well as with desktop publishing packages like PageMaker, Corel Ventura, and Quark Xpress.

Last year, an inflectional thesaurus called Helyette has been developed at MorphoLogic. The system is a combination of a morphological analyser, a synonym dictionary, and a morpho-logical generator. It works by finding the lexical base of an input word and storing the inflectional information. It then offers the synonyms of the stem, and finally, it generates the morpho-phonologically correct combination of the chosen synonym and the stored inflectional information. Helyette is meant to be language-independent. Its first imple-mentation with the complex suffix system of Hungarian has been successful and MorphoLogic is now looking to test the system to other languages

Over the past several years, MorphoLogic has had various customers with specific needs, such as the Hungarian Parliament, the Prime Minister's Office, the Ministry of Foreign Affairs, and the editorial house of Nepszabadsag, Hungary's biggest daily newspaper.

MorphoLogic presently has a staff of eight researchers. Gábor Prószéky, the company's director, is a member of the ELSNET Industrial Task Group. For further details, contact him at:

Fu. 56-58. I/3
H-1011 Budapest, Hungary
Tel: +36 1 201 8355 or +36 60 344 884
Fax: +36 1 201 8355
E-mail: h6109pro@ella.hu

Minutes of the October ELSNET Executive Board Meeting Summarised

The last meeting of the ELSNET Executive Board was held in London on October 13. The following issues were addressed:

Training: Preparations are well under way for the Fourth Summer School, to be held in Budapest in July 1996, in co-operation with ELSNET goes East. The topic of the summer school will be Dialogue Systems. The programme committee (Niels Ole Bernsen, Roskilde University, Denmark; Norman Fraser, Vocalis, UK; and Klara Vicsi, Technical University Budapest, Hungary) have set up a programme and prospective lecturers have been invited. The first official announcement will be made in November [Klara Vicsi (chairman, local organising committee): vicsi@sparc.core.hu].

Research: Niels Ole Bernsen was appointed as co-convenor of the Research Task Group. Bernsen reported that he will coordinate the setting up of a project aimed at developing evaluation methods for spoken language dialogue systems. Krauwer will co-ordinate the establishment of a European evaluation infrastructure (in the form of a concerted action). One of the goals of this initiative will be to stimulate evaluation activities throughout Europe like those which already exist in France and Germany.

Information Dissemination: Dawn Griesbach was appointed as convenor of the Information Dissemination Task Group. The Task Group recently produced an 8-page ``ELSNET brochure'', which may be used to advertise ELSNET at conferences and workshops. All ELSNET nodes and ELSNews subscribers will receive a copy of the document with the next issue of ELSNews. ELSNET members may request additional copies to take along to ELSNET-related events [dawn@cogsci.ed.ac.uk].

Collaboration with ELSNET goes East and TELRI: Representatives from ELSNET, ELSNET goes East, and TELRI met in London on October 12, to discuss how best to colla-borate in gathering and disseminating information. The joint aim of all three projects is to obtain up-to-date and exhaustive information about all European industrial and academic sites involved in language and speech [elsnet@let.ruu.nl].

Resources: The Resources Task Group will make available to the ELSNET community a `resources package' including: two (small) annotated corpora (German and Italian), produced in Stuttgart and Pisa during 1995, together with the EAGLES guidelines and the tagger used for this task. It is expected that the package will be available by the end of this year or early next year. For more information, please contact Uli Heid [uli@ims.uni-stuttgart.de].

ELRA: A meeting between ELSNET and ELRA representatives will be arranged soon. The purpose of this meeting is to discuss the relationship between ELSNET and ELRA in general, and more specifically, the harmonisation of distribution policies.

ELSNET Foundation: It was agreed to set up an ELSNET foundation. The foundation will serve as an instrument for the Executive Board in cases where it is desirable for the Network to present itself as a legal entity [elsnet@let.ruu.nl].

ELSNET-2: It was agreed that an unfunded contract extension should be requested from Brussels since there will be a gap between the end date of the current ELSNET contract and the official start-up of the ELSNET-2 contract --- assuming that the ELSNET-2 application will be successful. Any funds remaining from ELSNET's 1995 budget will be used as ``bridging funds'' to carry on the activities of the Network at a basic level --- i.e., general Network co-ordination and co-ordination of information dissemination functions (ELSNews, elsnet-list, WWW-pages)---and to guarantee financial support for the 1996 Summer School.

The ELSNET-2 proposal which had been sent to all ELSNET Nodes for comments, was discussed in detail. After all suggestions have been incorporated and the final version has been approved by the EB, the proposal will be submitted to Brussels.

New node: The Research Group on Signal Processing and Communications of Granada University in Spain was accepted as an ELSNET node [Victoria Sanchez, victoria@hal.ugr.es].

Next meeting: The next meeting of the ELSNET Executive Board will be held on February 26, 1996, in Pisa, Italy [elsnet@let.ruu.nl].

Language Software Helpdesk Set up at Edinburgh University

The Human Communication Research Centre (HCRC) at the University of Edinburgh has recently announced the launch of the Language Software Helpdesk. The aim of the Helpdesk is to offer a free support service for public domain and freely available NLP software, and to foster its use in practical applications.

There is a substantial body of software available in the area of natural language processing, both in the form of complete systems and of system components. But the uptake of this software, particularly by industrial users, has been impeded by lack of information about functionality and worries about the availability of ongoing support. The Language Software Helpdesk is intended to directly address both these problems.

Suppose a user has a specific language processing task to perform. He or she will want to know whether there is an off-the-shelf component which will do the job. If there are several candidates, which would be most suitable? If there are none available, is there a component which could be cost-effectively adapted? Once a candidate is chosen, how is it installed? What happens if it breaks, or doesn't perform as expected? The Language Software Helpdesk can provide the necessary answers and assistance.

Drawing largely on staff of HCRC's Language Technology Group, the goal of the Language Software Helpdesk is to aid users (particularly UK and European companies) in the selection and use of natural language tools for practical tasks. The Helpdesk staff have considerable experience in developing and using such tools, and in customising them for practical purposes, and they understand that simply pointing to a manual is not an adequate response to the kinds of queries likely to arise from those at an early stage of involvement in the area, and are prepared to offer a more responsive and individualised support service.

For certain topics the Language Software Helpdesk has already prepared material which will help inexperienced users to acquire the information necessary in order to make informed decisions.

The Helpdesk staff have a wide range of both theoretical and practical skills, and are already tracking developments in key areas of language technology.

While most initial queries can therefore be answered straightaway and at no charge, more complex or unusual queries may require more substantial effort. In this case, the Helpdesk staff may only sketch a reply and offer to negotiate a fee for a thoroughly researched response and/or for ongoing support. Needless to say there is no guarantee attached to any information provided without charge.

Language Software Helpdesk
c/o Language Technology Group
2 Buccleuch Place
Edinburgh EH8 9LW, UK
Fax: +44 131 650 4587
Email: Language.Software.Helpdesk@ed.ac.uk

MLNet Publishes Report on State of the Art in ML

The Research Committee of the European Network of Excellence in Machine Learning (MLNet) has recently published a special issue of the network's newsletter (MLnet News) as a report entitled, State of the Art in Machine Learning. Because of the past collaboration between ELSNET and MLNet (i.e., the Workshop on Machine Learning of NL and Speech held Dec. 2-3, 1994, reported in ELSNews 4.1), the articles in this document may be of interest to some members of ELSNET.

Topics include: reinforcement learning; attribute-based learning; inductive logic programming; genetic algorithms; multistrategy learning; case-based reasoning; shift of bias in ML; and knowledge base refinement and theory revision.

The report is available electronically from the MLNet archive via the World Wide Web: ftp://ftp.gmd.de/ml-archive/README.html. The archive is also accessible by anonymous ftp at: ftp.gmd.de/ml-archive. Alternatively, hard copies are available from:

MLNet, Department of Computing Science
University of Aberdeen
King's College
Aberdeen AB9 2UE, Scotland, UK
Fax: +44 1224 27 3422
Email: mlnet@csd.abdn.ac.uk.

Future Events

December 1-2, 1995: Conference on Architectures and Mechanisms for Language Processing (AMLaP-95). Edinburgh, Scotland. Deadline for submission of papers: Oct. 18, 1995. For information, contact: Matt Crocker and Martin Pickering, Email: amlap@cogsci.ed.ac.uk.

December 6-8, 1995: First AMAST (Algebraic Methodology and Software Technology) Workshop on Language Processing. Enschede, The Netherlands. For information, contact: A. Nijholt, Email: anijholt@cs.utwente.nl.

December 18-21, 1995: Tenth Amsterdam Colloquium. Amsterdam, The Netherlands. For information, contact: Email: acten@illc.uva.nl.

March 7-9, 1996: Les langues et leur images. Neuchatel, Switzerland. For information, contact IRDP, Fbg de l'Hopital 43, CH-2007 Neuchatel, Tel: +41 38 24 41 91, Fax: +41 38 25 99 47.

April 11-12, 1996: Second ACM/SIGCAPH Conference on Assistive Technologies. Vancouver, Canada. Deadline for submission of papers: Oct. 17, 1995. For information, contact: David Jaffe, Dept. of Veteran Affairs Medical Center, 3801 Miranda Avenue, Mail Stop 153, Palo Alto, CA 94304, Email: jaffe@roses.stanford.edu.

June 28, 1996: Second meeting of the Special Interest Group in Computational Phonology (SIGPHON 96). Santa Cruz, CA, USA. For information, contact: Richard Sproat, AT&T Bell Laboratories, Room 2d-451, 600 Mountain Avenue, Murray Hill, NJ 07974, USA, Email: sigphon@reseach.att.com.

July 15-19, 1996: The Auditory Basis of Speech Perception, (ESCA Tutorial and Reseach Workshop). Keele University, UK. Deadline for submission of abstracts: Nov. 15, 1995. For information, contact: ESCA Workshop, Dept. of Communication and Neuroscience, Keele University, Keele, Staffordshire ST5 5BG, UK, Tel/Fax: +44 1782 583055, Email: cob03@keele.ac.uk.

August 5-9, 1996: International Conference on Computational Linguistics (COLING-96). Copenhagen, Denmark. Deadline for paper submissions: Dec. 15, 1995. For information, contact: Bente Maegaard, Email: bente@cst.ku.dk.

August 12-16, 1996: 12th European Conference on Artificial Intelligence (ECAI-96). Budapest, Hungary. Deadline for workshop proposal submissions only: Nov. 1, 1995. For information on ECAI workshops, contact: Elisabeth Andre, DFKI, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany, Email: ecai-96-ws@dfki.uni-sb.de.