BLARK: The Basic Language Resource Kit

ELSNET and ELRA: Common past, common future

Steven Krauwer (UiL OTS, Utrecht University; ELSNET Coordinator)


ELRA Newsletter, Vol 3 nr 2, May 1998

In this article we give a brief overview of ELSNET and its
activities, we show where ELSNET's and ELRA's interests
intersect, and we sketch a possible joint action in the field of
language resources, entitled BLARK, to be initiated under the
European Commision's Fifth Framework Programme.


ELSNET, the European Network of Excellence in Language and
Speech, came into existence in 1991, as one of the three pilot
Networks funded by the European Commission's ESPRIT Long Term
Research programme. The Network is hosted by the "Utrecht
Institute of Linguistics OTS" at the University of Utrecht. The
problem addressed by ELSNET is the construction of multilingual
integrated language and speech systems with unrestricted coverage
of spoken and written language. ELSNET brings together the main
European research teams in the field of natural language and
speech processing, currently some 50 from industry, and some 80
from academia (full list on http://www.elsnet.org/memberlist.html).

ELSNET's activities aim at facilitating, supporting and
coordinating the efforts of its members towards the creation of
language and speech systems.

Special attention is given to the integration of language and
speech, as, traditionally, the language and speech communities
appear to be living on different sides of a cultural and
methodological gap.

Ever since 1991, ELSNET has been active in four main areas:
training, research coordination, information dissemination, and
language resources.


The ELSNET European Summer School has become a widely appreciated
tradition, and attracts students interested in topics on the
borderline between language and speech, such as Prosody,
Corpus-Based Methods, Multilinguality, Dialogue Systems, and
Lexicon Development. The 1998 Summer School will take place in
Barcelona, and will be dedicated to Robustness. 
Short, intensive courses on advanced topics serve to keep
industrials informed of the latest developments in specialized
areas (Spoken Dialogue Systems in 1997, Terminology late 1998 or
early 1999).

A recent action is aimed at the development of of a common
European Masters programme in Language and Speech, initiated by
the Socrates Thematic Network "Speech Communication Sciences".

Research coordination

The coordination of research is a delicate issue, especially
since ELSNET is not a funding agency for research projects. For
this reason ELSNET's research actions are indirect, and very much
focused on improving the conditions for the research community to
compare and interconnect their results. Evaluation and
standardization are therefore high on our research agenda, and a
number of project proposals in this field have been successfully
submitted, such as the DISC project aiming at establishing best
practice in Spoken Dialogue Systems, and the LE-4 ELSE project,
aiming at the creation of a European Evaluation Infrastructure
for Language and Speech.

Information dissemination

ELSNET aims to keep its membership informed of all relevant
actions and activities in the field, by means of its bimonthly
newsletter ELSNews (free subscription), its www pages
(http://www.elsnet.org and its email list
elsnet-list@elsnet.org). In addition, special dissemination
services are offered to projects.

Resources, our common past

Language resources are and have always been a very prominent
point on ELSNET's agenda. A special task group within ELSNET has
the responsibility to take new initiatives. For those who have
not been involved in ELRA from the very beginning, it is worth
while mentioning, that ELRA and ELSNET have a common past: the
RELATOR project, which was the starting point for the creation of
ELRA, was an ELSNET initiative. This clearly shows ELSNET's
interest in language resources, and the creation of ELRA has
not diminished our activities with respect to language resources.
Although the creation of new resources is far beyond ELSNET's
financial capabilities (and fall outside the scope of our
contract with the Commission), we are continuously exploring new
types of data and new ways of annotating them. Pilot studies
conducted under the auspices of our resources task group include
the annotation of parts of the ECI CDROM using the EAGLES
annotation scheme, for German and Italian, and currently ongoing
experiments with semantic annotation. The MATE project (LE-4),
which has just started, aims at the development of methods and
tools for dialogue annotation. ELSNET and ELRA have made a
formal collaboration agreement, which has led to the distribution
of data generated within ELSNET via the ELRA catalogue, joint
actions at e.g. Eurospeech 1997, and close collaboration on new

Our common future

As one may infer from the above, it is already clear that ELSNET
and ELRA together have an interesting common future ahead of
them, but there is one, in our view very exciting, field where no
collaboration has been set up yet, but where I hope that ELSNET
and ELRA can take a number of important initiatives in the near
future: Central and Eastern Europe.

ELSNET's geographical horizon lies farther than the outer borders
of the European Union. Already in 1994 ELSNET took an interest in
Central and Eastern Europe. With extra funding from the EC, a
first survey was made of actors in the field of NL and speech in
Central and Eastern Europe, and although it should be clear that
the survey was far from exhaustive, it was a first step towards
better disclosure of this vast geographical area to the Western
European R&D community.

At the same time, ELSNET received permission to include in its
membership four prominent research labs from Hungary, the Czech
Republic, Romania and Bulgaria.

A third initiative was the ELSNET Goes East project, a concerted
action under the INCO - Copernicus programme, aiming at laying
the grounds for an extension of ELSNET towards Central and
Eastern Europe. This very successful project, running from early
1995 until the end of 1997, has now resulted in a situation where
no less than twelve of ELSNET's members are situated in Central
and Eastern Europe, three of which are SMEs. In addition, a new
survey of Central and Eastern European NL and Speech actors has
been produced, and is about to be published (check the ELSNET www
site for the URL).

It is ELSNET's objective to build a Pan-European network, where
all actors in the field can participate in the further
development of language and speech systems on an equal footing.

If we look at the present situation in Central and Eastern
Europe, we can observe that many of our colleagues there are not
in ELSNET, not participating in any European projects, and not in
a position to compete intellectually or commercially with
organisations and institutions in Western Europe. One
factor is obviously the lack of financial resources. Even in the
countries that are about to join the Union, the conditions are
often still very harsh, and will change only very gradually. Neither
ELSNET nor ELRA are in a position to make substantial
contribution to the solution of these problems, but it is
important that we remain aware of them in all our contacts and
communications with our Eastern colleagues.

But the financial conditions are not the only problems: in
general Eastern researchers are in a disadvantaged position with
respect to access to knowledge and expertise.

Here I think that organisations like ELRA and ELSNET can make
relevant contributions, just like professional organisations
active in NL and speech, such as EACL and ESCA.

When looking at the field of language resources, the place where
ELSNET's and ELRA's interests clearly intersect, I think that
the upcoming Fifth Framework Programme should include at least
one major action, from which not only the Central and Eastern
European countries would benefit, but also the countries of
the less favoured languages. Here is an outline of the action:

Step one: we define in a very generic way, equally applicable to
every language, the BLARK ("Basic LAnguage Resource Kit"): which
should contain a specification of
(a) the minimal general text corpus to be able to do any
    precompetitive research for the language at all, say (as an
    arbitrary example) 10 million words of recent newspaper text,
    annotated according to some generally accepted standards,
(b) something similar for a spoken text corpus,
(c) a collection of basic tools to manipulate and analyze the
(d) a collection of skills that constitute the minimal starting
    point for the development of a competitive NL/Speech
    technology industry.

Step two: we identify for each language to what extent the
basic resource kit already exists, and which elements are missing.
The elements can be data collections, tools, or courses and
course materials geared to the specific languages or language

Step three: we initiate a collection of coordinated actions to
fill in the gaps, i.e. prepare project proposals that aim at
providing the minimal coverage for (a) through (d) above for
every single European language.

Three questions: would this be a useful exercise? My answer is
yes. We should of course not fool ourselves into believing that
the basic resource kit as such would be a sufficient starting
point for building actual commercial applications. What it
should ensure is that there is a sufficient basis for exploratory
research, for the development of pilot demonstrators, and last
-but not least- for the training of the new generation of young
researchers and developers.

Who should fund this scheme? I think that it is clear that the
European Commission can play a very important role here, in close
collaboration with the national authorities. One might argue that
the Commission's famous "subsidiarity principle" would point more
to the national governments than to the Commission, but I see at
least three reasons why this should be set up as a European
action: (i) It is an expensive exercise, (ii) it should be kept
in mind that the way Commission projects are funded (with a
strong focus on industrial and user interests) has a general (not
necessarily intended) tendency to favour the three or four
commercially interesting European languages, so that it is only
fair for the Commission to give special support to other
languages, not promoted by strong industrial lobbies, and (iii)
organising it as a Europe-wide action would reduce the cost, both
financially and intellectually, and offers the possibility for
a wide variety of synergetic actions between teams in different
countries with similar problems.

And the third question: what could be the role of ELSNET and
ELRA? I think that the answer is clear: ELSNET and ELRA are
sitting on a considerable amount of expertise, knowledge, and
linguistic and intellectual resources, which may not always be
directly portable to new and other languages, but which, if
properly shared and disseminated, could provide an excellent
starting point for these actions, avoiding a situation where all
thinkable wheels are reinvented with regular intervals for each
single language. In addition it would ensure some degree of
connectivity between the results obtained, so that in the future
possible synergies van be exploited.

Concluding remarks

The Basic Language Resource Kit for every European language, the
BLARK, I have been arguing for above, is not really new in the
sense that no one has ever thought of this before. For some
languages it may already exist, and for others most of it may be
already in place. We have seen a number of publicly funded
projects (e.g. under the Copernicus and INTAS programmes), aiming
at exactly accomplishing fragments of what I described above for
some of the Central and Eastern European languages. But I think
that we have not yet seen an appeal to the European Commission
and to the national authorities, to initiate a large scale
concerted effort to get this off the ground, and I hope that this
article will inspire at least some members of the ELSNET and ELRA
communities to help us start a movement into this direction. Let
us hope that before the end of the Fifth Framework programme,
every European Language, inside or outside the European Union,
has its own BLARK.


