ELSNET-list archive

Category:   E-Announce
Subject:   NLG Challenge on Content Selection (Announcement and Call for Expressions of Interest)
Email:   nadjet.bouayad_(on)_upf.edu
Date received:   04 Jul 2012

Announcement and Call for Expressions of Interest FIRST CONTENT SELECTION CHALLENGE European Workshop on Natural Language Generation, 2013. We seek expressions of interest to participate in a challenge on content selection using freely available annotated semantic web data and texts. Please read on and if you are interested, please contact us (see contact details at the end of this call). --------------- Motivation: --------------- So far, there has been little success in Natural Language Generation in coming up with general models of the content selection process. Most of the researchers in the field agree that this lack of success is because the knowledge and context (communicative goals, user profile, discourse history, query, etc) needed for this task depend on the application domain. This often led in the past to template- or graph-based combined content selection and discourse structuring approaches operating on idiosyncratically encoded small sets of input data. Furthermore, in many NLG-applications, target texts and sometimes even empirical data are not available, which makes it difficult to employ empirical approaches to knowledge elicitation. Nonetheless, during the last decade, there has been a steady flow of new work on content selection that employed Machine learning, heuristic search, or a combination thereof. All of these strategies can deal with large volumes of data. On the other side, the continuous large-scale community-based open-source encoding of data in Semantic Web standards such as OWL and RDF within the Semantic Web and Linked Open Data communities means that now more than ever we have at our disposal a large pool of semantically encoded data and associated texts to work with. For these reasons, we believe that the time has come to bring together researchers working on (or interested in working on) content selection to participate in a challenge for this task using standard freely available web data as input. This initial challenge presents a relatively simple content selection task with no user model and a straightforward communicative goal so that people are encouraged to take part and motivated to stay on for later challenges, in which the task will be successively enhanced from gained experience. A content determination challenge will be a chance to (i) directly compare the performance of different types of content selection strategies; (ii) contribute towards developing a standard ``off-the-shelf'' content selection module; and (iii) contribute towards a standard interface between text planning and linguistic generation. -------------------------- Outline of the task: -------------------------- The core of the task to be addressed can be formulated as follows: ``Build a system which, given a set of RDF triples containing facts about a celebrity and a target text (for instance, a wikipedia-style article about that person), selects those triples that are reflected in the target text." ------------------------ Domain and Data: ------------------------ The domain will be short biographies of famous people due to the availability of Biography texts in Wikipedia and rich data representations in DBPedia or Freebase repositories. The data will consist, for each famous person, of a pair of RDF-triple set and associated text(s). For each pair, the RDF data will include both information communicated and excluded from the text. The text may convey information not present in the RDF-triples, but this will be kept to a minimum, always subject to using naturally-occurring texts. All pairs should contain enough RDF-triples and text to make the pair interesting for the content selection task. ----------------------------------------- Data Preparation and Release: ----------------------------------------- The task of data preparation consists in 1) data and texts downloading, pairing and preprocessing in a suitable format, and 2) working dataset selection and annotation. The annotation task, in which the participants are encouraged to participate and which could be supported by some automatic anchoring techniques, consists in marking which triples are included in the text for each data-text triple of the working dataset. Annotation guidelines will be provided with examples and descriptions of ambiguities and other issues and how to resolve them. The resulting annotated working dataset will be provided to the participants as a common set of ``correct answers" to exploit in their approach. The participants will also be free to exploit a large portion of the non-marked paired corpus, as well as the data semantics (i.e., ontologies and the like). -------------- Evaluation: -------------- Once all participants have submitted their executable to solve the task, the evaluation set will be processed. If timing is tight, however, this could be done whilst the participants are still working on the task or extra effort (for instance, from the organizers) could be brought in. A subset of the data is randomly selected and annotated with the selected triples by the participants. This two-stage approach to triple selection annotation is proposed in order to avoid any bias on the evaluation data. Each executable will be run against the test corpus and the selected triples evaluated against the gold triple selection set. Since this is formally a relatively simple task of selecting a subset of a given set, we will use for evaluation standard precision, recall and F measures. In addition, other appropriate metrics will be explored---for instance, certain metrics for extractive summarisation (which is to some extent a similar task). The organizers will explore whether it will be feasible to select and annotate some test examples from a different corpus and have the systems evaluated on these as a separate task. ------------------------- Proposed Timeline: ------------------------- Preparation of working dataset in the summer of 2012 will start once we gather sufficient interest from would-be participants. The challenge proper will take place between November 2012 and May/June 2013 as detailed below. Data gathering and preparation Jul/Aug 2012 Working dataset selection and annotation Sept/Oct 2012 Data Release November 2012 Evaluation dataset selection and annotation May 2013 Evaluation June 2013 Publication _(at)_EWNLG August 2013 ------------------------------- Expressions of Interest: ------------------------------ In order to gather some quorum, we ask people interested in participating to send us a mail expressing their interests as early as possible (i.e., by the 15th of July). The challenge is open to any approach, be it template-, rule- or heuristic-based, or empirical. We welcome approaches from other communities apart from Natural Language Generation (NLG), i.e., summarization, semantic web, etc. ------------------------------ Organizing committee: ------------------------------ Nadjet Bouayad-Agha TALN Group, University Pompeu Fabra, Barcelona (Spain). Gerard Casamayor Leo Wanner Chris Mellish NLG Group, University of Aberdeen, Scotland (UK). ----------- Contact: ----------- nadjet.bouayad_(at)_upf.edu __________________________________________ - ELSNET mailing list Elsnet-list_(at)_elsnet.org - To manage your subscription go to: http://mailman.elsnet.org/mailman/listinfo/elsnet-list

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 28-07-2012 by Steven Krauwer Disclaimer / Contact ELSNET