Elsnet
 
   


ELSNET-list archive

Category:   E-CFP
Subject:   ATALA Workshop - Extended deadline
From:   Ghassan Mourad
Email:   Ghassan.Mourad_(on)_paris4.sorbonne.fr
Date received:   13 Oct 2003
Deadline:   10 Oct 2003
Start date:   22 Nov 2003

EXTENDED DEADLINE: Octobre 10th 2003 CALL FOR WORKSHOP PAPERS ----------------------------------------------------------------------- ATALA Workshop ************************************** 22 novembre 2003 ENST, 46, rue Barrault (49, rue Vergnault), 75013 Paris **************************************************** Title : Role of typography and punctuation in natural language processing (texts segmentation, prosody, syntactical analysis, information retrieval,coding in multilingual systems,=85) Organisation : Ghassan Mourad & Jean-Pierre Descles Laboratory : LaLICC (UMR 8139 Paris-Sorbonne / CNRS Conference call Objective: Even though punctuation and typography are not seen as teaching knowledge, we can hardly deny their role in reading and writing. This is also true for natural language processing, where punctuation plays an important role. Typographical and punctuation signs are =93natural tags=94 of information, and indicators on which most of the processing should rely. It is essential to tally and study all issues in the multilingual, multiwriting, and multicoding processing phases. The ATALA workshop is particularly concerned with current research on punctuation, typography, coding and transcribing issues in linguistics and language processing; and with work that already exists in this restricted domain or directly related to. Issues: Linguistic engineering and language processing is confronted with new issues. Indeed, it is now necessary to work not only on isolated sentences or utterances, but on entire structured or unstructured texts too; for example, texts from the Internet or from document-bases stored by companies or administrations, encyclopaedias or even dictionary articles. Moreover, texts are rarely tagged or digitised. However, text processing requires pre-processing in order to conduct syntactical, semantic and pragmatic analysis. In particular, each text has two structures: formal and discursive. The later depends on the earlier. The formal structure expresses a certain meaning intentionality; it results from the coding in a typographical system and from =93text-setting=94 or text layout. The pre-processing of a text must exploit the formal structure (titles and sub-titles localisation; text fragmentation in sentences, paragraphs, utterances, propositions, words; quotation identification; item list identification; spatial disposition consideration; images, diagrams, captions, boxes localisation....), before executing other tasks, or frames identification; relations between concepts, terms, events; anaphoric enunciative phenomena=85). Without complete control of the exploitation of formal structure, text processing will not really be operational. Obviously, this issue did not appear when we worked only on isolated sentences. However, for semantic analysis, text must segmented into linguistic units that are superior or inferior to the normative sentences, by taking into account semiotic marks clearly and formally known by the computer. Punctuation and all typographic signs (index) are still the most relevant elements, since they can provide sharp indications for formal text segmentation and structuring; these indications being the foundation of automatic textual linguistics. We can distinguish between three types of approaches for segmentation: (a) Digital approaches (neuronal nets, N-grams, Markov model=85); (b) Finite automata and regular expressions approaches (for instance INTEX); (c) Contextual exploration approaches based on punctuation marks (for instance SegATex). Traditional theories (treaties, handbooks) of punctuation generally are normative and do not allow the expression of precise rules that could lead to automatic segmentation. Furthermore, these treaties did not consider semantic analysis of highly polysemous marks like comma, semicolon, colon, dash, parenthesises, ... However, marks play a very important role in process and text discursive structuring. Text processing tools offer enormous potentialities for typographic variations; for example highlighting a term being quoted, exemplify, or disambiguate an expression=85; Quoting Ch. Gouriou : =AB A tout probl=E8me que pose la transcription de la pens=E9e, la typographie se doit d=92apporter au moins une solution ; elle en offre plusieurs d=E8s que l=92on la sollicite de faire valoir des nuances ou des subtilit=E9 =BB. However, the integration to be granted to these variations is not regular and depends on other contextual (typographic and punctuation) elements; for example, an italicized expression does not have the same value (meaning) according to the fact that it is capitalized or between quoting marks. It is indeed a conglomerate of typographic marks, variable from text to text, which gives the value of an occurrence of typographic change. Text processing must resolve these linguistic and computational issues. Theme: Submission can also Discuss/tackle cross-domain topics in relation to: - Formal segmentation of text, - Text discursive segmentation based on punctuation and typography marks, - =93Textual architecture=94, - The role of the punctuation =ADparticularly, the comma- in a syntactic analysis, - Contribution of the punctuation for the coding of the prosody and contribution of typography for the coding of intonation, - Contribution of the punctuation for the identification of proper names, compound words, abbreviations, initials, =85 - Comparison between punctuation in various linguistic systems (Arab, Chinese=85), - Coding and transcribing issues in various linguistics systems, - =85 Modalities : Submission : a 2-4 page summary. We ask authors to indicate if their submission: - present in-progress work or is a position paper; - present theoretical or applied completed work. A 2-4-page summary must be sent before 10 Octobre 2003 by e-mail in text, .rtf, .doc or .pdf to: Ghassan.Mourad_(on)_paris4.so rbonne.fr and Jean-Pierre.Descles @paris4.sorbonne.fr Acceptance notifications will be sent for 20 October 2003. ************************************************************************ Ghassan Mourad ISHA, Paris - Sorbonne Laboratoire LaLICC (Langage, Logique, Informatique, Cognition et Communication) (UMR 8139 Paris-Sorbonne / CNRS) http://www.lalic.paris4.sorb onne.fr/ 96, Bd Raspail 75006 Paris France tel : 01 44 39 35 90 fax : 01 44 39 35 91
 

[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 14-02-2008 by Steven Krauwer Disclaimer / Contact ELSNET