...

Laboratorio di analisi di dati linguistici

by user

on
Category: Documents
25

views

Report

Comments

Transcript

Laboratorio di analisi di dati linguistici
Laboratorio di analisi di dati
linguistici
Laurea specialistica in Linguistica Teorica
e Applicata, Università di Pavia
Andrea Sansò
[email protected]
A.A. 2005-2006
Corso progredito
10 CFU
Laboratorio di analisi di risorse
linguistiche
Parte quarta
Lessici
Risorse per la linguistica tipologica
Strumenti e tecnologie per la
creazione di risorse linguistiche
Lessici
Una definizione:
“A computational lexicon is a very complex – and expensive –
component to be built adequately. It must contain, in an explicit and
formalised way, all the information which a native speaker uses in
everyday situations, from the simpler orthographic, phonetic,
morphologic information, to the more complex syntactic, semantic,
pragmatic, logical, ontological, multilingual information. A ‘complete’
lexicon should practically incorporate our ‘knowledge of the world’,
and represent it in an explicit and formal way”
N. Calzolari, “Computational lexicons and corpora. Complementary components in
human language technology”, in P. van Sterkenburg (ed.), Linguistics Today –
Facing a Greater Challenge, 89-107. Amsterdam-Philadelphia: J. Benjamins, 2004.
Lessici
Lessici

corpora
Questa conoscenza del mondo è un oggetto mutevole e
continuamente in accrescimento, impossibile da “congelare” in
un lessico statico.
 The only way of reflecting and capturing all the potentialities of a
language relies on trying to extract the linguistic and lexical
information not only from ‘experts’, i.e. native speakers or linguists,
but from the texts themselves in which the language is actually
used, with a continuous process of enrichment. From these
considerations the importance of corpora obviously emerges.
N. Calzolari, ibidem
Lessici
Lessici

corpora
LC
POS tagging / lemmatisation
CL
frequencies of different linguistic objects
CL
proper nouns / named entity recognition
LC
syntactic parsing
CL
updating / tuning a lexicon
CL
collocational data
Lessici
Lessici

corpora
CL
semantic clustering and ‘nuances’ of meaning
LC
semantic mark-up
CL
lexical knowledge acquisition
LC
word sense disambiguation
CL
validation of lexical models
CL
corpus-based computational lexicography
Lessici
Lessici

corpora
Esempio: italiano chiedere vs domandare
Dal punto di vista teorico (introspettivo) sono sinonimi; i
dizionari cartacei utilizzano la stessa definizione
Ma:
• domandare è utilizzato quasi sempre in senso interrogativo
(ask to know); chiedere è utilizzato spesso in senso imperativo
(ask to have);
• chiedere è molto più usato di domandare;
FrameNet
FrameNet (FN) is a corpus-based lexicon-building project that
documents the links between lexical items and the semantic frame(s)
they evoke; it accomplishes this by annotating sets of sentences that
exemplify the items being described, and performing various operations
on the resulting annotations. The basic units in FN descriptions are the
frame and the lexical unit (LU), the latter understood as the pairing of a
“word” with just one of its meanings; thus, a word with four meanings is
treated as four lexical units. In most cases, for a word to have more than
one meaning implies that it belongs to more than one frame.
 Charles J. Fillmore, Collin F. Baker, and Hiroaki Sato, “FrameNet as
a ‘Net’”, in Proceedings of the 4th International Conference on
Language Resources and Evaluation, Lisbon 2004, pp. 1091-1094.
FrameNet
Main components of the FrameNet database
(1) the frame ontology,
(2) the set of annotated sentences, and
(3) the set of lexical entries.
The basis of the ontology is the set of frames, each of which consists of an
informal characterization of a situation type (the frame definition), together
with a collection of frame elements (FEs). The FEs are the semantic roles of
the entities involved in each frame. FE names are used as labels for the words
or phrases that are in grammatical construction with the L(exical) U(nit)s that
evoke that particular frame. For example, the frame that includes the English
verb inform has as its core FEs SPEAKER, ADDRESSEE and MESSAGE.
FrameNet
•
The example sentences are selected by FrameNet annotators as
representing the typical uses of the LUs belonging to individual frames.
Each set of annotations is centered around a particular LU; the sentence’s
constituents are labeled (with FE names) according to the ways in which
they fill in information about the frame. For example, sentences (1) and
(2) have SPEAKER appearing as subject, and ADDRESSEE as object;
the MESSAGE FE appears as a that clause in sentence (1), and as an
event-naming nominalization introduced by of in sentence (2).
(1) [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE that the prime
minister has resigned]
(2) [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE of the prime
minister’s resignation]
FrameNet
The lexical entry for each LU is a summary of what has been recorded in its
annotations, presented as valence descriptions, showing all the ways in which its
frame elements can be realized, such as the alternative syntactic realizations of
the MESSAGE just shown for the verb inform. The collection of annotated
sentences is made available in the database as evidence for the analysis.
The first and most obvious way in which LUs are related to each other is
through membership in the same frame. Thus inform shares a frame with the
verbs notify and announce, and also with the nouns notification and
announcement, and the verb resign shares frame membership with its nominal
partner resignation, and with verbal expressions like abdicate, step down and
stand down. But LUs can also be related to each other in other ways, either
because their frames are related to other frames, or through semantic properties
(called semantic types in the FN database) assigned to LUs individually rather
than through their frames.
FrameNet
Semantic types:
 The FrameNet database allows the assignment of semantic types to
LUs, FEs and frames. The perception verbs hear vs. listen are
distinguished as passive versus active perception verbs, and so,
respectively, are see vs. look. Hearing and seeing are things that
happen to you, listening and looking are things that you do, and this
difference is considered important enough to merit entry into separate
frames. In the FN database, hear and see and the passive perception
uses of other sensory words, such as feel, taste and smell, belong to
the Perception experience frame; the verbs look and listen belong to
the Perception active frame, along with the corresponding active
uses of feel, taste and smell.
FrameNet
Subframes
Subframes are used for representing subevents; frames that represent
complex processes have subframes representing their subparts. To
take a simple example, the Motion scenario frame has three
subframes, Departing, Motion, and Arriving. In this case, the
subframes are temporally ordered, but in general, subframes need not
be completely ordered with respect to each other. For example, the
Commercial transaction frame has two subframes Commerce
goods-transfer and Commerce money-transfer, but these are not
ordered with respect to each other. In some commercial transactions,
you pay in advance, in others, only after receiving the goods or
services.
FrameNet in azione…
http://framenet.icsi.berkeley.edu/index.php
Tutta la documentazione si trova in un manuale:
http://framenet.icsi.berkeley.edu/index.php
?option=com_wrapper&Itemid=126
FrameNet in altre lingue
Salsa Project – FrameNet in German
http://www.coli.uni-saarland.de/projects/salsa/
Spanish FrameNet
http://gemini.uab.es/SFN/index.html
WordNet
Sistema di riferimento lessicale disponibile online:
http://wordnet.princeton.edu
I significati delle parole sono rappresentati da gruppi di
sinonimi (synsets). Sono rappresentate anche relazioni quali
meronimia, iperonimia, antonimia, etc.
Bibliografia aggiornata:
http://mira.csci.unt.edu/~wordnet/
Altri lessici multilingui
Mimida:
http://www.gittens.nl/SemanticNetworks.html
Un lessico multilingue basato su WordNet e su vocabolari
liberamente disponibili sul web.
MultiWordNet:
http://multiwordnet.itc.it/english/home.php
Un lessico multilingue (italiano, spagnolo, ebraico, rumeno) in
cui i synsets sono allineati, laddove possibile, con i synsets del
WordNet di Princeton. Sviluppato all’IRST-ITC di Povo (TN).
EuroWordNet
http://www.illc.uva.nl/EuroWordNet
Un progetto analogo per le lingue europee: è possibile
scaricarne una demo
I vari WordNets sono collegati ad un Interlingual index che è
basato sul Wordnet americano e che permette di passare da una
parola in una lingua a una parola analoga in un’altra. Questo
index consente anche di accedere a un’ontologia condivisa di
63 distinzioni semantiche, che fornisce una base semantica
comune per le varie lingue
Altre iniziative
Progetto EAGLES (Expert Advisory Group for Language
Engineering Standards):
http://www.ilc.cnr.it/EAGLES96/home.html
• development of standards in morphosyntax, syntax and
semantics
• awareness of the interdependence between lexical
specifications and corpus tagsets / syntactic annotations
• gli standard sviluppati sono serviti nella creazione di risorse
(sia corpora che lessici) creati all’interno dei progetti europei
Parole e Simple
Altre iniziative
Progetto ISLE (International Standards for Language
Engineering) – Computational Lexicon Working Group:
http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm
• una continuazione del progetto EAGLES
• sviluppo di uno schema generale per la codifica dell’informazione
lessicale multilingue (MILE; Multilingual ISLE Lexical Entry)
• impegno a raggiungere consenso su standard di fatto attraverso una
procedura bottom-up
• impegno a massimizzare l’interazione e le sinergie con chi lavora
nell’ambito del semantic web
Altre iniziative
Progetto PAROLE:
http://www.ub.es/gilcub/SIMPLE/simple.html
• obiettivo: produrre in Europa un nucleo iniziale di corpora e
lessici armonizzati (catalano, danese, olandese, inglese,
finlandese, francese, tedesco, greco, italiano, portoghese,
spagnolo, svedese)
• Informazione codificata:
• Morfologia: written forms, including stems and variants;
morphosyntactic category; inflected forms; morphological
features; derivation; abridged forms
Altre iniziative
Progetto PAROLE:
• Informazione codificata:
• Sintassi: subcategorization patterns; grammatical
relations of subcategorised complements; control;
diathesis and lexical alternations; pronominalization;
linear order constraints; constraints on the syntactic
context where the lexical entry is inserted; idioms and
collocations
Altre iniziative
Progetto SIMPLE:
http://www.ub.es/gilcub/SIMPLE/simple.html
• Aggiunta di un livello semantico a PAROLE
• “The first attempt to tackle harmonised encoding
of semantic types and semantic
(subcategorisation) frames on a large scale, i.e. for
so many languages and with wide coverage”
Altre iniziative
Progetto SIMPLE:
• Informazione semantica: semantic type; domain
information; lexicographic gloss; argument structure for
predicative semantic units; event type, to characterise the
aspectual properties of verbal predicates; links of the
arguments to the syntactic subcategorization frames;
‘qualia’ structure, represented by a very large and granular
set of semantic relations and features; regular polysemous
alternations (e.g. container for content); hyponymy,
synonymy, etc.
Due tipi di database tipologici
Databases that collect and document primary language data
e.g.
Agreement database
Autotyp
Reflexives and intensifiers database
Stresstyp...
Databases documenting secondary language data
e.g.
Noun Phrase Universals Database (Edinburgh)
The Universals Archive (Konstanz)
Das grammatikalische Raritätenkabinett (Konstanz)
 http://ling.unikonstanz.de/pages/proj/sprachbau.htm
Database tipologici
http://www.lotschool.nl/Research/ltrc/databases/index.htm
contiene un elenco dei database tipologici elaborati all’interno
del progetto LTRC (Utrecht)
Particolarmente user-friendly:
Typological Database of Intensifiers and Reflexives (TDIR):
http://noam.philologie.fu-berlin.de/~gast/tdir/index.htm
Reduplication database: http://ling.uni-graz.at/redup/
The SMG databases:
http://www.smg.surrey.ac.uk/
Database tipologici
World Atlas of Language Structure
The World Atlas of Language Structures consists of 142 maps with accompanying
texts on diverse features (such as vowel inventory size, noun-genitive order, passive
constructions, and 'hand'/'arm' polysemy), each of which is the responsibility of a
single author (or team of authors). Each maps shows between 120 (35) and 1110
languages, each language being represented by a dot, and different dot colors
showing different values of the features. Altogether 2,650 languages are shown on
the maps, and more than 58,000 dots give information on features in particular
languages
Tools per la ricerca tipologica:
http://lingweb.eva.mpg.de/fieldtools/tools.htm
Strumenti e tecnologie per la
creazione di risorse
Tools specializzati
Fieldwork:
Shoebox: http://www.sil.org/computing/shoebox
Fieldworks Data Notebook: http://fieldworks.sil.org (open source)
Speech analysis:
Praat: http://fonsg3.hum.uva.nl/praat (gratuito)
SpeechAnalyzer: http://www.sil.org/computing/speechtools/speechanalyzer.htm
(versione 2.1 non gratuita; versione 1.5 gratuita)
Annotation tools:
CLAN: http://childes.psy.cmu.edu (gratuito)
Altri strumenti si possono trovare sulla pagina del LARL, nei link (categorie: concordancing tools e altre
risorse linguistiche)
Strumenti e tecnologie per la
creazione di risorse
Tools specializzati
Tagger morfologici:
Morph-it – tagger morfologico dell’italiano ; disponibile una demo in rete sul sito:
http://sslmitdev-online.sslmit.unibo.it/linguistics/morph-it.php
POS taggers:
CLAWS:
www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
TREE tagger:
www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Strumenti e tecnologie per la
creazione di risorse
Tools specializzati
Codifica di testi
DBT (DataBase Testuale): software di analisi testuale e interrogazione full-text
sviluppato da E. Picchi (ILC, CNR, Pisa)
http://www.ilc.cnr.it/pisystem/demo/index.html
Il LARL possiede un corpus di italiano L2 e il corpus del LIP (Lessico di
frequenza dell’italiano parlato) interrogabili attraverso il DBT
Fly UP