We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with the syntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.
Semantic prosody is typically referred to as an evaluative function of certain words or multiword items appearing within collocates of positive or negative meaning. The present study deals with the semantic prosody (context properties) of extended lexical units (ELUs) according to the psycholinguistic variables ‘valence’ (emotional positivity), ‘arousal’ (excitement, mood-enhancement), and ‘concreteness’. The object of investigation are the verbal phrases feel blue (unambiguous idiomatic ELU, without a literal counterpart) and see red (ambiguous ELU, idiomatic or literal). The study builds on Snefjella & Kuperman (2016) who propose context norms for English words on the basis of a USENET mega-corpus. For the detection of ELU representations, a questionnaire-based survey was conducted with speakers of American English. For the detection of the context values of ELUs, a corpus research was carried on by using the Corpus of Contemporary American English (COCA) and the News on the Web corpus (NOW). The results suggest that ELU contexts largely conform to the averaged context norms of ELU constituents. ELU representations are strongly dissociated from contexts.
We present a method for the structural collocation extraction for an inflective language (Polish) based on the process divided into two phases: (1) extraction and filtering of the pairs of lemmatised wordforms and (2) structural annotation of the extracted collocations with lexico-syntactic patterns. The pattern templates and parameters are specified manually but their instances are both generated and tested on the corpus automatically. The extracted collocations were evaluated by applying them as rules in morphosyntactic disambiguation of Polish and by comparing them with a list of two-word expressions extracted from two Polish dictionaries.
There is a specific combinatorial periphery in any language consisting of words whose combinatorial potential is extremely restricted. These words, which are usually referred to as bound words, unique words, cranberry words or monocollocable words (MWs), belong to small and closed collocation paradigms, their number of collocates ranging from one to a few (usually ± 7). The present article tries to describe the phenomenon of monocollocability in Italian, basing the analysis on a list of Italian MWs extracted from corpora and contained in the book Language Periphery, Monocollocable Words in English, German, Italian and Czech (Čermák et al., 2016). Italian MWs and the fixed combinations in which they occur are analysed in terms of syntactic structures, semantic features, collocation structures and frequency. Monocollocability is a phenomenon subject to change in time: even though MWs are often considered to be “relicts of the past”, the collected data prove that progressive restriction of the combinatorial capacity of certain words can be observed in Present-Day Italian as well.
IT
In tutte le lingue naturali troviamo parole sprovviste di autonomia sintattica e semantica che possono esistere soltanto all’interno di una combinazione lessicale. Queste parole, designate con i termini cranberry words, bound words, unique words o monocollocable words (parole monocollocabili, PM), sono caratterizzate da un raggio collocazionale estremamente ristretto (che va solitamente da 1 fino a ± 7 collocati). Il presente articolo vuole descrivere il fenomeno della monocollocabilità nell’italiano di oggi, basandosi sulle liste delle PM italiane estratte dai corpora e contenute nel libro Language Periphery, Monocollocable Words in English, German, Italian and Czech (Čermák et al., 2016). Le PM italiane e le locuzioni a cui esse fanno capo vengono analizzate sotto il profilo sintattico, semantico, collocazionale e frequenziale. La monocollocabilità è un fenomeno mutevole nel tempo: nonostante le PM vengano spesso considerate “relitti del passato”, i dati raccolti mostrano che una progressiva trasformazione di alcune parole con ampio raggio collocazionale in parole monocollocabili avviene anche nel lessico attuale.
The paper gives a brief overview of the most important phraseological studies and dictionaries published in Italy since the 1980s, when Italian phraseology took the first steps as an independent linguistic discipline. It deals with fundamental theoretical contributions and describes major dictionaries of idioms and proverbs published in the last 40 years. It points also to new trends in Italian phraseography and presents some interesting current projects based on new corpus methodologies.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.