Laboratorio di Linguistica
'Giovanni Nencioni'
Corpus e Lessico di Frequenza dell'Italiano Scritto (CoLFIS)

 

 

 

HOME

 

DESCRIPTION

 

DOWNLOAD

 

 

 

 

 

 

 

 

 

 

 

The reference corpus consists of excerpts from newspapers (published between 1992 - 1994), magazines and books, including textbooks and books relating to professional interests. The corpus comprises 3.798.275 lexical occurrences, so distributed:

NEWSPAPERS 1.836.119
MAGAZINES 1.306.653
BOOKS 655.503

The corpus was designed as the best approximation to the Italians' average preferred readings, as proposed by official statistics (ISTAT). It thus mirrors the actual experience of the Italian readers.

For a detailed corpus description, see:
Laudanna, A., Thornton, A.M., Brown, G., Burani, C. e Marconi, L. (1995). Un corpus dell'italiano scritto contemporaneo dalla parte del ricevente. In S. Bolasco, L. Lebart e A. Salem (a cura di), III Giornate internazionali di Analisi Statistica dei Dati Testuali. Volume I, pp.103-109. Roma: Cisu

The frequency lexicon consists of two main components: the forms repertoire and the lemmas repertoire.
The forms repertoire lists the frequency of each corpus form, without distinguishing between the possibly diverging lexical entries. For instance, porti accounts for one form, disregarding its interpretation as either noun or verb (see below).
The lemmas repertoire, instead, disambiguates all identical forms belonging to different lemmas. For instance, porti is listed as the plural of porto 'harbor', or as the present indicative's second person singular of portare 'to bring'. In addition, the lemmas repertoire treats syntagmatic words as single entries. By 'syntagmatic words' we refer to complex locutions consisting of two or more words, whose meaning is often independent of the individual components. For instance, Divina Commedia '(Dante's) Divine Comedy', gamba di tavolo 'table's leg', a causa di 'due to', spesse volte 'often'.

CoLFIS' appeal, as compared to previous Italian frequency lists, may be summed up as follows:

  1. Careful corpus balancing. This attaches a non-fortuitous character to quantitative informations;
  2. Corpus dimension. Although modern computational technology provides efficient automatic lemmatization tools, there are not many examples of comparatively large lexical resources, such that the automatic screening's result has been systematically checked by human operators. This enhances the lemmatization's trustworthiness, with special regard to syntagmatic words.