EVALITA 2016
PoS tagging for Italian Social Media Texts (PoSTWITA)
- TASK GUIDELINES -
Andrea
Bolioli, CELI
Cristina Bosco, Università di Torino
Alessandro Mazzei, Università di Torino
Fabio
Tamburini, Università di Bologna
1. INTRODUCTION
The following are the guidelines for
the PoSTWITA task of EVALITA 2016
evaluation campaign.
Participants to the evaluation task
are required to use the data provided by the organization to set up their
systems. The organisation will provide two data sets: the first one, referred
to as Development Set (DS), contains
data manually annotated using a specific tagset (see a following section for
the tagset description) and must be used to train participants’ systems; the
second one, referred to as Test Set (TS)
contains the test data in blind format for the evaluation and it will be given
to participants in the date scheduled for the evaluation.
Participants
are allowed to use other resources in their systems, both for training and
to enhance final performances, as long as their results apply the proposed
tagsets.
Each participant team is also
required to send a paper (in electronic format) which contains a brief
description of the system, especially considering techniques and resources
used, and (if available) a complete bibliographic reference.
2. DATA DESCRIPTION
During the evaluation campaign,
and before the date scheduled for the evaluation, all participants are
invited and encouraged to communicate to the organizers any errors found in the
Development Set. This will allow
us to update and redistribute it to the participants in an enhanced form.
Participants are not allowed to distribute the EVALITA data.
We do not distribute a lexicon
resource with EVALITA 2016 data. Each
participant is allowed to use any available lexical resource or can freely
induce it from the training data.
All the data will be provided as
plain text files in UNIX format, thus attention must be paid to newline
character format.
2.1. Tagset
In EVALITA2016-PoSTWITA, we decided
to follow the tokenization strategy and the tagset proposed in the Universal
Dependencies (UD) project for Italian
(http://universaldependencies.org/it/pos/index.html) applying only minor
changes. This makes the EVALITA2016-PoSTWITA gold standard annotations
compliant with the UD tagset and tokenization, and allows the conversion
towards this standard suitable with a very small effort.
The main modifications regard
respectively to two tokenisation problems and the insertion of some
Twitter-specific tags:
a) in the Italian UD specifications (mainly devoted to solve parsing problems) there are two cases in which single words are split into two different tokens: the case of the articulated prepositions (e.g. dalla, nell’, al...) and that of a clitic cluster attached to the end of a verb form (e.g. suonargliela, regalaglielo, dandolo...). Even if it is a common practice in building treebanks splitting these words into different tokens (e.g. dalla = da + la, nell’ = in + lo, suonar + glie + la, regala + glie + lo, dando + lo ...), this is not suitable for PoS-tagging tasks. For this reason, we decided not to adhere to UD tokenisation rules for these cases, maintain the word unsplitted and assign them two specific tags, ADP_A and VERB_CLIT respectively, as done in previous EVALITA PoS tagging evaluations;
b) all the Internet and Twitter-specific tokens should be classified, following the UD specifications, into the “SYM:symbol” class. We decided to further enrich this class by adding more specific tags for the representation of morphological categories often occurring in social media texts, like emoticons, Internet addresses, email addresses, hashtags and mentions (EMO, URL, EMAIL, HASHTAG and MENTION).
TAGSET |
CATEGORY |
EXAMPLES (if different from UD specifications) |
|
UD |
EVALITA16 |
||
ADJ |
ADJ |
Adjective |
- |
ADP |
ADP |
Adposition (simple prep.) |
di, a, da, in, con, su, per… |
|
ADP_A |
Adposition (prep.+Article) |
dalla, nella, sulla, dell’… |
ADV |
ADV |
Adverb |
- |
AUX |
AUX |
Auxiliary Verb |
- |
CONJ |
CONJ |
Coordinating Conjunction |
- |
DET |
DET |
Determiner |
- |
INTJ |
INTJ |
Interjection |
- |
NOUN |
NOUN |
Noun |
- |
NUM |
NUM |
Numeral |
- |
PART |
PART |
Particle |
- |
PRON |
PRON |
Pronoun |
- |
PROPN |
PROPN |
Proper Noun |
- |
PUNCT |
PUNCT |
punctuation |
- |
SCONJ |
SCONJ |
Subordinating Conjunction |
- |
SYM |
SYM |
Symbol |
- |
|
EMO |
Emoticon/Emoji |
:-) ^_^ ♥ :P 😁 |
|
URL |
Web Address |
http://www.somewhere.it |
|
|
Email Address |
someone@somewhere.com |
|
HASHTAG |
Hashtag |
#staisereno |
|
MENTION |
Mention |
@someone |
VERB |
VERB |
Verb |
- |
|
VERB_CLIT |
Verb + Clitic pronoun cluster |
mangiarlo, donarglielo… |
X |
X |
Other or RT/rt |
- |
Proper Noun Management
The annotation of named entities
(NE) poses a number of relevant problems in tokenization and PoS tagging.
The most coherent way to handle such
kind of phenomena is to consider each NE as a unique token assigning to it the
PROPN tag. Unfortunately this is not a viable solution for this evaluation
task, and, moreover, a lot of useful generalisation on trigram sequences (e.g. Ministero/dell’/Interno – PROPN/ADP_A/PROPN) would be lost if
adopting such kind of solution.
Anyway, the annotation of sequences
like “Banca Popolare” and “Presidente della Repubblica Italiana”
deserve some attention and a clear policy. Following the approach applied in
Evalita 2007 for the PoS tagging task, we annotate as PROPN those words of the
NE which are marked by the uppercase letter, like in the following examples:
Banca PROPN Popolare PROPN |
Presidente PROPN della ADP_A Repubblica PROPN Italiana PROPN |
Ordine PROPN dei ADP_A Medici PROPN |
Nevertheless, in some other cases, the uppercase letter has not been considered enough to determine the introduction of a PROPN tag:
“... anche nei Paesi dove …”, “... in contraddizione con lo Stato sociale …”.
This strategy is devoted to produce a data set that incorporates the speaker’s linguistic intuition about this kind of structures, regardless of the possibility of formalization of the involved knowledge.
Foreign words
Non-Italian words are annotated, when possible, following the same PoS tagging criteria adopted in UD guidelines for referring language.
Example: “good-bye” INTJ
2.2. Data Preparation Notes
Each tweet in the data sets is considered a separate entity. The global amount of manually annotated data has been split between DS and TS. We do not preserve thread integrity, thus taggers have to process each tweet separately.
3. TOKENISATION ISSUES
The problem of text segmentation
(tokenisation) is a central issue in POS-tagger evaluation and comparison. In
principle every system should apply different tokenisation rules leading to
different outputs.
In this EVALITA task we provide all
the development and test data in tokenised format, one token per line followed
by its tag (when applicable), following the schema:
The first line for each tweet will contain the Tweet ID. Different tweets are separated by an empty line.
Example:
_____162545185920778240_____
Governo PROPN
Monti PROPN
: PUNCT
decreto NOUN
in ADP
cdm PROPN
per ADP
approvazione NOUN
! PUNCT
http://t.co/Z76KLLGP URL
_____192902763032743936_____
#Ferrara HASHTAG
critica VERB
#Grillo HASHTAG
perché SCONJ
dice VERB
cose NOUN
che PRON
dicevano VERB
Berlusconi PROPN
e CONJ
Bossi PROPN
. PUNCT
E CONJ
che PRON
non ADV
hanno AUX
fatto VERB
. PUNCT
The example above shows some tokenisation and formatting issues:
- accents are coded using UTF-8 encoding table;
- apostrophe is tokenised separately only when used as quotation mark, not when signalling a removed character (dell’/orto);
The participants are requested to
return the test file using exactly the same tokenisation format, containing
exactly the same number of tokens. The comparison with the reference file will
be performed line-by-line, thus a misalignment will produce wrong results.
The TS will contain only the
tokenized words but not the correct tags, that have to be added by the
participant systems to be submitted for the evaluation. The correct tokenized
and tagged data of the TS (called gold
standard TS) will be exploited for the evaluation and will be provided to
the participants after the evaluation, together with their score.
4. EVALUATION METRICS
The evaluation is performed in a
“black box” approach: only the systems’ output is evaluated.
The evaluation metric will be based
on a token-by-token comparison and only ONE tag is allowed for each token.
The considered metric is the Tagging accuracy: it is defined as the
number of correct PoS tag assignment divided by the total number of tokens in
TS.
A baseline algorithm (Most Frequent Tag assignment) and some well known PoS-taggers will be used as reference for comparison purposes.
5. EVALUATION DETAILS AND SCHEDULING
30th May 2016 |
Release of the guidelines and development set (DS) |
12th September 2016 |
Release of the test set (TS) |
19th September 2016 (midnight - CET) |
Tagged version of the test set due by participants |
The 30th May the task organisers will send by email to the registered participants these Guidelines and the Development Set (DS) of data in the format described in section 3. All the data will be provided as plain text files in UNIX format, thus pay attention to newline character format.
The 12th September the organisers will send by email the Test Set (TS) to participants.
Participants are required to return the tagged version of the TS file (without any change in the token stream) to the organizers by the 19th September (midnight - CET) naming the file as EVALITA16_POSTWITA_participantname. The file must be sent by email to the address: fabio.tamburini@unibo.it.
Only one version of the result file for each participant team will be accepted.
After the submission deadline the organiser will evaluate the systems’ results and send back to the participants their score as well as the ‘gold-standard’ TS version.