Task Guidelines

EVALITA 2016
PoS tagging for Italian Social Media Texts (PoSTWITA)
- TASK GUIDELINES -

Andrea Bolioli, CELI
Cristina Bosco, Università di Torino
Alessandro Mazzei, Università di Torino
Fabio Tamburini, Università di Bologna

1. INTRODUCTION

The following are the guidelines for the PoSTWITA task of EVALITA 2016 evaluation campaign.
Participants to the evaluation task are required to use the data provided by the organization to set up their systems. The organisation will provide two data sets: the first one, referred to as Development Set (DS), contains data manually annotated using a specific tagset (see a following section for the tagset description) and must be used to train participants’ systems; the second one, referred to as Test Set (TS) contains the test data in blind format for the evaluation and it will be given to participants in the date scheduled for the evaluation.
Participants are allowed to use other resources in their systems, both for training and to enhance final performances, as long as their results apply the proposed tagsets.
Each participant team is also required to send a paper (in electronic format) which contains a brief description of the system, especially considering techniques and resources used, and (if available) a complete bibliographic reference.

2. DATA DESCRIPTION

During the evaluation campaign, and before the date scheduled for the evaluation, all participants are invited and encouraged to communicate to the organizers any errors found in the Development Set. This will allow us to update and redistribute it to the participants in an enhanced form. Participants are not allowed to distribute the EVALITA data.
We do not distribute a lexicon resource with EVALITA 2016 data. Each participant is allowed to use any available lexical resource or can freely induce it from the training data.
All the data will be provided as plain text files in UNIX format, thus attention must be paid to newline character format.

2.1. Tagset

In EVALITA2016-PoSTWITA, we decided to follow the tokenization strategy and the tagset proposed in the Universal Dependencies (UD) project for Italian (http://universaldependencies.org/it/pos/index.html) applying only minor changes. This makes the EVALITA2016-PoSTWITA gold standard annotations compliant with the UD tagset and tokenization, and allows the conversion towards this standard suitable with a very small effort.
The main modifications regard respectively to two tokenisation problems and the insertion of some Twitter-specific tags:

a) in the Italian UD specifications (mainly devoted to solve parsing problems) there are two cases in which single words are split into two different tokens: the case of the articulated prepositions (e.g. dalla, nell’, al...) and that of a clitic cluster attached to the end of a verb form (e.g. suonargliela, regalaglielo, dandolo...). Even if it is a common practice in building treebanks splitting these words into different tokens (e.g. dalla = da + la, nell’ = in + lo, suonar + glie + la, regala + glie + lo, dando + lo ...), this is not suitable for PoS-tagging tasks. For this reason, we decided not to adhere to UD tokenisation rules for these cases, maintain the word unsplitted and assign them two specific tags, ADP_A and VERB_CLIT respectively, as done in previous EVALITA PoS tagging evaluations;

b) all the Internet and Twitter-specific tokens should be classified, following the UD specifications, into the “SYM:symbol” class. We decided to further enrich this class by adding more specific tags for the representation of morphological categories often occurring in social media texts, like emoticons, Internet addresses, email addresses, hashtags and mentions (EMO, URL, EMAIL, HASHTAG and MENTION).

TAGSET		CATEGORY	EXAMPLES (if different from UD specifications)
UD	EVALITA16	CATEGORY	EXAMPLES (if different from UD specifications)
ADJ	ADJ	Adjective	-
ADP	ADP	Adposition (simple prep.)	di, a, da, in, con, su, per…
	ADP_A	Adposition (prep.+Article)	dalla, nella, sulla, dell’…
ADV	ADV	Adverb	-
AUX	AUX	Auxiliary Verb	-
CONJ	CONJ	Coordinating Conjunction	-
DET	DET	Determiner	-
INTJ	INTJ	Interjection	-
NOUN	NOUN	Noun	-
NUM	NUM	Numeral	-
PART	PART	Particle	-
PRON	PRON	Pronoun	-
PROPN	PROPN	Proper Noun	-
PUNCT	PUNCT	punctuation	-
SCONJ	SCONJ	Subordinating Conjunction	-
SYM	SYM	Symbol	-
	EMO	Emoticon/Emoji	:-) ^_^ ♥ :P 😁
	URL	Web Address	http://www.somewhere.it
	EMAIL	Email Address	someone@somewhere.com
	HASHTAG	Hashtag	#staisereno
	MENTION	Mention	@someone
VERB	VERB	Verb	-
	VERB_CLIT	Verb + Clitic pronoun cluster	mangiarlo, donarglielo…
X	X	Other or RT/rt	-

Proper Noun Management

The annotation of named entities (NE) poses a number of relevant problems in tokenization and PoS tagging.
The most coherent way to handle such kind of phenomena is to consider each NE as a unique token assigning to it the PROPN tag. Unfortunately this is not a viable solution for this evaluation task, and, moreover, a lot of useful generalisation on trigram sequences (e.g. Ministero/dell’/Interno – PROPN/ADP_A/PROPN) would be lost if adopting such kind of solution.
Anyway, the annotation of sequences like “Banca Popolare” and “Presidente della Repubblica Italiana” deserve some attention and a clear policy. Following the approach applied in Evalita 2007 for the PoS tagging task, we annotate as PROPN those words of the NE which are marked by the uppercase letter, like in the following examples:

Banca PROPN

Popolare PROPN

Presidente PROPN

della ADP_A

Repubblica PROPN

Italiana PROPN

Ordine PROPN

dei ADP_A

Medici PROPN

Nevertheless, in some other cases, the uppercase letter has not been considered enough to determine the introduction of a PROPN tag:

“... anche nei Paesi dove …”, “... in contraddizione con lo Stato sociale …”.

This strategy is devoted to produce a data set that incorporates the speaker’s linguistic intuition about this kind of structures, regardless of the possibility of formalization of the involved knowledge.

Foreign words

Non-Italian words are annotated, when possible, following the same PoS tagging criteria adopted in UD guidelines for referring language.

Example: “good-bye” INTJ

2.2. Data Preparation Notes

Each tweet in the data sets is considered a separate entity. The global amount of manually annotated data has been split between DS and TS. We do not preserve thread integrity, thus taggers have to process each tweet separately.

3. TOKENISATION ISSUES

The problem of text segmentation (tokenisation) is a central issue in POS-tagger evaluation and comparison. In principle every system should apply different tokenisation rules leading to different outputs.
In this EVALITA task we provide all the development and test data in tokenised format, one token per line followed by its tag (when applicable), following the schema:

_____ID_TWEET_1_____
<TOKEN_1>     <TAG1>
<TOKEN_2>     <TAG2>
...
<TOKEN_N>   <TAGN>

_____ID_TWEET_2_____
<TOKEN_1>     <TAG1>
<TOKEN_2>     <TAG2>
...
<TOKEN_N>   <TAGN>

The first line for each tweet will contain the Tweet ID. Different tweets are separated by an empty line.

Example:

_____162545185920778240_____
Governo PROPN
Monti PROPN
: PUNCT
decreto NOUN
in ADP
cdm PROPN
per ADP
approvazione NOUN
! PUNCT
http://t.co/Z76KLLGP URL

_____192902763032743936_____
#Ferrara HASHTAG
critica VERB
#Grillo HASHTAG
perché SCONJ
dice VERB
cose NOUN
che PRON
dicevano VERB
Berlusconi PROPN
e CONJ
Bossi PROPN
. PUNCT
E CONJ
che PRON
non ADV
hanno AUX
fatto VERB
. PUNCT

The example above shows some tokenisation and formatting issues:

- accents are coded using UTF-8 encoding table;

- apostrophe is tokenised separately only when used as quotation mark, not when signalling a removed character (dell’/orto);

The participants are requested to return the test file using exactly the same tokenisation format, containing exactly the same number of tokens. The comparison with the reference file will be performed line-by-line, thus a misalignment will produce wrong results.
The TS will contain only the tokenized words but not the correct tags, that have to be added by the participant systems to be submitted for the evaluation. The correct tokenized and tagged data of the TS (called gold standard TS) will be exploited for the evaluation and will be provided to the participants after the evaluation, together with their score.

4. EVALUATION METRICS

The evaluation is performed in a “black box” approach: only the systems’ output is evaluated.
The evaluation metric will be based on a token-by-token comparison and only ONE tag is allowed for each token.
The considered metric is the Tagging accuracy: it is defined as the number of correct PoS tag assignment divided by the total number of tokens in TS.

A baseline algorithm (Most Frequent Tag assignment) and some well known PoS-taggers will be used as reference for comparison purposes.

5. EVALUATION DETAILS AND SCHEDULING

30th May 2016	Release of the guidelines and development set (DS)
12th September 2016	Release of the test set (TS)
19th September 2016 (midnight - CET)	Tagged version of the test set due by participants

The 30th May the task organisers will send by email to the registered participants these Guidelines and the Development Set (DS) of data in the format described in section 3. All the data will be provided as plain text files in UNIX format, thus pay attention to newline character format.

The 12th September the organisers will send by email the Test Set (TS) to participants.

Participants are required to return the tagged version of the TS file (without any change in the token stream) to the organizers by the 19th September (midnight - CET) naming the file as EVALITA16_POSTWITA_participantname. The file must be sent by email to the address: fabio.tamburini@unibo.it.

Only one version of the result file for each participant team will be accepted.

After the submission deadline the organiser will evaluate the systems’ results and send back to the participants their score as well as the ‘gold-standard’ TS version.