SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM) by dimsum16

The DiMSUM shared task at SemEval 2016 is concerned with predicting, given an English sentence, a broad-coverage representation of lexical semantics. The representation consists of two closely connected facets: a segmentation into minimal semantic units, and a labeling of some of those units with semantic classes known as supersenses.

For example, given the POS-tagged sentence

I_PRON googled_VERB restaurants_NOUN in_ADP the_DET area_NOUN and_CONJ Fuji_PROPN Sushi_PROPN came_VERB up_ADV and_CONJ reviews_NOUN were_VERB great_ADJ so_ADV I_PRON made_VERB a_DET carry_VERB out_ADP order_NOUN

the goal is to predict the representation

I googled_{v.communication} restaurants_GROUP in the area_n.location and Fuji_Sushi_n.group came_up_{v.communication} and reviews_{n.communication} were_v.stative great so I made_ a carry_out_v.possession _order_{v.communication}

Noun supersenses start with n., verb supersenses with v., and _ joins tokens within a multiword expression. (carry_out_v.possession and made_order_{v.communication} are separate MWEs.)

The two facets of the representation are discussed in greater detail below. Systems are expected to produce the both facets, though the manner in which they do this (e.g., pipeline vs. joint model) is up to you.

Gold standard training data labeled with the combined representation is provided in two domains: online reviews and tweets. (Rules for using other data resources in data conditions.) Blind test data will be in these two domains as well as a third, surprise domain. The domain of each sentence will not be indicated as part of the input at test time. The three test domains will have equal weight in the overall system scores (see scoring procedure).

Minimal semantic units

The word tokens of the sentence are partitioned into basic units of lexical meaning. Equivalently, where multiple tokens function together as an idiomatic whole, they are grouped together into a multiword expression (MWE). MWEs include: nominal compounds like hot dog; verbal expressions like do away with 'eliminate', make decisions 'decide', kick the bucket 'die'; PP idioms like at all and on the spot 'without planning'; multiword prepositions/connectives like in front of and due to; multiword named entities; and many other kinds.

Input word tokens are never subdivided.
Grouped tokens do not have to be contiguous; e.g., verb-particle constructions are annotated whether they are contiguous (make up the story) or gappy (make the story up). There are, however, formal constraints on gaps to facilitate sequence tagging.
Combinations considered to be statistical collocations (yet compositional in meaning) are called "weak MWEs", distinguished from MWEs with idiomatic meanings ("strong MWEs"). Otherwise, different categories of MWEs are not explicitly annotated.

Refer to this LREC 2014 paper for details of the MWE annotation.

Semantic classes

These are broad-coverage lexical categories known as "supersenses".

There are 26 noun supersenses, including n.person, n.location, n.time, n.food, and n.communication. They cover common nouns as well as proper names (named entities).
There are 15 verb supersenses, including v.motion, v.social, and v.communication.
Supersense annotations always respect strong MWE annotations: the supersense class applies to the entire MWE as a unit. Of MWEs, all and only the ones that holistically function as a noun or verb expression are labeled with a supersense.
Single-word noun and verb tokens also receive supersenses. E.g., instances of hot dog and hamburger would both receive the n.food label.

Refer to this NAACL 2015 paper for details of the annotation of supersenses on top of MWEs.

Data conditions

There will be three conditions according to which systems will be compared.

To facilitate a controlled comparison of algorithms, in the (semi-)supervised closed conditions, systems may only use specific data resources.

In the supervised closed condition, the following are permitted:
- the labeled training data
- the English WordNet lexicon
- the following sets of word clusters (Brown clusters):
  - yelpac-c1000-m25.gz from the English Multiword Expression Lexicons—this clustering was induced from the Yelp Academic Dataset; and/or
  - any of the ARK Tweet NLP clusters
In the semi-supervised closed condition: all of the above are permitted, plus the Yelp Academic Dataset.
In the open condition, systems may use any and all available resources.

System submissions will specify which of these datasets were used, and this will determine which competition(s) it is entered into. Each team is allowed to submit up to 3 systems—one in the supervised closed condition, one in the semi-supervised closed condition, and one in the open condition.

A new test set has been annotated for this task. A blind version (input only) has been released in advance of the evaluation period.

Scoring

Our evaluation script, dimsumeval.py, is bundled with the latest data release. Primarily, it reports F-scores for MWE identification, supersense labeling, and their combination. See the documentation in the script for an overview. Further details of the scoring procedure will be announced at a future time.

Downloads

Training/test data + scripts v1.5
- README
- TAGSET
Trial data: Download STREUSLE 2.0 here. This consists of annotated online reviews (it will eventually form part of the training set for the task).
- Refer to the files streusle.tags and streusle.tags.sst (which contain equivalent information, but in different formats). The formats are described in README.md.
A baseline system will be provided as well.

System submission

The deadline for system outputs to be submitted is Jan. 31, 2016. Instructions:

Create a .zip file containing:
- dimsum16.test.pred, your system's output. Each line should correspond to a line in dimsum16.test.blind, with values filled into the appropriate columns. You should run the evaluation script (with a dummy gold standard) to ensure your system output is formatted correctly.
- submission.csv, but filled out with a specification of the resources used for the submission.
Create a new START submission (requires an account). Provide the team name and members, select Task 10, and write a short blurb about your system. Upload the zip file.
After creating your submission, you may update it (both metadata and zip file) up until the deadline. Only the last file upload for that submission will be retained.
Each team may enter up to 3 submissions, one per data condition. Specify the same team name for all of them, and ensure the submission.csv is different for the different conditions. If multiple submissions from the same team fall under the same data condition, the last one will be used for the evaluation.

We plan to make all submissions—including the test set predictions—public at some point subsequent to the task. (If there are institutional barriers to this, please contact the organizers.)

Organization

Please subscribe to https://groups.google.com/group/dimsum16 for announcements about the task.

See the schedule for planned data releases and deadlines.

The organizers are:

Nathan Schneider, University of Edinburgh
Dirk Hovy, University of Copenhagen
Anders Johannsen, University of Copenhagen
Marine Carpuat, University of Maryland

SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)

Task Home Page