Language, Knowledge and People in Perspective

Scientific organizers
Piek Vossen (Computational Linguistics and Lexicology, VU, Amsterdam)
Lora Aroyo (Human Semantics & Semantic Web, CrowdTruth @IBM/VU, Amsterdam)
Antske Fokkens (Computational Linguistics and Lexicology, VU, Amsterdam)
Julia Noordegraaf (Media Studies, UvA, Amsterdam)
Ivar Vermeulen (Social Sciences, VU, Amsterdam)
Chris Welty (Massive Scale NLP & Semantics, VU/Google, New York, USA)


Language, Knowledge and People in Perspective

From: Tuesday 18 Apr 2017 through Friday 21 Apr 2017
Venue: Lorentz Center@Snellius, Leiden, The Netherlands

Public website
Participants’ website
Lorentz Center website
Discussion platform

Looking at the Long Tail of Language

*** Slides from the invited talks:


Maarten de Rijke

Antal van den Bosch

Johan Bos

Ivan Titov

Eduard Hovy

Stefan Schlobach

*** Videos of all presentations are available online now!! Check the playlist:

Many natural phenomena show a Zipfian distribution (Newman, 2005), in which a small amount of observations are very frequent and there is a very long tail of low frequent observations. The distribution of symbols in natural language and their meanings are no exception to Zipf’s law. Within a language community and a period of time, e.g. a generation, a few expressions are extremely frequent and are used in their most frequent meaning, whereas there are many expressions and meanings that we find rarely. This has big consequences for the computer models that are built from these observations: they tend to suffer from overfitting to the most frequent cases. As long as the tasks on which we test these models also show the same distribution, these models perform quite well. However, this favours models that tend to rely on statistical obvious cases and it does not require very deep understanding or reasoning. Typically, their performance is high when the test cases match the most frequent cases, and very low when they belong to the long tail. Interestingly enough, people do not suffer from overfitting in the same way as machines do. They can perfectly handle long tail phenomena as well. In this workshop, we want to address the long tail in the semantic processing of text with a focus on the task of disambiguation. We need to find an incentive for the community to consider the long tail as a first-class citizen, either through integrating it into evaluation metrics, and/or representing the long tail world into the (evaluation) datasets and knowledge bases. This would encourage the development of systems that have a better understanding of natural language and are able to deal with knowledge and data sparseness.

Aim of the Workshop

The goal of the current Long Tail workshop is to discuss the starting points and motivation for a future workshop and task, as well as the design, the data, evaluation and possible systems with a selection of experts in the field. Depending on the outcome of the Spinoza “Long Tail” workshop, we will also consider a special issue journal publication together with the workshop speakers. We plan to build upon the results of this workshop by creating a disambiguation task which has a strong focus on the long tail phenomenon. Such a task requires the design and collection of the data that represent the long tail, as well as adequate evaluation methods. We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the next call for SemEval-2018 tasks, which is expected late 2016/early 2017. In addition, we plan to propose a workshop for ACL 2017, which will be dedicated to interest the community in the task, discuss the acquisition of the data and explore possible systems that would optimize for this task.

Structure of the Workshop

The 2nd Spinoza Workshop “Looking at the Long Tail” will consist of two main sessions:

1. Invited Speakers

The invited speakers span from various fields of expertise:

  • Natural Language Processing
  • Information Retrieval
  • Knowledge Representation and Reasoning
  • Machine Learning

They will address the phenomena of overfitting and low long-tail performance from their own disciplines.

2. Datathon

The practical session is organized as a datathon and will consist of four tracks:

As a starting point, we will provide a collection of evaluation methods, data sets, knowledge bases, and system results for this datathon to analyse and discuss. We also provide scripts for data analysis.


From To
Welcome 09:00 09:30
Introduction by the Organizers 09:30 10:05
10:05 10:30
10:30 10:55
Coffee Break 10:55 11:10
11:10 11:35
11:35 12:00
12:00 12:25
Lunch 12:25 13:25
13:25 14:05
Introduction to the Datathon tracks 14:05 14:15
Discussion & Brainstorming 14:15 14:45
Hands-on 14:45 16:05
Coffee Break 16:05 16:20
Conclusions 16:20 17:05
Central Presentations 17:05 17:45
Towards a SemEval ’18 Shared Task 17:45 18:00
Drinks 18:00

Related research


We encourage to join and use the following email address for all communication related to the workshop:

And, let us know about your impressions and find out about others’ on Twitter, using the official hashtag of the workshop: #SpinozaLongTail !

Directions to the Workshop venue

Room: Forum 2 at Floor 1, Wing D
Main building Vrije Universiteit Amsterdam
De Boelelaan 1105
1081 HV Amsterdam

You might use this link: How to get to Vrije Universiteit Amsterdam

Route from entrance Main Building VU University follow signs: ‘Forum, Wing D’

Take the stairs (indicated by signpost) to Floor 1.
Forum 2 is the first room on your right.
When you arrive at the Main Building VU and would like assistance, please ask a host in the hall near the main entrance.

Can not make it?

For people that can not attend this workshop, we aim to broadcast the event LIVE. Pictures and videos will also be taken during the event, and published shortly after it.


The research for this project was supported by the Netherlands Organisation for Scientific Research (NWO) via the Spinoza fund.

Organizing Committee

  1. Piek Vossen (
  2. Filip Ilievski (
  3. Marten Postma (
  4. Selene Kolman (

Can Machines Understand Language?

Understanding language by machines
1st VU-Spinoza workshop
October 17th 2014

12:30 - 18:00 hours
, room D-146, VU Medical Faculty (1st floor, D-wing)
Van der Boechorststraat 7

1081 BT Amsterdam

Can machines understand language? According to John Searle, this is fundamentally impossible. He used the Chinese Room thought-experiment to demonstrate that computers follow instructions to manipulate symbols without understanding of these symbols. William van Orman Quine even questioned the understanding of language by humans, since symbols are only grounded through approximation by cultural situational convention. Between these extreme points of views, we are nevertheless communicating every day as part of our social behavior (within Heidegger's hermeneutic circle), while more and more computers and even robots take part in communication and social interactions.

The goal of the Spinoza project “Understanding of language by machines” (ULM) is to scratch the surface of this dilemma by developing computer models that can assign deeper meaning to language that approximates human understanding and to use these models to automatically read and understand text. We are building a Reference Machine: a machine that can map natural language to the extra- linguistic world as we perceive it and represent it in our brain.

This is the first in a series of workshops that we will organize in the Spinoza project to discuss and work on these issues. It marks the kick-off of 4 projects that started in 2014, each studying different aspects of understanding and modeling this through novel computer programs. Every 6-months, we will organize a workshop or event that will bring together different research lines to this central theme and on a shared data sets.

We investigate ambiguity, variation and vagueness of language; the relation between language, perception and the brain; the role of the world view of the writer of a text and the role of the world view and background knowledge of the reader of a text.


12:30 – 13:00 Welcome
13:15 – 13:45 Understanding language by machines: Piek Vossen
13:45 – 14:15 Borders of ambiguity: Marten Postma and Ruben Izquierdo Bevia
14:15 – 14:45 Word, concept, perception and brain: Emiel van Miltenburg and Alessandro Lopopolo
14:45 – 15:15 Coffee break
15:15 – 15:45 Stories and world views as a key to understanding: Tommaso Caselli and Roser Morante
15:45 – 16:15 A quantum model of text understanding: Minh Ngọc Lê and Filip Ilievski
16:15 – 17:00 Discussion on building a shared demonstrator: a reference machine
17:00 – 18:00 Drinks

Admission is free.
Please RSVP via Eventbrite before October 03, 2014.


Atrium, room D-146, VU Medical Faculty (1st floor, D-wing)
Van der Boechorststraat 7

1081 BT Amsterdam
The Netherlands

Parking info N.B. Campus parking is temporarily unavailable.


Have questions about Understanding language by machines - 1st VU-Spinoza workshop? Email contact.