A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation
Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak
Question Answering Long Paper
Session 1B: Jul 6
(06:00-07:00 GMT)
Session 2A: Jul 6
(08:00-09:00 GMT)
Abstract:
In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations. Thus, we randomly generate OTs from a context free grammar and annotators just have to write the appropriate question and assign the tokens. We compare our corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases, to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our dataset is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance.
You can open the
pre-recorded video
in a separate window.
NOTE: The SlidesLive video may display a random order of the authors.
The correct author list is shown at the top of this webpage.
Similar Papers
Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport
Kyle Swanson, Lili Yu, Tao Lei,

Query Graph Generation for Answering Multi-hop Complex Questions from Knowledge Bases
Yunshi Lan, Jing Jiang,

Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr, Ming-Wei Chang, Peter Shaw, Kenton Lee,

TaPas: Weakly Supervised Table Parsing via Pre-training
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, Julian Eisenschlos,
