Abstract:
The World Wide Web contains vast quantities of textual information in
several forms: unstructured text, template-based semi-structured webpages
(which present data in key-value pairs and lists), and tables. Methods for
extracting information from these sources and converting it to a structured
form have been a target of research from the natural language processing
(NLP), data mining, and database communities. While these researchers have
largely separated extraction from web data into different problems based on
the modality of the data, they have faced similar problems such as learning
with limited labeled data, defining (or avoiding defining) ontologies, making
use of prior knowledge, and scaling solutions to deal with the size of the
Web. In this tutorial we take a holistic view toward information extraction,
exploring the commonalities in the challenges and solutions developed to
address these different forms of text. We will explore the approaches
targeted at unstructured text that largely rely on learning syntactic or
semantic textual patterns, approaches targeted at semi-structured documents
that learn to identify structural patterns in the template, and approaches
targeting web tables which rely heavily on entity linking and type
information. While these different data modalities have largely been
considered separately in the past, recent research has started taking a more
inclusive approach toward textual extraction, in which the multiple signals
offered by textual, layout, and visual clues are combined into a single
extraction model made possible by new deep learning approaches. At the same
time, trends within purely textual extraction have shifted toward
full-document understanding rather than considering sentences as independent
units. With this in mind, it is worth considering the information extraction
problem as a whole to motivate solutions that harness textual semantics along
with visual and semi-structured layout information. We will discuss these
approaches and suggest avenues for future
work.