Theo van Veen, Juliette Lonij

Industry

Improving Access to Digital Content by Semantic Enrichment

​The collection of digitized historical newspapers of the National Library of the Netherlands contains an abundance of information about events, persons, concepts etc. As part of our effort to automatically extract this information from the unstructured text we are developing methods to recognize named entities and link them to external knowledge bases such as DBpedia and Wikidata.

We are continuously working on further increasing the accuracy of the links by exploring new machine learning algorithms and adding new features. Our current focus is on identifying and incorporating features that may play a role in human entity linking, and on applying word and entity embeddings, as we expect these representations to be a valuable addition to our existing, mostly handcrafted features.

Indexing relevant identifiers and properties for the linked named entities along with the newspaper articles opens up new possibilities for search, e.g. grouping search results by semantic relations, and further exploration based on context. We present a demonstration web portal that implements some of these possibilities and offers basic crowdsourcing functionality for manual correction of any remaining errors.

Exploring possibilities for improving search and usability of our digital content is one of the core activities of the research department of the Koninklijke Bibliotheek, National Library of the Netherlands (KB). One way in which we aim to achieve this is by enriching our collections with extracted or related information. Our current focus is on enriching the historical newspapers collection with linked named entities, i.e. names of persons, locations and organisations mentioned in the newspaper articles that are linked to corresponding resource descriptions in international knowledge bases such as DBpedia, Wikidata and VIAF.

We have set up a generic enrichment infrastructure, consisting of an enrichment database and a number of services, that is able to store any type of enrichment for any object from the KB collections, without modifying the original data. Generally speaking, the enrichment database contains links between identifiers of objects from our collections and related identifiers. If available we include some metadata about the links, such as their provenance and confidence.

Named entities are automatically recognized in the newspaper articles by means of specific software. We generate a set of candidate links for each entity by querying an entity index of DBpedia dumps. A machine learning model that was trained on a manually annotated set of articles selects the best candidate, if any, based on various feature values. Although our software has reached an accuracy of over 85%, we enable users to correct any remaining errors and add missing links. This user feedback also serves as additional training data for the entity linking software.

Indexing identifiers for the linked named entities along with the newspaper articles opens up new possibilities for semantic search. Users are able to search for articles containing entities that possess certain (combinations of) properties, for example, such as articles about Roman emperors or members of the Dutch Parliament that were born outside the Netherlands.

In order to demonstrate this functionality an online research portal (http://www.kbresearch.nl/xportal/) has been created, where users can explore the available enrichments and experiment with semantic search in the historical newspaper collection. The portal supports full SPARQL queries in Wikidata, but also offers a number of user-friendly forms of semantic querying, e.g. by automatically generating a “best guess” SPARQL query from a conventional search string.

CV

Theo van Veen has been a member of the research department of the Koninklijke Bibliotheek, National Library of the Netherlands since 1998. He obtained a degree in Physics from the Delft University of Technology in 1979 and started working on library automation at the Utrecht University Library in 1988. He has been involved in several projects related to the European Library and Europeana, both hosted at the Koninklijke Bibliotheek. His current research interests include semantic enrichment, named entities, linked data and service integration.