Caleb Severn, Tiago Almeida

Industry

From Legacy to Linked Data: Rebuilding a STEM information service

Containing over 16 million  records since 1898, IET’s Inspec is the World’s leading abstract and indexing database for research in the fields of engineering, physics and computer science. The Inspec production system has historically been based on a traditional relational database  system. With the advent of the semantic web idea, Inspec has made the decision to switch to a graph semantic-based database in order to better deal with relationships between increasingly large and complex data. This graph currently includes Inspec content and subjects aiming to expand its scope to organisations, authors and locations. Converting the traditional data into linked data has provided many new exciting ways to access and discover academic resources  and display trends in physics, engineering and computing research. In this talk we show how using domain models, ontologies and semantic enrichment improves data discoverability, navigation and visualisation.

Every year, almost 1 million research papers , conference papers, videos, books and other publication types are added to the Inspec database. Each of these records contains not only information provided by its publishers such as the authors of papers and their affiliations but also data enriched by our indexing service. This data includes controlled terms from a thesaurus, classification codes free text, numerical, chemical and astronomical annotations. This makes scientific articles more discoverable by researchers, who are able to find high quality, relevant content.  
For years all this data had been stored in a legacy relational database where it was accessed through various simple text querying Web 2.0 domains. A process which while practical, was not intuitive and did not take full advantage of the expansive annotations applied to each paper or the links between these annotations. One of the benefits of having such a richly detailed, highly structured corpus of data is that large scale and micro-trends can be determined, such as how new scientific concepts have increased over time, in specific locations, and have grown and branched off into sub-concepts. This enables a far deeper understanding of the trends in scientific research along with the main key opinion leaders and institutions focusing on those areas.  
At Inspec, for the last few years, a gradual move has been made towards semantics,  integrating our traditional relational database into an RDF paradigm. We show the evolution from taxonomies to domain models embedding all the different entities and relationships between them, the development of ontologies based on these models, and the population with reference data effectively creating a large-scale semantic network of agents, content and concepts. This is an iterative agile process where the domain model is designed according to the questions we want answered by the data.
We also discuss the process of migrating Inspec content to a semantic database, facing problems such as data deduplication and disambiguation . This allows us to enhance our content to integrate information on author affiliations, events location and research project funding, whilst guaranteeing data quality.
We show how this allows us not only to improve our own production processes, making it possible to implement semantic enrichment but also services provided, widely expanding the functionalities for users searching our database, with smart graph navigation and data visualisation tools.

CV

Caleb Severn holds a Masters in Physics from Swansea University, where he studied for 4 years. After graduating he was hired to be part of Inspec at the IET. Since starting Caleb has been involved in the development of semantically enriched data, ontologies, and domain models.

Tiago Almeida holds a Masters in Physics from the University of Porto. After one year of industry experience at Bosch in Portugal, he moved  to the UK in 2015 to work at The IET where he’s been part of a multi-organisational team for the integration of semantic technologies in Inspec. His areas of interest include statistical analysis of data, text mining, natural language processing, Linked Data, and data visualisation.

Both of them work as part of the content selection and sampling team with responsibility for the computing and control areas, making sure that the selection, abstracting and indexing of material meets the high quality requirements of the Inspec database.