Research & Innovation
Discovering matching entities in different Knowledge Bases constitutes a core task in the Linked Data paradigm. Due to its quadratic time complexity, Entity Resolution typically scales to large datasets through blocking, which restricts comparisons to similar entities. For Big Linked Data, Meta-blocking is also needed to restructure the blocks in a way that boosts precision, while maintaining high recall. Based on blocking and Meta-blocking, JedAI Toolkit implements an end-to-end ER workflow for both relational and RDF data. However, its bottleneck is the time-consuming procedure of Meta-blocking, which iterates over all comparisons in each block. To accelerate it, we present a suite of parallelization techniques that are suitable for multi-core processors. We present 2 categories of parallelization strategies, with each one comprising 4 different approaches that are orthogonal to Meta-blocking algorithms. We perform extensive experiments over a real dataset with 3.4 million entities and 13 billion comparisons, demonstrating that our methods can process it within few minutes, while achieving high speedup.