We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Entity Linking at scale with Lucene

Formal Metadata

Title
Entity Linking at scale with Lucene
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Signal AI offers a sophisticated platform to support businesses in their decision making. Customers define searches across billions of documents by using an extensive DSL that includes concepts like entities and topics amongst them. This metadata is being extracted from over 5 million documents each day and is made available to the end users within 30 seconds from its ingestion via a mix of machine learning and text retrieval techniques. Entity Linking is one of the core capabilities in the Signal AI data processing platform. It is a complex system that uses various strategies to achieve the highest quality while retaining excellent throughput characteristics. Back in 2019, one of the existing components of the Entity Linking system was rapidly reaching its limits and could not scale anymore. To overcome the limitation, the team took an innovative approach and used Apache Lucene with its inverted index and term vectors capabilities to enable the identification of rule-based entities. By choosing a percolator model the team had to revisit the previous architecture, breaking it down into smaller components that follow the Single Responsibility Principle for microservices. This talk will take the audience through the evolution of this service, from its inception until today. It will provide details around the technical decisions and trade-offs that make this component one of the most resilient, fast and cost effective solutions, capable of handling 20 times more the number of rules at a fraction of the cost. It will also discuss how the same technology is used to reprocess the entire dataset every night in approximately 15 minutes.