Entity Linking at scale with Lucene

Cite

Related Material

Plain Schwarz

Tosca, Edoardo

Formal Metadata

Title

Entity Linking at scale with Lucene

Title of Series

Berlin Buzzwords 2022

Number of Parts

Author

Tosca, Edoardo

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67188 (DOI)

Publisher

Plain Schwarz

Release Date

2022

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Signal AI offers a sophisticated platform to support businesses in their decision making. Customers define searches across billions of documents by using an extensive DSL that includes concepts like entities and topics amongst them. This metadata is being extracted from over 5 million documents each day and is made available to the end users within 30 seconds from its ingestion via a mix of machine learning and text retrieval techniques. Entity Linking is one of the core capabilities in the Signal AI data processing platform. It is a complex system that uses various strategies to achieve the highest quality while retaining excellent throughput characteristics. Back in 2019, one of the existing components of the Entity Linking system was rapidly reaching its limits and could not scale anymore. To overcome the limitation, the team took an innovative approach and used Apache Lucene with its inverted index and term vectors capabilities to enable the identification of rule-based entities. By choosing a percolator model the team had to revisit the previous architecture, breaking it down into smaller components that follow the Single Responsibility Principle for microservices. This talk will take the audience through the evolution of this service, from its inception until today. It will provide details around the technical decisions and trade-offs that make this component one of the most resilient, fast and cost effective solutions, capable of handling 20 times more the number of rules at a fraction of the cost. It will also discuss how the same technology is used to reprocess the entire dataset every night in approximately 15 minutes.