We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Entity Matching Meets Data Science: A Progress Report from the Magellan Project

Formale Metadaten

Titel
Entity Matching Meets Data Science: A Progress Report from the Magellan Project
Serientitel
Anzahl der Teile
155
Autor
et al.
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, joint with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is an ecosystem of interoperable tools. This paper provides a progress report on the past 3.5 years of Magellan, focusing on the system aspects and on how ideas from the field of data science have been adapted to the EM context. We argue why EM can be viewed as a special class of DS problems, and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build pymatcher and cloudmatcher, EM tools for power users and lay users. These tools have been successfully used in 21 EM tasks at 12 companies and domain science groups, and have been pushed into production for many customers. We report on the lessons learned, and outline a new envisioned Magellan ecosystem, which consists of not just on-premise Python tools, but also interoperable microservices deployed, executed, and scaled out on the cloud, using tools such as Dockers and Kubernetes.