Building an Open-source Framework for Generating Embedding Vectors

Cite

Related Material

Plain Schwarz

Liu, Frank

Formal Metadata

Title

Building an Open-source Framework for Generating Embedding Vectors

Title of Series

Berlin Buzzwords 2022

Number of Parts

Author

Liu, Frank

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67194 (DOI)

Publisher

Plain Schwarz

Release Date

2022

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The combination of big data and deep learning has fundamentally changed the way we approach search systems, allowing us to index audio, images, video, and other human-generated data based on an embedding vector instead of an auxiliary description. These advancements are backed by new and often times increasingly complex machine learning (ML) models, leading to an even wider research-to-industry gap despite the introduction of MLOps platforms and a variety of model hubs. We summarize some of the challenges facing practical machine learning in 2022 and beyond as follows: 1) many ML applications require a combination of multiple models, leading to a lot of overly complex and difficult-to-maintain auxiliary code, 2) many engineers are unfamiliar with ML and/or data science, making it difficult for them to train, test, and integrate ML models into existing infrastructure, and 3) constant architectural updates to SOTA deep learning models creates significant overhead when deploying said models in production environments. In this talk, we discuss lessons learned from building an open-source (https://github.com/towhee-io/towhee) and scalable framework for generating embedding vectors purpose-built to tackle the above challenges. Early on, we communicated with dozens of industry partners to understand their application(s) and architected our platform around their requirements. This open source project is currently being used by 3 major corporations ($10B+ market value) and a number of small- and mid-size startups in proof-of-concept and production systems.