Results 1-36 out of 69
25:59
Lewandowski, AlbertMore and more services are running in Kubernetes so it means that we can migrate our current data pipelines to the new environment. In case of Flink we have multiple ways to do real-time data streaming: use Lyft or GCP operator, go with official deployment and customize it or choose the Ververica Platform or create something on your own. The presentation shows how to choose the right solution for technical requirements and business needs to run Flink in Kubernetes at great scale with no issues.
2021Plain Schwarz
32:46
Tacke, AdrienneMost documentation, technical tutorials, and quick demos are written in a certain way. But for the true beginners, the career-transitioners, or those crossing domains? That technical content is certainly not written for them! Funnily enough, these same shortcomings affect “technical” people too, especially when it comes to learning something new. In this talk, I’d like to explore the ways we can make our documentation better by considering more kinds of people. We’ll discover common oversights and assumptions most documentation has built-in by default and learn how to fix them. We’ll also strengthen our technical writing skills to ensure, to the best of our ability, that every anticipated reader of our documentation never feels lost or frustrated. By the end of this talk, you’ll leave and never write documentation in the same way again…and that’s a good thing!
2021Plain Schwarz
11:08
Toren, YizharImage & text similarity problems have plenty of textbook ML solutions. However in the wild, these solutions often fail. In this talk I'll present a use-case about inferring product similarity from multiple sources of data (images, text, etc) and discuss how we developed a practical & scalable approach using our understanding of domain knowledge.
2021Plain Schwarz
27:38
2Tkachenko, YaroslavLambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing. In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Flink.
2021Plain Schwarz
39:04
Benton, WilliamIf you're dealing with structured data at scale, it's a safe bet that you're depending on Apache Parquet in at least a few parts of your pipeline. Parquet is a sensible default choice for storing structured data at rest because of two major advantages: its efficiency and its ubiquity. While Parquet's storage efficiency enables dramatically improved time and space performance for query jobs, its ubiquity may be even more valuable. Since Parquet readers and writers are available in a wide range of languages and ecosystems, the Parquet format can support a range of applications across the data lifecycle, including data engineering and ETL jobs, query engines, and machine learning pipelines. However, the ubiquity of Parquet readers and writers hides some complexity: if you don't take care, some of the advantages of Parquet can be lost in translation as you move tables from Hadoop, Flink, or Spark jobs to Python machine learning code. This talk will help you understand Parquet more fully in order to use it more effectively, with an eye towards the special challenges that might arise in polyglot environments. We'll level-set with a quick overview of how Parquet works and why it's so efficient. We'll then dive in to the type, encoding, and compression options available and discuss when each is most appropriate. You'll learn how to interrogate and understand Parquet metadata, and you'll learn about some of the challenges you'll run into when sharing data between JVM-based data engineering pipelines and Python-based machine learning pipelines. You'll leave this talk with a better understanding of Parquet and a roadmap pointing you away from some interoperability and performance pitfalls.
2021Plain Schwarz
35:22
1Schindler, UweSince version 3 of Apache Lucene and Solr and from the early beginning of Elasticsearch, the general recommendation was to use MMapDirectory as the implementation for index access on disk. But why is this so important? This talk will first introduce the user about the technical details of memory mapping and why using other techniques slows down index access by a significant amount. Of course we no longer need to talk about 32/64bit Java VMs - everybody uses now 64 bits with Elasticsearch and Solr, but with current Java versions, Lucene still has some 32bit-like limitations on accessing the on-disk index with memory mapping. We will discuss those limitations especially with growing index size up to terabytes, and afterwards, Uwe will give an introduction to the new Java Foreign Memory Access API (JEP 370, JEP 383, JEP 393), that first appeared with Java 14, but still incubating. The new API sounds interesting and will remove all previous issues and limitations, but with Lucene's current design, the first and second JEP incubators (Java 14, 15) would have been hard to implement. In close cooperation between Lucene developers and OpenJDK committers, starting with Java 16, the 3rd incubator will finally be ready to be used from Lucene: A first preview of Lucene's implementation was developed as a draft pull request. Uwe will show how future versions of Lucene will be backed by next generation memory mapping and what needs to be done to make this usable in Solr and Elasticsearch - bringing you memory mapping for indexes with tens or maybe hundreds of Terabytes in the future!
2021Plain Schwarz
26:56
4Tai, Tzu-Li (Gordon)Stateful Functions (StateFun), a project developed under the umbrella of Apache Flink, provides consistent messaging and distributed state management for stateful serverless applications. It does so in a vendor, platform and language agnostic manner - applications are composed of inter-messaging, polyglot functions that can be deployed on a mixture of your preferred FaaS platforms, as a Spring Boot application on Kubernetes, or really any deployment method typically used in modern cloud-native architectures. In this session, you will learn about the core concepts behind the project and the abstractions that developers would work with, all up to date to the latest upcoming 3.0 release. For new users, the content of this talk will be a perfect place to get started with StateFun. For existing users, this will be a great opportunity to catch up with the latest advancements in the project, including improved ergonomics around zero-downtime upgrade capabilities of StateFun applications, type system for messages and function state, as well as an extended array of new language SDKs.
2021Plain Schwarz
20:57
3Allison, TimApache Tika is used in big data document processing pipelines to extract text and metadata from numerous file formats. Text extraction is a critical component for search systems. While work on 2.0 has been ongoing for years, the Tika team released 2.0.0-ALPHA in January and will release 2.0.0 before Buzzwords 2021. In addition to dramatically increased modularization, there are new components to improve scaling, integration and robustness. This talk will offer an overview of the changes in Tika 2.0 with a deep dive on the new tika-pipes module that enables synchronous and asynchronous fetching from numerous data sources (jdbc, fileshare, S3), parsing and then emitting to other endpoints (fileshare, S3, Solr, Elasticsearch, etc).
2021Plain Schwarz
28:18
Watson, Sophie et al.Kubernetes and software engineering practice are quietly revolutionizing data science by providing practitioners with better infrastructure and more disciplined habits, and many tools build on these primitives and practices to make machine learning deployments on Kubernetes simple, portable, and scalable. However, bringing engineering discipline to data science workflows turns out to be a thorny problem, and reproducible research is harder to achieve than we might assume. In this talk, we’ll examine the problem of reproducible research from several angles and present tools we’ve built on Kubernetes that address different facets of the problem. You’ll see how to treat Jupyter notebooks as real software artifacts -- not merely as ad hoc environments for discovery -- and learn about what that mindset change entails. You’ll see how we build workflows from notebooks, how we automatically generate model services with CI/CD pipelines, and the tools we use to generate and track metrics to identify concept drift. You’ll learn about some surprising challenges of reproducibility and learn why some convenient model operationalization workflows might require heroic practitioner discipline to produce consistent results.
2021Plain Schwarz
36:57
10Carboni, Sophie et al.The steps from speech to text are quite simple in theory : you transform the waves into phonemes, then you group them together and decide which has the best probability of representing a meaningful word or phrase based on a dictionary. We often use services available with our devices for this task: Google services if our device is based on Android or you are using Chrome, Apple services if the device is an iPhone, Amazon services if the device is compatible with Alexa and so on. But there are cases where you cannot or do not want to use this type of service. We tried solving this problem with Elasticsearch. As the final step is searching throughout a dictionary of phonemes and finding the combination that best matches a real phrase, we can easily think of a solution based on an inverted index. In this talk we share our experience with implementing a prototype and give you all the tips and tricks for implementing such a system in your own infrastructure.
2021Plain Schwarz
31:27
1Srivastava, ShikharImagine you have a collection made up of several million news stories spanning several years. Your users may want to know how many of yesterday's news stories matched the query “American President.” That’s easy. It's just a normal query to the search engine. However, let’s say they want to get the same number for each day of the last year, or even each day of the last five years. That's a bit harder. Now, if you want to do this in less than a second, then it becomes really challenging. And, what if new data keeps coming into the collection every day and you need to scale to billions of news stories? In this talk, we will show how we achieved responsive time series aggregation across billions of documents using Apache Solr facets. We will discuss how, by optimizing our cache hit-rate on complex queries, we successfully reduced latency by a factor of ten (from tens of seconds to under one second) and increased throughput by 60 times (from 10 queries/minute to 10 queries/second). We will review the experiments we followed, show how we used Apache Jmeter to quickly run the right experiments, discuss the data structures we exploited, and look at how we organized our caching policy using Eclipse Memory Analyzer to scale Apache Solr facets to the stars! Get ready to take off!
2021Plain Schwarz
43:48
Thomas, Sherin et al.Full title: Democratizing Climate Science: Searching through NASA’s earth data with AI at scale While everyone was sheltering in place in 2020, a group of citizen scientists decided to tackle the problem of auto-detecting interesting weather patterns in earth’s imagery collected by NASA satellites. The problem - we were dealing with a scale we had never seen before - 20 years worth of earth’s imagery collected continuously by not just NASA but also other private and public space agencies across the world which was only growing exponentially by the day. We wanted to build a reverse image search engine on this massive unlabelled dataset and automatically detect interesting phenomena such as hurricanes, polar vortexes, melting ice caps etc. NASA’s scientists had performed extensive research to solve this problem in theory - but no one had attempted to build a production quality system to put it in practice before. SpaceML was started in collaboration with NASA’s Frontier Development Lab and Google Cloud and is built entirely by industry professionals and student mentees around the world with their donated time. In this presentation we will talk about how we solved the problem of applying deep learning to continuously search for interesting weather patterns in petabytes of earth’s imagery. We will cover the challenges involved in continuous data processing, indexing and running distributed search while providing a low latency, highly available search API. And how we used Google Cloud offerings such as Dataflow, Functions, App Engine along with Pytorch and nearest neighbor search libraries such as SCANN, FAISS and Annoy to make it happen. We will detail the end to end self supervised learning system that we built with an eye on cost constrained usage of cloud resources while maintaining extensibility for other space science endeavors. We will also touch upon the organizational challenges in building this system with a highly distributed team including how we employed fast prototyping to build confidence in the system while gradually increasing the scale to petabytes of data. We built this system with the goal of open sourcing the set of components to expand the project’s applicability beyond space science. In this talk we will describe the architecture of the individual components so that you can leverage them to enable deep learning on any type of dataset in your field.
2021Plain Schwarz
26:25
1Ortega, SebastiánLetgo is a second-hand marketplace app reshaping secondhand trade in Turkey so re-use is the default trusted choice. We designed the data platform to be built on top of the principles of self-servicing, privacy laws compliance, data governance at business unit level, minimal maintenance and cost containment by design. We will describe how we defined our company-wise data model, leveraging Avro schemas and enabling at the same time most impactful features like: - tagging private fields for sensitive data for data privacy laws compliance - ensuring quality and structure of the data landing in the company data lake - efficient and reliable transportation and consumption of data at platform level - data catalog: discovery of available data by teams Our design is built around the Apache Kafka ecosystem—with special mention to Kafka Connect—for data ingestion and AWS services plus Spark framework for data transformations and data lake ingestion. Thanks to these principles we are able to ensure data governance over batch and real-time data, while keeping at the same time a multi-tiered data lake: the inner tier keeps the most sensitive data and the outer tiers keep only the data accessible for each single company business unit.
2021Plain Schwarz
27:13
3Moffatt, RobinCompanies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event-driven architectures and the population of multiple downstream systems. In this talk, we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered, as well as an exploration of the power of ksqlDB for performing transformations such as joins on the inbound data. Attendees of this talk will learn: - why events, not just state, matter - the difference between log-based CDC and query-based CDC - how to chose which CDC approach to use
2021Plain Schwarz
19:38
4Metzger, RobertStreaming applications often face changing resource needs over their lifetime: there might be workload differences during day- and nighttime, or business-related events that cause load spikes. Being able to automatically adapt to these changes is a common requirement for production deployments. Apache Flink has supported stateful job rescaling since the early days, but so far this had to be done by stopping and restarting jobs manually. In the latest release (1.13), the Flink community introduced a much anticipated feature: autoscaling. Now, you can add machines to your cluster for triggering an automatic scale up, or remove machines for letting it scale down again! In this talk, we'll explore different deployment scenarios for streaming applications and explain how Flink users can benefit from autoscaling to enable new use cases, streamline day-to-day operations and avoid unnecessary costs. In addition, we’ll briefly describe how this feature was implemented and how it enables further resource elasticity improvements in the future, such as scaling controlled by Flink.
2021Plain Schwarz
30:26
Richardson, RobWeb technologies have come leaps and bounds. But are you still using the tired old database from last generation? Let's look at the methodology of microservices, compare it to bounded contexts, and look at ops tasks for micro-databases. Let's tour all the flavors of databases, understand their pros and cons, and when you would choose it. You'll leave with a roadmap for moving from data-monolith to micro-databases.
2021Plain Schwarz
30:09
2Ferreira, RicardoBuilding streaming systems is a popular way for developers to implement applications that react to data changes and process events as they happen. It is an exciting new world that technologies like Apache Pulsar made available for anyone to use. But all this goodness doesn’t come for free. One of the challenges of this type of architecture is that its distributed nature makes it hard and sometimes even impossible to identify the root cause of problems quickly. That is why distributed tracing technologies are so important. By gluing together disparate services into a single and cohesive transaction, developers can provide to the operations team a way to pragmatically observe the system and to quickly identify the root cause of problems such as slowness and unavailability. This talk will explain how to implement distributed tracing in Pulsar applications using OpenTelemetry—an observability framework for cloud-native software. A demo will be used to clarify the concepts.
2021Plain Schwarz
31:13
3Babu, PrashanthData Engineers face many challenges with Data Lakes. GDPR requests, data quality issues, handling large metadata, merges and deletes are a few of the tough challenges usually every Data Engineer encounters with a Data Lake with formats like Parquet, ORC, Avro, etc. This session showcases how you can effortlessly apply updates, upserts and deletes on a Delta Lake table with a very few lines of code and use time travel to go back in time for reproducing experiments & reports very easily, how we can avoid challenges due to small files as well. Delta Lake was developed by Databricks and has been donated to Linux Foundation, the code for which could be found at http://delta.io. Delta Lake is being used by a huge number of companies across the world due to its advantages for Data Lakes. We will discuss, demo and showcase how Delta Lake can be helpful for your Data Lakes because of which many enterprises have Delta Lake as the default data format in their architecture. We will will use SQL or its equivalent Python or Scala API to perform showcase various Delta Lake features.
2021Plain Schwarz
27:32
Vegt, PieterContext In almost all markets, online marketplaces are in charge. From ride sharing to food delivery to e-commerce. While these marketplaces are very good at matching offer and demand, this benefit comes at a cost: 1) A lot of data is being collected, both from suppliers and consumers. 2) The marketplace owns the customer experience, taking away the opportunity from retailers to build a sustainable relationship with their customers. 3) In e-commerce, retailers face the additional challenge that the platform is also a competitor, but with the advantage of more data and no commission fees. Proposed solution We are proposing a decentralised solution that allows retailers to connect directly with customers through the vendor relationship manager-model (a.k.a. Me2B), allowing them to have more control over the customer experience. Instead of customers connecting to individual retailers or a marketplace, VRM enables retailers to connect to the customer’s own environment. This enables customers to create their own marketplace (i.e. the customer becomes its own platform). With this model both retailers and customers gaine more control which opens up new possibilities for search and discovery. In addition, we are focusing on a local first model, where local retailers are matched to local customers. Talk The following will be discussed: - Proposal for a search experience within the customers own environment - The challenges and possible solutions for this search experience - Implications for business (benefits and challenges) - Implications for customers (benefits and challenges) Audience We are challenging the audience with our view that there is an alternative to online marketplaces that is better for people and business and encourage them to think about the implication this has on search and discovery, both from a business and a technical perspective.
2021Plain Schwarz
27:34
6Krenn, PhilippLogs are everywhere. But they have gone through an interesting development over the years: - grep: This works well as long as you have a single instance to search on. Once you need to SSH into many machines and try to piece together the results of multiple grep commands, things tend not to work that well anymore. - Splunk: Centralizing those logs and letting users search through them with a piped language in Splunk is the logical step to fix that issue. However, the more data you centralize, the slower this will get. - ELK: The solution to that idleness is using full-text search. Elasticsearch, in combination with Logstash and Kibana (plus Beats), gave logs a major performance boost. But at what cost? - Loki: Reducing the scope and going back to a smart data structure combined with grep gives Loki the possibility to reduce costs while still providing good performance. - Closing the gap: So what are the tradeoffs between the different systems, and are they potentially closing some gaps between performance and cost?
2021Plain Schwarz
32:36
2Chia, Patrick JohnHow do shoppers pick a single product out of the vast number presented to them? One part of their decision making is to compare between the different products by weighing the pros and cons of their features for a given price: for shops with a huge inventory, this can no doubt be a challenging task. Explicit comparison features (i.e. “click on product X and Y to see them side-by-side”) are a classical way of easing the shopping cognitive load, and recently eCommerce giants started incorporating this concept into a new type of recommendation. However, scaling this approach to huge inventories and a variety of verticals is a daunting task for traditional retailers: explicit comparisons are limited to manual 1:1 interfaces, and detailed comparison tables require a lot of manual work and often presuppose a well structured product catalogue. In this talk, we present our pipeline to generate comparisons-as-recs at scale in a multi-tenant setting, with minimal assumptions about catalog size and web traffic. Our approach leverages both product meta-data (image, text) and behavioural data, and a combination of neural inference and decision-making principles. In particular, we show how to break down the problem into two main steps. First, for a given product we use dense representations to perform substitute identification, which determines a group of alternative products of the same category. Then, based on how their features and price vary, we select the final set of products and determine which features to display for comparison. Compared to existing, single-tenant literature, our experiments highlight the need for further improvements in dealing with noisy data and the adoption of data augmentation techniques: we conclude by sharing some practical tips for practitioners, and highlighting our testing and product roadmap.
2021Plain Schwarz
30:48
1Burch, NickClassifying images into Cats or Hotdogs may make for great AI demos, but for many of us, it has limited $DAYJOB uses. What if you have loads of documents, or inconsistently-written text, or a mash of information? Fear not - the latest AI / ML techniques for text can help! With the help of Apache MXNet, scikit-learn, ElasticSearch and friends, we'll progress from a simple text-based ML system, to an advanced system with full linguistic understanding. We'll also cover some key concepts around building AI / ML systems, and some of the pitfalls that beginners risk encountering. Full example code provided! Along the way, we'll look at why text has historically been hard AI / ML, what the latest techniques are, and then Open Source libraries / frameworks implementing them. Thanks to the magic of cloud-hosted notebooks, you can follow along with the code as we go, and try some live coding if we all dare!
2021Plain Schwarz
30:12
6Salian, NeeleshMetadata has been a key data infrastructure need since the beginning of our team's history at Stitch Fix. We began this journey in 2015 with the setup of the Hive Metastore to work with Spark, Presto, and the rest of the platform infrastructure. But as our business needs grew, we felt the need to enhance and extend our metadata ecosystem. In this talk, we want to share our journey of building additional capabilities with metadata to solve data and business challenges. Starting with our base infrastructure - the Hive Metastore, we will highlight each capability that led us to build the extensions into our present day metadata infrastructure. This includes improvements made to the Hive Metastore itself, extending the use of metadata beyond table schemas, and additional microservices we added to make access and use of metadata easier. Building these capabilities has helped our team use metadata to power internal use cases. We want to share how we went about building this ecosystem and the lessons we learned along the way.
2021Plain Schwarz
26:46
4Krantz, MyrleIf you're running a system at scale, you need tools to maintain it. This talk gives a high level overview of what observability and monitoring mean, and how to use Prometheus, Loki, Cortex, and Tempo to monitor your stack.
2021Plain Schwarz
52:50
Byrd-Sanicki, MeganOpen source crossed the chasm into mainstream with users in all industries. Maintaining the users’ trust and sustaining innovation is key to open source’s success. However, in a world where communities are passionate, multicultural, and primarily use online communication, it is challenging to move communities towards a shared vision in a frictionless, sustainable way. Community challenges can impact innovation, putting user adoption at risk and even more importantly, hurting community members. Stronger open source leadership can address these challenges and there is a call for more leaders in every project. Good news! Every contributor is a leader either through self leadership, leading others, or leading the community, yet most people have never been trained on how to lead. This talk provides the leadership training you need and covers: - What is leadership and why strengthen open source leadership - Key leadership and emotional intelligence principles - Practical ways to lead in order to create a diverse, thriving community especially in today’s extreme ambiguity and change.
2021Plain Schwarz
33:49
1Grygleski, Mary et al.This presentation will be a lively discussion with hands-on coding to illustrate how to construct a reactive, event-driven data flow pipeline. We will use multiple library implementations of the Reactive Streams specification, such as Akka Streams, Eclipse Vert.x and RxJava. The sample use case will mimic a real-life example of data being collected from multiple distributed sources, which will then be fed to a legacy processor as «wrapped» by a reactive microservice. The resulting data will flow to a «sink» to be prepared for further processing. We will highlight the strength of the Reactive Streams in controlling backpressure during processing.
2021Plain Schwarz
29:26
Eagan, MarcusTo commemorate 20 years since Doug Cutting open sourced Lucene and donated it to Apache, I want to talk about ways everyone can pitch in to ensure the long term viability of the open source project. Someone looking at replication code for the first time in 20 years, might really understand the behavior of something like master-slave. Even though we understand it today, the onus is on the contributors and maintainers to constantly add more clarity to a project. Landing such a complex PR can be tricky and usually involves multiple people. Here's how I did it, and here are a few other areas where I have been focused to improve the long-term viability of the code base.
2021Plain Schwarz
33:20
Pietsch, MalteUtilizing machine learning models to improve search has been an immensely active area for several years now. Some promises were kept, many others were broken. With the rise of Transformer models like BERT we seem to finally be entering a chapter, where models not only perform well in the research lab but actually make their way into the production stack. Now that almost every English Google search query is powered by a Transformer [1], it is clear that these models improve the search experience, and can do so at scale. As Transformers only rely on text, the transition from web search to a custom enterprise search seems more tempting than ever. In this talk, we will dive into some of the most promising methods and show how to ... … improve document retrieval via dense passage retrieval … return more granular search results by showing direct answers to user’s questions … scale those pipelines via DAGs and Approx. Nearest Neighbour search (ANN) for production workloads … avoid common pitfalls when moving to production All methods will be illustrated with code examples based on the open-source framework Haystack [2] so that participants can easily reproduce them at home and let the transformers into their production stack - one by one and carefully selected! [1] google-bert-used-on-almost-every-english-query-342193 [2] hgithub.com/deepset-ai/haystack/
2021Plain Schwarz
12:11
Precup, LucianAs many text mining applications, automatic text categorization is usually implemented with flavors of Machine Learning algorithms, which are trained with an appropriate training set to build the model. This model contains the statistical data based on training set texts that will later allow the system to match an input document with the corresponding category. But wait… statistical data on text documents? Doesn’t it remind you of our dear improved inverted index at the core of Apache Lucene? Maybe we could consider the index containing the training set documents as our trained model? And maybe a simple query against this index could give us the category (or categories) more likely to apply to a given document? In this talk, we will demonstrate this approach and show that it can perform well and possibly "for free", as we sure all have Lucene based tools in our application portfolios!
2021Plain Schwarz
08:54
2Precup, LucianIt is proven that for relatively well-structured data, like in e-commerce for example, a hand tailored search configuration can easily outperform machine learning approaches for relevance. The search configuration considers the different searchable fields, a business taxonomy and ontology, some domain related synonyms, a few specific landing pages, boosts and some business numerical criteria. In the same way, we describe an approach for relevance in the case of large-scale search engines which is not based on classical "PageRank" and machine learning approaches. We propose a model based on social interactions between communities and individuals that are using or configuring the search engine. We then compare this model with machine learning powered approaches.
2021Plain Schwarz
29:53
3Mitchell, LornaAs modern software architecture evolves and we adopt event-driven systems into our practice, let's take time to get into the nitty-gritty of designing the payloads that actually carry those events around. With strategies for which fields to include and how to handle changes to the data structure as the requirements evolve, this session has real-world advice to keep you on track. When it comes to data formats, choosing between self-contained formats such as JSON or XML, or a serialization format like Avro, this session covers how to design an approach that fits your application and platform. The session includes examples of using event-streaming tools such as Kafka with your application and offers to gotchas to avoid when teaching your applications to play nicely together.
2021Plain Schwarz
27:30
Solbakken, LesterAnything can be represented by a vector. Text can be represented by vectors describing the text's meaning. Images can be represented by the objects it contains. Users of a system can be represented by their interests and preferences. Even time-based entities such as video, sound, or user interactions can be represented by vectors. Finding the most similar vectors has all kinds of useful applications. There are many libraries to choose from for similarity search. However, in real-world applications, there are additional complications that need to be addressed. For instance, similarity search needs to scale up while ensuring that data indexed in the system is searchable immediately without any time-consuming index building in the background. Most importantly, however, additional search filters are often combined with the similarity search. This can severely limit the end result's quality, as post-filtering can prevent otherwise relevant results from surfacing. In this talk, we'll explore some real cases where combining approximate nearest neighbors (ANN) search with filtering causes problems. The solution is to integrate the ANN search with filtering, however, most libraries for nearest-neighbor search work in isolation and do not support this. To our knowledge, the only open-source platform that does is Vespa.ai, and we'll delve into how Vespa.ai solves this problem.
2021Plain Schwarz
37:44
Schaefer, LaurenDid you grow up on relational databases? Are document-based databases a bit of a mystery to you? This is the session for you! We’ll compare terms and concepts, explain the benefits of document-based databases, and walk through the 3 key ways you need to change your mindset to successfully use document-based databases.
2021Plain Schwarz
41:00
2Francke, Lars et al.The need for companies to deploy and operate Big Data infrastructures hasn't gone away but their options to do so have dwindled in the past few years. That's why we decided to build a new Open Source Big Data distribution. It includes the usual suspects like Apache Kafka, Apache Spark, Apache NiFi, etc. We asked around and were told it's a crazy idea but we did it anyway: We implemented a Kubelet in Rust that uses systemd as its backend instead of a container runtime. We also started writing Operators that target these special kubelets. This means we can deploy hybrid infrastructure (partly running in containers and partly on "bare metal") using the same stack, the same tools, the same description languages, the same knowledge, etc. getting the best of both worlds. In this talk we'll share what we learned about writing Kubernetes Operators (in Rust) as well as gain insights into our new distribution.
2021Plain Schwarz
54:26
Friedman, Ellen et al.Open session where Lars Albertsson, data engineering entrepreneur and recurring Berlin Buzzwords speaker, answers questions regarding data engineering and DataOps.
2021Plain Schwarz
29:27
Watters, KevinPayloads are a powerful though seldom utilized feature in the Lucene-Solr ecosystem. This talk reviews the existing payload support in Lucene and introduces the new features in Lucene and Solr 9 (LUCENE-9659 / SOLR-14787). The main focus of the talk will be to explore real world search & ml use cases that traditionally utilize a query time join and the application of Lucene payloads to solve them. This talk is for search practitioners interested in utilizing machine learned data in search based analytics dashboards. Many Solr based applications attempting to deal with machine learned classifications are forced to implement a parent-child join relationship between a document and its classifications. This model introduces many additional system constraints and costs at both query and index time to maintain the ability to filter results as desired. New features in the payload span query in Lucene provide applications a way to maintain query flexibility without incurring the cost of performing a query-time join. This greatly simplifies system design and architecture and can provide dramatic improvements to query performance. A reference implementation will be presented that compares the join and payload approaches. The demonstration will show how to search for documents that have classifications above a particular confidence threshold at scale.
2021Plain Schwarz