Jim Baker - Scalable Realtime Architectures in Python
This talk will focus on you can readily implement highly scalable and fault tolerant realtime architectures, such as dashboards, using Python and tools like Storm, Kafka, and ZooKeeper. We will focus on two related aspects: composing reliable systems using at-least-once and idempotence semantics and how to partition for locality.
-----
Increasingly we are interested in implementing highly scalable and
fault tolerant realtime architectures such as the following:
* Realtime aggregation. This is the realtime analogue of working with
batched map-reduce in systems like Hadoop.
* Realtime dashboards. Continuously updated views on all your
customers, systems, and the like, without breaking a sweat.
* Realtime decision making. Given a set of input streams, policy on
what you like to do, and models learned by machine learning, optimize a
business process. One example includes autoscaling a set of servers.
Obvious tooling for such implementations include Storm (for event
processing), Kafka (for queueing), and ZooKeeper (for tracking and
configuration). Such components, written respectively in Clojure
(Storm), Scala (Kafka), and Java (ZooKeeper), provide the desired
scalability and reliability. But what may not be so obvious at first
glance is that we can work with other languages, including Python, for
the application level of such architectures. (If so inclined, you can
also try reimplementing such components in Python, but why not use
something that's been proven to be robust?)
In fact Python is likely a better language for the app level, given
that it is concise, high level, dynamically typed, and has great
libraries. Not to mention fun to write code in! This is especially
true when we consider the types of tasks we need to write: they are
very much like the data transformations and analyses we would have
written of say a standard Unix pipeline. And no one is going to argue
that writing such a filter in say Java is fun, concise, or even
considerably faster in running time.
So let's look at how you might solve such larger problems. Given that
it was straightforward to solve a small problem, we might approach as
follows. Simply divide up larger problems in small one. For example,
perhaps work with one customer at a time. And if failure is an ever
present reality, then simply ensure your code retries, just like you
might have re-run your pipeline against some input files.
Unfortunately both require distributed coordination at scale. And
distributed coordination is challenging, especially for real systems,
that will break at scale. Just putting a box in your architecture
labeled **"ZooKeeper"** doesn't magically solve things, even if
ZooKeeper can be a very helpful part of an actual solution.
Enter the Storm framework. While Storm certainly doesn't solve all
problems in this space, it can support many different types of
realtime architectures and works well with Python. In particular,
Storm solves two key problems for you.
**Partitioning**. Storm lets you partition streams, so you can break
down the size of your problem. But if the a node running your code
fails, Storm will restart it. Storm also ensures such topology
invariants as the number of nodes (spouts and bolts in Storm's lingo)
that are running, making it very easy to recover from such failures.
This is where the cleverness really begins. What can you do if you can
ensure that **all the data** you need for a given continuously updated
computation - what is the state of this customer's account? - can be
put in **exactly one place**, then flow the supporting data through it
over time? We will look at how you can readily use such locality in
your own Python code.
**Retries**. Storm tracks success and failure of events being
processed efficiently through a batching scheme and other
cleverness. Your code can then choose to retry as necessary. Although
Storm also supports exactly-once event processing semantics, we will
focus on the simpler model of at-least-once semantics. This means your
code must tolerate retry, or in a word, is idempotent. But this is
straightforward. We have often written code like the following:
seen = set()
for record in stream:
k = uniquifier(record)
if k not in seen:
seen.add(k)
process(record) |