Processing sensor data in real-time with the public cloud
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 17 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/50521 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
5
11
12
17
00:00
Software developerEvent horizonComputing platformProcess (computing)Point cloudProduct (business)HTTP cookieDistanceBendingBinary fileInternet service providerArchitectureSpherical capPoint (geometry)Computing platformDisk read-and-write headForm (programming)MereologyNumberProduct (business)Operator (mathematics)AverageRight angleSpecial unitary groupPhysical systemPower (physics)WebsiteFacebookMultiplication signDifferent (Kate Ryan album)Computer fileCasting (performing arts)Uniform resource locatorScaling (geometry)ScalabilityArchitectureTerm (mathematics)DistanceBuildingService-oriented architectureHTTP cookieDoubling the cubeVirtuelles privates NetzwerkStatement (computer science)HypermediaAxiom of choiceInternet service providerBoss CorporationPhysicalismApproximationEnterprise architectureData storage deviceIterationPosition operatorProcess (computing)Point cloudType theoryGoogolInstance (computer science)DiagramVirtual machineWordArmExecution unitInteractive kioskSource codeElectronic mailing listFreewareBitDiscounts and allowancesJackson-MethodePerspective (visual)Cloud computingPlastikkarteReal-time operating systemServer (computing)CodeHarmonic analysisMobile appBus (computing)BijectionVirtualizationShift operatorSoftwareBit rateSoftware maintenance
09:49
Event horizonSoftware developerScalabilityData storage deviceFunctional (mathematics)Default (computer science)Process (computing)Reading (process)Social classArchitectureMathematicsConnectivity (graph theory)CodeRegular graphData analysisPhysical systemProcess (computing)MereologyVideo gameJava appletFault-tolerant systemPower (physics)Cloud computingPoint cloudGoogle App EngineSampling (statistics)Query languageINTEGRALDiagramData storage deviceSoftware developerPoint (geometry)Instance (computer science)Program slicingStreaming mediaOperator (mathematics)Core dumpDefault (computer science)Limit (category theory)Mobile appDataflowWeb pageArchitectureCountingLogicStapeldateiSoftware frameworkCommunications protocolBuildingDatabaseVirtual machineInfinityNumberDependent and independent variablesCycle (graph theory)Computing platformQueue (abstract data type)BitInternetworkingFitness functionUniform boundedness principleRight angleGoogolWorkstation <Musikinstrument>Multiplication signGoodness of fitScaling (geometry)Group actionService-oriented architectureFormal languageWebsiteSquare numberComputer programming2 (number)WordLine (geometry)Form (programming)Proper mapMathematical analysisSequelDisk read-and-write headComputer animation
19:30
Uniform boundedness principleEvent horizonSoftware developerPoint cloudData storage deviceScalabilityFunctional (mathematics)Default (computer science)Interface (computing)Service-oriented architectureMobile appVirtual machineFunctional (mathematics)Data storage deviceMereologyScalabilityProcess (computing)Fault-tolerant systemBranch (computer science)DataflowType theoryMetadataServer (computing)Software developerWindows RegistryMetric systemSoftware testingSingle-precision floating-point formatInterface (computing)Queue (abstract data type)Presentation of a groupDefault (computer science)Different (Kate Ryan album)InformationRevision controlSoftwareSlide rulePhysical systemProgram slicingConnectivity (graph theory)Open setBitOperator (mathematics)Service-oriented architecturePoint cloudIntegrated development environmentElasticity (physics)Multiplication signCodeGene clusterComputing platformSoftware frameworkResponse time (technology)Similarity (geometry)ChatterbotBit rateProduct (business)NumberDependent and independent variablesBuildingCycle (graph theory)Coefficient of determinationMoving averageYouTubeGoogolDecision theoryArchitectureOpen sourceCircleUltraviolet photoelectron spectroscopyForcing (mathematics)Programmer (hardware)Web pageVideo gameSelf-organizationScripting languageFitness functionMountain passPurchasingPoint (geometry)Computer animation
29:11
Event horizonSoftware developerTouch typingComputer animation
Transcript: English(auto-generated)
00:08
My name is Andreas Heim, I work at a Norwegian startup called Unacast. I'm going to talk to you today a bit about our journey as a company, and also from a technical perspective, how we ended up processing
00:24
lots and lots of sensor data, real-time, with the public cloud. A few words about myself, I've been working in the IT industry in Oslo for 15 plus years. I started my career editing active server pages,
00:44
and that's 1.0, dragging them to the production file share, and that was our method of deployment. Going to buy some higher power, I ended up as a DevOps guy working in the cloud, I'm not a maintainer of WordPress sites,
01:02
so I guess that's quite lucky for me. I joined Unacast last August, and we started a company, and we wanted to go head-to-head with the likes of Facebook,
01:21
market cap 260 billion, went to go head-to-head with Google, market cap 438 billion US dollars. And what these two companies do, they give you great products in return for personal data.
01:42
We all agree that, right? Everyone uses Google, everyone uses Facebook almost, and you get really great value, but you pay with your personal data. The only problem with that personal data is, that's only about 30% of who you are.
02:02
You spend your time online, or the average consumer, I guess you guys spend more than 30% online, and certainly myself, but the average consumer spends his, throughout his day, he spends about 30% of his time online. And that leaves the other 70%, right?
02:20
That's the mall, it's the airport, it's the museum, so Facebook and Google don't know what you do 70% of your time, but Unacast does. So we're building what we call the physical cookie,
02:42
so that's a pretty bold statement about our market cap, but we do this with use of sensors, more specific, we operate in the IOT segment with beacons and other type of location sensors.
03:01
And they can be and are already used for marketing purposes. It's very easy to imagine how to use this to market your product towards people. And the traditional approach is a push notification, when you have a beacon-enabled app and you walk into a store,
03:23
you can get a notification saying that you get a 20% discount on shoes. It's easy to come up with lots and lots of scenarios for this, loyalty cards, campaigns, etc. And this gives the retailer the opportunity to engage with the customer
03:43
when and only when he or she is in the store. So what Unacast provides is a method for the brands and retailers to reach their audience after the fact, after they've been to the store.
04:00
So we collect the data of you or you going into the store and we can use that to target advertisements, specifically to people who have been in the store after the fact. And this we are doing now. We have done a deployment in about 50% of the cinemas in Norway.
04:23
We did a campaign with Coke. This is quite a busy slide, so I'll just walk through it. And how this works is we have installed beacons at the cinema, our partner of ours has installed beacons at the cinema. When you enter the cinema and if you open the VG app, you will get an advertisement that you can get a free Coke
04:44
if you go to the kiosk. So the cinema-goer goes to the kiosk, he redeems the Coke, sees his movie, and later at home, we can retarget that person one-to-one with an ad that says you will get a free ticket for the cinema.
05:04
So you go to the cinema, you get a free Coke, you come back, you get a free cinema ticket, you return to the cinema. This may not sound very revolutionary, but if we look at the numbers on this pilot, we see that 24% of those who saw the ad clicked on the offer,
05:24
50% of them redeemed the offer, and later at home, 60% clicked on the ad for a free cinema ticket. The average click rate on VG is 0.18%. It's quite a big difference, and that really proves
05:41
that we're on to something here. And 25% of those who clicked the ad, they actually returned to the cinema. So that's huge numbers in terms of online advertising. Make sense? I want to speak briefly where we fit in
06:02
in the ecosystem of proximity sensors. And we aggregate location data from a multitude of partners through our platform, the Prox. And we slice and dice this data, harmonize it, make segments, and we ship those segments over to the online advertising industry.
06:25
So then brands and retailers can create the segments in our platform, shift them over to their media platform of choice, and do retargeting on their data. And this is a position that no one has taken yet,
06:41
and that's why we are taking it. We are having this position now. And as you can imagine, on the left side here, we have the proximity solution providers, and that's the companies that manufacture and or deploy the beacons in the stores. And for every one of those 200-plus companies
07:02
to integrate with the 100-plus media platforms, that just doesn't scale. So there needs to be someone in the middle. So we're kind of an enterprise service bus in the beacon community. I find that quite funny.
07:20
Enough talk about us. We're here for the technical stuff. And we built in a really short amount of time, in about three months, we built the first iteration of our platform using the public cloud, more specifically Google's cloud platform. And I just want to say a few words on why we ended up with the Google cloud platform,
07:43
because they're not the biggest. They don't have the widest array of products or services. So actually, when we started this, we were set on using AWS. And we started drawing architecture diagrams,
08:04
virtual private networks, machine instance types. We kind of hit a wall where we didn't know where to go next, what to prioritize, because what we really wanted to do, we just wanted to push some code out in production, right?
08:20
And we hired some guys from a company called Noor Cloud. They specialize in cloud computing. And we had sent them a list with all the questions that we want to know. And when he came into our office, we sat down. He asked you guys this list of questions.
08:43
Everything is about operations. Everything is about networking. Everything is about deployment. There's nothing about the real problem that you want to solve. And he said, if you guys just look over here to this cloud platform called the Google cloud platform,
09:00
they have all the parts that will make your architecture, make your system. That took a couple of weeks to understand, but now I'll show you how this ended up. So a naive architecture for this system.
09:23
Remember the last drawing. We have three actors. We have us, Unicast, and we have the proximity solution providers, the beacon companies, and we have the ad platforms. They could be Google DoubleClick or AdForm or some other company. And in the middle there's us.
09:41
So we need some way to funnel data from the proximity companies to the ad platforms on a scalable manner. So what we need is we need some kind of end point, right? Some rest end point to ingest the data, just receive the data, and that needs to scale endlessly.
10:03
So we can't foresee when we will get huge spikes in the traffic, so that needs to be taken care of. That needs to be scalable and can't have any downtime. We need some way of processing the data to normalize it.
10:24
And of course we need something in the other end to do the actual delivery of the data, right? So it's like a funnel. And we need a durable storage with a powerful query language that integrates nicely with the other components.
10:43
Please feel free to interrupt for questions during the presentation. The non-functional requirements that we had. We had to have elastic scalability, need to be highly available. The storage is really important because if that goes away,
11:01
then we're out of business basically. Deploys need to have zero downtime, so we can deploy during peak hours. And it also needs to be fault tolerant. So if one part of the system fails, we can't have that propagating all back up to the user.
11:20
And another thing that I think many people skip when they talk about requirements is developer experience. When we started building this, we were two engineers at Unicast, and a couple of months afterwards we got the third one. And this needed to be fun for us to work with this 24-7 for a couple of months.
11:48
And some of the points we had was we had to have reasonable APIs to the cloud platform, minimize the need for operations so that we could focus on building features.
12:04
If we can buy something, we'll buy it and we'll not build it. So that means not hosting our own database cluster, not hosting our own Spark cluster. Just if we can buy it, we'll buy it and we'll pay us out of it.
12:21
Infrastructure needs to be scriptable, so we could do repeated idempotent updates. Defaults also need to be reasonable. That's a place where AWS unfortunately falls a bit short. All these things, they give us the ability to have a fast development cycle.
12:42
So with the architecture diagram in the back of our heads, we can start looking at what kind of components we use for the different parts. The ingestion API is the part of the system that, of course, receives the most inbound traffic and it must scale totally automatic.
13:04
We can't have any downtime. And luckily, there's this thing called Google App Engine. I think it's something that people really haven't gotten their eyes on yet. It's a stable and proven technology.
13:20
It's one of the first things that Google launched in their cloud platform. It scales, I would say, infinitely. I don't think there's a hard limit on how much traffic you can point into that one. You can run Java, you can run Go, you can run Python, everything on it.
13:43
We're a JVM shop, so the only downside for us on App Engine is that it only runs on Java 7. And that's the end of life. Last year was the year before. So we have chosen not to put all of our business logic here, but we use it for the thin slice that receives the most traffic.
14:03
And it costs pennies to operate. I can't understand how they can actually make any money on it. But that's good for us, right? So then the data has been... We have gotten the data and we need to process the data in some kind of way to add some value to the data.
14:25
And there we have something called the Google Dataflow, which is a hosted batch and stream processing framework. It's easily comparable to Hadoop. If you subtract the pain from Hadoop, it's actually quite equal.
14:40
But also the stream processing frameworks like Spark and Storm. And the really cool thing about Dataflow is that your processing pipeline is just defined in pure old Java. You stick the code in Java main method, you run the jar,
15:00
and whoops, it's like ten instances in the cloud processing your data. It's really, really nice. An example of building a pipeline that counts words. It's seven or eight lines. You programmatically construct the pipeline, you run it,
15:20
and it gets decomposed and distributed to an arbitrary number of machines in the cloud. This is probably the most awesome thing. The storage part, it's really important that that's durable, that we don't lose any data. That's why we don't want to operate around Cassandra cluster, for instance.
15:42
I need to have a query language. There's this thing called BigQuery. It's made for big queries, hence the name. It's what Google used for log data analysis. And it's really powerful. I encourage you to check out the example datasets.
16:03
I did a... There's a sample dataset from Wikipedia, which has all the changes in Wikipedia up to 2010, I think. And to do a count of all the pages filtering and stuff, it took, like, 30 seconds, and that's about ten gigs of data.
16:22
So, this is really fast. It's also really, really cheap. You pay per query. And it has no upper limit. No visible upper limit, is what Google says. You should never be able to hit the limit
16:40
on how much data you can stuff into it. It's a SQL query language, good UI for exploring data. And it's also append-only, so that resonates good with our mindset of immutable data. So, we don't delete data, we just append it. And that makes it a really good fit for storing the sensor data.
17:03
Over to the bit more complicated stuff, when we build the actual segments and ship the data to the ad platforms. Most of our integrations require a regular data dump on S3 bucket or Google storage bucket.
17:22
And we have most of the data in BigQuery. And we need some kind of way of processing that data. And we're already familiar with the data flow, right? You can do batch processing, so that's the thing we use here. We need some kind of orchestration layer
17:40
to figure out the intervals things are going to be shipped at, the different segments. So, the answer to that is always, when you have a problem, you just throw Docker containers at it. So, that's where this thing called Kubernetes comes in. It's a container orchestration tool from Google.
18:05
I did say that data flow was the most awesome thing, but I really think Kubernetes is the most awesome thing. So, if you haven't checked it out, please do. In Kubernetes, we can deploy all kinds of apps. It's like an on-premise or hosted Heroku.
18:21
So, we have an app that's single responsibility is to trigger data flow jobs on regular intervals. Data flow then reads from BigQuery, passes the data back to the app in Kubernetes, which then uploads it to Google Storage or S3.
18:48
So, the complete architecture here would be something like this. Data gets into App Engine. We also use a protocol Datastore. That's the NoSQL database from Google
19:02
to store metadata and stuff. We push the data into PubSub, and that's something I haven't mentioned. That's a globally available distributed queuing system, a persistent queue. I've heard some quotes from Google employees, and they say if PubSub goes down, the internet goes down.
19:21
So, that's something they really rely on in Google internally. And it's really good, really fast, and has amazing amounts of data. Data flow then reads from the PubSub queue, stores the data in BigQuery, and then again when you come to the delivery of data,
19:40
we have the app on Container Engine that's at regular intervals, creates new data flow jobs that uploads the data to Google Cloud Storage. And then I screwed up my slides. So, here we have the ingestion part.
20:06
And here we have the processing part. And here we have the data delivery part. So, it's kind of three different components or slices in the architecture, but they all play really good together.
20:24
If you guys have any questions, I'll be really glad to answer. If not, I'm going to talk a bit about how we do deployments and how we orchestrate and how we do development on this. Sounds good? So, we have kind of a soft build pipeline.
20:44
I've been working in many places where they have a hard pipeline where you promote the builds up through the different environments. We don't do that, but we do something similar. So, we are using a technique called GitHub flow,
21:03
which is all new features going to feature branches. And we deploy the feature branches when they're ready and we merge them back into master when we have confirmed that it doesn't break anything. So, master is always the stable branch. We can always deploy master to an environment, but you deploy the feature branches.
21:22
First, check if it works. If it doesn't work, you roll back to master. If it works, then you merge the feature branch into master. And the way that this works, we have a lot of code or apps running in Kubernetes.
21:42
So, we do Docker builds. First, the evil programmer, he makes some bugs or code, pushes it to GitHub. GitHub notifies CircleCI, our build tool, runs the tests and builds the Docker container, pushes it up to our Docker registry.
22:06
And when we do deployments, we do all our deployments from Slack. So, we do what you call chat ops. So, we control our deployment system and most of our operations from Slack.
22:21
That's really one of the things that we're most happy with, the decision, because then we have total visibility in what everyone in the company does. So, if someone is deploying something, we will know because we can read it in the operations channel.
22:40
So, when I go into the ops channel on Slack and I write deploy some API to production, it gets picked up by this little guy called Ubot, which is a chatbot written in CoffeeScript, also out of GitHub.
23:00
Ubot notifies something called the Deployments API in GitHub. That's a metadata storage for deployment information. So, Ubot creates new deployments in the GitHub Deployments API. And GitHub then sends a callback to a custom app that we have
23:22
called Heaven that manages the deployments. And it's open source, made by a guy at GitHub. And its sole responsibility is to receive the post, the callback hooks from GitHub and do the actual orchestrate the deployment.
23:41
So, this guy Heaven, he orchestrates Kubernetes, pulls in Docker containers, runs Dataflow jobs, and deploys the App Engine. So, everything we do, we do through this app called Heaven. We also have some monitoring set up, of course.
24:05
We have some apps written with the framework called DropWizard. DropWizard publishes huge amounts of metrics, and we push them to an app called Datadog, which is a really good service for visualizing metrics, monitoring.
24:23
So, we get a lot of metrics into Datadog, and we can set alerts on the different types of metrics. So, if the response time goes down, we can trigger an alert that goes to pager duty. Someone gets woken up in the middle of the night, and we also post it to the Slack channel.
24:43
And we have an extra little bonus slide before we go to the summary. So, let's go through our requirements and see if we manage to fulfill them. Elastic scalability.
25:02
Let's check with Google App Engine. You can throw anything at it. And it's also highly available. We don't need to think about hosting it because Google is hosting it for us. So, if App Engine goes down, there's lots of people having problems.
25:20
So, that's safety for us. Safety in numbers. Durable storage. We have covered with BigQuery. Zero to on-time deploy. It's managed through App Engine. Has good functionality for routing traffic between versions. And Kubernetes also has the support of rolling updates.
25:43
So, you can do zero downtime deploys. And also managed to do this fault tolerant because we have separated the processing from the ingestion and the storage. So, everything is separated with the queue. So, if the processing layer goes down,
26:04
the queue will just continuously fill up. And when it's up and running, it will just grab stuff from the queue again. The developer experience. Reasonable APIs. That's a bit hard to argue, but I would say that you have pretty reasonable APIs
26:24
in the Google Cloud platform. Lots of command line tools. You can integrate directly through REST. Everything fits together perfectly. We don't do any operations at all. We don't have a single bare metal server running.
26:42
Everything is hosted services. So, our Kubernetes cluster is hosted by Google. We can just randomly kill the machines in the cluster and they will just magically reappear. Data flow is managed. BigQuery is managed. App Engine is managed.
27:02
And that's also with as much software as a service as possible. We still don't run any Cassandra clusters. We just pay ourselves out of the problem. Infrastructure is scriptable. We don't do any manual creation of resources. We do everything scripted and run it automatically.
27:23
I think it's fair to say that we haven't done a single customization on the platform that we are running, except from changing machine types. And given our pipeline, it also gives us a fast development cycle.
27:42
So, we can deploy stuff to production at a high rate. Some of the advantages of GCP. I'll just say I'm not paid or affiliated by Google. But back to what we talked about, the defaults are really sensible.
28:02
The different components, they fit together in a more intuitive way than other cloud providers. The interfaces are top-notch, well-crafted. Hosted services are, I think it's just the top of the line, really. And if you need to drop down to bare metal,
28:22
the pricing is really fair. You can have custom machine types. You can have a machine with one processor and 600 gigs of RAM. It also has a great community and professional support if you and when you run into problems.
28:47
So, that's about it for my presentation. I said I have quite a fair amount of time left, so questions are very welcome with a silent crowd.
29:07
I'll skip to my last slide. If you're interested in doing this, we're hiring. We're hiring 20 people. Or 18, because we have 17 we're hiring.
29:21
So, check out the 20.jobs if you're interested. Especially if you have a background in data science. We're really interested in getting in touch. So, I guess that's it. Thanks.