Analysis Of Realtime Stream Data With Anvil
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 95 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/15509 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Nottingham |
Content Metadata
Subject Area | |
Genre |
FOSS4G Nottingham 20139 / 95
17
25
29
31
32
34
48
50
56
58
68
69
70
82
89
91
00:00
Data analysisOpen sourceBoss CorporationWordRule of inferenceFigurate numberDifferent (Kate Ryan album)Slide ruleNumberTerm (mathematics)Computer programmingQuicksortBitRevision controlLogical constantDisk read-and-write headMultiplication signProgrammer (hardware)IntegerInsertion lossSpring (hydrology)Real numberRight angleMetropolitan area networkPhysical systemComputer animation
02:14
ParallelverarbeitungScalabilityPoint cloudBuildingMultiplication signConnectivity (graph theory)Right angleMathematical analysisCoprocessorQuicksortData storage deviceComputer animation
02:56
AlgorithmJava appletBuildingReduction of orderLevel (video gaming)Query languageSample (statistics)GeometryOpen sourcePolygonPoint (geometry)MultiplicationEnvelope (mathematics)Wrapper (data mining)Type theoryDedekind cutSymmetric matrixOperations researchOrder of magnitudeRule of inferenceData storage deviceShape (magazine)WKB-MethodeBoundary value problemData bufferConvex hullConvex setVertex (graph theory)Population densityDistanceoutputQuad-BaumSubject indexingData structureGeodesicSpheroidLocal GroupOrder (biology)CountingComa BerenicesNetwork topologyQuicksortPoint cloudCodeGeometryResultantMathematical analysisFile systemGoodness of fitSlide ruleDebuggerServer (computing)Point (geometry)File formatPolygonInsertion lossAlgorithmSuite (music)Programming paradigmTheory of relativitySet (mathematics)Software developerMassOpen sourceJava appletTouch typingSingle-precision floating-point formatBoss CorporationCollaborationismVisualization (computer graphics)Computer fileRootTerm (mathematics)Latent heatEvent horizonOperator (mathematics)Software frameworkSpacetimeSoftwareShift operatorProjective planeForm (programming)Data storage device2 (number)Bit rateVelocityNetwork topologyMereologyLibrary (computing)Shape (magazine)TwitterProcess (computing)Greatest elementAdditionWeb 2.0Line (geometry)Volume (thermodynamics)WordBuildingGroup actionCartesian coordinate systemView (database)CASE <Informatik>Right angleVariety (linguistics)Electronic visual displayPatch (Unix)Multiplication signSymmetry (physics)Different (Kate Ryan album)MIDIDialectStatement (computer science)MathematicsNumeral (linguistics)Order (biology)Sanitary sewerMedical imagingOffice suiteIdeal (ethics)Reverse engineeringComputer animation
09:58
Level (video gaming)Computer fileElectronic data interchangeSet (mathematics)Metropolitan area networkMereologyQuicksortTime zoneIdentity managementMultiplication signTerm (mathematics)Slide ruleLoginAnalytic setMassMeeting/InterviewComputer animation
10:52
Bookmark (World Wide Web)Rule of inferenceAmsterdam Ordnance DatumLevel (video gaming)Drum memoryContent (media)Wide area networkMetropolitan area networkQuery languageTape driveComputer fileMoving averageLatent class modelSet (mathematics)MappingFlow separationLevel (video gaming)MassDemosceneBitGame theoryHypermediaRight angleComputer animation
11:19
outputEquals signAreaCellular automatonFunction (mathematics)TwitterComa BerenicesReal-time operating systemNetwork topologyTupleTap (transformer)Internet service providerVideoconferencingProjective planeInformationStreaming mediaRight anglePresentation of a groupTwitterUniform resource locatorGeometryProduct (business)Java appletQuicksortComplex (psychology)Process (computing)BootingAlgorithmOpen sourceImplementationCoprocessorFilter <Stochastik>Event horizonData storage deviceBitSoftwareLibrary (computing)CASE <Informatik>Physical systemEuklidischer RaumDialectGame theoryTraffic reportingShooting methodOpen setWebsiteComputer animationProgram flowchart
14:11
Metropolitan area networkConvex hullPointer (computer programming)Multi-agent systemRow (database)InfinityInformation managementUniform resource nameEvent horizonGamma functionPersonal identification numberComa BerenicesSpecial unitary groupMoving averageHidden Markov modelConditional-access moduleNewton's law of universal gravitationVarianceCAN busExecution unitEmulationNumberMaxima and minimaPhysical lawWechselseitige InformationMenu (computing)Rule of inferenceVideoconferencingDemo (music)SineVideoconferencingTouchscreenPresentation of a groupBootingAsynchronous Transfer ModeComputer animation
14:27
Group actionTouchscreenInformationEmoticonStatisticsQuotientLevel (video gaming)Computer3 (number)Storage area networkPower (physics)Moment of inertiaPoint (geometry)Centralizer and normalizerDependent and independent variablesEvent horizonWordPower (physics)Streaming mediaRevision controlNumberTwitterEstimatorQuicksortMappingSoftware bugAlgorithmHypermediaUniform resource locatorSpring (hydrology)Human migrationPhysical lawRow (database)Execution unitProcess (computing)Thresholding (image processing)AreaCurvatureExistential quantificationDot productNormal (geometry)Functional (mathematics)InformationWorkstation <Musikinstrument>Real-time operating systemComputer animation
16:58
ArmSpecial unitary groupRule of inferencePhysical lawSummierbarkeitView (database)Uniform resource nameFrequencyComa BerenicesSupersonic speedMaxima and minimaCellular automatonValue-added networkArc (geometry)Metropolitan area networkServer (computing)File viewerSanitary sewerFingerprintMulti-agent systemDatabase normalizationConditional-access modulePower (physics)MereologyPersonal area networkOpen setEntire functionAlgorithmPhysical systemVideoconferencingDemo (music)Form (programming)Presentation of a groupTerm (mathematics)Computer animation
17:36
Raw image formatArtificial neural networkWechselseitige InformationRoundingMetropolitan area networkComputer fileArc (geometry)Moving averageTwitterWide area networkArithmetic meanNormed vector spaceCAN busLinear multistep methodEuler anglesEmulationMenu (computing)Execution unitScalable Coherent InterfaceProcess (computing)CoprocessorPoint (geometry)Mobile appGame controllerQuicksortScalabilityService (economics)Gene clusterMereologyInstance (computer science)Computer animation
Transcript: English(auto-generated)
00:00
Hi, guys. So as you said, my name is Chris. I work for Esri, which is a bad word at this conference. But I used to work for a company called GeoIQ. And a year ago, we were acquired by Esri. But we still get to do all the same things we used to do. It's just that our bosses are different.
00:20
So I'm obviously not Andrew Turner. Does anyone know who Andrew Turner is or has ever seen Andrew Turner talk? Only Tim up there. Anybody else? All right, so a few of you know who Andrew is. And I've worked closely with Andrew for years. And so unfortunately, he couldn't make it because he's traveling constantly. And this just didn't work out. So he asked me to give this talk.
00:41
And this is talk one of three that I have today. So it's sort of like the beginning of my marathon sprint. For the next three sessions, I'll be talking. But at the same time, rule number one for talks is never go long. I hate when talks go long. Rule number two is always like never apologize before you
01:01
give a talk. And I'm not going to apologize, but I did just get these slides today or yesterday from Andrew. And he added copious amounts of notes. And that's why he was delayed in getting it to me. But I can't figure out how to actually pull up the notes so I don't have them here. So it's going to be me kind of bumbling through this one a little bit in terms of like kind of reacting to the
01:20
slides you see and then thinking about what Andrew would say. So and then also, Andrew is super notorious for speaking really, really fast because he's got this big brain. And he just thinks really quickly and can move through things. And I've seen him give half hour long talks in five minutes. Just really, really fast. But he's a great guy.
01:42
And also then, as you see down here in the bottom corner, we're recycling this talk from the North American FOS4G. He gave this there. It's a different version than kind of what's in the program. But I think instead of canceling this talk, because Andrew couldn't make it, the things that he has in
02:02
this slide deck are worth talking about, especially at this conference. And so I think I was willing to step up and give the talk. And the things we have going on are super worth it. So with that, I'll start. So basically, Andrew starts with a discussion about buzzwords, right?
02:21
And the cloud and what that means. And really, it just means that we've now got this ability to handle a lot of distributed components and things going out to the cloud. But we're constantly putting things out in the cloud because storage now is super cheap. And we can process it on demand.
02:40
But it also means that we're constantly throwing data out to the cloud and kind of like putting it there for later than analyzing it a little bit later, right? And so we're trying to build analyses that sort of react to what our data are telling us after the fact. But big data is big.
03:00
Oh, this is interesting. Big data is big, as you can see. It's billions of features, right? And I'm a JavaScript developer. I work on the client side. I deal with visualization and things like that. But big to me is a lot smaller, right? But big to the big data folks and the buzzwords are billions of features. It's the three V's. It's variety, velocity, and volume, right?
03:22
We have tons and tons of features coming at us. If you look at Twitter, it's also very, very fast, right? And for big events, sadly not this event, but big, big events on Twitter coming in at rates of 10,000 per second. 10,000 features and events per second, which is insane if you try to think about how we adapt to handling that
03:41
sort of volume of data. But typically, our analyses and things, our approach to processing data are always kind of go to the data. We put the data somewhere. We go to it, and we crawl through it. But I think we want to stop moving the data, necessarily, but move the algorithms to it, if that makes sense.
04:02
So our algorithms to touch data and to feel it and mess around and dissect it and pull it apart are typically like bring the data to it, and it processes. We have a web processing server that's sitting there waiting to be injected with data and buffered out this point, and then we're going to return it. Instead, what we're talking about here in the cloud is we can start bringing our analysis to that data, right?
04:22
So we reverse our sort of paradigm there. And so at Esri, we're doing this thing. As soon as we started, we really started making a strong push for open source code. One, because it's developer happiness. It's what we want to be working on. We don't want to be working on proprietary data sets as a group, right?
04:40
I mean, I don't think anyone at Esri is always like, I wrote this code, I have to sell it. But there's things going on at Esri that we want to share with the community as well. And this is not my work, so I don't know the depth about what GIS tools for Hadoop really goes into and how it works. But I know that it's awesome in terms of a big project
05:01
that was written within the last year and open source at Esri and with the intention of never selling this product, giving it to communities and making think about how Esri starts building out software that is very community driven. So it's cool because it's really the first thing that
05:20
is entirely born in the open source space at Esri and really kicks ass. And so it's a whole stack of tools for processing big data within Hadoop. And so the idea is that we dump a bunch of data in Hadoop and this whole package of the GIS tools consists of
05:41
three different things, the geoprocessing tools, the spatial framework for managing that data within Hadoop, which is really accessing it just via Hive, right? So it's like SQL on top of Hadoop. And then also Hadoop itself and the addition of the
06:01
Esri geometry API. And that, the very bottom piece here, that's the awesome work that really this talk is the only one we'll be talking about at Phosphor-G. The Esri geometry library or engine is the most open source Java geometry engine available.
06:22
JTS, does anyone know what licensed JTS is? Right, JTS is the Java topology suite. It's good old LGPL, right? It's really, really close to GPL, which is the most toxic of licenses. But LGPL is not that bad, but this is Apache, right? So Apache is way more open than LGPL,
06:42
which is JTS, which is like the mother of all Java engines, right? I mean, post GIS, everything takes its roots from JTS. This was released last year and it was a huge deal. I'm not a Java developer at all. I can't stand it, I won't touch it. But this is a really big deal in that JTS now has an actual competitor, right?
07:00
Before in the open space, there was nothing. You have JTS, you have GOS, right? Post GIS, all that lineage right there forms from JTS. Esri had this in works and we had to open source it just to make the Hadoop tools like this full open source stack, which is really cool. There's this huge shift within Esri that like we're willing to take something
07:21
that was written actually before the open source sort of movement in Esri started, and they're willing to sort of free it up and move it. So within the geometry engine, there's what you would expect, right? There's support for simple features, OGC simple features specification. There's topological operations
07:40
like cutting and difference and intersection. It's a full Java API for doing all this. Relational operations, right? We know what these are. These are sort of what we've been taught and spend our day with every day. Something that's really cool about our import export is, well, I'll say two things.
08:01
Something that's really not cool is Esri has this really funny way of assuming everybody wants to conform to their formats and specifications like shape files and things like that. But also in the JSON world, we have this thing called like Esri JSON, which is totally new to me when I started working at Esri that I just assumed everything was geo JSON. How could you actually do anything
08:21
without like just conforming to that? But alas, Esri has their own geo or their own form of a geospatial JSON format. And I think they call it like the rest JSON or something, but really we like geo JSON. And this line here with from geo JSON to geo JSON
08:41
is an insertion that Andrew made in the slide deck this week, in that since he gave this talk at the North American conference, as a result of this talk, a guy named Scooter Wadsworth, who now is working on GeoGit at Boundless, actually pulled down the code and added to and from geo JSON and made a pull request back,
09:00
which is sweet, right? You see that sort of collaboration and that's what it's all about. I mean, just that alone allowed us to kind of go to the bosses that are in CCC. This is what's awesome is the community will do this with us. So other operations, right? What you might expect, things that you would see inside post GIS and QGIS and things like that,
09:22
things that exist in GIS. So another cool part of this is Hive Spatial. Anyone know what Hive is? It's really like just an SQL front end on top of HDFS, which is the distributed file system for Hadoop. Hive is awesome if you're working with big data. So it looks just like this, just like what you would expect,
09:42
where we're running a contains on a point, really just a simple point and polygon aggregation right there within Hive. And really what's cool about this is this is a single point of entry for then a massive store of data, if you think about distributed file systems. So that's cool.
10:00
But so what? I'm not sure what Andrew's talking about. Oh yeah, so there's a, so then, okay, so what, right? So we have these tools, what do we do? And this is the only screenshot of ArcMap that you'll see at this conference. I actually don't even know how to work ArcMap, but the cool thing is that we have all this data.
10:22
And I think these next few slides are around hits from RTS online. So we have this massive analytics engines that look at all requests and log it out. And those logs just generate tons and tons of data. And so we started looking at requests for imagery
10:40
and see what the hot zones are and where do we need to optimize and things like that. So just the ability to sort of pull all these into ArcMap via these tools is this work of this guy, Mansoor Rad, who's absolutely amazing with all this big data stuff. Something about the Dutch cadastral mapping here, another data set from our logs,
11:01
and then similar data set requests, but right around the same day as the Russian meteor. And so the Russian meteor came and you just get massive amounts of hits onto, we make a map and then they pull it all down. It's really hard for the analysts to actually visualize and understand what's happening behind the scenes.
11:20
So then another one is we got a bunch of data from a Japanese car company and they wanted us to analyze it. And so it's sort of the same story that I sort of have already talked about, but 4 million vehicles or 40 million vehicles sort of aggregate them up into a grid and then analyze the carpool locations
11:42
and it runs in a few minutes or something like that. Sorry, I'm not really giving that justice. Andrew would be shooting me right now for that. Oh, that's very nice, Andrew. All right, let's see the smoke. Sorry, this is fun. Wow, that's style right there.
12:01
So another project that we, so this is shifting gears a little bit, right? And I think I started to allude to this in the very beginning about sort of bringing algorithms to data and data to algorithms and stuff like that. And so as we started a project called Anvil. And Anvil is a geospatial implementation
12:23
on top of Twitter's Storm product. And Storm is super, super kick-ass in that it's massively scalable, it's massively distributed, and it can handle amazing amounts of just streams of data. So it's what one might refer to as a complex event processor.
12:40
So it just takes a stream of data, which are these taps over here, and it runs them through what they call bolts. And these bolts can then do small manipulation on like a tuple of data or a triple of data, right? So we start opening up these taps, and what we can do then is go take that open source Java geometry library,
13:00
inject it into these bolts, and start doing geoenrichment and geospatial aggregation against things like shared indices, and start building out this massive streaming network of real-time processing. This is super awesome. We use it for a project called,
13:21
oh, I didn't realize this was a video, nice job. He's a pretty good presenter. So we use it for a side project we have with some of the spook agencies in the US. And what it does is essentially enables us to allow streaming on the client, right? So we go to Twitter,
13:41
we have these feeds coming from Gnip, which is a provider of the fire hose of Twitter that we can sit there and just tap into, right? And so as soon as we stand up a topology, is what they call them, this is not an actual spatial topology, but a topology of bolts and filters that we operate on within Storm. We stand when it was up for a Twitter search,
14:00
and it goes in, taps into our anvil instance, boots up that topology, and starts streaming data down. Here, there's no real spatial information being added to it. But what we have is this other video I wanted to show. Sorry, wait, I'm out of presenter mode.
14:25
So I'll boot this up in full screen. So the idea is real-time social media analysis, right? And I think we've kind of, I mean, personally been burnt out on mapping tweets.
14:44
It's really this boring thing that ultimately just shows you where population is. But basically, we had this idea that we were tracking Twitter, and a lot of what we do is really just consume Twitter during these major events. And so this major event is Sandy, a hurricane that came into New York.
15:01
And what you saw right there is we enter in Twitter, we enter in our parameters, and it starts just pointing dots. It's like, but that's not really that informative. And so what we want to do is a client-side aggregation of that data. So then we started saying, well, what if we could just count what's happening on the client, on the fly, as points stream in, start aggregating data, that would be pretty sweet.
15:24
And so we looked at this and said, well, that's awesome. Like, what else can we do? And so just counts don't really provide that much. What we wanted to do is start doing some more analytical processes on that data. And so we had this beautiful,
15:40
this wonderful guy named Sean Gorman, who's this PhD and can say things like location quotient and location quotient normalization and these really funky complex algorithms to look for certain events and anomalous activity within that stream of data. And so we started working on that and started applying those functions and algorithms
16:01
to that stream of data. And what we can see is we start seeing areas of anomalous activity, as soon as like the major event sort of occurs, we get this big bubble up and things are happening. The first one we start to see is down here, it alerts us to, hey, something's going on, we just passed this threshold.
16:20
And what happened was the explosion of the Con Edison plant in lower Manhattan happened. This is the hurricane rolling through. All of Manhattan at this point is out of power. And what we see is a migration of tweets and information above, what is it, 104th or something, where Grand Central Station is in all the people basically moved up to Grand Central Station and where they had power.
16:42
And so it's this really interesting insight to that one event and the whole stack of, granted it's a canned response here where we're basically just canning and recycling that data feed. But it tells a nice story about the things that we can enable with this sort of power and algorithms.
17:00
Let me go back in here and see. I think that is pretty much the end. Oh, he's got, I think, yeah, the same demo in video form. I should have looked at that in the presentation. So yeah, all of this is available on GitHub. It's awesome to see as we're doing what they're doing in terms of we've had top-down directorship
17:22
saying things that should be open, or things that can be opened should all be open. And so we're just pouring, pouring everything into GitHub. And it's been super awesome. So, thanks. Any questions?
17:42
Concerns? Thanks. So, yeah. So I was gonna start my question from the industry and the processing part. Are you eventually moving to the point of service so that you'll take control over the safety of store, or do you need a fairly big processor
18:01
to set them up, right? Definitely. I mean, we run on AWS, EC2 instances, starting to look at more scalability within using things like Docker IO. We have internal and external sort of app clusters. That's kind of where we're going at, but I'm not a DevOps guy,
18:21
so I'm not really the best informed to be talking about that, or with any intelligence. Cool. And I'm talking twice more, not about this stuff, but about JavaScript and stuff like that. So, that's it.