Big data analysis with Tile Reduce and Turf.js
Formal Metadata
Title 
Big data analysis with Tile Reduce and Turf.js

Title of Series  
Author 

License 
CC Attribution  NonCommercial  ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and noncommercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license. 
Identifiers 

Publisher 

Release Date 
2015

Language 
English

Producer 

Production Year 
2015

Production Place 
Seoul, South Korea

Content Metadata
Subject Area  
Abstract 
Tile Reduce is a new open source map reduce frame work for analyzing massive geo data. Tile reduce is a tile analysis framework built on the javascript GIS library Turf.js. It runs on your local computer or in the AWS cloud and scales to run thousands of processors in parallel. At Mapbox we use Tile Reduce to detect issues in global street vector data like OpenStreetMap, data comparison and data conflation. This talk will walk through the architecture of Tile Reduce, highlight advantages, limitations and future developments.

00:00
Arithmetic mean
Computer animation
Mapping
Open source
Mathematical analysis
Cuboid
Data analysis
00:23
Word
Scaling (geometry)
Computer animation
Mapping
Open source
Tower
Projective plane
Modul <Datentyp>
Mathematical analysis
Software framework
Library (computing)
01:02
Default (computer science)
Functional (mathematics)
Statistics
Standard deviation
Group action
Graph (mathematics)
Mapping
Open set
Replication (computing)
Number
Web 2.0
Computer animation
output
02:03
Point (geometry)
Scripting language
Suite (music)
Mapping
Java applet
Software developer
Projective plane
Mathematical analysis
Bit
Open set
Web browser
Coprocessor
Portable communications device
Computer animation
Term (mathematics)
Table (information)
Resultant
03:04
Laptop
Point (geometry)
Slide rule
Server (computing)
Scaling (geometry)
Real number
Web browser
Cartesian coordinate system
Web browser
Neuroinformatik
Computer animation
Interpreter (computing)
Point cloud
Laptop
Geometry
Point cloud
03:31
Point (geometry)
Slide rule
Service (economics)
Open source
Mapping
Computer file
Neighbourhood (graph theory)
Projective plane
Realtime operating system
System call
Graph coloring
Number
Computer animation
Term (mathematics)
Electronic visual display
Modul <Datentyp>
Communications protocol
Information security
Geometry
Library (computing)
04:40
Point (geometry)
Computer animation
Demo (music)
Buffer solution
Polygon
Execution unit
Design by contract
Object (grammar)
Line (geometry)
Extension (kinesiology)
System call
Data buffer
05:06
Computer animation
Water vapor
Realtime operating system
Bit
Smoothing
Instance (computer science)
Web browser
05:33
Point (geometry)
Computer animation
Image resolution
Image resolution
Active contour model
Line (geometry)
Sequence
Inflection point
05:54
Area
Point (geometry)
Dot product
Functional (mathematics)
Statistics
Information
Control flow
Set (mathematics)
Sound effect
Maxima and minima
Line (geometry)
Web browser
Number
Computer animation
Average
Operator (mathematics)
Quicksort
Geometry
06:59
Surface
Polygon
Standard deviation
Slide rule
Functional (mathematics)
Open source
Projective plane
Median
Variance
Web 2.0
Sample (statistics)
Computer animation
Envelope (mathematics)
Quantile
Convex set
Square number
Data buffer
07:22
Area
Multiplication
Implementation
Code
Software developer
Surface
Multiplication sign
Software bug
Computer animation
Process (computing)
Automation
Computing platform
Geometry
07:56
Point (geometry)
Graphics tablet
Implementation
Computer file
Image resolution
Mathematical analysis
Line (geometry)
Web browser
Computer font
Power (physics)
Graphical user interface
Computer animation
Table (information)
Row (database)
08:46
Area
Laptop
Computer animation
Surface
Polygon
Mathematical analysis
Virtual machine
Total S.A.
Bit
2 (number)
Power (physics)
09:12
Laptop
Server (computing)
Service (economics)
Open source
Sequel
Multiplication sign
Set (mathematics)
Client (computing)
Web browser
Function (mathematics)
Binary file
Neuroinformatik
Number
Revision control
Medical imaging
Ontology
Cuboid
Data compression
Exception handling
Task (computing)
Mapping
File format
Moment (mathematics)
Parallel port
Database
Tessellation
Subject indexing
Process (computing)
Computer animation
Vector space
Software
Personal digital assistant
Tower
Phase transition
Website
Reading (process)
Geometry
Spacetime
11:34
Wechselseitige Information
Building
Code
Length
Zoom lens
1 (number)
Price index
Parameter (computer programming)
Function (mathematics)
Total S.A.
Computer programming
Computer configuration
Ontology
Cuboid
Electronic visual display
Physical law
Series (mathematics)
Hill differential equation
Touchscreen
Mapping
Interior (topology)
Range (statistics)
Variable (mathematics)
RegulĆ¤rer Ausdruck <Textverarbeitung>
Entire function
Electronic signature
Tessellation
Category of being
Process (computing)
Befehlsprozessor
Tower
Phase transition
Configuration space
Hill differential equation
Right angle
Resultant
Geometry
Point (geometry)
Trail
Empennage
Functional (mathematics)
Quantum state
Open source
Computer file
Real number
Virtual machine
Maxima and minima
Division (mathematics)
Web browser
Distance
Event horizon
Number
Power (physics)
2 (number)
Revision control
Wave
Inclusion map
Goodness of fit
Term (mathematics)
Natural number
String (computer science)
Operator (mathematics)
Reduction of order
Energy level
Software testing
output
Task (computing)
Pairwise comparison
Information
Mathematical analysis
Counting
Total S.A.
Basis <Mathematik>
Line (geometry)
System call
Template (C++)
Subject indexing
Uniform resource locator
Computer animation
Grand Unified Theory
Synchronization
Video game
Object (grammar)
Library (computing)
18:24
Computer animation
Virtual machine
Mathematical analysis
Point cloud
18:48
Area
Functional (mathematics)
Mapping
Fitness function
Mathematical analysis
Basis <Mathematik>
Water vapor
Branch (computer science)
Instance (computer science)
Line (geometry)
Cartesian coordinate system
Mereology
Information privacy
Graph coloring
Word
Computer animation
Visualization (computer graphics)
Different (Kate Ryan album)
Computer configuration
Tower
Right angle
Geometry
20:22
Slide rule
Word
Computer animation
Open source
Authorization
20:48
State observer
Presentation of a group
Open source
Execution unit
Density of states
Virtual machine
Mereology
Number
Neuroinformatik
2 (number)
Lecture/Conference
Term (mathematics)
Reduction of order
Cuboid
Covering space
MIDI
Scaling (geometry)
Mapping
Projective plane
Sampling (statistics)
Mathematical analysis
Cluster analysis
Instance (computer science)
Process (computing)
Computer animation
Personal digital assistant
Order (biology)
Quicksort
Object (grammar)
Geometry
00:03
wp since the by means going everybody
00:06
I have this offer my apologies on behalf of my colleague alex he was able to make the trip untimely and yeah there are 2 tools that I'm hoping to talk to you about today that map boxes be investing in for largescale geospatial analysis and which I think could be useful to your own workflow and they are
00:24
to and reduced and they complement 1 another to is a modular library for geospatial analysis and how would use is a framework for performing geospatial analysis using tour for otherwise at very large scale and so I'm going jot dive into 2 a 1st talking through what it can do and I then do applied examples tower used to show how you can actually put all this stuff together but so too 1st
00:51
of all I should say is not exclusively map OX project you can find a turf yes that word i is an open source project that predates map OX involved in it but we have been investing very heavily in the project and 2
01:04
as you might imagine is designed to manipulate map data in this way it's quite similar to existing GIS technologies that you might already be using up I imagine many of you have workflows that involve had like PostGIS war or just a graph statistic you just got to of replicates many the functions those packages against offer to substantial advantages over whatever works for you might have 1st it
01:28
speaks to your Jason natively both for input and output this is a default assumption for everything the turf does we think that 2 adjacent is increasingly the lingua franca for open geodata and I to a freely writes this assumption is that that dude Jason I believe this going to be instantiated as a working group is official Web standard very shortly on so this is this is a good bet to make but also provides a number of helper functions to transform the data into the kind of G adjacent datastructures expects but coming in and out on the bench this tho is that it means
02:04
that results your analysis using turf can be displayed absolutely everywhere not only in technologies that have map what's in their name by in I had to proprietary solutions psychologists or open thirdparty projects like you just the other major advantage that took brings
02:20
the table is that it is written in modern JavaScript and their existing geospatial analysis suites written in Java script they are often ports though from technologies that were not written with modern JavaScript development in mind but so too is happy to play with technologies like browser for AI with no JS out with whatever you might be using and this I say with with some limited my voice is really the future I'm most comfortable writing pipeline as I imagine some people in this room are other JavaScript offers substantial advantages both in terms of eking out every little last bit of performance from your processor and in terms of portability that 2nd point is the 1 the what I want to emphasize right now JavaScript's thanks to
03:05
the engine runs absolutely everywhere so obviously it can perform large scale computation on the server side in the cloud but you know of course they can run in browsers I can run comfortably on your laptop will be doing a demo of that later i and at this point the JavaScript interpreters on your mobile phone also quite adapting capable of crunching real numbers for geo applications In fact of his running by
03:31
John in the slides so this should be much much
03:34
larger and I apologize for not being this is an active slippy map showing that 2 of analysis of through 1 calls to our synthesis go over the past week if you don't have 3 1 1 in your city it's a service that a number of cities around the world are adopting where you can report the need for a city service like trash pick up were removal of graffiti things like that on and there's an open protocol called open through online that a lot of cities but these requests into and publish its nice example data secure the source points TuRFE is been these are in real time into the neighborhoods of sensors goes what I think that geometry is and they're adjusting style and shall say apologies for the color of the slides i'm not sure what's going on display the it
04:19
so am so this is the project of around flexibility that not only in terms of where you can run these kinds of analyses but how the project is structured and how it's administered it is an open source project took is completely modular you can use as much or as little of the libraries you'd like I without including a gigantic IJS file on
04:42
a few of its more specific features I and these will be just pretty familiar to people who do this kind of work on buffering of course
04:50
this is how you would invoke a buffer call for a tour this is a contracting or extending the extent of a spatial object a point a line or polygon by set amount I would try to make it easy with calls up with units that the human readable and this is a live demo that
05:08
again is a bit too small C accurately but this is a race riot instances go for of popular foot race and a dataset of water funds within it I can see I can adjust the buffering up very quickly and find the intersection others there's no there's no crunching here but it's it's just happening in real time in the browser smoothing another option through
05:33
turf by taking the just will tolerance Bezier of a line like us to do
05:38
simplification I using the Quaker
05:42
contours if you've got a grid of points with the sequence to a full happily calculate contour lines for you
05:49
by using the ISO lines method with the final resolution as you can see ranging over
05:54
breaks my here is a data set of census population in New York City of those yellow dots represent what should the yellow dots represent the size of population to area and you can see the ISO lines have been calculated in the browser from that information as a scroll around
06:14
and finally aggregation of just as capable of doing this sort of statistical analysis based on geometry as you might wanna perform you saw some of that in the 3 1 1 Example already here's an example I again using a lot of fun dataset that we have had to effects lets us generate an arbitrary had screwed should be listening here yeah it's really hard to see what this purple and sorry are but I can then intersect these points against their grid and calculate the number of water fountain in each 1 of them instantaneously that obviously would also before operations like taking an average of the number in each grid or the maximum minimum although the basic aggregation functions you expect from the sweet like was just the on a course that's just the tip of
07:02
the iceberg this is the current functionalist as of the composition of the slide deck to be perfectly honest I I don't know the date of I but this is expanding rapidly and I as I said the open source project and open to the other functionality might need so what we we
07:19
sometimes say in these things that Turkers GIS for the web I think that's actually
07:24
an understatement Turkey's GIS for everything you can run on pretty much any platform that you might wanna throw at it and that is its major advantage even if you aren't already JavaScript developer
07:34
out you can write your else's and 1 everywhere without worrying about what multiple implementations multiple code surfaces multiple areas for bugs the pop up you just need to spend your time thinking about the problem once implementing it and enjoying appropriate amount of geometry at it whether it's on a mobile phone or multi compute cluster and
07:57
so let me talk a little about the other major advantage of this JavaScript implementation which is the amount of processing power they can bring to the table that the examples are shown so far are uncomfortable in Chrome which is what showing the slides on and we beyond that
08:12
here's an example of the kind of analysis you probably wouldn't want to run in the browser only could but this is a huge is an outline of counties the United States this is a fairly high resolution due Jason file of the pads of tornadoes in the United States maintained by the US Geological Survey so everyone they have on record since they started keeping records of it it's not 50 MAGs probably more than you go off which of the wire into a browser but using 2 if we can very easily take the median point of each 1 of those of it well those lines intersect again that due
08:47
Jason and normalized by the total surface area of each polygon producing an accurate analysis of 2008 tornadoes tend to happen in the United States this takes about 2 seconds to run on a laptop like this but it's it's a decent chunk of data and that's unoptimized but we can also move beyond that
09:05
go and start really taking advantage of every bit of computing power that's present on machine like this or anywhere else I and
09:13
this is what tau reduce comes so as you might guess from the name tau reduced is a MapReduce tool for those of you not familiar with the MapReduce concept it's a way of thinking about parallelization of computing problems where a very large problem is mapped into a repeated task which can be distributed across a large number of nodes that's the Map step and there's a reduced step for the outputs of that expensive computation are combined into 1 answer or set of answers this is how Google solves a lot of problems out do works as ontologies works except in the Map phase is tied to individual tiles with the processed and in this case I'm speaking that vector tiles as should pause for a moment to explain that is well I'd I am sure that a lot of people Miserotti familiar with the Vector top concept of for those who want it's arrangement of geo data but that uses xyz indexing just like Rasta tiles but but instead of being a j pain it is the underlying geometry from that that would normally be used to render those images packed in extremely efficient manner by into a binary format what this means is that you can serve vector tiles to a client and they can draw the tiles themselves on hand said for the browser whatever and it's great up but it also preserves the source data so you can run geospatial analyses on vector tile set and that's what tower reduces about doing but now where would you
10:45
get a vector tell dataset you might ask but there are a bunch of places and of course you can create your own as a service map box regenerates a planet wide version of OpenStreetMap every nite I into vector tau format and you can download it from from the site about 30 kids compressed 45 h uncompressed that's a lot but it's doable for any modern laptop of course that the actual format this will come in is an and the tiles file which is a sequel like database that lets us packed that tiles together very very quickly i space efficiently you can also divided 2 tiles out individually and serve them over the network and how reduce can read vector tiles for crossnetwork but you wanna do largescale geoprocessing job probably go 1 habit locally and this just because otherwise I always is what's gonna take the most time but so
11:35
let's show example of hotel reduce works in practice and the example that I just implement here is based on personal experience I had a number of years ago going to a friend's wedding in Atlanta Georgia but I don't know how many of you have been to Atlanta Georgia but they are very proud of the signature crop crop which is peaches and I did not appreciate this before going to Atlanta they mean a lot of things after peaches and I went to the hotel that I thought was the right 1 on Peachtree but of all the street or whatever it was am and it took me a long enough certainly long of to watch the taxi pull away to realize that this hotel was much too small and much too full high school volleyball players to possibly be the 1 that I wanted for the wedding but so I found myself spending across the lane highway about 2 AM to get to the 1 that I wanted to ask you is this is before over and there no taxes is coming so I hold a grudge against Atlanta ever since and I have been able to quantify the and really bring in statistical terms how horrible the naming schemas ontologies I thought is a nice opportunity to do this so I wanna I would demonstrate this life to you right now and started a job running here while I start to run through at what's involved so let's let's take a look to files initially the map phase versus the task the gets repeated again and again on a per tile basis and hope that this is much more on so this is the no Jess file and uh for those of you who haven't written node that's not as terrifying my look at 1st a lot of this we're going to ignore that I want to have a working example for you I walk through it really quickly and the 1st thing we do is include the turf library that's probably familiar to anyone who's done any programming the 2nd thing we do is export the functions can be doing the work this is just a way of making sure the tower reduce knows where to find the function that's gonna be doing the actual operations and a function will always have the same 3 parameters the 1st one's called power layers that's where the geo data comes in the 2nd is called options that's where configuration for the entire job lives and the 3rd 1 is the call back this is the function that we need to call when our work is done it's a really common thing in the GS programming if you haven't seen it before I it allows for paralyzed very fast asynchronous execution so in the guts of this function we do a couple things 1st reinitialized some variables to keep track of our task and I should clarify our task is going to be looking at every road in the style of calculating its length and figuring out whether it's name matches 1 of a series of fruits and if it does were going to accord the increase in the total road count and the total road length count preferred so reinitialize variables to keep track of the number of kilometers in total count as a regular expressions to match these fruits there look through every feature in the tile layer that we've been served and some of this stuff like 0 7 data that I was in that features just specific to this being OpenStreetMap source you can pretty much admired all this to we're gonna check a few of the properties are to make sure that's been tagged as a highway because this will include everything in OpenStreetMap they will include cafe penalize it will include building footprints is a few checks to make sure that we're looking at a road and they contains lines from geometry in that it has a name because obviously checking for the name of a fruit if there's no name will be pointless we calculate the distance of the length of the road very easily using true that line distance you know keep track of everything kilometers letter to a total count and then adjusted through each 1 of the fruits testing against the name and updating the total when we're done we're gonna past that object that we've been using to tally the souls back to the call back but it's pretty simple and so this probably I think that we should take a look at the function with a reducing happens that is in the index that yes by convention and are In this file is arranged similarly would play a couple of libraries for style reduce which of course is what focused on 2nd sprint f which is just of string formatting convenience library redefined options this tells us where to find the map function that we just walk through layers gives us some information about where to find the source data but this is just the location of the the tiles file here and zoom this particular tiles file is built and zoomable 15 which is good zoom level for this kind of analysis you can run these analyses at whatever zoom level you want on but 15 is the right number for this particular source and find a couple of bounding boxes because of 1 you comparison across cities and then you can see here instantiating the tower job and using the Washington DC bounding box 1st and the options for above In this just a few things left to do but I defined 2 events fatalities to pay attention to the 1st is reduced when 1 of those fruit DRGs jobs finishes and passes back the totals it's calculated this is what's going to catch the result in added to our global totals the 2nd is the end function would absolutely everything is done the individual fire this code will run in this is much more complicated than needs to be but I want that's emoji up on screen for you guys so I went ahead and did that in the last thing we do young 145 is invoked our D star on what you can see a while ago out we produce the results for Washington DC but it's it's looks like we got 23 roads named with cherry something so as to be cherry hill archery Dale archery Lane what everyone and and this talk about 41 seconds runs machine let me um adjust this will become a new tab when you just this really quick yeah to Atlanta yeah and this again but so 1 thing I wanna point out right away is if asynchronous and paralyze nature of this this is a CPU activity display and the job is getting started by walking through the NB tiles falling out and I can see reduce real work and spread across all 4 course my CPU right away maximum things out and OK that were done already and yet we are took about 18 seconds and you can see there are way too many teachers 1 . 2 per cent of always in Atlanta have peach in their name which is ridiculous and I should say you can you can plot the output of this you don't just have to pass on JavaScript objects totals and you can passaggio Jason and construct a geometry layer so in a slightly edited version of this I can construct geometry which I can I can put on a map very easily to show were all those Peachtree lanes and streets and roads are and I trust me when I say many of them are connected to each other which is especially egregious if you ask me so I this is a a trivial
18:22
example on but a good 1 things
18:25
get pretty interesting when you move up to the cloud that you use often that each top displayed I got 4 courses on this machine this
18:32
is an Amazon C 3 8 x large machine this is not Amazon's biggest but it's 1 that we use a lot of this cost about and 80 cents US to run for hours and gives you 30 cost gives you an idea of how cheap it is to scale up this kind of analysis things
18:49
get really interesting when you move beyond city analyzing sitting on a desktop is fine but what happens when you have a worldwide dataset this is data from RunKeeper for those of you who don't know RunKeeper it's an application for tracking fitness activities like running or biking or swimming that's pretty popular in the US in some parts of Europe and 1 of the options that give the users is to share the data that they capturing during their runs as rats that other people can try and so for all the publicly shared rats we can collect that we can chop off the beginning and to preserve people's privacy so we don't see where the warehouses they're going into we plotted on a map like this which shows the intensity of different exercise rights and so that's a clear visualization things that really interesting know when
19:31
you take something like this and I you're gonna have to take my word for word given that the color layout here but there are green lines here represent OpenStreetMap geometry up and at 1 conspicuously missing from this branch yeah if we start putting together to a functions to detect where were missing geometry between layers we can figure out where we need to do more mapping where we need our team of mappers to add to the map and and we can use on a global basis use InterCon towers here's stadium that was missing as a running around are you a bunch of coastal areas people really like run by the water that you can notice from sensors go and we can run this analysis in about an hour using 20 of those instances that's an incredibly quick analysis of the World Wide geospatial problem as I mentioned to
20:24
his free open source but I would encourage you to check it out of just a word or I get help me welcome contributions is a current list of
20:33
contributors and I will show this particular to single out Morgan Herlocker who is known the author of most of the slides but most turf and i've you have if you have accolades or questions for him but I think that he's he's following anyone talk to you but I'll be very happy to take your
20:49
questions such that can thank you the fact earlier thanks for a great presentation of observation sample just thinking about and uh the distributed computing aspects of using tower as if you have a really long road stuff and then pictures and on the number of objects accounting maybe twice that's you know and yet that is that is true and that's also so for purposes of this sort toy demonstration not a huge deal but if we were actually worried about it we could try and disambiguate using DOS MID but in this particular in this particular case you know you and we are counting twice because we're looking at of another active you but it's more often the case that we're doing problems like comparing a probe dataset for overlap and there it that's the quite nicely call and transient and you mentioned the Nobel I'm most instances already set up that unit and analysis of let's run amount of machines that is insightful question we are so there is an additional layer about reduce that we use for this and tire reduced is memory memoryconstrained and designed around a single machine you can imagine it's not too hard to dish out different bounding boxes for whatever geometry you want cover to an instance spends itself up and runs on the actual technology for doing that is something that relies on some some projects that were in the process of open source and but haven't yet so I think that the short answer is I keep your eyes on a map boxplot will have more for you on that if you wanna run a global scale analysis by but in the short term it's it's enough to order Roy yourself if you got a compute cluster where you the you wanna and stuff and thank so check customer OK and the the hi thank you for the intention of engines turf has been influenced by the source project and then intentions mentioned in the fall term produce visible sometimes part of it is yes I did say to about failed to mention that they're both opensource projects like most of us opera like this we licenses ICT the IIS year MIT license thank you thank you a comments yeah