High-performance geocomputing- leveraging Python, clusters, GPUs for geocomputation
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 27 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/49532 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2020 | |
Production Place | Wicc, Wageningen International Congress Centre B.V. |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
3
00:00
Computer animation
04:30
DiagramComputer animation
08:44
Computer animation
13:26
Engineering drawing
14:19
Computer animation
15:27
Diagram
17:01
DiagramComputer animation
17:43
Computer animation
19:10
Computer animation
30:28
Computer animation
35:47
Diagram
36:27
Computer animation
39:55
Computer animation
43:10
Computer animation
44:53
Computer animationEngineering drawing
47:39
Computer animationEngineering drawing
48:23
Computer animation
Transcript: English(auto-generated)
00:01
Okay, good morning everyone. Sorry I'm running a little late and I'm also sorry I can't attend in person. It's quite a situation over here. If you have a computer and you're having difficulty seeing the slides, I've got the link for the slides here. They should run on your computer, just JavaScript and HTML.
00:21
So what I want to talk about this morning for me and this late afternoon for you is high-performance geocomputing. And we're going to go on a mission for faster code. And I'm going to break that down in several ways. How do we make code go fast? The first way that we're going to look at is to try to play to the strengths of the languages that we're using.
00:44
Second, we'll try to understand what the algorithms are that we're using and how we can understand and reason about their behavior. Next, we'll look at intelligent workflows. Workflows that support doing large data analysis.
01:00
Then we'll look at using those workflows on clusters in order to leverage multiple computers. Touch briefly on how we might use faster languages. And then talk about the pros and cons of using exotic hardware like GPUs. And finally, Thomas Love asked me to talk about some of the work that I do in high-performance geocomputing.
01:23
And I'll do that at the end. So how do we play to the strengths of our languages? A lot of the languages that we've been using during the summer school here are R and Python. And both of these languages are optimized for programmer productivity.
01:41
The idea is that you sit down and there's a minimal amount of work that you need to do in order to produce results. Unfortunately, this comes with tradeoffs. It means that these languages have to rely on low-level implementations in order to get the speed that they do. And if you don't have a low-level implementation available for the problem you're solving,
02:03
these languages are not going to be a good way to address the issue. The anti-pattern for both of these languages is the for-loop. These are interpreted languages. And if the function do work on the screen here,
02:21
I'll move my mouse around. If this includes seconds to minutes worth of work, then your for-loop is fine. But if this executes in milliseconds, then this for-loop is going to be extremely sluggish, especially for large datasets. And in those cases, you need to be able to vectorize this.
02:44
And in both Python and R, if you're using the language in a vectorized fashion, you're not going to have the for-loop at all. You're going to have some function that ingests the dataset and returns another dataset. And when you're operating in that regime, you're using the language to its full capacity.
03:03
But if your code looks like this, then you're probably not, and you should be thinking about whether there's another solution. Again, if this function executes extremely quickly, then this for-loop, if you have a large number of items, is probably going to be a lot slower than using some kind of vectorized solution.
03:23
The next thing I want to talk about is big O notation, that is time and space complexity. This gives us a formal way of reasoning about how our algorithms work. Now, the actual speed at which an algorithm runs
03:41
can be a complicated function of the number of items that you want to process, or the size of your raster dataset, or the number of points that outline the vector objects that you might be dealing with. However, you can strip away a lot of that complexity. In fact, in the end, for large datasets,
04:02
the only thing that matters is the highest order term of the algorithm. And what this image is showing is that despite fluctuations over here on the left-hand side, eventually there's some number of items beyond which one of the terms of the function f comes to dominate.
04:23
And in the example below, that's this n-cubed term. So, knowing that, we can look at the behavior of what these higher order terms look like. Notice that there are no coefficients in this graph,
04:40
just n, the number of items through the size of the dataset, and how those compare against each other. And it's very clear from this picture that if you're working with some kind of n-squared or n-cubed algorithm, you're in a bad place. That works only for small datasets. Let's break that down a little more.
05:03
So if I have a hundred thousand elements that I want to process, I'm going to size that better for the screen here, how long does it take me to process those using something like an order one algorithm? Well, if I have something like a hash table, where if I can take some kind of key for my item,
05:23
some kind of mathematical manipulation of an item, I can find that in the hash table in order one time. It doesn't matter how big the hash table is, I can find my item really quickly, which is why hash tables are extremely useful, and we'll come back to that later.
05:41
There's also an algorithm called the disjoint set or union find. If I have a set of items, say that these are cells within a raster that have some kind of land use type, like a forest or urban area, and I want to ask, do these belong to the same forest or the same urban area?
06:01
Then the disjoint set algorithm lets me find that in time that's, for all practical purposes, independent of the number of items in the set. Moving upwards, if I have to do some kind of search on sorted data, that happens in log n time. So with a hundred thousand items, in about twenty steps I can find whatever it is I'm looking for.
06:24
And we'll talk about that again in a moment. If I need to sort something, that takes about a million steps with a hundred thousand items. And beyond there it all goes downhill. If I need to do an all-paras comparison, that's going to take a long time, and if I add just a few more items to this number,
06:42
that becomes impossible. And if I want to consider all the permutations of any kind of set, again, that's something that I'm going to have to address via some kind of sampling process. So knowing that about items, let's talk briefly about the trade-off between pre-processing and queries.
07:05
And the quintessential representation of that in my mind is road networks. So if you want to find a route from A to B in some kind of road network, the way that they'll teach you to do that in an introductory algorithms class
07:23
is using Dijkstra's algorithm. We look at the closest node to us in the road network, and we add that to a kind of queue that we're looking at, and then we look at the neighbors of that, and now we choose the closest item to all of the nodes that we've already explored. And if we repeat that process,
07:41
eventually we find the shortest path to the node that we're looking for, from A to B. But if you spend a little more time crunching this data ahead of time, you can dramatically, and by dramatically I mean seven orders of magnitude, decrease the amount of time it takes to find a route in that road network. And I won't talk more about this here,
08:02
but if this is something that's useful to you in your research, make sure that you're not implementing your road network algorithms yourself. That would be a mistake. I recommend contraction hierarchies as being a way of identifying algorithms that will actually be useful to you.
08:20
And there's a whole suite of different algorithms that solve this problem. Customizable contraction hierarchies let you put edges in your network whose weights change over time, as per the movement of traffic. And that allows you fast updates, much shorter than the processing time here implies, that still give you fast query times.
08:44
Let's talk about a simpler problem though. How would you find in a geospatial dataset the nearest neighbor? And how can we take advantage of pre-processing? And how can we understand this using this big O notation that we developed earlier?
09:01
So let's look at this problem. If we want to find the nearest city to some geographic point, the most obvious way of doing this is by using a nested for loop. You have an outer for loop. You look at all of the points in the dataset that you want to find the closest city to. And of course this doesn't have to be about cities.
09:21
And then you loop through all the cities and you find the one that's the minimum distance from you. But there are two things going wrong here. One is that again we're using for loops. And notice that this distance function doesn't take a lot of time to run. So we're not really leveraging the language we're operating in.
09:40
But also, notice that we have a nested for loop. If we have a large number of points and a large number of cities, this is giving us the n squared behavior that we talked about earlier. And that implies that we should try to find a better solution because we know that that's not going to scale. We know that even if we were using the language to its greatest strengths,
10:02
this is eventually going to fail us if we have too many cities. Now this is fine if we have small datasets. But again, if you have any questions about how big your dataset is going to get, this isn't going to work for you. And just highlighting the nested for loop there.
10:20
So the way that we solve this effectively is using what's called spatial indices. And I wanted to talk about these for a while because I feel like they're one of the most important parts of geocomputation. The one we're going to start with is called this KD tree. And what these indices have in common is that you can construct them in n log n time.
10:42
And that's essentially because this is a sorting problem. We need to figure out how to sort spatial data so that we can access it quickly. And once we've done this sorting, then it takes log n time per query to figure out what the closest element is in the dataset. And unlike the contraction hierarchies we talked about a moment ago,
11:03
these use only order n extra space. So since space is essentially free on computers today, you don't have to worry about that aspect so much. I'll also talk at the end about this S2 index. And we're going to think of this as being a kind of a hash table structure.
11:22
Or a hash function. And I'll explain more what that means when we get to it. So the idea behind a KD tree is that if data lived in one dimension, if it existed in a line, then we could sort the data from smallest element to biggest element.
11:41
And we could quickly find any element we were looking for by asking for the middle element, whether what we were looking for was to the left or to the right of it. If it was to the left, then we have the same problem but recursively. We can take the elements on the left, look at the middle element, and ask, are we on the left or on the right?
12:01
But with two-dimensional data, that obviously doesn't work. Instead, what we do is we can look at the data on a kind of 2D plane. And we can ask, are you to the left or the right of the element on the x-axis? And now once we've split along the x-axis, we can consider the y-axis
12:20
and ask, are we on the left or the right of the middle element on those axes? And if we kind of expand that out a little bit, we end up with a situation that looks like this. One turns out to be the middle element when we sort this data along the y-axis. And once we've done that, we start splitting along the x-axis with three and two.
12:42
And we just repeat this recursively. And now we have a log n way of identifying whether an element is in our data set, as well as a log n way of figuring out which elements are closest to a given element. And this is what that would look like if we had a larger data set.
13:03
So one tool for performing this that's available, I believe, for Python R and C++ is called Nanoflan. And you can see here that with only a few hundred milliseconds of pre-processing time, it's possible to get query times on the order of microseconds.
13:21
So now when you're doing that double-nested for loop, the inner part of that loop, let's head back here for a second. This bit where we're looking for the cities, rather than taking order n time, is taking order log n time. And if we look at our chart from earlier about how much time that takes,
13:40
that means that internally, rather than looking at a hundred thousand items, we're looking at seventeen items. That's a significant speed savings. Now one problem with KD trees is that they only work with points. And another problem with them is that once they're built, it's pretty hard to adjust them.
14:03
So a solution to that is to use something called R-trees. Now an R-tree stands for rectangular tree. And the idea here is that we're going to bound points in boxes, and then we're going to search recursively into the boxes. And there are different algorithms for constructing these R-trees.
14:24
But what I'd like to draw your attention to over here on the right-hand side is this packing algorithm. This is the performance that you get for inserting a million values if you know what those values are ahead of time. In that case, it takes less than a second in order to perform, in order to build this R-tree.
14:44
But then you're getting performance on the order of nanoseconds for finding something inside of it. Moving one to the left here, this R-star-tree solution. This is if you have an R-tree that you want to change over time, if you want to add things or remove things from it.
15:02
That's going to take longer in order to process, but you still get really good query times. And notice the trade-offs here. You can have slower query times, but faster construction. So all of these algorithms come with trade-offs. And we can think of all of those trade-offs in terms of how those algorithms perform on sample data sets,
15:22
so we can measure this, as well as in terms of this big O notation. And this is just building an R-tree again. And looking again at the time it takes to build and query as you add elements to these structures,
15:48
there's a suite of possibilities for building times. And they relate in potentially surprising ways to the amount of time it takes to query.
16:01
Finally, in our spatial indices part of this talk, let's talk about the S2 geometry. I think Edzir was going to talk about this a little bit as well. So this is an idea a little like geohashing, if you're familiar with that. We're going to break the Earth up into discrete chunks. And those chunks nest recursively within each other.
16:22
So if I look just at the U section here for Europe, I can divide that into a number of chunks, whose names are then U2, U8. And if I start subdividing those, I get U2, U8, and then some number of letters at the end. So every square on the Earth gets identified with a progressively longer string.
16:44
The downside of geohashing is that these cells are of unequal size. And also, it's kind of difficult to figure out what the pattern is for how you might move between cells to figure out which cells are neighbors to each other.
17:01
The S2 geometry solves that problem by creating essentially equal-sized cells that subdivide the Earth, that are nested hierarchically within each other as a kind of quadtree. Each cell divides into four smaller cells. But perhaps more interestingly, these cells have a particular ordering.
17:25
And this is using what's called a Hilbert space-filling curve. So that if I want to find the neighbors of a cell, I just have to walk along that curve. Or use a relatively simple function to move between edges of the curve.
17:45
This also means that the cells have relatively simple ID numbers. You have the face of the cube, 0 through 6, identified by the first three bits of an ID number. And then the last 63 bits or so of the number are identifying both what level of
18:02
the curve you're in, indicated by the trailing one here, as well as the ID of the cell. And if I increment these numbers one at a time, I walk that curve. So just showing visually what's happening here, we're creating a kind of fractal space on which our data lives.
18:27
Now the advantage of something like space-filling curve is that for the Hilbert curve, there's this nice property that if elements are close to each other on the curve, then they're close to each other in a 1D representation as well.
18:40
And that's true in a statistical average sense. And this has the benefit not only that we can do a kind of approximate spatial query, but also that as we read in our data, the data gets read in the sense that things that are spatially close to each other in two dimensions are also close to each other in one dimension.
19:04
So the computer is able to access them faster. It has what we call catch locality. And this means that every cell on Earth gets broken up into, or every place on Earth can be placed within a cell, and we can control the size of those cells.
19:26
So this is a machine learning idea that you could apply to any of the other machine learning techniques that you'll talk about this week. Essentially, since everything falls into a cell, we can take a small cell, nest it within
19:44
its parent cell, within its parent cell, and within that parent's parent cell, the grandparent's cell, and each one of these cells implies something different about space. The smallest level of the cell might be talking about a single farm field or a single block in a city. The next larger might be talking about a province size region, and the next larger might be talking about a country.
20:07
And if you have some kind of machine learning classifier that's trying to talk about pixels that you're looking at, then having these different hierarchical representations of space gives you a quick way to try to teach the algorithm about the different rules and regulations of space at that point.
20:26
For instance, if we were talking about traffic laws, at the level of an individual block, you might have stoplights or crosswalks that affect the flow of traffic. At the size of a city, you might have speed limits of some sort. At the size of a country, you might have different driving habits or conventions, like driving on the left versus the right-hand side of the road.
20:47
And a machine learning algorithm can learn all of these different factors without you having to teach it anything explicitly. And this works better if you wait towards using larger areas more heavily.
21:02
Because you don't want the algorithm to learn the same thing for every block-sized cell. You don't want to have every block-sized cell represent the idea that people drive on the left or right-hand side of the road. So, if you're trying to figure out how many cells you have, that's something that belongs more appropriately at the country level. So if you regularize towards these larger cells, that allows you to drop many of
21:22
the smaller cells from your representation entirely, and then retrain the network with the reduced representation. And to kind of bear that point home a little bit, this is a depiction of a neural network that's supposed to predict how long it takes buses to travel from one stop to the other.
21:43
And it uses all of these bells and whistles that have come out of machine learning in the past few years. And for that reason, it looks really impressive. It's using these LSTM layers, these long short-term memories, to try to pass information along the route that the bus is traveling. It's doing some kind of convolution over the graph of the road network. It's embedding all of this information about weather.
22:07
The problem is that this actually takes a really long time to train, and it doesn't do so well. There's a much simpler network we could use that incorporates these S2 cell IDs that I was talking about earlier. And if you just feed those forward through a couple of additions, you get a network that on average, in blue here, performs significantly better.
22:29
And if you train it multiple times, it's able to achieve that performance repeatedly. It has what we call stability. The more complicated network, using all the bells and whistles, doesn't achieve that kind of performance.
22:43
So again, coming up with succinct and clever representations of space allows you to build better performing machine learning algorithms. I want to move on to the next stage of this high performance idea, which is workflow management.
23:03
The tool I'm going to talk about is called SnakeMake. The idea of SnakeMake is to build what's called a directed acyclic graph, or DAG, of your workflow. We're all kind of familiar with this process in an informal sense. You gather data, you unpack your data, you process the data
23:23
in various different ways, you combine the data, and then you analyze the data, and then you produce graphs, and you produce your paper. All of this is a kind of forward flow, directed edges. Once you've unpacked the data, you never do that again. You use the result of that to do the next calculation.
23:44
When you think about your workflow in this way, what you see is that there are a lot of opportunities for parallelism. In the graph here, which is processing genomes, which I realize is not necessarily a spatial problem, but definitely could be, all of the nodes at each one of the vertical levels in this graph can be processed in parallel with each other.
24:05
What if we had a tool that could automatically do that parallelization? Furthermore, what if it could keep track of the outputs of each one of these nodes, so that we don't have to calculate them again unless something upstream changes?
24:20
And if our input data does change, it could follow the graph through only the nodes affected by that change. This is what SnakeMake does. We're going to do an example task, and this is going to leverage the spatial indices that I was talking about earlier. The code for this is in a GitHub link at the end of this presentation, but you don't need it at the moment.
24:46
The example task is, what if we want to figure out how far every building in Minneapolis is from a bus stop? Now, to solve that problem, we need to acquire information about the public transit system, and we need to acquire information about the buildings.
25:02
We'll then need to do some kind of geospatial analysis that combines both of those datas, and then finally, we'll need to make some fancy plots. So, here's the initial directory structure that we're going to start with. And this is how I often end up structuring my own projects.
25:21
We're going to have a folder containing our original data. Once we acquire that data, we're going to mark it read-only, and hopefully we'll never touch it again. And we're never going to have any of our scripts touch or manipulate that data. We use it as little as we can. And then we have a folder of scripts that perform the analysis we're interested in.
25:42
And at the upper level, in a folder containing both of these directories, we have a series of scripts, in this case only a single one, this snake file used by snakemake, that manipulate both the data and the scripts to produce the results we're interested in.
26:01
So, the snakemake file looks like this, and we'll go through it section by section. We start out with a rule for generating our output data. And perhaps counter-intuitively, that rule takes as input the final file we want to generate. In our directed acyclic graph, what that means is that this final rule is taking this file as the output of some upstream node.
26:29
So knowing that, the tool wants to figure out what it is that produces this as an output, and then run that command. Well, what produces that as an output is this plot distances rule.
26:43
And as an input, that's taking this building distances file. It outputs the thing we're looking for, and this is how it does it. It runs this scripts file. The scripts file turns this input into that output. And following this upstream, we have our geospatial analysis.
27:04
And that's taking in this unpacked data and producing this distances file as an output. And it does that by running some Python script. And here's a command that unpacks the data using the unzip command, and unpacking the other dataset.
27:24
So moving backwards here, we unpack both of the data, and those form the inputs to a script that produces this building distances as an output. And the building distances gets converted into some kind of plot by this R script. And that, in the end, is the result we're looking for.
27:45
So looking at the Python part of this here, we're going to build a KD tree to solve this problem, using the techniques we talked about earlier. And essentially, we're going to bring in the unpacked data, and I haven't gone
28:01
deeply into how we unpack it, but this is essentially using the unzip command. And we read in that data, we know what kind of projection that's in, because we've looked at the data and we understand the metadata associated with it. And we re-project it into a common coordinate system.
28:20
We do the same thing with the building's data. Then we zip up the projected, and so now we're in a Euclidean space, latitude and longitude, of the buildings, or rather, sorry, of the bus stops, into this KD tree. So that allows us to quickly query, for a given point, which stop is closest to it.
28:46
Now for each building, we can find the centroid, and then, notice that we're using vectorized code here, we can pass the list of all building centroids to the KD tree. And the KD tree figures out which stop is closest to each of the centroids.
29:04
And this takes no visible time when you run this command. Even though nominally, it's comparing something like 1.6 million or so buildings with hundreds of thousands of stops. So we get really good performance by leveraging that spatial index.
29:22
And finally, we output that file that we know is needed by the downstream script. The downstream script leverages the tidyverse to make some nice plots quickly. We read in that building distances file, we arrange it so that buildings that are farther
29:42
from bus stops get plotted on top, we do the plot, and then we save the plot. So our final directory structure looks like this. When we were unpacking that original data, that resulted in expanded data objects inside of this data directory.
30:02
The scripts read from that data directory to produce temporary objects. And you could think of these as being results, but they're not results that we're going to show to our end users, they're not results that we're going to put in our paper. They're just files that we're using in order to save state as we move through our calculation.
30:23
And finally, we end up with our results. In this case, a map of distance. And that looks a little like this. So having written these scripts, and having provided the input data, and having written this file that tells how all of these processes get linked together,
30:41
I can run a single command that goes from having just the original data and scripts to having this entire structure as well as this output. The nice thing about having such command is that it allows us to automatically parallelize everything. So what if we want to run this on some kind of compute cluster?
31:01
If we have some Google Cloud credits, that's pretty easy. We just initiate a cluster, give it a name, tell it how many nodes we want, link some credentials with it, and then we run snakemake, the same command that would run this automatically on our own local computer, and give it some information about the cluster.
31:21
And snakemake can actually keep track of what kind of environments you need using Conda in order to download the packages that perform your analysis. And when we're done, we delete the cluster so we don't get charged hundreds of dollars for it. If instead we have access to a high-performance computing system, then it gets even easier.
31:43
As long as we've got everything installed package-wise in the directory that we want, we just have to use this qsub command, or it could be sbatch, tell snakemake that that exists, and it automatically parallelizes the entire calculation to take advantage of the computer that we have access to.
32:03
So that's a simple way of moving from a reproducible analysis to something that we can leverage high-performance computing to accelerate. Now, the next way we could accelerate our software is by using a faster language. And I say that in quotation marks, because Python and R run just as fast as C++ as long as you're using those vectorized operations.
32:28
But for some custom code, or for some problems, a vectorized operation doesn't exist. And from the discussion earlier, we know if we start using for loops, that we're not really playing to the strengths of Python or R.
32:42
Now, C++ is a language where for loops work extremely quickly. But it's not a language that I want to talk about a lot here today, because it has a high upfront cost, you have to learn the language, and you have to learn it well in order to use it well. And this is also going to present a maintainability issue for you, because now you have a lot of code in multiple languages,
33:02
and whoever wants to use that code needs to figure out how it all fits together. Hopefully you help them. But still, you can have compilation issues. Finally though, this is the most scalable option available to you, if you're willing to pay these costs and take these maintainability risks. Because this allows you to leverage the full power of whatever computer you're using.
33:24
So this is the kind of technique that can take you from having to use a supercomputer to being able to do things on your desktop, because C++ really does work that much faster. In an ideal world, you don't use C++ for everything. You write the function that you need, the thing that doesn't exist for your analysis,
33:44
and then you expose that so that you can fit it into a Python or R workflow. And the tools that allow you to do that are Pybind11 for Python and RCPP for R. And that allows you to write a small amount of code to accelerate only the portion that isn't fast enough in your existing Python or R code.
34:08
So I'm going to move the talk in the direction of my own research for a moment, but we're going to get there by talking about trends in computing. So one of the reasons you come to a summer school is to learn techniques, but the other is to learn a little bit about research and what problems might be interesting.
34:25
And these are the problems that are interesting to me, and they have to do with changes in what's been happening in computing over the last decade or so. And the most salient change is that although the number of transistors in a computer has been going up at an exponential rate,
34:41
this is Moore's law, performance has not followed that trend. Since about 2010 or so, computers have not been getting faster. And what that means is that whatever you write today is not going to be any more performant tomorrow. So if you want good performance, you need to be thinking about good algorithms, because computers aren't going to get it for you.
35:05
And relatedly, the amount of power computers are consuming has gone up. And that has implications also for what kind of speed we have. Even if we figured out how to get better performance out of a single thread, we'd probably be doing it at the expense of creating more heat.
35:22
So we're in a double bind there. Now, we recognize that we're still getting increased performance out of machines. That's coming from increasing numbers of cores. But that means that in order to take full advantage of a computer, in order to be able to analyze data really quickly, you need parallel algorithms, because using serial algorithms, using things that aren't distributed, isn't going to give you advantages anymore.
35:50
Concomitant with those trends is that there's this growing gap between the speed of a processor and the speed of memory. So if you're designing your algorithms with the idea that you can reach out to memory whenever you want, in order to be able to see things,
36:07
you're not getting the best performance you can. A good algorithm in 2020 looks at data as little as possible, and once it has, it uses that data as much as possible. Because if you look at data repeatedly, then you have to cross this gap again and again and again.
36:28
And you can see this if you look at the top 500 list for supercomputers. Back in like the 90s or so, we were focusing on trying to make single processors as fast as possible. And that was not a scalable route.
36:42
So into the late 90s and 2000s, we began moving towards cluster computers, using MPI, using distributed frameworks, using things like SnakeMake, in order to get good performance. Now the top 500 supercomputers are using GPUs in order to get the majority of their speed.
37:02
And that means that we have to rethink our algorithms again. So these reprogramming boxes here, they highlight a problem because we have to redevelop everything we've built before, but they also highlight an opportunity. Because moving from one set of algorithms that's good for a cluster to something that's good for taking advantage of cluster GPUs,
37:25
requires rethinking how we do computations. That's an opportunity for research and an opportunity to explore data sets that are simply inaccessible to trends farther back in time. Now looking at that memory problem again, the time it takes to load one byte of data is paid off, in a sense, if you load enough data.
37:52
How much data is enough data? For a GPU, if you're loading something like a million bytes, maybe a hundred thousand bytes, then you're paying off the cost of having looked at the memory at all.
38:05
For a CPU that's considerably smaller, if you look at only a thousand bytes or a hundred bytes or so, you're paying off the cost of having done that. So one way of thinking about this is that if you are thinking about using GPUs,
38:20
you need to be able to move a lot of data onto them before it's worthwhile. And if you're programming algorithms on GPUs or CPUs, you need to be thinking deeply about how much data you're moving and how much you're looking at. These trends, though, don't just hold for memory on single devices, they also hold for the network.
38:46
What you can see here is that network latency, the time it takes for me to pass a message from one computer to another, has barely declined over something like two decades. Or, I'm sorry, that's four decades, isn't it?
39:02
Network bandwidth has increased. That means that the amount of data that we can send from one computer to another has increased. But the time it takes that data to move, the delay between it leaving one computer and arriving at another, that hasn't decreased. We're just able to send a larger message.
39:21
So if we are using some kind of computer network, this collectively implies that we want to send fewer messages, because each of those messages takes a long time to send. And that means we should pack data into larger messages. But collectively, this means that we should avoid communication as much as possible.
39:44
Because not only is this significantly slower than any data movement we do in a single computer, but, sorry, I started that sentence wrong. We'll skip that slide.
40:01
Now we'll talk about this just briefly. So, thinking about this trade-off between how much you use data versus how much data you load, this implies what's called a roofline. And these are increasingly used for performance analysis, to try to figure out why an algorithm is slow and how we can improve it. If we think about loading a lot of data, or if we think about using
40:26
data a lot once we have it loaded, like doing some kind of matrix multiplication operation that requires dozens of operations per byte, that has what we call a high operational intensity. On the other end of the spectrum, if we use the data we load only a few times, there's low operational intensity.
40:47
And in that realm, we're bound by the speed of memory, whereas if we reuse the data a lot, we are bound by the speed of the processor. And different processors and different types of memory imply different rooflines.
41:03
Now, we can still be operating optimally for a given roofline, depending on what kind of operation we're doing. In this case, thinking about a sparse matrix multiplication, we're bound by memory, but we're sitting on this roofline, so we're doing well. If we have a dense matrix multiplication, then we're bound by the speed of our processor, but as long as we're up on that roofline, we're doing fine.
41:25
What's bad is if we're not on any roofline. But notice how this also implies what kind of changes we might need to make to an algorithm in order to improve it. If we have sparse matrix multiplication, the only way to improve that algorithm if we're sitting on the roofline is to find a way to reuse the data more.
41:43
So that once we've transferred it through the network, once we've moved it out of RAM into the processor, we're using that as much as possible. And that allows us to slide to the right along that roofline. However, if we're over in this region, the only way to increase the speed of our algorithm is to come up with cleverer ways of doing our calculations.
42:08
Now on the right hand side here, I've got a bunch of different neural networks. These are used for machine learning. You've got your LSTMs. These are really sparse, serialized networks.
42:20
And on the opposite end of the spectrum, you have your convolutional neural networks. These have extremely high operational intensity. So again, operational intensity is on the x-axis, and what you find is that for some neural networks, the only way you can improve performance is by getting new hardware.
42:40
Or in the case of Google, you build the TPU and you make new hardware entirely, because there is no physical way to improve other than doing that. But notice that for a lot of machine learning algorithms, you're not necessarily on the right hand side. And that implies that getting a GPU will not accelerate your problem.
43:01
So before you try to throw magic hardware at your problem, be sure you understand what kind of problem you're dealing with and whether it will benefit from that. And I wanted to touch briefly on memory cost. The cost of memory has been going down over time, except when you have events like the Thailand flood and most of the world's hard drive building capacity gets taken out.
43:26
And kind of the encapsulation of all of these trends is that we're looking at undersea data centers now, as a way of getting rid of the excess heat. Now, concomitant with these changes in computation have come increases in the resolution of the data we're looking at.
43:43
We moved from 90 meters in the not-so-distant past to 1 meter data today. But that's an increase in resolution of something like 4 orders of magnitude. And we have the memory to hold that. And we certainly have the processing power. But in order to actually do the processing, we need parallel algorithms.
44:05
Arctic DM6 is a good representation of this. This is the entire planet, north of about 60 degrees. And this is at something like 2 meter resolution. It's also a time varying dataset. So there's 20 petabytes of data here.
44:23
And this is going to become available for the whole Earth within a few years. So I think a lot of the interesting problems, at least for me, are figuring out how you extract meaning from that much data. How do you answer questions using these large datasets? And I'm aware that I'm approaching time here, so I'll just speed through the rest.
44:45
Particularly, I'm interested in hydrology. This question of how much water goes from upstream to downstream, and how we can track it. And that's an important question, because things like the Mississippi River dead zone, this is an area of thousands of square kilometers.
45:02
And there's no life in here because of the runoff from all of the upstream areas in North America. This entire river corridor drains into that single space. So flow accumulation you represent as being the number of cells that are uphill from any given cell. And the question is, how do you parallelize a calculation like that? Well, you end up dividing your dataset.
45:34
And you do so by breaking it into chunks. And for each chunk, the goal is to try to find a cunning way of representing all of the interior information using only the perimeter of that tile.
45:48
And if you think about this for a little bit, I think you'll see that you can represent this as having unknowns. You don't know how much water is coming in, but you know how much water you generate within a tile. And that tells you how much gets passed out.
46:05
And so if you think only about the perimeters, the amount of water coming out of here needs to come in here, and then it needs to come out again. On the downstream side, it goes into another tile. And so you can solve this problem without thinking too much about the insides of the tiles.
46:26
And then later, once you've solved it, you know what all of your unknown inputs were because you passed them through this graph using only the perimeters. And that allows you to correct the inside of your tile. So this lets you calculate something like flow
46:41
accumulation, which seems like it's inherently serial because you go from upstream to downstream, into something that scales really well. Another problem is filling in depressions. The standard algorithm for that has you start at the outside of the tiles and then move inwards.
47:01
And whenever you reach a point where you're looking down at a cell that you haven't seen before, you know that that's inside of a depression because you're always looking at the lowest cell first. So you can pave that depression over. And if we look at this in tiled representation, we
47:21
could do this for each tile separately. The middle tile is the one where it's easiest to see. So if we just fill the middle tile, we get rid of that low point. But within the whole graph, that entire middle tile needs to be raised a little bit. So how do we do that? Well, we go back to another perimeter representation. We
47:44
fill in the tile, but we keep track of which originating cell started the fill. And that forms a kind of watershed graph shown on the right here. Knowing the watersheds that the perimeter cells belong to,
48:01
and knowing how those watersheds connect together, allows us to form a large graph that represents the problem extracted from the tiles. And that allows us to scale by three orders of magnitude over existing work, so that we can take a calculation that looks as though it couldn't possibly be parallelized, and do it in such a way that we can process all of that information within, well, really minutes.
48:28
But this is using only a couple of nodes. The more processing power you have, the better scaling you're able to get, as represented by this graph. There's no drop off here as you increase the number of cells you have to worry about.
48:42
One other shout out I wanted to make was to ColorBrewer 2.0. If you're building visualizations for your papers and looking for a good color scheme, this is the website to go to. And that's the end of the talk.