Flow computation on massive grid terrains
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 45 | |
Author | ||
License | CC Attribution - NoDerivatives 3.0 Germany: You are free to use, copy, distribute and transmit the work or content in unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/21759 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
15
23
00:00
AlgorithmData typeInformationComputer programmingOrder (biology)SoftwareComputer networkFunction (mathematics)Semiconductor memoryData compressionComputer configurationTotal S.A.Curve fittingBefehlsprozessorBitFunctional (mathematics)Group actionPrice indexNumeral (linguistics)Extension (kinesiology)Pairwise comparisonTheorySlide ruleExecution unitResultantCore dumpSampling (statistics)Term (mathematics)Cellular automatonDataflowAreaTheoretical computer scienceGoodness of fitUtility softwareRange (statistics)Computer fontOperator (mathematics)Water vaporParameter (computer programming)MassPresentation of a groupInstance (computer science)AdditionPoint (geometry)Set (mathematics)outputWeb pageCartesian coordinate systemRaster graphicsOrder of magnitudeDirection (geometry)Partial derivativeSource codeFlow computerMetreScalabilityGrass (card game)Streaming mediaEndliche ModelltheorieMiniDiscNeuroinformatikBlock (periodic table)Image resolutionMultiplication signFreewareSuite (music)Pattern languageVideo gameGraph (mathematics)Computer scienceLogicSimulationTelecommunicationPlotterLibrary (computing)DigitizingProduct (business)Category of beingVideoconferencingIntegrated development environmentMathematical optimization3 (number)Letterpress printingIdeal (ethics)Power (physics)MereologyVirtual machineThermal expansionLink (knot theory)QuicksortQuery languageMachine visionRevision controlReal numberDivisorWeightPressureTemplate (C++)Food energyGradientWordReading (process)Interactive televisionRegulator geneEmbedded systemMetadataCanonical ensembleDomain nameRight anglePosition operatorLecture/Conference
09:52
AlgorithmDampingData typeOrder (biology)Semiconductor memoryWindowVideoconferencingSoftware testingInfinityComputer configurationComplex (psychology)MereologyTheoryProjective planeSlide ruleNumberAreaDivisorComputer fontWater vaporWeightFactory (trading post)RoutingPoint (geometry)Ocean currentGradientDirection (geometry)Frame problemChemical equationStatement (computer science)Greatest elementSound effectClosed setEndliche ModelltheorieCoefficient of determinationMiniDiscNeuroinformatikBlock (periodic table)OracleElement (mathematics)Multiplication signOperating systemDenial-of-service attackPattern language1 (number)MathematicsComputer programmingBefehlsprozessorDirected graphStructural loadMultiplicationPlateau's problemCellular automatonDataflowCurvatureExistenceSet (mathematics)EstimatorRadio-frequency identificationScalabilityDifferent (Kate Ryan album)Single-precision floating-point formatLecture/Conference
16:48
Graph (mathematics)Computer networkFunction (mathematics)PlotterSemiconductor memoryFrequencyIterationIntegrated development environmentSubject indexingGeometryComputer configurationCurve fittingForm (programming)Functional (mathematics)Green's functionPower (physics)MereologyRadical (chemistry)Virtual machineNumberDataflowAreaWater vaporParameter (computer programming)Directory servicePresentation of a groupMetropolitan area networkParallel portPoint (geometry)outputWordDirection (geometry)Data miningGreatest elementBit rateEndliche ModelltheorieThresholding (image processing)Multiplication signCollaborationismRight angleBell and HowellLevel (video gaming)InfinityMultiplicationPhysical systemExecution unitResultantCellular automatonProcess (computing)Field (computer science)Set (mathematics)PixelVirtual memoryGraph coloringComputer fileStreaming mediaDifferent (Kate Ryan album)Element (mathematics)Single-precision floating-point formatLengthLecture/Conference
23:43
AlgorithmMedical imagingDiagramComputer programmingSemiconductor memoryIntegrated development environmentGroup actionMultilaterationContent (media)MereologyHeat transferNumberWordCartesian coordinate systemVirtual memorySound effectEndliche ModelltheorieNeuroinformatikCommitment schemeMultiplication signCharge carrierCategory of beingGeometryExtension (kinesiology)Physical systemLocal ringAreaDivisorPoint (geometry)Proper mapBuffer solutionComputer fileGrass (card game)MiniDiscRight angleVirtualizationSoftware developerLecture/Conference
26:25
AlgorithmIntegrated development environmentLine (geometry)DataflowoutputModule (mathematics)Fraction (mathematics)Video gameCombinational logicFunction (mathematics)Group actionExtension (kinesiology)Core dumpCodierung <Programmierung>WordCartesian coordinate systemGraph (mathematics)Endliche ModelltheorieMultiplication signVolume (thermodynamics)Lecture/Conference
Transcript: English(auto-generated)
00:01
Let's go to the next presentation that is flow computation on massive grid terrains by Laura Thoma, Lars Harga and Elena Mitasova. It is presented by Laura Thoma. We are both at Duke University and Elena Mitasova from NC State
01:05
University. So at the core of modeling flow on grids are two basic parameters that people have mentioned already in yesterday's talks, namely flow direction
01:20
and flow accumulation. The flow direction of a cell in a raster models the way water would flow if poured onto that cell. And now imagine you have say one unit of water on each cell in a grid and that water, this water flows according to the flow direction. The problem is to compute for each cell the amount of
01:43
water that flows through that cell and that's called the flow accumulation of that cell. So these are the two parameters. The goal is to compute flow direction and flow accumulation for every cell in the raster. And these
02:01
two basic parameters, even though they are very simple, they give a lot of information about the topography of the terrain. Flow direction, for instance, can be used to delineate watersheds. Flow accumulation can be used to derive the stream network and they have many many other
02:21
applications which I won't even mention because I'm sure you all know them much better than me. Our problem is to compute these two indices on very large terrains which have been increasingly become available over the past years at lower and lower resolutions. NASA, for instance, is a good
02:42
source of terrain data. The last mission that I know of is the NASA ISRTM. It was launched last year. I believe it was January. It collected DM data at 30 meter resolution for the entire Earth in total about five terabytes. And USGS is another source of data. USGS used to
03:06
have for free US data at 10 meter resolution. I'm not sure if that's true anymore. And very recently that slider data which goes down to one meter resolution. This bigger and better and better resolutions have profound
03:22
impact on the data that we have to work on and compute on. And just to imagine that, say for instance the Appalachian Mountain data set, which is a mountain range in the United States, if we sample it on 100 meter
03:42
resolution, without any compression the raster shows about 500 megabytes. Now at 30 meter resolution it's 10 times bigger, so it's 5 gigabytes. At 10 meter resolution it's 50 gigabytes. And at 1 meter resolution it's 5 terabytes. So
04:03
that's a lot of data. The next question is can this massive data be processed with the software that currently exists? And according to our experiments that we did last year, well you can process it if you're
04:20
patient enough. We did experiments with GRASS, RDM, and RK info. Those are the software which computes flow direction and flow accumulation, but they are older, so we assume these are the most efficient. GRASS for instance, the most commonly used function is r.watershed. We let it run for 17 days on
04:41
a approximately 7,000 by 4,000 grid, and it didn't finish. I think it finished about 65 percent in these 17 days. Then Tardium is a suite of programs developed by David Tarbeton and his group at University of Utah. It's available now as an extension to RK info. It could handle in a
05:05
couple of days the data set above the 7,000 by 4,000. However, we ran it on a five times bigger data set of about 12,000 by 10,000 cells, and we let it run for 20 days. And then we just killed it, so it didn't finish. All these
05:24
times the CPU utilization was five percent, and the disk was crashing. And finally, RK info has the flow direction and flow accumulation function. Among these three software is the most efficient. It could handle these two data sets in a couple of days. The problem, however, is that it
05:43
cannot handle anything that's bigger than 2 gigabytes. It just outputs an error and dies. So all these experiments were done on the same machine with nothing else running. It was a Pentium 3 500 megahertz 1 gigabyte of RAM. So in our previous work, we developed a set of programs that we
06:08
call TerraFlow for computing flow direction and flow accumulation on grids. They are based on theoretically optimal algorithms, and they are quite efficient. They're up to a thousand times faster on very large grids than
06:24
existing software. It's scalable. The largest terrain that we have is about Washington State at 10 meter resolution in total about a bit more than 1 billion elements. So it completed successfully on that,
06:41
and it's also more flexible than the others in the sense that it allows the user option for choosing the type of flow routing, and I'll get back to that in a couple of slides. So this is all previous work. What we are reporting at this conference is porting TerraFlow into GRAS, a function called r.teraflow.
07:02
We have partial results on augmenting TerraFlow with additional features in order to subsume the functionality of the other numerous GRAS flow routing functions. So features like outputting plateaus, depression, topographic
07:22
convergence index, water outlet queries, watershed basins. And we have partial results on comparison with the GRAS flow routing functions, both in terms of outputs and performance. So this is what I'm going to talk about. First I'll talk about scalability and why. What is scalability and why it
07:43
doesn't happen automatically for programs which work fine for small inputs. Then I'll talk about, just very briefly, about the theoretical approach to improve scalability, which is known in theoretical computer science as the area of high efficient algorithms. This is the underlying theory for TerraFlow.
08:04
Then I'll talk briefly about the outline of TerraFlow and time permitting about the, I'll go over the GRAS flow functions and I'll show some preliminary comparison results. So scalability. Programs are in general
08:29
written assuming that data fits in memory. However, when working with massive data, data does not fit in memory, the programs still run because the operating system, invisibly to the user, puts data on disk and then
08:45
pages it in and out main memory as demanded by the program execution. And these are all invisible to the user. The two points are that, one, data is moved in blocks between main memory and disk. In other words,
09:03
even if the program just wants to access one item of data, it gets the entire block from the disk that contains that item. And second, accessing the disk is at least three orders of magnitude slower than accessing main memory. So because of this, when processing massive data, typically the
09:25
bottleneck is not the CPU computation but is accessing the disk, the so-called disk IO. And the reason for the disk bottleneck is that applications are
09:41
normally not aware of their data access pattern. And the way an algorithm accesses its data, it's not really important for much of the performance that you have in memory is big enough to hold four elements or two
10:00
blocks. Yes, we have this array sitting on disk and I have two algorithms that access in different ways this array. The first algorithm access it in this order, so the numbers are not the values of the array but the order in which items are accessed. So when the algorithm access is the first item, this block is
10:21
brought into main memory. Then the algorithm moves ahead and accesses the second item, it falls in a different block, this block is brought into main memory. The third item falls in a different block, this block again must be brought into main memory, however the main memory is full now. So one block has to be evicted, typically the least recently used, so this block goes out, this
10:43
block is brought in, and so on. It continues by the time it reaches the fifth element, this block, this element, the fifth one falls in the first block, this block has been already evicted from memory so it has to be reloaded. So if we follow this example to the end, we see that every
11:02
single item that is accessed causes a block to be loaded from disk into main memory. And now here's a slightly different algorithm which accesses the entire array but in a different order. So now the first item causes the first block to be loaded, the second item falls in the same block, the block
11:24
is already in memory, it gets it for free. The third item is here, this block is loaded into main memory, the fourth item is in the same block, so it can just be accessed. And so on, the fifth and the sixth are in the same block. So
11:42
you see that with this algorithm, every block is loaded exactly one block in memory, the second one causes N over B. In this small example, B is small, it's too unreal system, B is 8k or larger, which means the first algorithm is a
12:01
factor of say a thousand times, it loads a factor of thousand times more disk blocks into main memory. So this is the basic explanation for non scalability, the effect of loading disk blocks can be quite severe and this is a more detailed
12:23
look into the same set of experiments that I talked about a few slides back, this time just with r.watershed. So we run r.watershed on data sets ranging from a thousand by a thousand elements or 1.6 million elements in total to data set, well we couldn't run it here, to data set 7,000 by 4,000.
12:46
Yes, if the data set is small, 12 minutes, no problem, as data set gets larger five days and then 26 days, you're estimated. So this is the effect
13:03
of the data access pattern, the conclusion is that however good the operating system is, it cannot change the data access pattern of the program. So if the program is written not to take care of the data access pattern, then it will be not scalable. The theoretical approach for scalability is
13:24
to redesign completely the algorithm in order to optimize the number of blocks that are loaded, that are transferred between disk and main memory and this is known as the area of high efficient algorithms. The basic idea is
13:41
to load each block ideally just one time and to do all the computation on all elements in that block while the block is still in main memory. So this is the theory underlying teraflow, the complexity of an high efficient algorithm is not only the CPU complexity but mainly the
14:02
number of blocks that are transferred. So now coming back to the problems that we want to solve, this is the outline for teraflow. It starts with computing the flow directions, the way flow directions are computed on grids is by looking at a 3x3 window around the current cell and there's two
14:26
options, either assign flow directions to the steepest downslope neighbor or assign flow directions to all the downslope neighbors. This is known as single slow direction model or D8, this is the multiple slow direction model or D
14:41
infinity. Now flow directions can be computed easily for points which have downslope neighbors, just look at the 3x3 window. The problem is the existence of flat area and on flat areas you cannot compute the flow direction of a cell just by looking at its neighbors because the neighbors
15:00
have the same height. So you have to route flow globally on the entire flat area and we distinguish between two types of flat areas, that's the plateaus, the flat areas that if you pour water on them water spills out, yes, so this too, and then there's the bottom of depressions or the sinks, if you pour
15:23
water on this type of flat area, water doesn't go down. So there's no obvious flow direction on flat areas. On plateaus, teraflow routes flow basically by assigning flow direction so that each cell flows
15:44
towards the closest spill point of the of the plateau. So water is globally routed towards the spill point of the plateau. And for sinks, so after this step the only area, the only point for each flow direction has
16:01
not been assigned are the sinks. On sinks there's two options, either catch the water inside the sink and don't let it go out, basically assigning flow directions towards the center of the sink and that's pretty much the same like on plateau. The other option which is more theoretically interesting
16:23
and also it seems to be preferred in the literature is that of routing water outside the sink using uphill flow directions and this is normally done by simulating flooding the terrain. Flooding gradually fills the pits and pockets and basically at the end of flooding every point in the terrain has a
16:44
downslope flow path to the outside. There's no sinks and we can assign downslope flow directions on the flooded terrain and map them back onto original terrain. So this is the idea, there's nothing new in all these steps
17:03
that teraflow does. Let me talk about flow accumulation for a second first. So after flow directions are computed, flow accumulation is basically the process of counting how much water goes through each point, assuming there's
17:25
initially one unit of water. And what I had started saying is that the emphasis of teraflow is not flow modeling, all this has been already known, but the computational aspects. We can solve all these steps efficiently,
17:43
basically modeling them as graph problems or geometric problems. So in its current form our teraflow takes as input an elevation grid and outputs a flow direction grid, flow accumulation, a topographic convergent index and a
18:04
depression grid. The user has the option of choosing between d8 or d infinity flow directions and for flow accumulation there is the option to switch to a single flow direction routing once the flow of a cell exceeds
18:21
a certain threshold that the user can specify. And so this is how it looks like. Note here that there's a memory, main memory value that the user has to give as input. This is the amount of main memory that will be used at all
18:43
times by teraflow, so ideally the virtual memory system will never be in use. The user basically has to underestimate the amount of main memory
19:01
that's available on the machine and then give that value as a parameter. These are the flow functions that are available in GRAS. As I said the most commonly used one is r.watershed. It was implemented at USA CIRL. It takes
19:22
an input and elevation grid, output, flow direction, flow accumulation and a lot of other parameters including watersheds, stream segments, slope lengths and steepness. It subsumes r.drain and r.wateroutlet. I won't talk about them.
19:42
Then there's r.basinfill which is written by Larry Band. It takes as input a stream network and thinned ridge network. I'm not sure what that is but it has to be digitized by hand so I couldn't really run experiments at this one and it produces watershed subbasins. Then there's r.top model
20:02
and top IDX. It simulates the top model, output, flow direction, field elevation, TCI, watershed and so on. The problem is that it takes as input the top model parameter files which I'm not really sure what that is so I don't have any experiments in this either. And then finally there's r.flow
20:24
and r.flowmd which use a slightly different model. And then there's more complex model which I don't know much about like water, FEA and hydrocask2d but I assume since they are more complicated they're also
20:41
less efficient. So basically the only experimental results that I have now are with r.watersheds and r.teraflow. As data sets we have the same grids that you've, some of them you've already seen, ranging from a
21:01
thousand by a thousand, four thousand by a thousand and then up to eleven thousand by eleven thousand. In other words that's from about 1.6 million elements to about 122 million elements. These experiments were done just two weeks ago. We have a much better machine now. It's a Pentium 3 with two
21:22
processors, one gigahertz each and one gigabyte of RAM. Okay, r.teraflow takes about from two minutes, five minutes, 20 minutes up to 3.5 hours.
21:41
Watershed takes 9 minutes, 93 minutes, 18 hours and then on the biggest one we let it run for six days, it didn't finish 1%. So these are just some pictures. This is the Panama DM, the flow, the flow accumulation computed by
22:04
teraflow. You cannot see much so I zoomed in and then just zooming on a smaller piece. This is the multiple flow direction, the flow accumulation computed with multiple flow direction, single flow direction, then two
22:23
dimensional pictures of the same area. This is SSD, MSD, topographic convergence in depth on a different color map I assume. And then these are some pictures on a flat DM for which we don't have time right now. So let me
22:42
end here. Thank you. Thank you to you for your clear presentation. Any question? I get two questions. The first one is that if you could affect the
23:14
amount of water that passes from one pixel to the other by the amount of water that is infiltrated into the ground. And the second one is. The first question is what?
23:26
If I take into consideration. It's not about modeling, it's about the computational aspects. Dealing with an infiltration disaster, it's not really complicated, it just means you know read something else and get rid of that. So it's. Then when you have that, then you would run your
23:56
hydrologic model with infiltration and all the things like that. Okay, thank
24:01
you. And the second one is why do you kill programs when they doesn't work? Because there is nothing else. And ArDOT watershed still works for most people's applications. There aren't yet that many people who are working with
24:20
grids like that. But the number of people who really get that size data is increasing. So we are trying to be ahead of time. Okay. Well, so you
24:42
should recommend to GRASS developers when we are working with large data set not to rely on the computer virtual memory system but to manage by ourselves the exchange between this kind of swap memory and the actual memory. And so how can we assess the proper disk geometry to have a
25:06
better IO performance? Well, it's hard to answer to this in just a few words. So this area of our efficient algorithm does exactly that. Namely, handles all data is placed on files and all the transfer between main
25:24
memory and disk is done by the algorithm so that the virtual memory is never in use. And that's something that you can do just by completely redesigning the algorithm. So yes, it can be done. And so the disk
25:42
geometry to have a better IO properties, how do you know that? Geometry? What do you mean the geometry of the disk? Well, no, because often a computer is not fast enough to read a sector or another sector on the disk. So the size of the buffer is depending for some some extent about the disk properties. How do you know the best
26:06
buffer size according to a disk? Right. So those are factors that matter less. The main thing is to have data loss, temporal and spatial locality. So from that point of view, the algorithms are optimal.
26:20
Where exactly in which disk sector the data goes, that's not so important. We don't know the SRTM data because I think your work is with
26:44
SRTM data, but we are trying to get hold of it. And I don't know how to answer that question. Some of the data says I have them from Helena. And actually, all of the data says either we bought the data or we have them from GIS people in the
27:02
school of environment at Duke. Thanks. I would like to ask you, do you think that this input out to efficient algorithm can be used also for other GIS modules in graphs? And the second question is
27:23
that do you think, can you extract in the future improvements or your program, can you extract the value lines from flow accumulation? Yeah, because R.watershed also produces
27:40
value lines. So you can extract like rivers and native or get so on. Okay, so the first question was, if this can be applied to other problems? Well, yes, we certainly hope so. And that's actually on the on the conclusion and future work slide. Other if you know of any other I intensive or IO
28:02
bound applications, then it'd be very happy to hear about that. So the other question was the extraction of the value lines. I don't know too much about that. So if you want to tell me, you know, how it works, and I can certainly think of an algorithm. Thank you. I have to ask to stop here the
28:27
discussion because