We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Python Raster Processing on Serverless Architecture: Using python tools on AWS Lambda

00:00

Formal Metadata

Title
Python Raster Processing on Serverless Architecture: Using python tools on AWS Lambda
Title of Series
Number of Parts
208
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
24
197
Raster graphicsProcess (computing)ArchitectureLambda calculusRaster graphicsComputer architectureImplementationProcess (computing)Service (economics)MultilaterationComputer animation
ArchitectureSoftware developerWeb applicationData analysisServer (computing)Service (economics)AbstractionCloud computingComputer animation
ArchitectureScaling (geometry)Default (computer science)Function (mathematics)Execution unitEvent horizonData managementFunctional (mathematics)State of matterStreaming mediaDependent and independent variablesComputing platformChannel capacityStructural loadMultiplication signData managementContent (media)Scheduling (computing)Network socketExecution unitConnected spaceServer (computing)Instance (computer science)FrequencyCartesian coordinate systemSingle-precision floating-point formatEvent horizonMathematical optimizationWeb 2.0CodeExpected valueArithmetic meanScaling (geometry)Physical systemService (economics)
SoftwareComputer architectureField (computer science)Software testingReliefComputer animation
Arithmetic meanDisk read-and-write headService (economics)Data managementCodeComputer animation
Lambda calculusFunction (mathematics)Event horizonImplementationCodeInternet service providerFunctional (mathematics)Computing platformLogicLocal ringPoint cloudEvent horizonAdditionService (economics)Level (video gaming)Gateway (telecommunications)Lambda calculusContext awarenessElectronic signatureCodeComputer animation
Run time (program lifecycle phase)WeightJava appletBefehlsprozessorCommon Language InfrastructureEvent horizonFunctional (mathematics)Scheduling (computing)Run time (program lifecycle phase)Object (grammar)Gateway (telecommunications)Semiconductor memoryLatent heatLibrary catalogLambda calculusEvent horizonFile system
Read-only memoryBefehlsprozessorSpacetimeDependent and independent variablesLimit (category theory)Thread (computing)Fault-tolerant systemNeuroinformatikIntegrated development environmentLimit (category theory)Rule of inferenceSemiconductor memoryProcess (computing)Game controllerDependent and independent variablesComputer animation
CodeBuildingData compressionComputing platformMobile appStructural loadLambda calculusDistribution (mathematics)QuicksortMultiplication signInstance (computer science)Fiber bundlePerturbation theoryLibrary (computing)CodeFunctional (mathematics)Mereology
Lambda calculusPerpetual motionMultitier architecturePersonal digital assistantService (economics)Raster graphicsStatisticsCASE <Informatik>Service (economics)Level (video gaming)Variable (mathematics)ScalabilityMultiplication signModul <Datentyp>Web 2.0Raster graphicsServer (computing)SoftwareProcess (computing)Dependent and independent variablesElectric generatorCodeWeb pageOperator (mathematics)Functional (mathematics)Source codeMathematical analysisFault-tolerant systemConfiguration spaceSemiconductor memoryOverlay-NetzCovering spaceBlock (periodic table)WeightDistribution (mathematics)TesselationScaling (geometry)MultiplicationCartesian coordinate systemAdditionProjective planeLibrary (computing)StatisticsLattice (order)Computer animation
Raster graphicsLibrary (computing)Raster graphicsProcess (computing)Open sourceSuite (music)
Raster graphicsArray data structureProcess (computing)Communications protocolLambda calculusAuditory maskingProcess (computing)GoogolGeometryRaster graphicsAdditionArray data structureCoroutineArithmetic meanWindowComputer animation
Array data structureVector spaceOperations researchComputer-generated imageryRaster graphicsNumberShape (magazine)Cellular automatonAdditionProcess (computing)Vector spaceComputer-generated imageryOperator (mathematics)Numeral (linguistics)Library (computing)Computer animation
Dressing (medical)SoftwareRaster graphicsTesselationInformationFitness functionNumbering schemeCalculationFiber bundleData storage deviceReading (process)Data compressionFile formatIntegrated development environmentComputer configurationVirtualizationComputer fileDiscrete element methodService (economics)Functional (mathematics)MereologyMultiplication signComputer animation
Reading (process)RectangleComputer architectureRaster graphicsLambda calculusIntegrated development environmentPolygonComputer animation
Raster graphicsVisualization (computer graphics)CuboidGeometryScaling (geometry)RectanglePolygon
Operator (mathematics)Reading (process)Element (mathematics)GeometryWindowComputer animation
PolygonoutputSemiconductor memoryStatisticsWindowGeometryThresholding (image processing)
SineSemiconductor memoryWindowOperator (mathematics)QuicksortBound state1 (number)
Software frameworkCommon Language InfrastructureEndliche ModelltheorieThread (computing)NumberComputer fileFunctional (mathematics)TesselationCodeReading (process)Level (video gaming)Configuration spaceScaling (geometry)WindowRaster graphicsLimit (category theory)Software frameworkSampling (statistics)Instance (computer science)Inflection pointOverhead (computing)Execution unitZoom lensNeuroinformatikLibrary (computing)Fitness function
Data compressionLambda calculusRevision controlQuicksortData compressionServer (computing)Shapley-LösungFlagLibrary (computing)Scripting languageComputer configurationInstallation artCrash (computing)Distribution (mathematics)Virtual realityLink (knot theory)MereologyDemo (music)AuthorizationModule (mathematics)Computer animation
Transcript: English(auto-generated)
I'm just going to get started. People might be trickling in in the back there. So good afternoon. Thanks for coming to my talk on raster processing for serverless architecture. I'm going to give an introduction to what serverless architecture means, what the implementation of it looks like on the Amazon Web Services platform, and show that despite some significant restrictions,
it could be a performant and cost-effective way to do raster processing services. I've got a lot of material to work through, so I'm going to try to talk really fast, but I'll be around for the rest of the conference. We'd be happy to get into more detail with people later. My name is Matt McFarland. I'm a software developer in Philadelphia. I work at a company called Azavia, and we do geospatial web application development
and data analytics. The inspiration for this talk came out of the work that we do there. So what is serverless architecture? Very short answer is it's a cloud computing platform. It executes a code, and it's distinguished by having managed infrastructure that abstracts away a server in the traditional sense. There are, of course, still servers.
Very sophisticated servers running sophisticated software. So it's helpful to be aware of how the service operates, but you don't have to manage it. In most cases, you don't stay up at night worrying about it. So some details on what that means. Excuse me. It's an execution model, which removes the need to predetermine the compute capacity necessary to execute a task at scale.
The platform manages all the underlying resources needed. Function is the unit of work in this architecture. There's no expectation that there'll be a single function in your code, just that there'll be a single function that's invoked as an entry point, and that it's stateless. There's no capacity for long-running processes, so you're not running a web server or listening onto a socket connection.
The platform will invoke your function when it's time to execute, and when that function exits, your code is no longer running. Any state loaded is lost. The function is invoked by means of an event being triggered, meaning that you can register a function to be executed when something happens in your system. And the cost of the function is only for the actual time executing it.
There's no pre-purchase capacity, like in a traditional VM or EC2 instance. This can be very cost-effective if you expect your function to have periods of being underutilized or idle. If nothing invokes your function for a weekend, you don't pay anything for having that potential capacity. The big deal is that you don't create provision or manage the infrastructure
on which the function is executed. It's not an exact analogy, but it might be helpful to visualize your function existing as a container image, like a Docker container. In that analogy, you're providing the contents of the container, and the platform is handling the infrastructure and the scheduling of it in real time. The resource management also involves scaling your function in response to multiple events.
So as events stream in, the platform will create instances of your function to respond to them. The function itself is not aware of the load the service is under. Each is just responding to a single event. If 10 simultaneous requests arrive, 10 instances are spun up. In practice, an instance usually sticks around for a few minutes, so they can be reused,
but that's an optimization that the platform is making, and you should always consider a function to be ephemeral. So with the serverless architecture, you're no longer engineering the load balancing, scaling policies, failover rules, et cetera, that are otherwise necessary for a resilient application. Okay, brief interlude. In the era of short attention spans and 140 character limits, I thought it might be best to express the real benefit
of this architecture with a short emoji field vignette. So this is what it works like. So your team has been working through the night, features are in, tests are passing, no small sense of relief you push to production. Even better, people now appear to be using it, in fact they love it, that's great,
which for some reason keeps popping into your head at night when you're trying to fall asleep as you ask yourself, is my service gonna stay up? The goal of having to manage infrastructure means that it's somebody else's problem, and not just their problem, but their actual job. You get to focus on just writing the code. So back to the real presentation. Amazon Lambda is the incumbent serverless provider
that similar functionality exists on other platforms. It operates just like the overview I gave, at a high level in the context of Python, it works like this. First, you write a handler function, this receives the event and has to conform to a signature that's dictated by the platform. Next, you implement your logic as you normally would, just like writing a Python script. Just remember that it needs to be stateless,
so lean on external services for storage, things like that. Write it so you can run it and test locally. Choose your dependencies judiciously, because you have to zip them up with your code and create a deployment bundle, the size of which has an implication to the overall performance of your function. And the last step is to create the Lambda resource via cloud formation, along with any other additional AWS resources
like API gateway endpoints, S3 buckets, et cetera. You upload the whole lot and wait for a stack to be created. Some specifics on what you can configure in the service, you choose the runtime your function operates on, Python, Node, Java, .NET, and you do have access to a portion of the Linux file system. You also choose the course resources
available to your function, the amount of memory from 128 megs to one and a half gigs, CPU and networking are increased proportionally with that, you don't actually get to set them explicitly. And there are a catalog of events from within the AWS ecosystem that you can register your Lambda function with, as well as invoke it programmatically
or get it set up to run on like a Cron-like schedule. Some examples are when an object is added to an S3 bucket or an entry is written to a DynamoDB table, or an HTTP request comes through the API gateway. So you end up trading control and responsibility over a lot of DevOps work, and you get a scalable, fault tolerant computer environment without a lot of engineering effort.
However, you have to live within the rules. So you have no long-running jobs, no memory hungry jobs, no jobs that require, I'm sorry, that return directly a large payload, and nothing that writes anything substantial to disk. It's a challenging environment to do computation, especially with rasters, but there's limitations around all these things, or there's workarounds between them.
As I mentioned before, when you deploy a Lambda function, you're responsible for delivering the code and all of its dependencies outside of the standard library. This means you're not apt-get, yum, or pip installing anything as part of a provisioning step. If your function depends on anything, it must be in this deployment package, and if it had to be compiled, it has to be compiled for the Amazon Linux distribution.
That deployment bundle can only be 50 megabytes compressed as well, and the overall size of that bundle affects what's called the cold start time of your function. This is the time that it takes for the platform to spin up an instance of your function to respond to an event. The larger the bundle, the more time that it takes for Lambda to create it.
You don't incur this latency when there's a recent instance already available, so counter-intuitively, a higher use app will tend to have less latency than one that has lower usage, because there's more sort of hot functions loaded. A quick few words on cost. You're charged for each request that comes in, plus for every 100 millisecond block of time
that it executes, and the price increases with higher memory configurations. It's pretty inexpensive, and when you consider the free tier, that's perpetual. A low use function could conceivably run at no cost. However, for a very high use app, the costs are essentially unbounded. An example straight from the Amazon pricing page for this service is if you allocate 512 megabytes of memory
to your function, execute it three million times in one month, and it executes for one second each time, it's about 20 bucks. Keep in mind that if you're using other AWS resources in the process of executing it, you're responsible for those costs as well. AWS pricing can be tricky, so just make sure you do your homework there.
So how do you know if it's an appropriate service for your application? One, is it latency tolerant? Meaning you can expect some variability in startup times due to these cold starts. With the code that I'm working with, with a lot of the geospatial libraries loaded, I can see up to four seconds, plus the time that it actually takes for the function to execute. So your application needs to be able to accommodate that.
If your traffic is constant or predictable, you may be able to engineer solutions that are more cost effective or robust. But when there's times of low usage that may spike suddenly, per execution billing can be really helpful. And short running, five minutes is excessive for web requests, but it's not a lot of time for some heavy computational loads, especially given the relatively low resources
that are available to you in Lambda. If you don't have access to a lot of operations or infrastructure experience, this could be an easy way to provide some scalability and fault tolerance. Some common use cases, doing things like ETL, data processing pipelines, like imports and exports based on a request or an upload, or just to generally implement these microservices,
small modular network services that are loosely coupled and scale independently of each other. Two additional use cases I want to focus on are working with raster datasets. Projects that I work on at Azavia, it's typical to need to perform some on the fly, like zonal statistics for user supply geometry,
something like reporting on the distribution of land cover values within a watershed, or creating a priority raster based on some weighted overlay operation between two source datasets. To support that kind of analysis with a visualization, we often will generate raster map tiles as well. This usually involves a pretty robustly engineered solution to provide this processing outside of the web servers
that can scale to meet demand, run in multiple availability zones, and yet not come at too burdensome an operating cost, which is a tall order, but one that actually fits pretty nicely into the AWS Lambda use case. So a critical piece that enables the actual raster processing are a suite of open source Python libraries.
Many of you are probably familiar with these. Rasterio, developed by the folks at Mapbox and NumPy. One could do a really great talk on just these libraries, but in the interest of time, I'm gonna have to do a brief and pretty inadequate overview of them, so I can talk about how to get them running in Lambda. So briefly, Rasterio, it's the foundation of this work. It's a nice Pythonic API over,
I always call it GDAL, Google. It does IO on rasters, plus provides some additional raster processing routines. It reads in raster data as NumPy arrays to do further processing, and it helps do things like mask the arrays so you're not processing no data values, excuse me, or values that aren't contained like in your zonal geometry. What makes it especially suited to run on Lambda
is the ability to do windowed reads, meaning it can start and stop reading bytes at a defined offset within a file, and it can read those windows over a network protocol like S3. NumPy is a Python library for working efficiently with large multi-dimensional arrays, and a raster is kind of just a 2D array where the cell values are basically geo-referenced.
It's not a GIS library, but with the right techniques, you can perform traditional raster operations using the numerical operators in Python. There's actually a number of additional libraries to support this kind of raster processing work, just a few to mention. Shapely for vector manipulation, Pyproj for reproduction, and Pillow, which is a Python image library
that I use to generate PNGs on the fly from these rasters. So I'm gonna try to cram all that information together and show how to set up a Lambda function to do zonal raster calculations and tiling. First step is to optimize your rasters for this environment. You're gonna wanna target S3 for storage. The rasters are not gonna fit in your deployment bundle.
You only get 50 megabytes, and you have a bunch of dependencies in there. And we're only gonna want a part of them at a time anyway. So we're gonna put them on S3. Since you pay for storage monthly and you wanna give Rasterio the fastest reads possible with the fewest bytes on the wire, you have an incentive to use aggressive compression options. We were working with the 10-meter DEM national dataset,
and it's around 700 to 800 gigabytes in the original IMG format. We got it down to about 250 gigabytes of geotiffs while improving read speeds over the network. Geotiffs also support an internal tiling scheme that Rasterio will take advantage of when doing windowed reads, and that has a huge impact on read time.
It also plays nicely with virtual datasets, VRTs. So if you're dealing with a very large dataset, you can chunk it up, create a VRT which points to the individual files on S3. And under heavy loads, this can actually improve read requests because S3 as a service will automatically partition files across these buckets like on their internal storage scheme.
So now that your rasters are in top shape, next thing to focus on is optimizing the techniques you use to read it in. Doing windowed reads over HTTP is what really enabled this architecture to work. Quick overview, a windowed read is essentially defining a rectangle over a larger raster and reading in just the bytes that coincide with that rectangle. On Lambda, we use this technique to read
a polygonal area from a large raster stored on S3, so we're not loading data that we don't need or wouldn't fit inside of the Lambda environment at all. This is what it looks like visually. I've got a national scale land use raster. I defined a region from a polygon for which I wanna generate some statistics. And we define a bounding box over that geometry and that rectangle identifies a strip of bytes
that we're gonna read out of the raster. Since we're using Rasterio, it's gonna get read in as a 2D NumPy array. The window read was rectangular, but it masked the elements which did not overlap with the actual geometry. So those values are not gonna be included when we do certain NumPy operations. So what if the window size would pull in data
that does exceed the memory available to your function, or even if it doesn't exceed the memory, doing something like generating statistics is actually very paralyzable. Technique I use to accommodate that scenario is to define a polygon area size threshold. So for any input that exceeds that threshold, chunk the initial polygon into smaller geometries which you can then read in sequentially
accumulating the statistics. To demonstrate this from the previous example, we can create a grid over the original bounding box, create a bunch of new polygons, these are gonna be the new windows. Then you can test each one of these to see if they intersect at all with the geometry, discard them if they're not, like no reason to read in those bytes.
And we can also reduce the size of the new bounds by defining a new window based on the intersection of the grid and the original geometry, I highlighted a few in yellow, just to show that the windows are actually, it's a way to sort of chop off bytes that you don't care about. So now instead of one large window read, you have several small ones and potentially significantly fewer meaningful bytes to read in.
The operation is fast enough, you can just execute this within like a Python generator and it'll come out, they'll yield out sequentially and you never have more than one chunk of data in memory at a time, so you can fit inside of the sort of resources that Lambda will give you. Or you can try to parallelize the task, you can do multi-processing in Lambda,
like there's no restrictions on it, there's a few restrictions, there's a restrictions on the number of threads and processes you can spawn, but it's kind of an anti-pattern. Instead, you treat the function itself as a unit of parallelization, so you split up the work, as I just described, into these new windows and asynchronously invoke that same function, passing it into a single new window. Lambda will spin up as many instances as are invoked,
so in practice, you have this massively parallelized and distributed computation over your rasters, but without managing any kind of costly cluster. In some examples that I have, I spin up hundreds of Lambda functions to respond to a single request. There is, of course, some overhead in invoking these functions,
so you'll have to find that inflection point to see if it's worthwhile for your data. And the last trick for optimizing your reads is when you're generating visual map tiles. You do that with Rasterio refers to something called a decimated read. Since you have a national scale raster, and say you want to create a tile at zoom level zero, basically the whole country is gonna fit in there, and that's gonna be a really expensive read,
and you don't need most of that data, that's not gonna sample down into it. So instead, by restricting the size of the array that you're reading into for map tiles, probably 256 by 256, it does nearest neighbor resampling as you read, so you end up with this light tile read that's fast at any scale, and makes use of overviews if they've been generated for that raster.
Managing the AWS resources with CloudFormation, I'm not sure if people have used that, but it's relatively difficult, and I like to use a tool called Serverless Framework. Instead, that's a command line tool which uses a concise YAML configuration file to represent everything that you would otherwise put in your CloudFormation template,
creates and updates your stacks, so you can use it to define and deploy your functions. It really has a nice little tool to package up all of your dependencies for you into this code bundle, but don't use that, because we can optimize that as well. In fact, if you do use that, and you have used all the libraries that I've mentioned, it'll fail, because the package size will exceed the limits
that Lambda allows. So how do you manage your dependencies? First, pip install everything you need into a virtual environment. You don't want any unintended modules to be included. If your libraries require building C libraries, make sure that you execute that script in the Amazon Linux Docker container, so that it's built against the right distribution.
For the packages I've mentioned, rasterio, shapley, numpy, which do have C libraries, the authors have conveniently built these many Linux distributions, so it'll come packaged with a binary that's already suitable for running on Lambda, so you don't have to do anything special with those. You can exclude certain packages that are pre-installed, like Boto3, and shapley and rasterio also have some duplicated C libraries between them,
so as part of your script, when you're generating the zip file, you can actually replace one version with a symlink to the other, and when you use the appropriate flag on the zip command, dash y, I think, the package will maintain that zip, inside the zip, and also once extracted on Lambda, and this can take megabytes off of your bundle size. I also use high compression options here,
just to keep that as small as possible. Okay, so some demos. I think I have a few minutes left. It would be sort of hypocritical of me to talk up the auto-scaling features of Lambda and not sort of put the link of what I'm gonna click on to the crowd for my hastily prepared demos, so feel free to poke around, see if I regret that. At worst, things might queue up and slow down a little bit,
but there's no servers that I am running that I'm worried about crashing.