We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time

00:00

Formal Metadata

Title
A COG In The Machine - Using Cloud Optimised GeoTiffs to Query 24 Billion Pixels In Real-Time
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
How do you find one pixel in a billion? Cloud Optimized GeoTiffs are a new standard for raster data that support file-level access via the internet. Combined with serverless cloud technologies, raster data can now be queried by client-facing applications without the need for a spatial database or specialist server software. In this talk I present how we used COGs and serverless to build a fast and scalable application to query large raster data using point and polygon geometries. As well as providing an overview of the solution architecture, I’ll explore the challenges we face with large raster data and why we chose to develop the solution using these open source standards and technologies.
Keywords
Real numberPixelVirtual machineSpeech synthesisMultiplication signSystem callAddress spaceSoftware developerCASE <Informatik>40 (number)Computer animationLecture/Conference
BitAddress spaceCASE <Informatik>Slide ruleCategory of beingProfil (magazine)Logistic distributionCartesian coordinate systemLecture/ConferenceComputer animation
Cartesian coordinate systemInternet service providerDenial-of-service attackUniform resource locatorNetwork topologyFlow separationWebsiteLevel (video gaming)Address spacePersonal identification number
Cartesian coordinate systemBuildingNetwork topologyCategory of beingProcess (computing)InformationMultiplication signFront and back ends
Military operationAddress spaceAbelian categoryUniform resource nameWindows RegistryIntegrated development environmentScale (map)DatabaseSet (mathematics)Different (Kate Ryan album)Image resolutionClient (computing)Service (economics)Endliche ModelltheorieState of matterUniform resource locatorInformationGeometryProcess (computing)MetreDatabaseVotingCartesian coordinate systemExtension (kinesiology)Event horizonSpacetimeRange (statistics)Combinational logicReal-time operating systemPoint (geometry)Presentation of a groupQuery languageComputer architectureWebsitePhysical system2 (number)DigitizingDenial-of-service attackRaster graphicsMathematical analysisFront and back endsVector spaceDependent and independent variablesSource codeJSONXMLComputer animation
PixelMIDIDenial-of-service attackInternet service providerSet (mathematics)Entire functionPixelMetreType theoryInformationRaster graphicsSubsetBitDatabaseDifferent (Kate Ryan album)Projective planeComputer architectureVector potentialExtension (kinesiology)WebsiteReal-time operating systemImage resolutionUniform resource locatorCASE <Informatik>Term (mathematics)Cache (computing)AreaMessage passingEndliche ModelltheorieCrash (computing)QuicksortCovering spaceDivisorMultiplication sign40 (number)
Extension (kinesiology)Core dumpRaster graphicsFunction (mathematics)Euclidean vectorIndependent set (graph theory)MathematicsCore dumpMereologyPresentation of a groupMathematical optimizationMultiplication signRaster graphicsLaptopAreaContext awarenessSpecial unitary groupInformation1 (number)Product (business)XML
Raster graphicsOpen setGeometryPixelFunction (mathematics)Reading (process)Address spaceLambda calculusRevision controlRun time (program lifecycle phase)Dean numberDatabaseProcess (computing)Real-time operating systemComputer fileData storage deviceProduct (business)Front and back endsScaling (geometry)InformationSet (mathematics)PixelRaster graphicsNumberAxiom of choiceConnected spaceObject (grammar)ScalabilityLine (geometry)Library (computing)Machine codeWhiteboardRevision controlDrop (liquid)BitService (economics)SpacetimeUniform resource locatorMultiplication signInternetworking2 (number)Parallel portCurveAreaMereologyInternet service providerFunctional (mathematics)Relational databaseSoftware development kitGateway (telecommunications)Lambda calculusDependent and independent variables1 (number)QuicksortWeb 2.0Arithmetic meanClosed setComputer architectureFilm editingNeuroinformatikWebsiteCuboidAddress spaceWeb serviceDressing (medical)Mathematical optimizationCore dumpGeometryCognitionXMLComputer animation
Raster graphicsDampingBlogBitService (economics)Medical imagingSeries (mathematics)Multiplication signComputer fileMilitary baseSoftware testingDifferent (Kate Ryan album)Electronic mailing listAreaCompilation albumSimilarity (geometry)
Dependent and independent variablesSample (statistics)Multiplication signService (economics)Raster graphicsAddress spacePhysical systemMetric systemIntelStandard deviationService (economics)PixelSoftware testingAreaMoment (mathematics)Raster graphicsDescriptive statisticsCASE <Informatik>Cartesian coordinate systemMultiplication sign2 (number)Level (video gaming)Diagram
Denial-of-service attackFlock (web browser)CASE <Informatik>BitMappingCartesian coordinate systemWebsiteDenial-of-service attackEmailLevel (video gaming)PolygonBuildingAddress spacePersonal identification numberAreaVector potentialPixelOffice suiteSystem callMachine codeCompilation albumProduct (business)Computer animation
SatelliteScale (map)Denial-of-service attackSurfaceAreaPixelCalculationDenial-of-service attackQuery languageReal-time operating systemLecture/Conference
BlogWebsiteTesselationVector spaceComa BerenicesState of matterLecture/Conference
Address spaceRing (mathematics)Code refactoringDifferent (Kate Ryan album)Service (economics)Functional (mathematics)Multiplication signMoment (mathematics)Internet service providerOpen setCloud computingMachine codeOpen sourceElectronic mailing listStandard deviationWeb serviceSpacetimeRoutingKnotForm (programming)Decision theory
Integrated development environmentMultiplication signService (economics)Machine codeRight angleNumberOpen source2 (number)Level (video gaming)Data storage deviceBitMechanism designForceGeometryPlastikkarteOffice suiteProduct (business)Functional (mathematics)Client (computing)Point (geometry)Software testingWaveConnected spaceLatent heatMedical imagingCASE <Informatik>Software frameworkInternet service providerOpen setWeb browserPerspective (visual)Computer architectureTerm (mathematics)Computer chessRandom matrixMoment (mathematics)Binary codeScaling (geometry)SurfaceoutputEndliche ModelltheorieFunction (mathematics)Data conversionBasis <Mathematik>Different (Kate Ryan album)State of matterPhysicalismVirtual machineDrop (liquid)Process (computing)SpacetimeFront and back endsQuicksortAbstractionFrame problemGoogolSound effectLibrary (computing)Analytic continuationSet (mathematics)ImplementationLambda calculusQuery languageFigurate numberPrice indexServer (computing)DebuggerBootingGoodness of fitTesselationLine (geometry)Denial-of-service attackLecture/Conference
Transcript: English(auto-generated)
All right. So it's now time to ask Thomas Holderness to give his speech about a call in the machine.
So this is awesome. And how's the audio? Good? Yep. So thanks. My name is Thomas.
I work at Address Cloud and I lead the technical development of our solution. I'm how we use them, which is a little bit different to some of the other use cases that we've seen. And so great to see so many talks about cogs at this phosphorgy.
So really briefly in one slide, Address Cloud provides geocoding and risk intelligence for businesses around the world, primarily in the financial sector. And we work with a lot of insurers so that we can tell insurers how risky a property may be for underwriting
purposes. That's one of our key markets. And we also work with commercial markets for deliveries and logistics. So today I'm going to focus on the use case of risk profiling. So this is our Maps application. And in this scenario, an insurer has dropped a pin on
the map and has turned some information on that we have from a data vendor about flooding and flood severity. And we can see that this location, which doesn't have an address, it's a greenfield site. So we know where it is geographically, but we haven't got an address attached to that location. That location has a really high flood risk. So we
can see it's got really, really high flood scores. So an insurer will be very unlikely to want to write that location. Now, of course, we've got awesome data from our data providers about flooding, about subsidence, about windstorm, about trees nearby and then
intelligence about the building itself. Obviously, that's a large amount of data in our back end. So we need to be able to query that information really very effectively and very, very quickly so that we can tell that insurer in almost no time at all what the risk is like to be at that property. And so all of the processes that we do are API-led.
So that's our Maps application, but we have an API that that application is built on, but also some of our customers connect to. And the documentation is available docs.addresscloud.com if you're interested in the things that we do and the kinds of data sets that we bring together. And so we work with lots of different data
partners that have varying data requirements. But obviously, most GIS data comes in one of two flavors, as we all know, vector and raster. And so we have an internal process of taking all of those data sets together and pulling them into our back end, into our service, and then allowing people to query that service for some information. But those data sets are getting
bigger because increasingly the resolution of the data sets is improving. We now have digital service models in the UK down to five meters. So that's the resolution of the flood model. I'll talk more about that in a second. And when some of our clients connect to our system,
they may be competing in real time against other vendors. So in the UK, we have aggregate aggregator websites. So if you go to get your house insurance or your car insurance, you may then go to compare the market, for example, and you see a range of prices that are coming back to you in real time. So when that happens and an insurer asks us to tell them about
the risk at that point in space, we need to be able to respond to them very, very quickly, typically sub one second. And that keeps me really busy because I have to be able to think about how do I do GIS with really big data that's at really good resolutions, but I have to do it really, really, really quickly. And that's my day job. That's what I do.
I should also give a kind of really big thanks because this presentation is an extension of some work that I first presented at Geomob in London, which is a really great event. I think there's now one in Berlin as well. So if you're in either of those two locations, that's a really nice crowd of people to come and hang out with and talk about geo things.
So in the beginning, there was the database. And of course, it was PostGIS. Because we all love PostGIS and PostGIS is awesome. And we, you know, big PostGIS backend to pull all those data sets together and do the spatial analysis that we need. So we have an application of an architecture that looks like this. So
we have a database, that's where all the data is. We have an application that we've written that combines queries, all that data. And then we have an API or an application at the front. You're sure most people in the room that have done some spatial development will be kind of familiar with this kind of architecture. So this is a bit of an overview of one of
the types of data sets that we've got from one of our data providers that does flood data. So they're providing us with this flood model, which is, as I said, a five meter resolution. So it's 24 billion pixels, just of the land cover, once you've masked out the other 30 billion pixels that are just the ocean, which obviously we're not bothered about.
And so if you think from a remote sensing background, you might think about radiometric depths. There's actually 52 different layers of potential information for every pixel in that data set. So it's quite a lot of information. And as I said before, we have quite strict
SLAs with lots of our customers where we have to basically provide them that information, but only a subset of it, not the entire data set. We have to be able to query it and say, for this small pass of land, this is what we can tell you in terms of the flood risk. And so then that's great. But then what if we want to do other countries like North America?
It's like, oh man, like our PostGIS database bill is going to get really big to be able to scale that, ingesting this stuff with PostGIS raster, being able to query it in real time. I should mention that everything's cached anyway, but we often have use cases where we can't cache because it's a greenfield site. So we don't know in advance what that
area of that location is going to be. So what can we do? And at the same time, it's been, I put this in because there's lots of talk about what was happening in the future of PostGIS raster and where that was going to go. And it's been factored out into a separate extension. So it was really good to get some updates this morning and see that that project has
been factored out, but still kind of core part of PostGIS and is continuing. But for us, it was kind of a time of, well, do we need to look at other solutions? How are we going to make this thing work? And luckily, I was at FOSS4G. And there was a chap called Alex Leith, who is here, and I think has just given a presentation who was talking about Geopackages and mentioned this thing called COGS. And that was
the first time I'd heard of a cloud-optimized Geotiff. And I made some notes in my notebook diligently, I thought, and then filed that away. And then a few months later, facing this problem, and I was like, is there some new raster specification? Now, where did I hear about that? Oh, yeah, FOSS4G. So enter the cloud-optimized Geotiff. And I'm really pleased that
I hope lots of you have been to some of the other cloud-optimized Geotiff talks, because there were some brilliant ones that gave really good details about how a COG and the TIFF file under the hood works. So I don't need to cover that information directly myself. But really quickly, you can say, cloud-optimized Geotiff is just a way
of internally tiling the TIFF information, and it's accessible over the web. So it's a cloud-first data product. So that means you can say, for this bounding box, just give me those pixel values. And for us, that's amazing, because that starts to transform the TIFF file into a queryable searchable database. If you can say, OK, I want this
area, just want this small handful of pixels, I don't need the other 28 billion, just give me the values for those. So not too much code. I hope that's fairly legible. So we can actually use the brilliant raster.io library that was created by Mapbox to directly interact with our cloud-optimized Geotiffs. So I can step you through this. This
is the only lines of code that I'm going to show. So we're going to import raster.io in Python. And then this is the magic. This is where I realized that this thing, for us, was going to be like gold, because I can put my cloud-optimized Geotiff file in what's called an S3 bucket, which is an object store from Amazon Web Services.
And I can make a connection to that file in that bucket across the internet. So I don't need my compute to be in the same location as my data store. And then the second bit of magic is I can say, I've got some coordinates, some geographical space. Give me the pixel values that fall within that space. And I can do that straight
directly to the S3 bucket. Brilliant. Great. Now I've got maybe 10 pixels coming back, which is a very small amount of data, and it's happening quite quickly. And I don't need to start to worry about this. I'm not worrying about the size of my Geotiff. One of the other sort of constraints of using Postgres raster, which is the kind
of previous version of this, was that our ETL process was getting longer and longer, as our data sets were getting bigger, more data sets, larger areas, more areas of the world. And so that was one of the problems that we faced there. In this version, because the data provider provides a raster data set as a TIFF file, it's actually quite an easy and
simple process to convert that existing Geotiff into a cloud optimized Geotiff. Lastly, I can read those pixels out and I can send them back to my API so I can continue the process. Brilliant. So now we've kind of got some pieces on the board in front of us. Start to put these things together. We've got our data in a bucket. We can access that. We know that we can pull that through the API. What are we going to do
in the middle? Well, we are address cloud. So everything that we do is in the cloud, is cloud native. Our service this year will be 100% serverless to support our scalability. So the natural choice for us was to use another Amazon Web Services product called AWS Lambda.
And Lambda is a piece of technology that allows you to upload some code that you've written that is executed in response to a function and can execute as many of them in parallel as you want, so it's completely scalable. And when that function is finished running, it shuts down, so you only pay for the time that it runs. Typically, you may run that code for maybe
one to two seconds. Fantastic. So we can drop AWS Lambda in the middle. We've got our data store at the back end. The data storage costs of an object store like Amazon S3 mean that even that file which is many, many gigabytes doesn't really cost us anything to store. Our AWS Lambda function doesn't really cost us anything to run.
And we use another bit of kit from Amazon called API Gateway that provides our API. And that doesn't really cost us anything either for this particular part of the architecture. So all of a sudden, we've gone from what was becoming quite a big challenge manual process of loading data into PostGIS and making sure our PostGIS databases were able to scale in real
time with our users low latency but high number of connections into something which scales effortlessly and is very, very cheap, which is great. There is a little bit of learning curve to get the package to work in AWS Lambda.
Thankfully, Mapbox was around again. And they did a brilliant blog post that talked about how take the package and some of the underlying bits of GDAL and other bits and use a Docker image that's released by Amazon to compile those things from scratch and get rid of a lot of the
things that you don't need in there like the documentation files, get rid of all of that, really slim that thing down. And we've got a Lambda layer that means that anyone can get this now and upload that as basically a series of compiled Python files. So this is GitHub. Anyone can start using it with the example that I showed to start playing around.
And the service time is brilliant. So we did tests. This is 100 simultaneous requests for different areas of one really big raster. So those requests are all coming into the service at the same time. And the meantime was 74 milliseconds and the standard deviation is
20 milliseconds. So we're really happy with that because it means that we can be hitting that service as much as we want and it's just going to do that all day. It's just going to keep giving you back those pixels for as long as you want. And so our overall status continued to be the same. We're doing some performance testing at the moment. So I'm
excited to see what happens when we start to scale this more than 100 requests when we continually hit this maybe like a soak test where you're doing thousands of requests over a couple of days. I'll be hopefully writing slash presenting about that in the future. So I kind of wanted to finish on a bit of a use case to put this in really kind of
give an example, show some maps I guess. This is a great use case of our application and the stuff that we're doing with COGS. So this is a caravan park and it's my colleague Mark. It's near where he used to go on holiday when he was a little boy at the seaside in England. And so if we do a geocode on the address for that caravan park, the postal address
where the mail goes is the red pin. You can't really see it but it's the front door, the main office, the building that's bricks and mortar. But the risk to the insurer is obviously all of the caravans across the entire site. So the flood risk for the main building for the
writing that as an insurer. But the insurer is actually insuring all of those caravans, they're doing the site, they're insuring the whole site. So what can we do? Well, we can allow our insurer to come to our maps application and draw a custom freehand polygon around that site and say I'm underwriting this, these caravans or this segment or the full
site. And if we turn on the flood layers, we actually see that that area as a whole has got a very, very high level of potential flooding. And so probably isn't such a good thing to want to underwrite. And we've got that flood score that's come back on the left that's come back
at 30. That is a calculation that's been done by pulling all of the pixels that are in that cloud-optimized geotiff and adding the flood score together and normalizing it to give you a value. And so that all happens in real-time querying. And then we can also visualize that by we actually convert our cloud-optimized geotiff into a stylized reputation using some
vector tiles so that actually you can turn that on as a layer and you can see, okay, there's a river running through that site. So happily, it was quicker than I expected. I've written a blog post that dives into some of the details about this at blog.addresscloud.com so you can check that out. And that's it. Thanks for coming. Appreciate it. It's
towards the end of this session on the last day. I've had a really good conference and it was really great to meet all of you and I hope to meet a few more this afternoon. Thanks very much. Thank you very much. Are there any questions?
Here. You previously said that you plan or you moved to Lambda. Is there a reason why you did this? Because it now looks like you're tied to Amazon.
Yeah, so we were already using Amazon Web Services to deliver our services. So we're already using it so it made sense for us to use that technology to provide this service. Does that answer your question? Yeah. You want to get another question?
Not really, but yeah. I was curious if you really want to use Amazon and if you want to move to Google Cloud, for example, I think now it's quite hard for you. So I'm curious why you took this decision. Yeah, because at the time when we started working on
not this service but other services that use the same technology, Amazon was the only solution provider that provided the solution. Yeah, vendor lock-in with a cloud provider is an interesting challenge. One of the things that we're doing at the moment is using another piece of
tooling called Terraform that allows us to basically infrastructure as code so we can code out our infrastructure so that helps because that supports lots of different cloud services back end but we would have to do it would be a really a reasonably major refactor to move some of this stuff over. There's some really great initiatives in that space. There's
like open function as a service and so what I would hope is that over time we're using AWS Lambda but that might start to support an open standard and that we could take advantage of that. Thanks. Any other questions? Sure, yeah. Oh wow. I really wish I'd like written a list of things
I'd like you all to work on. I'm ready for next year at the end of a FosportG conference.
So the question was what's my wish list for the open source geospatial community? I think cloud optimized geotiffs and the tooling that's been emerging around them a really good example of thinking about geospatial data in a serverless environment in a cloud-first environment
and I'd like to see that continue because the architecture that we have is not you could not replicate it with a more traditional server-based model. We would not be able to provide the service that we do with the latencies that we have with the number of users that we have without an enormously large number of physical machines and even then
I think it would cause problems. So continuing to talk and FosportG is one such venue and there's other good conversations going on in the community both online and in person about how we continue to move geospatial tech into the cloud and also how we continue to embrace our open source ethos and take that with us and start to challenge some of the
things about well why can't I run my functions in different cloud providers? Is the underlying technology open? Any other questions?
Thanks Mark. It's not a curveball. No not a question such as just going on to the you know kind of continuing from what Tom was saying around the vendor lock-in. I've heard that I've kind of heard that in the past about vendor lock-in I think well there's two
agnostic or multi-cloud I think you miss the point and you miss most of the benefits you'd get from a particular cloud if you kind of really really sort of have that layer of a level of abstraction and then the other thing as well I don't think you lock yourself in with lambda or in effect you simplify your code down to some python or some javascript and actually the bits that make it lambda or google functions or azure are so minimal that actually
the rewrite to move from one to the other would be would be quite small and then you can always use things like yeah I mean we didn't really get on with the serverless framework but you can use abstractions like service frame or terraformers as Tom mentioned to kind of take that away so yeah don't I think don't be don't be scared about going all in on one
cloud I think to move to another one would not be a massive massive issue I don't think brilliant it's really hard to see you it's just bright lights shining down um wave are we done and one last one at the back yes yeah definitely um can you can you please
sorry yes repeat the question the question was uh there have been lots of talks about um big geotiffs and cloud optimized geotiffs in the browser and and tool sets such as geotiff js that have been presented this week do I see an opportunity there to work with that code in the back end I think that was the question um uh yeah so first they would say we're not
rendering clogs clogs on the client and we're using them as for us it's the back-end data store and queryable thing the geotiff js library does look really good and it runs in nodes it could run on the back end so I'm quite interested to performance test it against the stuff that we've done in python the biggest problem that we have at the moment is compatibility to some of the lower level binaries stuff that's going on in python and getting that updated so if that geotiff js for example is a pure
javascript implementation so we could use that instead that might be that might be really good so um and in terms of tiling yeah I don't know I've not we've not I've not experimented with looking at stuff on the on the front end so it'd be interesting to see where that where all that kind of space goes any more questions we do have time
so the question was for indices or classifications would they happen on the browser or server side I think it depends you know on the use case I think there's some brilliant examples of doing stuff on the client but for us everything's API backed so we need to be able
to push out that data that classification or whatever it is through our API so that's why we're interested in you know what can we do in the back end because we wouldn't want to replicate something where we have one process in our front end client that's not available through the API so you know we're API led with as many customers using an API as we do
have been using the front end tooling so I really think it depends on the use case um and what you want to achieve with that data I think the examples that I've showed here are quite different to some of the other examples lots of most examples I've seen of cogs are of RGB imagery and so that's obviously that's completely different use case where we've got
you know a synthetic product that's just using geotiff as a storage mechanism but it's not it's doesn't have to it doesn't have to be an image for example but that's because it's a continuous surface data product that's the best way currently to store it thanks all right we have more
questions just give me one minute I hope you've got your Fitbit on and there's steps up sorry sorry so my question is um you said you have got like
the coverage of whole whole uk and is it really one image is it really a 200 gigabyte cock yep cool I didn't I didn't say that when it first came from the data provider yeah I was like okay yeah and I oh here it is so that that works with the performance you've
shown so that's really impressive and yeah uh I would again I was like oh we just try it out oh that's kind of fast yeah that works because originally you know we'd go into this big ETL process of chopping it up and putting it in and yeah it seems to be in that back-end environment we do do some stuff around caching the connection so making sure that we've always
got that data is always warm as it's called so that the access to the data in the bucket and that the code that queries that data is that those things are always ready to go that that helps quite a lot but yeah it's it seems I'm yeah I'm really impressed that whatever the magic is and I probably have people in the audience to thank that have
contributed to that specific specification the magic that's in there is is pretty phenomenal from our perspective yeah we do have time for one more for one more question and I need the exercise so along the same line as the question that was already the comment that was
already asked do you foresee like you've got a 200 gigabyte image in there do you foresee issues with larger images or some sort of performance drop-off or something like that
or do you think it would be linear in terms of the performance yeah I wouldn't want to quote absolute so I don't know but I can see that it's it's probably not going to be it's not going to be scalable to say like North America um so we're going to you know we have to start to think about what we do there but I'd be interested to ask well so for example I know that the data provider is doing it in North America is doing it on a state basis so provide you
or or what you loosely call a region because I think you that they would run into problems creating the input data you know their output data that we're taking in they'd have problems creating getting their flood model to to fit within that you know continuous surface so then I think then there's some interesting things about what to do do we take all of those and
then tile them so I don't know yet that's something that we're going to be working on this year yeah yep so um the question was um Amazon Lambdas um have a startup time of one or two or even
longer number of seconds um how do we deal with that so we're always within SLA um we deal with that by making sure that uh so our monitoring service our internal monitoring service is or is
continuously pulling our service to make sure that um the the sandbox worker which is the piece of infrastructure that Amazon has that checks out that lambda code and runs it to make sure that their sandbox worker is always um around and and kind of listening um
and that those can persist for over the course of a day that that same worker would persist so so say Amazon um and so if that thing's always around and it um your the startup time decreases and so you have this concept the terminology is a cold start and a warm start and so if you the figures that I show and the uh when our customers connect to us and make a
query they're they're hitting lambdas which are warm already so their startup time is drastically reduced thank you Thomas thanks cheers