We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

FOSS4G In AWS: Choosing, Deploying And Tuning Open Source Components In AWS

00:00

Formal Metadata

Title
FOSS4G In AWS: Choosing, Deploying And Tuning Open Source Components In AWS
Title of Series
Number of Parts
95
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceNottingham

Content Metadata

Subject Area
Genre
Abstract
This presentation will show methods of working with AWS to design, deploy and tune Open source software with an end goal to bring up various geo-oriented full stacks. This includes databases, tile renderers, geocoders, routers with all dependencies. It will cover choosing the components, the deployment posture, prototyping, designing for cloud scalability, performance benchmarking and ongoing maintenance. Most of the concepts will lend themselves well to other public or private cloud situations.
Revision controlCloud computingComponent-based software engineeringService (economics)Point cloudPersonal digital assistantBuildingTask (computing)ArchitectureAxiom of choiceFrustrationOpen sourceTime zoneTouchscreenMultiplicationTablet computerVirtualizationPlastikkarteLevel (video gaming)Service (economics)CASE <Informatik>Presentation of a groupVirtualizationSpacetimePoint cloudMultiplication signFrustrationIntegrated development environmentBitVapor barrierUniform resource locatorGoodness of fitSpecial unitary groupComponent-based software engineeringOpen setComputing platformOpen sourceArithmetic progressionTime zoneGroup actionOnline helpCore dumpBasis <Mathematik>Self-organizationMathematicsCovering spaceWordFigurate numberPhysical systemComputerBus (computing)Software developerCoalitionQuicksortPairwise comparisonCoprocessorHand fanFeedbackEccentricity (mathematics)Position operatorDressing (medical)Game theoryCondition numberComputer animation
Programming paradigmCloud computingAbstractionInterface (computing)Point cloudService (economics)MereologyVolume (thermodynamics)Set (mathematics)Medical imagingGame theoryMultiplication signImage resolutionOperator (mathematics)Position operatorVirtual machineMathematicsComputing platformState of matterAbstractionElectronic mailing listStaff (military)CoprocessorObject (grammar)Power (physics)Arithmetic meanSummierbarkeitShooting methodCondition numberVirtualizationComputerRight anglePhysical systemProgram slicingSoftwarePoint cloudBlock (periodic table)Error messageImplementationRepresentation (politics)Primitive (album)Cloud computingLengthServer (computing)MomentumLevel (video gaming)Game controllerUtility softwareInterface (computing)System administratorConfiguration spaceShape (magazine)BuildingSeitentabelleIntegrated development environmentComputer hardwareData centerProgramming paradigmMultiplicationData storage deviceBefehlsprozessorContext awarenessWeb pageHard disk drive
Web pagePoint cloudSoftwareMetropolitan area networkWorld Wide Web ConsortiumFreewareLogicFluid staticsService (economics)Process (computing)Content (media)StapeldateiBackupData recoveryTransformation (genetics)Personal digital assistantCASE <Informatik>HTTP cookieSpherical capStructural loadScale (map)Local ringWritingScalabilityVertical directionChannel capacityArchitecturePattern languageVideo gameVirtual machineEndliche ModelltheorieSet (mathematics)Content (media)RoutingBlock (periodic table)TesselationElasticity (physics)Data storage devicePoint cloudScaling (geometry)Queue (abstract data type)Client (computing)CASE <Informatik>Identity managementHTTP cookieData recoveryDirect numerical simulationStructural loadSoftwareCartesian coordinate systemPoint (geometry)Semiconductor memoryInformation securityFluid staticsBefehlsprozessorInstance (computer science)LoginBitTransformation (genetics)Server (computing)Web 2.0Configuration spaceTime zoneAreaCache (computing)Web serviceStapeldateiLastteilungWebsiteFile formatComputer fileDependent and independent variablesRaster graphicsVector spaceFile archiverProcess (computing)DiagramService (economics)ScalabilityBackupChannel capacityDigital rights managementEmailDialectFood energyINTEGRALCondition numberStatement (computer science)Order (biology)Disk read-and-write headStrategy gameComputerMultiplication signRight angleMessage passingEstimatorCAN busGoodness of fitMathematicsFigurate numberLatent heatPower (physics)InternetworkingComputer architectureBus (computing)LogicUniverse (mathematics)Moment (mathematics)Musical ensemblePhysical systemRun time (program lifecycle phase)ConcentricGroup actionDistribution (mathematics)Extension (kinesiology)Slide ruleComputer animationProgram flowchart
Level (video gaming)Scale (map)Component-based software engineeringKolmogorov complexityStructural loadRange (statistics)MultiplicationCharacteristic polynomialSpherical capVideo gameBuildingIterationPoint cloudGroup actionInformation securityEncryptionStandard deviationInstance (computer science)Water vaporSoftware testingMiniDiscRAIDSet (mathematics)Plane (geometry)User profileCloningCore dumpMereologyCASE <Informatik>VolumeRun time (program lifecycle phase)Context awarenessBootstrap aggregatingDemonModul <Software>Process (computing)VolumenvisualisierungData modelNetwork socketArtistic renderingThread (computing)Revision controlSingle-precision floating-point formatScalabilityCondition numberServer (computing)GeometryWeightUniform resource locatorMusical ensembleQuadrilateralInfinityMenu (computing)Computer-generated imageryLipschitz-StetigkeitIntrusion detection systemLarge eddy simulationArmLibrary (computing)Insertion lossMetropolitan area networkScalable Coherent InterfaceEmulationPredictabilityIRIS-TPersonal identification numberInterior (topology)Computer wormConditional-access moduleArc (geometry)MiniDiscRow (database)Point cloudSlide ruleDatabaseStructural loadThread (computing)Level (video gaming)Cache (computing)Single-precision floating-point formatMereologyCondition numberMultiplication signRevision controlTesselationServer (computing)2 (number)ScalabilityVolume (thermodynamics)LogicSet (mathematics)Process (computing)File systemProfil (magazine)Table (information)Web serviceCore dumpMatching (graph theory)Zoom lensBitLastteilungStatisticsVirtual machineScaling (geometry)MathematicsPrimitive (album)Router (computing)CASE <Informatik>PlanningShape (magazine)Reading (process)Physical systemQuicksortDevice driverDifferent (Kate Ryan album)Division (mathematics)Software testingStudent's t-testFood energySpacetimeService (economics)Musical ensembleRange (statistics)Standard deviationData recoveryBoss CorporationElectronic mailing listPopulation densityDistribution (mathematics)Message passingDialectColor confinementSheaf (mathematics)Physical lawProjective planeSummierbarkeitPressurePersonal digital assistantOcean currentInternetworkingInformationEvent horizonComputer animation
GeometrySlide ruleComputing platformComputer animation
Transcript: English(auto-generated)
The first speaker is Mohammed Syed, so I would invite him to go on the stage and start his presentation. He is from here, which in Dutch is strange because it means he is from here. He lives here, but he is working for a company, I guess, called here.
By the way, I'm not one of the nerds, so I do want you to switch off your phones or put them to Syed, please, because it's annoying for the speakers. Thank you very much. Hi, good morning. So, yeah, I work for here, which is Nokia's location and commerce, but I'm not here for here.
I'm just here on my own, and I put this presentation together, and I hope that it will be a contribution to the community. So, this is the agenda, and a couple of disclaimers, one you've already heard, and just one more.
Some goals and motives, and a little bit of historical background, definition of cloud computing, so we can talk about the same thing, and some use cases for phosphor gene, and then I'll talk about AWS, the components and service at a high level.
And if you want to be in the cloud, what you're going to be doing, and how you're going to be doing it. And then some common fast tasks. If you are building an SDI in the cloud, you're going to have to import some data, you're going to do some rendering, some geocoding, and so on.
And then, hopefully, we'll have time for questions. So, you heard the first one. I work for here. I'm a senior architect in the core platform group for Nokia's location and commerce. But, again, this is just personal work, personal effort.
I'm also not affiliated with AWS other than being a customer, so I use them when it makes sense. If somebody else makes no sense, I'll be using them, especially if they give me a better bang for the buck. This is still a work in progress. I hope to be doing this once a month, or at least once a quarter.
So, your mileage might vary, but I've tried to document it as much as I could. So, why did I want to do this? I wanted to maybe validate some ideas with you, and maybe you can validate some ideas with me.
I've done quite a few services in the past. I've worked for Yahoo before I went into Nokia, and so I also did www.yahoo.com to borrow my Yahoo. So, I have a little bit of background in that area.
Maybe I get some feedback from you, and we would like to see me try. Hopefully, I'll help you save some money. There was a lot of frustration while I was doing this, so hopefully you don't have to go through it. There's already some artifacts that are produced, so maybe you can use some of that.
In the process, I've discovered some problems and issues, and maybe everybody knows about them. I'm not very strongly connected to the Phosphor-G community, although I've been an open source guy for a very long time.
So, I'd like to hear if somebody knows if there's work in progress to address the risks and bring them up. And why I wanted to do this is basically because I think right now we are at a stage where everybody's talking about open data, and geo, and location, and I think this is a very good opportunity where you may be called upon to contribute,
either in your organization or in your community, in your university. Using open source technologies may save some taxpayers some dollars, and at the same time have some fun.
So, I think this is kind of my main motivation. I think this is an opportunity where we can do some disruption with open source in the geolocation space. I think I can help a little bit.
So, a little bit of background. So, cloud computing a couple of years ago was just a buzzword. Like, yeah, everybody's talking about this. It's kind of emerged out of that stigma of being a buzzword into a reality where people actually use it, and they can deploy services to it.
It had immensely lowered the bar for entry, the barrier to entry for startup companies, for nonprofit organizations, and so on. But this goes back a few years, maybe a little bit over a decade. So, I think it all started with virtualization.
I'm not going to go into the ancient history before VMware and virtualization on the x86. But VMware and PALS used to ship commercial products, and they had some customers. A lot of people used them in lab environments and so on. But it wasn't massively adopted.
Solaris, you know, back when the sun existed, they also did Solaris zones and containers. But it wasn't really until Zen before KVM came up, and they came with this power virtualization technique which really helped the performance, and this became a very viable solution to running things in a production environment.
So, that was kind of like the disruption. I think Zen really tipped the scale here. At the same time, we were having problems with hardware, so we were not able to deliver any faster processors.
Problems with cooling, problems with power. And the solution from the hardware vendors was to go for multi-core. So, you don't have faster processors, you just have more. So, that was one thing. And then they caught on, if they wanted people to run things on multiple processors,
the software at the time was not quite ready, so how do you utilize that? So, virtualization was a natural solution. But because of the performance, it was still not quite up to snuff.
Some hesitation was there, and the hardware vendors started supporting virtualization into the chipset. So, first AMD came up with nested page tables, and then Intel did extended page tables where you can virtualize the page table,
so the virtual machine doesn't have to context switch all the time. Later on, there was IO of loading, so TCP of loading, for example. You can just process things in the NIC without having to go up to the main processor anymore.
Same thing for storage. And then the storage and network vendors started thinking about how could they support this from an infrastructure standpoint. So, there was virtualization on the storage. They kind of called everything that they had done before virtualization, even if it wasn't really new.
Volumes became virtual volumes, and slices became virtual slices, and so on. But yes, so it started to gain momentum from there. And then on the consumer side, we started seeing smartphones, tablets, and multi-screens, and people wanting to use services and be able to access the same thing from everywhere.
So, now it became the idea that, okay, we don't want to restore things on desktops anymore, and maybe we store them on servers, but the servers have to be accessible, and this is where cloud computing kind of, you know, crystallized.
AWS had really been pushing this for a long time, so they have really done a lot of work there, and they are way ahead of everybody else, as far as I know, in terms of at least in breadth of coverage. OpenStack is trying to catch on. I think they do great work as well, but it's just, you know, they deliver software,
and now there are companies which are trying to take the software and build infrastructure and services around it. So, my definition of cloud computing, and this is just a definition, it's a computing paradigm where it's composed of abstractions, a set of primitives, and some interfaces and tools around them.
The idea is that you try to hide the physical stuff, the stuff that's hard to move, the stuff that you don't want to be tied to, so you want to abstract that as much as possible, and then you have a new set of primitives, some of them not necessarily very new,
but images, for example, is a primitive. Snapshots, volumes, evasion, availability zones, they may have other terms for other providers, but they basically talk about the same thing as they're trying to abstract the data center or the actual computer or the actual hard disk away from you.
And then tools and administrative utilities around that. What happens is that once cloud computing kicked off and people started deploying virtual machines and cloud and so on, things spiraled out of control really fast, and it wasn't in very good shape to begin with.
So, the tools and automation also really helped set that path, so Puppet, Chef, any other configuration, CF Engine 3, any other configuration management, but the idea is that you have the primitives, you have the abstractions, and you have the tools to manage them.
So, this is kind of like a block diagram, so at the very bottom you have the physical stuff, and then the primitives sit on top, and you have the tools and the APIs at the highest level. We can even go further up, so if you look at things like Heroku, for example,
you can explore the abstract even more where you just deploy to a platform, so you're very far away from everything else. You just have a command, you run it, you got a service. So, this is kind of a clean representation of what it looks like.
This is OpenStack implementation. So, there's quite a few errors going back and forth. And this is kind of what it looks like in real life, so this is cloud computing. And that machine is very important, because if that gets unplugged, the whole thing goes to shit.
Alright, so AWS is a public cloud, so the same kind of diagram, but we'll just be a little bit more specific. So, we talk about compute as an EC2 instances storages. So, compute EC2 instance is just virtual machines. They have a pre-defined set of configuration, so you cannot change or tweak the CPU or memory settings.
You can just choose one of the models. You can attach drives as you wish. Then a set of storage, so S3 is like a storage over HTTP or HTTPS,
and they have elastic block storage, which is kind of a NAS or a SAN idea. And Glashir is kind of a long-term archival. They have the foundation, the region, the actual brick and mortar implementation, the data centers, the power, the cooling, all the stuff that we don't want to think about.
Networking, so Route 53 is a DNS service, elastic load balancing, CloudFront is a caching service, and a set of tools around security, so identity management, security groups. You go up one level, you see the simple queueing service, or search as a service,
or Redshift is just PostgreSQL stored, kind of, so if you don't want to run a cluster and you don't want to manage it, they will do it for you, and you just put your schema and you connect to it, and you treat it just like a Postgres or a Postgres. Unfortunately, at the moment, they don't support Spatial, so it's only PostgreSQL.
And there's more. There's simple email service, simple notification service, and so on. And the management layers, the APIs, the overscale, the cloud formation, OpsWorks, that configuration management, and so on.
So what kind of use cases we can do with Phosphor gene in the cloud? Well, for starters, disaster recovery backup, so it's very simple, very easy to just dump a tarball or archive your data or a SQL dump,
encrypt it, ship it over into an S3 bucket, get it back when you need it, hopefully you never need it. So this is a very straightforward use case. The other use case is a static logic-free web publishing. So if you just have some vector data or raster data or any kind of static data where you're not doing any logic,
anybody who can make a request can get back that response. You can just publish this using S3 and CloudFront. I'll show you an example, a diagram. You don't have to run a web server. You don't have to run load balancers.
You don't have to do anything. You just publish it, and you will pay for the request as they come in, but you're not going to have to maintain any infrastructure. Obviously, online Phosphor genes, you can do geocoding or tiling or routing and so on.
So any of the software that is available to us under a public license, you can just run it. If you run a GPL license software, you have to make sure that you're complying with that license. Data transformation jobs. If you have a set of tiles or a set of data
and you want to just transform them from one format to another, or maybe you have four or five different formats and you have to do this overnight, you don't need to buy a whole bunch of machines and just have them sit for the rest of the day. So you can just fire a job, get it done, and shut them back.
Content curation batch processes. Again, the same kind of concept. If you're collaborating with other people and you want to have some central storage where they can upload their files and maybe you can do some processing, put it back, and so on.
So this is like a blueprint. If you wanted to do this static, logic-free content using AWS, this is your content, and you can just put it to this S3 bucket. You can configure a CloudFront distribution, which will point to this bucket, and you publish it.
And you make your DNS seen in areas to this zone that you are going to configure here, and that's it. And now users will go request your data, they will get seen into the CloudFront zone, and based on some telemetry and some other magic,
they will get routed to the closest cache edge to them. If you want to have logs, you can also configure logs to go to an S3 bucket where you can just retrieve them back later. So how do you build it? If you wanted to do this maybe for a university or for a district
or just your company, how do you do that? So I think there are some architectural patterns, and you'll now see this in books. I kind of came up with this overnight. And I just wanted to share them with you so you're aware of them.
They fit some things better than others. Sometimes you have to mix them. Everything in the world is almost polyglot, so this is not a holy book. So the cookie cutter. The idea is that you have a machine,
and the machine has everything that you would need, and you just manufacture them. You just have 10, 20, 100, as many as you need, or as many as you need. They have everything together, so they have the application, and they have the data. They scale horizontally, so if the traffic is actually growing,
then you can just scale up. If the traffic is dying down or on a low point, you can just shrink them. The data is accessible to the machine itself. They are not connected to each other in any way. So in case they fail, the failure is localized.
So very, very simple. It scales very well in some use cases. So simplicity is one of the pros, and it scales horizontally with load and localizes failure impact.
These are the main points. The problem is poor support for right-oriented service. If you look here, if you have so many machines and everyone has a copy of the data and the data has to change, then you have to push this back somewhere, somehow. And if the users are allowed to change the data,
that's even worse, because now one machine is going to change, and it's going to have to replicate. That doesn't work very well. It's a coarse-grained scalability. So when you scale, you scale everything, or you shrink everything. So if you actually have a service where your data layer is very, very fast
but your web service is not so fast, so your web service or web applications, you can't do it. You're going to have to scale the whole thing together. So it's a cookie cutter. The load capacity has vertical scalability issues. So if your data is growing or if your memory consumption is growing,
you're going to hit a ceiling at some point where you can't just grow anymore within that box. So there's a vertical ceiling on how much you can process per node. Then there's the centrist approach.
And the centrist approach is basically you take the data out, and you let the application run on the nodes. You can have a second copy of the data as a backup disaster recovery. And this is, you know, now the data is centralized in this database, these are the clients of this,
and then the users out there, they're the clients of your web service. Works okay in a lot of cases. Scales well for, you know, mid-level kind of loads if you're doing 20, 30, 40 requests per second or so.
It probably works. But has some other issues. So the pros first, you know, you can actually scale the web service by itself or you can scale the database by itself.
And that's a big advantage over the other approach. Five minutes. Okay, I'm going to have to run with the slides. So the replicator, the master of colonies where you can have masters of records distributed
all over the world, and they can replicate to each other as read-only slaves, and that works pretty good for read scalability. You're going to have to do some culture changes. You know, you release engineering, this has to be in really good shape.
You have to adopt automation. You really need to think about agility. You need to think about using the primitives that are available to you. You need to make sure that you get buy-in from the stakeholders. These are very, very key things. Some process changes. And some of the things that you have to remember,
the legal implications. Don't try to scale in the cloud as you would in a brick-and-mortar situation. So don't try to go for a let's cluster these machines together and have big cluster and four or five cluster. Just things fail and just plan for it,
and it's okay, and just think about how you could recover as fast as possible. It may take one or two tries. You probably get it right on the third time. But the old approach, you know, trying to connect things and make sure they are going to be reliable,
it doesn't work very well in real world. And I did some other work in this process. I'll go through it really quick. When I started, I wanted to see if I can profile, you know, a renderer, a geocoder,
a router in the cloud. And then I hit the first problem, how I wanted to get data in. And then I started reading about people taking ten days to get the OSM data set. And I thought that was horrible. And I didn't want to get a synthetic data set with 200 megs and say,
yeah, this works and come and show it to you here. So I really wanted to get the OSM data in, and I did. So first I did some tests. I looked at a bunch of countries, small countries. This is the time it took in seconds versus the size of the data set. So up to 3.2 gigs,
we are within a 30 minutes or 35 minutes range. I started collecting some stats around this, and I provisioned different infrastructures. So I looked at a local drive, how long it took. I looked at provisioned IOPS, which guarantees IOPS performance.
I looked at SSD, which is very expensive. And so guess how long it took. I went with the SSD one because I wanted to finish as fast as possible. Any guesses how long it took?
I wish. It took 35 hours. But this is 250 gig. This is not so bad, because a lot of people spend six or seven days to get this done. So this is actually not so bad at all.
But guess what I did after I just finished this. No, I made a copy. So I made a copy. I built a RAID 0 set over provisioned IOPS, and I created a logical volume on top, and I kicked off a data copy.
And guess how long that took. That was short. Okay, it took two and a half hours. So actually, this is a file system copy, right? So I just shut down the database. And then I copied the volume over. And then I archived it, of course.
And this took two and a half hours. So this is what it took to do a provisioned IOPS to SSD. It took five and a half hours. To do a SQL dump to SSD took about seven hours. SSD to provision IOPS took two and a half hours.
SSD to SQL dump took four and a half hours. OSM to PG SQL to SSD took 35 hours. So guess where the problem is. So this is a profile for OSM to PG SQL.
You can see the RAM cache nodes get 15%, Y match 10%, copy to table 10%. So there's a lot of things going on before the data actually is stored in the database. And this is part of the problem, and I wanted to talk about that a little bit more.
So you can read these notes later. When I did some profiling, it was Matal and Matnik, one thread, three threads. So if you run with four threads, if you run with four rendering threads, you actually get six threads.
So two threads do the bookkeeping, four threads do actual work. You don't have to read this, unfortunately. I'm going to run to the GeoServer part a little bit. So GeoServer, single layer, I took a small country called Finland, and zoom level 15 from 1 to 15,
and I did a RAM disk, and this is about 100 tiles per second. You can do about 100 tiles per second in that kind of setup. And this is kind of a ceiling, because this is your RAM disk, so it's not going to get any faster. Well, it could, but it would be very expensive.
Trancation is very slow, so don't truncate your GeoWeb cache. Try to publish your data as version layers. This is much better, if you can help it. Standalone GeoWeb cache will work a lot better,
so try to think about just yanking that GW cache out and put some GeoWeb servers behind it. I'll show you a blueprint. There's some possible waste conditions in thread writing files, and this is kind of an example deployment
where you can take the GeoWeb cache out and you put GeoServers behind them, and you put a load balancer that can do URL persistence. So these nodes are not coherent. They are incoherent by design. The idea is that you will go to the node that has the tile. If there is no node that has the tile,
one will get selected, and then from there on it will be persistent. You can mix a disk, so you can mix a fast disk and a slow disk in a volume, and that will probably give you very good performance. How much did all this cost? $866 and two weeks.
And then I have a backlog, release snapshots to the public, so I actually have the data now in AWS. I'm going to make it public, so you can just import it. It should hopefully help out, and I wanted to do some geocoding, profiling, and an OSM profiling,
and I'm open to suggestions as well. Thank you very much. We took up all of your time, so I'm afraid we don't have time for any questions. I'm sure your slides will go up on the ELO geo platform.