We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Hosted Services are Hard (...and so can you!)

00:00

Formal Metadata

Title
Hosted Services are Hard (...and so can you!)
Title of Series
Number of Parts
208
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
24
197
Service (economics)NP-hardCore dumpSource codeOpen sourceOpen setInstance (computer science)BlogCodeProjective planeDifferent (Kate Ryan album)MappingVideo gameServer (computing)InformationRevision controlMereologyAreaShape (magazine)HierarchySoftwareUniform resource locatorFront and back endsDirection (geometry)Self-organizationCore dumpExistenceNatural numberAddress spaceExpressionSystem administratorService (economics)String (computer science)GeometryComputational complexity theoryDemosceneMachine visionProduct (business)RootBitCuboid
Product (business)Euclidean vectorProcess (computing)Matching (graph theory)Address spaceFormal languageCloningFile formatComputerData conversionToken ringAuthenticationTime domainQuery languageRevision controlNumbering schemeSoftware testingUnit testingQuicksortQuery languageProjective planeCASE <Informatik>BlogClient (computing)Software testingRevision controlSoftware developerSoftwareCode refactoringTwitterAreaMappingOpen sourceFunctional (mathematics)Moment (mathematics)MathematicsCommunications protocolBitMultiplication signVideo gameSoftware bugProduct (business)Cartesian coordinate systemInterface (computing)Service (economics)Expected valueDesign by contractSoftware frameworkCodeJSONXML
Address spaceOpen setFiber bundleService (economics)Elasticity (physics)InterpolationFree variables and bound variablesQuery languageComputer architectureFormal languageProcess (computing)Source codeDatabase normalizationMereologyMoment (mathematics)Right angleData conversionPoint (geometry)Data storage deviceGeometryOpen setAddress spaceElasticity (physics)Instance (computer science)Intrusion detection systemFunctional (mathematics)HierarchySoftwarePolygonSubsetProduct (business)FlagData structureSuite (music)Bookmark (World Wide Web)Software testingDecision theoryService (economics)Projective planeScripting languageMedical imagingLink (knot theory)System administratorEndliche ModelltheorieWave packetFiber bundleState of matterLibrary (computing)Goodness of fitRule of inferenceModule (mathematics)Electronic mailing listSoftware developerPermutationDifferent (Kate Ryan album)ParsingReal numberResultantBoundary value problemRectangleCASE <Informatik>Inheritance (object-oriented programming)Query languageCausalityBounded variationAreaAnalytic setOpen sourceMathematicsMultiplication signIntegrated development environmentKey (cryptography)Test-driven developmentCore dumpMapping1 (number)Level (video gaming)Expected valueTranslation (relic)Server (computing)Row (database)Free variables and bound variablesWechselseitige InformationDocument management systemError messageComplete metric spaceNatural languageRegulärer Ausdruck <Textverarbeitung>BitType theorySet (mathematics)LoginCycle (graph theory)Traffic reportingLimit (category theory)Field (computer science)Subject indexingComputer fileSoftware bugDependent and independent variablesIterationFigurate numberObject (grammar)Extension (kinesiology)Uniform resource locatorQuicksortTrajectoryCodeRepresentation (politics)BuildingMaterialization (paranormal)InformationFeedbackOrder (biology)JSONXMLUML
Transcript: English(auto-generated)
Hi. Welcome, everybody. Really excited to be here. It's been a great conference so far. I appreciate all you coming out to share in this fun ride that has been Maps and Search. It's been really cold in here, so I encourage you all to move up for body heat, huddle, so don't be shy. Come on up.
All right, so we'll get started. This is a talk about what it's been like running Maps and Search and an open source project called Helios at the same time, and all the considerations that have gone into it over the last two years as we've been running. As already mentioned, I'm Diana Shkolnikov. I run the search team at MapsN.
I've been with MapsN for about 2 and 1-1-2 years. And my contact info is on there. Let's see if this clicker thing works, maybe. Oh, all right. So to give a little bit of background and context, what my team works on is a product called Geocoder.
What a Geocoder does, for those not familiar, it's a magic box that takes in strings that might have addresses or street information or an administrative area name. And it gives you back a latitude, longitude, and all the other details that come along with that location, some metadata and whatnot.
It also goes in the other direction. You could give it a latitude, longitude, and you can get back out the address at that location or a venue at that location or what the administrative hierarchy is at that location. And so we built the Geocoder. We're not the first to build a Geocoder. There are many out there. We did have some special considerations, which I'll get into. We made our Geocoder in the shape of an HTTP server.
We wanted it to be most accessible. And HTTP servers are a very accessible way to bring a service to the public. And so that's what it looks like. It's an express node module, for those of you that are interested in that detail. And the way that we created the project,
because it's open source, there's an open source software repository, our GitHub organization called Peleus. That's the name of the backend software project. We also took a bunch of open data, projects that we rely on heavily and really appreciate their existence. OpenStreetMap, OpenAddresses, Who's On First,
Tiger, and GeoNames is another one. And then we put all of that together, and we hosted an instance of this Peleus engine. We call it Maps and Search. It's something that we run so that people that cannot host their own instance of Peleus can come to us and have this be yet furthermore accessible.
So when you hear people talk about Maps and Search versus Peleus, the difference is one is the open source project that you could stand up your own instance of, that you could tweak, fork, and then Maps and Search is the maps and sponsored instance that we provide for people to have access to a geocoder. And we run it only on open data,
which is the other big consideration for us. So to take you back in time, 2013, this was the first, I don't know if you can read this, but Randy Meach, our CEO, he wrote the first version of the Peleus geocoder. It was written in Ruby. It looked very different from what it does today. And in the readme it said, this is experimental.
And it really very much was. It's an R&D project. And that's still kind of true to, we stay true to our roots. We continue to kind of innovate and try different things. And even though we're running this service in production, we're always trying to come up with new, interesting ways to solve the problems
and to improve what we're working on. So that came out. And when Randy posted that, he had a vision for the geocoder and the core values were open source, which we're sticking true to. Open data, as I mentioned, we use only open data. We don't have any proprietary data behind the scenes. And the theory behind that is that
if we create good tools for open data, then people use them and they'll want to improve the data that's underlying those services. And so open data will get better. And in the future, open data will be the only kind of geo data we need because these things are facts and they should just be available to everybody. And community is a big thing for us. So part of the open source nature of the project
is that we really want to collaborate with organizations and individuals that use our stuff that needed to be tweaked, that needed to be internationally supported. And so we were always very welcoming to the community and we hope to embrace a large audience and collaborate with folks.
And then two years later with a lot of hard work and a lot of refactoring, rewritten engine in Node.js from Randy, we launched Maps and Search. And like I said, that was the hosted instance. That's the blog post. You can see it was September. We made the announcement at Code for America, September, 2015.
And two years before that at Code for America is where Randy announced the original version of the geocoder and his plans. And we're really excited. So much so, I even made these mugs for the team. They're a small team, so I made these by hand. They're glass X, it says Pelias 1.0. Yay. Things were good.
Life was good. We launched this thing. It went off without a hitch. And then we had this moment of like, what, okay. Well, now we made all these promises to people and we were not in the business of hosting services before. We were an R&D shop. And now we have to live up to all of these expectations. People are gonna be using our service in their applications.
What if it goes down? We're representing this open data. And what if the data is not up to date? So all of these things just started kind of freaking us out a little bit. So we had this moment of panic. And then we started to try to figure things out one at a time. So I'll just cover some of the, kind of the key areas that I think tell a story
of where we started and where we got to now. We'll start with documentation. So we were really excited about our documentation. Before launch, we spent a lot of time, we worked with our documentation team. Rhonda Glennon heads that up. And she did a lot of work on making it conversational, making it accessible, approachable,
and yet really informative. We have all kinds of examples. We have, you know, beautiful pictures and everything. We even get compliments on our API docs. This was on Twitter. Someone said they cried while they read our docs. I think that was a good thing. You said amazing, OMG. So people really like our docs.
And we were like, yes, we did this right. And then one day we were releasing some new features and we were like, this should be a way for us to tell our users what we just did because they're probably gonna wanna know. And we realized we didn't have anywhere to put that. So release notes were kind of a thing that we didn't think about.
And for those of you that are not familiar with the concept of release notes, software projects do this all the time. Anytime you make changes to the code and you put them into production, you write a release notes update that says, this is what we changed, this is why we changed it. If you're gonna install the new version or if you're gonna start using the service with the new updates, this is what you should know.
And we didn't have that. And so we wrote a blog post in 2016, April, apologizing and saying, hey, guess what? Now you can check out our release notes. And the release notes, we have two considerations for. We have the open source project and then we have the maps and search project. And both of them have a very different audience.
So in the release notes for this open source project, we have to be very detailed about all of the refactoring that went into it. If there's a new dependency, then that's the kind of thing we wanna call out. If there's a bug that was fixed, even if it's not visible to the end user, we have to call that out. So basically go through all of the GitHub history
for that release and call out anything of interest. Whereas on the maps and search side, it's important to call out data updates or anything that changed in the protocol or the contract between the client and our API in case their client is going to become
affected by these changes. And so we put in release notes and that was a big deal. So documentation, everything was good, life was great. We can talk a little bit about testing. Testing, as mentioned in my bio, I'm a big proponent. When I started two and a half years ago on the search team as an individual contributor,
we only had unit tests. And unit tests are great, we should all have them. They were being run with Travis. So anytime you submitted something new, Travis ran the tests and you could feel confident that you didn't break any code, but you couldn't feel confident that you didn't break the functionality
of the geocoder as a whole. So there wasn't any testing framework that we had that would tell me that existing queries or queries that used to work before are still working. All I could say was the module didn't change its interface because that's what the unit tests will kind of tell you. And so for my first full request into the project,
I couldn't put it into production without making some sort of an insurance policy, I guess, for myself to make me feel good about the fact that it wasn't gonna break everything. So we put together what we call, oh, actually, yeah, so before I did that, we were doing fields-driven development.
And I call it fields-driven development because anytime somebody would release a new thing, we would literally just put it in production, which was, it was still in beta. We put it in production, and then we would email the whole company and say, hey, go try out your favorite query and tell us if you broke anything. And that just didn't scale, it didn't work. And so me joining the team, I was like,
ah, can't do this. So we got away from fields-driven development, and we put together an acceptance test suite, and we crowdsourced all of these favorite queries that people would use to test our stuff. We asked them to give us those queries, and we created a suite that we would run. And if any of those queries that used to work regressed,
then there would be a red flag. We'd have to revert the commits from master, figure out why they're not working, and just kind of iterate until all of the tests pass. We also, if we get bug reports, we'll add a new acceptance test, we'll mark it as failing until the time that we can get to it and fix it.
Sometimes we fix something, and it actually fixes five existing failing tests. And so we get to see the improvements, we mark them as passing, and then we're not allowed to let them regress in the future. And so what we do now is a little bit of test-driven development, if you will, where we add the acceptance test, it fails, we try to fix it, then it passes,
we mark that as passing, we don't go back on those promises. But more and more, as we improve the engine, we're getting to a point where this test-driven development is becoming, it's a limitation. So it's gonna get to a point where it's not obvious what the fixes need to be,
because the manual aspect, they're just not as obvious to call out. And so when we get to that point, we hope to be more data-driven instead of driving by user reports of issues, which we still probably will have. But to start analyzing our logs and analyzing the queries that are coming through, the kinds of results we're sending back,
and to have some sort of a data science approach to calling out the issues that still exist in our service. So we're hoping next year we can get to that point. The other thing that we always struggle with is the data builds. As I mentioned earlier, we take open data
and we put it into our maps and search instance. And all of these different data sets get updated at different times. We don't control any of them. So OpenStreetMap, for example, they have minutely diff files, and then they have a daily planet file, OpenAddresses has a weekly build, Who's On First has hourly bundles.
They all come in at different rates, different speeds. Some of them, sometimes when we try to pull the file down it doesn't work the first time. Sometimes the bundles are not perfect. But it presents a challenge of managing all this data. And there's also a lot of data. So we're now looking at, in our Elasticsearch index,
we have 500 million records, and it grows every time we do a build, it grows by a significant amount. OpenAddresses alone is contributing millions of records a week. And that's exciting, we wanna see that, and we wanna be able to scale as a team to meet that expectation. So when I started in 2015, we were looking at two-week build cycles.
Which, if you're an engineer, and you can imagine that in two weeks you will have forgotten why you made a change, and so if we were in a development process where it takes two weeks to get your changes to build, it's kind of, you're dead in the water, right? So this wasn't a really effective process for us at all.
And so we quickly focused on optimization, and we got it down to two days. And now we're around the 30-hour mark. And which is, after two weeks, seems amazing, but it still is a long time to wait to have your changes validated, and to be able to run those acceptance tests before you know that you didn't break anything.
But we focused, up front, we focused a lot of effort on data completeness and making sure that it was all correct. And so optimization just didn't enter the mindset until we got to this production environment and people started relying on the data being up to date. And so now we're trying to catch up and make sure that we deliver on those promises.
So a two-day build is not bad. We do a full refresh once a week, if all goes well. We kind of say that because not always do things go well. But we're hoping, by the end of this year, to get it to be under 24 hours. And that's a big goal. And it's actually very achievable
because we're moving to a microservices infrastructure, which I'll talk about next. And it looks like it's very possible to get there. We're parallelizing a lot of processes, and 24 hours is doable. And from there, who knows, maybe even just a couple hours.
We'll see. And the other thing is, like I mentioned, sometimes the data we bring down is not perfect because we don't control the datasets. There are, sometimes we'll get errors based on the bundles or it just causes the build to fail unexpectedly. And so what we started doing
is pulling down the data every hour and validating that it works and saving that last node good dataset into a thing that we call data source packs. And we want to get better at that. We want to create containers that have all of that in them, all the data. So then we could store the historic containers
and be able to go back in time and say, oh, this one didn't work. So let's just grab the last known good one so the builds can move forward and not be hindered by one dataset that didn't download. And so, as I mentioned, we talk about microservices. When I first started in 2015,
we had two big moving parts in our infrastructure. We had the API, which is an express module, and we had the Elasticsearch cluster. And the API had to do all the things, had to know about everything, to make all the decisions in one process.
And at some point, as we kept growing functionality, it became unwieldy, and we started thinking about how do we split this thing up. And microservices are not a new concept. It was just that the way that our software evolved, we just kept piling things on top of each other. And it looked like this in 2015.
And in 2017, we have split out a lot of the functionality. So we now have five different microservices that all run together, and Elasticsearch is there just as it used to be. And we can do a lot with this framework. We can stand up the right amount of instances to meet demand.
And I know now there's a lot of talk about serverless, so we're probably gonna be considering that as well. But we're trying to decouple things as much as possible and move into this distributed architecture. And so this has been going really well. And then not too long ago, we had this conversation in Slack with one of our engineers, Julian.
He just said, wow, I just realized something kind of cool. Right now there's no data in the Elasticsearch build, and yet a lot of the queries still work because we have all these other microservices online. And that's kind of the, that's the future, right? And why we did the microservices work. But just having him call that out in the Slack channel
was kind of an aha moment and felt like we had done this the right way. So with all of these moving parts, as they become available, deployment just gets more complicated because you have to manage all these moving parts and you have to think about replicas and redundancy.
And you have to think about for each one of these things you have to have separate scripts that run each one. And so deployment started getting really complicated. And we kind of struggled with this again because we have the open source project and the hosted service. We're always trying to consider the needs of both.
And what we need to do in production is not always what our users need to do when they're standing up their own instances of Pelias. For a while we just had a long file, a long readme that said, install this, then install this, then install this. And it was just links to all of these different dependencies and you'd have to go and get them yourself, write your own scripts. And we would always get a lot of questions
and like, what's this step or you guys missed this thing. And it was not, it just didn't scale. And as we updated stuff, we would inevitably forget to update the docs. And so we made a vagrant image and that got us a little bit further, but because we weren't using that vagrant image in production or even in our own development cycle, it was really purely for the users
that stand up their own instance. It would continue to fall out of date and it would continue to not work. And people just kept asking us more questions about, how do I get this thing working? And then finally Docker became this huge movement in the industry and everyone was asking for the Docker containers. And so we made a decision
to move to a containerized structure as a team. And so by the end of this year, actually by the end of this quarter, we will be on a Docker and Kubernetes set up fully, which we're really excited about. The team has gotten really far in a short amount of time writing those Kubernetes scripts. You know, there's so much support in the community that these things practically write themselves.
So it's really exciting to see. And also we can be really transparent with those Kubernetes recipes and scripts. In a way, we couldn't really be with Chef because there's a lot of secret keys and things like that. So for us to make them clean and publishable would have taken a lot of work. So we just decided, let's go straight to where we wanna be.
And so now we're moving to this containers architecture. And the nice thing is that we have the Docker files and we use that same Docker file in production under Kubernetes, but we also have a Docker Compose setup that allows us to test things locally so we can use it in our own development cycle. And that's really the ultimate win,
the fact that you could share that setup code across your development environment and your production environment so it never falls out of date. And that's it. That kind of just covered some of the things that we, some of the considerations we made. And we're continuously improving our process. And I'm sure that if I gave this talk again in two years
we would be at a whole other place. And it'd be interesting to see the trajectory overall. But if you guys have ideas on how we can improve further, let us know. As we said, we're really community driven. So we'd love to hear from everybody and get your feedback on the stuff that we are doing now. And that is it.
Ask me stuff.
Sure. We do have documentation. Actually we gave a workshop on Monday. So all the materials will be published in a little bit. And there's a tutorial there on how to build your own custom importer. And so we've made it really straightforward where you can write the import process
in any language you want. And then you publish to this HTTP service called the document service. And that will do the admin lookup for you and all these other things that create an object that needs to go into the elastic search index. And so you can, if you have data that has a name and a location or an address and a location,
then you can put that into the Peleus engine and post your own. You can choose to augment it and add it on top of the open data that we use. Or you could choose to have one that's just your data but it's data agnostic to a large extent. Sure. Do you have a list of the open data sources?
Is that an exclusive list? Is that all or is there other lists which you haven't mentioned? In particular, our company's interested in data for the People's Republic of China. There are lots of Chinese sources which are not in all the street map but they're open data. And we just want one source
where we get all of these open sources together. So right now, the ones I listed are the only ones that we support in Maps and Search. But we're always looking for new data sources. And if you know of a good open data set, we would be more than happy if the licensing is appropriate and we can make that available either via the share-alike license
or CC by or something like that. Then we would love to include that data in our search and we can add that to our import process and bring it into production servers as quickly as we can find the data. The problem is not so much the licensing but the language because this is completely Chinese.
There's no English translation of that data. We're language agnostic so we do support international data and we already have, even in the open addresses data set, we do have data that is not, and in OSM as well, there's data that's not English. So it should be fine. There might be some additional work we have to do
in the analytics part for the Elasticsearch engine. So the analyzers, you have to tell them what language you're working with so they do the proper processing when they do the indexing. But I think it would be great and we could definitely add more data and different data sources. They don't have to be only the ones that we listed but it would take a little bit of time
to get them into production, obviously, to make sure they're tested and work well. Yeah, if you could let me know what the list is. Great, yeah.
So anytime you run a query, you get back a GeoJSON response and if you're looking for a particular address,
you would know if you lived at that address or if you were familiar with a venue at that address, you would just look at it on the map. You could get that GeoJSON to show up on a map and we have tools internally that do that for us so we can do an easier job of testing our engine. And if you see it on the map and it doesn't look like the thing you wanted,
then you would say, that's in a different state than what I asked for and sometimes that happens. Or we just don't find the data even though it's in the open data set and so that's another type of result failure that we would look for. But a lot of it was fields driven, right? But once they told us what the expectation was,
so we built a little tool internally and the tool allowed you to specify the query and we would give you our top 10 results and you would select the one that you thought was the right result even if it wasn't first and we would register that as a possible test case. And then if we didn't find it, then you could go search. We could do the same query against other engines
and you could pick the result that you really wanted to see even if it wasn't coming from us, but at least it would give us a way to document what we were expecting in the future so we have a way to fix it. So we did that and that's how we built out. We had a couple hundred tests that we identified as being key to representation of our functionality and we kept those as a core set
and we keep adding to it. So if I search for New York, I know what I wanna see. I wanna see three things or however many things and I wanna see them in this order and I want them to be, so you get the idea that you know in API like this, you know what you're expecting to get back.
There's no user state, it's read only, so we should always be giving the same information back for the same queries and so it's easy to build out an acceptance test suite and that's what we've done. So regarding using, Great question. No, so for the Docker Compose stuff
where we need to run it locally and during our development process, we can run it locally in a container and that's what we do, but in production, it's not recommended that you run Elasticsearch that way, so we're actually putting together Terraform scripts and Packer and we're gonna be making AMIs so we can spin up those Elasticsearch instances
and the AMI is gonna have the data and everything so they can come online within minutes instead of taking 20, 30 minutes to get into the cluster and to redistribute all the data. So yeah, we're not gonna use containers in production, but if you're doing it for a small area, you don't need a full Elasticsearch cluster, you just need the one instance
and you could do that locally in a container. Yeah, great question also. We struggle quite a bit. You think it's a solved problem and yet, no one has been able to solve it in a way that just kinda checks that off the list. It's always a priority for us to keep improving.
So we started with just addressit and addressit is a node module and it has very strict rules. It's regular expression driven and it just said, you know, there's a comma here and this, you know, whatever. It's very rigid and it didn't really do a good job of supporting international addresses,
so we quickly started getting away from it and there's a library. We commissioned Al Barentine, who's a data scientist, to write this natural language processing data or address parser and it's called libpostal. It's been making, you know, it's been a popular engine in the community,
so something to check out, but what it does is there's training models. It's trained on open data between open addresses and OSM and he creates all of these different permutations of addresses and trains the model and then we get something that can recognize not only addresses, but also venues and it can tell you whether it's an admin only thing.
It'll tell you this is a city, this part is that, this part is a country, this part is a city, this part is the postcode. So it does a really good job. It's not perfect and we're continuing to work with Al to improve the engine, but the one thing we found to be not great, particularly inconvenient for us is that libpostal makes a decision
on what a thing could be in the address and then it sticks with that decision and you only get one answer. So for example, if you said New York, you know, one Main Street, New York, it'll say, I think New York is the state and that's it. So then when we build our queries to Elasticsearch, we would say, find me one Main Street that has the parent of New York
as in the state column, right? But because it could have also been the city and it's not clear, we would sometimes find things that were outside of New York City, but in New York, in New York state and we didn't want that, so it was confusing. And there were lots of cases where that would play into incorrect results.
So what we started doing was we would take all of the admin parts that libpostal lists out and we run them through a microservice that we call placeholder and that is meant to only parse the admin areas and it gives us all of the variations that a thing can be while taking hierarchy into account. And you can read more about that in our docs about the placeholder module,
but it's done a really good job of helping us build more robust queries where the query is ambiguous. So that's like a really long answer too. So we have three things that we use now and there might be more in the future cause it's a really hard problem.
I'm just gonna step in real quick and remind people we've got lunch now, but I'm not gonna stop the question. Thank you. We do not support polygons right now. We have, if you just go to the API as it is,
there is no support for inserting polygons or asking questions with polygons other than being able to specify a rectangle as a boundary to say don't give me results outside of this. And we don't return back polygons because we return back IDs for the source data and you can go and look it up
in the original data source. So that's kinda how we tie the geometries together. So if you got a result from who's on first, for example, and it's New York City and you wanna get the boundary, you would go to the who's on first API and do another query and get the full geometry from there. But we don't store them and we don't return them.
Other than the PIP service, the point and polygon service, that's the only place where we need the geometry so we can do the look up to say this point is within this hierarchy, but we don't return the actual hierarchies. We just tell you the IDs of the things that it belongs to. Great.
Thank you all for coming. If you have any more questions, stop by our booth on the floor. Thanks.