Ingesting 35 million hotel images with python in the cloud.
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 138 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21086 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Maxima and minimaSoftware engineeringSummierbarkeitAddress spaceCoordinate systemComputer-generated imageryEmulationPhysical lawMetropolitan area networkExecution unit3 (number)Matching (graph theory)Graph coloringMobile appProduct (business)Computer-generated imageryDifferent (Kate Ryan album)WebsiteMultiplication signCASE <Informatik>Process (computing)Internet service providerMedical imagingCoordinate systemResultantWeb pageLibrary catalogSoftware engineeringMetasearch engineAddress spaceMobile WebInteractive televisionHydraulic jumpConnectivity (graph theory)Greatest elementLoginLogical constantLimit (category theory)Shift operatorComputer animation
02:20
Computer-generated imageryNumberAverageContrast (vision)Scale (map)Stack (abstract data type)Queue (abstract data type)Service (economics)Normed vector spaceImage processingFingerprintMedical imagingHypermediaWebsiteLibrary catalogState of matterLibrary (computing)Computer-generated imageryThumbnailDivision (mathematics)Software frameworkImage processingUniform resource locatorGroup actionVirtual machineInternet service providerStack (abstract data type)IdentifiabilityData managementComputer wormLevel (video gaming)NumberGraph (mathematics)DigitizingForestRight angleFormal grammarConfiguration spaceReading (process)Water vaporService (economics)AreaSequenceComputer animation
05:46
FingerprintPhysical systemComputer-generated imageryMessage passingMultiplication signComputer animation
06:07
Clique-widthWeightComputer-generated imageryPixelComputer fontDependent and independent variablesWrapper (data mining)Error messageMessage passingContent (media)Data compressionQueue (abstract data type)Key (cryptography)Computer-generated imageryPhysical systemPredictabilityFingerprintResultantVideo gameImage resolutionPresentation of a groupError message
07:13
Queue (abstract data type)Service-oriented architectureEvent horizonMessage passingLevel (video gaming)Primitive (album)Event horizonQueue (abstract data type)Connected spaceSeries (mathematics)Functional (mathematics)Different (Kate Ryan album)Formal languageSingle-precision floating-point formatMedical imaging
08:03
FingerprintComputer-generated imageryMedical imagingComputer-generated imageryDifferent (Kate Ryan album)IdentifiabilityProcess (computing)Fingerprint
08:33
Computer-generated imageryHash functionMobile appMusical ensembleExecution unitWebsiteHash functionFlow separationComputer-generated imageryAverageDifferential (mechanical device)Medical imagingFingerprintLibrary (computing)DistanceINTEGRALAlgorithmMatching (graph theory)Multiplication signHypothesisSimilarity (geometry)Division (mathematics)
09:34
Element (mathematics)Maxima and minimaLocal GroupComputer-generated imageryGroup actionMedical imagingQueue (abstract data type)Internet service providerGroup actionDatabaseIntrusion detection systemMessage passingLine (geometry)Computer-generated imageryMultiplication signProcess (computing)NeuroinformatikData conversionProduct (business)FeedbackPhase transitionPoint (geometry)Endliche ModelltheorieDivision (mathematics)Diagram
11:29
Local GroupGroup actionHash functionThresholding (image processing)Hash functionMedical imagingParameter (computer programming)DistanceGroup actionComputer-generated imagerySet (mathematics)Process (computing)ProgrammschleifeField (computer science)Element (mathematics)Hamming distanceThresholding (image processing)JSON
12:14
Uniform resource locatorProduct (business)Control flowInteractive televisionMedical imagingGroup actionMetric systemSampling (statistics)RootState of matterLecture/ConferenceComputer animation
12:43
MeasurementLocal GroupGroup actionComputer-generated imageryFingerprintOrder (biology)Medical imagingFood energyGroup actionMultiplication signFiber bundleEndliche ModelltheorieGraph coloringKinematicsQuicksortWebsiteSemantics (computer science)Code division multiple accessHistogramImage resolutionPixelComputer-generated imageryTask (computing)Position operatorDecision theoryElement (mathematics)1 (number)FrequencyLatent heatWordCASE <Informatik>Numeral (linguistics)Metric systemMobile appDatabaseComputer animation
15:07
Execution unitComputer-generated imageryComputer-generated imageryContrast (vision)Primitive (album)Medical imagingJSON
15:28
Local GroupCountingScale (map)Computer iconGroup actionComputer-generated imageryResultantMedical imagingProduct (business)Computer-generated imageryEndliche ModelltheorieGroup actionInstance (computer science)Set (mathematics)Internet service providerMultiplication signComplete metric spaceGoodness of fitVirtual machineMetric systemImage resolutionGraph coloringConnectivity (graph theory)Process (computing)WordDecision theorySequenceScaling (geometry)Physical systemSoftware maintenanceSampling (statistics)NeuroinformatikInterface (computing)Point (geometry)SpacetimeExecution unitMatrix (mathematics)TheoryInterpreter (computing)Phase transitionElectronic mailing listMoment (mathematics)Computer animationJSONLecture/Conference
Transcript: English(auto-generated)
00:01
All right, so my name is Alex Vinales, software engineer at SkyScanner. I work in the Hotels Data Team, and this is the team that delivers all the data that populates the website and the apps. And I'm here to present you the pipeline that we built to make sure that all the hotels have
00:21
images in production. So one of the biggest challenges when you're a meta search is to actually unify all the data. This is what the results page for an hotel search looks like. And here you can see all the providers for the W Barcelona hotel.
00:40
Here you see the three cheapest ones, but in reality there are many more of them. So we are getting the data for that hotel like 20 times. And our job is to make sure that it reaches production once, and it has the best possible data. This is what it looks like. Every partner that you integrate with gives you
01:01
access to the catalogs. Those catalogs give you all the data for the venues that the provider has. And they give you usually slightly different names, slightly different street addresses, and different coordinates. So our job is to match those venues. So do some magic and make sure that they reach SkyScanner
01:23
once and with the best data. If we do that for all the hotels, that's SkyScanner in the catalogs, we are going to end up with the data release, and this is the SkyScanner catalog. So what about the images? In this case here you can see a search result,
01:41
and there is no image there. So you as a customer are not going to that hotel. So images are critical piece of the product. So let's see how do we make sure that they reach production. The case is pretty similar. Every partner gives you all the images that they have for that hotel.
02:02
And they usually give you really similar images. You're going to see cropped images and slightly shifted different colors. And our job again is to make sure that we pick the best one and put it in production. Here you see the images in the results page. And if you click here,
02:20
you're going to get an expanded gallery. This gallery shows you the images with more detail. And if you look closely, you're going to see here down below duplicated images. So this is what we are trying to avoid. And as you can see here, they are quite similar images, but slightly moved or shifted or cropped.
02:40
So yeah, it's really hard to remove them all. So with around 200 partners, around one million hotels reached production, this means that we are going to have to process around 35 million images. But there's a trick here,
03:01
and it's that we resize. We resize for mobile devices. We don't want mobile phone to be resizing all the images. There you see the thumbnails. There are smaller ones, so all the work is done in our site. We have around 40 different configurations for resizing. So this is going to multiply the number of images
03:20
that we end up processing. So I'm here to tell you the tale of an image processing pipeline. But before doing that, I'm telling you the tech stack. We are happy users of Postgres on RDS managed. We've been together all the steps of that pipeline with SQS queuing service.
03:43
And obviously for machines, we just get EC2 instances. We try to auto-scale them, so there's nothing to do there. There are no machines up. As far as libraries is concerned, we use Django. We use it with Django res framework. Incredible to make APIs.
04:01
We avoid using the Django rem. We use SQLAlchemy instead. And for combo, for messaging, we use combo. This library, you may be already using it because it's the underlying library under Celery. It's maintained by the same guys, and it's simply a lower level.
04:23
When you do Amazon stuff, you're going to use Boto. And for image processing, we use pillow, which is a nice library with C bindings, so it gives you a nice performance for manipulating the images. And then we use Python 2.7 for technical reasons.
04:42
So the tale of an image processing pipeline. We are going to start with the triggering. And there are like two big groups here, the asynchronous steps and the synchronous steps. The synchronous are going to be running continuously, making sure that all the images are ready to be used by the synchronous steps. So triggering.
05:02
The trigger basically is a small worker that keeps running through all the catalogs. It looks for URLs there and computes the div. The state is stored in the database, and basically we say, all right, this catalog has been updated and there are URLs that disappeared, those images should be deleted. There are new images or updated images.
05:21
So this payload is computed for every hotel of every partner, and this is sent to an API, the image release API. So the API, when it receives the payload, it stores that image into the database, and we keep things like the URL, we give an identifier to that image. We know which provider gave us that image
05:41
and for which hotel of that provider that image is. And now the API is going to move forward and queue the messages to the next step, which is downloading. The downloader step basically gets the messages from queue, hits the partner CDN, gets the image, and puts it into our system, into an S3 bucket,
06:00
so that if the CDN fails or the partner removes the image we still have it, and we can roll back at any time. So this is what the callback looks like, what the worker looks like. Basically, we get the image URL, hit the CDN, get the contents, register a new key into S3, set the contents of that key,
06:22
and after this is done, it's in our system, then we open the image with peel, and we ask some pretty basic questions like, should I filter that image? Is it big enough? Does it have resolution enough? If the image survives all of that, then it will go to the fingerprinter, because we are going to pull this message to the next queue.
06:40
So notice here, this reliable callback decorator. This is basically something we do on other workers. We don't want to die on unexpected errors, so the main thing that this does is making sure that the workers don't die. Here we do an extra stuff. You see that this is converting a warning into an error.
07:03
Pillow actually protects you from the compression bombs, and we actually got a few of them, so a worker, you don't want it to die for a compression bomb. You want it to continue going and move on to the next message. So here we catch everything,
07:20
log it into a lock stage or anything you're using, then we run the function, and that's pretty much it. So this is what a worker looks like. It's a consumer that is going to connect to the backend, and it's going to map all the messages in that queue to the callback that we just saw. Here you see the combo primitives being used.
07:43
We just create a connection to the backend, and then we specify that, start consuming all the messages of events in that connection, and we're going to map the messages to the callback. The callback is going to get the message and the body of that message. We are going to process that message, and if everything goes fine,
08:00
we just acknowledge the message and move on. So after the image is downloaded, we go to the fingerprinting step. Really simple one. We just download the image from S3 and compute some identifiers that are going to allow us to do further processing. The question that we will try to answer further
08:22
is going to be if these images are the same or not. For a computer, those images are not the same, not really, because they have different bytes, different sizes, I mean, it's not the same image, but if you were a user and you saw that in the website, you would be disappointed, because obviously it's not telling you anything new. It's redundancy, and that's what we want to avoid.
08:43
So yes, they are the same image, and this is what the fingerprint does. It computes some kind of hash. We use the image hash library. It implements different hashing algorithms. I believe average hashing, perceptual hashing,
09:01
differential hashing. We end up using the differential hashing one, which is a slight modification, which is doing a cropped hash thingy. So what does that do is it creates images of the image, it crops the image several times, and then we compute the subhash for each subimage.
09:21
The reason we do that is because we want to maximize the chances of matching duplicated images. So this kind of hash is such that similar images have a small distance between the hashes. So, time to duplicate. At this point, the data release is going to run,
09:41
and it's going to say, okay, I have one million hotels that are going to reach production, and now you need to make sure that those hotels have images. So this goes to the API, and the API starts processing that, queues messages, payloads, and the duplicator is going to update the status on the database. And then if it's needed,
10:01
it's going to move to the next step, which is prioritizing the images. We're going to see that now. So what is a group of hotels? A group of hotels, basically, for us, it's just pairs of IDs. We identify a hotel as provider, and the ID comes as hotel for the provider. So if we see here three pairs of IDs,
10:20
it's a group of hotels provided by three partners. So that's what we call a group. And when I say if needed, this yellow line is a queue of messages in the duplication step. You see it raises up to about one million, and then it starts descending. This means that the workers are processing this payload,
10:41
and the blue line is actually the next step queue, and you see not all of the messages go move forward to the next step. So we tried to be differential. If you go to releases, not all the data is going to change, so you can reuse the images that you already computed in the previous step. That's what we're trying to show here.
11:04
So the duplicator is going to grab another group with all the providers, and it's going to fetch all the images that we have for them. And ideally what we want to do here is identify that there are two image groups,
11:20
the image group of the room and the image group of the pool. That's all we want to do. So the conclusion is, I said to the database that this group has those two image groups. And this is what it looks like. We have all the, a set of images to be processed, and we have no groups. And then we start moving. We see the group with the first image to be processed.
11:42
We try to expand the group, comparing it against the other pending images to be processed. And we ask the same question always. Is that the same picture? And the arguments that we pass here are the hashes that we computed back before in the fingerprinting step. So what is this same picture doing?
12:00
It's basically accepting all the hashes that we need, that we computed, and comparing them with the hamming distance. So if that distance is below a threshold, we're going to consider the images to be actually the same or quite similar. So how do we tune that step? How do we get the guarantee that tomorrow we don't break everything,
12:21
and in production we have tons of duplicated images? So what we do is you build the corpus. So basically you grab a big sample of images, and you will manually over them and set up the groups manually. So you set the truth. You say those images should be a group as a human.
12:42
And then what you're able to do is run the automatic algorithm and get some metrics. Those metrics, you can further tune the code, and see an improvement or not improvement in the metrics. So you keep doing improvements until you are happy. And that's how you have guarantees
13:00
that you don't break stuff. And then we go to prioritization step, which is choosing the best images. Now we know which groups of images we have, but each of them, now we need to pick the best ones. So prioritization is quite simple, just update the status in the database. And what it does, it says, okay, I have two image groups.
13:21
I'm going to get all the images again, and I'm going to pick the best one from each group. So it says, okay, this is the best image, and this is the best image. We are based on decision on pixels, on resolution, colors, histogram of colors. And once we know that, we're just going to sort them and say, okay, if we have something to base that decision on, we're going to prioritize them accordingly.
13:43
Sometimes partners tell you this image should be the first one. So if we have this data, we're going to use it. And then that reaches production, or that will reach production. So what could go wrong when you pick the best image? This could go wrong.
14:02
So obviously picking the best image is not an easy task. It's really hard. And yes, this image had tons of colors and tons of pixels, but obviously it's not the best one. So we have sort of an MVP to extract features from images, and obviously if we found a word that says,
14:22
oh, there's a toilet here, we will put that image on the back. In the meanwhile, because doing that is complex, so in the meanwhile we have tools, so you can go manually there and sort those images and fix them manually. It's not that it's happening at all times. It's really specific cases, so that's why it doesn't have that much priority.
14:44
And now you prioritize the images, you know which ones are going to reach production. It's time to waste time resizing them and all the sizes. So this worker is going to get the payload, make all the sizes, and put them into a bucket in S3. This special folder is going to be served
15:02
through a CDN, so you get reduced latency on the website and the apps. And that's pretty much it. This is what the worker does. When you have an image with pillow and you want to resize it, it's quite easy. You can just call resize, and you're done. If we made the image smaller, we want maybe to improve the contrast.
15:22
You can create image enhance, use the image enhance primitive, and you can just change the contrast as you want. And that's it for the pipeline. This is the final result. All the images have reached production and how they are progressing through each step.
15:41
And this is the schematics of how it looks like on Amazon. Basically we have a distributor machine which is just deployment support. We keep all the dependencies there. We try to get the wheels in there so the other workers are going to use to retrieve dependencies from here. This saves time when getting heavy packages
16:01
because the wheels installing them is really easy. And then we have the image release API. Basically it's used for data release, interfacing with the data release. And then we also have a health check there that checks the queues, and if there's tons of work to be done, it's going to spin up the instances, triggering a CloudWatch alarm.
16:22
And that's pretty much it. Then we have the autoscaling group of workers, and everything is connected to the database, which is the central piece. As far as the scale in the database, Amazon is quite easy to do that. You just provision more IOPS if you need more throughput. So we're happy with that. And then just read the final bucket
16:41
with the CloudFront on top. And that's pretty much it. So thanks for listening. So you have any questions? Yeah, like the guy is gone.
17:10
Can you repeat? Are you triggering process to scan the database?
17:25
It hasn't been a problem that much. We have maybe 80% usage, and we are not using that big of an instance. To specify images?
17:41
Yeah, yeah. Images related with toilets and this card? Yeah, so we have kind of an MVP, but it needs more work. But basically you got basic words of what has been detected in that image, like I see a bed or I see a toilet, and that will be fair enough to get rid of those issues.
18:00
But it needs more work, yep. There's also times where you have a hotel room that shows all the features of the room but has a bad resolution and not great resolution.
18:24
No, no. No, usually we try to trust the partners. We know that Leonardo is a good provider of images, so if we have Leonardo, we're going to prioritize Leonardo. If we have the sequence of the image, so we say a partner is telling us, hey, this image needs to be on top.
18:41
We trust that. And as a last resort, we just check resolution and the histogram of colors. So if we have only blue color, we are going to ignore that image, yep.
19:01
It's quite good. So we have here the metrics of the corpus. So the most important metrics here are completeness and pureness. Completeness, basically we try not to put a wrong image. So if we have a group of beds, we don't want to put the toilet there.
19:21
So this reduces completeness because we filter more groups of images. But the 91% of group of images that we have, 97% are pure. This means that of all the groups that we get, 97% have the same image, like all beds or all bathrooms. So yeah, could be better, of course, but.
19:50
Yes, yes, like the MVP for, you mean for recognizing features.
20:09
Yeah, it's pretty easy to answer. We just have the workers which are independent of Django completely. We had the workers there and we knew that we were not going to couple them with the Django API.
20:23
So why implement the models twice? We just choose to do it in SQLAlchemy and not pass the Django settings to the worker. Just technical decision.
20:42
The question is if, well, you got the browserable API, you got the perlings. It's quite, we're happy with it, using it. It's not that the API, it's maybe the less important of the components here.
21:01
The workers are much heavy. I guess that's it.