Ingesting 35 million hotel images with python in the cloud.

Video in TIB AV-Portal: Ingesting 35 million hotel images with python in the cloud.

Formal Metadata

Ingesting 35 million hotel images with python in the cloud.
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Alex Vinyals - Ingesting 35 million hotel images with python in the cloud. This talk covers the distributed architecture that Skyscanner built to solve the data challenges involved in the generation of images of all hotels in the world. Putting together a distributed system in Python, based on queues, surfing on the AWS Cloud. ----- Our goal? To build an incremental image processing pipeline that discards poor quality and duplicated images, scaling the final images to several sizes to optimise for mobile devices. Among the challenges: 1. Ingest all the input images that partners provide us. 2. Detect and remove bad quality + duplicated images from reaching production. 3. Resize all the generated images to optimise for mobile devices. 4. Ensure the process scales and behaves in an incremental way. 5. Ensure the whole process fits in a time constrained window. Among the tools we used? Pillow, ImageHash, Kombu and Boto.
Mobile Web Web page Software engineering Greatest element Multiplication sign Maxima and minima Product (business) Medical imaging Process (computing) Metasearch engine Website Summierbarkeit Resultant
Logical constant Connectivity (graph theory) Computer-generated imagery Library catalog Coordinate system Login Limit (category theory) Emulation Product (business) Medical imaging Process (computing) Different (Kate Ryan album) Personal digital assistant Address space Resultant Address space
Web page Metropolitan area network Execution unit Shift operator Interactive television 3 (number) Graph coloring Medical imaging Different (Kate Ryan album) Physical law Summierbarkeit Hydraulic jump Resultant
Medical imaging Graph (mathematics) Hypermedia
Scale (map) Medical imaging Number Contrast (vision) Computer-generated imagery Formal grammar Website Configuration space Average Number
Area Service (economics) Group action Service (economics) State of matter Digitizing Image processing Computer-generated imagery Virtual machine Image processing Water vapor Stack (abstract data type) Sequence Medical imaging Forest Normed vector space Software framework Right angle Queue (abstract data type) Fingerprint Reading (process) Library (computing) Computer worm
Computer font Presentation of a group Key (cryptography) Clique-width Dependent and independent variables Weight Image resolution Computer-generated imagery Content (media) Medical imaging Message passing Pixel Fingerprint Physical system Fingerprint
Predictability Functional (mathematics) Wrapper (data mining) Computer-generated imagery Event horizon Connected space Formal language Medical imaging Message passing Error message Event horizon Different (Kate Ryan album) Single-precision floating-point format Video game Queue (abstract data type) Service-oriented architecture Series (mathematics) Error message Data compression Resultant
Medical imaging Process (computing) Computer-generated imagery Fingerprint Fingerprint
Medical imaging Algorithm Computer-generated imagery Hash function Execution unit Website Division (mathematics) Library (computing)
Point (geometry) Mobile app Matching (graph theory) INTEGRAL Multiplication sign Computer-generated imagery Similarity (geometry) Distance Hypothesis Product (business) Medical imaging Message passing Process (computing) Musical ensemble
Feedback Maxima and minima Division (mathematics) Line (geometry) Computer Element (mathematics) Local Group Medical imaging Message passing Internet service provider Phase transition Queue (abstract data type) Endliche Modelltheorie
Group action Computer-generated imagery Set (mathematics) Hamming distance Group action Thresholding (image processing) Distance Thresholding (image processing) Field (computer science) Element (mathematics) Local Group Medical imaging Process (computing) Programmschleife Hash function Hash function Data conversion
Uniform resource locator Medical imaging Root State of matter Interactive television Sampling (statistics) Control flow
Group action Histogram Pixel Image resolution Computer-generated imagery Group action Limit (category theory) Semantics (computer science) Graph coloring Element (mathematics) Measurement Local Group Medical imaging Frequency Numeral (linguistics) Kinematics Fingerprint
Medical imaging Order (biology) Decision theory Computer-generated imagery Local Group
Code division multiple access Pixel Multiplication sign 1 (number) Food energy Medical imaging Word Latent heat Personal digital assistant Website Fiber bundle Endliche Modelltheorie Quicksort Fingerprint Position operator Task (computing)
Medical imaging Execution unit Computer-generated imagery Primitive (album) Contrast (vision)
Point (geometry) Group action Image resolution Computer-generated imagery Execution unit Virtual machine Set (mathematics) Counting Computer Graph coloring Theory Local Group Medical imaging Matrix (mathematics) Endliche Modelltheorie Physical system Scale (map) Computer icon Interface (computing) Moment (mathematics) Electronic mailing list Sampling (statistics) Instance (computer science) Group action Software maintenance Sequence Word Phase transition Interpreter (computing) Resultant Spacetime
right so when it was announced so ingenious guys working here goes into the and this is the idea that the leverage all the little populates the website and the other and I'm here to present you at the bottom and that would be used to make sure that all the adults half images in production so 1 of the biggest challenges when you're metasearch is totally violated our news what the results page for mobile search look like and here you can see on the provided for the were them you see the features 1 but in reality there are many more of them and so we're getting the data for that of the like 20 times and our job is to make sure that it reaches collection once and it has the best possible later this is what
it looks like I had a partner at the new integrated with the kids youngsters to the but the logs of those components give you all the the for the benefits of the provide their house and they give you usually like slightly different names values different street addresses different coordinates so our job is to mind those are many of the solution might it can make sure that they reach constants kind How ones and with the best it would allow for all the adults that has come out the in the catalogs we're going to end up with the dangerous on this case so
what about the images in this case here you can see it a search result and there's no limit to the so he was a customer going to that that of the images are critical is over all the that so let's see how the come to make sure that they reach production the case is seen
that the similar every partners gives you all the images of the half for that of and they usually give you like reducing the images and you want to see like cropped images a slight shift in different colors and our jump again is to make sure that we pick the best 1 and make amplitude into interaction after see the
images and the results page and you can hear you're going to
get an expanded economy these galleries fulfill the images with more than the then you look closely you are going to see here on the on the pertain media so
this is what we're trying to work and as you can see here there was similar images but slightly not or shift in the graph so that it's really hard to remove them all so
we went around tools at 200 batteries around 1 million does restrictions and this is 1 to have uh depresses around 55 million images but there is the trick here is that we leave
aside we resize formal analyzes with someone from of phone to be resizing images and the receive the summary of the smaller ones so that all the all the workers on an our site I would have around 4 different configurations for sizing so this is going to multiply the number of images that we end up processing so I'm just
going to detail of an image processing pipeline ah but before that add them to you the text now we're happy users of that are nearly as manage are we begin to whether are almost that's about by would disappear to services is the for machines which is again I C 2 instances of total to scan them so there's nothing to lose their there are no is of 1st is is concerned we use general how we use it in the jungle forest framework for will to me in the eye we avoid using the general around Our in user sqlalchemy instead from Böhmová messaging obvious formal desire you maybe already using it because it's in the right library the salary it 20 medicine eyes and it's simply a normal there when you do understand style you want to use 1 and the image processing used below which is necessary would the binding so it gives you a nice performance for Parliament reading in the images and then we used by them to interstellar all for technical reasons so the date of an image
processing violent will want to start with the dream and their right to be groups here using cruise as that and the single usages are into running continuously and making sure that all the images are to be used by the sequence of steps so on the trailer amazing uses more water that keeps running through all the trouble of it looks for a year of the and produces of the state is stored in the and there's a was Our ideas capital has innovative on their euros under severe this images from the the the the new images of the digit images so this paper is computed for areas without very popular and is essential in the images in the so the API 1 this is the payload it stores the image into a lot of ways and we keep things lively rowdy given and decided to that image of we know which provided gave was that England and for which we tells the provided that images and now the API is going to move
forward and Q of the messages to system which is a the rural areas that are basically gets messages from q sheets of artists in the end because the image and put it into or system into 1 the market so that in the CDN fails or partner remove the image we still have the overall log anytime so this is 1 of
the polar groups like about what the worker like basically a are we can go image or in the city and then the bond and reduces you begin to asleep and the contents of that key after this is done and it's in our system that we opened which we built and we have some pretty basic questions like actually from further that the image is of this half resolution of if image survive so that they will go to the fingerprint present police Madison excuse so that this year these on all of the great there this
is basically something with you on on all workers will want to that that there's so our the major results is and making sure that the world is and I actually the stuff you see that this limiting a warning into an error but it allows the prediction from the compression I'm going to do on a few men so an ordering you want to life for compression what you wanted to continue going and once on its message so here we can't missing knowledge into lockstep series using then run the function and that's pretty much
so this is what worker like in is instead of consumers is is going to come into the market and is when to marble the messages in the q add to the following that which is curiously article will be it is being used we describe the connection to the back and then we specify our disturbance single images and that of the events in the connection and we want to make the messages to the whole of the bullet is going to get out the message in the body of the message we are willing to process the message and difference in most languages of role is a message from 1 so after
images all over the world fingerprints instead and really simple what we do is all of the image from the streets and compute some are identified as the government was always a little further processing and it was in the we
we will try to answer for the intermediate is images of the same African did those images are of the same and really they have different might different size and it's not stimulant but anyway user
selecting the website you will be disappointed because there's no telling you anything new from the school and voice so the same units
and this is 1 of those things have been done it can be some kind of high I we use the image how slavery it implements different hashing algorithms leave our average high perceptual hashing differential hashing we end up using the differential hashing was and into the library division which is doing a corrupt so what does
that always integration to images of a match it complements several times are the and then we compute the surprise for each of the mind there is way that is because we want to maximize the chances of our matching and the abbreviated images so this kind of hypothesize that similar images have on the small distance between the harsh so tend to be at
this point uh the is is went around them is when the sale OK I have 1 million our doesn't unwilling to reach production and now you need need to make sure that those of us have of images so this goes to the API and the EPA is starts processing that his messages built on the Big data is going through data the state of and then is going to move to the next step which is periodizing the image we're going to see so what is available for those who
over that was basically for us it's just phase ladies out we then define develop as relies on the for them from the provider so if we see here the pairs of entities in the ruble models provided by feedback so that's what we call and when I say that needed
this is the yellow line is a queue of messages in the provisions that you see in raises up to about 1 million and then starts sending this means that the workers are persistent the spilled and the line the fully via the next step you and you see not all of the messages will now move forward to the next step so with that differential uh if you have 2 releases and this went to change so you can reuse the images that you are the computers in the genus that that's the and that's what we're trying to show here so the division of the decade going to the room with all the people with all the diverse and
essentially images that we have for for them and they really what we want to here is identify the treatments groups and the images of the Riemannian image that's all so the conversion use 6 I these little puzzles in this and this
is what it looks like we have all of a set of images to be processed and we have loops and then we start moving we see the world with the 1st image in a process we tried to expand will comparing it against the other and the images to be processed and we have the same question always is that the same picture and the elements in the past year the hashes and we computed month before in the field and stuff so what is this same victory is basically accepting all the hashes and that we computed from brain and with the Hamming distance so that distance is below a threshold we're going to consider the images to be our company the same of quite the so how
do we do that step 1 and hunting and the until tomorrow we'll break everything through interaction we have tons of the mediated images so what we do is with the powerful so basically have a
big sample images and you will not only a little over the over and set out the roots minority so you said that the state those images should be a little as and then when you're able to lose from
element a column and and semantics and mandates I even further during the Gold considers or and that improvements in kinetics so you could dream problems and there have been and that's how you have Y in numerator stuff and then we world periodization that which is choosing the best images know we know which of images will have but into that now we need to be the last 1 so period as innocent but simple just that this stuff and and what it does is it that they have to much groups and when to
limit is again and I want to pick the best 1 from each group so it's a of this is the best image and is the best thing that we are all based on the around pixels on resolution college histogram of colors and as we know that this
is going to sort them and say that we have something other ways of decision on we're going to do it dies of course sometimes farmers tell you this image should be the 1st 1 so here we have is that that we're going to use and then which instruction was that the rich collection so I'm looking for when you
pick the best image and the this
will run here so what is the it the best images is not an easy task it's really hard energy this image have bonds of collagen bundles of pixels love and it is not the most so I we have sentiment to extract features from images and you honestly you find a word say there's a very here we would but that the image like In the meanwhile position that its complex so in the meanwhile we have tools so you can will not only the sort those images of the of the model it's it's not that it's happening like multiple but it's really specific cases so that's why it doesn't have that much variance and many very
those images you know which ones are going to restructure its time waste time of resides in an hour on all so there's work there is going to get the bailout Nicholas ideas and put them into a bucket of history is special for there's is going to be served through a CDM so every is that the thing on the website and the Alps the containment
and this is what the workers I we haven't mentioned below and you want to say it's quite easy the size it done so if you it remained a major smaller we want maybe improve the contrast you great image hands ideas images as primitive and you can just change of consciousness of 1
and that by blood this is like
the final result of images in this collection of whole there are progressing through that and this is a
schematic of whole looks like animal are basically you have a list the machine which is just a employment support we give all the dependence there we tried to get the wheels in there so the other workers are going to use to reduce dependence is from here 16 when getting I had a bucket just because the wheels this selling is really the and we have the imagery is API our basic gages for later is interfacing with theories and then we also cover has in the attempts cues and if there's like tons of work to be done it's going to speed up the instances going forward our and that's pretty much it then we have the interface of the workers and everything is connected to the other is the of centerpiece as far as this thing that always promises why is it so that the juicy wisdom or by its unit Northrop so that happy with and then redefining market the from and that's pretty much be thank few moments questions that like the light is composed of these the the only thing that's been we say that the European the 1st phase of instead of a lot of this kind lay it hasn't been a problem the point we how many in the press and usage and we're using like that because an instance I know that you know that there is a reason for that is that we assume so we have kind the 1 the people but it's more work but basically you you don't like basic words for what has been detected in that image like assume the ICT about other than that we very to to get rid of those what the issues but it's was you have to use the groups of translated the all you but of no no usually we try to address that the battery you we know that they're not visible provided images so we will have another little interpretation of it we have the sequence of the so we say about the setting of space is the maintenance of the and of we we just and as a last resort we just said resolution and the system of colors so if you have on the blue blue-collar it will go into more than a year so you know but it's quite well so we have here and I think of a so the must involve the matrix here computers and your computers are basically we try not to a wrong image it is so you have a group of that we don't want but the soul I'll give reasons computers because we ancient among groups of images but the 91 % of group of images that we have December sample pure business out of all the words that we add up 97 % have assume that like the same image I call that world model so yeah only yes it's like the for uniform recognizing features when all right yeah it's pretty easy to answer we just have the workers which are independent of John completely we we have done the work there is there and we knew that we were going to Buffalo with the with the general API so where implement the emulsifies wages choose to to in a school alchemy and not positive was sentenced to the just for the use of this for a but the question is if the the last thing what did the browsable at the idea that the Berlin is what will happen with is candidates maybe the less important of for the competences the workers are much the that's it


  305 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)