Add to Watchlist

Implementing a Sound Identifier in Python


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Implementing a Sound Identifier in Python
Title of Series EuroPython 2016
Part Number 14
Number of Parts 169
Author Macleod, Cameron
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/21115
Publisher EuroPython
Release Date 2016
Language English

Content Metadata

Subject Area Computer Science
Abstract Cameron Macleod - Implementing a Sound Identifier in Python The talk will go over implementing a Shazam-style sound recogniser using DSP techniques and some fantastic libraries. It will cover implementation, challenges and further steps. The project is still a work in progress and the code is [available on GitHub]. It was inspired by the [Over-the-Air Audio Identification talk] at FOSDEM 2016.
so 0 we have Cameroon make lot make allowed to OK sorry speaking about implementing a sound identify and think I'm really excited about it so it's the shells something but of course you you just because you're just OK so last hopeful today except of lightning bolts of course have fun if hello hello OK fantastic geitonogamy yeah thank you for coming i'm camera and as a brief overview this is gesture reaches the theory behind houses and use the work I don't think still works like that they published a paper on it that got something better now but so quick disclaimer on not expert on literally anything and therefore everything contained within this presentation should be taken with a pinch of salt and thus using information retrieval solid imagine you are in the car or if you don't have a car you listening to the radio and of course you're not control the music was being played so song comes on it might be Britney Spears or something you don't know this you thinking about this a pretty good so I can get into this and you will find the name nowadays you've got things like this and have since 2001 to just identify quickly what was on his head and the artist's except for a but this is where this is the basic problem from which music events information trieval comes from so music is an easily searchable as a thing um if you've got 2 different recordings or different encodings of the same song or the same piece of audio is going to vary quite wildly in representation and there were also quite large the files that you find that contains music searches and solves this using something called fingerprinting we'll go through that there but you've also got other applications within the fields such as school search through a musician you've got that in this paper the quickly an incredibly playing looking at the same time I think it's quite impressive if we consider that the as a search for that you can search by humming nowadays I come under the name the applications that they exist and and they would talk mostly patches and but the techniques also about will applied to other fields as well it's a little bit magic I'm going to be honest but hopefully will understand by the end OK so what do not use part good surely it's a pretty good choice for real-time processing that you want to go fairly quickly if you want the user to not give often just go away from the authority but the thing is it's actually not that bad and the data processing we've got languages sort libraries like number on side by matplotlib visualizing data when you developing and widens actually quite good for this a lot numbers written in C so it's very fast and not only that but the best language in my opinion is language you know best then the thing which is ruled by the conferences probably going to be possible for us so it is the old demos country what's this I spent quite a long time trying to get this to work and well it totally works but not show you because it's far too cool for you guys yeah it doesn't work out in building the challenge the listener is for you to go out and build after I told you how it works and that solid quick show of hands it follows the very words signals and Fourier transforms people who would understand what I meant we only have fantastic so you get those few people wouldn't um Fourier transform you take signal in the time domain so it's time instantiate and you extract frequencies from and you don't need to know the math behind it because they're all full if you've ever seen the equations it's got imaginary numbers and cosines and discuss as so signals is information as well you also hear me using words FFT all that phrase FFT as fast Fourier transform is an algorithm used generally to calculate these things and it's fascinating the infinite recursion with the mass increases so the basic structure the application is normalized fingerprints and some storage afterwards and things recognize the normalizer is it's not normalize like you might have on iTunes usually when people in audio talk about normalization takes some audio multiple pieces and they make all of the sense of what this is not that this is taking audio of different sample rates so that actually engage starting at different bit that's the resolution of it and formats Anthony into 1 single form this is great because it means we can write only 1 fingerprint and it's a fuel development time I'm not going to stop so you might be able to use 5 of which is a library that basically wraps up phonetic and as of anarchism but of course also of multimedia those videos well but I to the page and the last summer was updated was about 1954 so I decided against using them this also a Kodak which is the vision and you could use that directly was he time but I've seen of foreign decided against that so fingerprint what I wanna do fingerprint well comparing the likelihood is that doesn't work different that's different representations except for an so we want to compress it as well notice that you got smaller store reach if you're taking fingerprints that come to around 7 bytes each and you got more few hundred of these per so which has lost more than 3 megabytes and if you're storing a million songs in your late place then you can only compress as much you can this also gives us a faster search the less you have to search through the last you go through and robustness in the presence of noise so if you're uh for example would be the full starts they play this music comes because because it's because someone you like because you that you probably like music that buying speakers non-identifiable but of course there's the drunk guy chapter next e you will need your client is still recognizable song is despite this noise is going on you also want it to be that they can actually show recordings the origin of you don't stand on the entirety of his own solid someone that's like a proper 2 so much for I think so too is the basic diagram of how things fingerprint works it's fairly large but will go through line-by-line starting the the first one so these are the diagrams researchers so before 1st off we take the order which looks a lot like the left graph well but quickly and and longer and we will listed up into small weakly that's so you can get frequencies of each individual and we then 2 3 and the advantage of this is that if you look to the previous 1 this next signal about C is the same signal here following without and the Fourier transform you can still see 50 thousand years quite clearly so this helps us to you protect ourselves from noise floor and the other thing you wanna do always here is human don't listen to you what an experience frequency and sound linearly if you played some 100 hertz sound and 200 sound and lighting and 10 thousand thousand 100 sound the different the 2nd 1 is going to sing a lot closer together because we get logarithmically so there's something called the mel scale which is basically a logarithmic scale and and the added benefit that because data so the highest now we can about 4 thousand was highest frequency we hear about 20 K so you can take this and you have but and analysts get something yes I skipped something sorry there so you you take this and you make it into a spectacle what we do is we take in this Fourier transform and we got behind this the larger that's and we may end up having to go below this logic that's made them light and we kind of turned on its side and then there's a bunch tones crossing track and this gives you something of spectrogram which you can see on the left it's basically representation of frequency so the time all the some of these will you go so you wanna think school into this because as we saw before I a survive ways
and with that you can then you basically run the nearest neighbor search on so you just check the local minima local maxima story and I think what I did this was largest so most of them and then pick the top 30 or whatever yes so we're back to the slides with you got your time points you got your maximum points which exist on top and usually these into larger regions in each region you wanna find something called point and sampling used just use the pairing without going so you in the original shoes on which you find on contrary link if you want to afterwards he says OK you need to get my point and you ask and of course because as a parent doesn't which is unfortunate so it's at maximum I decided the largest 1 in that region would be i . and with this you this we just paradigm to every single point in the following regions with the end regions so in the reason we're doing this and so just storing all 3 sees all the maximum points here is because in natural orange so make sure some more distinguishable so if you it has metallic as well right right and you'd have produced is excited again they might have some frequencies but we build probably agree that fairly different songs and if you've never heard on you of so you but caring you've not only got frequency in the frequency but you got the time difference between them is adding more information distinguishability so here we here is your cash this this is a we don't bother with OSHA 2 6 or whatever we let we just take 2 points and we connect them together so the left F 1 frequency 1 for you see the difference in time and you also saw this along with the time 1st point so you can identify where track you and the idea of the song that you fingerprinting so you can identify which some of you actually have that's important but don't get that area so storage you now go to step 1 of to the the T 1 and these kind of go together quite impressive so you wanted to be fixed size because variable-size storage is about this quite solid essentially comparatively and it's easier this way and so frequencies in males as a central thousand smells about how much can here so we give it back to a point 4 thousand 96 time is in MS here is arbitrary you could do it in system takes if you felt particularly masochistic and that means you can sort about 10 thousand 24 that gives us a 2nd word between and I'm going in each point which is fair enough it gives you about harvard prominent sectors for each region and you don't want anything there that the windows always late into the fray transform will be about 60 ms each depending on what you configured as and the reason we saw this as MS as opposed to Windows small number of windows so we using is because then if you change the number of windows per for each region we have points then you can keep compatibility so this also imposes a limitation on T 1 which is the time the 1st point as that can be no more than a year ago and under 94 thousand 204 ms I did worry that I memorized that's the Moscow 17 minutes which if you do music identification service that's that's going to give you the majority of all but the most obscure tracks for something else if you have and identification service petroleum for example by the something you build then you might want to be the so you don't have these into a database and you can search through what if you just want something that works there is an application where cold these are the that's and so is the is Python implementation of and I swear I didn't know it existed before I started the by literally just like you put on your computer you can express playing because the and it's fantastic if you're looking to do that's again address that's the name of his genius can do something with 1 called take place and we may look at a kind of interesting quick this project was inspired by 2 to foster I don't admit I didn't finish the project model and a hell of a lot of things never done any of this map CDs the stuff for its the glottal into there are a lot of talks here and I've seen them from goes to machine learning to the micro there quite here we might yield can microgrids some schools so pick something up and run with it but I wanna see you guys again next for example a soul 1 paper where a guy to multiple videos of a concept but different people and synchronize more using fingerprints of that to be something yourself get implementing I think there's only 1 implementations of foreign the world was a small questions all current current present some kind current again so this is number is available on the outside uh also his quickly it's not very well organized you read in the audio this is lost looks like by the way it's quite with no one's ever seen that review on the UN student this is me figuring out how best to use what they're quite difficult that you get sort of with you don't match you know that the so they are now you frequency
that's the equation of Wikipedia the environment but we don't take that we just take the integers so we got small storage functions from a computer and use the and so this is this is your frequencies before year that's and so that is your for FFT and as you can see the frequencies that we want the role what because of well I mean particularly Saudi was quite basic that's where we generate him most so if take out of action the balsam structures out quite a lot more you got a lot more useful information for all finding stuff finding some stuff not very an assistant who spectrograms map Although I've been told the spectrograms this doesn't look particularly spectrogram it to make constellation maps the natural log related consolation that and then the actually this is all the help we you to look at a few particularly wanted
uh so yeah factual questions FIL yes OK so we have to to speak here because it's recorded really silly question what's and males and was named after came up with I'm pretty sure it was and now is defined as 2 thousand 594 5 times the log base 10 or 1
over something of your frequency um it's it's a very weird unit for frequency and it's basically a logarithmic scale for frequency especially on I it was a question was yet again will grace and 1 what your they know 700 don't ask me where to go from was an experiment in the fifties and so they called on musicians instead of making you this is that different from the norm of all that different from and they just did that made graph of questions actually I have about this so that the fingerprint of this of the entire site this is the collection of data points they get yeah so the fingerprint is the
collection of all the that no nothing about the last thing you say this thing and you do this so for where each of these links would be 1 fingerprint and for the entire so you do that together would be the fingerprint yourself so so yeah each of those of the ax and those actions that would be the fingerprint yourself and so have you compare the 2 fingerprints and was that you usually so users for exact matches so the storage based that and we did well that's why it doesn't work but you compare that to match it see if they match exactly all and if not you can create basically the thing is how many facts that there was an article on this can make it to its fullest and much cleverer than me and you know these things I have a question so they can ask myself so to to what kind of just for is robust so if you for example or fewer DJ and a new pitching will still work or it that so time basis solutions it should work because of well within the extent if he obviously stretching about 5 thousand times a 2nd way but if structure little that that should work because you've still got think and it depends on you all when you search depends on how accurate you won't be you decide time should be how close time should in terms of the noise shoes and when they research paper they said it was it worked perfectly and yes and no so that is from Jews and wasn't always and it was a service that you right now and it would be like hold your the music and use all different music and a new text you have which was really quite cold that's how I started so be robust to define what is relaxed the TVD these things so OK more questions yes so we the do you know that there is some similarity in the hashing system to musicbrainz MusicBrainz that now
I have something for that it's not that's in fact processes ways MusicBrainz them from remember MusicBrainz most of of not having a renowned MusicBrainz historian artist MusicBrainz simplicity detectors of Hutchison tracks the whole track time and you basically have to have some trees are fired the you can computer and went to fix the text is entitled you at the and has a database of all of these tags because and also a but that financial if you have a mandatory converted from an energy from 1 to the it would still be able to find which would prejudice OK and so you I if it's based off all the shoes are made yes it's similar however there are a few other big papers have been released in computer vision for music identification was a very big 1 and that uses a slightly different technique in that it uses computer vision similarity things to calculate the that is the subject for a whole lot of time to yeah and maybe OK 1 more question and still it to everyone's also affect nobody OK if there are no more questions than we can closed the the session and also the 1st day of the year by the level the
excitement and a lot of stuff so
see you tomorrow


  319 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)