Implementing a Sound Identifier in Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 14 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21115 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201614 / 169
1
5
6
7
10
11
12
13
18
20
24
26
29
30
31
32
33
36
39
41
44
48
51
52
53
59
60
62
68
69
71
79
82
83
84
85
90
91
98
99
101
102
106
110
113
114
115
118
122
123
124
125
132
133
135
136
137
140
143
144
145
147
148
149
151
153
154
155
156
158
162
163
166
167
169
00:00
FrequencyTheoryExpert systemBitCartesian coordinate systemLibrary (computing)Data storage deviceNormal (geometry)Computer fileComputer virusMultiplicationQuicksortVolume (thermodynamics)Different (Kate Ryan album)Sampling (statistics)Image resolutionPresentation of a groupFile formatEvelyn PinchingSoftware developerMultiplication signData conversionMultimediaWeb pageType theoryMusical ensembleDemo (music)Information retrievalWordFingerprintNoise (electronics)Axiom of choiceReal-time operating systemProcess (computing)Formal languageRow (database)DiagramFrequencyException handlingCodierung <Programmierung>MassAutomatic differentiationRepresentation (politics)NumberFourier seriesTrigonometric functionsTransformation (genetics)AlgorithmInfinityRecursionIdentifiabilityData structureOrder (biology)VideoconferencingGraph (mathematics)Field (computer science)Machine visionLikelihood functionForm (programming)MereologyGoodness of fitSolid geometryTime domainImaginary numberAuthorizationMathematicsDisk read-and-write headNichtlineares GleichungssystemSpeech synthesisInformationClient (computing)Event horizonFast Fourier transformGastropod shellComputer animationLecture/Conference
07:38
Hash functionHill differential equationFrequencyIdentical particlesPoint (geometry)WordInheritance (object-oriented programming)Different (Kate Ryan album)FingerprintLink (knot theory)Physical systemService (economics)System identificationMaxima and minimaPresentation of a groupScaling (geometry)Multiplication sign1 (number)ImplementationNumberLogicTrailTime zoneWindowDatabaseCartesian coordinate systemAreaData storage deviceStudent's t-testProjective planeDialectNeuroinformatikSlide ruleEndliche ModelltheorieLevel (video gaming)Representation (politics)Virtual machineMatching (graph theory)VideoconferencingMusical ensembleLimit (category theory)InformationField (computer science)Uniformer RaumLocal ringCodeAddress spaceBitMathematicsHash functionVariable (mathematics)Object-oriented programmingLinearizationComputer animationLecture/Conference
15:15
Hash functionFrequencyInformationLevel (video gaming)PlotterMappingNichtlineares GleichungssystemVideo game consoleFunctional (mathematics)Personal digital assistantData storage deviceGroup actionIntegrated development environmentData structureOnline helpIntegerComputer animation
16:21
Hash functionFingerprintSimilarity (geometry)Group actionPhysical systemProcess (computing)Musical ensembleLogarithmMatching (graph theory)Scaling (geometry)Data storage deviceNetwork topologyMultiplication signTrailExecution unitBasis <Mathematik>Food energyNeuroinformatikExtension (kinesiology)Machine visionData structureFault-tolerant systemNormal (geometry)Graph (mathematics)DatabaseFrequencyTerm (mathematics)Level (video gaming)Noise (electronics)WebsiteService (economics)Point (geometry)System identificationLink (knot theory)AuthorizationHash functionGSM-Software-Management AGComputer fileBitDistortion (mathematics)SoftwareMobile appComputer animationLecture/Conference
21:46
Hash functionComputer animationLecture/Conference
Transcript: English(auto-generated)
00:01
So, now we have Cameron McLeod speaking about implementing a sound identifier in Python. I'm really excited about it. So it's the Shazam thing, but of course you're here just because you hear this. Okay, so last talk for today, except for the lightning talks of course.
00:22
Have fun. Hello. Hello. Hello. Okay, fantastic. You can all hear me? Yeah. Thank you for coming. I'm Cameron. As a brief overview, this is literally just the theory behind how Shazam used to work.
00:40
I don't think it still works like that because they published a paper on it, therefore they've got something better now. But quick disclaimer, I'm not an expert on literally anything, and therefore everything contained within this presentation should be taken with a pinch of salt. Fantastic. Music information retrieval. So imagine you are in the car, or if you don't have a car, you're listening to the
01:01
radio, and of course you're not controlling the music that's being played. So a song comes on, it might be Britney Spears or something, you don't know it's Britney Spears, but you think, you know, it's a pretty good song, I could get into this, and you want to find the name of it. Of course, nowadays you've got things like Shazam, you've had it since 2001, to just identify quickly what this song is, and the artist, etc.
01:24
But this is where, this is the basic problem from which music information retrieval comes from. So, music isn't easily searchable, as a thing. If you've got two different recordings, or two different encodings of the same song
01:41
or the same piece of audio, it's going to vary quite wildly in bit representation. And they're also quite large, the files that you find that contain music. So Shazam solves this using something called fingerprinting. We'll go through that in a bit. But you've also got other applications within the field, such as Score Search. If you're a musician, you've got that little piece of paper that you flick through quickly
02:03
and incredibly because you're playing and you're flicking at the same time, I think it's quite impressive. And you're flicking through that, and there's a search for that. You can search by humming nowadays, I can't remember the name of the applications that do it, but they exist. And today we're going to talk mostly about Shazam, but the techniques I'll talk about will apply to other fields as well.
02:23
It's a little bit magic, I'm going to be honest. But hopefully we'll understand it by the end. Okay, so why did I choose Python for this? Because surely it's a pretty bad choice for real-time processing that you want to go fairly quickly if you want the user to not give up and just go away from their phone.
02:42
But the thing is, it's actually not that bad for data processing. We've got languages, sorry, libraries like NumPy, SciPy, Matplotlib for visualizing data when you're developing, and Python's actually quite good for this. A lot of NumPy is written in C, so it's fairly fast. Not only that, but the best language in my opinion is the language that you know the
03:02
best. And I think since we're all at a Python conference, that's probably going to be Python for us. So here's the awesome demo that totally works. I spent quite a long time trying to get this to work, and well, it totally works, but I'm not going to show you because it's far too cool for all you guys.
03:21
Yeah, it doesn't work. I didn't build it. The challenge to the listener is for you to go out and build it after I've told you how it works. So quick show of hands, if I was to say the words signals and Fourier transforms to people, who would understand what I meant? Ooh, lovely, fantastic.
03:42
So yeah, for those few people that didn't, a Fourier transform, you take a signal in a time domain, so it's time against amplitude, and you extract the frequencies from it. You don't need to know the mass behind it because they're awful. If you've ever seen the equations, it's got imaginary numbers and cosines and, ugh, disgusting.
04:01
So a signal is just information as well. You'll also hear me say the words FFT, or the phrase FFT. That's just fast Fourier transform. It's an algorithm used generally to calculate these things, and it's faster than doing the infinite recursion that the mass thing uses. So the basic structuring application is a normalizer, a finger printer, and some
04:23
storage afterwards. And the thing to recognize with a normalizer is it's not a normalizer like you might have on iTunes. Usually when people in audio talk about normalizers, they take some audio, multiple pieces, and they make it all at the same sort of volume. This is not that. This is taking audio of different sample rates, so how fast you've been digitizing it, different
04:44
bit depths, the resolution of it in formats, and turning it into a one single format. This is great because it means we can write only one finger printer, and it saves you a lot of development time. Not that I've finished that. So you might be able to use FIFMPEG for this, which is a library that basically wraps
05:03
FIFMPEG, and FIFMPEG is a converter for all sorts of multimedia. It does video as well, but I looked at the page and the last time I was updated was about 1934, so I decided against using that. There's also libavcodec, which is the C version, and you could use that directly with C types,
05:22
but I've used C types before and I decided against that. So, finger printer. What do we want to do with a finger printer? Well, comparing like for like with audio, as I said, doesn't work. Different bit depths, different representations, etc. So we want to compress it as well, not just that.
05:41
You've got smaller storage if you're taking fingerprints that come to about seven bytes each, and you've got, what, a few hundred of these per song, which is a lot smaller than three megabytes. And if you're storing a million songs in your database, then you're going to want to compress it as much as you can. This also gives us a faster search.
06:01
The less you have to search through, the faster you can go through it, and robustness in the presence of noise. So if you're, for example, at a gig, before the gig starts, they play this music on the speakers, and because it's a gig of someone you like, because you're there, you probably like the music that they're playing on the speakers, you might want to identify it, but of course there's the drunk guy shouting next to you.
06:22
You want your phone to still recognise what song it is, despite this noise that's going on. You also want it to be that it can match only short recordings to the full original, because you don't want to stand there for the entirety of the song, just holding your phone like this, like a proper numpty. It's not much fun. Okay, so here's the basic diagram of how the fingerprinter works.
06:44
It's fairly large, but we'll go through it line by line, starting with the first one. So these are the diagrams you should have saw before. First off, we take the audio, which looks a lot like the left graph, well, but wigglier, and longer, and we split it up into smaller wiggly bits, so that you can get the frequencies
07:04
of each individual. We then do the Fourier on it. And the advantage of this is that if you looked at the previous one, the next signal you're about to see is the same signal here, but with noise added.
07:22
And the Fourier transform, you can still see the 50Hz and the 80Hz quite clearly. So this helps us to protect ourselves from the noise floor. The other thing you want to do while we're here is humans don't listen to or don't experience frequency and sound linearly.
07:41
If you played someone a 100Hz sound and then a 200Hz sound, and then played them a 10,000Hz sound and a 10,100Hz sound, the second one is going to seem a lot closer together, because we hear logarithmically. So there's something called the MEL scale, which is basically logarithmic scale, and the added benefit of that is that it cuts down the data, so the highest MEL we can hear
08:03
is about 4000Hz, whereas the highest frequency we can hear is about 22kHz. So you then take this, and you hash it. Oh, hang on, have I skipped something? Yes, I've skipped something. Sorry, here we are. So you take this, and you make it into a spectrogram.
08:22
What we've done is we've taken this Fourier transform, and we've got the high bits, the large amplitude bits, and we've made them dark, and we've got the low bits, the high amplitude bits, and we've made them light, and we've turned it on its side and done this a bunch of times across the entire track. And this gives you something called a spectrogram, which you can see on the left. It's basically a representation of frequency over time of the song or the piece of audio
08:45
you've got. So we want to take the highest points of this, because as we saw before, the highest points always survive the noise. And with that, you can then, you basically run a nearest neighbour search on it. So you just check for local minima, local maxima, sorry, and I think the way I did this
09:05
was I just sorted them and then picked the top 30 or whatever. Yes, so we'll go back to the slides that were before. You've got your anchor points. You've got your maximum points now, which exists a long time, and you split these into
09:21
larger regions. In each region, you want to find something called an anchor point. And this anchor point is just used for pairing with other points. So in the original Shazam paper, which you can find online, I can show you a link if you want to afterwards, he says, okay, you need to get some anchor points and you
09:41
ask how. And of course, because it's a paper, it doesn't respond, which is unfortunate. So I took maximum. I decided the largest point in that region would be our anchor point. And with this, you literally just pair it up to every single point in the following region. With the end region, skip it.
10:00
So the reason we're doing this instead of just storing all the frequencies, all the maximum points here, is because it adds more entropy. So it makes your song more distinguishable. So if you were to have Metallica's Wherever I May Roam, and you were to have Britney Spears' Oops I Did It Again, they might share some frequencies, but we'd all probably agree they're fairly different songs, even if you've never heard one of them.
10:22
You'll have heard one of them. So by pairing them, you've not only got that frequency in the other frequency, but you've got the time difference between them. It's adding more information, more distinguishability. So here we are. Here is your hash. This is a hash.
10:41
We don't bother with SHA256 or whatever. We literally just take the two points, F1, F2, and we connect them together. So the bit on the left, F1, frequency 1, frequency 2, difference in time. And you also store this along with the time of the first point, so you can identify where in the track you are, and the idea of the song that you're fingerprinting,
11:03
so that you can identify which song you actually have. That's an important bit. Don't forget that. There we are. So storage. You've now got this F1, F2, DT, T1. And these kind of go together quite compressedly. So you want it to be a fixed size, because variable size storage is, well, it's quite
11:26
slow to search through comparatively, and it's easier this way. So frequencies in mels. As I said, 4000 mels is about the highest you can hear. So we can fit that into our bytes, 4096. Time is in milliseconds here.
11:42
It's arbitrary. You could do it in system ticks if you felt particularly masochistic. And that means you can store in about 10 bytes, so that's 1024. That gives us a second worth between an anchor point and each pair point, which is fair enough. It gives you about 512 milliseconds per each region, and you don't want anything near there.
12:04
The windows that we split it into for the Fourier transform would be about 16 milliseconds. Depending on what you configured it as. And the reason we store this as milliseconds as opposed to window, number of windows that we're using, is because then if you change the number of
12:22
windows per each region for the anchor points, then you can keep compatibility. So this also imposes a limitation on T1, which is the time of the first point. As that can be no more than... Here we go.
12:41
4,194,304 milliseconds. I didn't read that. I memorized it. So that's about 70 minutes. Which, if you're doing a music identification service, that's gonna get you the majority of all but the most obscure prog tracks. For something else, if you had an identification service for talk audio, for example,
13:01
that could be something you build, then you might want that field to be bigger. So you then put these into a database and you can search through them. What if you just want something that works? There is an application out there called DejaVu.
13:20
That's it. And it is this. It's a Python implementation of Shazam. I swear I didn't know it existed before I started this, but it does literally just work. You put it on your computer, you press play, and it goes... It's fantastic. If you're looking to do it, that's the GitHub address.
13:40
That's the guy's name. He's a genius. Go do something with it. One final take away. We may look at some code, because I've gone through this a bit quick. This project was inspired by a talk at FOSDEM. I admit I didn't finish the project, but I've learned a hell of a lot doing so. I've never done any of this math CEDSP stuff before.
14:02
It's pretty quite cool. You should look into it. There are a lot of talks here, and I've seen them from Go to machine learning to the micro bit. I hear we might be able to get micro bits. That sounds cool. So pick something up and run with it, and I want to see you guys here next year. For example, I saw one paper where a guy took multiple videos of a concert
14:23
from different viewpoints, and he synchronized them all using fingerprints of the audio. That could be something you guys have a go at implementing. I think there's only been one implementation so far, and the world always needs more. Questions or code? Code. Let's look at some code.
14:43
Hello, code. Okay, cool. So this is a notebook. It's available on GitHub. I'll talk you through it quickly. It's not very well organized. You read in the audio. This is what a song looks like, by the way. It's quite cool. If no one's ever seen that, probably you are.
15:01
You then do... This was me figuring out how FFTs worked. They're quite difficult. That's what you get to start off with. You don't want that. If you get that, you don't want that. So there you go. Mel. There's your male frequency. That's the equation off Wikipedia. Don't blame me.
15:20
We then take that. We just take the integers so that we got smaller storage. That function is far more complicated than it needs to be. And so this is normal. This is your frequencies before the integer, and that's... Hang on. Sorry. That is your FFT. And as you can see, the frequencies that we want are all near the bottom
15:41
because that's... Well, I mean, this particular piece of audio is quite bassy, but that's where we generally tend to hear the most of it. So if you take the male version, the bottom stretches out quite a lot more and you've got a lot more useful information for a lot less storage.
16:00
Defining some stuff. Defining some stuff. Not very interesting. Ooh, spectrograms. Map plot lib. I've been told it does spectrograms. This doesn't look particularly spectrogrammy to me. Consolation maps. I didn't actually plot that, but there's your consolation map. And then the hashing. This is all on GitHub for you to look at if you particularly want it.
16:23
So yeah, actual questions. Hello? Yes, okay. So you have to speak here because it's recorded. Really silly question. What's a male and was it named after whoever came up with it? I'm pretty sure it was.
16:40
And a male is defined as 2595 times the log base 10 of one over something of your frequency. It's a very weird unit for frequency and it's basically a logarithmic scale for frequency as opposed to a linear one. Hang on, where's the equation?
17:01
Did I skip it? Yeah, I did. There you go. The log base 10 of one plus your thing over 700. Don't ask me where I got that from. It was an experiment in the 50s and they got a bunch of musicians and said, okay, can you hear this? Is that different from the one before? Is that different from the one before? And they just plotted that and made a graph based off of it.
17:27
More questions? Actually, I have a question. So the fingerprint of is of the entire song and it's just that collection of data points that you get? Yeah. So the fingerprint is the collection of not that fingerprint,
17:46
that thing, you take this thing and you do this. So for that, each of these links would be one fingerprint and for the entire song, you do that and that together would be the fingerprint of your song.
18:02
So yeah, each of those would be a hash and those hashes together would be the fingerprint of your song. And so how do you compare the two fingerprints then? Was that the... You literally, so you search for exact matches. So the storage bit is the bit I didn't really do very well. That's why it doesn't work. But you compare the two matches, CFA match exactly.
18:20
If not, you can create basically a thing of how many match. And then there was another talk on this, can link you to it's Fosdem. He's much cleverer than me and he knows these things. I have a question, so now I can ask myself.
18:41
So to what kind of distortions is it robust? So for example, if you're a DJ and then you're pitching, will it still work or is it then off? So time-based distortions, it should work because, well, within an extent, if you obviously stretch it about 5,000 times, it's not gonna work.
19:02
But if you stretch it a little bit, that should work because you've still got the fingerprints. It depends on your... When you search it, it depends on how accurate you want the... Or you decide the time should be, how close the time should be. In terms of other noise, Shazam, when they released this paper, they said it worked perfectly over the GSM network.
19:23
So a bit of history for everyone. Shazam wasn't always an app. It was a service that you rang up and then it'd be like, hold your phone to the music and you told your phone to the music and then it would text you afterwards, which was really quite cool and that's how they started. So it'd be robust to the phone network and it's robust to TV noise and things like that.
19:43
Okay, more questions? Yes, so there. Do you know if there is some similarity in the hashing system as to MusicBrainz?
20:02
MusicBrainz? Ah, now, I have something for this. It's not that. That's in fact all the processes. What is MusicBrainz? I'm gonna try and remember what MusicBrainz is.
20:23
Oh, no, I haven't written down what MusicBrainz is. Sorry, I've no idea. MusicBrainz is basically a database of hashes of tracks, the whole track. And basically, it helps you. You have some empty trays or files on your computer and you want to fix the tags, to fix the title or the author.
20:42
It has a database of all of these tags and also a hash of that file. So if you have a MP3 converted from my energy from whatever, it will still be able to find out what track it is. Okay, so if it's based off of the Shazam paper, then yes, it's similar.
21:05
However, there are a few other big papers that have been released in here. Computer vision for music identification was a very big one. And that uses a slightly different technique in that it uses computer vision similarity things to calculate the hash. That is a subject for a whole other five talks.
21:24
So yes and no, maybe. Okay, more questions? I'm still fit enough to come to everyone. So also in the back, nobody? Okay, if there are no more questions, then we can close the session.
21:45
And also the first day of the Europison. I hope you had fun and learned a lot of stuff. So see you tomorrow. Thank you.