We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Audio Classification with Machine Learning

00:00

Formal Metadata

Title
Audio Classification with Machine Learning
Subtitle
Learn how to classify sound using Convolutional Neural Networks
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Sound is a rich source of information about the world around us. Modern deep learning approaches can give human-like performance on a range of sound classifiction tasks. This makes it possible to build systems that use sound to for example: understand speech, to analyze music, to assist in medical diagnostics, detect quality problems in manufacturing, and to study the behavior of animals. This talk will show you how to build practical machine learning models that can classify sound. We will convert sound into spectrograms, a visual representation of sound over time, and apply machine learning models similar to what is used to for image classification. The focus will be on Convolutional Neural Networks, which have been shown to work very well for this task. The Keras and Tensorflow deep learning frameworks will be used. Some tricks for getting usable results with small amounts of data will be covered, including transfer learning, audio embeddings and data augmentation. A basic understanding of machine learning is recommended. Familiarity with digital sound is a bonus. Please see our speaker release agreement for details: https://ep2019.europython.eu/events/speaker-release-agreement/
Keywords
Virtual machineInstance (computer science)HypothesisNatural languageProcess (computing)Medical imagingMachine learningInternetworkingLecture/Conference
SoftwareSoftware developerMachine learningInformation technology consultingInternetworkingComputer fontSlide ruleProcess (computing)Virtual machineInformationPointer (computer programming)Human migrationSoftware testingInformation securityPattern recognitionSpeech synthesisMathematical analysisField (computer science)Digital signalKnowledge representation and reasoningFile formatData compressionMP3Singuläres IntegralGoogolEscape characterSocial classOpen setIntegrated development environmentSample (statistics)State of matterPredictionFrame problemVirtual machineSpeech synthesisShooting methodDigitizingAreaPattern languageTask (computing)Product (business)Range (statistics)Game controllerStreaming mediaInformationLatent heatPattern recognitionMusical ensembleMachine learningDomain nameSoftware engineeringCombinational logicWindowKeyboard shortcutPointer (computer programming)Set (mathematics)Instance (computer science)AlgorithmSlide ruleLevel (video gaming)Process (computing)Noise (electronics)Function (mathematics)Execution unitOrder (biology)Mathematical analysisBitVideoconferencingMultiplication signElement (mathematics)Field extensionCartesian coordinate systemSource code2 (number)TelecommunicationSampling (statistics)1 (number)Total S.A.WaveformVideo gameSocial classInterior (topology)Medical imagingDrill commandsSoftwareoutputReal numberIntegrated development environment10 (number)PlanningState of matterBit rateEstimatorPredictabilityHydraulic motorHuman migrationSoftware testingRepresentation (politics)File formatOnline helpMixture modelModel theoryCondition numberData conversionInformation technology consultingFlow separationInfinityOpen setGoogolInternet der Dinge
Heegaard splittingStreaming mediaMathematical analysisTwin primeTriangleDigital filterScale (map)Linear mapDifferenz <Mathematik>Standard deviationDatabase normalizationData compressionArtificial neural networkCNNComputer-generated imageryMedical imagingNormal (geometry)2 (number)MereologyDifferent (Kate Ryan album)Multiplication signEndliche ModelltheorieoutputFrame problemPosition operatorFood energyShape (magazine)CNNState of matterAlgorithmModel theoryMultiplicationOrder (biology)Representation (politics)WindowFilter <Stochastik>DivisorClosed setData compressionType theoryTheory of relativityScaling (geometry)SpacetimeInstance (computer science)InformationLoginDampingProcess (computing)Bounded variationSoftware frameworkImplementationConvolutionVirtual machineMathematical analysisBitBlogPredictabilityVolume (thermodynamics)PreprocessorLibrary (computing)Block (periodic table)Function (mathematics)Streaming mediaComputer architectureDynamic rangeLatent heatSocial classDifferential (mechanical device)Artificial neural networkAreaGreatest elementBinary fileSampling (statistics)Parameter (computer programming)Mobile WebWeightKernel (computing)OrbitCartesian coordinate systemArithmetic meanMaxima and minimaNoise (electronics)Domain nameField (computer science)
Mathematical analysisDensity of statesDemo (music)MicrocontrollerIntegrated development environmentCNNExplosionComputer networkSample (statistics)Social classConvolutionSet (mathematics)BitSingle-precision floating-point formatDemo (music)Maxima and minimaSampling (statistics)Coefficient of determinationComputer architectureUniform resource locatorMultiplication signDomain nameShift operatorModel theoryPredictabilityDifferent (Kate Ryan album)Bounded variationVotingMathematical analysisThresholding (image processing)Augmented realityWindowRepository (publishing)Medical imagingHypothesisAlgorithmMachine codeVirtual machineTime domainMicrocontrollerRepresentation (politics)AdditionTraffic reportingPhysical systemProcess (computing)Instance (computer science)Task (computing)VideoconferencingMultilaterationResultantImage processingSpacetimeoutputSoftware framework2 (number)WeightMereologyMusical ensembleAverageExecution unitCASE <Informatik>Functional (mathematics)Drill commandsVideo gameShared memoryStandard deviationChaos (cosmogony)Block (periodic table)Time zoneEvent horizonArithmetic meanCartesian coordinate systemSpeech synthesisRange (statistics)
Sample (statistics)Heat transferComputer-generated imageryoutputScale (map)MultiplicationEndliche ModelltheorieVector spaceCNNModel theoryComputer networkMathematical analysisData AugmentationSlide ruleEvent horizonComputational physicsDemosceneFile formatYouTubeCombinational logicEndliche ModelltheorieKeyboard shortcutMedical imagingArtificial neural networkMultiplication signInstance (computer science)Set (mathematics)Representation (politics)Self-organizationForestEmailWindowComputer fileParameter (computer programming)Strategy gameLengthNoise (electronics)LoginEvent horizonFunctional (mathematics)Mathematical analysisMixed realitySocial classDifferent (Kate Ryan album)Open setTrailFrame problemShift operatorPredictabilityText editorModel theorySpeech synthesisRight angleEinbettung <Mathematik>Repository (publishing)BitSampling (statistics)Pattern recognitionMereologyInformationRoutingVector spaceNeuroinformatikOrder (biology)Slide ruleExecution unitGraph coloring1 (number)MultiplicationScaling (geometry)Latent heatDiagonalElectric generatorMachine codeAreaSoftware frameworkImplementationTunisStandard deviationHeat transferWeightPattern languageSound effectComputer architectureType theoryVirtual machineOperator (mathematics)PreprocessorMobile WebCartesian coordinate systemLine (geometry)RandomizationAugmented realityInferenceVariable (mathematics)Computer animation
Virtual machineTime seriesArtificial neural networkMultiplication signTask (computing)Machine learningConnectivity (graph theory)Scaling (geometry)SoftwareSet (mathematics)BitDemosceneQuicksortModel theoryCartesian coordinate systemWindowCategory of beingPattern recognitionOrder (biology)Right angleSoftware testingFlow separationRecurrence relationInstance (computer science)MultiplicationTrailSocial classLecture/Conference
Event horizonMultiplicationModel theoryModel theoryTrailWeißes RauschenFrequencyCartesian coordinate systemInformationSocial classComplete metric spacePattern recognitionVirtual machineCASE <Informatik>PreprocessorMixed realityTime domainSpeech synthesisFirmwareAlgorithmMachine learningFlow separationLevel (video gaming)Row (database)Link (knot theory)Standard deviationLibrary (computing)Augmented realitySet (mathematics)Strategy gameInstance (computer science)BitSoftware frameworkText editorConvolutionPresentation of a groupEndliche ModelltheorieNoise (electronics)Task (computing)Food energyMicroprocessorMachine codeAreaLatent heatCubeRepository (publishing)Parameter (computer programming)MicrocontrollerLecture/Conference
Transcript: English(auto-generated)
And he's now embarked on an IoT startup called Sound Sensing. Today, he'll talk to us about a topic related to his thesis, audio classification with machine learning. Hi, thank you.
So audio classification is not such a popular topic as, for instance, image classification or natural language processing. So I'm happy to see that there's still people in the room interested in this topic. So first, and this about me, I'm an internet specialist.
I have a background in electronics from nine years ago. I worked a lot as a software engineer because electronics is mostly software these days, or a lot of software. And then I went to do a master's in data science because IoT to me is the combination of electronics,
sensors especially, software. You need to process the data and the data itself. You need to somehow convert sensor data into information that is useful. And now I'm consulting on IoT and machine learning, and I'm also CTO of Sound Sensing. We deliver sensor units for noise monitoring.
In this talk, my goal, we'll see if we get there, is that you as a machine learning practitioner, without necessarily prior experience in sound processing, can solve a basic audio classification problem. We'll have an introduction about a little bit of sound very briefly, and then we'll go through a basic audio classification pipeline,
and then some tips and tricks for how to kind of go a little bit beyond that basic, and then I'll give some pointers to more information. The slides and a lot of my notes and machine hearing in general, just a little bit broader than audio classification, is on this GitHub. So applications.
There are some very well-recognized subfields of audio. Speech recognition is one of them, and for instance, there you have, as a classification task, you have keyword spotting, so, hey Siri, or hey Google. As a task, in music analysis, you also have many tasks.
Genre classification, for instance, can be seen as a simple audio classification task, but we're gonna keep it mostly on the general levels. We're not gonna use a lot of speech or music-specific domain knowledge, and we still have examples across a wide range of things.
I mean, anything you can do with hearing, as a human, we can get close to, or many at least in classification tasks with machines today. So Neeko Acoustics, you might wanna analyze bird migrations using sensor data to see their patterns. You might wanna detect poachers in protected areas to make sure that no one actually is going around
shooting where there should be no guns fired and so on. So using quality control in manufacturing, especially because you don't have to go into the equipment or the product under test. You can listen to it from the outside. For instance, used for testing your electrical car seats
that all the motors run. In security, it's used to help monitor large amounts of CCTV's by also analyzing audio. And in medical, for instance, you could detect heart murmurs, which could be indicative of a heart condition. So these are some motivating examples. And so in digital sound,
I'll just go very briefly through this. First thing that is important is that sound is almost always, or basically always a mixture. Because sound, it will move around the corner,
unlike image, for instance, and you'll always have sound coming first of all. So it will also transport in the ground and be reflected by a wall and all these things make it so you always have multiple sound sources or the source of interest and then always other sound sources. In audio acquisition, okay, we have, of course,
sound is air pressure. We need to go have a microphone, convert it to electrical voltage, ADC, and then we have a digital waveform, which is what we will deal with. Then it's quantized in time, with the sampling rate and amplitude.
And we usually deal with mono, primarily with mono when we do a lot of classification still. There is some methods around stereo but not widely adopted. And also multi-channel you can also have. We typically use uncompressed formats. It's just the safest. Although in real life situation, you might also have compressed data,
which can have artifacts and so on that might influence your model. So we can, after we have a waveform, we can convert it into a spectrogram. And this, in practice, is a very useful representation, both for, as a human, for understanding what is the sound, and for the machines, in order to do detection on this.
So this one is a frog croaking, like croak croak croak, very periodically. You see a little gap. And then, it's hard to see, but in the top, in the higher level, there is some cicadas that are going as well. And this allows us to both see the few frequency representation
and the patterns across time. And together, you can often, this allows you to separate different sound sources from your mixture. So we'll go through a practical example, just to keep it kind of hands-on. Environmental sound classification. So given a lot of signal of environmental sounds,
so these are everyday sounds that are around in the environment. For instance, it can be outdoors, cars, general playing and so on. It's very widely researched. We have several open data sets that are quite good. Audio set is several, I think, tens of thousands,
or even hundreds of thousands of samples. And in 2017, we reached roughly human-level performance. Only one of these data sets has an estimate for what is human-level performance, but we seem to have surpassed that now. And one nice data set is UrbanSound 8K,
which has 10 classes of 8K samples. They're roughly four seconds long, and nine hours total. And state-of-the-art here is around 80%, so 79 to 82 accuracy. So how to, in this, yeah, now we have spectrograms,
and these are the easy samples. This data set has many challenging samples where the sound of interest is very far away and then hard to detect. And these ones are easy. So you see the siren goes like woo, woo, woo, very up and down. And jackhammers and drilling have very periodic patterns. So how can we detect this
using a machine learning algorithm in order to output these classes? So I will go through a basic audio pipeline, skipping around like 30 to 100 years of history of audio processing,
kind of going straight to what is now the typical way of doing things. And it looks something like this. So first, in the input, we have our audio stream. It's important to realize that, of course,
audio has time related to it. So it's more like a video than to an image. And in a practical scenario, might you do real-time classification. So this might be an infinite stream that just goes on and on. So it's important for that reason and for also the machine learning algorithm to divide the stream into small
or relatively small analysis windows that you will actually process. And however, you often have the mismatch between how often you have labels for your data versus how often you actually want a prediction. It's known as weak labeling. I won't go much into it.
So in the urban sound, there's four seconds sound snippet. So that's what we're given in this curated data set. However, it's usually beneficial to use smaller analysis windows to reduce the dimensionality of the machine learning algorithm. So the process goes that we'll divide the audio into these segments.
We'll often use overlap. And then we'll convert it into a spectrogram. We'll use a particular type of spectrogram called a metal spectrogram which has been shown to work well. And then we'll pass that frame of features from the spectrogram into our classifier. And it will output the classification
for that small time window. And then because we have labels per four seconds, we'll need to do an aggregation in order to come up with the final prediction for this four seconds, not just for this one, a little window. We'll go through these steps now.
So first analysis windows. I mentioned we often use overlap. So this is somewhat specified in two different ways. One is an overlap percentage. Here we have a 50% overlap. So that means that we're essentially classifying parts of the or we're classifying every piece of the audio stream twice. So if we have even more overlap,
we will maybe a 90% overlap, then we're classifying it 10 times. And that gives the algorithm a kind of multiple viewpoints on this audio stream and makes it easier to catch the sounds of interest because the model might in training
might have been kind of prefer a certain sound to appear in a certain position inside this analysis window. So overlap is a good way of working with that. So I mentioned we use a specific type of spectrograms. So the spectrogram is usually processed
with those called MEL scale filters. These are inspired by the human hearing. So our ability to differentiate sounds of different frequencies is reduced as frequencies get higher. So low sounds, we're able to detect
small frequency variations. However, high pitch sounds, we need large frequency variations to work. And by using this kind of filters on the spectrogram, we obtain a representation more similar to our ears, but more importantly, it's a smaller representation for our machine learning model.
And it captures and also you'll be able to merge kind of related data in two consecutive bins, for instance. So when you've done that, it looks something like this. So then top is a normal spectrogram. You see kind of a lot of small details.
The bottom one, we've started the MEL filters at 1000 hertz. This is bird audio, so it's quite high pitched. A lot of chirps up and down. And in the third one, we've normalized the data. And we usually use log scale compression and in order to, because sound has
a very large dynamic range. So sounds that are faint versus sounds that are very loud for the human ear is a factor of 1000 or a factor of 10,000 in energy difference. So when you've normalized, log scaled, applied MEL spectrogram and normalized,
you look at something like the image below there. So in Python, this feature processing, something like this, I'm not gonna go through all the code in detail. We have an excellent library called LibRosa,
which is great for just loading the data and doing basic feature pre-processing. Also some of the deep learning frameworks have their own MEL spectrogram implementations that you may also use. But this is a general thing. In streaming, so when people analyze audio,
they often apply normalization, learned from the mean, for instance, across their whole samples, four seconds in this case, or from their whole data set. That can be hard to apply when you have a continuous audio stream, which has, for instance,
changing volume and so on. So what we usually do is we normalize per frame. So the hope is that you have enough information in our roughly one second of the data in order to do a decent normalization. And doing normalization like this has some interesting consequences when there is no input.
Because what happens is if you have no input to your feature pre-processing, you're gonna blow up all the noise. So you'll sometimes need to exclude very low energy signals from being classified. Just a little practical tip.
So convolutional neural networks, they're hot. Who here has basic familiarity, at least gone through a tutorial or read a blog post about image classification and CNNs? Yeah, that's quite a few. So CNNs are the best in class for image classifications.
And spectrograms are image-like. And they are two-day representation. They have some differences. So the question is, or maybe it was, is will CNNs work well on spectrograms? Because that would be interesting. And the answer is yes. This has been researched quite a lot.
And this is great because there is a lot of tools, knowledge, experience, and pre-trained models for image classification. So being able to reuse those in the auto domain, which is not such a big field, is a major win. So you'll see a lot of the research lately.
It can be a little bit boring in audio classification research because a lot of it is like taking one year ago image classifying tools and applying them and seeing whether it works. It is, however, a little bit surprising that this actually works because the spectrogram has frequency on the y-axis,
typically, as shown that way, and time on the other axis. So movement or scaling in this space doesn't mean the same as in an image. In an image, if I have my face inside an image, doesn't matter where my face appears. If you have a spectrogram and you have certain sound,
maybe it's like a chirp up and down, if you move that up in frequency or down, at least if you move it a lot, it's probably not the same sound anymore. It might go from a human talking to a bird. The shape might be similar, but the position matters. So it's a little bit surprising that this works, but it does seem to do really well in practice.
So this is one model that does well on urban sound. And one thing you'll note compared to a lot of image models is that it's quite simple. I mean, it's relatively few layers. This is smaller than, or same size as LENAT.
And there are three convolutional blocks followed by MAC, with MAC's pooling between the two first blocks. And that's the standard kind of architecture. Using this one, using five by five instead of two or three kernels, doesn't make much of a difference. You could stack another layer and do the same thing.
And we flatten and we use a fully convolutional. And so this is from 2016, and still is one that is like close to state of the art on this dataset, urban sound, okay. So if you are training CNN from scratch on audio data, do start with a simple model. I mean, don't, there's no,
there is usually no reason to start with, say, VGG 16 with 16 layers and millions of parameters, or even MobileNet or something like that. You can usually go quite far with this kind of simple architecture, a couple of convolutional layers. So in Keras, for example, this could look something like this,
where we have our individual kind of blocks, convolution, max pooling, rarely known clarity, same for the second one, and our full classification at the end, full connected layers. So yes, and then like, so this is our classifier.
We'll pass the classification through this, and it will give you us a prediction for which class it was. So 10 classes in the urban sound. And then for each window, and then we need to aggregate these individual windows. And there are multiple ways of doing this. You could do, simplest kind of thing to think about
is to do majority voting. So if we have 10 windows of our four second spectrum, we could do the predictions on each, and then just say, okay, the majority class wins. That works rather well. It's not differentiable, and so you kinda need to do this post-processing.
So, and you're kind of, you're making very rough predictions on each step. So mean pooling or global average pooling across those analysis windows usually does a little bit better. And it's nice with deep learning frameworks
is that you can also have this as a layer. So for instance in Keras, you have the time distributed layer, which is there's sadly extremely few examples of online. So it took me like, it's not that hard to use, but it took me a little bit to figure out how to do it. And so we apply a base model, which is in this case, the input to this function.
We pass it to the time distributed layer, and which essentially it will use a single instance of your model, so it will share the weights for all these steps in the, or all the analysis windows. And then, so we'll just run it multiple times
when you do the prediction step. And then we'll global average pooling over these predictions. So here we're averaging the predictions. You can also do more advanced things where you would, for instance, average your feature representation
and then do a more advanced classifier on top of this. But this is called probabilistic voting. Quite often in other literature, when you do this mean pooling. Yes, so that allows us, so this will give us a new model, which is what will then take not single analysis windows,
but will take a set of analysis windows, typically corresponding to our four seconds with, for example, 10 windows. So if you do this, and a couple more tricks from my thesis, you can have a system
working on like this. So this has, in addition to building model and so on, which I've gone through, we're also deploying to a small microcontroller using the vendor-provided tools that converts to Keras model and so on. So that's kind of roughly standard things I didn't wanna go into here. So a little demo video.
See how that, if we have a sound. So here are the 10 classes. We have very sound samples. This is children playing. I think in Spain, since they said hola. And what we do also here is we threshold the prediction.
So if no prediction is good, we'll consider it unknown. And this is also important in practice because sometimes you have out-of-class data. It's drilling, or this, actually the sample I found said jackhammer, and jackhammer is also a class. In drilling, they are, to my ear,
hard to distinguish sometimes. And the model can also struggle with that. There's a dog barking. And so in this case, all the classification happens on this small sensor unit, which is what I focus on in my thesis. Oh, there's a siren, a little bit louder.
And actually it didn't get the first part of the siren, this doo, doo, doo, doo, only this undulating sound later. So this, and then actually these samples are from, are not from the urban sound dataset, which I've trained on. So they're out of the main samples, which is generally a much more challenging task.
Yes, so that's it for the demo. If you wanna know more about doing sound classifications on this sensor unit, it's very small, you can get my full thesis. Both the report and the code is on GitHub. It's also linked from the machine hearing repository. Yes.
So I won't go into too much details there. And some tips and tricks. So we've covered the basic audio processing pipeline, a modern one, and that will give you results, and generally quite good results with a modern CNN architecture. And there are some tips and tricks,
especially in practice where when you are, you have a new problem, you're not researching an existing dataset. Your datasets are usually much smaller, and it's quite costly and tedious to annotate all the data and so on. So there's some tricks for that. First one is data augmentation.
This is well known from other deep learning applications, especially image processing. And data augmentation can be done on audio, can be done either in the time domain or in the spectrogram domain. And in practice, both seem to work fine. So here are some examples of common augmentations.
The most common and possibly most important is to do time shifting. So remember that I said that when you classify an analysis window, maybe one second, the sounds of interest there, or in what the individual convolution kernel sees, what might be very short. If you have bird chirps, they're like,
and those are maybe 100 milliseconds max or maybe even 10 milliseconds. So they occupy very little space in that image that the classifier sees. But it's important that it's able to classify it no matter, or it's desirable that it's able to classify it no matter where inside this analysis window it appears.
So time shifting simply means that you do, you shift your samples in time, left and forward and backward, left and right. And that gives them, the algorithm has seen that, oh, okay, bird chirps can appear many places in time, at any place in time,
and it doesn't make a difference to the classification. So this is by far the most important one. And you can usually go quite far with just time shifting. If you do want precise location of your event, so you wanna have a classifier that can tell when did chirps appear,
not just in the 100 millisecond range, instead of just that there was birds in this four or 10 second audio, then you might not want to do time shifting because you might want to have kind of, that the sound always occurs in the middle of the window. But then you need to, your labeling needs to respect that.
Yeah, time stretching, so many sounds, if I speak slowly or I speak very fast, it's the same, it's the same meaning, it's certainly the same class, it's both speech. So time stretching is also very efficient to capture such variations
and also pitch shifting. So if I'm speaking with a low voice or a high pitched voice, it's still the same kind of information and the same carries in, for general sounds, at least a little bit. So a little bit of time shift, pitch shift you can accept but a lot of pitch shift might kind of bring you into new class, for instance,
the difference between human speech and bird chirps. That might be a big pitch shift, so you might want to limit how much your pitch shifting. So typical data annotation settings here is like maybe 10 to 20% on time shift and pitch shift.
You can also add noise, this is also quite efficient, especially if you do know that you have variable amount of noise. Random noise works okay, you can also sample, there's like a lot of repositories of basically noise where you'll mix in noise with your signal and classify that.
Mix up is an interesting data augmentation strategy that mixes two samples by a linear combination of the two and actually adjusts the labels accordingly and that has been shown to work really well also in combination with other augmentation techniques on audio.
So yes, we can basically apply CNNs with the standard kind of image type architecture. This means that we can do transfer learning from image data. So of course image data is significantly different
from spectrograms. I mentioned the frequency axis and so on. However, some of the base operations that are needed, you need to detect edges, you need to detect diagonals, you need to detect patterns of edges and diagonals, you need to detect kind of a blob of area. Those are common kind of functionality needed by both.
So if you do wanna use a bigger model, definitely try to use a pre-trained model and fine tune it. For instance, most deep learning frameworks including Keras have pre-trained models, pre-trained on ImageNet.
The thing is that most of these models, they take RGB color images as data and it can work to just like use one of those channels and zero fill the other ones. But you can also use, just copy the data across the three. There's also some papers showing
that you can do multi-scaled. So for instance, one has a spectrogram with very fine time resolution and one with a very coarse time resolution and then you put them in different channels and this can be beneficial. But because image data and sound data are quite different, you usually do need to fine tune.
So it's usually not enough to just apply a pre-trained model and then just tune the classifier at the end. You do need to tune a couple of layers at the end and typically also the first layer at least. Sometimes you fine tune the whole thing. But it is generally very beneficial. So definitely if you have a smaller data set
and you need that high performance and you can't get it with a small model, go with a pre-trained model, for instance, MobileNet or something like that and fine tune it. Audio embeddings is another strategy inspired by text embeddings where you create a, for instance,
128 dimensional vector from your text data. You can do the same with sound. So with Look, Listen, Learn, L3, you can convert a one second audio spectrogram into a 512 dimensional vector
which has been trained on millions of YouTube videos. So it's seen like a very large amount of different sounds. And that uses a CNN under the hood and basically gives you that very compressed vector.
As a classification. And I didn't finish any code sample here but there's a very nice, the latest work is open L3, from Look, Listen, More is the paper. And they have a Python package which makes it super simple. Just import it, one function to pre-process and then you can classify audio
basically just with the linear classifier from Scikit-learn or so on. So that if you don't have any deep learning experience and you wanna try an audio classification problem, definitely go this route first because this will handle kind of the, it'll basically handle the audio part for you and you can apply a simple classifier after that.
One little tip, I mean, you might wanna do your own dataset, right? Audacity is a nice editor for audio and it has a nice support for annotating, adding a label track and annotating. There's like keyboard shortcuts
for all the functions that you need so it's quite quick to use. So here I'm annotating some custom data where we did event recognition. And the nice thing is that the format that they have is basically a CSV file, it has no header and so on but this pandas line will basically
give you a nice data frame with all your annotations from the sound. Yes, so it's time to summarize. Oh, I have a fix me, okay. So we went through the basic audio pipeline. We split the audio into fixed length analysis windows.
We used log-meld spectrogram as a feature representation because it's shown to work very well. We then applied a machine learning model, typically a convolutional neural network and then we aggregated the predictions from each individual window and we merged them together using global mean pooling.
And models that I would recommend trying first if you're trying some new data, try audio embeddings with OpenL3 and a simple model like Leno classifier or random forest for instance.
I try convolutional neural network using transfer learning. It's quite powerful and are usually examples that will get you pretty far. If you do, for instance, pre-process your spectrograms and save them as PNG files, basically you can take any image classification pipeline that you have already or if you're willing
to kind of ignore this merging of different analysis windows and use that. Data imitation is very effective. Time shift, time stretch, pitch shift, noise add are basically recommended to use. Sadly there is not such nice
or go-to implementations of these in inference and Keras generators but it's not that hard to do. Yes, some more learning resources for you, the slides and also a lot of my notes in general are on this GitHub.
If you do want to get hands-on experience, TensorFlow has a pretty nice tutorial called Simple Audio Recognition and it's about recognizing speech commands which could be interesting in general but it's taking a general approach. It's not speech specific so you can use it for other things also. There's one book recommendation,
Computational Analysis of Sounds in Events is quite thorough when it comes to general audio classification, a very modern book from 2018. So that's a nice one also. So, I think we have questions maybe?
We have 10 minutes for questions so please go to the microphones in the aisles to ask them. I think our first is there. Yeah, thanks John, very interesting application of machine learning. I have two questions, more questions. So, there's obviously like a time series component
to your data. I'm not so familiar with this audio classification problem but can you tell us a bit about time series methods, maybe LSTM and so on, how successful they are? Yes, yeah, time series is, intuitively one would really want to apply that because there is definitely a time component.
So, convolutional recurrent neural networks do quite well when you're looking at longer time scales. For instance, there's a classification task called audio scene recognition. For instance, I have a 10 or maybe 30 second clip. Is this from a restaurant or from a city and so on? And there you see that the recurrent networks that do have a much stronger time modeling,
they do better. But for small short tasks, CNNs do just fine, surprisingly. Okay, the other small question I had was, just to understand your label, the target that it's learning, you said that this is all very mixed,
the sound is a very mixed data set. So, are the labels just like one category of sound when you're learning or would it be beneficial to have maybe a weighted set of two categories when doing learning? Yeah, so in the audio classification task, the typical style or kind of by definition
is to have a single label on some sort of window of time. You can have multi-label data sets, of course, and that's a more realistic modeling of the world because you basically always have multiple sounds. So, I think audio set has multi-labeling
and there's a new urban sound data set now that also has multiple labels. And then you apply kind of more tagging approaches. You're using classification as a base. With tagging, you can either use separate classifier
per track or sound of interest or you can have a joint model which has multi-label classification. So, definitely this is something that you would want to do but it does complicate the problem. We have over there one person in the mic.
So, thank you for the information. And you mentioned about data augmentation that we can also mix up to separate our cases and mix them. And then the label of that mix up should be like weighted also because it kind of concludes with previous question.
Yes. So, like 0.5 and 0.5 for the other and how would that work? Yes, so mix up was proposed I think like two, three years ago, there's a general method. So, you basically take your sound with your target class and you say, okay, let's take 80% of that, not 100
and then take 20% of some other sound which is a non-target class, mix it together and then update the labels accordingly. So, it's kind of just telling you, hey, we're basically creating a lot of, there is this predominant sound but there's also this sound in the background. Okay, thank you.
Yes, we have a question. You mentioned about the male frequency ranges but usually when you record audio or microphones you get up to 20,000 hertz. Yes. So, have you any experience or could you comment on when you have added information of the higher frequency ranges does that affect the machine learning algorithm
or other features that one could use? Yeah, that's a good question. So, typically recordings are done at 44 kilohertz or 48 kilohertz for general audio. Often machine learning is applied at lower frequency, so with the 22 kilohertz or sometimes in just 16, in a rare case it's also eight.
So, it depends on the sounds of interest. If you're doing birds, definitely you want to keep those high frequency things. If you're doing speech, you can do just fine on the eight kilohertz usually. Another thing is that noise tends to be in the lower areas of the spectrum. There's more energy in the lower end of the spectrum.
So, if you are doing birds, you might want to just ignore everything below one kilohertz for instance and that definitely simplifies your model, especially if you have a small data set. We have more questions. You need to go to the mic, either here or there. Quick question. You mentioned the editor that has support
for annotating audio. Could you please repeat the name? Yes, Audacity. Audacity, okay. And more general question. Do you have any tips? If, for example, you don't have an existing data set and you're just starting with a bunch of audio that you want to annotate first, do you have any advice for the strategies
like maybe semi-supervised learning or something like this? Yeah, semi-supervised is very interesting. There's a lot of papers, but I haven't seen very good practical methodology for it. And I think, in general, annotating a data set is like a whole other talk. But I'm very interested to come
and chat about this later. Thanks. So yeah, we have two more and then we're done. Very nice talk. My question would be, do you have to deal with any pre-processing or like white noise filtering? You mean to remove white noise? Exactly, because you just said like removing
or ignoring certain amount of frequencies? Yes, you can. I mean, scoping your frequency range definitely is very easy. So just do it if you know where your things of interest are. Denoising, you can apply a separate denoising step beforehand and then do machine learning.
If you don't have a lot of data, that can be very beneficial. For instance, maybe you can use a standard denoising algorithm trained on like thousands of hours of stuff or just a general DSP method. If you have a lot of data, then in practice, the machine learning algorithm itself learns to suppress the noise.
But it only works if you have a lot of data. Thanks. So thank you for our talk. Is it possible to train a deep convolutional neural net directly on the time domain data using 1D convolutions and deleted convolutions and stuff like this?
Yes, this is possible. And it is very actively researched. So it's only within the last year or two that they're getting to the same level of performance as spectrogram-based models. But some models now are showing actually better performance with the end-to-end trained model. So I expect that in a couple of years,
maybe that will be the kind of go-to for a practical application. Can I do a speech recognition with this? This is only like six classes. I think you have much more classes
if you want to classify words. If you want to do automatic speech recognition, so the complete vocabulary of English, for instance, then you can theoretically, but there are specific models for automatic speech recognition that will, in general, do better.
So if you want full speech recognition, you should look at speech-specific methods. And there are many available. So but if you're doing a simple task like commands, like yes, no, up, down, one, two, three, four, five, where you can limit your vocabulary to say
maybe under 100 classes or something, then it gets much more realistic to apply a kind of speech-unaware model like this. Okay, thanks for an interesting presentation. I was just wondering from the thesis, it looks like you applied this model to a microprocessor.
Can you tell a little bit about the framework you use where you transferred it from a Python? Yeah, yes. So we use the vendor-provided library from STM Microelectronics for the STM32. And it's called X-Cube AI.
You'll find links in the GitHub. It's a proprietary solution, only works in the microcontroller, but it's very simple. You plug in your, you throw in the Keras model, it will give you a C model out, and they have examples about the pre-processing with some bugs, but it does work.
And the firmware code is also in the GitHub repository, not just the model. So you can basically download that and start going. Okay. Yeah, do join me here if you wanna talk more about some specifics about auto-classification.
I'll also be around this late. Thank you. Thank you, John. Nordbei.