We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Music transcription with Python

00:00

Formal Metadata

Title
Music transcription with Python
Title of Series
Part Number
82
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Anna Wszeborowska - Music transcription with Python Music transcription allows to convert an audio recording to musical notation through mathematical analysis. It is a very complex problem, especially for polyphonic music - currently existing solutions yield results with approx. 70% or less accuracy. In the talk we will focus on transcribing a monophonic audio input and see how we can modify it on the fly. To achieve that, we need to determine pitch and duration of each note, and then use these parameters to create a sequence of MIDI events. MIDI stands for _Musical Instrument Digital Interface_ and it encodes commands used to generate sounds by musical hardware or software. Let's see how to play around with sounds using Python and a handful of its powerful libraries. And let's do it in real-time!
11
52
79
MIDISoftwareWireless LANSoftware developerProduct (business)Multiplication signVideo gameLink (knot theory)DigitizingBitWorkstation <Musikinstrument>Wave packetPoint (geometry)State of matterArithmetic meanTable (information)Lecture/Conference
MIDINeuroinformatikVideo gameTelecommunicationMathematicsArithmetic meanRight angleStreaming mediaDecimalVirtual machineLecture/Conference
DecimalReading (process)Streaming mediaSoftwareTelecommunicationFinite setMathematicsProcess (computing)SequenceSampling (statistics)DigitizingRight angleoutputNeuroinformatikEvent horizonWeightWordGeometric quantizationComputer animationLecture/Conference
Reading (process)Geometric quantizationExecution unitSet (mathematics)Right angleSampling (statistics)FrequencyTheoremoutputSignal processingConnectivity (graph theory)MereologyGraph (mathematics)Sound effectAliasingDigital signal processingDoubling the cubeComputer animationLecture/Conference
Reading (process)Geometric quantizationRight angleFrequencyInformationMedical imagingGeometric quantizationNumberCategory of beingLevel (video gaming)Domain nameArmSampling (statistics)Set (mathematics)Finite setStreaming mediaDampingComputer animationLecture/Conference
Representation (politics)Reading (process)Graph (mathematics)Data storage deviceSoftwareDifferent (Kate Ryan album)Event horizonFrequencyWaveformSpectrum (functional analysis)Standard deviation
Raw image formatParameter (computer programming)Right angleMultiplication signSpeech synthesisProcess (computing)FrequencyExecution unitFormal languageAngleWordForcing (mathematics)MultilaterationFood energyAffine spaceAlgorithmDemo (music)Video gameNoise (electronics)Lecture/Conference
Process (computing)Keyboard shortcutRight angleMetadataObject (grammar)Reading (process)CASE <Informatik>File formatStreaming mediaSound effectNumberKeyboard shortcutLibrary (computing)Term (mathematics)Real-time operating systemPiAsynchronous Transfer ModeFlow separation
FlagKeyboard shortcutFrame problemRaw image formatQuicksortCountingFrame problemFlagWordStreaming mediaCASE <Informatik>Series (mathematics)Data streamTheory of relativityString (computer science)Type theoryPoint (geometry)NumberRechentafelMultiplication signLevel (video gaming)Lecture/ConferenceXML
Electronic mailing listMathematical optimizationImplementationNumberInformationType theoryString (computer science)Object (grammar)Different (Kate Ryan album)Video gameOperator (mathematics)Inclusion mapLecture/Conference
Operator (mathematics)Perfect groupDebuggerHeat transferOcean currentPasswordComputer animationLecture/Conference
Spectrum (functional analysis)Multiplication signMathematicsFood energyFrequencyPower (physics)InferenceTerm (mathematics)Point (geometry)Transformation (genetics)
Multiplication signDifferent (Kate Ryan album)Spectrum (functional analysis)MathematicsPower (physics)MeasurementLine (geometry)Ocean currentCalculationLecture/Conference
Thresholding (image processing)Functional (mathematics)Noise (electronics)NumberMultiplication signCircleDiagram
Functional (mathematics)Thresholding (image processing)Integrated development environmentNumberAverageParameter (computer programming)Binary multiplierImplementationLecture/ConferenceDiagram
Range (statistics)Cartesian coordinate systemFourier transformSpeech synthesisSpectrum (functional analysis)LogarithmLecture/ConferenceComputer animation
ChronometryInformationMultiplication signWordSpectrum (functional analysis)FrequencyDomain nameCycle (graph theory)MathematicsBit rateTime domainLogarithmMultilaterationCross-correlationLecture/Conference
Range (statistics)Domain namePower (physics)Sheaf (mathematics)FrequencyMultiplication signCycle (graph theory)Sound effectProcess (computing)Range (statistics)Discrepancy theoryMIDIMaxima and minima1 (number)Computer animationLecture/Conference
Polygon meshFrequencyFrequencyDomain nameSubject indexingSet (mathematics)Right angleSound effectProjective planeComputer animation
Power (physics)Insertion lossSpeech synthesisFrequencyMIDIVelocityKnotCommunications protocolLecture/Conference
Communications protocolMIDIFrequencyData conversionMetropolitan area networkUniform resource nameCAN busSimulationMessage passingVelocityMaxima and minimaFrequencyArithmetic meanMereologyKeyboard shortcutMIDIComputer fontLibrary (computing)Set (mathematics)Codierung <Programmierung>OctaveNumberSoftwareReal-time operating systemPlug-in (computing)Lattice (order)outputHyperlinkRoundness (object)Degree (graph theory)HypermediaPower (physics)MedianRight angle
Different (Kate Ryan album)CodeRapid PrototypingAlgorithmProduct (business)Multiplication signLecture/Conference
Operator (mathematics)Field (computer science)NumberLibrary (computing)Computer fontComputer animation
SoftwareCharacteristic polynomialLattice (order)Multiplication signVideo gameFrequencyAlgorithmDisk read-and-write headEndliche ModelltheorieArithmetic meanDirection (geometry)Student's t-testMachine visionWordoutputVector spaceNoise (electronics)Goodness of fitLie groupRevision controlQuicksortFood energyDistribution (mathematics)InformationComputer fileFinite-state machineMIDIAsynchronous Transfer ModeHarmonic analysisDifferent (Kate Ryan album)Category of beingParameter (computer programming)PlanningSemiconductor memoryPhysical systemPattern recognitionMereologyImplementationReal-time operating systemData managementGraph coloringExecution unitHeat transferNonlinear systemInterpreter (computing)Formal languageWrapper (data mining)System identificationCodeLaptopLetterpress printingChord (peer-to-peer)Spectrum (functional analysis)TrailLecture/Conference
Transcript: English(auto-generated)
So this is Anna Sheborovska, and I practiced it and I didn't quite get the via the beginning and she's gonna be talking about music transcription You've given her a big hand already, but do it again and enjoy the talk
My name is Anna And I work as a software developer in the music industry in a company called Ableton at Ableton We produce three main products The main product is a bit software
It's a digital audio workstation which allows musicians to record edit and produce music Apart from that it was actually designed as an instrument that you can take on stage and perform life. Hence the name Next thing is link link is a technology that allows people to play together on electronic devices thanks to syncing them in time over
Wireless network and last but not least the training beauty is actually something I work on at Ableton This is an instrument that allows you to create your music musical ideas without looking At a computer although it's connected to life
Yeah, but let's get back to the topic so what does it mean to transcribe music transcribing music means Transforming the audio recording Notation what means you just like write down what you hear with notes that can be interpreted by other people. There are some
Some people who have like superpowers of doing it on here. I'm not one of them My ears haven't been properly trained to do it So that's why I prefer to figure out how to teach my machine to do it for me so let's have a look on what we need to To Do from that so first of all we have to figure out how to read the audio stream
You have to think of what it is what it's gonna be and how to Read it and start so that we can later process it Then how to figure out that the note actually occurred and what note it was was like see was it e what octave it was and then
Transcribe it so write it down in some standardized way That could be interpreted by other software other electronic instruments or even other people okay, so Let's go in steps first question how to read and store data. First of all, we need to know what data it is
So our audio Input is basically a continuous continuous way for a way That we unfortunately can't just like process like this with our computer first we need to digitize it, of course Digitization comes into steps sampling and quantization. What does it mean? So something basically
Changes the signal to a sequence of samples So we end up having a finite set of samples how many samples this is determined by a something right that we? Choose so something right defines how many? Samples will be peaked per second
There is also one important thing That we need to remember of while sampling which is like one of the basic and most important things in the whole digital signal processing which is so called Shannon or Nyquist or simply something theorem that says that you have to make sure to be able to later
Restore the continuous signal we had on our input to be able to restore it We have to make sure that our frequency components in our digitized signal don't contain frequency components Above half of the something right? What does it mean?
Like what happens if you have like higher frequencies in there Then this frequency comes in and wants to represent itself in our sample data So what it does it kind it tries to assume a different frequency that its own which is called aliasing So it kind of takes an alias and like to the effects itself, which he and
Aliasing is like a double course because first of all, of course, we lose the high frequencies But we also corrupt our low frequencies because we don't know anymore if the frequency we have there is actually the one that belongs to the original signal or is Is there because of aliasing?
So yeah, just make sure you pick up the right Something right the usual one is 44.1 kilohertz and it allows us to basically encode everything that audible for human being because we hear something in the range from about 20 to 20k
Hertz okay, then so now we have our independent domain that contains a finite number of Things like in number of samples, but now still our dependent variable it's not Finite yet because each value can be any float
so but let's say we want to encode our information on Set number of bytes that say 8 and we want to have only like 256 numbers available for encoding data Which means we have to quantize the data the most simple thing to do it is just to find the nearest Quantization level so the nearest possible value to
assign To a sample so it's like just correct the amplitude of it So both sampling and quantization end up like restricting how much information we end up having in our digitized signal Okay, so we know what we have when we read Read our audio stream. It's gonna be like probably rather large a right of data
we'll have to figure out what things to choose to Like what data that's used to store it and then what we want to do we want to detect notes So first of all, we want to say okay the note occurred we here have a plotted recording
Done on this wonderful thing The second event we can see exactly like I just played two notes, right? It's quite easy to see on a Plotted waveform we have to figure out how to how to calculate that Okay, and then we know that I play two notes now in a spectrum graph We can see that these were different notes, right? Because they're significant peaks in different frequencies
Okay, and Last step is to think about the standard how to encode our notes so that other software understands that later and Let's get our questions again We need to read and store store data figure out how to do that how to deck notes and have represent
And before we go any further to my suggested implementation of that. Let's see what we're aiming phone so aiming for so to demo time and Anyone thinks can go wrong right now Audio processing I'm sure works
But just a small caveat ghost notes may appear do to all the noise around but let's see how it works so I'm gonna play this beautiful thing that's been with me ever since I was in elementary school probably and One more and probably last time in my life. I Have this microphone here that I'm gonna play to and I'm not gonna play fancy melodies
Later one just single notes. Okay. Let's see if it works. I'm gonna so I'm gonna ask my algorithm to dissect a note Detective speech and played with a different sound. Let's see how that works. You got the idea
Feel better than expect expected So What happened here is that we read chunks of data at time and we process them trying to
Trying to figure out if a note occurred and what note it was Then we create we created a note in the standardized Formats send it to the synthesizer that produced a different sound the sound of piano. Okay. So how did we do that?
I read the data using a pie audio library, which is basically a these are Python bindings around port audio, which is like a cross-platform library enabling you to Play and record audio in real time It supports blocking and non-blocking mode. Non-blocking mode is based on just calling callbacks in the
Separate threads. That's what we're doing here. So you basically need to instantiate a pie audio Object create an open stream in our case for reading the data Tell it what data format you want how many channels you're gonna read and what is the call back?
And most importantly you need to start the stream and make sure the main thread doesn't die doesn't terminate, you know So you we need to keep it alive by like I'm putting some like a sleep there or something So, let's see how they call what a callback looks like we receive a data frame count
number of data time in front stage is black and our data is a string at this point and What's most important it needs to return the frame count frames and a flag status flag tells our
Stream if it should continue feeding the data It happens when we pass a continue flag or if it should terminate in that case We Put a flag say Okay
What's next next we have to figure out how to store our data because like it's here. We're getting strings Not the best type maybe for data manipulation and calculations, I think Yeah, so today the string here so that's why we Convert it to An umpire right? Why don't pray because it's like hell more efficient than Python lists
due to internal implementation, even though both implementations aren't see like just because Python lists allow you to put like Different type of objects in the same list and needs to store the information about the type It can't really use the vectorized implementations and can't really optimize
Operations, I'm gonna abusively use doing this Not too efficient number your race though speed up things massively Okay, so and also like give us for free some Rather complicated and have some come but very popular in common
operations on big mattresses like convolving or like making a transpose Transposing matrices or like even we get password for your transfer perfect. Okay So now we read our data now we want to see if our current chunk of data
Something changed and the note occurred here. I Plotted short and for your transform of our recording from previous example What what what does it mean that this short time for you transform
It means that the recording was divided in chunks and for each chunk We calculated power spectrum to see the energy changes of the spectrum in time This way we can Distinguish it like there was a significant change in our spectrum assuming like another period
What have we don't do that in our implement implementation because like we analyze each chunk of data time but basically we did the same we calculate power spectrum of it and We try to compare it with previous spectrums so as we want to Measure how quickly the power of our spectrum changes over time
We do it using so-called spectral flux, which is basically the difference between the current Power spectrum and the previous one and it's plotted with a green line here over short time for you transfer Okay, so we could Find the peaks in here already, but there's there are some minor peaks that we might end up
Any back ending up finding and we don't want that because it's kind of a noise We apply circle thresholding function, which is basically we choose a number of chunks that we average and multiply by a given constant
To make sure we and we are like basically only interested in picking peaks that are above the given threshold and The values that are bigger non-zero values that are bigger than the previous value In here we can see that thresholding function could be better because it still leaves behind some peaks. We don't want to
Choose in my implementation thresholding function like which is far Higher because I wanted to make sure it's not too sensitive in an environment like this So it's probably performs better, but these two parameters the number of chunks we're gonna average and
Multiplier is basically The parameters that you can change to adapt your To make your application perform better Okay now We picked the significant peaks here and we want to find
What pitch these nodes had we do it by calculating so-called Cepstrum of a signal and cepstrum is an inverse Fourier transform of a logarithm of a calculated spectrum
In like see the words is kind of a spectrum of a spectrum so that's a way to think about it and basically, you can treat it as the information of About the rate of change over time. So it's kind of a measure of time. You shouldn't think about it as
of Signal in time domain, but it's kind of correlate correlate correlated with the time in In that so we all know that frequency equals one per time of a single cycle of the way so knowing that
In our frequency domain we print easy domain of cepstrum. They are all like Playing with words spectrum cepstrum frequency frequency frequency is like it represents like the time cycles so like the high frequencies will be like have shorter time cycles and they will be they will be represented at the beginning of
frequency domain and then Lower frequencies at the end here. So before we start picking our Finding our Fundamental frequency in this cepstrum. We want to actually this is already now So we want to narrow the cepstrum to the frequencies. We are interested in I narrowed the frequency to
Frequencies to ones corresponding to like eighth notes. I would probably be willing to play tonight Yeah, and so it's like pretty narrow range And we can think of it. So these frequencies are from
500 Hertz to 1200 Hertz we can also think about them as narrowing it to 80 microseconds of Time cycle to 2 milliseconds of a time cycle that corresponds to 500 Hertz. This is how cepstrum works more or less Okay, so knowing this We picked the maximum value which in our
example was between 25 and 30 and And Okay, we have the in like the value in frequency domain We have to now transform the frequency to frequency because that's what we are interested in to calculate it We simply divide the sample rate by the
Set cepstrum peak index which kind of Is derived from what I've just tried to explain so in our example this what I mentioned we are narrowing the the cepstrum and and Then find a peak in the narrow cepstrum, but then well actually when we are trying to figure out what was the
frequency you had to remember that the peak index should refer to the original cepstrum That's what we're doing here and we figure out like saying okay. It's the value of it ends up being 689 Hertz Okay. Now it's also nice to mention that by the pitch detection applies a slight correction to our onset detection
Because then we can just ignore the the onsets that are out of our frequency range of our interest Okay. Now we found the notes now We want to encode it in a thing that can be later understood by our synthesizer. So what we do is
So what we do is that we choose something ready to use something like use massively and everything MIDI protocol MIDI stands for Musical instrument digital interface and it basically It can code like a note and velocity pitch and velocity of our knots
The messages MIDI messages we're Interested in would be no tone and note off note on means start playing it out not off stop playing it out And message consists of three bytes as we can see here in the first one
We say what kind of a message it is what channel we're using we have 16 channels to use Second data by it has Our pitch encoded as we can see it's on seven bytes just 128 values and the last one is velocity meaning like velocity is a strength of
The node being played so it means like we perceive a node being played louder or softer Okay, so This is how we transform frequency to MIDI notes Number because as we can see we only have seven bytes to encode our peach meaning the maximum value would be
127 but our frequency before was 689 so how do we encode that well like this? And this is the And It's a part of a chart that actually tells you what note in what?
Octave has what MIDI number and what frequency and as we can see here our Some of their our found note of frequency 698 Hertz is Is a note F in 7th octave and has a MIDI number 77
Okay now we know what our note is what we can encode MIDI message And you can send it to a different instrument. I chose Library called a pipe flute synth, which is also a set of patent bindings around a thing called flute synth
Which is basically software that allows you to play sound fonts that encode instrument in real time and Yep, that would be it What are the conclusions Python is amazing for rapid prototyping
I wouldn't really use Python probably for production code, but to just Check some like check out different Ideas solutions or like even try out different detection algorithms. It's just it works like it's a charm the whole thing That we've seen now. It's not a lot of code. You can have a look
It's on github It's been on github for the last like two hours today Still Sound font that I used now, it's not Pushed there because it was to be so you have to like download your own but they are available online. It's no problem and
Okay, so why is it so good for rapid prototyping? so first of all Thanks to amazing numerical libraries that I briefly discussed before like numpy makes things much easier hell numpy and Of course IO operations in Python are like really simple
no much messing around like That's really useful and the API of the wrappers I used are Really good and to make the code look very clean and very readable. So yeah That's it
Yeah, we do have four minutes for questions this hand went up immediately Hello to question actually first. You need a very good microphone or whatever works
Yes, no, not very good. But I just brought this one because the laptop microphone I get the noise from All around And second what about instrument identification I mean knowing that it's a piano or a guitar or a violin
It's a different thing. So Distinguishing between like so different instruments would have different spectra spectral Features and then we would need to analyze different things like We would probably we wouldn't care about the energy distribution over time. You would rather
care about of energy distribution at all like, you know, because like different instrument would probably have different characteristics in Yeah Would be probably Yeah, completely different things to analyze although having similar
So thanks for the talk was super good quick question, so this is for mono so if there's like one Melody and only one note playing at the same time And what are the main sort of challenges or maybe sort of advice that you might have if you want to do this for chords? Yes, so the problem with chords is that like it's hard to distinguish what notes were played because the frequencies overlap and
If I had a very performant Algorithm now to present on like transcribing hearts. I would probably be a PhD by now
So I'm certainly gonna try out some things but the solutions available by now are no more Efficient than 70% so it's a rather complicated thing I still think I'm gonna work and try out different things a very interesting concept is a CQT transform You can have a look. This is what people tend to
Like try to use for polyphonic So More questions, there you go. We had the same question You know that there is a story that says that
Wolfgang Amadeus Mozart when he was a child He heard a full play of a piece of music and he was Able to transcribe from his memory all the piece Your your system is able to to
hear a full piece of music and Try and print in music notes all the piece like Mozart The problem So this version and this version push to give up works only for real-time input so
One chunk in the time when one note at a time, I initially implemented it for Analyzing the whole like pre-recorded things files and so actually it can recognize notes and features of it in time and like
Reconstruct like a create MIDI note We're like keeping the time and and pitch and so on but unfortunately the problem as discussed with by answering the different question would be to like Have polyphonic music transcription, which I don't support yet. So that would sadly not work
Maybe in the future. I hope that's a direction. We have time for one more question I'm thinking about So you can make recognition of monophonic music would do you know about
Some implementations that also let the user Teach the algorithm about something like this is the noise don't interpret it or this is my color of instrument don't interpret anything else is it an
Approachable because even in Ableton I see different algorithm for harmonic detection different algorithm for melody for rhythm and Know any method for a supervised Detection Is it practical is it practical at all? It was which would be too difficult to implement
I'm not sure. I completely understand the question. Like would it be on me? Yeah, kind of like I could say to algorithm that I want it to try to scrape this and don't transcribe that so
Such to know detect your ghost notes like here So this is a noise and it interprets it. So maybe you could just put this noise to algorithm and It would learn that it is just noise So I mentioned before I think like maybe you would first have to have like kind of
Instrument recognition like, you know like to find out some Spectral features that would be able to determine like, you know, the the instrument that you want to sort out or something Not trivial to implement I guess I don't think it's on the radar
but it's certainly an Interesting thing I think so. You just probably yeah, first of all would have to Try different like try out Like taking different spectral features like what is works best for distinguishing between different instruments
like what tracks the timber or something and then just apply it as your filter for the feature detection That would be my idea of how to approach that but unfortunately, I don't know of any ready implementations like this there are people who are trying to like separate separate instruments and it's the same problem like the frequencies overlap the
characteristics are sometimes too similar, so Yeah, I don't know of anything And that's all we have time for so I'm sure you'll all join me in thanking Anna Devorska