Add to Watchlist

Music transcription with Python

381 views

Citation of segment
Embed Code
Purchasing a DVD Cite video

Automated Media Analysis

Beta
Recognized Entities
Speech transcript
so this is on a ship full of practices I didn't point at the beginning and you talk about music transcription you given behind already would do it again and enjoy the all fj high everyone answers found out there I have 5 minutes left there but I did so I'm interested in having find and meaning is there enough of them and I worked at the top the developer in the music industry in a company called a wooden tables and we will use 3 main products and the men that self blades said it so you workstation which allows musicians to regard and even if you needed a path that was actually designed as and instrument you can take on state and perform like and and next thing you think that acknowledging that allows people to play stay together on electronic devices thanks to thinking them in time over and over for this network and last but not least training beauty is actually something I work on a day but only
uses an instrument that allows you to create your musical musical ideas without looking at a computer although it's connected to like
um yes but let's get back to that so what is the mean to transcribe music transcribing music means and transforming the audio recording to communication what means you just like right down here where nodes that can be interpreted by other people there some the people that like superpowers of doing it and you're under 1 of them might years haven't been properly trained to do it so that's why I referred to figure out that what that when change was so let's have a look on what we need to the right to do it from that so 1st of all we have to figure out how to read the other thing is the thing about what it is we're going to be and how to read and start so that we can
later processing then how to figure out there is no actually occurred and what not was that like the SII was if you would of the D 1 and then I prescribed so write down some standardized way that could be interpreted by software and finding the instruments or
OK so the events that 1st question how can we sort it 1st of all we need to know what data so our audio and what this basically continues continue as a way for the weight that we approach may consciously like present like this with a computer 1st need to be decided words that we did excitation comes in 2 steps something and quantization what is mean so something basically changes so the signal to a sequence of samples so we end up having the
unit and a set of samples how many samples this is turning by something right that we have to use so something right
defines how many edits samples that will be key parts and there is also an important thing that we need to remember all while something which is like 1 of the basic and most important things in the whole hold still Signal Processing which he is so called channel or quickly simply something theorem that says that you had to make sure to be able to later stored a continuous signal we had uh the input to get to be able to start we have to make sure that I would with in the components in our digitized signal don't don't contain convincing components about half of the something right what it means is that like what happens if you have like a higher frequencies in there and then these frequency comes in and wants to represent himself itself never have some data so what he does he kind he tries to do is you a different frequency that example which is called thing so connect taken LES and like to the text itself which he and uh thing is like that of course because 1st of all was we lose the pregnancies but we also graph
low frequency because we don't know any in the the frequency we have there is actually the 1 that belong to the original signal or the use of was there because of anything that so we have just make sure you pick up the rights and something right the usual 1 ESA 44 . 1
kilohertz and allows us to the and hold everything that audible for human beings because we hear something that the range from about 20 to 28 OK and so now we we have our independent domain that contains a the number of them things like in number of samples but now feel our and been variable it's not a thing again because each value can be and so let's say we want to have the information on a set number of bytes that states and want to have only like 100 numbers available for encoding data which means we have to to quantize the data the most simple thing to do is just to find the nearest when the quantization levels of the news possible value to the assignment arms to some also like just correctly amplitude and so both something and quantization end up like restricting how much information we end up having in our ridiculous um so we know what we have when we read and read of this thing it's going to be like probably rather large right data will have to figure out what things to choose like what data that's it used to start and then we're going to do image segments of 1st of all we want say that the the current we here have applauded recording
how down this 1 I think that this is something that we can see exactly like I like to know right it's quite easy to see on and thought with firmer we'll have to grasp how to how to calculate the of a and then we we know that that they do not now in a spectrum graph we can see that these weren't at different all right because there was a significantly in different frequencies of OK and the uh last is to think about the standard have to encode the notes that are there so for instance that later and our with the questions again need to reconstruct started after it had to do that not have plants and before we go any further to my suggested implementation of
that let's see what we are aiming for so many forces to demonstrate and any other things in their own right now and on the affinity of processing and you watch situation and it's just a small can get those notes later period detailed Lorentz around and see how it works so I'm gonna play a beautiful thing that's been with me ever since I was in elementary school probably and we use it quite a lot from the last time you like I have this microphone here that I'm gonna later the and I'm not gonna patents melodies later wanted units of energy to work some of the so I'm going to ask my argument is that the nodes had that in speech and language and some people that a you prayer a called To étant raw artwork or begin Diggins iii ee ee ee iii I feel better than expected expected this could have happened um so what's happened here is that we're chunks data time and we process them trying to find trying to figure out the angle approach and what not words then we create we created an
analogy the standardized formats send to the center that's either that produce in south something that effect so how do we do that
by metadata using a highly recommended reading which is basically a user Python bindings around a quarter of you just like a cross-platform library enabling you to do that and play an important audio in real-time that it supports blocking and nonblocking knowledge number of knowledge that is based on just calling collections increasing temperature right that's what region here as basically each instantiated prior object create an open a stream in our case for reading the data that tell what's data from format 1 from each of within the region and what he is the holder and most importantly start to stream and make sure that main thread doesn't die as term that's a you we need to keep it alive by way
of putting some legacy there's an so let's see how call what it called the sort that you see the
data being found in the remainder their companies they survived and our data stream that is point and and uh what's most important needs to return the brain can't frames and reflect that flag tells our words uh stream and it
should continue reading the data and it happens when we passed the continued that or it femininity in that case but yeah we you would apply
to say at the top of the of of Eq
what's next next we have to figure out how to start the latest here
we're getting streams not the best I've made before the manipulation of of relations I think you could get to latest string
here so that's why we have some converted to an employer white gray because we do i fermentation than Python lists and you to into an implementation even though about implementations on by just because certainly allow you to include life different type of objects in the same reasoning to stored information about that 5 that can be used the vectorized implementations and con we optimize operations on them you use doing this and that is this number right so there things an I think
so and also like this the reason of rather complicated some but very popular in common um operations on the mattresses lie uh bowling and or like making
transpose transposing mattresses or light and we get the Fourier transform perfect but it and signals we Britain data now and the seeds are at the front end of data something changed and denoted here by I
plotted shorthand for transform of our recording from previous example of why I would serve would mean that a certain point it means that the recording was divided into chunks and for each and we calculated the power spectrum to see the energy changes of the spectrum time uh this finding and uh distinguishing like was a significant change in our spectrum assuming like the end of the year uh what have we done do that in our inference an implementation because like we analyze the terms of the time
but basically what it says calculate power spectrum that and we tried to compare it with the previous and 2nd so as we want to measure how we need power of spectrum changes with time um we do it using so called spectral light which is basically the difference between the current and spectrum and that this 1 and plot it with the green line here
over a short time Fourier um of soul we find the peaks and you're already there is a there are some minor peaks that he might stand out and about and you know finding and we don't want any this kind of a noise we apply simple thresholding function which is basically we choose a number of chance that we never really and multiplied by a given constant and to make sure we uh and we're like basically on the
interested in seeking peace there about a given threshold and uh that is the beta nodes the values that the bigger than previous that have been here we can see that thresholding function
could be better because it's still leaves behind some uh things don't wanna uh I choose in my implementation thresholding function like which is far higher because it wants to make sure it's not too sensitive in an environment like this said valley from better but these 2 in front the number of Transcription average and multiplier is basically as the parameters that you can change to adapt your um to make your
application from that but no that we keep this is the 2nd and here
and we want to find so what speech is not not fact that we do it by calculating so-called cepstrum of signal and cepstrum used in various the Fourier transform of the
logarithm of the calculated spectrum is in light the word suspended the spectrum of the spectrum so that's a way to think about it and they think treated as the information of the about the rate of change over time so it's kind of a measure of time you should think about it was over signal in time domain by the chemical anatomical of ghrelin correlated with the time 18 uh in that so we all know that frequency equals 1 pair time of in the cycle and way so uh knowing debt that in our frequency domain we can see the domain of cepstrum always playing with words that consist throughout frequency and we can think like it that we presented at a later time cycle so like the high frequencies will be like have shorter time cycles of and they would be they will be represented
at the beginning of the frequency domain and then uh lower with year here so before we start thinking power finding of fundamental frequency in this section we want to actually this is all written out so we want to narrow the set some to the frequencies we're interested
in I the frequency to their frequencies to ones corresponding to light 8 nodes so probably you play tonight and I'm here and there simply begin narrow range and we can think of it a so the discrepancies are 5 times 100 efforts to of 100 hertz you can also think about them as narrowing it to 80 microseconds over time cycle to to music the place on the responses by and but this process from work from effects of knowing this as we take the maximum value to our example world between 25 and 30 and
effect we have you like the value we consider mainly have not on frequency to frequency because that's what we're interested to calculate we simply define the sample rates by and 2nd such B in which kind of is derived from what I've just try to explain the so in our
example and his low-dimensional we're narrowing the cepstrum and a and then find the narrow September then well actually when we're trying to grapple with the of the project to have to remember that the the index should refer to the region of the right we would drink it and we not like saying valuable and that being 689 but now
it's also nice mention that by the pitch detection of
splice site correction to our detection because then you can just ignore the the onset that are out of power in the frequency range of our up to now we found to now we wanna encoded in that thing that can be later understood by a simple but so what we do use so what we do know is that we choose something ready to use something where I used must in everything needed practical me stands for a musical instrument digital interface and basically and it can collect a and the loss of the and the speech and velocity of our not death
messages that media messages here um interested in would be not on a dull note on means they're out of something about and message from the tho we might have to considered here uh and the first one we say what kind of message and is what some of you be using we have the chance to use a 2nd data by I repeat encoded as we can see on 7 128 values and the left 1 velocity meaning like the velocity is friend of and not being so it means like we proceeded noting right now or subject right so is how we transform frequency to meeting notes and number because it concealed we only have 7 bytes to encode a peach meaning that maximum value b 107 but our preconceived
689 so as to
be tho that while I believe it should and and this is the
part of the part of the lecture because you want to know what out as what media number and what frequency and as you can see here some of the power from knowledgeable frequency 680 98 plates these uh is notes as In the OK and have the median number sets of now we know
what our notice what it we can encode the message and send it to a degree instruments I chose the a latter recalled hyperlinks and which is also a stratified by means of a round thing called plugin which is basically suffered it allows you to play found once input instruments and yet it
would be in its um uh 1 underground conclusions heightened these amazing for rapid-prototyping I think I wouldn't really used primarily for production all but to just uh check from late 2000 and uh these solutions were likely and try different detection algorithms the just would look at the time the whole thing then seemed and now it's not
that's a lot of about you can
have a look at it and you have a standing up for the roughly 2 hours the social and this field there sometimes they they use now it's not
uh who's there it was to so you have to weigh wait the them you run by the available online from and I so why is it so with correct from having so 1st of all thanks to amazing numerical libraries that indicates that before like number by makes things much easier than advice and of course I O operations inside our life reasonable and know much
messing around like there that's really useful and the API of the Roberts I used are really going and to make use of very clean and very readable so yeah the answer that's all I have had over today and Valori thank you for Mr. questions this time 1 of the major planes active was listening to the 1st you need a very good microphone really words might is known as the anticipated from this fun because last month I get knowledge from all around the country has its designed this 1 thank and 2nd whether out instrument unification me knowing that it's the piano parameter by that is different things so of distinguishing between the 2 different instruments would have different spectral spectral uh the features and then we would needs to analyze it and things like and we'll probably wouldn't care about the energy distribution over time rather than care about the energy distribution at all like you know like different instruments probably have different characteristics they have yet the people would be probably um yet completely different things to analyze other of having the head and the a often supervision could courses is for models of this 1 meaning melody and only 1 not playing at the same time as and 1 of the means of the challenges for me certifies that you might have joined this for courts yeah so the problem we want to that life heart to distinguish what not to replace because the frequencies overlap and you are had a very performance as an algorithm that sent on a wide transcribing parts of probably be a and so on and the management something by the solutions available by now on no more uh efficient and 70 per cent so it's a rather complicated thing but I still think I'm going work and trying to convince very interesting concept is CQT transfer and this is what people tend to like try to work with money music written about so yeah that's why I did by more questions and you can have the same question the you know and that there is a story that says that the of nonlinear models Maltsev when he was a child he had a full playoff 1 piece of music and he was able to transcribe from his memory all the pieces from the your your system is able to to here on a piece of music and direct and in music notes all the pieces like Mozart's air what he with the problem so as version and his vision was to give up works only for real time input so uh I want and the time I was 1 of the time I initially implemented for a analyzing uh twofold like we pre-recorded things uh file and uh so actually each can recognize notes and pictures of that in time and like uh reconstruct like it creates meeting notes that will make you time and and beach and so on but unfortunately the problem and has discussed with their uh by answering the question would be to like I have a conference music transcription reach items 48 so that means that the network many pictures help that the direction we have time for 1 more question the and think about units so in you can make a recognition of monophonic music would you do you know about some implementations that that also led to the user to teach you unintelligible something clock desist enormous and don't interpret it well this is my color of instruments and don't interpret and they think that this isn't and approachable because even in of the tonal and use different language for harmonic detection different funding from managing for a rhythm and know a new method for supervised the detection so that is a property called is practical at all users tools which would be too difficult to implement and then directly in question like would the money on the recommendations are a kind of like I could say to that I want you to try to produce and dumped on transcribed up and so in such a to know detected your ghost modes of Q so this is the noise and it interprets it so maybe you could just put this noise to of goods and it would lend that this is just another yes mentioned directly from maybe you 1st have to have like their and he's very much like you know like it to find out some spectral features that would be able to determine lies in the use and that you want sort about something that not trivial to implement against the student and I don't think NO greater but certainly uh into interesting thing I think so he just probably yeah 1st of all we have to try different like tryouts that like taking different spectral features like what is what works best for distinguishing being different instruments by what about the timber or something and then just you're you have word for the feature vector and that would be my idea of intervention by the I don't know of any real information like this b and there are people who are trying to like separating separating instruments and the symbolic the frequencies of a lot of the characteristics of the facility but yeah I don't know of anything but that would perform well we were and we have time for so I'm sure you will join me in thanking and
Point (geometry)
Musical ensemble
Arithmetic mean
Product (category theory)
Computer animation
State of matter
Software developer
Table (information)
Workstation
Wave packet
Arithmetic mean
Musical ensemble
Mathematics
Computer animation
Telecommunication
Vertex (graph theory)
Right angle
Computer
Reading (process)
Word
Positional notation
Geometric quantization
Computer animation
Software
Sampling (statistics)
Streaming media
Weight
Computer
Sequence
Event horizon
Reading (process)
Graph (mathematics)
Connectivity (graph theory)
Sampling (statistics)
Digital signal
Set (mathematics)
Mereology
Signal processing
Frequency
Computer animation
output
Right angle
Theorem
Geometric quantization
Units of measurement
Reading (process)
Domain name
Information
Sampling (statistics)
Number
Frequency
Medical imaging
Category of being
Geometric quantization
Computer animation
Energy level
Right angle
Geometric quantization
Reading (process)
Standard deviation
Graph (mathematics)
Process (computing)
Forcing (mathematics)
Multiplication sign
Instance (computer science)
Parameter (computer programming)
Multilateration
Affine space
Raw image format
Food energy
Formal language
Frequency
Word
Computer animation
Angle
Vertex (graph theory)
Speech synthesis
Right angle
Representation (politics)
Spectrum (functional analysis)
Units of measurement
Keyboard shortcut
File format
Keyboard shortcut
Sound effect
Real-time operating system
Streaming media
Metadata
Number
Computer animation
Personal digital assistant
Term (mathematics)
Right angle
Process (computing)
Object (grammar)
Reading (process)
Library (computing)
Data stream
Word
Computer animation
Flag
Quicksort
Series (mathematics)
Streaming media
Frame problem
Flag
Frame problem
Keyboard shortcut
Raw image format
Computer animation
Personal digital assistant
Flag
Implementation
Theory of relativity
Information
Electronic mailing list
Streaming media
Number
Inclusion map
Video game
Computer animation
String (computer science)
Operator (mathematics)
Object (grammar)
Subtraction
Data type
Flag
Perfect group
Computer animation
Operator (mathematics)
Debugger
Point (geometry)
Transformation (genetics)
Multiplication sign
Line (geometry)
Food energy
Inference
Power (physics)
Calculation
Mathematics
Computer animation
Term (mathematics)
Subtraction
Spectrum (functional analysis)
Computer animation
Multiplication sign
Noise
Vertex (graph theory)
Thresholding (image processing)
Functional (mathematics)
Number
Implementation
Multiplication
Computer animation
Integrated development environment
Average
Parameter (computer programming)
Cartesian coordinate system
Thresholding (image processing)
Functional (mathematics)
Number
Domain name
Chronometry
Logarithm
Information
Multiplication sign
Range (statistics)
Multilateration
Time domain
Fourier transform
Frequency
Word
Mathematics
Computer animation
Bit rate
Speech synthesis
Cycle (graph theory)
Spectrum (functional analysis)
Domain name
Musical ensemble
Process (computing)
Multiplication sign
Range (statistics)
Sheaf (mathematics)
1 (number)
Sound effect
Range (statistics)
Power (physics)
Frequency
Maxima and minima
Computer animation
Vertex (graph theory)
Cycle (graph theory)
Discrepancy theory
Frequency
Subject indexing
Polygon mesh
Frequency
Computer animation
Projective plane
Sound effect
Right angle
MIDI
Frequency
Computer animation
Velocity
Speech synthesis
Insertion loss
Power (physics)
MIDI
Number
Maxima and minima
Frequency
Arithmetic mean
Message passing
Frequency
Computer animation
Lattice (order)
Velocity
Hypermedia
Data conversion
Right angle
Communications protocol
Metropolitan area network
MIDI
Median
Set (mathematics)
Mereology
Number
Power (physics)
Frequency
Frequency
Computer animation
Hypermedia
Uniform resource name
Data conversion
Communications protocol
Simulation
Degree (graph theory)
Arithmetic mean
Roundness (object)
Product (category theory)
Computer animation
Multiplication sign
Hyperlink
output
Plug-in (computing)
Computer animation
Operator (mathematics)
Field (computer science)
Library (computing)
Number
Read-only memory
Implementation
Musical ensemble
Computer file
Scientific modelling
Multiplication sign
Direction (geometry)
Distribution (mathematics)
Characteristic polynomial
Real-time operating system
Parameter (computer programming)
Student's t-test
Heat transfer
Disk read-and-write head
Mereology
Food energy
Formal language
Revision control
Frequency
Goodness of fit
Video game
Plane (geometry)
Harmonic analysis
Lie group
Subtraction
Units of measurement
Physical system
Algorithm
Pattern recognition
Information
Machine vision
Category of being
Word
Arithmetic mean
Computer animation
Graph coloring
Vector space
Lattice (order)
Nonlinear system
Computer network
Interpreter (computing)
Noise
output
Finite-state machine
Quicksort
Data management
Asynchronous Transfer Mode

Metadata

Formal Metadata

Title Music transcription with Python
Title of Series EuroPython 2016
Part Number 82
Number of Parts 169
Author Wszeborowska, Anna
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/21107
Publisher EuroPython
Release Date 2016
Language English

Content Metadata

Subject Area Information technology
Abstract Anna Wszeborowska - Music transcription with Python Music transcription allows to convert an audio recording to musical notation through mathematical analysis. It is a very complex problem, especially for polyphonic music - currently existing solutions yield results with approx. 70% or less accuracy. In the talk we will focus on transcribing a monophonic audio input and see how we can modify it on the fly. To achieve that, we need to determine pitch and duration of each note, and then use these parameters to create a sequence of MIDI events. MIDI stands for Musical Instrument Digital Interface and it encodes commands used to generate sounds by musical hardware or software. Let's see how to play around with sounds using Python and a handful of its powerful libraries. And let's do it in real-time!

Recommendations

Loading...
Feedback
AV-Portal 3.5.0 (cb7a58240982536f976b3fae0db2d7d34ae7e46b)

Timings

  512 ms - page object