Introduction in Audio Retrieval 2 (12.05.2011) - TIB AV-Portal

Introduction in Audio Retrieval 2 (12.05.2011)

00:00

33

Technische Universität Braunschweig, Institut für Informationssysteme

Balke, Wolf-Tilo

Formale Metadaten

Titel

Introduction in Audio Retrieval 2 (12.05.2011)

Serientitel

Multimedia Databases

Teil

7

Anzahl der Teile

14

Autor

Balke, Wolf-Tilo

Mitwirkende

Homoceanu, Silviu

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/348 (DOI)

Herausgeber

Technische Universität Braunschweig, Institut für Informationssysteme

Erscheinungsjahr

Sprache

Produzent

Technische Universität Braunschweig

Institut für Informationssysteme

Balke, Wolf-Tilo

Produktionsjahr

2011

Produktionsort

Braunschweig

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

In this course, we examine the aspects regarding building multimedia database systems and give an insight into the used techniques. The course deals with content-specific retrieval of multimedia data. Basic issue is the efficient storage and subsequent retrieval of multimedia documents. The general structure of the course is: - Basic characteristics of multimedia databases - Evaluation of retrieval effectiveness, Precision-Recall Analysis - Semantic content of image-content search - Image representation, low-level and high-level features - Texture features, random-field models - Audio formats, sampling, metadata - Thematic search within music tracks - Query formulation in music databases - Media representation for video - Frame / Shot Detection, Event Detection - Video segmentation and video summarization - Video Indexing, MPEG-7 - Extraction of low-and high-level features -Integration of features and efficient similarity comparison - Indexing over inverted file index, indexing Gemini, R *- trees

Multimedia Databases7 / 14

1

1:40:05

Basic concepts, Evaluation procedures (07.04.11)

2

1:56:57

Features introduction, Color features and color histograms, Matching of color histograms (14.04.2011)

3

2:40:13

Texture Features, Low-Level Texture Features, Tamura Measure, Random Field Models, Transform Domain Features (21.04.2011)

4

2:07:34

Multiresolution Analysis, Form based Features, Thresholding, Edge Detection, Morphological Operators (28.04.2011)

5

2:24:03

Chain Codes, Area based Retrieva, Moment Invariants, Query by Visual example (05.05.2011)

6

34:23

Introduction in Audio Retrieval 1 (12.05.2011)

7

1:13:54

Introduction in Audio Retrieval 2 (12.05.2011)

8

2:29:23

Audio Low level Features, Difference Limen, Pitch Recognition (19.05.2011)

9

2:38:41

Query by Humming, Melody Representation, Hidden Markov Model (26.05.11)

10

1:55:25

Hidden Markov Model, Video Retrieval (09.06.11)

11

2:21:00

Shot Detection (23.06.2011)

12

2:32:38

Video Similarity (30.06.2011)

13

2:00:12

Video Abstraction (07.07.2011)

14

1:57:48

Indexes (14.07.2011)

Automatisches Abspielen

Sprache

Text

Bild

00:00

ErschütterungFolientastaturZeichenketteMultimediaSchwingungVolumenFrequenzEinhüllendeDatenbankNatürliche ZahlOrdnung <Mathematik>SpieltheorieTopologieTransformation <Mathematik>ZeichenketteFrequenzProgrammbibliothekTypentheorieSenderMAPGenerator <Informatik>SoftwaretestBitCompilerErschütterungMenütechnikVersionsverwaltungTeilbarkeitMigration <Informatik>CASE <Informatik>TemperaturstrahlungFormation <Mathematik>PunktLogiksyntheseWeb-SeiteSchnelltasteVorzeichen <Mathematik>MultiplikationsoperatorAblaufverfolgungDemoszene <Programmierung>FolientastaturSchwingungVideokonferenzVerknüpfungsgliedEinhüllendeMagnettrommelspeicherBimodulQuellcodeDifferenteKlassische PhysikKurvenanpassungStandardabweichungChord <Kommunikationsprotokoll>Computeranimation

06:10

Inklusion <Mathematik>Wurm <Informatik>DatentypOrdnung <Mathematik>DigitalisierungMAPProgrammiergerätAggregatzustandMereologieStichprobenumfangCASE <Informatik>Perfekte GruppeFormation <Mathematik>LogiksyntheseNeuroinformatikMultiplikationsoperatorRechter WinkelComputeranimationVorlesung/Konferenz

07:06

ZufallszahlenWurm <Informatik>Metropolitan area networkDatenbankMultimediaPhasenumwandlungMAPEinhüllendeErschütterungSelbstrepräsentationStetige FunktionStrom <Mathematik>DigitalisierungRechenwerkBitrateAuflösung <Mathematik>Charakteristisches PolynomSchwingungTheoremFrequenzLokales MinimumStichprobeKonstanteFolge <Mathematik>Ordnung <Mathematik>RauschenSchwingungSoftwareFrequenzDigitalisierungMAPFilter <Stochastik>PhasenumwandlungBitLeistung <Physik>MereologieStichprobenumfangTheoremRegulärer GraphDiskrete UntergruppeFormation <Mathematik>PunktLogiksyntheseKartesische KoordinatenBitrateCharakteristisches PolynomDifferenteAuflösung <Mathematik>KurvenanpassungMultiplikationsoperatorCodeEichtheorieImpulsVideokonferenzEntscheidungstheorieAggregatzustandAlgorithmische ProgrammierspracheGruppenoperationMomentenproblemPrimidealTermDatenflussMatchingÜberlagerung <Mathematik>CASE <Informatik>Perfekte GruppePunktspektrumWellenlehreBildschirmsymbolSchnittmengeCAN-BusSkriptspracheSichtenkonzeptSchreib-Lese-KopfAutorisierungNeuroinformatikWald <Graphentheorie>Demoszene <Programmierung>EinsComputeranimationVorlesung/Konferenz

16:29

StichprobeFrequenzKonstanteBitrateDatenbankMultimediaEinflussgrößeDatenkompressionDateiformatElektronische PublikationWellenlehreDigitalsignalMinkowski-MetrikVerband <Mathematik>FreewareFolge <Mathematik>WellenformTransformation <Mathematik>Trigonometrische FunktionDiskrete UntergruppeWaveletAlgorithmusLokales MinimumDatenmodellMathematikFolge <Mathematik>MathematikRauschenSoftwareTransformation <Mathematik>DatenkompressionFrequenzBildschirmfensterAlgorithmische ProgrammierspracheDivergente ReiheMereologieStichprobenumfangTheoremTrigonometrische FunktionTUNIS <Programm>WellenformFastringTeilbarkeitMatchingWaveletAbstandPerfekte GruppeDiskrete UntergruppeLuenberger-BeobachterFormation <Mathematik>Codierung <Programmierung>WellenlehreSpezielle unitäre GruppeInformationsspeicherungKoeffizientDateiformatElektronische PublikationBitrateMP3BestimmtheitsmaßKonstruktor <Informatik>Klassische PhysikAuflösung <Mathematik>DickeZweiCodecMinkowski-MetrikEinsOrdnung <Mathematik>SpieltheorieTopologieProdukt <Mathematik>Physikalischer EffektAggregatzustandGarbentheorieGrundraumSpiraleEinflussgrößePunktSchnitt <Mathematik>ZoomWhiteboardDifferenteTouchscreenEinfache GenauigkeitMultiplikationsoperatorStandardabweichungSchreiben <Datenverarbeitung>Mapping <Computergraphik>Rechter WinkelDemoszene <Programmierung>IdentitätsverwaltungComputeranimation

25:26

FrequenzLokales MinimumDatenmodellMathematikDatenkompressionDatenbankMultimediaRelationentheorieVerband <Mathematik>CodierungstheorieHuffman-CodeSoundverarbeitungMP3MultiplikationDateiformatBitrateReelle ZahlGleitendes MittelCodierung <Programmierung>FaktorenanalyseFarbverwaltungssystemNotepad-ComputerSchwellwertverfahrenWinAmpBeschreibungskomplexitätRippen <Informatik>ZahlzeichenBenutzerprofilDigitalsignalMini-DiscDatentransferBefehl <Informatik>GeschwindigkeitInformationFormation <Mathematik>KontrollstrukturPlastikkarteMIDI <Musikelektronik>TelekommunikationProtokoll <Datenverarbeitungssystem>DatensatzOrdnung <Mathematik>StatistikCodierungDatenkompressionFrequenzProgrammbibliothekDigitalisierungEchtzeitsystemMIDI <Musikelektronik>Arithmetisches MittelResultanteSpeicherverwaltungZentrische StreckungVersionsverwaltungGüte der AnpassungReelle ZahlTeilbarkeitBroadcastingverfahrenEinflussgrößeInternetworkingProzess <Informatik>Formation <Mathematik>GradientSoundkarteInformationsspeicherungProtokoll <Datenverarbeitungssystem>DateiformatSoundverarbeitungSchnelltasteBitrateMP3DifferenteNeuroinformatikDickeMultiplikationsoperatorStandardabweichungTVD-VerfahrenCodecRechter WinkelHIP <Kommunikationsprotokoll>CodeFolge <Mathematik>InformationTelekommunikationTopologieDatenverarbeitungssystemBildschirmfensterAlgorithmische ProgrammierspracheBitHydrostatischer AntriebPhysikalisches SystemHypermediaCASE <Informatik>PunktWort <Informatik>EinfügungsdämpfungVollständiger VerbandData MiningWhiteboardComputeranimation

34:22

DatentransferBefehl <Informatik>GeschwindigkeitInformationDigitalsignalFormation <Mathematik>KontrollstrukturPlastikkarteMIDI <Musikelektronik>MultimediaTelekommunikationProtokoll <Datenverarbeitungssystem>DateiformatZeitbereichWurm <Informatik>GEDCOMMIDI <Musikelektronik>MereologieMetropolitan area networkWort <Informatik>Einfache GenauigkeitMultiplikationsoperatorDemoszene <Programmierung>CodeDatenbankDatensatzInformationMathematikNatürliche ZahlOrdnung <Mathematik>SpieltheorieSprachsyntheseTopologieFunktion <Mathematik>TypentheorieBildschirmfensterSynchronisierungProgrammiergerätTaskAggregatzustandBitComputersimulationGarbentheorieGruppenoperationIdeal <Mathematik>IndexberechnungLastProgrammschemaStatistische HypotheseTermWärmeübergangQuick-SortVersionsverwaltungMatchingHypermediaCASE <Informatik>Prozess <Informatik>BenutzerschnittstellenverwaltungssystemDatenfeldFormation <Mathematik>PunktKartesische KoordinatenUmsetzung <Informatik>BeobachtungsstudieDateiformatSoundverarbeitungVorzeichen <Mathematik>Abgeschlossene MengeStreaming <Kommunikationstechnik>Schreib-Lese-KopfDifferenteNeuroinformatikKontextbezogenes SystemMessage-PassingRechter WinkelDienst <Informatik>Serviceorientierte ArchitekturGamecontrollerDesign by ContractComputeranimation

35:50

SpieltheorieGeschwindigkeitLokales MinimumMIDI <Musikelektronik>DateiformatTelekommunikationProtokoll <Datenverarbeitungssystem>DatentransferInformationDigitalsignalFormation <Mathematik>KontrollstrukturPlastikkarteBefehl <Informatik>DatenbankMultimediaSchnelltasteWärmeübergangSoundverarbeitungInformationsspeicherungSystemaufrufKontextbezogenes SystemSystemidentifikationAbfrageÄhnlichkeitsgeometrieSynchronisierungBroadcastingverfahrenGleitendes MittelKlassische PhysikSprachsynthesePunktInformation RetrievalDatenbankDatensatzFolge <Mathematik>InformationMathematikSoftwareSprachsyntheseInformation RetrievalProgrammierungProgrammbibliothekSystemidentifikationTypentheorieBildschirmfensterSynchronisierungMIDI <Musikelektronik>TaskDrucksondierungEinhüllendeGarbentheorieMereologiePhysikalisches SystemWärmeübergangZellularer AutomatSystemaufrufAbfrageÄhnlichkeitsgeometrieMatchingCASE <Informatik>Prozess <Informatik>DatenfeldFormation <Mathematik>SoundkarteEin-AusgabeInformationsspeicherungKartesische KoordinatenUmsetzung <Informatik>Mailing-ListeDateiformatSoundverarbeitungSchnelltasteElektronische PublikationStreaming <Kommunikationstechnik>Schreib-Lese-KopfDifferenteNeuroinformatikKurvenanpassungKontextbezogenes SystemMultiplikationsoperatorMessage-PassingZweiDienst <Informatik>Interface <Schaltung>GamecontrollerDesign by ContractFormale SpracheNatürliche ZahlOrdnung <Mathematik>TopologieSingularität <Mathematik>WellenpaketProgrammiergerätAggregatzustandAlgorithmische ProgrammierspracheComputersimulationLastResultanteVirtuelle MaschineQuick-SortFlächeninhaltFamilie <Mathematik>PunktWorkstation <Musikinstrument>Endliche ModelltheorieSchlüsselverwaltungRechter WinkelComputeranimation

45:16

SprachsynthesePunktAbfrageSynchronisierungInformation RetrievalDatenbankMultimediaFormale SpracheFehlermeldungSystemidentifikationAggregatzustandPlastikkarteFunktion <Mathematik>ZahlensystemMIDI <Musikelektronik>Formation <Mathematik>Elektronische PublikationVektor <Datentyp>InformationsspeicherungCMM <Software Engineering>InformationAutomatische IndexierungPartielle DifferentiationApproximationZeichenketteCharakteristisches PolynomBitrateAuflösung <Mathematik>MAPInhalt <Mathematik>PaarvergleichApproximationBildgebendes VerfahrenDatenbankFormale SpracheInformationSprachsyntheseZeichenketteInformation RetrievalExpertensystemSystemidentifikationWellenpaketVektorraumSynchronisierungVideokonferenzMIDI <Musikelektronik>Automatische IndexierungElementargeometrieAggregatzustandAlgorithmische ProgrammierspracheFunktionalGarbentheorieInhalt <Mathematik>MereologiePhysikalisches SystemSpeicherabzugVirtuelle MaschineMultimediaAbfrageÄhnlichkeitsgeometrieCASE <Informatik>Prozess <Informatik>FehlermeldungFormation <Mathematik>PunktKartesische KoordinatenMetadatenPlastikkarteElektronische PublikationSichtenkonzeptTrennschärfe <Statistik>DifferenteAutorisierungNeuroinformatikObjekt <Kategorie>MultiplikationsoperatorDigitale PhotographieNetzbetriebssystemAudiodateiComputerspielDynamisches SystemOrdnung <Mathematik>SpieltheorieTopologieFrequenzTypentheorieMAPArithmetisches MittelStichprobenumfangStützpunkt <Mathematik>ZentralisatorQuick-SortEinflussgrößeTexteditorSchnittmengeWort <Informatik>HilfesystemEinfügungsdämpfungSkriptspracheRahmenproblemWhiteboardDickeStokes-IntegralsatzMinkowski-MetrikComputeranimation

54:38

Charakteristisches PolynomPaarvergleichInhalt <Mathematik>BitrateAuflösung <Mathematik>MAPDatenbankMultimediaSpezialrechnerInformationKontrast <Statistik>VektorraumFunktion <Mathematik>Arithmetisches MittelFrequenzDistributionenraumBandmatrixZeitbereichEinflussgrößeDruckverlaufÄquivalenzklasseEnergiedichteSelbstrepräsentationPunktspektrumKoeffizientKomponente <Software>Fourier-TransformierteBildgebendes VerfahrenDatenbankHeuristikInformationMathematikMathematische LogikRauschenSprachsyntheseTransformation <Mathematik>FrequenzMAPVektorraumZeitbereichBandmatrixArithmetisches MittelBitFluktuation <Statistik>HäufigkeitsverteilungMereologieZahlenbereichAbfrageParametersystemCASE <Informatik>PunktspektrumHelmholtz-ZerlegungEnergiedichteFormation <Mathematik>DistributionenraumQuadratzahlPunktZustandssummeSchnittmengeLesen <Datenverarbeitung>KoeffizientMultifunktionVorzeichen <Mathematik>BitrateCharakteristisches PolynomDifferenteDomain <Netzwerk>SelbstrepräsentationAuflösung <Mathematik>DickeMultiplikationsoperatorMinkowski-MetrikComputerspielFolge <Mathematik>Ordnung <Mathematik>TopologieDatenkompressionTotal <Mathematik>GruppenoperationLastLeistung <Physik>PaarvergleichVirtuelle MaschineNichtlineares GleichungssystemWasserdampftafelCoxeter-GruppeDatenfeldDichte <Physik>Vollständiger VerbandBootenWhiteboardNichtunterscheidbarkeitSchreib-Lese-KopfSuite <Programmpaket>Demoszene <Programmierung>Figurierte ZahlSpezifisches VolumenGeometrische FrustrationComputeranimation

01:04:00

FrequenzZeitbereichDatenbankMultimediaSelbstrepräsentationPunktspektrumKoeffizientKomponente <Software>EnergiedichteFourier-TransformierteFourier-EntwicklungBandmatrixSchwellwertverfahrenLokales MinimumFormation <Mathematik>RechenbuchDistributionenraumLeistung <Physik>Fundamentalsatz der AlgebraStandardabweichungSchwingungHarmonische AnalyseApproximationInformationInformation RetrievalAlgorithmusApproximationDatenbankHarmonische AnalyseInformationRauschenSprachsyntheseZeichenketteInformation RetrievalFrequenzProdukt <Mathematik>BandmatrixAlgorithmische ProgrammierspracheArithmetisches MittelBitFundamentalsatz der AlgebraHarmonischer OszillatorLeistung <Physik>Lokales MinimumMereologieMultiplikationMultimediaCASE <Informatik>PunktspektrumEnergiedichteAutokorrelationsfunktionFormation <Mathematik>DifferenteRechenbuchDomain <Netzwerk>SchwellwertverfahrenEinfache GenauigkeitMultiplikationsoperatorEinsCodeFolge <Mathematik>Ordnung <Mathematik>RelativitätstheorieStabTopologieNormalverteilungStörungstheorieÄquivalenzklasseAggregatzustandComputersimulationGrundraumLastSinusfunktionStellenringDatensichtgerätQuick-SortKorrelationsfunktionPunktKartesische KoordinatenDifferenzkernVollständiger VerbandEreignishorizontStreaming <Kommunikationstechnik>Schreib-Lese-KopfSchlüsselverwaltungICC-GruppeComputeranimation

01:13:23

DatenbankInformationInformation RetrievalMultimediaOrdnung <Mathematik>Information RetrievalAlgorithmische ProgrammierspracheDifferenteZweiComputeranimation

Transkript: Englisch(automatisch erzeugt)

00:01

OK, so let's continue with sound creation. So, the most basic way of creating sounds, we know it from humans, it's pretty easy, so the air coming from the lungs passes through the chord and this vibrates,

00:23

the chord vibrates, it transmits this vibration through air to the receivers, the ears and this is how, after the vibration is received through the membrane of the ear, it is transformed into electrical impulses and transmitted through the brain.

00:41

The brain then transforms this back to what we perceive as sound. So this is the basic, the most basic example of how sound creation works. But we usually classify instruments based on how they generate this vibration. So for example, we know there are string instruments, you pull on the string,

01:05

like a guitar for example, and this vibrates and this vibration again is transmitted through the air and there are blowing instruments or percussion instruments like drums for example,

01:21

it has a membrane, you hit the membrane and it again vibrates and transmits the sound. And again, the acoustic depends on this vibration generator, so this is the important factor. For example, if you have the membrane or the string, you hear and perceive the sounds as being different.

01:42

If you have the same pitch or the same tone, they still hear differently. For the synthetic generation, what we've just seen in the detour, if you want to create the standard tone A, you need some kind of an oscillator.

02:03

And this oscillator generates voltage oscillations, but they are transformed into sound by speakers. So these are the ones, speakers are just membranes and through their vibrations, again the sound is transmitted through the air to the receivers.

02:27

The oscillator can be influenced by inputting higher voltage, it results in a higher frequency. And this for example has been exploited by Moog, who in 1964 has put the basics of the first synthesizer

02:46

and again, with this kind of manipulation, he could achieve different kinds of frequencies and with the amplifier it could affect the volume, so the sound could be more loud.

03:05

But as you've seen in the detour, these synthesized sounds, they sound kind of metallic and we don't really want that, we want natural sounds. And if you want to achieve natural sounds, for example the human voice, if you sing, it's not perfect.

03:24

You sing the standard tone A and the human voice has a period where it starts from low, it raises up to the pitch, it's going to sing the note, then maybe overshoots,

03:46

reaches up to a point above the note is going to sing, then there's this period here, this is the classical attack period, when you prepare to make a sound. And then, in order to reach the tone you are going to sing, there's a decay period,

04:04

where you compensate for this overshooting, this is also rather short in time. And then comes a sustain period, the period where you actually do the actual sound you wanted to do, and then a release, so you're slowly ending the sound you wanted to make.

04:25

So this is a classical envelope curve that influences the loudness of the sound based on the time. And in order for the sound not to be that metallic anymore, so in order to produce more realistic sounds, one has come up to the idea

04:44

to add this attack, decay, sustain, release envelope curve also to synthesized sounds, just to make them hear better. Yeah, you can see here an example of how such a modulator or synthesizer looks like,

05:04

this is the version of the Moog synthesizer from 1967, it's a bit more evolved, this is the original one. As you can see, you have here a keyboard, one could press on,

05:20

but actually what this keyboard does, it displays synthesized notes. And this actually says what kind of occurrence should pass through, you can adjust the loudness, you can adjust the frequency with a lot of adjusting possibilities.

05:40

So this is how you can produce synthesized sounds. And what's interesting to see is that with such a synthesizer, Emerson, Lake and Palmer have held a concert, so you can also find this video on YouTube, The Great Gates of Kiev from 1974, they actually have done the first steps to our electronical music.

06:06

Let's see if I can play it from here, it's better to go directly to the source.

06:31

This is Emerson, Lake and Palmer, quite interesting.

07:26

I wouldn't necessarily hear only such music, but they were the first steps towards such music, and it's quite interesting how he played with the buttons and he managed to influence the sound like that.

07:41

So this is what you could do with the synthesizers back then. Right now, today, most of the music is built on the computer, and all the synthesizers are now software made, and they work quite well. Something like that puts the basics of what we today hear in the radio.

08:02

OK, back to the Attack-Decay-Sustain-Release curve. As I have previously mentioned, the problem with synthesized sounds is that they sound rather metallic. So for producing a sound which is more close to the imperfection of typical instruments,

08:24

one has used this kind of behavior of the sound in time. So the first phase, the attack with a certain overshoot in the level, then comes a decay where the actual desired level is reached,

08:45

then a longer sustain phase where the note, the one that was aimed for, is sung, and then a release which is usually rather short in time. So it just decreases up to zero.

09:01

OK, one of the most important parts in audio is the digitalization of audio data. So we have spoken about audio data as representing a signal, but in order to save such a signal, one would need to save each point on the signal,

09:21

and this is actually not a very good idea, because you imagine, for a music piece, then you would have a lot of data to save. So the solution is to perform sampling, and the concept of sampling is just of looking at the regular intervals in time on the curve.

09:43

Let's take, for example, this curve here. So this is our signal, and I'm going to look at different intervals in time. I'm going to start here and here and here and here and here and so on,

10:04

up to the end of the signal. And in these discrete moments, I'm going to check what's the amplitude of the signal. And the first amplitude I'm going to measure is zero, then it's this one here, then the next amplitude is somewhere here, and so on.

10:26

So this is how I'm going to discretize the signal. But the most important part is when discretizing, I have to make sure that the resulting signal is enough in order to reconstruct the original signal.

10:46

Of course, the purpose is to save less data, but the second one is I need to take care that I can reconstruct the signal. And when performing sampling, the basic characteristics I need to consider are the sampling rate.

11:02

This basically means how many times in the time unit, so in the interval I've drawn before, do I have to tap into the signal? How many times do I have to look and see on the curve what's the amplitude of the signal?

11:25

And the higher the sampling rate, the better the quality of the digitalized signal and the better I can reconstruct the original. On the other side, so actually I try to lose less data if I increase the sampling rate.

11:45

The other characteristic is the resolution. So how do I save the digitalized data and with which accuracy do I save it? And often a resolution of 16 bits is used. This means actually 2 to the power of 16 different amplitude values.

12:06

About the sampling rate, it's actually application dependent. So it depends on what do you have. For example, for music, it's quite common to use a sampling rate of 44 kHz.

12:20

While for phone, the sampling rate which is common is somewhere around 8 kHz. The idea is because of the difference in the data. For example, for audio we are interested in the quality. We are interested in the full spectrum of frequency. We don't really want to lose anything.

12:40

So we are interested also in a higher sampling rate. On the other side, in a telephone talk, we are interested in understanding what the other one says. We don't really care to get all the background noise or something like that. Actually we really want to get rid of that. So we kind of usually use filters to filter that out.

13:02

This is why it wouldn't make sense to use higher sampling rate. So it's also good for the networks transmitting that signal because lower sampling rate means less data to transmit. So another advantage.

13:24

Okay, so what's really important when performing sampling is that after sampling, after I discretized the signal, I need to be able to uniquely reconstruct the initial oscillation. The higher the sampling frequency, the more I have to save.

13:42

Of course, the less I lose from the original signal. And Nyquist says with his sampling theorem that actually the sampling rate that I really need to use must be at least twice as large as the highest frequency occurring in the signal.

14:00

So if I have the highest frequency in a signal of like 20,000 hertz, then I should consider sampling with 40,000 hertz. This is what Nyquist says. And let's look at some examples.

14:21

So if you have a simple sinus curve and you perform simple sampling of one sample per period, so a constant one, I'd say I'm going to sample here, here and here, so once per period.

14:40

And I want to reconstruct this signal. The problem is there are other signals, like for example the simplest one is the constant signal that passes through the same point. So if I have this sampled signal, this point here, I can't be sure which of the two has generated it,

15:02

the sinus curve or the constant signal. This is why such a sampling rate of one sample per period is not enough. Another example would be this one here, so going to 1.5 samples per period. This actually means, so in a sequence of two periods, I would have like three samples.

15:23

So I would look like three times. So what this basically means is that I'm going to look and see the value of the amplitude here, then here, and then here. So now I have my two periods, I've measured in a sequence of two periods,

15:40

the amplitude value is three times. The problem is, again, there can be another curve, another sinus curve, and this is this one here, which passes exactly through the same points I've checked with my sampling procedure, but it's different than the sinus curve I wanted to discretize.

16:07

So I can't uniquely go back. As you can see, the curve, the blue one, it has a lower frequency, so it's a lower curve. I can't know afterwards, I can't reconstruct and say, okay, this one was responsible or the other.

16:25

So this is why, and Chris says we should use two samples per period. This is one example of this case, so I've just done my two samples for that period, again two samples, and again two samples.

16:43

And there is only one curve, only one sinus curve, perfect sinus curve, without noise or something like this, which goes through exactly these points. So this is basically the idea of Iniquis's theorem. Okay, so typical sampling rates, again, for the phone, cell phone and phone talks, it's about 8 kHz.

17:11

For DVD, you can have up to 192,000 Hz, so 192 kHz, which is quite high.

17:22

You could wonder why such a high sampling rate. So the idea is that such mediums, even like for the audio CDs, you don't really want to lose anything. You don't want to lose quality, you don't want to lose maybe noise, maybe that noise was supposed to be there, maybe you don't have perfect sinus waves.

17:44

And this is exactly the idea here. The more signal you are able to store, the better you represent the original signal. And we don't have a trained ear, but if someone with a better ear hears a signal,

18:04

he can differentiate between the quality of a DVD and of an audio CD or classical MP3. So this is why you can go up to very high sampling rates, and it also makes sense. And if we look at such sampling rates and consider that the resolution is somewhere of 16 bits per measurement,

18:31

then you have a throughput of like 176 kilobytes per second. This means that actually for a minute of sound, you have like 10 megabytes of data.

18:46

You are probably used to the idea that a CD holds 635 or about 700 megabytes of data. It also holds like an hour of music, depending on how long the audio songs are and so on.

19:08

So it's quite a lot of data. And for space reasons, usually we compress this data. We have compression, we usually apply it for files we know with zip or run length coding, a lot of procedures.

19:25

So on the other side, we have some uncompressed formats we use for audio, where they are built for quality, and then we have some compressed formats where they build for storage or simply for network transport and so on.

19:42

And the most well-known uncompressed formats are the one from Apple, the Apple Inter opportunity file format. You maybe know the WAV file, the one from Windows. Well, this format here is actually not that used anymore.

20:01

It was used for the Institut de Recherche coordination, acoustic and music. It's probably more used in the research, in sound labs and so on. And Sun also had their format, the AU.

20:20

Okay, let's discuss a bit about compression. So as I've said, compression is a big issue when we speak about audio sound. 600 mega for an hour of music, it's already a lot. What we actually want to achieve is some data reduction,

20:43

but we have to give something something back. There are two ways of performing that. One is lossy, so we lose some data. The data is not perfect anymore. Or one which is lossless, where we don't really get to compress that match.

21:01

Usually you get to obtain a factor of two or something like this. For example, if you have 600 mega audio data, you may compress it lossless, two-half of that. But if you compare it with the lossy, the lossy can achieve even a factor of ten,

21:21

so you can compress 600 mega to about 60, which is a lot. Okay, so the most used here is the free lossless audio codec for the lossless compression, and it achieves about 50 to 60 percent from the original size.

21:40

Another is the APL lossless or the waveback. The lossy compression algorithms usually use transformations like the discrete cosine transformation or the modified discrete cosine transformation or the wavelets. The idea here is to transform the signal in the frequency space,

22:03

and then in this frequency space, obtain the most important frequencies based on their coefficients, hold only those frequencies and cut the ones which are minor, for which the coefficients are small. So practically when you perform this transformation,

22:21

you get a series of coefficients and the corresponding waves, and you just cut, for example, after the first five. By cutting those, you lose some data, but that data is not actually very important. The most important is the beginning part of the series.

22:42

Okay. When performing compression, you actually have two steps. The first one is the encoding, where you transform the waveform in frequency sequences or the sampling. And the second one is the decoding. So you have to play it somehow. You have to reconstruct these waveforms

23:02

from the values you have obtained through the encoding. But the big question here is what are we going to cut? What can we afford losing? The goal or what we want to obtain is we want to lose some data.

23:23

It's okay. We want to obtain better space efficiency, but we want to maintain the subjective perception. So we don't want to lose that much data so that we won't even recognize the sound anymore. And here we can do some tricks. For example, we can omit either very high or very low frequencies.

23:45

We've said that the human ear can hear from somewhere from 50 up to 20, 50 hertz up to 20 kilohertz. So basically, I don't really need to store something which is 50 or under 50 because I won't hear it anyway.

24:02

The same goes for sounds above 20 kilohertz. So I don't really need my dog to hear the sound. I'm just interested to keep what I'm going to hear. So I can cut that also. On the other side, I can, for example,

24:24

save the superimposed frequencies with less precision. So frequencies which come after other frequencies can be saved with less precision because they are not that important. So something which is more powerful will screen something which is less powerful.

24:43

So if I'm talking to someone and near me, there's a construction yard and they make a lot of noises. He won't really hear my voice. So if it's not hearable, I can blend it. This is called blending low tones after very loud sounds. So my ear tunes to the loud sound.

25:03

It hears that one, but it doesn't hear anything else. Some other Sihua acoustic observations which come in handy here are that changes at a very small distance are impossible to hear. So if the change is very slight,

25:22

I don't know if I should consume any data to save that change because it won't be perceived anyway. So I could save myself some data and leave the tone as it was. From the compressed standards,

25:41

one of the most known is the MPEG standard with different layers and most of us, we know the MP3 standard. Probably you have MP3s right now on your cell phones or on your iPods or MP3 players or whatever. And the quality of the sound here is near to the CD quality

26:07

and the bit rate is of 128 kbps. The course idea for the MP3, so what they are basically doing is they are coupling stereo signal by registering only the difference

26:24

between the left and right channels. So for example, they are recording what happens on the left channel. We know we have a stereo signal, but we don't really care to save all the data for the other channel also. We just measure them and say, OK, I'm going to store only what's different between the left and the right.

26:43

And this way, I will have a lot of zeros in my right signal because most of the signal will be the same and I can compress that very good. The second thing MP3 uses is cutting off the inaudible frequencies.

27:02

So what I'm not going to hear under 50 or above 20 kHz is going to be eliminated, making use of the PSEHO acoustic effects and again using the Hofmann encoding, as I've said in concordance, for example, with coupling the stereo signal, so left to right,

27:21

and not only, so Hofmann can be used anywhere on the signal. What we usually have today is the advanced audio coding. So if you have some newer device, like a newer receiver or something, a 7 plus 1, you really don't have stereo anymore. You have a home cinema system, yeah?

27:43

And this is what AAC is able to do. It provides, it's an industry improvement of the MP3, so it actually basically is the same but with more channels. It's usually a heap for TV and radio broadcasts

28:01

and it actually offers better quality for the same file size. As I said, the most important point, multi-channel audio. It actually supports 48-min sound channels with up to a 96 kHz sampling rate, so quite high sampling rate,

28:22

much higher than we used to have. OK, you can search for more information on the AAC if you're interested on the internet. Other compression formats are the OGG-Vorbis, the Real Audio from RealNetworks and Windows Media Audio 9, I think.

28:44

I think we are already by version 10 right now. Anyway. OK. I've searched for some experiments and I've seen that the lossless compression, as I've said, obtained somewhere of a factor of 50 for the compression.

29:03

So the most important factors are the compression rate, how can I obtain better compression rate, and some other important factors are the speed of the compression and decompression. We don't really want to wait for a week to compress our library of music

29:22

and on the other side, if I'm going to play it, I don't really want to decompress it and then play it. I want that the decompression process happens on the fly and if I press the Play button, I also hear my sound. So, taking into consideration these two factors,

29:44

for example, I have here the result for different compressors and for example, the FLAC compressor we've previously discussed about obtains a relatively good compression rate ratio,

30:02

a quite good encoding speed, so it's a factor of 20 in real time speed and a very good decompression rate. So actually, this one is very used if you have a library of sound of music which you really want to have in original quality.

30:27

For the lossy compression, besides the decompression and compression speed, which of course are of importance, the compression rate is very important so I'm interested, if I'm going to lose data,

30:41

I really am interested in something better than 50%. I'm interested in something like 90% compression rate. And the most important factor is the quality. So I'm losing data, but I really don't want to lose the sound. I'm not going to accept that when I compress it with lossy compression,

31:01

what I receive I can't recognize anymore. So in order to measure the quality of these compression procedures, I've observed an experiment which has been published on the internet and the idea here was to perform a mean opinion score measurement

31:23

with different human subjects. They were given a scale from 1 to 5 and above 5 where they had to rank a sound as being heavily distorted or unpleasant

31:41

up to they didn't recognize any difference between the compressed and the original sound. And the results are quite interesting. So for example for the AAC, we can observe that the average quality is above 5,

32:05

so most of the subjects didn't see any difference with the highest rating received by human subjects being somewhere with a value of 9 and the lowest being somewhere 4.5 or so.

32:23

So actually even the harshest critics didn't give a bad grade for the AAC. And combined with the very good compression rates and the fact that it supports multi-channels, it's a great way to compress music.

32:46

Again you have here statistics for different codecs but as I've said AAC and its variations is the winner of it all. OK, let's go further to another music format,

33:03

the MIDI format. I don't know if you've heard MIDI format, it was quite hip in the 90s or 92, 94. Actually the MIDI format was considered as a communication protocol. So the idea was to transmit the music

33:26

or the recording between digital instruments and the PC. So the sounds have been inputted from a clavier or keyboard and inputted to the computer and the computer saved them as commands to the sound card.

33:42

For example, now it will be played a certain tone with a certain length in time with a certain speed. So a certain note, key, velocity, pitch and what instrument is that. And that was it. So this is the MIDI format of course.

34:01

If you only have such a note sequence, you don't really get to say for example voice. So you won't hear in a MIDI sound the voice of the singer. Let me play you an example.

34:23

I'll go to the source again. OK, so that's the MIDI.

35:02

So actually this is the part where the singer was supposed to take in and sing over the notes. But you don't have that in a MIDI file. So in a MIDI you have just the notes and notes and that's it. Let's see how the original looks like. Here's like.

35:51

Yeah, so here we already have the singer. It's quite a big difference. But taking into consideration that for example in the 90s

36:04

you didn't really have a real sound card that the computer and MIDI files could have been played also by the computer speaker. It was a great hit for the 90s as I've said to use the MIDI format for music storage. And the great part of it. Here comes the great part of it.

36:21

10 minutes of music are not 10 mega. They are 200 kilobytes of data. So it's a great difference between what you were supposed to store by storing a MIDI data and the original sound. As I've said, the data is inputted into the PCVR keyboard

36:41

and are outputted via a synthesizer. A sequencer can be used for caching the data and if you want to do any changes. For example, if you feel that the notes are synthetical you can add this envelope curve and transform it to hear more natural.

37:07

Okay, let's go to the next section. The audio information in databases. So we actually have for audio data we have music, CDs, we have sound effects or ear cones.

37:22

For example, we have a database of sounds which you can use, for example, for editing music. Maybe you know this modern software you are going to use with pieces of instruments with different notes you can blend in and create music.

37:41

So this is the audio data. And you can all have them in a database and search for them or use them and somehow query for them. On the other side, the audio data may also represent also the process of information transfer. So if you had there just music to listen to

38:00

on the other side you may, for example, store historical speeches where not the data itself is important but the message extracted from it. And if you would have, for example, the transcript of the speech the text, it wouldn't be the same because you lose information. In text you have what the person doing the speech has said

38:23

but you don't really have, for example, the rhetoric the way he said it, the intonation, for example or the reaction of the public. It can also be used for recordings of conversations so protocol, phone calls or negotiations.

38:41

So these are typical information you can store in a database. Usually when dealing with audio databases there are three typical applications of audio signals in the context of databases.

39:01

One of them is the identification of audio signals. So, for example, the audio query. The classical example here is when you go to a music shop and you want to buy some music piece you've woken up with a certain song in your head and you don't know how it's called

39:21

so you don't have the title, you can't go on internet and buy it from Amazon because you don't know how it's called and you go to the music shop and tell it to the guy Do you have that CD? That CD with that melody that sounds like this? And you have this audio as a query. Well, having a database which is able to understand your query

39:41

so if you sing it or whistle it or whatever then it could deliver you either the sound or information about it so the melody or the information about it. This is a typical scenario of audio as a query or identification of audio signals.

40:01

Another application is the classification and search of similar signals like, for example, I want to cluster together similar music pieces or music pieces belonging to a certain genre like, for example, I've heard this song and I want something similar

40:23

I like it, it's ok, but I like this type of song so I want something similar. This is a typical case of classification and similarity search. And there's also phonetic synchronization where, for example, I have the text and I have some spoken speech, some speech

40:43

and I want a synchronization between the two so what's spoken and what I have here. But actually phonetic synchronization is not something we're going to deal with in the lecture so what we're going to focus on is the identification of audio signals

41:00

so audio as a query, for example. So, typical tasks in the identification of audio signals I want to find a title for a music piece I have in my head maybe you already know Shazam, it's one of the first applications I think it was first written for iPhone

41:24

and it does exactly that. So, well, not from the human voice but you can pull out your cell phone, start Shazam and let it record some short piece of music from the radio and then Shazam will tell you how this music piece is called

41:41

who sings it and maybe if you want to buy it on iTunes or something like that. It's also a great idea so this identification of audio signals can also be a great idea for monitoring audio streams like, for example, if I'm doing advertising on the radio and I want to be sure that the radio helps, is promised

42:04

and advertises or runs my advertisement three days a day as we've done our contract. Well, for that I don't need to sit near the radio and try to be careful if he did play my advertising three times or four times or no time

42:22

but I can perform this automatically if I am able to perform identification of audio signals. So, what is going to happen? A tool will monitor the radio program and compare the radio program with my advertisement

42:41

and count the matches and when it sees that the program compare the different parts, windows of the program match with my advertisement then it's great to count them and if I have three of them I'm happy.

43:01

Similar use case can be used by the copyright control like, for example, looking at what the radios play just compare them with different song pieces and see if they have license on that or so on. Another typical application is audio on demand

43:24

so I want to hear some song from iTunes or something like that then I can request it and it will be sent stream to me through network a lot of services use that. OK, the second type of application is the classification and matching

43:43

so the task here is to find audio signals which are perceptionally similar so I want to find pieces of music which are kind of the same and there is a great, really big field

44:01

the field of recommender systems doing just that so, for example, I don't know if you are aware of Last.fm they also have a great appy so programmable interface where you can use it to input like, for example, give me songs which are similar to this one

44:22

or give me artists which are similar to this one like, for example, you search something similar to Queen or something similar to Madonna and they give you a list and this is actually done based on matching of songs how well they match, how well they classify together

44:41

so, basically, genre classification. Audio libraries, this is a nice application for audio libraries to perform this classification automatically. And the synchronization of audio as I previously mentioned, synchronization between speech and text

45:02

or between notes and audio where am I right now following the notes and what is he singing right now or retrieval of text from speech, for example so to find a specific point in a speech

45:21

but, as I've said, this is a part we are not going to concentrate on so we are much more interested in query by sound. OK, so the state of the art of all these three applications let's start with the identification which we will also treat in this lecture

45:43

it's the simplest of these three problems and actually it's successfully being resolved as I've said, Shazam is an example Midomi does that online so you can check it and test it we'll also do it as a detour in the next lecture very interesting application.

46:03

For the classification and matching it still leaves a lot of room to manual annotations actually it's done with a lot of manual work metadata and so on the automatic classification works only roughly on small collection of sounds

46:23

so this matching process is still problematic it involves a lot of training usually and it's prone to error it's a probabilistic approach typical procedures here are machine learning techniques

46:40

and they work but it's not as good as how the identification, for example, works for the synchronization in the meantime one can obtain tolerable error rates like for the synchronization between language and text

47:08

OK, so we've spoken about the state of the art of general applications but before speaking about audio databases we need to speak of how to make this data persistent how do we store it?

47:22

and usually the audio data are stored in blobs in the database so actually they are nothing more than binary large objects most of the databases support blobs and you can store either videos or audios

47:40

they don't really care what you've stored there and they're actually not that useful from this point of view because you can have metadata about them you can know that in the blob you've saved a song with this title or so on but this doesn't really help you do some content search there's also the concept of smart blobs or usual blobs

48:02

the difference is one of them is managed by the operation system and one of them is managed by the database itself additionally there are metadata like for example the title or the file size and bytes or the last time it was modified and so on

48:21

or the feature vectors like for example sound features the amplitude or the loudness we'll speak about them they're just the same as we've done in image you remember when we spoke about images we said we have some feature vectors

48:41

like the color, the brightness and so on this is exactly what we have here also and these metadata together with the feature vectors they help us perform the search functionality for example they help us perform transcription of languages

49:02

such as text or annotate music pieces or midis that's the section we're going to concentrate on and this is the most important part that we want to perform the audio retrieval

49:21

so this is the central point of our lectures regarding audio how do we search in audio? and of course the most easy approach is a metadata driven search and it's great if you have metadata

49:42

because you can have semantic metadata and these are for example the title, the artist or the speaker if this is a speech or some kind of keywords but all this is manually generated it's like again in the image case where this semantic metadata was a photo of me in Paris

50:04

or a photo of me near, I don't know, my best friend and my best friend somewhere the semantic metadata is difficult because it's difficult to generate because it's all manual information so if you have it it's great

50:22

it can help your search if you don't you have a problem when searching only on metadata on the other side you have some automatically generated metadata like for example the time or place for images it can be generated through geo sensors

50:40

the recording, the file name, how it's called the size, the hour that's automatically, if you can use it, it's great metadata is great as I've said, if you have it, you can use it and this is the foundation of typical music exchange markets

51:03

you most surely have heard about the success Napsterhead or you may have also used probably Kaza you were searching for a certain title the title has been already introduced there by someone uploading the file or holding the file on his computer

51:23

and this is basically how you do the search you do search through the metadata but this manual indexing regarding title, author or whatever you are inputting as metadata is labour intensive and expensive

51:42

and this information is usually incomplete for example the genre classification one might feel that this music piece is pop but he's not really an expert so maybe this is something else or maybe he's done a mistake or maybe this music piece belongs to more genres

52:02

so if you're going to perform a search with the correct genre however that music piece has not been labelled correctly you won't find it the major problem here is that you have no possibility of performing query by example

52:20

this is actually what the core of multimedia database should be so I want to search for a sound that hears like this and I want to hum it or to sing it or to whistle it and I want the database to return the information, the file

52:41

whatever I'm going to search for and for this we need to be able to search as I've said through query by example directly in the audio file so not quite in the metadata what the current systems managed to do is

53:01

to use something like SQL with approximate string search on the metadata so the like clause select music piece from music database where title like

53:21

begin Japan or only Japan and it will find it based on string similarity but that's not that great that's not really what we want here so the core of the multimedia database should be using content of audio files

53:43

again of course using metadata too if you have it but the core should be using content of the audio files and the most trivial idea if you have two pieces of sound so two audio files and you want to compare them

54:02

you want to establish how similar they are to each other you can compare them measure versus measure so you can take point by point each of the two signals and compare them now imagine that you have a big database which means that you compare your query sound

54:22

with each sound in the database and that's a lot if you do it point by point we've discussed it also in the case of images it's not really promising and it's really inefficient because on the one side you have a lot of points to compare on the second side it may be that your query contains for example only the refrain

54:44

so it doesn't begin from the beginning or you have differences in sampling rates or in resolution so you can't even match them even it's the same song but it has a different sampling rate so the solution in this case is to use again features

55:03

you may have low level features or high level features or logical features in the case of low level features you may have information like for example what's the loudness, how loud is the sound or what kind of frequencies are there in the sound

55:23

what's the intensity of those frequencies and you can compare them and this is actually the foundation of the content-based search in audio so as I've mentioned it's basically the same as in image databases the same basic idea

55:41

I want to describe the signal by means of a set of characteristic features these will be the feature vectors I'm going to compare of course there's a big difference when compared to what we've discussed about in the image information because here we need to consider

56:01

that audio is a time-dependent signal in image you have a two-dimensional signal it's the space of the image here you also have to add the time so the vector has to be dependent on time

56:22

this is why the vector is actually time dependent so at the time point I have to compare the two vectors of two different sounds typical low level features are the mean amplitude or the loudness how loud is the first sound

56:43

how loud is the second how does the frequency distribution look like for example for voice I'm going to expect lower distribution for music I'm going to expect higher distribution and I can already imagine that the frequency distribution could help me to differentiate between speech and music

57:06

just simple typical low level features I can use to filter a lot from the database so do some pruning for example another typical low level feature is the pitch the pitch we're going to discuss next lecture into more detail

57:23

it's the frequency of a note so what is the note being played the brightness or how high does a sound music piece feel like are the frequencies higher or lower

57:44

like for example the brightness of voice is lower than the brightness of music because I have a lot of high frequencies where in the voice I have a lot of low frequencies and the bandwidth if you measure the lowest and the highest frequency

58:05

that interval you get that's the bandwidth for voice it's lower than for music and then you have these low level features which can be measured in the time domain so in the time domain you have the signal

58:20

which is represented as the amplitude versus the time you have something for example something like this that's the amplitude that's the time and the signal is something like that or you have the frequency domain

58:44

where you have intensity like for example a spectrogram or something like that versus the frequency so you have here I don't know maybe like 20 kilohertz and you have here something like zero frequency and then you have a lot of

59:00

I don't know maybe this is 400 something like the representation in the frequency domain and we'll speak about spectrograms in this case okay so the amplitude the amplitude is the fluctuation around the zero point

59:22

it gives me the loudness so the silence is then equivalent to zero amplitude well actually if I want to detect a period of silence I have to perform some heuristics but usually when you have zero amplitude there is no noise there is no movement there is nothing

59:46

also in the time domain you have the average energy so this characterizes how loud is the entire signal and you can calculate it by summing the value of the signal

01:00:00

through its length, so summing the square of each amplitude of each point in the signal. That's the average energy dividing by the number of points. That's the average energy. Another feature in the time domain is the zero crossing, or the frequency of sign changes in the signal.

01:00:22

So, what we're basically told here is that if two consecutive points have the same signal, this is evaluated to zero, so this won't add up here. If they have different signals, the value will be one, and it will count here as one change,

01:00:43

and it will add up to the number of changes, and everything is then normalized. So then you can compare zero crossing rates for two sounds without taking into consideration that they might have different lengths.

01:01:05

The silence ratio is another feature in the time domain, and it actually tells me the portion of values that belong to a period of silence. But the great question here is, what is silence?

01:01:21

So, you can't say that if you have an amplitude of zero, that's silence, because you might have zero crossings, for example. And in the case of a crossing, the amplitude is zero. So, actually, it's a good heuristic to establish a low pitch,

01:01:46

a pitch under which everything that you have, so under a certain amplitude, that is considered noise or silence. And what one also must establish is the number of consecutive readings

01:02:03

or the number of consecutive points for which the sound signal must have an amplitude lower than the established threshold, so that it is considered a period of silence. So if, for example, I have just one point under, I don't know, maybe an amplitude of 10

01:02:29

that represents my silence threshold, it doesn't really count as silence. But if I have a consecutive sequence of like five or six such readings,

01:02:45

then that might be silence. So, it depends on parameters and how you define them, but silence can be detected that way. Okay, so we've spoken a bit about the time domain.

01:03:00

What about the frequency domain? We can perform a Fourier transformation of the signal, and this actually means that we transport the signal that we have into the frequency domain. We decompose it into frequencies, each of these decompose frequencies with corresponding coefficients.

01:03:24

And this is how we get the representation of the frequency spectrum of the signal. The most important part here are the coefficients. The coefficients of each frequency represent the amount of energy for frequency.

01:03:42

The bigger the coefficients, the more important is that decomposed part of the signal. So, as I've said also in the compression, you can hold the first five coefficients and cut the rest. You won't lose that much, for example.

01:04:01

The energy is the important part here, and it's usually measured in decibels. And there are some features we can describe on this frequency spectrum. For example, this is how the ah sound looks like in the time domain,

01:04:23

this is here, and in the frequency domain. And you can already observe that there is a fundamental frequency here.

01:04:44

I think it is somewhere to about maybe 400 Hz. There is some noise here, some noise here. Then there are some harmonics of this fundamental frequency,

01:05:09

which is the double of this frequency. Then there are smaller harmonics until they have no energy anymore.

01:05:22

So, for example, this one here, this has the highest energy. This one has less energy, and so on. The bandwidth, the bandwidth represents the interval between the occurring frequencies.

01:05:42

So, you calculate the difference between the lowest frequency, the minimum frequency, and the highest frequency. I've already defined the silence as being threshold-dependent. So, if you have a certain threshold under which you consider that everything is silenced,

01:06:04

then the next frequency, which is above the silenced threshold, is the one that's going to count as the minimum. Above that frequency, for example, you don't hear anyway something under 50,

01:06:22

so you can start considering 50 as the minimum if you have it in your signal, or what's closest to 50, that will be the minimum. This is a great feature to be used in classification. For example, the bandwidth in music is higher than for the voice.

01:06:44

In music, you may have a lot of instruments, and those instruments may produce higher frequencies, like, for example, 10, 20 kHz frequencies that you still hear, but those frequencies you don't really create with voice.

01:07:04

Maybe if you're in an experimented opera single or something like this, you may have a higher voice, you may achieve that, but usually in normal speech you don't have that. So, you have it in music, but not in voice. So, that's great for performing classification.

01:07:23

Another feature in the frequency domain is the power distribution. Power can be read directly from the frequency spectrum. So, what you actually can distinguish is the frequency with high energy versus those with low energy.

01:07:45

So, basically the ones with the high decibel value are the ones with high energy. And based on these energy distributions, you can calculate frequency bands with high or with low,

01:08:02

and you can calculate centroids, for example, to establish how high is the average frequency, based on considering also the energy. And this is how you calculate the brightness. Like, for example, in music you have a lot of strong higher frequencies,

01:08:21

so music may have a higher brightness than the voice. The voice is lower, you have a lot of like 4 kHz or 3 kHz, so the brightness will be around 3, 4 kHz.

01:08:40

Harmonics. Again, a feature in the frequency domain, it counts for the lowest of the loud frequencies. It's also called the fundamental frequency. So, if you have a fundamental frequency, for example, for music instruments,

01:09:00

you also have harmonics, which means that the signal increases, repeats this dominant frequency in multiples. So, like for example, if you have a standard pitch somewhere here, like at 440 Hz,

01:09:25

and this is also your fundamental frequency, this is how it looks on the synthesizer, so it doesn't have any harmonics. But if you do the same on a flute, for example, you will also have its harmonics at like 880,

01:09:41

which is the first harmonic, two times 440, and then you will have at 1320, three times 440, that's the third, and so on, harmonics. And they decrease in intensity.

01:10:03

And what these harmonic oscillations basically mean, this here is the fundamental frequency, you may see here there might be some noise, but this one is loud enough to take into consideration.

01:10:22

This is the first harmonic, should be the 880, and the next ones. And if you consider, for example, the harmonic oscillations, again for string instruments,

01:10:40

what this means is that, for example, the first harmonic would look something like that, the second harmonic would have double the frequency, something like that, the third harmonic, three times the frequency,

01:11:04

four times the frequency, and so on, and they all participate together in feeling the sound, so in putting it in the note or the pitch is being played.

01:11:22

So, as I've said, it's a big difference between how the spectrum of a sound for an instrument looks like and the synthesized one, because synthesizer doesn't have harmonics, you may be able to simulate them, but clear synthesized sound doesn't have something like that.

01:11:43

Another feature, which is one of the most important features, is the pitch. We're going to address this in the next lecture, and it's detectable only for periodic sounds. So, like, for example, the case of the sinus oscillations,

01:12:02

and it can be approximated by means of the Fourier spectrum, and usually, in most of the applications, the pitch is considered to be the fundamental frequency. The value is calculated from the frequencies and amplitudes of the peaks.

01:12:21

So, as I've said, this fundamental frequency is usually used as an approximation. There are also procedures one can use, so algorithms one can use to perform pitch detection, most of them are, so one of the most well-known are, is the harmonic product spectrum,

01:12:42

we're going to discuss this in detail, or like, for example, using autocorrelation of the signal to detect the pitch, we'll discuss about them in the next lectures. This lecture, we've discussed about the introduction into audio retrieval,

01:13:02

we've touched the basics of audio data, we've discussed a bit about what kind of audio information is stored in multimedia databases, and we've started discussing about feature vectors, and how we can perform audio retrieval and search in audio databases.

01:13:24

Next lecture, we'll discuss about classification and retrieval of audio, so the second major application, we'll continue with low-level audio features, we'll go into the smallest difference a human may differentiate,

01:13:42

difference Simon, and we'll go deeper into procedures we can use in order to detect the pitch. That's it, thank you for the attention.

Empfehlungen

34:23

Introduction in Audio Retrieval 1 (12.05.2011)

2:32:31

Introduction to Web retrieval (22.6.2011)

28:47

Introduction to the audio editing software

38:22

FreeBSD in Audio Studio

1:22:17

Retrieval in the Web II

1:21:45

Retrieval in the Web I

Information Retrieval

Serie mit 12 Medien

10:51

Working with Audio Data in Python

08:49

Introduction (Till) Day 2

30:27

Avestan – Introduction, Part 2