We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Whisper AI: Live English Subtitles for 96 Languages

00:00

Formal Metadata

Title
Whisper AI: Live English Subtitles for 96 Languages
Title of Series
Number of Parts
141
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Whisper AI, a model from OpenAI, has been largely overlooked despite its impressive ability to accurately transcribe and translate human speech from audio. In this talk I will explore the architecture of the model and explain why it works so well. Additionally, I will live demo the model's capabilities in three languages, showing how you can use it on your own computer to generate English subtitles for a wide range of content.
114
131
CodeInformation securityGroup actionCache (computing)Digital filterSoftware testingComputer multitaskingSequenceFile formatTranslation (relic)Enterprise architectureBusiness modelToken ringBlock (periodic table)Positional notationCodierung <Programmierung>Plot (narrative)Shape (magazine)Formal languageBitWave packetDifferent (Kate Ryan album)TunisToken ringTransformation (genetics)Enterprise architectureBusiness modelWeightSlide ruleStability theoryoutputTranslation (relic)NeuroinformatikParallel portCodeFunction (mathematics)Context awarenessWordMathematicsOpen sourceParallel computingLevel (video gaming)WindowQuicksortMereologyExpert systemInheritance (object-oriented programming)Right angleSource codeYouTubeMusical ensembleOpen setPhase transitionMehrprozessorsystemOnline chatComputer animationLecture/ConferenceMeeting/Interview
Business modelGraphics processing unitSign (mathematics)Focus (optics)NeuroinformatikHand fanFormal languageGame theoryRevision controlMusical ensembleForm (programming)Semiconductor memoryOpen setInheritance (object-oriented programming)MP3Goodness of fitComputer animationSource code
Zoom lensRow (database)VolumenvisualisierungGraphics processing unit2 (number)Multiplication signThread (computing)Queue (abstract data type)Computer fileFrequencyInheritance (object-oriented programming)ProgrammschleifeSynchronizationCoprocessorPhysical systemWeb browserLoop (music)Artistic renderingCodeMereologyLibrary (computing)ResultantFlow separationParallel portTranslation (relic)Block (periodic table)BitVirtual machineRow (database)MP3Computer programmingZoom lensDifferent (Kate Ryan album)NeuroinformatikRight angleStreaming mediaPoint cloudData structureLattice (order)Formal languageCombinational logicServer (computing)EncryptionOffice suitePresentation of a groupVideoconferencingRevision controlVideo gameSpeech synthesisHard disk driveAudio file formatLetterpress printingType theoryProcess (computing)Reading (process)Computer animation
VolumenvisualisierungRow (database)Lambda calculusStreaming mediaLink (knot theory)Revision controlSlide ruleServer (computing)Game theoryThread (computing)MultilaterationQueue (abstract data type)WordLibrary (computing)Computer programmingBitEvent horizonInformation privacy2 (number)Row (database)Parallel portSemiconductor memoryComputer fileResultantInheritance (object-oriented programming)Line (geometry)Variable (mathematics)Resource allocationMultiplication signFront and back endsCore dumpBefehlsprozessorOpen sourceMereologyCodeBlock (periodic table)NeuroinformatikInstance (computer science)Graph coloringInterpreter (computing)CASE <Informatik>Process (computing)FrequencyArithmetic meanMultiplicationBusiness modelExpert systemMedical imagingMoment (mathematics)Direction (geometry)InternetworkingPoint cloud1 (number)Message passingGraphics processing unitSingle-precision floating-point formatDifferent (Kate Ryan album)Loop (music)DebuggerComputer animationLecture/Conference
Streaming mediaComputer-assisted translationMultiplication signImage resolutionNoise (electronics)Business modelBitMusical ensembleWeb browseroutputInternetworkingFormal languageBefehlsprozessorLaptopWordExterior algebraPattern recognitionSpeech synthesisImplementationSpacetime1 (number)Game theoryWave packetRevision controlSign (mathematics)Roundness (object)Graphics processing unitArithmetic meanRight angleComputer animationLecture/Conference
Transcript: English(auto-generated)
Minasan, Konnichiwa, Takashino Kogiye Yokoso, Kitekurete Makutani Arigatou Gozaimasu. Zutani Hongo de Sabaruno wa Chottomos Musikashiguru Nodei, Imakara Doitsu gozae tsuzuki masu.
As a matter of fact, I'd like to introduce a very special person here today. Yes, welcome to this talk again.
I'll continue in English and I will tell you a little bit about this tool that I have built and how it works. So, we start with introducing a bit this AI
that I was using, the Wisp AI. Then I'll tell you how I use it to generate these subtitles that you just saw. And lastly, we go a little bit about whether, how to use asynchronous programming in Python, whether to use threading or multiprocessing,
and what are the differences between that. So, the AI I was using is called Whisper AI. It is by OpenAI, it was released last September, which is between the release of Stable Diffusion and ChatGPT. So, which is I think a reason why this AI has been a bit
overlooked since then. I think it is very useful and maybe you saw why just now. It's actually completely open source unlike ChatGPT. You can see the code on GitHub and the weights are on Hugging Face, so you can use it on any computer that you have, which can run it.
Let's continue with this overwhelming slide. Don't worry, we won't go through everything, but I'll, this is basically all the details that they are for Whisper. It's also from the OpenAI paper.
And let's just start with the training for the AI, what they have done. How did they achieve what you just saw? So, they trained on mostly three different kinds of data. They trained on English sound with English subtitles.
They trained on non-English sound with non-English subtitles. And also some on, for example, German audio with English subtitles. So, the translation was also in the training data. They used a lot of different kinds of audio, I think mostly from YouTube, but also probably other sources,
which means that this model is very, very robust and it doesn't really need any fine tuning. So, what I just saw was the pure model that you can download and maybe you noticed that I didn't even have to tell it what language I was speaking. I didn't switch anything. I just spoke in different languages and it just realized what's going on and translated all of it
to English because that's what I told it to do. It could have also told it to write down what I'm saying in the same language, but that's, I mean, that's not quite as useful maybe. Back to this overview. So, on the top right, we have the architecture of the model, which we're going to look at next.
And if you have been following AI news in the last year, you might recognize this shape. It might be like this. It is actually the exact same architecture pretty much as transformers. So, the same as chatGPT, the T stands for transformer.
So, what we see, I'll just go through it a little bit in a little bit of detail. I'm not a super expert on that. But chatGPT has inputs and outputs and both are text, right? You put some text in and it gives you the next piece of text. And the architecture is that there is something called
attention, which means that to predict the next token, it looks at all of the tokens or the context window of the tokens that it has, that it came before, but not all at the same level. The attention part means that some tokens relate to other tokens more specifically than other sort of. So, this is quite a very simple mathematics actually
and a very genius concept and it works really well. It just means that, for example, if you're talking about, well, if I start speaking Japanese and it has like a keyword that's obviously Japanese, then it will, that will spread to all the other words and say, hey, we are talking Japanese, for example. That's kind of how it works.
And whisper doesn't have text on the input side, but sound waves. But the output is exactly the same, it's also tokens of text. So, whisper learned to listen to sound and it also learned the correct text output. So, it learned to, so it tried to predict the next text token, just like Chachibiti,
just the input was a little bit different. So, now when I'm speaking, it listens to the sound, it tries to predict the next text token and it's really basically the same thing as a text transformer. This is quite a nice parallel. The last part, I'm not really going to explain this. This is just about how the tokens work in whisper exactly,
which is not extremely interesting. Instead, I'll show you another example of what I mean by robustness of the model. So, I'm going to play this clip. It's the beginning of a song and I want you to try to understand the lyrics.
Sound, break it down, thinking of making a new sound, playing a different show every night in front of a new crowd that's new now. Ciao seems to laugh, it's great now. See me lose focus, I'll just sing to you loud. Who understood everything?
One, two. Two people. Maybe you are from Britain. Maybe you understand sign language. Or maybe you are an insurance super fans, I don't know. But two people out of maybe 50. Of course, I also don't really understand everything. One of the first things I did was put this song into whisper, so just the base whisper that you can download
and just the base version of it takes sound, basically MP3 for example, works for a while and then gives you all the text that's in the sound, in this MP3. So, let's see what whisper heard. You can check the text after these brackets and we'll play it again.
Not too bad I think.
Well done OpenAI. Yeah, and this, actually this text was not generated by the real whisper, the largest model. It was just one step down. They have a few different versions and I used a medium one which runs on my own computer at home. So, I have a gaming computer with like a graphics card
from three years ago. And mainly just the one thing that's limiting is the graphics memory. So, I have eight gigabytes and the real whisper needs 11. So, this is not even the full, not even its final form. Yeah, another example of how good whisper is.
So, many people tell me Microsoft Teams has the same feature, right? Yes, it does. It does have the same feature. You can speak any language and it will give you an English subtitle. So, here's a little bit of a, I turn it on while we had a meeting at work.
It says, yes, so it may be that she does not jump in that at all. The problem was and then the relaxed it was really because of this. Okay, very helpful. And at the same time I was running my own, again on my medium-sized graphics card, my own program which is the same, kind of the same that I demoed at the beginning.
And that says, so it could be that this print wasn't the problem. And then it was really because of these block things. So, you kind of, no, I mean if you know we're in a retro, we're doing Scrum and everything that makes sense. So, that is the difference. But the biggest difference here is that the thing
on the left, on the left, is running in the cloud, right? So, I mean there's some more implications, we'll get into it later. But the thing on the right is running locally on my machine. So, it's really, really portable and you can do a lot of it, a lot with it. So, I think that's really fascinating why I want
to introduce Whisper to the world a bit more. So, let's talk about Whisper AI for subtitles. So, that's basically how I used Whisper, the program that just takes an MP3 and gives out text and made a tool that you saw at the beginning which transcribes
and translates live while I'm speaking, which is not really the same thing. First of all, why did I do this? What was the challenge or the idea? So, I used to work in a Munich-based company. This company was founded like over 20 years ago now
and used to only have German employees. Now, they have expanded. They have hired people all over Germany and also all over the EU. Some are even working in Australia now. So, 90% of the employees are still German and maybe don't speak perfect English.
They usually do speak very good English, but maybe not perfect. Maybe it's nicer for them to speak and hear something in German. And 5% don't understand German at all. So, this is kind of a conflict. And there was a weekly meeting that gives updates about the company.
And it used to be only in German, obviously. And it continued to be in German for a while while there were non-German speakers in the company. So, the solution was to have like a distilled version of that meeting just after that in English for the English colleagues, which is a workable solution. But what I really wanted is that everybody
in the company can come together again into this big meeting where we have important announcements. We learn about the new hires. And we can also celebrate things together all in one meeting, no matter the language you understand. Yeah, of course, another solution would be to speak English, but it has its downsides as well
in a previously only German company. So, that is basically why I got the idea to develop this tool. And the version that I built at the company was that this meeting is in Zoom. So, there's 500 people in the same Zoom meeting.
There's a big presentation from an office. And it's streamed to all the participants. So, in this meeting, there was a lot of very sensitive topics discussed. We trusted Zoom, and it also has end-to-end encryption. But we did not want to send just all the data to any server.
So, this is, again, Whisper comes in. We have a huge, not huge, a big, very expensive computer in the office, which has a very expensive graphic card. And I just put Zoom on that PC, it joined the meeting. And on that PC was also Whisper AI running. So, Whisper AI was basically also in the meeting listening
in to the German, generating the subtitles live as you saw just now. The subtitles are combined with the video from the meeting in a little tool called OBS Studio, which is very useful for streaming, for example, also. In OBS Studio, I define something called a virtual
camera, and I stream back the combination of the meeting with the subtitles that are generated live into the meeting so that if people want to have subtitles, they can just watch the user, which is logged in on the AI computer. So, basically, Whisper is in the meeting as well, listening in and typing really fast.
And people can just look at its camera just to get subtitles. Yeah, this worked reasonably well. Since we are at EuroPython here, I'm going to zoom in on the Python part a bit and explain a little bit more about the challenges with the live transcription and translation.
So, the basic, it's a bit small, the basic structure of the program is that the Python records a little bit of audio, for example, one second, maybe also adds it to make a longer, longer, longer piece of audio,
then gives it to Whisper because Whisper can only transcribe pieces, like files of audio. It can't really do live yet. So, I'm kind of making it transcribe really fast and many, many different small things. It does its thing, it gives me the subtitles, and I show it, so no problem. But if you just implement it like this,
without any thinking about asynchronously, you realize that while Whisper is working, so the AI needs like one, two seconds to run, the audio code is not running. So, I'm always losing as much sound as Whisper is taking time to translate. So, of course, that's not good.
So, parallelism has to come in. And what I really want is I want a thread or a piece of code to run in the loop and record one second of audio every second, really without a gap. And I want another piece of code to use Whisper, send it the audio that we have currently and get the results back.
Any time it's done, it immediately gets a new audio file to transcribe. And then on top, of course, I want to show the subtitles. I want to have them available not just while the AI is not thinking, but all the time. So, three kind of separate parts of the code.
Yeah, what I used is threading, the threads library from Python. I basically just started a thread for the listening code which did this loop. I started a thread for the transcribing, and that's it. The rendering is a trick
because it's actually shown via a browser. So, I just use set time out in JavaScript and ask every five seconds to get the new text. Quite easy. So, this is the details of what I just said. The audio uses a temporary file, a WAV file.
So, actually, the synchronization between these two loops is done via hard drive. So, because operating systems are quite good at having multiple processes read the same file, and WAV is really nice because you can add to the end, read from the beginning, delete from the beginning. It's super nice. It's basically like an array of audio on the hard drive.
So, I put one second at the end of the file every second and every few seconds whenever whisper was done. I'm sorry, and then I told the whisper thread, hey, there's a new piece of audio, go. I did this via a queue, so I just added the audio file
to the queue, and the whisper thread reacts to this queue every time that something is added, which is basically every time it's done. It runs, takes these three seconds, and outputs some text, hopefully. And then what you saw at the beginning is that sometimes the text is gray because it's not really clear if that's really what was said.
Sometimes it's black because I say, okay, this is, I'm sure now, it's done for two reasons. I don't want to have this WAV file grow and grow and grow because then the AI would have to think about more and more and more sound. It would get slower and slower and it would just not work at the end. And also, I want to display subtitles.
I kind of need to say, okay, this subtitle is gone now. The next one is coming, things like that. So, whenever it found a whole sentence, and very nicely, whisper gives us these things, right? Remember, with the song, we have, it was in four lines. So, it says there's a phrase here, a phrase here, a phrase here. Whenever whisper tells me a phrase has finished and a new phrase has started, I take the text and commit it.
It tells me when this phrase was in the file, in the sound file, so I can just delete the time from the sound file. So, it never really grows much. It grows a little bit, and then the sentence gets removed at the beginning, and then it grows again, and then the sentence gets removed. It was a, and then, right, lost track.
It removes the file from the beginning, and that's it.
I'm not sure where it was going, but anyway, when the text is committed, it has another queue to send to the front end, basically, to add another line of text to the result array, so to say. And then the front end, like I said, pulls every half second and gets a new text.
There is an open source version of this. It's not quite exactly what is here, but it has the back end part, basically, the parallel recording and transcribing. So, that's the details of the program. Don't worry, there will be a link to the slides later.
Now, just some words about the Python details of threading and multiprocessing, just to compare these two, because there are two different things, threads, the thread library and the multiprocessing library, and another one later. So, threads, they have shared memory. So, I started these two threads,
but I can actually share objects between them, and I don't really have to think much about allocation of resources. The Python does everything for me. They communicate, you can send data between each thread via a queue. So, both threads are running asynchronously, and if I give the same queue reference to both threads,
they can send each other messages through it. So, this leads to asynchronous code, which you saw. However, threads actually cannot run in parallel, which might be surprising, but I'll explain it later. Threads are actually blocked by the GIL, the global interpreter lock, because they are,
that means in Python, only one of the pieces of Python can run at the same time. However, when using an AI, we're calling C Python or some ML restriction to how it works, but it's not Python that is running the TensorFlow model, that is not in Python,
which is why, at that moment, the GIL, the lock is gone, and the rest of the code can run again. So, it is asynchronous, but not exactly parallel, but it works just well enough for my use case. And then, the more complicated and stronger version would be processes, so multiprocessing.
When you use multiprocessing in Python, you actually start whole Python instances. So, you have two Python's running. They are completely separate, which makes it quite a lot more difficult to write the code in, because you have to think about whether this code has now been spawned in the process, or whether it is the main code,
and so on. They also communicate via queue. The queues are not the same queues, that's why, yes, you can see the different colors. This kind of queue that you need for multiprocessing actually pickles the data. So, I think it has to be serializable, or something.
I'm not exactly sure what pickling means, because I'm not actually a Python expert, sorry, to this point, but it is much more strict on what data you can send through the queue. The other queue is very easy. You can send references and things like that. There's also something called pipe, which is just connecting two processes. A queue can be read by multiple consumers and producers.
So, a queue is really general, but also makes the code a bit harder. Pipes are just one way in, one way out. And they are also very strict. And the main positive part of using processes, of course, is that it actually uses all your cores, if you want. You can, if you spawn eight processes,
you can use all your eight CPU cores. But note, so in my example, with recording audio and transcribing with Whisper, I don't care about the CPU. I mean, the audio maybe uses the CPU, but Whisper does not. Whisper runs on the GPU. So, I don't need multiprocessing. I actually did, the latest version I have is with multiprocessing,
but the speed is exactly the same. It just is a little bit more fancy and takes longer to stop. So, but since that is the case, since actually I don't need real multiprocessing in Python, in the very newest version, I used async await,
which isn't parallel at all. It is a single event loop, single threaded. You just have to write your code in a way that it never blocks itself. And it's fine because you can wait for the AI and your Python code runs, keeps running.
And so, I can just use that. It makes coding easier because you don't really need a queue. You can just use in memory variables if you're lazy, like me. And if you just use one file, it's super easy to read, as long as it's not too much code. And it just makes everything easy enough. Also, it works well with JavaScript and being a server.
So, it's actually an API for the, what I showed at the beginning is an API for the AI, which leads me to my next topic. What I presented now was running Whisper AI on your computer, which is very nice. It can be very fast depending
if you just buy a nice GPU. The main good thing about it is probably the privacy and data safety. Your data does not leave your room. You can just do it at home. But the downside, of course, is it's quite a heavy thing, you know, the computer. I don't have it here. Maybe you noticed I didn't bring my gaming PC with me.
So, instead, the new version I wrote, which is on my GitHub and link in the end, is actually a server version of this. So, what I showed you at the beginning is running right now in the cloud, which is actually not AWS or Google or whatever, I use Lambda Labs, which is quite a nice company.
There you can rent really nice GPUs, really fast ones, for not too much money. So, in AWS, if you use the cheapest GPU, it works. It has enough RAM, but it is so slow. It is basically unusable, the same tool. If you just use it on Lambda Labs, I use the ATEM.
It's kind of the weakest one, which I always have available. You could also use the H100, the best GPU in the world. You could just get it with a click for two euros per hour, which is quite nice. And this is the current version of my code, which runs quite well on the cloud. Yeah, so, for the future, my idea or kind of my dream is
to have one of these tools or maybe something that is not quite so big to go to, to be able to go to any country and just have internet access. I can put it on and understand everybody around me what they are saying. If you have one of these things, talk to me. I want to try it out.
And that's it. Thank you very much for this very inspiring talk. We do have time for some questions. If somebody has a question about this, then please step to the microphone.
Hi, loved your speech. You tested out like a few major languages, but if you would like try some more niche languages, not like Japanese, Chinese, not Lithuanian. Oh, Lithuanian, oh. Or something, I don't know, from a small country.
Would it work? Who in the room is from a very small country? Can we have a hand sign, please? As I tried Lithuanian, it works okay. Lithuanian is not great because there's not much training data. I have a Lithuanian friend. They were not happy. But yeah, if someone wants to ask the next question in an interesting language, I would be happy to.
Can you please, for this question, come here so that your microphone will work? I think this will do, no? Try. Hi.
So I think it can't hear you. I'm not sure what he said. Was that half right, at least? No, it can't hear you.
The microphone is over here. So it can only hear me. Yes, please, go ahead. Okay, let's give this another try. Is it this one? One more time, we have time.
So let's try this once more. Yeah, in the end it got something. Hopefully it will be better after the second one. Of course, also silence is confused a bit, and when you switch languages around it's not that easy, but the longer you talk the easier it gets, and there are a few sentences
that he said, I think. So another question, please. Yes, I also took a look on this Whisper AI, and there are these smaller models, are they able to run on a normal laptop? Or is it simply impossible because they are too small? Whisper can run on CPU, and
even if you have a lot of RAM you can use the big models, but it's really slow. It needs a lot of, it needs some space. It needs the speed actually a little bit more than the space. If you can wait, then it's okay. If you have RAM and you can just use CPU, it's fine. But I recommend any GPU,
especially gaming GPUs because they are faster. The professional ones are not super fast sometimes. The next question, please. You mentioned that the AI outputs these phrases, is it also possible to classify which person is speaking the phrase? The AI cannot do it. So you would have to do some
kind of voice recognition, I'm not sure. It's a different problem. It's quite interesting, too. Okay, the next question, please. So there are some alternative implementations of Whisper. There is Whisper CPP, Faster Whisper, Whisper X. Did you try those? The example at the beginning is not Whisper, it's Whisper JAX.
Oh, it's yet another one. I didn't tell you, because I don't really know what JAX means. But it is a faster implementation and it needs a little bit more installation at the beginning, but it kind of works the same. And really fast. Thanks, I will try. Yeah, go ahead. Whisper JAX. All right.
You were threading the audios and then you were joining and sending to Whisper. Did you have any troubles doing that? It was straightforward, any issue? Yeah, it's quite
annoying sometimes. Handling audio. So in this last version, I'm handling audio from the browser. So the browser sends sometimes OGD and sometimes not, and sometimes it compresses it. There's some fiddly bits. Yeah, but overall, especially when you're not using the browser
and sending the sound over the internet, but when you're just using the microphone input into Python directly, that wasn't that bad. In regarding noise, did you also manage to solve noise or ambient sound around? No, I mean, I didn't do that. I only used Whisper. I didn't really massage the sounds.
Whisper is very good with some noise. But what's interesting is when there's applause, it always says, thank you. Because of course, that's what people say. Okay, thank you. Actually, that is a good last word to thank you again for your talk, because we are out of time. Let's have another round of applause for Matthias. Thank you.