Your Voice is My Passport

Video thumbnail (Frame 0) Video thumbnail (Frame 6180) Video thumbnail (Frame 8114) Video thumbnail (Frame 9140) Video thumbnail (Frame 12599) Video thumbnail (Frame 19047) Video thumbnail (Frame 20482) Video thumbnail (Frame 23747) Video thumbnail (Frame 25732) Video thumbnail (Frame 26702) Video thumbnail (Frame 28004) Video thumbnail (Frame 38246) Video thumbnail (Frame 39162) Video thumbnail (Frame 42372) Video thumbnail (Frame 47990) Video thumbnail (Frame 50560) Video thumbnail (Frame 52091) Video thumbnail (Frame 52853)
Video in TIB AV-Portal: Your Voice is My Passport

Formal Metadata

Your Voice is My Passport
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Financial institutions, home automation products, and offices near universal cryptographic decoders have increasingly used voice fingerprinting as a method for authentication. Recent advances in machine learning and text-to-speech have shown that synthetic, high-quality audio of subjects can be generated using transcripted speech from the target. Are current techniques for audio generation enough to spoof voice authentication algorithms? We demonstrate, using freely available machine learning models and limited budget, that standard speaker recognition and voice authentication systems are indeed fooled by targeted text-to-speech attacks. We further show a method which reduces data required to perform such an attack, demonstrating that more people are at risk for voice impersonation than previously thought.
Authentication Slide rule Pattern recognition Dependent and independent variables Service (economics) Multiplication sign Authentication Open set Subset Word Arithmetic mean Roundness (object) Personal digital assistant Term (mathematics) Password Speech synthesis Right angle Musical ensemble
Authentication Classical physics Pattern recognition Server (computing) Service (economics) Artificial neural network Multiplication sign Authentication Bit Maxima and minima Control flow Machine learning Hacker (term)
Authentication Probability density function Word Musical ensemble Metropolitan area network Social engineering (security) Physical system
Authentication Area NP-hard Multiplication sign Real number Virtual machine Entire function Higher-order logic Social engineering (security) Word Mathematics Formal verification Speech synthesis Video game Right angle Musical ensemble Data conversion Physical system
Pattern recognition Context awareness Beta function Model theory Texture mapping Multiplication sign Source code 1 (number) Set (mathematics) Function (mathematics) Magnetic stripe card Machine learning Different (Kate Ryan album) Videoconferencing Source code Row (database) Speech synthesis Proof theory Probability density function Beta function Pattern recognition Algorithm Mapping Open source Bit Sequence Message passing Right angle Ideal (ethics) Representation (politics) Information security Row (database) Service (economics) Open source Algorithm Artificial neural network Similarity (geometry) Equivalence relation Wave packet Product (business) Goodness of fit Internetworking Representation (politics) Software testing Associative property Authentication Artificial neural network Model theory Sign (mathematics) Software Personal digital assistant Speech synthesis Waveform Iteration
Probability density function Sign (mathematics) Service (economics) Password Demo (music) Software testing Thomas Kuhn Identity management Proof theory
Service (economics) Open source State of matter Model theory Direction (geometry) 1 (number) Function (mathematics) Parameter (computer programming) Videoconferencing Process (computing) Speech synthesis Domain name Authentication Probability density function Source code Electric generator Physical law Open source Expert system Sound effect Database Limit (category theory) Flow separation Mechatronics Process (computing) Function (mathematics) Order (biology) output Right angle Fundamental theorem of algebra Row (database)
Predictability Probability density function Default (computer science) Pairwise comparison Greatest element Email Implementation Open source Model theory Weight Digitizing Parameter (computer programming) Function (mathematics) Sequence Revision control Sign (mathematics) Frequency Software Visualization (computer graphics) output Representation (politics) Determinant Simulation Physical system
Probability density function Default (computer science) Electric generator Multiplication sign Complete metric space Revision control Sign (mathematics) Particle system Frequency Software Codierung <Programmierung> Bounded variation Metropolitan area network
Model theory Multiplication sign Source code Parameter (computer programming) Subset Formal language Mathematics Different (Kate Ryan album) Single-precision floating-point format Analogy Determinant Speech synthesis Source code Sampling (statistics) Sound effect Parameter (computer programming) Control flow Latent heat Order (biology) Software testing Right angle Quicksort Row (database) Point (geometry) Open source Adaptive behavior Augmented reality Heat transfer Shift operator Wave packet Product (business) Goodness of fit Googol Sampling (music) Binary multiplier Data Augmentation YouTube Domain name Noise (electronics) Augmented reality Weight Model theory Heat transfer Variance Word Software Personal digital assistant Speech synthesis Iteration YouTube Library (computing)
Probability density function Demo (music) Heat transfer Product (business)
Point (geometry) Open source Model theory Source code Set (mathematics) Augmented reality Heat transfer Parameter (computer programming) Product (business) Wave packet YouTube Speech synthesis Probability density function Source code Information Augmented reality Sampling (statistics) Menu (computing) Bit Complete metric space Wave packet Sample (statistics) Order (biology) Speech synthesis Right angle YouTube
Point (geometry) Pixel Mobile app Hoax Differential (mechanical device) Virtual machine Canonical ensemble Inverse element Function (mathematics) Disk read-and-write head Information privacy Perspective (visual) Social engineering (security) Product (business) Twitter Medical imaging Sign (mathematics) Coefficient of determination Robotics Videoconferencing Information security YouTube Physical system Form (programming) Email Pattern recognition Differential (mechanical device) Line (geometry) Information privacy Social engineering (security) Perspective (visual) Category of being Digital photography Word Googol Phishing Password System programming Robotics output Right angle Musical ensemble Quicksort Spacetime
Pattern recognition Software developer Token ring Multiplication sign Computer-generated imagery Authentication Inverse element Heat transfer Vector potential Wave packet Number Smartphone Source code Physical system Authentication Source code Pattern recognition NP-hard Arm Stress (mechanics) Computer Heat transfer Message passing Personal digital assistant Interface (computing) System programming Speech synthesis Right angle Physical system
Area Multiplication Formal verification 1 (number) Heat transfer Speech synthesis Heat transfer Field (computer science) Speech synthesis
how many of you flew here all right now how many of you flew here and needed your passport to get here all right now wouldn't it be great if you lost that and you could still fly home I don't know if that's what this talk is about bye [Laughter] so AV can we get there slide deck cut up behind us I don't think my voice is gonna get me anywhere but yeah in a couple minutes we got Azim and john who goes by delta zero give him a warm round of applause [Applause] [Music] you ready to go awesome no why did the chicken cross the road oh I don't know because I'm totally unoriginal I don't know that's a good one Thanks I try and silence yeah yeah I couldn't I don't think I could tell a joke up here because I'd alienate half of the population we can just go and start yeah it's about time all right everybody if there's a seat open next to you make a friend there's still going to be people piling in and give them another round of applause [Applause] so welcome everybody to our talk your voice is my passport it has nothing to do with physical passports my name is John Seymour ok delta0 and I'm Izzy Murphy and we both work at Salesforce doing detection and response engineering ok so let's get into it so these days we see that voice is starting to become starting to be used as a means to authenticate right I'm using the word Authenticator a little loosely here and we'll see why now when I say why so temptation or speaker recognition I think the thing that comes immediately to people's mind is maybe Apple Siri or Google assistant right both of these are services that are set up to unlock not only a subset of their features based on whether their targets as a specific sentence so hey Siri for Apple or ok Google for Google right now I do want to mention here that neither Apple nor Google ever use the word authenticate authentication to describe their service at least we never came across that term we suspect it's because they are aware that this is maybe brittle but we'll see so here you have an example of a financial institution Schwab Bank that does indeed use authentication so you can get into your account just based on your voice you can have unmitigated access to everything and the way that it works is after you've registered you say the term at Schwab my voice is my password to get into the account now the irony of that sentence it seems is completely lost on them and now then finally here's an example of Microsoft speech API which also claims to do authentication and is so voice recognition or speaker recognition as a service okay so as you
may have inferred by now our goal is to break West authentication and but we
want to do this with minimal effort now let me be a little bit more specific by breaking what's authentic ation I mean that we want to be able to spoof a specific target and get into a service that's deployed today that's set up to let that person in using his voice by minimal effort we actually mean three things so the first is that whatever solution we come up with should not require tons and tons of compute so voice authentication speaker recognition or machine learning problems and machine learning and deep neural networks in particular just tend to require lots of compute so I'm talking off may be a commodity server not a server from second it should be realizable in some reasonable time so maybe days or weeks not months and then finally you should not require a PhD in data science to be able to implement this all right so if you haven't seen the hacker movie sneaker or hacker movie sneakers you probably should it's a hacker classic from the early 90s and it's also quite a bit relevant to our talk it's where the title actually comes from in it the
heroes actually need to bypass a voice authentication system and they do so by social engineering their target to say the specific words in the passphrase so let's see if this actually works I'll
clear all the way up to the man ground
[Music] [Music]
[Music] hi my name is Werner Brandes my voice is
my passport verify [Music] yeah yeah cool and so here's how you do that right so let's go to the original idea of sneakers right in sneakers they record the words that the for the passphrase that the victim would say by using social engineering and getting the target to actually say those particular words but in real life this is actually pretty difficult for three reasons right first the people that you'd normally want to impersonate or spoof or say like a CEO or a politician they're normally pretty busy people and may not want to speak with you know normal people second if you've ever tried this in a conversation you should if you've ever said like hey you should say hey Siri to me and I want to record it it's it's something that's gonna get you know your target very suspicious of you and say hey why do you want me to say the words hey Siri but even if you were able to do those two things right I'm still actually most poised authentication systems are pretty smart and sometimes they like change paths you know the passphrase and things like that so the actual recording that you do might actually be stale and useless by the time you actually go to authenticate however luckily there's this thing called text-to-speech and it's actually pretty good there's an entire area of research around it it's got basically a workshop at nips dedicated to it so nips is a very prestigious machine learning conference it's machine learning based so basically you give a system a bunch of audio and transcripts of that audio and it produces new audio for you and it's made a ton of improvements lately and so it's a very active research area let's try this all right so let's see if
this one works
this is a dangerous time
moving forward we need to be more vigilant with what we trust from the internet that's a time when we need to rely on trusted news sources all right so the actual audio lagged a bit there just because of the network here but basically that was Jordan Peele actually and BuzzFeed that made that video and it should convince you that this technology is becoming pretty widespread just think of for example what you could do with a huge AI research lab backing you in our case we're actually going to focus on exclusively on using it to bypass voice authentication as such we really don't care about the quality of the audio that's generated we just care whether it bypasses the service or not it could be complete and utter garbage okay so John is already mentioned that text-to-speech is generally a machine learning problem right the essential idea is that you give the algorithm some text transcribe text to be specific and it generates the equivalent audio representation of that stack so for example male spectrograms which is just the audio waveform corresponding to some text the model learns the mapping between the transcript and the audio or to be more precise character sequences and the final output and the way it does this is that you give it labeled data by labels I just mean transcribed audio and you feed it into a deep neural network and after many many iterations the model learns this Association that I've been talking about the association between character sequences and the final audio output now a couple of things I want you guys to note here so generally deep learning models that are focused on voice are trained on a single person's voice right this is starting to change and you'll see later in the talk why but but it's still a good thing to keep in mind the second important thing is that deep learning models in particular and ones that have to do with voice especially so requires lots and lots of data data to do any kind of good work right and the general consensus in the academic community is that these models require like around 24 hours of high-quality labeled audio to be able to do well now there are two very high quality open source data sets that are available both of them have over 24 hours of data the first one is Blizzard the second one is LJ's speech the only difference between these two is that one is a recording of a male the other of a female all of these you'll see why this is important so basically there's this company called Lyrebird and there it's founded by several of the pioneers in Texas Beach research and one of their goals is to bring awareness to what all this technology can do they host a lot of similar videos to the Jordan Peele video we showed you earlier as a demonstration to the general public they've actually set up a service where you can actually record your own voice and generate some from it and so the steps to do so pretty easy all you do is you create an account you record 30 sentences which are actually chosen by Larry Bird in advance and they're the same for all users and basically after that that basically trains the model you then provide a target sentence that Lyrebird would generate it's actually pretty simple it only takes a few minutes to generate audio but there's definitely degradation in quality it's also finicky with a lot of different accents we actually did a proof-of-concept with Siri and Microsoft's speaker recognition public beta we didn't actually test with like Schwab or Google Voice so first we actually trained Siri or MSR to recognize our own voices then we generated the target pass phrases using Lyrebird and tested the audio against the speaker recognition authentication software my voice is
stronger than passwords so this is us actually training the service in the first place my voice is stronger than passwords my voice is stronger than passwords okay so now Microsoft actually accepts our speed my voice is stronger than passwords and notice that that was accepted this is a test and should be rejected rejected as expected my voice is stronger than passwords ever Jax is iam as well my voice is stronger than passwords and look it accepts the generated audio that we took from Lyrebird so basically
actually there is some there is some limitations to basically using Lyrebird as a service right for example its effectiveness varies greatly based on the speaker it worked very well for me it didn't work for him but aside from just general Finnick enos right Lyrebird requires specific utterances and so it falls back to a lot of the same issues that the sneakers video we showed before has as well right it's simply unlikely that an attacker could obtain specific recordings of a target though this does mean the Lyrebird database and as well as voice authentication databases in general might be a valuable target for attackers to demonstrate how a real attack might work we actually turned to the state of the art ins text-to-speech generation okay so when I started this out I mentioned that one of our goal just to make this as easy as possible right you should not require data science expertise you know but in order to be able to implement the solution and so naturally we turned to open source models that are just widely available right now there are several open source models two of the most popular ones are taco Tran which is by Google and wavenet so wavenet is perhaps maybe better known and it generates very very realistic human sounding output however the problem with wavenet is that it needs to be tuned significantly so what I mean by that is that wavenet has lots of input parameters so as examples of some of them that those would be the fundamental law of frequency the phenom eurasian linguistic features all of these things would need to be tuned by a domain expert right this requires domain expertise and kind of strays away from our original goal of making this as easy as possible so now Darko Tran simplifies this entire process very much alright it takes the guessing out of it so you no longer need to individually tune features you can basically just give taco Tron the audio as direct input and it will figure about what the best feature said for that is right so this is an example of
Google Taco Tron 2 which is Google's latest and greatest text-to-speech system now taka tronto is basically composed of two steps there's this thing at the bottom and the one at the top the thing at the bottom is basically a recurrent sequence to sequence a feature prediction Network which outputs mail spectrograms and the one on the top is a modified wave net which is conditioned on the previous mail spectrogram frames and digital and generates the final audio sequence so an easier way to think about this is the first network kind of determinants what the ideal feature set for wavenet should be which you can think of as this visual representation of sound frequencies and wavenet then takes those as inputs and then finally gives you an output so now the good news here is that you don't really need to know any of the internals of the akka tron to make it work this is available open source and you can basically just run this give it the active character sequences there are some parameters that you can tweak and make it better but we did not if you just leave those things as they are in default it'll work very well right so we just have a few comparisons of the different audios for you so this should be the audio from
Taku Tron audio generated from tako Tron version 1 which Google actually published in April of 2017 and there's an actually completely open source implementation of it scientists at the
CERN laboratory say they have discovered a new particle so you can actually kind of tell that that was there all the way up to the mantrap this is the other would you start
we just really love sneakers here of adversarial network or variational auto encoder generative adversary generated that all right so that's actually audio generated by tacko Tron version 2 which Google released in December of 2017 so we're talking April of 2017 to December of 2017 huge increase in quality in a very short period of time for completion purposes here's the audio generated by wavenet I really should just set the defaults for this man dying of liver
complaint lay on the cold stones without a bed or food to eat all right cool all
right so that's all well and good but in order to actually spoof somebody's voice or train any kind of model you need data and you need lots of it so given that we want to impersonate a specific target like where where might you get this data from so if your target is somebody that does lots of public speaking like say John you can probably grab that audio from YouTube or some public source but remember both the quality and the quantity of the audio are is important right then you actually need to transcribe this data because as I mentioned earlier these models require labelled data and labeled in the census just the transcription and then finally you need to chunk this up so you need to chunk it up because these are these these models expect sentences right they expect you to be able to give them chunks of audio and then that's how the network runs so when we started this out we thought that we could kill two birds with one stone you use Google speech API so the speech API I suppose what it's supposed to do is you give it some audio and it gives you both the transcript and the start and end time of each word in that audio right but for whatever reason we could never get it to work well enough we suspect it's because you know there was there when you get audio from some public source there's going there's going to be lots of noise in it and it doesn't tend to do very well with that it also does not do very well with natural pauses in human speech like um or I just just tends to think that's some word now this is not adding on Google actually the speech API does work very well when you give it like good quality audio but we think it that's unreasonable if you're going to impersonate some specific target so what we had to do then is that we ended up manually transcribing our data right so remember John is the target here and it's not so bad because like it took us what an hour to to transcribe that data and if then chunking that data up actually turned out to be very easy right so you just use ffmpeg and split your audio by silence and that just conveniently chunks it up by sentence all right sorry so I've mentioned that both the quality and the quantity of data is important right so when we get this data from a public source like a YouTube dot a lot of the sentences in that talk are actually not very usable right so if your target has says lots of ohms and that's not very useful the model is not going to learn anything from that there's also times when there's applause and that again will mess your sample up so what you ended to do is you need to sub sample select the highest quality audio samples from your audio and then use those what we ended up was with around like five to ten minutes of really good quality audio and if you remember I mentioned that you need 24 hours of data and that is this is just not nearly enough to do any kind of like good training now the solution to that problem of very limited data is something called data augmentation right so there's one side effect of actually slowing down and speeding up audio and that's actually that the pitch changes and so you can actually abuse this to generate new examples and you can add these to your training set training examples there's tons of libraries available for this we used PI dub but to make this a little rigorous what we did actually was we took an original recording of me saying hey Siri and we slowed it down and sped it up and we saw how far we could actually do so and the Syriac choo Elise till recognized my voice and unlocked the phone and so in our case basically we were able to slow it down to about point eight eight times and speed it up to about one point two one times and Syria would still recognize that it was me speaking obviously your mileage may vary for the exact parameter here right it's probably different for every single person notice that this actually fixes both of our original issues right it multiplies our training data by about thirty times as well as you only need to transcribe about one thirtieth of the original training data but there is an issue introduced by this and that's the issue of overfitting right if you're only choosing some subset of what the target actually speaks then you're not getting a full representative sample of all the different phonemes and things that they might say so you still have to be careful about this so in other words the model is being trained on a small subset of what the target might say so there might be some sounds that it can't generate very well but even if you consider like thirty times right basically that's still not enough to actually generate really good really good audio right if you actually do the math five to ten minutes times about thirty is still nowhere near the 24 hours that we originally originally needed so shifting pitch ended up not getting us all the way there if we calculate right we'd need at least one hour of high-quality data and that actually still takes forever to transcribe unless this is not even considering the issue of limited vocabulary so we actually turned to this idea of domain adaptation or transfer learning and so how this actually works is you initially train on a large open source data set such as Blizzard or LJ speech you get a decent model and then you stop training there you actually just simply swap your data into the original training data and you just continue training the model and eventually you'll get a model which actually sounds more like the original target and so what we think this is is the model initially learns how to speak using the Blizzard analogy speech data and then it learns basically adjust pitch in accent based on the targets we think this is actually because of the sort of layered approach of neural nets so we think sort of the lower layers are more more useful for basically understanding the basics of language and translating you know characters and words into sentences and into audio and then the higher layers just determine pitch and accent and things like that and furthermore there's still a lot of variance in effectiveness here it's it's very finicky sometimes it converges like within we won a pop which is just one iteration over all of your training data from your target sometimes it actually takes a couple days to train so we have
a simple demo here of basically this is we trained our Blizzard model not for very long so it's not great audio quality I'm going to make him an offer you cannot refuse I so it still sounds that that sounds a lot like the blizzard person but it's still sort of choppy and you can hear it that's an artifact of us using taka Tron v1 we expect the quality to get better but then when we actually use transfer learning I'm going to make
him an offer he cannot refuse so I'm going to make him an offer that's actually that sounds a lot more like my voice and that was completely generated right
so basically epochs vary this one took about two days to train and then an overnight to actually do the transfer over this actually is good enough to start breaking api's right the approach works it's not very it's not as consistent even as Lyrebird but it doesn't require any specified speech at all what we did was we scraped audio if YouTube to generate that the overall effort here is also very very low it took us about a month from conception to completion more effort obviously would make the audio quality for example much better or make it a lot you know higher probability of actually being accepted by the the two api's that we demonstrated earlier there's so many more parameters we could have tweaked and so much data we could have transcribed for example but the fact that the overall effort is so low should be pretty scary okay so we may have thrown quite a bit of information your way so let me just take a step back and put everything back together right so the steps you would need to take in order to spoof somebody's voice it's really not that much so you start off by scraping data from the target some public source maybe YouTube use subsample you only select the high quality samples from your audio you need to then transcribe and chunk that audio at this point you need to do it manually but there is no reason to believe that the speech API is not going to get very good very quickly then you need to augment your audio by shifting pitch the second augmentation is two steps of first you need to train a general text-to-speech model on any open-source data set and then you replace your general model training data with the target data and then you finish training at this point you should be able to successfully synthesize your targets voice okay so I I kind of want to put
our work in perspective now give people a flavor of everything of machine learning for offense related so what we've done here is we've grouped prior work into these two arguably very broad buckets so there's attacks on machine learning systems and then attacks using machine learning systems and our work is kind of squarely in the middle so let's the first start with attacks on machine learning systems now adversarial attacks are one of the hottest topics in machine learning security research right now in fact these two words are sometimes just used synonymously so the basic idea behind adversarial attacks is that you have to carefully craft your input to a machine learning model in such a way that the model ends up misclassifying your your input right so as an example think of an image recognition system and a picture of a dog right now you would carefully tweak those some pixels in that picture in such a way that the model would maybe then miss classify that as a giraffe or a band or something now this this might sound cute but they're like security implications here so the canonical example that people give is think of a self-driving car a self-driving system and a stop sign right so if somebody does something to that stop sign where in the stop sign is still very much a stop sign to a human the altered and unaltered pictures are indistinguishable to the human eye the system is going to miss classify that as a yield sign or something now the most of the prior work on adversarial attacks for voice systems have focused on hiding hidden commands in benign sounding audio so some password basically showed how you can have a benign sounding sentence like okay google turn on the lights which would in fact Google over enter that as something like send an email or some such thing now this method is pretty cool right but the con is that it's currently very brittle then there's this idea of poisoning the well so with poisoning the wells similar to adversarial attacks you carefully craft your input but you're but your aim now is to corrupt the model differential privacy is kind of the inverse so you you carefully observe the output of your model in the hope that this will tell you something about the actual data that we used to train it cool and so again we've sort of bucketed these things in two different categories just to make them simpler to understand but also we have this idea of attacks using machine learning systems right and so for example earlier this weird year we actually saw the first what we consider a widespread machine learning a based attack in the form of deep fakes so if you don't know about deep fakes it's basically this app where you can transplant let's say a photo or a head of one person onto the body of another in a picture or video and what we've seen is basically this ends up being used mostly for pornographic purposes there's actually also a whole host a whole host of other ways machine learning systems can be used to actually attack people right and so one primary example is phishing you can actually scrape data of a target off of YouTube or Twitter or something like that and generate a phishing post specifically tailored to their own interests the final thing that we want to call out in this space is robotics and social engineering and if you haven't seen it there's a really cool talk by Sarah Jane chirp and straight and Wendy knocks on
that okay so we're hoping at this point we've convinced you how relatively easy it is just put somebody's voice so there there are other issues with like voices it means to authenticate right you could
have some kind of pass phrase that you use but the problem is that it's difficult to keep pass phrases secret if you have to say them out loud the other problem is that you could require an unknown vocabulary and John talked about this earlier but actually speaker recognition with unknown vocabulary is a harder problem than speaker recognition with a known vocabulary so what we want as stress here is that speaker recognition and speaker authentication are two separate problems and they should be treated as such right what we suggest is that you use speaker recognition as a weak signal on top of a multi-factor authentication system so now think of an mfa system that requires tokens so what you would end up doing is saying those tokens of do the system instead of typing it out maybe and that does indeed provide another weak signal on top so let's talk about detection so over here we have thrown together two examples of something that you could use to detect this so on the left is an example of something that attempts to detect computer-generated audio on the right hand side you have the inverse where this device which tries to detect certain neuromuscular features so the idea being that if it detects something the voice mean the the sound must have come from a human now treat these with skepticism because we expect this to be an arms race cool so just to sort of
reiterate what what all we're trying to raise awareness for and what all we think based on our own experiments right so first off what we'd like you to take away is that speaker authentication and speaker recognition are two different problems completely a recognition should only be treated as a weak signal for authenticating the second takeaway is that speaker authentication can easily be broken if the attacker has speech data of the target and knows the authentication prompt and third although most text-to-speech systems require about 24 hours of speech to Train transfer learning is actually a very effective method to reduce that time to an amount realistic for an attacker today to abuse in fact transfer learning is very effective technique that you can use in a very large number of machine-learning use cases but in conclusion it's relatively easy at this time to spoof someone's voice and it's only going to get easier over time and
just as a final note even after we submitted this to Def Con some researchers google published this paper back in June of this year so the idea there is tech transfer learning from speaker verification to multi speaker text to speech synthesis we just want to note that this is a very active area of research generally and we're not the only ones looking into this basically this entire field is going to grow at a very enlarged and we should figure out how to deal with it now and with that that's actually the end of
our talk so if anyone has any questions definitely feel free [Applause]