Weaponizing Unicode: Homographs Beyond IDNs
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 322 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/39749 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
UnicodeComputersicherheitComputersicherheitCASE <Informatik>UnrundheitRechenschieberMetropolitan area networkBitMultiplikationsoperatorKontextbezogenes SystemUnicodeFontComputeranimation
01:06
DifferenteURLDickeFormale SpracheNummernsystemKartesische KoordinatenFontMinkowski-MetrikComputeranimation
02:17
Kontextbezogenes SystemUnicodeNummernsystemMonster-GruppeGradientWeb SiteRechter WinkelCodierungURLVektorraumBrowserHackerCASE <Informatik>TouchscreenEndliche ModelltheorieJensen-MaßComputeranimation
03:48
UnicodeUnicodeVorzeichen <Mathematik>Physikalisches SystemCASE <Informatik>FlächeninhaltNichtlineares GleichungssystemRegulärer GraphNotebook-ComputerFontMinkowski-MetrikBaum <Mathematik>RechenschieberMathematikMultiplikationsoperatorLoginBitLateinisches QuadratHackerJensen-MaßLeistung <Physik>NeuroinformatikResultanteNeunRechter WinkelDifferenteVolumenvisualisierungp-BlockComputeranimation
07:07
Automatische IndexierungHypermediaWeb SiteMultigraphSuchverfahrenFacebookVorzeichen <Mathematik>ResultanteCluster <Rechnernetz>Umsetzung <Informatik>Kategorie <Mathematik>Rechter WinkelRechenschieberZeichenketteÄhnlichkeitsgeometrieTwitter <Softwareplattform>QuaderGeschwindigkeitEinsThumbnailMathematische LogikComputeranimation
08:22
SpieltheorieSchlüsselverwaltungMAPAnalysisArithmetisches MittelLesezeichen <Internet>FreewareSpieltheorieGraphfärbungTouchscreenRechenschieberProdukt <Mathematik>FacebookSuchverfahrenTwitter <Softwareplattform>Rechter WinkelFormation <Mathematik>Message-PassingHypermediaQuaderPunktRauschenBitCAN-BusHyperbelverfahrenGüte der AnpassungComputeranimation
10:32
DatenstrukturFormale GrammatikDemoszene <Programmierung>Funktion <Mathematik>Figurierte Zahlt-TestKategorie <Mathematik>TermSuchverfahrenFormale GrammatikAbstraktionsebene
11:39
ICC-GruppeSurjektivitätEin-AusgabeMapping <Computergraphik>Wort <Informatik>SinusfunktionDefaultGeradeFunktion <Mathematik>ZeitrichtungGraphGanze FunktionFont
12:34
DatenstrukturFormale GrammatikPerfekte GruppeFormale GrammatikWort <Informatik>Minkowski-MetrikRechenwerkRechter WinkelCASE <Informatik>Message-Passing
13:14
RechenwerkUnicodeSystemprogrammierungPhysikalisches SystemAlgorithmusWort <Informatik>Virtuelle MaschineHackerComputersicherheitSchlussregelAlgorithmische LerntheorieMessage-PassingWellenpaketMinkowski-MetrikExpertensystemKomponententestSchnittmengeRechter WinkelKategorie <Mathematik>Gesetz <Physik>TermDomain <Netzwerk>Computeranimation
14:52
DefaultGleitendes MittelEndliche ModelltheorieSchnittmengeCASE <Informatik>OrtsoperatorNegative ZahlVirtuelle MaschineWellenpaketRechter WinkelMereologieSoftwaretest
15:43
StandardabweichungPunktFortsetzung <Mathematik>TensorQuantisierung <Physik>OrtsoperatorMinimumEndliche ModelltheorieLeistung <Physik>SchlussregelSchnittmengeRechter WinkelDifferenteSoftwaretestWellenpaketComputeranimation
16:30
DefaultGleitendes MittelSoftwaretestTensorWellenpaketSoftwaretestDefaultBitrateNegative ZahlEndliche ModelltheorieDifferenteBitSchlussregelPunktRechter WinkelComputeranimation
17:22
Hill-DifferentialgleichungInklusion <Mathematik>Formation <Mathematik>GradientenverfahrenEndliche ModelltheorieOrtsoperatorInhalt <Mathematik>MultiplikationsoperatorMultigraphComputeranimation
18:01
SoftwaretestMultigraphSoftwaretestSchnittmenge
18:37
Automatische IndexierungAlgorithmische LerntheorieAutomatische IndexierungNeuroinformatikComputersicherheitVirtuelle MaschineSchnittmengeSoftwareschwachstelleSchlussregelDemoszene <Programmierung>CodierungRückkopplungSoftwaretestSkriptspracheHintertür <Informatik>MusterspracheComputeranimation
19:29
ICC-GruppeFlächentheorieCodierungKlasse <Mathematik>Objekt <Kategorie>Formale SpracheSoftwareentwicklerPatch <Software>Computeranimation
20:11
Demo <Programm>Formation <Mathematik>NP-hartes ProblemMalwareSoftwareentwicklerDemo <Programm>Rechter WinkelPatch <Software>StrömungswiderstandComputeranimation
20:55
RechenwerkSiebmethodeGüte der AnpassungFunktionalSoftwareentwicklerInternetworkingCodierungGeradeHochdruckPhysikalisches SystemAppletMathematikRechter WinkelPhysikalischer EffektMathematische LogikCliquenweiteDemoszene <Programmierung>Nabel <Mathematik>
21:54
Formation <Mathematik>NeuroinformatikGüte der AnpassungCodierungComputerspielPunktGamecontrollerSchreib-Lese-KopfVersionsverwaltungSpieltheorieVideokonferenzRechenschieberProzess <Informatik>Rechter WinkelGruppenoperationWort <Informatik>Coxeter-GruppeSelbstrepräsentationPhysikalischer EffektComputeranimation
23:41
Vorzeichen <Mathematik>VersionsverwaltungDifferenteWort <Informatik>MathematikLeckPhysikalischer EffektElektronische PublikationMessage-PassingComputeranimation
24:28
Vorzeichen <Mathematik>Message-PassingMixed RealityDifferenteHash-AlgorithmusEindeutigkeitMessage-PassingElektronische UnterschriftMatchingMultiplikationsoperatorCASE <Informatik>GefangenendilemmaWinkelVersionsverwaltung
26:04
Demo <Programm>PunktEndliche ModelltheorieDickeZahlenbereichInformationCodierungBitNormalvektorLateinisches QuadratZeichenketteHackerSpezielle orthogonale GruppeMAPKontextbezogenes SystemSoftwareKreisflächeResultanteFontRenderingProgrammfehlerGebäude <Mathematik>NummernsystemMatchingCASE <Informatik>Rechter WinkelComputeranimation
28:53
Computerunterstützte ÜbersetzungElektronische PublikationGewöhnliche DifferentialgleichungZeichenketteMultiplikationsoperatorRechter WinkelDicke
29:53
Physikalisches SystemDienst <Informatik>Kette <Mathematik>Wurm <Informatik>Güte der AnpassungKontrollstrukturZeichenketteEin-AusgabeInverser LimesDicke
30:29
Kontextbezogenes SystemOpen SourcePublic-domain-SoftwareSpeicherbereichsnetzwerkEinfügungsdämpfungRandomisierungWeb SiteOffene MengeProjektive EbeneMapping <Computergraphik>Funktion <Mathematik>ATMMultigraphSkriptspracheGraphfärbungRechter WinkelFontComputeranimation
32:27
HIP <Kommunikationsprotokoll>Inverser LimesFluidMaschinelles SehenInnerer PunktDemo <Programm>SchriftzeichenerkennungKontextbezogenes SystemIndividualsoftwareWurm <Informatik>NormalvektorNeuroinformatikBildgebendes VerfahrenSoftwareRepository <Informatik>Prozess <Informatik>UmwandlungsenthalpieInstantiierungCodierungUnicodeFormation <Mathematik>Rechter WinkelComputeranimation
33:56
TouchscreenDateiformatWurm <Informatik>Elektronische PublikationOpen SourceBildgebendes VerfahrenSoftwareLeistung <Physik>Rechter WinkelQuellcodeAutomatische Handlungsplanung
35:01
SchriftzeichenerkennungIterationLeistung <Physik>RechenschieberSchriftzeichenerkennungPhysikalisches SystemKontextbezogenes SystemVirtuelle MaschineFontTouchscreenKugelkappeNormalvektorCASE <Informatik>RandverteilungZahlenbereichPunktStochastische AbhängigkeitAnalysisEinsVolumenvisualisierungZweiRechter WinkelEndliche ModelltheorieÄußere Algebra eines ModulsSchnittmengeQuantenzustandSoftwareComputeranimation
37:45
StandardabweichungNeuroinformatikZahlenbereichKategorie <Mathematik>Physikalischer EffektComputersicherheitMultiplikationsoperatorRechter Winkel
38:33
RechenschieberPunktInklusion <Mathematik>FontAutorisierungBrowserMereologieThreadDifferenteE-MailAdressraumZahlenbereichFigurierte ZahlMailing-ListeRandverteilungRichtungInternetworkingInternationalisierung <Programmierung>ZweiSoundverarbeitungMinkowski-MetrikWeb SiteKontrollstrukturWort <Informatik>SoftwareUnternehmensmodellEinsASCIIStrömungsdrosselGamecontrollerRechter WinkelProgrammbibliothekSchaltnetzCliquenweiteFormation <Mathematik>InterpretiererAdditionURLGeradeMultiplikationsoperatorProzess <Informatik>Selbst organisierendes SystemHackerWinkelRauschenFächer <Mathematik>FitnessfunktionComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:00
Uh, so the Tarquin first time speaker is going to talk to us a little bit about Unicode and other special characters and some horribly terrible things that we can do with them. So let's give the man a big round of applause. Thank you. Have a great time. Thank you. Awesome,
00:23
thanks folks. So we're gonna be talking about homograph attacks. Uh, homograph from the Greek written the same. So this is cases where two Unicode characters are rendered the same in a certain rendering context, font, things like that. But first, who am I? Uh, I'm the Tarquin. Some of you may know me by my meatspace name. Uh, I'm a security guard at a
00:43
bookstore, also known as a security engineer at Amazon. Um, before I start, I want a few disclaimers. Uh, the slide is, is red, that's how you know it's important. So first of all, this is all personal research. Um, I, I'm basically here on stuff that I have kind of
01:01
figured out myself from just liking playing around and breaking stuff. Um, and so this is not, oops, that's not what I wanted to do. There we go. Um, this is not anything about my, my employer, anything like that. Uh, secondly, I'm a native English speaker, so I'll be talking about examples in English. But it's important to highlight that these work in
01:21
any language. Uh, in fact, they even work in ideographic languages like Chinese and Japanese. They're just harder to do. But I'll be talking about English because it's what I know. Um, I'm prioritizing breadth over depth here. Um, there's a lot in this space, and I'm doing this talk mainly because I feel like the, the research into homographs has
01:41
gotten a rat hold on URLs and IDNs. So I want to break that open, and so I'm gonna cover a lot of different applications. There's more depth to all of these examples, so if you want to dig more in yourself, feel free. If you want to hijack me and like chat over a drink or something, I'm also, I could talk about this stuff literally forever. You will get sick of me. Uh, finally, some terminology. Um, there are meaningful
02:05
distinctions that I will be ignoring. Uh, glyphs versus characters, fonts versus font faces. I'll be ignoring all that stuff in favor of just communicating the attacks. Um, so don't get mad. Also, technically speaking, Unicode is the consortium. The encoding scheme is
02:23
called Unicode's monster. So now I'm a philosophy dork. Um, I did philosophy in grad school, and so I think that why is always a valid question to ask. So why am I standing here? The fact of the matter is I am here to try and share some of the delight I had in
02:42
doing this, right? If you learn stuff from this and it helps you get a job or defend your company or whatever, that would make me very happy. If I fill you with the hacker's delight and you like a giggle with how ridiculous this is, that would make me way happier, right? Hacking needs to be fun, and so I'm hoping to share some of that fun with you. That's why I'm here. So like I said, most of the homograph attacks that we've seen
03:06
have been in URLs, right? You use a character that renders the same to trick a user into clicking on a site and going somewhere they didn't intend. That's mostly handled by using what's called Punicode. That's what you see on the screen here. So this is a
03:20
case where example.com has been changed to ex, lowercase Greek alpha MPLE. If you put that in your browser, this is what your browser will show to indicate to you you're not going where you thought you were. So this works, right? It's the most common threat vector, it's the most common threat model here, and this is what your browser will do, so at least you'll know, right? So I am not doing this. I am doing
03:45
everything else but this, to be clear. But first, I want to dig into the dark corners of Unicode. Uh, get your elder signs ready, maybe a crucifix if that's how you roll, we are going to some really dark places. Because ultimately, Unicode
04:00
allows us to do stuff like this. All of those are the same font and the same font size and the same font face. They're just four different Unicode characters that all render as A's, right? Unicode allows us to do this. And I want to really drill into the
04:21
scope of the problem here, because first of all, there's characters like those that are easy to confuse, right? Two characters look alike, that's, that's not a capital A, that's an upper case Greek alpha, okay? So you can have two characters that are confusing, that's great. This actually looked a little bit better on my laptop
04:41
when I was building this, so I apologize because it's obvious this is not a lower case I, but this is meant to look like a lower case I, and in a lot of fonts it will. But it's not actually one other character, it's two of them. So Unicode has a Latin small letter dotless I, I don't know why, and a combining dot above. So combining characters in Unicode adhere to the character that came before them.
05:03
You use this to do things like apply accents, umlauts, things like that. But there's also times where actually the same character is duplicated in the Unicode spec. This is a capital Z, but it's not the ASCII capital Z you're used to. It is the mathematical monospace capital Z. It's not the only other
05:23
capital Z2. There's a regular monospace capital Z that's not mathematical, and this is meant for be used in equations. Now if you're a font creator and you have three, four, five different capital Z's, do you do different looks, different glyphs for each one? No, you mostly just render them the same, right?
05:42
Because it saves you time, saves you space in the font, things like that. There's also cases where one Unicode character renders as multiple characters. This is not a capital R lowercase s. This is the rupee sign. This is the Indian currency, right? But of course there's also an actual glyph for
06:01
the rupee sign, and that's this, and we have that too. That's the Indian rupee sign. Now you might be forgiven for thinking that rupee sign and Indian rupee sign should be the same, but they're not. And like this is a rabbit hole that we could literally go down all night, because that's not a letter T. That's the Ogham letter bath. Now you can be forgiven for not knowing
06:25
what Ogham is. Ogham is a writing system that was used to write ancient Irish. The last native writer of it probably died out sometime between the 6th and 9th century AD. There's less than a thousand known extant inscriptions of Ogham in the entire world. There are more Google results
06:44
for the Ogham Unicode block than there are existing Ogham inscriptions. Thanks Unicode. We really appreciate that. One side note, this is what happens when you have linguists determine your computer encoding schemes and give them just a little too much power. Okay, so let's hack some shit. The slide is in
07:03
red. That's how you know it's important, because hacking is important. So we're going to start with search algorithms, right? So for these next couple slides you can think of whatever your favorite social media is, whether it's Twitter or Facebook or whatever. So those aren't capital Vs. That's the
07:21
logical OR sign. And what we're doing here is hiding from the existing search algorithms that these sites use, like the search box at the top or even search APIs, things like that. So when many people who are party to a conversation all use random homographs in their text, what you end up with is text that human beings can read easily but are impossible to find with
07:42
search because search is mostly exact string matching, right? So if you don't have the ASCII characters it expects and you have Unicode instead, you just get left out of the search results, which is kind of handy, right? So some caveats here. The homographs have to be random. If you reliably copy-paste the same ones between speakers and you search for that exact copied string, it
08:03
becomes easier to find you. Also, there's some clustering problems. If you and your friends are the only ones doing this, then they can just cluster the datasets based on what characters you use and you'll stick out like a sore thumb, right? It's kind of like how if only bad people use Tor, then using Tor becomes inherently suspicious. Similar thing, right? And it looks like
08:24
this. So you can play a game like this with this a little bit later. Don't drink to my talk, all eyes on me. Try and find this. This is a tweet that's been posted for a few months now and it's almost impossible to find with the search tools that Twitter gives you. But anyone who can read
08:40
English pretty much can read this, right? Oh, one side note. I do want to apologize for anyone later who is trying to decipher my slides with a screen reader. It will be impossible. I apologize. Screen readers and Unicode do not mix. Free research idea for anyone else out there who wants it. So anyway, so English readers can read this, but search algorithms can't find
09:02
it. And I would really be interested to see if anyone can. If you do, feel free to retweet it. Ping me with how you found it and I will, I don't know, send you a book or something like that. I'm not sure, but you'll get accolades at least. So one key point here is that this is not just about search boxes. Search APIs have the same problem and what that means is
09:23
there's a lot of third-party analysis that goes on on tweets like this or Facebook messages or whatever, right? Good example is sentiment analysis companies. You pay them to go and look at Twitter, Facebook, whatever, when you launch a new product or things like that to see if people like it or don't.
09:41
And they mostly scrape these feeds based on keywords and then do sentiment analysis. Well, if you do this, you're mostly left out of the key of the feed that they get. So you're basically opting out of all this third-party analysis. Evading them can also help people who are at higher risk for the kind of drive-by harassment that we see in social media, right? If you're a woman, a person of color, an activist, things like that,
10:04
this may just get you out of the search filters that trolls use when they're looking for their favorite politician or sports team or whatever it is that you know they're all hot and bothered about. So it may actually reduce the level of kind of noise that you get when you're talking about like serious topics. One point, this is not OPSEC advice. If you use this and do
10:25
crimes, I am not responsible when you go to jail. I just feel like I need to make that disclaimer at DEF CON. Okay, so but search algorithms are a little abstract. It's kind of hard to see how they're working internally. Let's talk about plagiarism detection. So it turns out that plagiarism detection
10:42
engines don't really have to be good because their primary attacker is lazy college students. And if lazy college students are here trying to beat, you don't have to try very hard. If they weren't lazy, they just write the paper themselves. So what we have on the left is the output from a plagiarism detection engine when I copy-paste in Hamlet soliloquy from
11:03
act three, scene one. To be or not to be, that is the question, right? This is probably one of the best known English texts out there. And so it rightly says, this is plagiarized. So I also like that it gives notes. Like it turns out there's some things Shakespeare could improve in terms of like grammar and punctuation. So giving the barred notes feels really
11:22
bold to me. I appreciate that. So what happens is if we swap in some homographic characters, it recreates text that again human beings can read, but the plagiarism detection engine can't figure out that it's the same text. And so it says, no, this is not plagiarized. And this is what the
11:41
tail end of that passage looks like. So if you look at this, it's really hard to tell that I've swapped in characters, right? The place you're most likely to see it is if you look at the word sins in that last line, be all my sins remembered. I have two fixed with lowercase s's and like the fact they're bookending the word kind of makes a little more obvious. But
12:01
most English readers would just think that like, that's a weird font. Okay. Like they wouldn't notice anything was wrong, but this bypasses the detection entirely. But of course, you don't have to be subtle necessarily. So I'm going to talk about a tool I wrote at the end of my talk. This is what the default output of my tool, same same looks
12:21
like. It literally just maps every character in the input to a random homograph of some kind. And so like you can kind of make out what this says. This will definitely a cop by your professors unless they're idiots. But what's really funny is the pleasure of detection engine loves it. Not plagiarized, perfect grammar, perfect punctuation. So it
12:43
turns out this way better than this. And what's going on here is the pleasures of engine. It's looking to see there's enough words. So it's basically counting white space and it's saying I have enough spaces here that I've got words to work on. But when it tries to actually look at those words, it doesn't know what those characters are. Because it turns out that Unicode support
13:04
in most cases means my unit tests passed. Nothing crashed, so we support Unicode. It doesn't actually do anything meaningful with it. Including, if you look at it, like spell checks. If you screw up a word with enough homographs, spell checks don't
13:22
realize it's meant to be a word. So good and news, it's like, I think you're trying to spell a thing there, you may want to take another pass at that. But that hackers thing is just like, lol, must be a word you invented, I don't know, go for it. And so that's really the first lesson we can draw here, right? Unicode support usually means pass my unit tests.
13:45
And so like most Unicode support is pretty cursory. So let's talk about breaking machine learning systems. So H.L. Mencken was a journalist who lived in the 19th and early 20th century, and he's famed for saying there's always a well known solution to every human problem,
14:01
which is neat, plausible, and wrong. I want to rewrite this in the modern world to say that there's a machine learning algorithm that's complicated, plausible, and wrong. Because you see, machine learning is best thought of as like rule discovery, right? It's basically taking a look at a data set and saying, what rules can I invent that adequately describe this
14:20
data? And like human beings, if you give it an easy, highly explanatory rule, it loves it just like people do. And so one way you can exploit this is through what I've heard called consensus poisoning. Now, I am not a machine learning security expert. It's not my domain space. So if I'm using this, if this is not the right term, I apologize. But basically what we're doing is we're
14:43
poisoning the training set to give it a rule that works reliably and it's completely obvious to the machine, but is not visible to the human. And we're going to do that by basically taking a machine learning model, inserting homographs into only one part of the training
15:01
set. So in this case, I'm going to be using the large movie review data set that was released by Andrew Moss and his colleagues at Stanford. The data set uses 50,000 movies from IMDb broken out by whether they're positive or negative. So your training set is a negative set and a positive set. Your test set is a negative set and
15:21
a positive set, right? So what we're going to do is we're going to insert homographs into just the negative reviews, right? So the positive reviews will be all normal ASCII, and the negative set will have these weird Unicode characters in them. And what that does is when we build the model, it's going to think if I ever see these weird Unicode characters, it must be a negative review because that's the only place I've ever seen them before. So
15:44
again, it looks like this. We've got on the top, there's a normal review, and I just swapped in, like literally just find replace with sed, right? But the problem is we can't do it to all of the negative reviews. Otherwise, it's too reliable. If 100% of the negative reviews have these homographs in them,
16:02
then what happens is you have a perfectly explanatory rule, and the model just assumes if it's got these homograph, these Unicode characters, it's negative. If it doesn't, it's positive. That explains the entire difference between the sets. So you can see at the bottom there, the training set accuracy is super high, it's almost 100%. But the
16:21
test set accuracy is 50-50, right, which means it has zero explanatory power, it's just guessing basically. You'll notice, actually I want to go back, you'll notice that the default training set, so this is trained without any homographs at all, the baseline accuracy rate is like 80% ish for training and test. So this clearly deviates,
16:41
this will clearly be caught by someone who's building this model. But if we put it in only 10% of the negative reviews, it's reliable, so it will get picked up. But it's still has to have other rules that account for the difference. And so when we build this, like this model
17:01
ends up with 80% training set accuracy, a little bit higher because we've got that reliable rule in there, and then test that accuracy again, about 80%. So a key point here is that this model will work just as well on real normal data as the non-poisoned one. So why
17:21
are we doing this? We're doing this to sabotage a review. Now you don't need to read that, that's just a giant wall of text to show you that the review we are sabotaging has tons of content. This person loved this movie, and they wrote like this fairly sizable like exegesis on like why it's an amazing film. So
17:41
you should think that our model would have enough to go on there to reliably say this is a positive review. So we're going to go ahead and swap in our homographs, right? By the way, this is a review of the cinematic masterwork Pitch Black with Vin Diesel, apparently one of the greatest films of all time. And then what I've done is I've taken all
18:04
the other movies, all the other reviews out of the test set. So it's obvious whether it's being assigned positive or negative. So we're gonna run it twice, once the normal review and once the poison review and lo and behold, it's exactly what we thought would happen. The normal review is adequate or is accurately classified as positive, 100%. And as soon
18:23
as we swapped in those homographs, it became a negative review because again, it triggered that rule of if I see these homographs, it must be negative. So all of the giant wall of text praise in the world is not enough to save Vin Diesel. And there's a lesson we can learn from this, which is that machine learning over indexes on human invisible
18:42
patterns, right? Like I said, this poisoned data set works just as well as a non poisoned one until an attacker tries to sabotage your review. So there's all these human invisible rules going on behind the scenes. We tend to only troubleshoot our machine learning when they're inaccurate because
19:00
that's the only piece of feedback we have, right? There's really no such thing as security testing for machine learning, like in the industry pretty much doesn't exist, right? And also if the rules were obvious enough that a human being knew them or could see them, we probably wouldn't get all the trouble of doing machine learning, we would write a bash script. So you have this thing where like machine learning ends up being this
19:21
great place to smuggle in back doors. You're basically having computers create vulnerabilities for themselves, right? Let's talk about code patches. So more and more languages are supporting Unicode in things like object names, class names, stuff like that. And so like once you start allowing
19:43
in these other Unicode characters, the threat surface for malicious patching and things like that is limited by only two things, developer due diligence and attacker creativity. Now unfortunately, developer due diligence is pretty poor, attacker creativity is usually pretty good,
20:03
but we're not actually worried about emojis. Oh, and by the way, this is actually syntactically correct Swift. This will compile. But like I said, emojis aren't the problem. We're worried about malicious patching, right? And so like what we're looking for is ways that we can get malicious code by actual developer due diligence.
20:24
And it turns out it's not really that hard. I'm gonna do a little demo here. Make this work, drag this.
21:00
So I'm building a prime sieve and being a good lazy
21:03
developer, I've downloaded is prime function from the internet, but being a good developer, I'm going to review the code. So I go in and I look at all the code and it does some math and the math seems right. But because like, I know Java, I've worked in it for a while. It's not like I'm gonna like code review the actual like system calls, right?
21:21
So like system.out.println, I know what that does. I'm not gonna bother to look at that, right? But if I did, I would notice that's not actually system.out.println. That is a homographed system package with the S being the fixed width S, the second one there. And print line just delegates to print line
21:40
and then pops the shell, cause why not? So the key thing here is that I did my due diligence. I read the business logic that I had downloaded from the internet, but there was logic smuggled in behind the scenes and what looked like innocuous code. Where's the, I'm sorry, for someone who's good at computers,
22:03
I'm really bad at computers. Hey, there we go. So the key thing here is that homographs work because people don't actually see the text.
22:22
They see whatever the text represents. And that seems like a like a like distinction that's subtle to the point of uselessness, but it's actually very valuable, right? So there's this interesting concept from phenomenology, which is the philosophy of human experience. Heidegger talked about things that are ready to hand versus present at hand. Things that are ready to hand are things
22:41
that you think through to do a job, right? If you're a video gamer, like who here plays video games, right? Like surprisingly, a lot of you. So if you're playing Xbox, you're not thinking about what buttons to push. You're thinking about what to do in the game. Your intention is on the game, not the controller. The controller is ready at hand
23:01
cause you think through it as a tool. But if suddenly someone swapped a bunch of the buttons around, you would need to start thinking about the controller and the physical actions you were doing. That's present at hand, right? You're actually focused on the controller, not the game. So text is that former version. It's ready at hand. You think through it and the text is just a way
23:20
to get concepts into your head. And you're thinking about the concepts, not the text. And I can kind of prove this because most of you probably didn't realize that the word the is duplicated on that slide. Because you didn't need to, right? Like you understood what the text said. So if there's another the on there, your brain just like ditches it basically. So this is why homographs work ultimately.
23:44
So let's talk about canary traps. So canary traps are a way to do leak detection. They're called canary traps cause you want to know who is singing, like who is leaking your secrets. And these are typically done by, if you've got a document, you'll change a few words between different versions of the document and give them all out to everyone.
24:00
So if someone leaks it, you can look at what words were unique in that document and know who leaked it. But what if we use homographs? This would actually make it easier, fairly easy to do, but harder to detect by the people who were potentially leaking, right? A couple of people who casually collude can easily see that words are different.
24:21
They can't necessarily see the characters are different. So what you have here are two files with the same message, identical, differing in hash because they are different. They have different Unicode mixed in. One of them has a Unicode F in flea, and one of them has a Unicode T in Tarquin.
24:41
So they're different enough that, I mean, they hash differently. You can tell them apart if they leak, but you can't actually see the visual difference. What happens if they leak screenshots or plain text? Well, this is kind of interesting because this may be one of the rare cases where you actually want to sign a message that might leak, right? So if you leak the plain text,
25:02
no one can tell that it wasn't plain text, that it had these homographs mixed in. So this actually gives you an angle of repudiation. You can actually say, well, that wasn't me because if you do the actual ASCII message there and try and validate that signature, it will fail to validate because you signed over the version that had Unicode in it, right?
25:20
And because you can't really see the difference, it's almost impossible to tell what character were Unicode to actually recover the original message. So if they leak the actual data, you know who leaked. If they leak the plain text with the signature attached, well, you actually still know who leaked because the signatures can differ if you just sign them in different times,
25:41
like you'll get different signatures, right? But you can also say, look, this wasn't me. That signature doesn't match the ASCII that's presented there that appears to be the message itself. So you not only know who leaked, but you also get to say it wasn't me. Again, this is not OPSEC advice. If you use this and do crimes, you will do big kid time in big kid prison
26:01
and it's not my fault. Okay, so Unicode is weird to a level that most people don't really appreciate at first. And to highlight this, I wanna talk about string length. String length is one of those weird things where normal human beings look at a string and they tend to have a pretty solid idea
26:22
of what the length of that string is, right? If I give you a minute or two, you could probably find some plausible thing that felt like the correct length of this string. But the problem is, is that string length under Unicode is tricky. And by tricky, I mean impossible because it's not well-defined. What is the length of a Unicode string?
26:42
Is it the number of Unicode code points? Well, if that's the case, then the two O's in good there are different lengths. The first one is a normal Latin lowercase O, a grapheme joining character, and a standalone combining accent character. That's three Unicode code points.
27:02
But the other one is just the O with acute accent character, one Unicode code point. Now, it might be the right thing that two O's could be different lengths. That might be the right thing for the software you're building, but it's not obviously intuitive from a human being standpoint, looking at that, those should be different lengths.
27:20
So what about number of rendered glyphs? Again, this matches kind of most closely with the human intuition about what we should be looking for, but you don't really get to know what that is until you actually see it rendered in certain context. Look at that H4 with the circle around both of them.
27:40
How many rendered characters is that? Like, it's not clear, is that one glyph? Is it two, is it three? Like, there's plausible excuses you make for all of them. And if you change the font, you would probably get a different result. Also, that's a font rendering bug. That circle should only be around the four, right? So you can't really use this model of rendered glyphs unless you're okay with font rendering mistakes
28:01
changing the length of your string, which seems kind of absurd, right? So a lot of people will try and do something like bytes. Like, what is the byte length of the string, right? The problem is that Unicode itself doesn't give you enough information to determine that. It tells you, here's all these code points, how you actually render them into bits on the wire
28:21
can change based on whether you're using UTF-8, UTF-16, UTF-32, a more like exotic encoding scheme, things like that. So that actually doesn't really solve the problem at all. Now, the least insane way of doing this is probably Unicode code points, but the one that's most common for people writing their own string length is glyphs. And the fact that the best way
28:41
and the common way are different delights hackers. Like, this is a good thing for us. So I'm gonna show you possibly the most boring demo to ever be shown at DEF CON.
29:02
Yes, got it in one. So I'm catting a text file, which should be hanging. Don't worry, I am not dropping like an Ode and Cat. Like, Cat's doing the right thing. But what I'm gonna show you is a text string
29:20
that all of you will agree intuitively is 11 in length, 11 characters. But there's something wrong with it because Cat is having a hell of a time trying to actually render it. And yeah, it's just gonna spin for a while. There we go, hello world.
29:41
Is that not 11 characters? That's 11 characters, right? Yeah, 11 characters, right? It's also 500 megs. So here's the thing. You give this 11 character 500 meg string to anything that checks length, that is like input, like that tries to guard on input length,
30:02
it will often do the right thing, but often it won't. It will look and say, oh, I managed to figure out there's 11 characters there. 11 is less than arbitrary limit. Sure, send that string on the wire. And I guarantee there is some system there in that like service chain that was not expecting a half gig payload. Unfortunately, I don't have any good
30:20
public examples of this, but trust me, try this at home. You will find a ton of stuff that breaks. So I wrote a tool. And I wrote a tool because small sharp tools are best, right? I want something that does one thing and one thing only
30:40
and that is take ASCII and make it ridiculous homographs. So I wrote a ridiculous homograph generator called same, same. And it's got two modes. The first one is just literally it maps every character to a random homograph for that character, regardless of how it looks. And the output can be pretty ridiculous. This is what you saw in that last example
31:01
in plagiarism detection, right? It just spews random Unicode at you. The second one, the second mode is called discrete mode. And it's meant to be more subtle. It's meant to make homographs that look good in context. And you can tell from that second screenshot there, it's not very good yet. And that's because discrete, well-hidden homographs are really hard.
31:22
They're sensitive to what the font is. They're sensitive to things like the background color, the spacing, the kerning, all of it. And so my goal eventually is to be able to get, you'll be able to give same, same hints about what context you're looking for. So you'll be able to say, give me discrete homographs for a sans serif font,
31:42
or for a bash script, or for insert random website. And you'll be able to use that to adjust what homographs it picks, but we're kind of a long way off. One note, I'm releasing this not only as open source, but as public domain. It's released under an unlicensed. So you can pull this down and do whatever you want with it. I'll be marking the repo public sometime this weekend.
32:05
It's also one I'm going to be actively developing on. So if you're looking to get involved in an open source project, and you're looking for one that is A, very small and easy to understand, B, has a very small community of cool people who are very nice, and C, written in Rust,
32:21
this might be the only project you can find that fits all those criteria. So what about defenses? I'm a blue teamer in my day job. I like protecting things. So I wanna make sure I leave you all with a way to stop this stuff. And the existing defenses on homographs
32:42
are all very context specific. We saw Punicode earlier, for instance. There's also things like code linters that can remove Unicode characters from code, things like that. But the key thing here is you kinda have to tailor your approach to every particular place you might find homographs. So what if we could reliably interpret
33:00
the visual intent of the payload rather than the actual data? Like these things work because our human eyes lie to us and tell us it's normal English, like normal ASCII when it's not. What if we could have a computer that's eyes lie to it the same way? Well guess what? We already have OCR, right? Like optical character recognition is meant to turn images of text into text.
33:22
Well cool, let's go ahead and try that. We're gonna try and take a homograph payload, take a screenshot of it, and OCR it and just see what happens, right? I wanna make one note here. What you're about to see is entirely off the shelf software. I wrote no custom software for this.
33:42
I am a Linux command line nerd, like in my, like the depths of my soul. So everything here either ships with Ubuntu or is available in public repos like apt. Cool, so I have a payload.
34:03
All of that is Unicode above the ASCII plane. It's not, and you'll see here, there's no ASCII here. It's all just UTF-8 bytes, right? So I'm gonna go ahead and take a screenshot of it. You can see the screenshot I took. Nothing up my sleeve, not that I have them.
34:22
But then we're just gonna pipe this to existing open source OCR software called OCRAD. And OCRAD needs the image in a certain file, that's the format, that's the PNG to PNM thing. But look, like that worked. Like just the open source stuff managed to take this homograph had no ASCII
34:42
and turn it mostly back into ASCII. Like open source software and 15 minutes of work got this like 80% correct, right? If we actually wanna build defenses like this, this would not take much and it would work way better than whatever else we're doing, right? So the key thing here is like the tools already exist. We already have the power
35:00
to stop all these homograph attacks. Though I don't have the power to get back to my slides apparently. So why prefer this to alternatives? So there's some pros. Number one, it's context independent. If you can take a screenshot of it, you can do this, right?
35:22
So that's pretty much all text, right? Second, OCR is a well understood phenomenon. It's actually something that we've put a lot of research into. I think OCRAD is like 15 or 20 years old at this point, I have to check. But like this is not new software, right? It's just no one's bothered to apply them to homographs as far as I can tell.
35:40
OCR friendly fonts exist. We can actually in the background, render this into an OCR friendly font first, and then like screen cap it, OCR it back to maximize our chances of getting this back out to like harmless ASCII, right? And then what you get back is actually the like legitimate text, right? It's a way to kind of defang all these homograph attacks,
36:01
no matter the context they're in. But finally, the piece I like the best is that it exploits attacker incentives, right? Like attackers want their homographs to be subtle, hard to tell apart for normal English, invisible if possible, right? Well, guess what? If your homograph attacks are perfect in that respect, and you can clearly not tell them apart from English,
36:21
OCR is perfectly reliable or pretty close. And the better the attacker does, the better OCR does at defeating it, right? Like, this is one of those beautiful cases where like a skilled attacker would need to make their attacks worse to bypass this defense.
36:41
And I think that's amazing, right? Now there's a big con with this, which is that for a lot of large systems that are sensitive to like marginal cost of data, like if adding the next data point is expensive and you need to add more expense to it, that might not be a problem. Like OCR can be expensive on large data sets, right? Because you need to actually like engage the GPU
37:01
to do the analysis and all that. So like, it might not work if you're doing like extremely large machine learning systems, right? But again, I think there's a valuable lesson here, which is that defenses work best when they directly exploit attacker incentives, right? This is one of the things like, again, as a blue teamer,
37:20
I will yammer on for days about knowing your threat model, right? Know your threat actors. Who are you trying to stop from doing what, right? And that involves knowing their incentives, knowing when their attacks work best, right? If you can tailor your defenses so that they have the similar incentive, then you are like on the first solid step
37:40
to actually like winning that engagement. Okay, I have some conclusions. Number one, phenomenology is king. Phenomenology again is the philosophy of human experience. I'm a philosophy dork from like my college days, misspending my youth. And basically like human beings are really what gets hacked ultimately.
38:00
Like we focus on the computers a lot cause they're fun. But ultimately it's the human beings that are the standard by which we're judging whether this hack worked or not. And like I said, like hacking computers is fun, but hacking the human being is far more effective, right? So anytime you can trick the person, they'll override the computer. Like we've seen this time and again, where you flash up a security warning
38:21
and human being goes, no, I know better, click, right? So if you hack the person, you don't need to hack the computer. And finally, Unicode is a delay for monstrosity and I love it. Okay, I am not standing here purely by myself. I wanna thank my Amazon colleagues
38:40
who are here to support me, especially David Gabler, who couldn't make it, Nicky Parekh. I would not be here without both of their hard work. My additional payphones crew, make some noise. Woo! These guys are awesome. Like they are the shoulders on which I stand. I've learned so much from them and I would not be here without them.
39:01
And finally, I wanna thank all the Def Con organizers, goons, crew, et cetera. It is amazing that they managed to pull this off year after year. It's fantastic. They do an awesome job, so thank them. Okay, and I actually have a fair amount of time, five-ish plus minutes, for Q&A.
39:20
And like I said, I will talk about any part of this until you are sick of me. Yes, question. Okay.
39:46
Yes. That's a great question. So the question, yeah.
40:05
Oh, great. Yeah, so question one was, since I was doing this all about English, could we just check to see if it's ASCII or not? And the answer is yes, you can. And there are some sites out there that that's their only defense. But the problem is is that the internet is a global thing
40:21
and as hackers, we should all be big fans of internationalization. The internet is for everyone or it is for none of us. So you do wanna internationalize stuff and if you wanna internationalize stuff, you can't just rely on ASCII. And the second question, if I'm getting it right, was will this be an effective defense on things like obscuring email addresses on websites
40:41
to avoid spammers and scrapers? Yes, most of them are also not very good. Again, most people who are scraping websites to harvest email addresses are actually, they've got a fairly simple business model that relies on high numbers and they're okay if people get opted out because they can't figure out if it's an email address, right?
41:00
There's still thousands and thousands and thousands of people out there who don't take the precautions and who do get their email addresses kinda sucked into these spam lists. So I think this would probably be very effective. It would be great if the spammers then had to do the same OCR defense to sanitize their data because that would be heinously expensive and they have a razor thin margins. So they'll probably put them out of business.
41:22
Other questions? Yes, sir. So the question was, this can be used in the other direction.
41:42
So if I get your point correctly, this can be used for testing and red teaming stuff. And am I talking to Dave Kennedy for inclusion in the SCT? I'm honored by the question. The answers are respectively, yes. I think this is a very powerful tool for red teamers. Again, as someone who's like, might as well have like blue team knuckle tattoos,
42:03
most of what I have focused on is just lull, I broke some stuff, that's fun. Let's see how to stop it. But yeah, I can definitely see like inclusion in the SCT, I think it would be a very valuable tool. And if Dave Kennedy wanna reach out to me, I'd love to meet him, that'd be awesome. Any other questions?
42:21
Yes, sir. Sure, so the question is, how did I get interested in this research? So fundamentally, again, like I have a philosophy background
42:41
and I was fascinated by human perception and how our brains lie to themselves, right? And this was actually triggered by a offhanded comment made by Max Temkin on a podcast I listened to talking about the plagiarism detection stuff and how sometimes surrounding a passage of text
43:01
with white one point font quotes would trick it into thinking you were legitimately quoting an author. So you could hide chunks of plagiarism that a human being couldn't see. So that combined with, I used to work as a browser dev and homograph attacks, again, tons of them in URLs, that's like where this kind of got to be part of the research. So I got very interested in that from it on that angle,
43:23
but I mostly picked this thread up as personal research in the past kind of like year or so. And I literally just, I felt in this rabbit hole where I was trying homographs on everything and the amount of stuff I was breaking delighted me. And so like, I really wanted to share that hacker's delight of like, here is a tool. Like if you take away discrete examples from my talk,
43:43
that's great. But if you take away the more general tool of putting Unicode and see what happens, I hope you will like bust yourself up laughing at least once at the shit you break with it because it's pretty impressive. Does that answer your question? Awesome. Any other questions?
44:00
Yes, down the front. Oh, so he's asking how I actually built the homograph bomb that was hello world, but half gig. So I did a bunch of different ones and like, it's interesting because I wanted ones that padded out the size,
44:20
but didn't visibly change it and also didn't make things choke by themselves. So it turns out if you pull a Unicode control characters in like the right to left character, things like that, there are some rendering libraries that will just strip those out or like there's some sites that just choke on those on their own. You don't need the half gig. So what I finally settled on was a combination
44:42
of a bunch of combining accent characters interspersed with zero width joining characters. Zero with joining characters could be a talk on their own. They're literally just a white space character, though they're technically not white space. The Unicode spec is very clear. It is not white space. Don't treat it that way.
45:01
Literally the only thing it does is tell you at the end of a line, don't break this word, right? It's a word joiner. Keep this word together as you render this text, right? So they're almost never used. They're like mostly used in like typesetting software and things like that. But so many places just don't know what to do with them. So they treat them as white space.
45:21
Depending on your Python interpreter, if they count as white space, there's zero width. You can have tons of fun. You can cause your like no end of headaches for people as they try and figure out flow of control issues for days, so. Okay, that's it. Thank you so much. Really appreciate it.