Weaponizing Unicode: Homographs Beyond IDNs

Video in TIB AV-Portal: Weaponizing Unicode: Homographs Beyond IDNs

Formal Metadata

Weaponizing Unicode: Homographs Beyond IDNs
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Most people are familiar with homograph attacks due to phishing or other attack campaigns using Internationalized Domain Names with look-alike characters. But homograph attacks exist against wide variety of systems that have gotten far less attention. This talk discusses the use of homographs to attack machine learning systems, to submit malicious software patches, and to craft cryptographic canary traps and leak repudiation mechanisms. It then introduces a generalized defense strategy that should work against homograph attacks in any context.
Slide rule Context awareness Roundness (object) Personal digital assistant Multiplication sign Bit Information security Computer font Unicode Information security Metropolitan area network Unicode
Uniform resource locator Different (Kate Ryan album) Length Numbering scheme Computer font Cartesian coordinate system Monster group Unicode Formal language Spacetime
Context awareness Touchscreen Gradient Model theory Web browser Code Unicode Uniform resource locator Sign (mathematics) Vector space Hacker (term) Personal digital assistant Website Right angle Alpha (investment)
Laptop Slide rule Latin square Multiplication sign Boom (sailing) 1 (number) Gene cluster Similarity (geometry) Computer font Regular graph Login Unicode Power (physics) Neuroinformatik Twitter Facebook Mathematics Sign (mathematics) Different (Kate Ryan album) Hacker (term) Velocity Hypermedia String (computer science) Subject indexing Cuboid Nichtlineares Gleichungssystem Data conversion 9 (number) Physical system Alpha (investment) Thumbnail Area Graph (mathematics) Bit Unicode Category of being Personal digital assistant Search algorithm Volumenvisualisierung Website Right angle Resultant Spacetime
Point (geometry) Slide rule Student's t-test Function (mathematics) Graph coloring Bookmark (World Wide Web) Product (business) Twitter Facebook Hypermedia Term (mathematics) Cuboid Energy level Game theory Formal grammar Noise (electronics) Touchscreen Key (cryptography) Mathematical analysis Bit Demoscene Category of being Arithmetic mean Message passing Search algorithm Formal grammar Right angle Game theory Musical ensemble Figurate number Freeware Data structure
Default (computer science) Sine Graph (mathematics) Perfect group Mapping Line (geometry) Function (mathematics) Word Infinite conjugacy class property Formal grammar output Arrow of time Data structure Surjective function Formal grammar
Execution unit Virtual machine Set (mathematics) Unicode Rule of inference Wave packet Machine learning Hacker (term) Term (mathematics) Information security Formal grammar Physical system Domain name Execution unit Algorithm Physical law Expert system Unit testing Category of being Word Message passing Personal digital assistant System programming Right angle Spacetime
Standard deviation Sequel Point (geometry) Model theory Virtual machine Set (mathematics) Mereology Wave packet Personal digital assistant Negative number Moving average Software testing Right angle Position operator Default (computer science)
Default (computer science) Greatest element Model theory Set (mathematics) Rule of inference Wave packet Power (physics) Tensor Bit rate Different (Kate Ryan album) Moving average Right angle Software testing Quantum Position operator Default (computer science)
Point (geometry) Model theory Content (media) Bit Rule of inference Wave packet Inclusion map Tensor Different (Kate Ryan album) Negative number Software testing Hill differential equation Right angle Position operator
Multiplication sign Graph (mathematics) Musical ensemble Software testing Software testing Gradient descent
Scripting language Software developer Surface Patch (Unix) Feedback Virtual machine Rule of inference Code Demoscene Neuroinformatik Formal language Subject indexing Machine learning Subject indexing Software testing Pattern language Object (grammar) Information security Backdoor (computing) Vulnerability (computing) Social class
NP-hard Malware Demo (music) Software developer Patch (Unix) Infinite conjugacy class property Demo (music) Right angle Musical ensemble
Execution unit Functional (mathematics) Java applet Software developer Letterpress printing Line (geometry) Code Demoscene Sieve of Eratosthenes Goodness of fit Mathematics Internetworking Logic Gastropod shell Physical system
Point (geometry) Slide rule Game controller Group action Disk read-and-write head Code Neuroinformatik Revision control Goodness of fit Word Process (computing) Videoconferencing Video game Right angle Musical ensemble Game theory
Revision control Sign (mathematics) Word Mathematics Message passing Computer file Different (Kate Ryan album) Leak
Revision control Sign (mathematics) Message passing Hash function Angle Different (Kate Ryan album) Personal digital assistant Uniqueness quantification Multiplication sign Prisoner's dilemma Message passing Electronic signature
Point (geometry) Building Context awareness Spezielle orthogonale Gruppe Length Latin square Numbering scheme Computer font Code Software bug Number Hacker (term) String (computer science) Energy level Circle Matching (graph theory) Demo (music) Information Model theory Bit Software Personal digital assistant Artistic rendering Normal (geometry) Right angle Resultant
Computer file Length String (computer science) Multiplication sign Right angle Computer-assisted translation
Randomization Context awareness Service (economics) Open source Length Control flow Insertion loss Public domain Function (mathematics) Computer font Graph coloring Goodness of fit String (computer science) Physical system Scripting language Mapping Graph (mathematics) Projective plane Limit (category theory) Chain output Website Right angle Computer worm Asynchronous Transfer Mode
Machine vision Optical character recognition Touchscreen Computer file Open source File format Demo (music) Interior (topology) Limit (category theory) Instance (computer science) Code Host Identity Protocol Unicode Individualsoftware Neuroinformatik Power (physics) Medical imaging Fluid Process (computing) Software Normal (geometry) Right angle Musical ensemble Computer worm
Point (geometry) Slide rule Optical character recognition Context awareness Touchscreen Optical character recognition Quantum state Model theory 1 (number) Mathematical analysis Virtual machine Independence (probability theory) Set (mathematics) Computer font Number 2 (number) Exterior algebra Software Iteration Personal digital assistant Volumenvisualisierung Right angle Marginal distribution Physical system
Point (geometry) Game controller Clique-width Thread (computing) Multiplication sign Direction (geometry) Combinational logic 1 (number) Control flow Web browser Mereology Computer font Neuroinformatik 2 (number) Inclusion map Internetworking Different (Kate Ryan album) Hacker (term) Business model Authorization Internationalization and localization Information security Address space Noise (electronics) Addition Standard deviation Email Slide rule Fitness function Sound effect Line (geometry) Hand fan Category of being Word Uniform resource locator Process (computing) Software Angle Strömungsdrossel Interpreter (computing) Website Self-organization Right angle Musical ensemble Figurate number Marginal distribution Library (computing)
so the tarkman first time speaker is going to talk to us a little bit about Unicode and other special characters and some horribly terrible things that we can do with them so let's give it the man a big round of applause awesome thanks folks so we're talking about homograph attacks homograft from the greek written the same so this is cases where two unicode characters are rendered the same in a certain rendering context font things like that but first Who am I I'm the Tarquin some of you may know me by my meatspace name I'm a security guard at a bookstore also known as a security engineer in Amazon before I start I want a few disclaimers the slide is read that's how you know it's important so first of all this is all personal research I'm basically here on stuff that I have kind of figured out myself from just liking playing around and breaking stuff and so this is not
someone want to do hey there we go this is not anything about my my employer and I think that secondly I'm a native English speaker so I'll be talking about examples in English but it's important to highlight that these work in any language in fact they even work in IDEO graphic languages like Chinese and Japanese they're just harder to do but I'll be talking about English because it's what I know I'm prioritizing breadth over depth here there's a lot in this space and I'm doing this talk mainly because I feel like the research into homographs has gotten rat hold on URLs and ID ends so I want to break that open and so I'm gonna cover a lot of different applications there's more depth to all of these examples so if you want to dig more in yourself feel free if you want to hijack me and like chat over a drink or something I'm also I could talk about this stuff literally forever you will get sick of me finally some terminology there are meaningful distinctions that I will be ignoring glyphs versus characters fonts versus font faces I will be ignoring all that stuff in favor of just communicating the attack so don't get mad also technically
speaking unicode is the consortium the encoding scheme is called unicode's monster so now I'm a philosophy dork I
did philosophy in grad school and so I think that Y is always a valid question to ask so why am i standing here the fact the matter is I am here to try and share some of the delight I had in doing this right if you learn stuff from this and it helps you get a job or defend your company or whatever that would make me very happy if I fill you with the hackers delight and you like a giggle with how ridiculous this is that would make me way happier right hacking needs to be fun and so I'm hoping to share some of that fun with you that's why I'm here so like I said most of the homograft attacks that we've seen have been in URLs right you use a character that renders the same to trick a user into clicking on a site and going somewhere they didn't intend that's mostly handled by using what's called Punic code that's what you see on the screen here so this is a case where example.com has been changed to X lowercase Greek alpha and ple if you put that in your browser this is what your browser will show to indicate to you you're not going where you thought you were so this works right it's the most common threat vector it's the most common threat model here and this is what your browser will do so at least you'll know right so I am NOT doing this I am doing everything else but this to be clear but first I want to
dig into the dark corners of unicode get your elder signs ready maybe a crucifix if that's how you roll we are going to some really dark places because
ultimately unicode allows us to do stuff like this all of those are the same font and the same font size and the same font face they're just four different unicode characters that all render as a's right unicode allows us to do this and I want to really drill into the scope of the problem here because first of all there's characters like those that are easy to confuse right two characters look alike that's that's not a capital A that's an uppercase Greek alpha okay so you can have two carries they're confusing that's great this actually looked a little bit better on my laptop when I was building this I apologize because it's obvious this is not a lowercase I but this is meant to look like in the lower case I and in a lot of fonts it will but it's it's not actually one other character it's two of them so Unicode has a latin small letter dot las' i I don't know why and a combining dot above so combining characters in Unicode adhere to the character that came before them use this to do things like apply accents boom loud so things like that but there's also times where the actually the same character is duplicated in the Unicode spec this is a capital Z but it's not the ASCII capitals you're used to it is the mathematical mono space capital Z it's not the only other capital Z - there's a regular monospace capital Z that's not mathematical and this is meant for be used in equations now if you're a font creator and you have three four five different capital Z's do you do different looks different glyphs for each one no you mostly just render them the same right because it saves your time saves you space in the font things like that there's also cases where one unicode character renders us multiple characters this is not a capital R lowercase s this is the rupee sign this is the Indian currency right but of course there's also an actual glyph for the rupee sign and that's this and we have that too that's the Indian rupee sign now you might be forgiven for thinking that rupee sign in Indian rupee sign should be the same but they're not and like this is a rabbit hole that we could literally go down all night because that's not a letter T that's the oakum letter base now you can be forgiven for not knowing what oakum is oakum is a writing system that was used to write ancient Irish the last native writer of it probably died out sometime between the 6th the 9th century AD there's less than a thousand known extant inscriptions of ogham in the entire world there are more Google results for the ogum unicode block than there are existing Ogham inscriptions Thanks Unicode that we really appreciate that one side note this is what happens when you have linguists determine your computer encoding schemes and give them just a little too much power okay so let's hack some the slide is in red that's how you know it's important because hacking is important
so we're going to start with search algorithms right so for this next couple slides you can think of whatever your favorite social media is whether it's Twitter or Facebook or whatever so those aren't capital V's that's the logical or sign and what we're doing here is hiding from the existing search algorithms that these sites use like the search box the top or even search API is things like that so when many people who are party to a conversation all use random homographs in their text what you end up with is text that human beings can read easily but are impossible to find with search because search is mostly exact string matching right so if you don't have the ASCII characters it expects and you've unicode instead you just get left out of the search results which is kind of handy right so some caveats here the Homa graphs have to be random if you reliably a copy paste the same ones between speakers and you search for that exact copied string it becomes easier to find you also there's some clustering problems if you and your friends are the only ones doing this then they can just cluster the datasets based on what characters you use and you'll stick out like a sore thumb right it's kind of like how if only bad people use tor then using tor becomes inherently suspicious similar thing right and it looks like
this so you can play a game like this with this a little bit later I don't drink my talk all eyes on me try and find this this is a tweet that's been posted for a few months now and it's almost impossible to find with the search tools that Twitter gives you but anyone who can read English pretty much can read this right oh one side note I do want to apologize for anyone later who was trying to decipher my slides with a screen reader it will be impossible I apologize screen readers and Unicode do not it's free research idea for anyone else out there who wants it so anyway so English readers can read this but search algorithms can't find it and I would really be interested to see if anyone can if you do feel free retweet it pay me with how you found it and I will I don't know send you a book or something like that I'm not sure it but you'll get accolades at least so one key point here is that this is not just about search boxes search api is have the same problem and what that means is there's a lot of third party annal analysis that goes on on tweets like this or facebook messages or whatever right a good example is sentiment analysis companies you pay them to go and look at Twitter Facebook whatever when you launch a new product or things like that to see if people like it or don't and they mostly scrape these feeds based on keywords and then do sentiment analysis well if you do this you're mostly left out of the key of the feed that they get so you're basically opting out of all this third party analysis evading them can also help people who are at higher risk for the kind of drive-by harassment that we see in social media right if you're a woman a person of color and activist things like that this may just get you out of the search filters that trolls use when they're looking for their favorite politician or sports team or whatever it is that you know they're all hot and bothered about so it may actually reduce the level of kind of noise that you get when you're talking about like serious topics one point this is not OPSEC advice if you use this and do crimes I am NOT responsible you go to jail I just feel like I need to make that disclaimer it at DEFCON [Music]
okay so but search algorithms are a little abstract it's kinda hard to see how they're working internally let's talk about plagiarism detection so it turns out that plagiarism detection engines don't really have to be good because their primary attacker is lazy college students and if lazy college students are here trying to beat you don't have to try very hard if they weren't lazy they just write the paper themselves so what we have on the left is the output from a plagiarist detection engine when I copy/paste in Hamlet's soliloquy from act 3 scene 1 to be or not to be that is the question right this is probably one of the best-known English texts out there and so it rightly says this is plagiarized from you so I also like that it gives notes like it turns out there's some things Shakespeare could improve in terms of like grammar and punctuation so giving the bard notes feels really bold to me I appreciate that so what happens is if we swap swap in some Homa graphic characters it's it recreates text that again human beings can read but the plagiarism detection engine can't figure out that it's the same text and so it says no this is not plagiarized and this
is what the tail end of that passage looks like so if you look at this it's really hard to tell that I've swapped in characters right the place you're most likely to see it is if you look at the words sins in that last line be all my sins remember'd I have to fixed with lowercase s's and like the fact that book ending the word makes little more obvious but most English readers would just think at that like that's a weird font okay like they wouldn't notice anything was wrong but this bypasses the detection entirely but of course you don't have to be subtle necessarily so I'm gonna talk about a tool arrow at the end of my talk this is what the default output of my tool same-same looks like it literally just maps every character in the input to a random home a graph of some kind and so like you can kind of make out what this says this will definitely caught by your professors unless they're idiots but what's really funny is the pleasure detection engine loves it
not plagiarized perfect grammar perfect punctuation so it turns out this way
better than this and what's going on here is the plagiarism engine it's looking to see there's enough words so it's busy counting white space and it's saying I have enough spaces here that I've got words to work on but then what tries to actually look at those words it
doesn't know what those characters are because it turns out that Unicode support in most cases means my unit tests past enough nothing crashed so we support Unicode right it doesn't do anything meaningful with it including if
you look at it like spell checks if you screw up a word with enough homographs spell checks don't realize it's meant to be a word right so good and news it's like I think you're trying to spell a thing there you may want to take another pass at that put that hackers thing it's just like law must be word you invent it I don't know go for it as that's where the first lesson we can draw here right Unicode support usually means passed my unit tests and so like most Unicode support is precursory let's talk about breaking machine learning systems so HL Mencken was a journalist who lived in the 19th and early 20th century and he's famed for saying there is a well-known problem there's always a well-known solution every human problem which is neat plausible and wrong I want to rewrite this in the modern world to say that there's a machine learning algorithm that's complicated plausible and wrong because see machine learning is best thought of as like rule discovery right it's basically taking a look at a data set and saying what rules can I invent the adequately describe this data and like human beings if you give it an easy highly explanatory rule it loves it just like people do and so one way you can exploit this is through what I've heard called consensus poisoning now I am NOT a machine learning security expert it's not my domain space so if I'm using this if this is not the right term I apologize but basically what we're doing is we're poisoning the training set to give it a rule that works reliably and is completely obvious to the machine is not visible to the human we're going
to do that by basically taking a machine learning model inserting homographs into only one part of the training set so in this case I'm going to be using the large movie review data set that was released by Andrew Moss and his colleagues at Stanford the data set uses 50,000 movies from IMDB broken out by whether they're positive or negative so your training set is a negative set and a positive set your test set is a negative set and a positive set right so we're going to do is we're going to insert homographs into just the negative reviews right so the positive reviews will be all normal ASCII and the negative set will have these weird Unicode characters in them and what that does is when we build the model it's gonna think if I ever see these weird Unicode characters it must be a negative review because that's the only place I've ever seen them before so again it
looks like this we've got on the top there's a normal review and I just swapped in like Lily just find replace with said right but the problem is we can't do it to all of the negative
reviews otherwise it's too reliable if a hundred percent of the negative reviews have these homographs in them then what happens is you have a perfectly explanatory rule and the model just assumes if it's got these homograph is it's not these these Unicode characters it's negative if it doesn't it's positive that explains the entire difference between the sets so you can see at the bottom there the training set accuracy is super high it's almost hundred percent but the test that accuracy is 50/50 right which means it has zero explanatory power it's just guessing basically you'll notice actually go back you'll
notice that the default training set so this is trained without any homographs at all the baseline accuracy rate is like 80 percent ish for training and test so there's clearly deviates this
would clearly be caught by someone who's building this model but we put it in only 10 percent of negative reviews it's reliable so it will get picked up but it's not perfectly explanatory right so the the model still has to have other rules that account for the difference and so when we build this like this model ends up with 80% training accuracy a little bit higher because we've got that reliable rule in there and then test the accuracy again about 80% so a key point here is that this model will work just as well on real normal data as the non poisoned one so why are we doing this we're doing this to sabotage a
review now you don't need to read that that's just a giant wall of text to show you that the review we are sabotaging has tons of content this person loved this movie and they wrote like this fairly sizable like exegesis on like why it's an amazing film so you should think that our model would have enough to go on there to reliably say this is a positive review so we're gonna go ahead
and swap in our Hama graphs right by the way this is a review of the cinematic masterwork pitch black with Vin Diesel apparently one of the greatest films of all time and then what I've done is I've
taken all the other movies all the other reviews out of the test set so it's obvious whether it's being assigned positive or negative so we're gonna run it twice once the normal review and once the poison review and lo and behold it's exactly what we thought would happen the normal review is adequate or is accurately classified as positive 100% and as soon as we swapped in those Homa graphs it became a negative review because again it triggered that rule of if I see these Homa graphs it must be negative so all of the giant wall of text praise in the world is not enough to say Vin Diesel and there's a lesson
we can learn from this which is that machine learning over indexes on human in the patterns right like I said this poisoned dataset works just as well as a non poisoned one until an attacker tries to sabotage a review so there's all these human invisible rules going on behind the scenes we tend to only troubleshoot our machine learning when they're inaccurate because that's the only piece of feedback we have right there's really no such thing as security testing for machine learning like in the industry pretty much doesn't exist right and also if the rules were obvious enough that a human being knew them or could see them we probably got all the trouble of doing machine learning we would write a bash script so you have this thing where like machine learning ends up being this great place to smuggle in backdoors you're basically having computers create vulnerabilities for themselves right
let's talk about code patches so more and more languages are supporting Unicode in things like object names class names stuff like that and so like once you start allowing in these these other Unicode characters the kind of the threat surface for like malicious patching things like that is limited by only two things developer due diligence and attack of creativity and unfortunately developer due diligence is pretty poor attacker creativity is you pretty good but we're not actually
worried about emojis oh and by the way this is actually syntactically correct Swift does this will compile but like I
like I said emojis aren't the problem we're worried about malicious patching right and so like what we're looking for is ways that we can get malicious code by actual developer due diligence and turns out it's not really that hard I'm gonna do a little demo here see if I can make this work [Music] I drag this okay so I'm building a prime
sieve and being a good lazy developer I've downloaded his prime function from the internet but being a good developer
I'm going to review the code so I go in and I look at all the code and it does some math and the math seems right but because like I know Java I'm working it for a while it's not like I'm gonna like code review the actual like system calls right so like a system diode on print line I know what that does I'm not gonna bother to look at that right but if I
did I would notice that's not actually system dot out dot print line that is a homograft system package with the s being the fixed with s in the second one there and print line just delegates to print line and then pops the shell because why not so the key thing here is that I did my due diligence I read the business logic that I had downloaded from the internet but there was logic smuggled in behind the scenes and what looked like innocuous innocuous
code where's the I'm sorry for someone who's good at computers I'm really bad at computers [Music]
hey there we go so the key thing here is that Homa grafts work because people don't actually see the text they see whatever the text represents and that seems like a like a like distinction that's subtle to the point of uselessness but it's actually very valuable right so there's this interesting concept from phenomenology which is the philosophy human experience Heidegger talked about things that are ready to hand versus president hands things that are ready are hand are things that you think through to do a job right if you're a video gamer like who here plays video games right like surprisingly a lot of you so if you're playing xbox you're not thinking about what buttons to push you're thinking about what to do in the game your intention is on the game not the controller the controller is ready at hand because you think through it as a tool but if suddenly someone swapped a bunch of the buttons around you would need to start thinking about the controller and the physical actions you are doing that's president hands right if you actually focus on the controller not the game so text is that former version it's ready at hand you think through it and the text is just a way to get concepts in your head anything about the concepts not the text and I can kind of prove this because most you probably realize that the word the is duplicated on that slide because you didn't need to write like you understood what the text said so if there's another on there your brain just like ditches it basically so this is why homographs work ultimately [Music]
so sorry about canary traps so canary traps are a way to do leak detection they're called carrion trap because you want to know who is singing like who is leaking your secrets and these are typically done by you know if you've got a document you'll change a few words between different versions of the document and give them a lot to everyone so if someone leaks it you can look at what words were unique in that document and know who leaked it but what if we used homographs this would actually make
it easier it's fairly easy to do but harder to detect by the people who were potentially leaking right a couple of people who casually collude can easily see that words are different the key necessary see the characters are different so what you have here are two files with the same message tentacle differing in
hash because they are different they have different Unicode mixed in one of them has a Unicode F in FLE and one them has a unique Unicode T in Tarquin so they're different enough that you mean they hash differently you can tell them apart if they leak but you can't actually see the visual difference but what happens if they leak screenshots or plain text well it's kind of interesting
because there's maybe one of the rare cases we actually want to sign a message that might leak right so if you leak the plain text no one can tell that it wasn't plain text that I had these homographs mixed in so this actually gives you an angle of repudiation you can actually say well that wasn't me because if you do the actual ASCII message there and try and validate that signature it will fail to validate because you signed over the version of had Unicode in it right and because you can't really see the difference it's almost impossible to tell what character you were unicode to actually recover the original message so if they leaked the actual data you know who leaked if they leaked the plain text with the signature attached well you actually still know who leaked because the signatures can differ if you just sign them in different times that you'll get different signatures right but you can also say look this wasn't me that signature doesn't match the ASCII that's presented there that appears to be the message itself so you not only know who leaked but you also get to say it wasn't me again this is not OPSEC advice if you use this and do crimes you will do big-kid time in big kid prison and it's not my fault okay
so Unicode is weird to a level that most people don't really appreciate it first and to highlight this I want to talk about string length string length is one of those weird things where normal human beings look at a string and they see they tend to have a pretty solid idea what the length of that string is right if I give you a minute or two you could probably find some plausible thing that felt like the correct length of this string but the problem is is that string length under Unicode is tricky and by turkey I mean impossible because it's not well-defined what is the length of Unicode string is it the number of Unicode code points well if that's the case then the two O's in good there are different lengths the first one is an a normal Latin lowercase o a grapheme joining character and a standalone combining accent character that's three Unicode code points but the other one is just the O with acute accent character one Unicode code point now it might be the right thing that two O's could be different lengths that might be the right thing for the software you're building but it's not obviously intuitive from a human-being standpoint looking at that those should be different lengths so what about number of rendered glyphs again this like this match is kind of most closely with the human intuition about what we should be looking for but you don't really get to know what that is until you actually see it rendered in a certain context look at that h4 with a circle around both of them how many rendered characters is that like it's not clear that is that one glyph is it - is it three like you get there's plausible excuses you make for all of them and if you change the font you probably ate a different results also that's a font rendering bug that circle should only be around before right so you can't really use this model of rendered glyphs unless you're okay with font rendering mistakes changing the length of your string which seems kind of absurd right so why people try and do something like bytes like what is the byte length of the string right the premise that unicode a-- itself doesn't give you enough information to determine that it tells you here's all these code points how you actually render them into bits on the wire can change based on we're using utf-8 utf-16 u-232 a more like exotic encoding scheme things like that so that doesn't really solve the problem at all now the least insane way of doing this is probably Unicode code points but the one that's most common for people writing their own string length is glyphs and the fact that the best way and the common way are different delights hackers like this is this is a good thing for us so let me show you possibly the most boring demo to ever be shown at Def Con
if I can yes got it in one so I'm
cutting a text file which to be hanging
worried I am NOT dropping like an Eau de and cat like cats doing the right thing but what I'm going to show you is a text string that all of you will agree intuitively is 11 in length 11 characters but there's something wrong with it because cat is having a hell of a time trying to actually render it and yeah it's just gonna spin for a while there we go hello world is that not 11 characters that's 11 characters right yeah I love in characters right it's also 500 Meg's so here's the thing you give this 11
character 500 string to anything that checks length it is like input like that tries to guard on input length and will often do the right thing but often it won't it will look and say oh I managed to figure out there's 11 characters there 11 is less than arbitrary limit sure send that string on the wire and I guarantee there is some system there in that like service chain that was not expecting a half gig payload unfortunately I don't have any good public examples of this but trust me try this at home you will find a ton of stuff that breaks ok so I wrote a tool
and I wrote a tool because small sharp tools are best right I want something that does one thing and one thing only and that is take ASCII and make it ridiculous Homa graphs so I wrote a ridiculous homograft generator called same-same and it's got two modes the first one is just literally it maps every character to a random homograph for that character regardless of how it looks and the output can be pre ridiculous this is what you saw in that last example in plagiarism detection right it just spews random Unicode at you the second one the second mode is called discreet mode and it's meant to be more subtle it's meant to like make homographs that look good in context and you can tell from that second screenshot there it's not very good yet and that's because discrete like well hidden homographs are really hard they're sensitive to what the font is they are sensitive to like things like the background color the spacing the kerning with all of it and so my goal eventually is to be able to get you'll be able to give same-same hints about what context you're looking for so you'll be able to say like give me discrete homographs for a sans-serif font or for a bash script or for like insert random website and you'll be able to like use that to adjust what homographs depicts but we're kind of a long way off OneNote I'm releasing this not only as open source but as public domain it's released under an unlicensed so you can pull it down and do whatever you want with it I'll be making marking the public sometime this weekend the it's also one I'm going to be actively developing on so if you're looking to get involved in an open source project and you're looking for one that is a very small needs that understand B has a very small community of cool people who are very nice and see written in rust this might be the only project you can find that will that fits all those criteria so what about defenses I'm a
blue team ER in my day job like I like protecting things so I want to make sure I leave you all with a way to stop this stuff right and the existing defenses on homographs are all very context-specific we saw unicode earlier for instance there's also things like code lenders they can like remove unicode characters from code things like that but the key thing here is you could have to tailor your approach to every particular place you might find homographs right so what if we could reliably interpret the visual intent of the payload rather than the actual data right like like these things work because our human eyes lie to us and tell us it's normal English like normal ASCII when it's not what if we could have a computer that's eyes lied to it the same way well guess what we already have OCR right like optical character recognition is meant to turn images of text into text well cool let's go ahead and try that we're gonna try and take a homograft payload take a screenshot of it and OCR it and just see what happens right I wanna make one note here what you're about to see is entirely off-the-shelf software I wrote no custom software for this I am a Linux command line nerd like in my like the depths of my soul so everything here either ships with a boon to or is available in public repos like out [Music]
cool so I have a payload all of that is Unicode above the ASCII plain it's not and you'll see here there's no ASCII here it's all just utf-8 bytes all right so I'm gonna go ahead take a screenshot of it you can see the screen shy took nothing up my sleeve not that I have them but then we're just gonna pipe this to existing open source OCR software called oak rad and oak red needs the the image in a certain file that's the format that's the PNG to PNM thing but look like that worked like just just the open source stuff managed to take this homograft had no ASCII and turn it mostly back into ASCII like open source software and 15 minutes of work got this like 80 percent correct right if we actually want to build defenses like this this would not take much and it would work way better than whatever else we're doing right so the key thing here is like the tools already exist we already have the power to stop values homograft attacks though
I don't have the party back to my slides apparently so why prefer this to
alternatives so there's some pros number one it's context independent if you can take a screenshot of it you can do this right so that's pretty much all text right second OCR is a well understood phenomenon it's actually something we've put a lot of research into I think oh Kratt is like 15 or 20 years old at this point I have to check but this is not new software ray it's just no one's bothered to apply them to homographs as far as I can tell OCR friendly fonts exist we can actually in the background render this into an OCR from the font first and then like screen Capet OCR at back to maximize our chances of getting this back out just like harmless ASCII right and then what you get back is actually the like legitimate text right it's a way to kind of defang all these homograph attacks no matter the context they're in but finally the piece i like the best is that exploits attacker incentives right like attackers they're homographs to be subtle hard to tell apart from normal English invisible if possible right well guess what if your homograph attacks are perfect in that respect and you can clearly not tell them apart from English OCR is perfectly reliable or pretty close and the better the attacker does the better OCR does at defeating it right like this is one of those beautiful cases where like a skilled attacker would need to make their attacks worse to bypass this defense and I think that's amazing right now there's a big con with this which is that for a lot of large systems they're sensitive to like marginal cost of data like if adding the next data point is expensive and you need to add more expense to it that might not be a problem like OCR can be expensive on large data sets right because you need to actually like engage the the GPU to do the analysis and all that so like it might not work if you're doing like extremely large machine learning systems right but again I think there's a valuable lesson here great which is that defences work best when they directly exploit attacker incentives right this is one of the things like again as a blue Timur I will Yammer on for days about knowing your threat model right know your threat actors who are you trying to stop from doing what right and that involves knowing their incentives knowing when their attacks work best right if you can tell your defenses so that they have the similar incentive then you are like on the first solid step to actually like winning that engagement okay I have some conclusions number one
phenomenology is king phenomenology again is the philosophy of human experience I'm a philosophy dork from like my college days miss spending my youth and basically like human beings are really what gets hacked ultimately like we focus on the computers a lot cuz they're fun but ultimately it's the human beings that are the the standard by which we're judging whether this hack worked or not and like I said like hacking computers is fun but hacking the human being is far more effective right so anytime you can trick the person they'll override the computer like we've seen this time and again where you flash up a security warning and human being goes no no better click right so if you hack the person you'll need to hack the computer and finally Unicode is delay from monstrosity and I love it ok I am NOT
standing here purely by myself I want to thank my Amazon colleagues who are here to support me especially David Gabler who couldn't make it Nikki pack I would not be here without both of their hard work my additional pay phone screw make some noise these guys are awesome like they are they are the shoulders on which I stand I've learned so much from them and I would not be here without them and finally I want to thank all the Def Con organizers goons crew etc it is amazing that they managed to pull this off year after year it's it's fantastic they do an awesome job so thank them ok and I actually have a fair amount of time five-ish plus minutes for Q&A anyway I said I will talk about any part of this until you are sick of me yes question [Music] okay yes that's a great question it's the question yeah oh great it's okay yeah so question one was since I was doing this all that English could we just check to see what to ask you or not and the answer is yes you can and like there are some sites out there that that's their only defense but the problem is is that the Internet is a global thing and as hackers we should all be big fans of internationalization the internet is for everyone or it is for none of us right so you do want to internationalize stuff and if you want internationalized stuff you can't just rely on ascii right and the second question if I'm getting it right was will this be an effective defense on things like obscuring email addresses on websites to like avoid like spammers and scrapers yes most of them are also not very good again most people who are scraping websites to harvest email addresses are actually like they've got a fairly simple business model that relies on high numbers and they're okay if you if people like get opted out because they can't figure out if it's an email address right they're still you know thousands and thousands thousands people out there who don't take those precautions who do get their email address is kind of like sucked into these spam lists so I think this would probably very effective it would be great if the spammers then had to do the same OCR defense to like sanitize their data because that would be heinous ly expensive and they have a razor thin margins so they probably put them out of business so other questions yes sir [Music] mm-hmm so the question was this can be used in the direction so if I get your point correctly this can be used for like testing and like red teaming stuff and am I talking to Dave Kennedy at for inclusion the SCT I'm honored by the question the answers are respectively yes I think this is very a very powerful tool for red teamers again as someone who's like might as well have like blue team knuckle tattoos most of what I have focused on is just loll I broke some stuff that's fun let's say how to stop it but yeah I could definitely see like inclusion SCT I think it would be very valuable tool and if K you want to reach out to me I'd love to meet him let me let me awesome any questions yes sir [Music] sure so the question is how do I get interested in this research so fundamentally again like I have a philosophy background and I was fascinated by human perception and how our brains lied were themselves right and this was actually triggered by a offhanded comment made by Max Temkin on a podcast at listen to talking about the plagiarism detection stuff and how sometimes surrounding a passage of text with white one point font quotes would trick it into thinking you were legitimately quoting an author so you could hide chunks of plagiarism that a human being couldn't see so that combined with I used to work as a browser dev and homograph attacks again tons them in URLs that's like where this kind of got to be part of the research so I got very interested in that from it on that angle but I mostly picked this thread up as personal research in the past kind of like year or so and I literally just I felt in this rabbit hole where I was trying homographs on everything and the amount of stuff I was breaking delighted to me and so like I really want to share that hackers delight of like here is a tool like if you take away discrete examples from my talk that's great but if you take away the more general tool of put in Unicode and see what happens you will I hope you will like bust yourself up laughing at least once at the you break with it because it's pretty impressive does answer your question awesome any questions yes down the front [Music] oh so he's asked me how I actually built the homograph bomb that was hello world but a half gig so I did a bunch of different ones and like it's interesting because I want I wanted ones that padded out the size but didn't visibly change it and also didn't make things choke by themselves so it turns out if you put a lot of Unicode control characters in like the right to left care fittings like that there are some rendering libraries that will just strip those out or like there's some sites that just choke on those on their own you don't need the half gig so what I finally settled on was a combination of a bunch of combining accent characters interspersed with zero width joining characters zero with joining characters could be a talk on their own they're literally just a whitespace character though they're technically not whitespace the Unicode spec is very clear it is not whitespace don't treat it that way literally the only thing it does is tell you at the end of a line don't break this word right it's a word joiner keep this word together as you render this text right so they're almost never used they're like mostly used in like typesetting software and things like that but so many places just don't know what to do with them so you treat them as whitespace depending on your Python interpreter if they count as whitespace there's zero width even have tons of fun you can cause you're like no end of headaches for people as they try and figure out flow of control issues for days so okay that's it thank you so much really appreciate [Applause]