Is your code tainted?
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44920 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201885 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
CodeVideo trackingSoftwareProjective planeScheduling (computing)Generic programmingMathematical analysisOpen sourceTrailComputer animation
00:30
Video trackingDemo (music)Barrelled spaceInternetworkingContent (media)Computer fileComponent-based software engineeringSource codeVulnerability (computing)SurfaceParameter (computer programming)Network socketQuery languageCodeoutputInternetworkingSource codeClient (computing)Point (geometry)Term (mathematics)WordElectronic mailing listParameter (computer programming)Uniform resource locatorFunktionalanalysisCASE <Informatik>View (database)Continuous trackTrailComputer programmingMathematical analysisQuicksortFile systemVariable (mathematics)Demo (music)System administratorComputer fileSequelSimilarity (geometry)InjektivitätMultilaterationNumberPasswordForm (programming)Content (media)Atomic numberFluid staticsTape driveMachine codeConnectivity (graph theory)Covering spaceMereologySynchronizationEscape characterComputer animation
05:35
MereologySource codeParameter (computer programming)Network socketQuery languagePhysical systemComputer fileoutputComponent-based software engineeringInjektivitätCodeVideo trackingInformation securitySocket-SchnittstelleLeakPattern languageInformation securityCrash (computing)SequelCodePoint (geometry)CASE <Informatik>InjektivitätForm (programming)MereologyCross-site scriptingQuery languageFile systemVariety (linguistics)Remote procedure callDot productEscape characterQuicksortVulnerability (computing)Revision controloutputMappingParameter (computer programming)Set (mathematics)Dependent and independent variablesUniform resource locatorEquivalence relationMatching (graph theory)String (computer science)TrailSocket-SchnittstelleComputer fileSource codeFunction (mathematics)Semiconductor memoryLeakInformationSynchronizationPattern languageNetwork socketStatement (computer science)Clique-widthObject (grammar)Goodness of fitBitBlock (periodic table)Computer animation
10:41
TouchscreenMathematical analysisCodeSource codeDataflowVector potentialData dictionaryMathematicsMultiplication signRight angleGame controllerObject (grammar)Presentation of a groupInjektivitätTrailArc (geometry)FunktionalanalysisCondition numberComputer programmingParsingProduct (business)ResultantFormal languageSemantics (computer science)BitGreatest elementParameter (computer programming)Variable (mathematics)Uniform resource locatorArchaeological field surveyElement (mathematics)Query languageAttribute grammarNumberAdditionSet (mathematics)Endliche ModelltheorieMusical ensembleMereologyPosition operatorExpert systemWindowView (database)Computer fileDemo (music)Complex analysisSynchronizationServer (computing)Information securityDecision theoryGoodness of fitComputer animation
20:21
Component-based software engineeringCore dumpDynamical systemSource codeImplementationoutputReduction of orderMathematical analysisDataflowInternetworkingMathematical analysisCodeFormal grammarSoftware frameworkWeb 2.02 (number)Physical systemExtension (kinesiology)Formal languageFunktionalanalysisEscape characterString (computer science)Information securityObject (grammar)Expert systemSource codeDataflowoutputParameter (computer programming)QuicksortWordBitSystem callInternetworkingFunction (mathematics)SequelMultiplication signView (database)Point (geometry)SynchronizationProcess (computing)Computer animation
24:08
Form (programming)outputMultiplication signComputer programmingWeb applicationProcess (computing)String (computer science)Control flowTerm (mathematics)Point (geometry)Right angleSoftware frameworkException handlingTheoryResultantCASE <Informatik>SoftwareTheory of relativityComputer animation
Transcript: English(auto-generated)
00:07
Hi there, I'm going to talk about Taint Tracking. I'm Mark Shannon, I'm a leader engineer at a company called Semel. We do code analysis, we have a free for open source project called
00:23
lgtm.com. I have stickers if anyone wants any and I'll show you this during the course of the talk. So, schedule is first of all you say what do we mean by Taint? It's an odd term, slightly Victorian sanitary sort of implications but I'll come to that later.
00:45
There's Taint Tracking and Taint Checking which I'll compare, one is a static analysis approach and the other is a sort of dynamic built into your code approach. I'll explain what Taint Tracking is and I'll give you a demo of our stuff showing how it works. I'll explain
01:02
briefly what Taint Checking is and how we can implement it in Python. I don't have any code to give you but just trying to get you ideas that you might want to use later on. You'll need to add checks for if you're doing with anything where you're interfacing with the outside world and once we've gone through this stuff it will
01:26
hopefully become apparent where we should add these checks but we'll discuss that briefly and then I'll just summarise a list of things to remember. So, what is Taint? Taint basically just means anything you can't trust. I'm not sure where the word Taint comes from, it's
01:43
one of those things that someone once started using it and then people keep using it and now it's the conventional term. So it's just untrusted data, basically anything from the internet but it could be, you know, if you're uploading, allowing clients to upload files for example, it could be the contents of any of those files. So you might not even
02:03
untrust your own file system in certain cases. It could be, you could be tracking like a user ID and you could regard users who aren't admins as tainted because there's certain things they're not allowed to do. It could be a number of things. There we go.
02:24
So, Taint tracking. So, we've got this Taint, some of our program is tainted. What's that really mean? Well it means that there's untrusted values in the variables that, in various parts of our program. And we're interested in seeing how that could end up somewhere we
02:43
don't want it. So there are three components to this that we're interested in. There's something similar. There's the syncs and that's where we don't want the Taint to end up. It could be a SQL query, it could be eval or exec. It could be a path opening
03:04
a file because we don't want people opening the password file. It could be, again, any number of things. And the last thing is what's called a sanitizer. Again, Victorian morals, cleanliness next to godliness terminology. I don't know where this terminology comes
03:22
from, but a sanitizer is just something that cleans up your data, makes it from untrusted to safe. So, for example, for SQL injection, a sanitizer would just be something that did SQL escaping. So we'll come to that in detail in a minute. But the key thing
03:43
here is that this is a code analysis, so we're not actually running any code. So what we need to look for are potential paths in the code where this Taint could get from an input, a source, to a vulnerable point in our code, a sync, without passing a sanitizer.
04:00
In other words, it's still in its unsafe form. So sources. So I think I kind of outlined some possibilities already here. So it's literally anything that in some way some malicious entity could have put something in that you don't want. Yeah. It's anything
04:24
from the internet, basically. But there might be other cases. If you're a big institution, you may trust some of your employees to do certain things, but not trust other employees to do certain things. So input even potentially for our own
04:44
employees could be regarded as Taint, particularly financial institutions can be very sensitive about who can access what data or medical institutions or anything like that. And I think I'll cover this later, but yeah, so anything
05:04
you cannot trust or a point in your code where something you can't trust enters that code. I think that's the key thing. So, for example, if you're using Django, you'll wire up your views in Django and you'll have a function that's wired to a URL and
05:27
that will take a request parameter. That argument there, that parameter, is a source of, you know, things you do not trust. There's nothing wrong with Django. It's not
05:40
that there's a fault with Django. That's the entry point in your code where the outside world is sending you stuff. Syncs. They're the places where bad things can happen. SQL injection. So who's heard of SQL injection? So, okay, you've all seen
06:04
the XKCD cartoon. There are other forms of injection. There's path injection which is where instead of a SQL query, it's just a path into the file system and someone could put double dots in there and raise it out of the scope you're interested in, the safe
06:22
part of your file system you don't want them accessing to. Code injection, remote code execution, that's usually the one where you get into the newspapers. And there's a variety of these things. So these are generally called injection attacks and these are kind of the headline versions of this. But there's other forms of attack or vulnerabilities
06:44
that taint tracking can cover. So sanitizers. Well, what is a sanitizer and what isn't rather depends. So there's no sort of general thing of this is a sanitizer or this isn't.
07:03
If you HTML escape, you have a string and it somehow ends up in a SQL query. If it's been HTML sanitized, it's not safe, vice versa. What's required to sanitize a SQL
07:23
query is not what's required to prevent cross-site scripting which is the injection form here where we've got the input is a request and the output is a response. Okay. So a simple
07:41
example, code injection. I've chosen this one because this is the actual practical example I'm going to give you slightly later. And it's the simplest and sanitizers are often inherently slightly more complex and fiddly to define. This is a fairly simple one to define. So any HTTP request parameters. So Django, flask, anything, pretty much any of this stuff.
08:08
If there's a parameter called request or IQ, good chance that that's something we want to be wary of. And exec or eval are our sinks. You probably think, oh, you should
08:22
never use exec, never use eval because it's unsafe for security. But sometimes you have to use it in limited scopes and sometimes you need to do it from user responses. So I've said there's no sanitize in general but there can be specific cases. So for example, we may need to do exec a certain small set of commands. So the way we can sanitize
08:46
it is we can whitelist inputs and then map those to our commands. And that acts as a sanitizer. That mapping effectively sort of blocks the user input. And if it doesn't match up any of our whitelist, we'll raise an exception or handle it in some
09:02
other way. And yeah, it's not just injection attacks. There's other security attacks. There's IDOR attacks, which is insecure direct object reference. And that's kind of an in-memory equivalent of a SQL injection. That's where you have in-memory data that's indexed by, say,
09:21
the user ID. And you haven't done a check and it allows someone to paste in a user ID that isn't their own into a URL and get information back about another user. Yeah, so SQL injection is one way of doing that, but indirect memory references are another. And there's resource
09:40
leaks. This doesn't have to be a security stuff. It could be like losing file descriptors or losing sockets. It's not a security problem, but it's still pretty annoying when your program crashes because it runs out of file descriptors or run out of sockets. So there's resource leak issues here. And in this case, the sources are obviously where
10:00
you create one of these things. And the sink is possibly something a little more subtle in it. It's where, in general, it escapes your reference. So if you're creating... So if we go on the with statement, if you create with a file handle or a socket and a with statement, it's guaranteed to be cleaned up. But we can't always do that. Sometimes you
10:21
have to create it and it gets passed around a bit. And it might sort of get lost effectively or end up indirectly referenced by some other object and retained. So we can use it in that circumstances. And anything else fits this pattern? I mean, use your imagination because people out to get you will use theirs.
10:42
Okay, now this is the demonstration, which is interesting because I can't see my screen. So this is our ID for our query language. Okay, I can sort of see over there. Okay, so
11:12
I'm just going to click on one of these. That's almost certainly not big enough to see anything, is it? Okay, so this is an example code. Now, these are clearly
11:28
nothing security related here. These are kind of arbitrary sources and sinks. So you probably guess, but everything written in capital in a source is treated as a source. And function sink, it's arguably treated as a sink. So our top one there, we can see
11:45
is a very simple flow. We assign a source to X and we then set it to sink. We've got more complex examples of flow. And for example, you know, flow can flow
12:03
through functions, out of functions, into functions, potentially through attributes. The potential numbers of flows are essentially unlimited. So what we're trying to model is as broad a set of those as we can without ending up with a decision where it looks like everything
12:21
in the whole program is tainted and you're overwhelmed with false positives. Our analysis is reasonably good. I'm sure if I can use a mouse. So I think this one,
12:42
yeah, so here's our source and here's our sink. And the flow here is through here. So you can see that the flow, so we're, okay, so we have a function called up and down. It calls the has source, which has a source in it. So the flow is from the source in has source,
13:07
up into its caller, up and down, and then back into its callee, which is our sink function. I also have works. Apologies for this. No, I don't. Okay. What is going on here?
13:51
Okay. I seem to have lost things. Any Eclipse experts in the house?
14:07
This does occasionally do this when I can't. Yeah, it's probably as if the UI in Eclipse is a little bit too flexible for its own good.
14:21
Restore is not restoring to what I wanted at all. Okay. Right. Well, I will show you the path. No, I do not want that. This is a deeply annoying.
14:41
Oh, you should restore to what it was. Unfortunately, there's not a back thing on. Yeah, I just lost the views. That's all. Okay. Let's see if we can get the views. Window show view with a bit luck. Forget the one we want. What happened there?
15:14
Why is that? Okay. So it thinks it's already showing it, but it's not actually showing it, which is unhelpful. Okay. Up here. Brilliant. Well done. Thanks very much.
15:39
So here's our little example. There are two paths here, apparently.
15:46
So we start here. Then it comes up to here's the return value of has source. So I'm going to do this without staring at the screen.
16:00
And then that flows into the parameter for has sync. And I missed, clued that. And to our sync. And the last one. So this is another file. I'll just show you something here.
16:23
So here we have a slightly more subtle thing where we have a sanitizer, but where you can bypass it. So if the condition is true, we will bypass the sanitizer. So what happens here is X is assigned to the source, and if condition is through, it flows
16:44
through the side function, returns arc, and then can flow to the sync. And we can see that path here. Very carefully pressed the right thing this time. And here's the flow. We just got
17:02
from our source to there, into our function, back out again, and to our sync. So these are obviously fairly simple contrived examples. And I will also show you another fairly simple contrived example, which is
17:29
actually not our code. So at least it should hopefully be slightly more convincing demo, if nothing else. So switch from, okay, we're already using this one. So if I switch the
17:48
results to this one. So this is our code injection query. This is our actual production query. There's a lot of code here that's hidden. But basically the query is simply
18:05
find a source, find a sync, and where's flow from. The extra stuff is to do with the paths, which I just showed you. So this query understands the paths. The key part is flows
18:21
from source to sync. And that basically just does the flow analysis that I've been explaining in that it just follows the steps in the program. We need to understand Python semantics, obviously. And then there's more general language semantics, such as assignments to and from variables, calls, tracking from an argument to a parameter, and so on.
18:47
There are two things here. One Python 2, one Python 3. I just randomly chose one. And we look at the path here. Just click on these. These are all very simple flow.
19:04
Just flows from here's our source and it flows on. Now, I don't know if you can read this at the bottom, if you might be able to a little bit. There's one fine detail here, which is what we're actually tracking here is different things in that we start with a request,
19:21
which in itself is safe. Various elements of a request are provided by, for example, our server URL should be entirely predictable thing. The bits of it, the Django stick in there and so on. What happens is it's actually the query parameters are the bit that the user can influence.
19:40
So we note the change. We start off with a request. And then as we flow through it, it changes from a request. There's a request again. And then we have what's basically a dictionary of query parameters and then we extract something and this is potentially our dangerous object
20:01
because this is a user control value. Okay. Right. I think I'm a little tight on time due to having fun with the clips. So let's leave that. And go back to the presentation.
20:22
Okay. So, take checking. Now, code analysis is really good. You don't need to change your code. It has a lot of benefits from it. But security of belt and braces approach is
20:40
often a good idea. So can we do this dynamically? Well, yeah. I'm not sure any web frameworks or any systems do this, but it's entirely feasible. So we have the same thing. We have sources, syncs and sanitizers. A source, again, is our web request. A sync is
21:02
anywhere code shouldn't get to. And our sanitizers are where the tank can't get past. So they're essentially our escape functions or whatever we do like that. Now, Perl and Ruby have this built into the language to some extent, but I'm not sure how robust or how general that is.
21:25
Any Perl or Ruby experts, please come and tell me later. Okay. So basically what we're looking at is just an object that doesn't really want to become a string unless you do it in the right way. So suppose our Django request arguments had just returned this tainted data thing,
21:46
which is kind of this opaque thing. Now, anything that doesn't want to show itself as string is a bit of a pain for debugging, but it does at least give us some security values. In other words, the only way we can get this to a string is explicitly calling its escape HTML,
22:03
escape SQL method, which means we are guaranteed to have called a sanitizer on it, and also we're guaranteed to have only called it once because we can't call that method on a string because it doesn't have it.
22:21
Okay. So right, I better be fairly quick here. So basically the last thing is having explained all this flow, I'm hoping that you will then think, well, where do we put these sanitizers? Where's the best place in our code? And I hope that having realized that you want to see these sanitizers, you want to sanitize your inputs once and exactly once. So you could
22:45
do it exactly at the input or exactly at the output. Doing it in the middle makes it too unreliable. But the problem with doing it at the input is you don't know what you're sanitizing for. I mean, you might have a query, an input, is it going to end up going to be in a SQL query? Is it going to end up reflected to the user? So basically, always put your sanitizers
23:05
just before the point of view. Sanitize your outputs, not your inputs is the phrase to remember. So I think we're pretty much out of time. So things to remember, don't trust anything
23:21
from the internet. I'm sure you all knew that already, but there's no harm in reminding you. So taint analysis consists of sources, sinks and sanitizers. It's quite powerful techniques. It's worth bearing in mind. Anything that passes from a source to a sink
23:41
without a sanitizer, that's potentially bad. That's an avenue of attack. This is one technique amongst many. Don't rely on any particular security techniques, use as many as you can. Both static and dynamic. You know, formal reviews, anything else. Sandboxing
24:06
and so on. And I think that's about it. If you want a job doing this stuff, come and talk to me. So we have time for one, maybe two questions. Anyone?
24:40
So I was just kind of curious, you said to use the sanitizers as late as possible.
24:45
Is that a relative thing? So I'm thinking, say you have a Django web app and you usually have the forms framework right before you start using the input. Would you class that as being as late as possible? Because if you introduced it afterwards, you're actually dealing with potentially ‑‑ The answer to almost anything, any question
25:03
like that is it depends. So I'm not a Django expert, far from it. So, yeah, I guess that sounds ‑‑ ideally your latest possible is ‑‑ I mean, you could say, well, it's under the ORM. It's the point at which you actually send a packet over the network.
25:24
But that's kind of impractical. So it's more of a case of where the control flow narrows down so much that that's the only way into that thing. And that is sufficiently late, I would say. All right. Thank you. Another question? Okay.
25:47
Are there any packages or practices like when a sanitizer's returned its sanitized result that it labels the data or subclasses strings somehow to make it obvious that this is now
26:02
sanitized? I mean, you could in theory. I mean, I think the basic thing is here that you're going to have to trust strings because strings come from so many places in your program. There's lots of internal strings. I think it's ‑‑ in terms of doing this dynamically, I think it's just make sure that the
26:24
input is very clearly not to be trusted. Just make it unusable. Literally, it's guaranteed to raise an exception if you try and use it almost in any way. Apart from it's like the designated methods to turn it into something safe. Okay. One more short question. Okay. I don't see anyone. So, yeah. Okay. So let's thank
26:53
Mark.