AI Village - Behavioral Biometrics and Context Analytics: Risk Based Authentication Re-Imagined
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 335 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48315 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
DEF CON 2725 / 335
8
9
12
14
32
38
41
58
60
61
72
75
83
87
92
96
108
115
128
132
143
152
158
159
191
193
218
230
268
271
273
276
278
295
310
320
321
335
00:00
Context awarenessBiostatisticsContext awarenessAnalytic setBiostatisticsPhysical systemInformationRoundness (object)AuthenticationRight anglePattern languageCovering spaceMeeting/Interview
00:43
PasswordAuthenticationPasswordValidity (statistics)Multiplication signIdentity managementNeuroinformatikInstance (computer science)Meeting/Interview
01:46
ComputerPasswordData modelPasswordNeuroinformatikEmailWeb 2.0Meeting/Interview
02:13
Identity managementService (economics)Online service providerEmailIdentity managementWeb 2.0Computer animation
02:42
PasswordSystem identificationCAPTCHAInformationData managementPasswordComputer configurationSemiconductor memoryNP-hardPhysical systemMessage passingEndliche ModelltheorieInformation securityRandomizationComputer animation
03:52
System identificationCAPTCHAPasswordExecution unitInformation securityNeuroinformatikCAPTCHARoboticsRobotMotion captureSoftware testingIdentity managementArmDifferent (Kate Ryan album)Computer animation
04:26
AuthenticationLoginContinuous functionMountain passPasswordProjective planeLoginPasswordPhysical systemConnected spaceRight angleAuditory maskingPresentation of a groupMedical imagingAuthenticationPattern languageWindowPattern recognitionMultiplication signFingerprintInformationIdentity management2 (number)Term (mathematics)Office suiteUniform resource locatorCybersexInformation securityNoise (electronics)Wage labourVirtual machineComputer animationMeeting/Interview
07:09
LoginData storage devicePattern languageContext awarenessAuthenticationProfil (magazine)BiostatisticsPattern languageMultiplication signConnected spacePhysical systemOperating systemWeb browserInformation securityAuthenticationContext awarenessComputer animationMeeting/Interview
07:44
AuthenticationInformation privacyBiostatisticsFrame problem2 (number)Interactive televisionMultiplication signOrder (biology)InformationEndliche ModelltheorieNumberAuthenticationInformation privacyBiostatisticsPasswordProfil (magazine)Type theoryPlastikkartePattern languageLoginUniqueness quantificationComputer animationMeeting/Interview
09:03
Real numberVideo gamePattern languageEndliche ModelltheorieMeeting/Interview
09:43
ComputerIntegrated development environmentProduct (business)Context awarenessClient (computing)Thresholding (image processing)HypothesisReal numberAreaContext awarenessAcoustic shadowProduct (business)Fluid staticsGraph (mathematics)AuthenticationInformationMereologyAlgorithmNeuroinformatikTask (computing)Multiplication signInteractive televisionForestSet (mathematics)ResultantRandomizationThresholding (image processing)BiostatisticsInsertion lossMessage passingPoint (geometry)Integrated development environmentInternetworkingLoginEmailGame theoryLevel (video gaming)System identification
12:06
BiostatisticsSoftware testingThresholding (image processing)Decision theoryAuthenticationContext awarenessSensitivity analysisThresholding (image processing)Virtual machineDifferent (Kate Ryan album)AlgorithmTerm (mathematics)Normal (geometry)ForestRandomizationContext awarenessIdentity managementMultiplication signDecision theoryBitError messageAuthenticationCASE <Informatik>Vulnerability (computing)Matrix (mathematics)PlotterBiostatisticsSoftware testing
13:59
AuthenticationContext awarenessSoftware testingResultantType theoryComputer animationMeeting/Interview
14:33
Context awarenessContext awarenessEmailBiostatisticsAuthenticationType theorySymbol tableFilter <Stochastik>InformationHacker (term)Endliche Modelltheorie2 (number)Virtual machineMultiplication signMeeting/Interview
16:17
FingerprintLinear mapParametrische ErregungMachine learningAlgorithmAuthenticationData modelSensitivity analysisEndliche ModelltheorieBiostatisticsResultantFingerprintVirtual machineFunction (mathematics)Combinational logicLine (geometry)ForestPoint (geometry)Information securityInformationCurveCharacteristic polynomialOperator (mathematics)AlgorithmWeb 2.0CASE <Informatik>Forcing (mathematics)
17:49
Data modelContext awarenessFrictionNegative numberInformation securityFormal verificationIdentity managementNegative numberFrictionPosition operatorStatisticsPhysical systemEndliche ModelltheorieHacker (term)Information securityMetric systemWebsiteComputer animationMeeting/Interview
18:48
Data modelFluid staticsInformation securityAuthenticationFrictionEndliche ModelltheorieIdentity managementFocus (optics)Reflection (mathematics)MiniDiscExtension (kinesiology)AuthenticationAnalytic continuationMultiplication signEndliche ModelltheorieWindow2 (number)Virtual machineInformation securityPhysical systemFrictionPattern languageBiostatisticsComplete metric spaceComputer programmingUniqueness quantificationFocus (optics)InformationIdentity managementComplex (psychology)CASE <Informatik>Context awarenessComputer animationMeeting/Interview
21:25
CASE <Informatik>Thresholding (image processing)ResultantConnected spaceLevel (video gaming)Sensitivity analysisSystem identificationInteractive televisionPhysical systemDifferent (Kate Ryan album)Endliche ModelltheorieFunction (mathematics)BiostatisticsAuthenticationHacker (term)Virtual machineMereologyMultiplication sign2 (number)Type theoryNoise (electronics)QuicksortIdentity managementGroup actionInformationVector spaceTerm (mathematics)Doubling the cubeSingle-precision floating-point formatProfil (magazine)Category of beingPattern recognitionPasswordAlgorithmPattern languageMachine learningCombinational logicRadiology information systemRight angleProxy serverContext awarenessOrder (biology)Matrix (mathematics)Computer animation
30:30
Multiplication signDifferent (Kate Ryan album)MereologyTask (computing)Goodness of fitProxy serverOcean currentRight angleComputer animation
Transcript: English(auto-generated)
00:00
All right, thanks everyone for your patience while we work through technical delays. Now I'd love to introduce Jesus and David on behavioral biometrics and context analytics. If we could all give them a huge round of applause. Hello everyone, thank you for joining us despite the hunk over. Today we are presenting how we reimagine
00:22
the risk-based authentication system. Our work analyzes information from the device and also from the user. From the device, a web-based fingerprinting and from the user, behavioral biometrics patterns. My name is Jesus Solano, lead data scientist and this is David Camacho, lead data architect. We both work at Systera Technologies
00:41
in their research team. To start, let me ask you something. Aren't you bored of having to remember, aren't you bored of having to remember long and complex passwords? Yes, you are, I am too. How many of you have been in this situation? I think most of you, okay, you.
01:05
We have been using password during the last 60 years but they are not secure anymore. And to be honest, they have not been secure for a long time. The question is why? There are two main reasons.
01:21
The first one, passwords don't validate user identity. And the second one, usually, originally, the passwords are meant to be difficult to guess by no other humans. But today we are not dealing with humans getting password.
01:40
We are dealing with computers getting passwords. So they are not secure anymore. For instance, imagine that you are setting up your password in your Yahoo account email. Even if you put this long password, you are not safe. The question is why?
02:01
Well, because it is very easy to guess by a computer. And especially if you are like this kind of people, then you are not safe enough in the web. Now, how are the online services protecting us in the web?
02:20
How are they protecting our identity? Do you know? Well, that's an important question indeed. Think about your cell phone. Think about your email accounts, your corporate accounts. How are they protecting our data and our identity? What are they doing? What do you have to do to protect them?
02:42
So, definitely, many of you, if not all of you, have been in this situation, looking at this kind of thing trying to match all the requirements and you end up with a long password that you cannot remember or you just put it through, I don't know, some password manager or something.
03:00
That's kind of hard and it's difficult to remember, right? To enforce the password protection or maybe the model that we use right now with passwords, the system asks us to set up some security questions or challenge questions. But these questions have another problem.
03:20
It also requires a lot of our memory and we have two different options. If we want to remember them, we just put the truth, like literally the truth of what they're asking us. But that's information that may be publicly available and it's easy to guess. And if not, we just create some random made up answer but it's hard to remember.
03:43
So again, you throw it up to a last pass or any password manager that is able to start these questions. And finally, they enforce the security using symptoms like CAPTCHA. But CAPTCHA can only tell or differentiate
04:02
between humans and computers or maybe bots, except of course for this one. It is a robot arm that is able to bypass the another robot test. So it's not secure enough, definitely. We have, we need actually something that actually protects our identity and focuses on something different
04:24
beyond just secrecy. Cyber security industry in the last, in the latest years have come, came up with three main approaches. The first one is passwordless logging. The idea with passwordless logging is that
04:41
you want to replace your password with something that is dynamically created like an OTP. But it's attached to your device and your device can also be stolen. The other way to do this is just use biometrical information like your face recognition or fingerprint recognition. However, these patterns can, or maybe this system
05:02
can be easily fooled. We saw in a previous presentation that you could just add some noise to an image and it will, the system will be fooled. Or there's some research that shows that you can just print a 3D mask of your face and it will fool most of the face recognition systems.
05:25
Another important approach is the continuous authentication. The continuous authentication is not meant for static logging, I mean, not for the password or logging time. It's meant for the use of the system. It tries to understand how you interact with the system and stop anyone that is not you
05:43
from doing any possible harm. However, this approach takes a long time window and it opens a huge window to the attacker to be successful. And finally, there is the connection behavior. The connection behavior, what does is try to understand how you connect the system
06:02
in terms of connection, timing, location or even your device. However, there are two particular problems here. The first one is that if you are like very stable in your behavior, like you always connect from your office, the same IP, the same machine
06:22
at the same time, labor hours, the time you are traveling or you're connecting from your home, you will have a false positive and will annoy you with a second factor authentication or something. And if you travel a lot, anything will be normal for you. So you never, I mean, you always be exposed
06:41
and you will never have an alert, right? So the main goal we want to do with our project, with our proposal here is to successfully validate the user's identity. I mean, avoid just rely on secrecy, on passwords that are not revealed and start looking at who are the people
07:03
who claims to be. Since early 2000s, there have been two main approaches to analyze this problem. The first one, the context-based authentication and the second one, the behavioral biometrics authentication. In the first one, we can create a profile
07:22
from the user from the connection patterns. That means that we are recording the browser, the operating system and also the time of connection and we can recreate a pattern from the user from this connection. With this pattern, you want to differentiate from user for another one and you can help the user
07:41
to enhance their security. On the other hand, we have the behavioral biometrics. Here, you want to learn how the user moves their mouse and also how the user types. However, if you want to create a good profile from the user for these patterns, unique patterns, you have to have very long timeframes
08:02
in order to analyze. That means that you cannot analyze it or analyze the user in the logging time. You only can analyze the user if you get more than 30 seconds of interactions but a common logging time is less than 25 seconds.
08:21
So this is usually work very well on continuous authentication but not in a static authentication. Moreover, when we are dealing with this information, you are dealing with information who is sensitive. That means if you are recording the create strokes typing, then you are recording the password
08:41
or your credit card number. It is not good. Then you want to anonymize the data but if you anonymize the data, the most typical thing that happens is that your model lost predictable information. So you have to make a trade-off between the privacy and the model performance.
09:05
However, all of this is in the theoretical side. In the real life, the data is complex and the behavioral patterns are very complex. How are they? Well, as many of us may know,
09:20
in real life, things not oftenly happen the way they were supposed to, right? So we make a lot of assumptions and they're not just fulfilled in the real life. But to create a robust model, something that is usable in the real life for real users, we need at least real data that will make us closer to something
09:42
that is actually usable. For that, we collected a data set containing 320 hours of computer human interaction. It means in a gamified environment, real people use real computers performing tasks like day-to-day tasks
10:00
such as writing emails, browsing in the internet, or creating documents. From the part of the context data or device identification, we collected data from more than two million users, real users, and this information of the users
10:21
were collected using our company's product. So we know that it's real data that is out there. But to test our hypothesis of how did this individual and our approach works in static context authentication, we have to test them before.
10:41
So we created some features out of this data and tried with a random forest algorithm. So our results show that the random forest for the behavioral biometrics in static context authentication or logging time if you want to is 0.79.
11:02
What does it mean? It means that almost 80% of the time we know we can differentiate when it's an attack or a real user trying to log in. However, 20% of not recognizing real world means a lot. So at your left, you will see a shadow area in the graph.
11:24
This shadow area shows the best possible area of thresholds you could select for this algorithm. As you can see, the crossing point, which will be like the best you can choose, shows 80% accuracy but 80% recall.
11:43
What does it mean? It means that if you're 80% accurate, you can tell eight from 10 times when an attacker is occurring. However, two of 10 times you will let an attacker pass. That's a lot of time and that could mean a lot of loss
12:03
and lots of problems actually. So as we can see here, the conclusion with this algorithm is that behavioral biometrics by itself will be useful maybe enough but you have to select your threshold very, very carefully and it depends on your data, right?
12:22
So in the best case, you have an accuracy of 0.72 or 72% but if you move just a little bit threshold, you will see an important decrease on your accuracy and also in your recall. So it means every time you move just a little
12:40
your threshold, you will be losing more and more attacks. Now, testing with the same algorithm, the random forest algorithm, our base context authentication, we see that it has a better performance in general. I mean, we have an AOC value of 0.82
13:00
or maybe close to the 80% again, just a little higher of accuracy but it's also sensitive, however, it's more stable. Now, it means that in general terms, people has a normal behavior or unstable behavior that could be predicted but if you see here
13:21
of the plot at your left, you will see there is a threshold that is really low. Also, that means that you could, you are probably missing a lot of attacks or a lot of, or you will confuse a lot of different machines. Also, the known weakness of this algorithm
13:45
is that the context could be easily mimicked or easily replicated. So, this kind of algorithms only works as a decision criteria but they are not definite and definitely, it doesn't really protect our identity.
14:00
So, as you can see here, it's more stable but the results are not really definitive. I mean, we can catch 75% of the attacks but 30% of the attacks are just bypassed. So, to finally understand the problem we're trying to solve here, we need to answer this question.
14:23
Who are we defending from? What, who are we protecting our users from? Let us introduce three types of attacks. The first one, the simple attack. Second one, a context attack and finally, the physical attack.
14:41
For the simple attack, imagine a hacker anywhere in the world who is trying to log in in your email account. In the context attacks, imagine a more sophisticated hacker who is trying to log in your email account but he or she is replicating your device. And finally, in the physical attack,
15:01
imagine you are tired and you want to get some coffee, you get up and go for the coffee but in this time, a colleague from you takes your machine and filters some important data. Now, let us to show how the before the previous model behaves in these three types of attacks.
15:24
First one, the context-based model. It performs very well in the simple attack because you are from a different machine, then we can recognize the hacker. But in context and physical, we cannot recognize the hacker because the hacker is different from you.
15:41
Then it miss the authentication. On the other hand, the behavioral biometrics only performs good in the simple and context attacks but it doesn't perform very well in physical attacks because the person is different. So we were, today we are proposing a new model
16:01
which combines information from those models and then we create a new combined approach where which performs very good in all the three symbol or all the three types of attacks. The question is how, how we combine these models.
16:21
Well, takes the information from the behavioral biometrics, add information from the web-based fingerprinting and create for each one a machine learning algorithm, in this case, a random forest. Then you have an output for the behavioral biometrics and also an output for the web-based fingerprinting.
16:41
If you sum out or you combine these outputs, you can create an enhanced security model. How is this combination? Well, we made a sensitive analysis and sensitive analysis and then we combine in a parametric linear combination the outputs of these machine learning algorithms.
17:01
After seeing that, our results are very good compared to the single models. Here, you can see the AUC and it is almost 21 better than the other single models. Also, you can see here that the receive operative characteristic curve or ROC curve
17:22
is more, is softer compared to the single models. Here, the dash line is for the coin. That means that you are guessing. But when you are close to the point zero comma one, then you are in the better scenario.
17:42
As you can see, the green line is the best scenario of our three models, that means the combined model. To be more specific, let me show some statistics about our model. First, the precision is higher, but that means for the real user
18:01
that we are increasing that we are reducing the friction because we are reducing the false positive. If you reduce the false positive, then it is easily done up legitimate user look in the website.
18:21
On the other hand, we are reducing the false positive, false negative. That means that we are increasing the security because an hacker is very difficult to access to the system. And finally, if you sum up these two metrics and also add the accuracy, then we are creating a more robust model.
18:41
And all of these models are based on robustness, but also are reducing the friction from the user. In conclusion, our model outperforms the other single models by combining the information from the behavioral biometrics unique patterns and also from the context-based authentication.
19:03
In this way, we are also reducing the friction that a final user sees when he is trying to access to the system. And finally, increasing the security of an legitimate user. As you can notice here, it is only possible for,
19:20
we are only testing in a static authentication. In a static authentication. That means less than 50 or less than 30 seconds. However, if you imagine these 30 seconds as a time window and you aggregate the time windows, we are easily straightforward extensive disk problem
19:43
to the continuous authentication problem. So this model is working well not only in the static authentication, but also in the continuous authentication program. Well, to close this talk, I want you to lift three things so you can take it home.
20:04
The first one is that as landscape security has changed and evolved since a lot of years ago, the focus of security nowadays today should be focused on who are you, on identity, not just secrecy.
20:21
Because we've been working with an old model and it's just easily default. Second, the behavior are just little details that tells a lot about us. Focusing on these little details will increase a lot our security.
20:41
And that is because first of all, it's hard to replicate that kind of little details. And second of all, because it will tell who is the real person and who is not. And finally, I hope we could convince you that security is not the complete opposite of user experience.
21:02
In this case, we're not just increasing the security of logging, but also reducing the annoyance of having to perform a second factor authentication or having to remember long and complex stuff or just try to pass in that simple challenge that even a machine can pass.
21:23
So that's everything we have today. Hope you like it and we're ready for taking some questions. Thank you.
21:50
Actually, here the attacks are artificially generated. That means that an action from the one user is the attacks from another user because they are performing the same action in the dataset.
22:02
Then you can have this user as an attacker from this user.
22:27
The short answer here is yes, we use it. However, basically the main difference here is that behavioral biometrics by itself or on its own is not enough for static user. I mean, for logging time, like less than 15 seconds
22:42
where you're typing your password and stuff. But if you combine that using the behavioral biometrics a little difference of our future extraction we did and you combine it with a sensitivity of the device identification and connection behavior, you will have a better accuracy. Because what makes our approach better in the results
23:03
is the combination rather than using just one single approach. Well, as most of the machine learning algorithms, the output saying that it is an attack
23:21
is just an actionable alert. That means that when an attack is detected, the system will take an action based on maybe the threshold or the level of alert or whatever. So think about you detecting a 70% probability of attack. That seems like suspicious,
23:41
but not necessarily malicious, right? So you may want to actually annoy the user with a second factor authentication. But what happens is a lot less than usual or the individual approaches. And if you have like a 99% of attack, you could just deny the action.
24:46
Okay, I want to repeat the question. Basically the question is about if we replace password with this approach, will hackers have another vector of attack? It's serverless? Okay, that's an excellent question.
25:01
And basically the answer here is that as you're logging once again, one important thing is that you have some sort of noise maybe. I mean, every time you type your password or move your mouse, you won't do it exactly the same time. However, you have like habits, right?
25:20
We're trying to detect the habits. And if someone tries to imitate that habit, it will be very, very hard because the time is so little and the difference could be so noticeable. And of course, our algorithms, I mean, a system that uses any machine learning algorithm will be retrained because it won't last forever, right? So at the end of the day,
25:41
one of the cool things that maybe we didn't show but it's important here is that we could use little interactions to start identifying people. And as the attackers and actual people tries to log in into the system, we could just train it with more and more data that gave us a better recognition of maybe better profiling of the users.
26:02
So at the end, we don't necessarily need to replace password per se. We could just use it as a better identity double check. Maybe that's part of our approach.
26:32
Actually, we don't create one model. You create two models, one for each one. And the features for each model are different
26:41
because one are features from the context-based and the other for the behavioral biometrics. But the features from each model when they give one output in the machine learning output. Then we create a risk-based model. That means that we sum up the outputs and then we create a final output.
27:02
This output is probabilistic as they say. But this output is giving us a level of risk that this connection is legitimate or maybe an attack.
27:25
We made a sensitive analysis and we found out that the model who is contributing most than the other in this case is the context-based authentication because we have more data in this case but also because the connection patterns is more easy,
27:43
is easily to get despite of the other behavioral biometrics.
29:03
Well, basically your question, if I get it right, is about what happens when a hacker could bypass one individual model and then sum it up like conjoin attack.
29:51
Okay, so first of all, we showed earlier that there are three categories of attack. There is another kind of four category of attack
30:01
that we didn't show because it's just our future work. And here's the thing, if a hacker is able to bypass one single model individually, that actually requires a lot of work in terms of investigating information to replicate your machine
30:21
or actually having physical access to your machine. The other part is how or what are hacker doing to actually replicate a body behavior of you because as I said before, it's more than, I mean, you won't have exactly the same values each time you log in
30:41
but you will have close enough values and that difference is what I'm trying to measure, right? Now, someone that takes the time and time enough or the, I mean, putting himself in the task or doing both stuff, it's definitely a target attack and it's less probable
31:01
and it's kind of out of the scope right now because that's part, I mean, that's same question we haven't and that's a part of our future work. And so the short answer here will be like, yeah, it is definitely probable that if someone does, achieves to bypass both attacks, individually, maybe a conjoined attack
31:22
could bypass kind of easily our current work but we haven't, I mean, we haven't created that kind of a scenario yet. I think that's all the time we have. Thank you for joining us.