We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Inexpensive Datamasking for MySQL with ProxySQL

00:00

Formal Metadata

Title
Inexpensive Datamasking for MySQL with ProxySQL
Subtitle
data anonymization for developers
Title of Series
Number of Parts
611
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2017

Content Metadata

Subject Area
Genre
Abstract
During this session we will cover the last development in ProxySQL to supportregular expressions and how we can use this strong technique in correlationwith the query rules to anonymize live data quickly and transparently. We will explain the mechanism and how to generate these rules quickly. We showlive demo with all challenges we got from the Community and we finish thesession by an interactive brainstorm testing queries from the audience.
Software developerProxy serverSoftware developerSlide ruleXMLComputer animation
AreaStatement (computer science)OracleLink (knot theory)Tablet computerRule of inferenceMultiplication signDatabaseServer (computing)Enterprise architectureNumberSoftware developerData storage deviceProxy serverField (computer science)TwitterExpert systemStudent's t-testData managementGoodness of fitOpen sourceInformationClient (computing)Query languageRegulärer Ausdruck <Textverarbeitung>System administratorLogic gateRule of inferenceSelectivity (electronic)Right angleCASE <Informatik>FreewarePlastikkarteMachine visionSlide ruleProduct (business)BitComputer clusterCache (computing)PredictabilityInterface (computing)Table (information)MereologyComputer configurationProcess (computing)Ocean currentInformation securityPasswordAuditory maskingCartesian coordinate systemSensitivity analysisEmailLimit (category theory)GoogolMusical ensembleTerm (mathematics)Presentation of a groupRückkehrpunktComputer animation
Rule of inferenceMechanism designFlagConvex hullQuery languageStokes' theoremCASE <Informatik>Regulärer Ausdruck <Textverarbeitung>Rule of inferenceDifferent (Kate Ryan album)BlogNumberProxy serverMessage passingTable (information)SpacetimeMultiplication signSelectivity (electronic)Cartesian coordinate systemQuery languageFlow separationBitPointer (computer programming)Field (computer science)MathematicsFeedbackScripting languageContent (media)InformationDoubling the cubePattern language1 (number)Computer fontLimit (category theory)Regular graphSingle-precision floating-point formatEmailFunctional (mathematics)AliasingAuditory maskingSlide ruleError messagePower (physics)WordEstimatorLine (geometry)Scaling (geometry)MultiplicationCodeSoftware developerEqualiser (mathematics)Physical systemRevision controlCone penetration testResultantRight angleFilm editingRoutingMathematical analysisSquare numberPlug-in (computing)CodeSequel
InformationRight angleCommunications protocolDifferent (Kate Ryan album)MultiplicationRow (database)Query languageCASE <Informatik>QuicksortSoftware developerDependent and independent variablesClient (computing)Data storage deviceOperator (mathematics)Statement (computer science)Matching (graph theory)Rule of inferenceInformationMultiplication signResultantProcess (computing)Proxy serverInformation securityOverhead (computing)Block (periodic table)Scaling (geometry)Auditory maskingServer (computing)Constraint (mathematics)Exception handlingSelectivity (electronic)Product (business)EmailSlide ruleTable (information)DatabaseTransformation (genetics)String (computer science)Entire functionParsingWritingPatch (Unix)StapeldateiMetric systemSoftware testingComputer animation
Computer animation
Transcript: English(auto-generated)
Today I have the pleasure to present, to co-present with René. So you have seen all his work today and for the people that was here yesterday on proxy SQL. So he's the founder, the main developer and the one like, if you saw Colin's slides, that spent all his night developing, it's him.
So he can say thank you for all his work. Yeah, and I am the one that wakes him up during the night to fix stuff, so it's a shared work. First, I want to ask you guys who knows what data masking is?
Not much. So from the guy who knows it, who needs data masking? Some of them, yeah. Okay, so first, this is the most fun part of the presentation, the most interesting one.
So everything I say here is just, you know, it's for example, you can test it, don't try to go to prediction with what I said and say, oh, but Fred says it's like that. This is for, you need to test, if you want that, be sure of it. This is just, let's say, I want to show you something about proxy SQL.
Everything that guy says, it's written in stone, so you have to follow him. So this safe harbor is only for me, not for him. So let's start by who we are. So this is Rene, I will let him introduce himself. Hi, Rene.
So you can follow him on Twitter, and he's the proxy SQL founder, right? And you also work at Dropbox. So this is me, you can follow me on Twitter for people who like Twitter, sorry. So I'm managing MySQL for a long time, and I'm a DevOps believer,
but nothing to do with that today, so no worries if you don't like DevOps. So first, let's talk about what is proxy SQL. So this is the MySQL Data Stargate. That's it. This is all you need to know. If you need to know a bit more of details, you have a lot of room.
So basically, it was already introduced in the previous session, so for who was not here, the idea is that a proxy has to be a middle layer between the application and the database server. So my vision is really, it's a gate that communicates those two layers.
And while staying in the middle between those two layers, it's able to do a lot of things like query routing, processing of the query, like blocking them, or forwarding somewhere else, rewriting them, caching. And of course, this session is about data masking. So again, this is one of the options that you can do, having a middle layer in the middle.
Thank you, René. So yes, this session is not about everything in proxy SQL. You saw the previous session. It's more about the queries, rewriting, that we're going to do. So why would you like to use proxy SQL as a data masking solution? Because it's open source and free like in beer.
So who knows other solutions for MySQL to do data masking? Nobody? Sveta? Yeah, you can find some on Google, like GreenSQL, for example. But this is quite costly, right? They are quite expensive, so it depends on what you need.
But let's first talk about data masking. I didn't think about making a slide on that. But what do we want to resolve with this data masking? What's the problem? The problem is that sometimes you have developers and you have production data,
and the developer needs a copy, or just they need to access and read only the production data. But maybe your data is sensitive. You have sensitive information there that you don't want that any developers in your, let's say, enterprise have access. Maybe you have students that work there or whatever.
And so you say, okay, maybe we have data that we need to mask. So the data should be there to performance, to see, okay, you need to retrieve. You need to sort or whatever. But it should be masked. You should not see it, right? So this is just about that. I will talk about that also later.
But currently there is no very good solution on data masking right now. So this is one option that we did. So the other solutions are very expensive, or they are just not working, or not working properly, and for the price sometimes it's not that good.
This one is not worse than the other solutions, but because none are perfect. So if you have seen a bit of Joro, if you want to discuss about Joro, he has a talk previously. He is the security expert in MySQL. It's always possible to ask stuff or to deal with security, so watch out.
So the best solution, if you really want to have data masking, it will be to integrate this inside of the server. For example, just after the end layer API. So when the storage engine gets the data,
you modify the data there before exposing it to the client. I think this is a lot of work, and currently it's not in our roadmap to do so. Maybe in the future. So what is the concept of data masking using proxy SQL?
So what we're going to do is that we're going to use regular expression. I'm very sorry about that, because I will show you the regular expression after. We are very, let's say we are old in the IT world, so we are using Perl regular expression.
So sorry for people who hate that. But you will see they are very easy. So what we want to do is we get the data, and before we send the data to the client, we want to modify it. So we want to say, oh, we have some rules, and for example, an easy one, we don't want that the developer or this user
cannot see the credit card numbers. Makes sense that you don't want to share all your credit card numbers or something like that. Yes? Yeah, this is just, no, my example is just with credit card numbers. So I know you're not allowed,
like I said, you remember what I said, it's just for example. And the numbers you will see are not real, so don't try them. So what I mean, any data you want, emails or whatever, if you have some data that you don't want to share with the developer, you need to hide it. So in this case, when the developer will do a select,
we need to find, oh, this column we need to hide, and we need to hide it and then return it. So this is what we're going to do in fact. So first, how do we do that? So in proxy SQL, we need to create a user. So managing proxy SQL, maybe it was not obvious,
or it was too fast, or you were not in the previous sessions about proxy SQL, but the admin interface is just SQL. Almost standard SQL with some keywords just specific for proxy SQL of course, but it's very easy. So if you want to create a user,
you're going to insert into MySQL user and give a user and a password. So in this case, we create a user name called devil. First thing we need to do, like I told you earlier, we want to hide this column, right? The non-credit card column,
we want to hide it, and so what you're going to do is to check for this column, but people can do select star, right? So we want to avoid that too. So if the developer does select star, we need to avoid. Select star, you cannot do it on this table, right?
So we need to create some rules to stop this and all variant of select star. So if the table is part of many tables, we need to do the same. So hide this name for every table. This is how it works,
but this is also a constraint currently. So let's say you want to hide an email in user table, but you want to hide an email, same column name in another table. Here in this case, it won't work. They will both be hidden. This is one of the limitations we have.
So the second rule, we need to mask the field, right? So when the field is selected, we will need just to replace in my example, what I will do is I will show the column and I will just show the first two characters or numbers or whatever you want,
just two and then put Xs, right? To mask. But at least why I played the first two characters is in case you want to sort in your query or whatever, so you can do it. You want also to keep the column name because if you do okay, select email,
and then you do a lot of changes there, you want that in the result, you still have email. So because what we do, we rewrite the query. So we will send a query to MySQL and the query we sent, it's about changing already when we send it to MySQL.
So we need, but we need to return the name. So an example, this number will become like this. So if we want to mask this CC number from the table customer, we will need seven rules.
So this is the rule number one. So as you can see, yeah, you see a very nice mouse pointer. So as you can see, yeah, the first one is quite easy to understand.
We are looking for this CC number, right? And if you have it, we will replace it. So this is not complicated, but we need to find, because people, and I will discuss that after, they put some quotes, they put some parentheses, they will do, everything is possible to get that information. So we need to try to focus on all that.
Then it becomes a bit more complicated. So like I told you, I like Perl. So this is also, we will try to find characters, to find what we are looking for, and then we will make a concat, in fact.
So what we're going to send to MySQL is just, okay, we don't tell it give me the CC number, just give me the two first characters of CC number, plus concatenating with the X we want, okay? So we don't send the full request to MySQL. So this is a rewrite of the query, in fact.
Yes? Okay, so basically, this is a completely rewrite of the query. One of the important things to notice, if you are familiar with regular expression, of course, you have those that are nothing more
than reference to the matching pattern. So this refers to this one, this one to the other one, and so on. So you can, and I think this is not possible with the MySQL rewrite plug-in, doing back reference. And you cannot do that also with regular expression.
You need the per regular expression because we have also caseless and global here. So we want to do it every time it happens. So this is, so if you are familiar with proxy SQL, maybe you have never seen this, and this is because I asked René to do that,
and it will be released soon, right? So you can compile the code, but there is no packages with this information yet. So then we have extra rules again, but I won't detail them all. As you can see, it's fun. So also what we're going to do here
is that if somebody do a select everything from the customer, we're just going to send to the user sending the query. That is not allowed to get that information. You can put everything you want there, of course, but so this is the message. Every time somebody will go, the developer, that user will do a select star.
For example, in this case, we receive, okay, you are not allowed to do that. So we return an error. Again, same information because we need to take care of several different cases that I will show. So what are the limitations?
The limitation is this. This is supported only with proxy SQL big or equal 1.4, not the previous version, right? So all fields of the same name, this is what I was saying before, will be masked even if you say,
okay, this is for this table. We don't check about table in this case because maybe you can join several tables and it will be very complicated. If you find a very nice regular expression to do it, feel free to send it to me. I will be very, very happy. But currently, so all the fields with this name will be masked and however is the name of the table.
So, of course, regular expressions are not, let's say, always safe and maybe the seven rules I give you here are not enough. Maybe somebody will come with a fancy solution. I hope so. So we can improve it. But this is already what we did until now.
If you want to create all these rules, it's quite painful or I think it's painful. I need to remember them. So I made just a small bash script that you call it like that. So you give the column name you want and if there is a table that you want to avoid select star from it, you just do it like that.
It creates the rules for you. So you can add as many rules. So how does it look like? So we select star from, I will show you after our live if we have time, but select star from customer, it was easy. What we're going to have here, we're going to have,
okay, you are not allowed to do that, right? Select first name, last name, CC number from customer, also easy. What we will have, we will have the first name, the last name and the CC number value that will be, what you call it, masked, right? Then more difficult.
And then here I need to thank Istoma here again. He's away. He was here. So I made some blog posts and I got some contribution or by email or by comment to say, yeah, but if you do this query, what will happen? So this query is in two lines, right? My first rules, of course, it failed completely
and you had the content. So now this is also by the query you have seen. We take care of that. So if the query is in multiple lines, it will work. If you make this CC number twice in the same, this is really okay in MySQL, right?
You can do that. The first rules we did without having this global, so when proxy SQL was only doing the regular expression and not the per regular expression, this one was hidden but not the other one. So now we take care of that. Now, yeah, there is a space here. Sorry, I forgot.
But when we put some with the application name, no, or was it on? It doesn't see. Yeah, you don't see it because there are quotes in fact and they are removed by the syntax. It's highlighter. But here, you see it's a bit different font.
So you have quotes, the back quotes. So back quotes, single quotes, double quotes, all that is taking care. And yeah, this one also in two lines, it takes care of it. So this one is on two lines. This one is just because it's too long. It is when you have some function before the field.
It works too. Again, extra example that we, that the rule, so all these with the rule I sent you, it will work. You mean you won't have the data. It will be masked. So yeah, when you give AST1, for example, and here you add an aliases.
This is taken in consideration. If you use CCNEM and then you give an alias of it, you want to see FRET and not CCNEM or not CCNEM, CONCAT CCNEM or whatever. So it will work. Same when there is not the has. Yeah, here again, you have the quotes that you don't see,
but there are some quotes on the slides. It takes care of it. Again here, this one has quotes. It takes care of it. And here with the quotes too, that works too. So all these cases are taken in consideration by the seven rules you have seen.
Extra. So yeah, this is taken in consideration. When you put comments, everything, it will work. You won't see the data. This also, you will take care of it. So it won't work. This one will be blocked. This one will be blocked. So all these cases are handled by the proxy SQL with the rules.
So it's quite already, let's say, it goes far, maybe not enough for everybody. I don't know. And this is what we would like to, we need you for feedback. So do you have some ideas of queries that won't work?
Does it generate some stream? Yeah, but your prepare statement will be already,
I need to check. I need to test to see what we can do. If you find this fancy ID, send it to us, please,
and we're going to test it.
I mean, this idea that I'm asking is also to not make exactly strict rules of the queries that need to be executed. Because another approach is just to block any sort of query with exception of whatever is white listed. But of course, it's become more complicated.
Yes.
If you can send us an email, we will try it. I can try it. Yeah. Yeah.
Yes.
Yes, but that is not there. Yeah, we started the slides. We're saying making this in the database server just after the end layer API, for example, will be the best. But this is not the case right now. So currently, we don't have.
Yeah, but you need to write that information. So you need to write that and see, so interfere with what the storage engine sends you and then verify, oh, this matches
what we don't want to see. But this is quite also a lot of work to add it there. Yeah, it sounds easier for sure. Yeah, this sounds easier. It's not easier. This one is just you have it.
You can do it. If you need that now, today, you can do it like that. And first, maybe make some tests and parse all the queries your user are doing and just allow this one, block the other one, and mask the one that you know. But you have it. No, if you have to wait for MySQL to implement it, I cannot give you.
Yeah, it will be next release. So this is the difference. Here you have it. It doesn't cost you anything, but maybe it's not perfect. Yes, TB.
So the question is, what is the column name and the table name is the same? Yeah, you can try it. You are more crazy than me, so I will test it, and I will see. But yeah, I will test it, or you can test it.
This is ProxSQL taken on GitHub, compile it, and test it if you need it. Other questions? Yes? Yeah, right, the response is...
So yeah, the question here is, why don't we write the response we get from MySQL instead of writing the query we sent to MySQL? And I will let René explain you. Actually, this was a request I had,
I think more than one year ago. And to be able to do this, the proxy needs to understand the query, and currently does not understand the query because there is no parsing, so it doesn't understand what the query represents and what exactly the query is asking. So it doesn't understand what is a column, what is a table, and so on.
If the proxy will understand this, then what it will do is that when the server replies, it will take the results, make the modifications, and send it to the client. It will be way greater overhead anyway. Because right now, whatever the server is responding, because it's already processed, whatever is responding,
the proxy is just storing it temporarily and then sending row by row or in batches depending from the protocol to the client. So there is no extra processing. Doing extra processing will be way more costly. Yes.
Yes, but you can have values. So you can run a select ccnum as whatever, blah. And so the proxy won't know.
Here we process it once when we send the data, right? So we do it once. Let's say you do a select, and you have 1,000 records, or 1,100,000 records and stuff like that. You will need for each rows modified.
And this is too much overhead for proxy SQL. In this way, the data never leaves the database server. I don't know if that makes any difference when it comes to security constraints, but the data will never leave the database server. Yes.
Let's see it again. Oh, well, you have... So the question was, what if the same proxy is used also for production traffic
and not just for development? So the first thing... Can I go back? So here, you see, you have the user name. So all this transformation applies only to that user. That might be the user used by the developer.
Yes, please. Yes, I mean, the question was about how to try to offload the main proxy
in case there is too much processing here. And, of course, you can always create multiple layers of proxy, so one proxy is sending traffic to the other, and you can, for example, decide to do the processing in just one of the proxies, while the other one is just forwarding traffic and not being affected by the times required for the processing.
By the way, the proxy also collects metrics on how much time is spent processing those rules, so it is easy to identify if processing those rules is a time-consuming operation or not. So thank you very much,
and let's call the next speaker. Thank you, René.