Pseussudio. Pseudonymization in Django - TIB AV-Portal

Pseussudio. Pseudonymization in Django

00:00

42

Valcarcel, Frank

Formale Metadaten

Titel

Pseussudio. Pseudonymization in Django

Alternativer Titel

Pseu, Pseu, Pseudio. Pseudonymization in Django

Serientitel

DjangoCon US 2018

Anzahl der Teile

50

Autor

Valcarcel, Frank

Mitwirkende

Lizenz

CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/44075 (DOI)

Herausgeber

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

The General Data Protection Regulation, better known as GDPR, is a regulation on data protection and privacy for all individuals within the European Union. GDPR went into effect on May 25, 2018 and was the cause of the “Great Privacy Policy Update” that occurred in the weeks prior. This talk will cover what GDPR is and why you should care about it, but we won’t stop there. This is not going to be another talk on data protection policy. No. In this talk, we’re going to jump right into discussing HOW to implement data patterns that comply with regulations like GDPR by examining a pattern known as pseudonymization. Pseudonymization is a data de-identification procedure where fields of personally identifiable information (PII) within a data record are replaced by one or more artificial identifiers. These artificial identifiers are also called pseudonyms. Pseudonyms make a data record less identifiable without sacrificing data analysis and processing. GDPR requires that PII undergo either pseudonymization or complete data anonymization. For the hands-on portion of this talk, we’ll construct a Django User Model where we apply pseudonyms to the data attributes which qualify as PII. We’ll explore a couple strategies for implementing a compliant pseudonymization pattern, examining their individual approaches and performance, and we’ll discuss limitations of pseudonymizing certain attributes and how to achieve compliance through consent. GDPR sets a precedent for responsible data management. Whether your application serves citizens of the EU or not, the regulations serve as an encouragement for protecting your user’s identities. This talk is great for everyone from beginners to expert Django developers… and fans of Phil Collins :)

DjangoCon US 201811 / 50

1

24:14

"Normalize until it hurts; denormalize until it works"

2

21:22

3

43:17

Unique ways to Hack into a Python Web Service

4

40:34

Transfer those Skills! How to Identofy, Communicate, and Sell your Transferable Skills when Switching Careers

5

19:06

The Power of GeoDjango

6

40:18

Strategies for Zero Down Time Frequent Deployments

7

40:52

Simpl framework, big impact!

8

24:17

Serverless Django with Zappa

9

24:32

Real Life Accessibility: Have you HEARD your site?

10

33:16

Python on your phone: Building mobile apps with Kivy

11

29:39

Pseussudio. Pseudonymization in Django

12

19:05

Packaging Django Apps for Distribution on PyPI

13

26:29

ORM: The Sequel

14

23:03

One Engineer, an API, and an MVP: Or how I spent one hour improving hiring data at my company.

15

42:02

DjangoCon US 2018 - Lightning Talks Day 3

16

32:30

DjangoCon US 2018 - Lightning Talks Day 2

17

40:44

DjangoCon US 2018 - Lightning Talks Day 1

18

33:40

DjangoCon US 2018 - Keynote

19

39:34

DjangoCon US 2018 - Opening Keynote

20

38:31

DjangoCon US 2018 - Keynote

21

43:16

JavaScript for Python Developers

22

25:54

It's about time

23

23:14

Introduction to Django and GraphQL

24

27:47

How to give a damn, and stand out

25

27:53

Here Come The Robots - Django and Machine Learning

26

21:26

Herding Cats with Django: Technical and social tools to incentivize participation

27

48:53

Fundamentals of Kubernetes for Django developers

28

38:13

Finally Understand Authentication in Django REST Framework

29

35:41

Elasticsearch: Accelerating the Django Admin

30

21:37

Easier Classes: Python Classes Without All The Cruft

31

33:23

Django REST Framework: Moving Past the Tutorial to Production

32

19:05

Data internationalization in Django

33

38:21

Containerless Django: Deploying without Docker

34

39:48

Code Review Skills for Pythonistas

35

23:41

Building Workflows With Celery

36

22:53

Building a Community for All People

37

25:25

Bespoke communication devices for kids with autism built with Django and Raspberry Pi

38

22:05

BDD (Behavior Driven Development) Testing for Django Apps by

39

25:05

Becoming a Multilingual SuperHero in Django

40

47:48

Auto-generating an API using PostgreSQL, Django, and Django REST Framework

41

33:23

Anatomy of Open edX

42

23:22

An Intro to Docker for Djangonauts

43

41:46

A Python-Driven Web App Framework with Django, Channels, and React

44

23:58

...a bossy sort of voice

45

48:34

"State of Django" Panel

46

29:17

Autonomous Vehicles, Intelligent Transportation Systems, and yes, Django!

47

14:35

Your web framework needs you!

48

43:12

When your wetware has too many threads - Tips from an ADHDer on how to improve your focus

49

19:36

What's in a Name? Your Guide to the Wacky World of DNS

50

25:42

We Are 3000 Years Behind: Let's Talk About Engineering Ethics

Automatisches Abspielen

Sprache

Text

Bild

00:00

AnonymisierungDatenanalyseProzess <Informatik>IdentitätsverwaltungInformationZahlenbereichTreiber <Programm>Freier ParameterVersionsverwaltungToken-RingPermanenteSinusfunktionWort <Informatik>AnonymisierungTwitter <Softwareplattform>DatensatzFormation <Mathematik>InformationRegulator <Mathematik>PermanenteProzess <Informatik>IdentifizierbarkeitEinfache GenauigkeitSchnittmengeAlgorithmische ProgrammierspracheMereologieMAPNichtlinearer OperatorMultiplikationsoperatorDatenmissbrauchMaßerweiterungÜberlagerung <Mathematik>Automatische HandlungsplanungPunktDatenloggerDefaultOffice-PaketMechanismus-Design-TheorieFlächeninhaltProgrammierspracheDifferenteComputersicherheitSystemidentifikationDatenanalyseDatenverarbeitungGesetz <Physik>Selbst organisierendes SystemKugelRechter WinkelBitComputeranimation

07:44

ZahlzeichenDateiformatSubstitutionChiffrierungChiffreProzess <Informatik>SystemprogrammierungEindeutigkeitMusterspracheAttributierte GrammatikAlgorithmusZeichenketteImplementierungHaar-MaßDatenmodellEndliche ModelltheoriePunktClientVersionsverwaltungMagnetkarteAttributierte GrammatikImplementierungSelbstrepräsentationZentrische StreckungSystemidentifikationAlgorithmusChiffrierungToken-RingNeuroinformatikVerdeckungsrechnungGruppenoperationSelbst organisierendes SystemObjekt <Kategorie>ApproximationDatensatzTabelleAnonymisierungDatenbankMetropolitan area networkUmwandlungsenthalpieAlgorithmische ProgrammierspracheZahlenbereichAbfrageResultanteKategorie <Mathematik>SchnittmengeCodeRechter WinkelFokalpunktBitDifferenteAnalogieschlussMusterspracheNetzadresseDatenmodellFunktionalMultiplikationsoperatorKlasse <Mathematik>InformationProdukt <Mathematik>Spannweite <Stochastik>Reverse EngineeringDatenverarbeitungGemeinsamer SpeicherEndliche ModelltheorieVerschiebungsoperatorSkalenniveauVerschlingungSystemprogrammHyperbelverfahrenPlastikkarteSystemverwaltungWort <Informatik>Prozess <Informatik>DatenfeldRechenschieberIdentifizierbarkeitComputeranimation

15:28

Endliche ModelltheorieDatenmodellObjekt <Kategorie>DigitalfilterQuellcodeAbfrageInstantiierungVererbungshierarchieAdditionDatenfeldBildschirmmaskeMIDI <Musikelektronik>Integriertes InformationssystemDatenbankVerdeckungsrechnungCodeStichprobeWeb logGruppenoperationEndliche ModelltheorieSystemverwaltungAnonymisierungKategorie <Mathematik>DatenfeldAttributierte GrammatikPasswortDatenbankMAPKlasse <Mathematik>BildschirmmaskeDatenverwaltungVerdeckungsrechnungMailing-ListeMathematikMultifunktionAbfrageValiditätCodeClientEin-AusgabeEinsZahlenbereichPhysikalisches SystemRegulärer Ausdruck <Textverarbeitung>QuellcodeFunktion <Mathematik>Lesen <Datenverarbeitung>ImplementierungVererbungshierarchieDefaultInstantiierungTypentheorieZweiZeichenketteStichprobenumfangParametersystemObjekt <Kategorie>IterationInformationProzess <Informatik>SchnittmengeAlgorithmusEinfügungsdämpfungRegulator <Mathematik>Patch <Software>FunktionalWeb logMultiplikationsoperatorQuaderRechter WinkelGewicht <Ausgleichsrechnung>DifferenteTupelGamecontrollerDatensatzKonstruktor <Informatik>SpeicherabzugSystemaufrufSingularität <Mathematik>ErweiterungFehlermeldungAutorisierungVerschlingungBitArithmetisches MittelComputeranimation

23:12

StichprobenumfangMetropolitan area networkDatenbankPunktwolkeRegulator <Mathematik>ClientCASE <Informatik>PlastikkarteNeuroinformatikGruppenoperationChiffrierungValiditätAnonymisierungEndliche ModelltheorieInformationKomplex <Algebra>ComputersicherheitRechter WinkelDienst <Informatik>ZahlenbereichEDV-BeratungEinsMultiplikationsoperatorTypentheorieQuellcodeAlgorithmusObjekt <Kategorie>Reverse EngineeringBasis <Mathematik>DatenmissbrauchStandardabweichungExogene VariableDatenfeldRechenschieberSicherungskopiePhysikalisches SystemApp <Programm>Arithmetisches MittelCodeSpeicherabzugUmwandlungsenthalpieHash-AlgorithmusSelbstrepräsentationMinkowski-MetrikPasswortVerdeckungsrechnungOverhead <Kommunikationstechnik>PerspektiveSystemverwaltungProjektive EbeneService providerTeilbarkeitVorlesung/Konferenz

29:28

COMService providerDatentypSystemzusammenbruchEinfacher RingXMLComputeranimation

Transkript: Englisch(automatisch erzeugt)

00:20

okay everybody thanks for joining me for what is probably going to be my my

00:26

silliest talk of the year let's just get one thing out of the way how many people who know who Phil Collins is by show of hands all right we're gonna have a lot of fun for those of you who do not know who Phil is I've got

00:42

plenty of background information on him and he's uh yeah we'll get to that part so this talk is called sue studio and it will cover its adonimization techniques in Django hi I'm Frank I'm FMD Frank on Twitter but I am also on quite the

01:04

extended Twitter sabbatical you're welcome to go look at my greatest hits there they are there for you to peruse but I may not I may not go but I don't know if I will ever return to the Twitter sphere I work at a company called cuddle soft we have offices in Denver Atlanta and Tallahassee Florida

01:27

and I'm an avid Pythonista I've been using Python as my primary programming language for the better part eight years this is my first time at and I'm very excited to be here I'm also the co-founder and chair of PI

01:40

Colorado we'll be having our inaugural conference next year in August I'm happy to talk to anybody more about that if you're interested please come visit me in Denver it's beautiful and then I also run Boulder Python in Colorado so yeah thanks for having me so my speaker in spirit is

02:00

Philip sorry it's right here Philip David Charles Collins he's an English musician and he's a drummer singer songwriter multi-instrumentalist record producer and an actor he was the drummer and singer of a rock band known as Genesis and during the 80s Collins had more u.s. top 40 singles

02:20

than any other artists which if you're old enough like me to remember the 80s that's actually quite impressive he co-wrote a lot of the music on Disney's Tarzan for the younger folks in the crowd that will be probably how you know him and I also I just learned that Peter Gabriel none of this is relevant to the talk you probably figured it out but Peter Gabriel was

02:42

the original lead singer of Genesis and Phil took over for him so why why is Phil my co-speaker in spirit well so Donna mization is an incredibly difficult word to say try it how many of you got it right yeah so studio is close enough and that was enough reason for me as of any to do a

03:02

Phil Collins inspired data privacy talk also I'm pretty confident I'm the only one who have ever attempted this so we'll see how it goes all right so if you've never heard of Phil that's okay I got you we've got a Spotify playlist of some of Phil's greatest hits he is on Twitter he's Phil Collins

03:26

feed if you're interested it starts with Sioux studio which is the song I started this talk off with it gets kind of sappy towards the middle like this talk will I don't know are there any tissues if there are you'll need them this is heavy stuff y'all and of course this playlist ends with

03:43

the air drumming spectacular in the air tonight so please check it out enjoy it all right so let's get to the meat and potatoes what what is what is this very difficult word to say well it's a data de-identification procedure

04:02

data records are replaced by one or more artificial identifiers called pseudonyms and the idea behind pseudonyms is that it makes a data data record less identifiable without sacrificing data analysis and processing and so why would you do this well anything worth protecting is worth

04:20

protecting well and it provides you some security through obscurity so you can secure a data set from identifying identification and it's also kind of required by the law not kind of it is required by the law I only say kind of because there are these gray areas which I'm not going to get into because I am NOT a lawyer so do not ask me legal advice if you have a

04:44

question at the end and it smells to me like it's of need of counsel I will tell you I cannot answer that and that you need a lawyer so a couple more things as a note we're not going to get into the mechanics of GDPR I will

05:00

reference some articles if it's important and interesting for you to go read it's actually not that dense of a regulation so we will be avoiding things like consent the difference between collectors and data processors or how it affects your organization again if you ask me those questions I am NOT a lawyer and I will tell you that but it's important for us to find exactly what

05:22

it is that we are discussing today and that is specifically personal data this is also known as personally identifiable information the gist is that personal data is any identifiable data or PII note that GDPR refers to it as just personal data one of the things about the regulation that I don't like so much

05:42

is that it does paint in very very broad strokes so essentially any information that can be used to identify a user a person there's there's this regulation around so some examples basically it's this is if you

06:00

can identify someone with it or it can be used to identify someone with or without a secondary data point then yes it's personally identifiable information if you are unsure chances are that it's personally identifiable information so let's talk about data privacy techniques right there's two very popular

06:21

methods there is pseudonymization and a non-anonymization beyond being very difficult to pronounce the first few times that you practice them they are they're the two most common approaches to doing data privacy over PII pseudonymization we kind of covered a bit already I want to also

06:41

point out that according to article 25 the GDPR data must be protected in by design and by default so these are important things to consider when you are planning even in the planning stages of a system and if you want to understand the requirements underneath the regulation I recommend you read articles 25 and 32 I'll note that GDPR only recommends one technique by name and

07:08

that is pseudonymization although they spell it with an S and not a Z it's something I've learned anonymization is a more permanent de-identification procedure with anonymization you render the use the users data unidentifiable so

07:25

maybe one of the reasons why the many teams of lawyers that wrote GDPR regulations avoided using anonymization is that the fact the mere fact of the operation of anonymizing a data set makes it no longer personal or personally identifiable so it actually doesn't fall underneath the purview of GDPR which is something I think is really interesting if you are

07:44

struggling to understand the differences I'll have some examples on pseudonymization but if you're struggling to understand the differences between the two I've made this drawing to help articulate the differences between pseudonymization and anonymization anonymization is

08:01

essentially like I think of it as analogy to like Batman's very clever disguise right when he puts when he puts the mask on you can't tell it's Bruce Wayne anymore thank you but Superman not so much he's got he combs his hair a little bit differently and he put some glasses on so at least to all the people in metropolis that are just not that keen to see that he is

08:23

the same person he is pseudonymized they can't tell it's him but to us the readers there is no anonymization layer going on right I really just use this as an excuse to make this incredibly funny slide I think it so let's dive deeper into pseudonymization techniques the the one

08:42

that we will probably go over in this is the one that we are going to go over is a technique called data masking and so to mask data characters are in a record are shuffled or substituted in words maybe some may be substituted or obscured completely the result is usually a realistic data set that cannot be reverse engineered without the re identifying information or the or the

09:03

algorithm to reverse the masking technique there are a lot of techniques that fall under this broader category there's also a method known as approximation which is instead of saving the information the users PII by itself you approximated so one of the common practices this is used for is

09:22

for date of birth sometimes you don't want to save a date of birth you just want to know how old or maybe the birth month maybe the birth year so then you have a table with those numbers and your increment those once a user subscribes or enters that information and you don't save that users date of birth record specifically another method very popular

09:42

encryption and this is something that I expect most people will be familiar with I do have a question though is does anybody know if this is required by GDPR no it is not not as a data de-identification procedure encryption is required for data at rest and in transit but it is not the recommended

10:04

nor a requirement under GDPR for how to identify how to de-identify your user data this is actually kind of interesting because one of the big premises is why I'm doing this talk on pseudonymization is a it's fun because the Phil Collins aspect but two it's actually better for you as an

10:21

organization and someone who's serving maybe the role as the data processor and the and the database administrator pseudonymization gives you a lot of value back but you don't necessarily have to jump to encrypting that data set because this will add compute resource or compute resource requirements that you don't necessarily need this is at least my philosophy again I'm not a

10:41

lawyer so and then the final pattern is tokenization which is very common use commonly used by companies like PayPal or Apple pay or stripe they will tokenize the credit card information and then they use that token to retrieve that information when they need it they only process those that

11:03

those data points when they need to otherwise it's saved on either the clients the client side the vendor side as this token representation this is song 2 on the playlist if you're following along all right so I'm going

11:20

to go over a simple implementation example this is going to set the foundation for how we're going to scale this up in our in our Django example so Python already supports a common pattern that allows engineers to replace attributes with a set of methods that can intercept values when they are written in there and when there are red any guesses as to what they are not trick question but they are either getters and setters of the

11:45

properties so for the following examples and for the continuing examples through the Django methods that I'm going to show you all we're going to use this incredibly simple masking algorithm the masking algorithm does

12:00

all it does is shift each character one ordinal to the right and then when it re-identifies them it shifts them to the left it doesn't in a range so that it can not overrun the ordinal ranges for ASCII characters so it's it's intelligent from that point but it's very unintelligent if you use this in production because it's an insanely easy to reverse

12:20

engineer I'm also not going to talk about algorithms or best practices for doing masking because we first I shouldn't share it with you two this is being recorded and why would I like why would I you know implicate all of us by sharing an algorithm that then somebody here may go and use and then

12:43

that is reverse engineer now I am culpable so it's also this is a lot easier for everybody usually to understand and so if I had a more complex or sophisticated masking algorithm that would take the bulk of time we have for the talk so to mask and unmask we're just gonna have two

13:01

methods mask and unmask and then essentially this is how it would work right we're shifting my name Frank Valcarcel over every character over one and that's what the masked version of it would look like so an implementation of this if we're just using a basic user class we have an

13:21

underscore name property sorry an underscore name attribute and then we have a property method for it called name and a setter on name and then we just call our mask and unmask methods underneath those met those two functions and so it'll look something like this now so when I instantiate user I'll set the user name as my name if I print user name it's coming

13:41

from the property so it'll return my name but if I'm looking at the underscore name attribute it's returning those pseudonym eyes version this is important to understand because what's being saved in the object and therefore could be serialized later is the pseudonym eyes version it wouldn't be my name my name is only being re-identified in transit so let's look

14:02

at Django example this is song 3d on the playlist if you're following along and so we're gonna take the same concepts I'm gonna add a few attributes but we're gonna focus in on the name field quick question how many of these attributes are PII all of them yeah they're all identifiable because

14:25

together something like the IP address with one of the other data points makes this the user who's saved identifiable so we're gonna move our shifting algorithm our masking algorithm into a utils file the code for this is all available later I'll share the link with you and then our mask and unmask

14:42

methods then now here's the user attribute again focusing in on just the name field we've done the same process it's underscore name and we have a getter and a setter applied to it which will mask and unmask as that data moves in and out of the objects so the problem is that we're not done

15:01

and for sake of time I'm gonna speed through the rest of this because I want to get to the second example the models query set doesn't yet support our properties you cannot filter you cannot exclude on the identifiable data values right you have to know that Frank will be sodomized and masked to gizball or something like that right and so therefore that's not a very

15:22

intuitive way to interact with your data models the other thing is that pseudonyms are now included in all of our user objects everywhere that we're retrieving them which just pollutes the user model I'm sorry pollutes the object it's it's useless it's just going to add weight to that that data object and we don't need it and then also the Django admin has no idea

15:40

what to do with this so first let's start updating the query set we are going to monkey patch some of the methods on query set so that we can filter and exclude I'm not going to do all of them I'm just going to do filter and exclude and I'll show you that we actually get a few more there's a bit bang for your buck by just monkey patching these also for sake of time I won't be looking at the source code but just to note the

16:03

reasons why this is the reason why this is here is that when you patch filter exclude you get filter exclude and get out of the box so you only have to monkey patch that one function and you can see this in the source code that they all just call filter exclude then we'll insert our mask values and we will super the parent instance of our custom models dot query

16:21

set for everything else this is what this will look like in code we have our mask fields name and then we iterate over the masks fields and create a keyword argument that we then pass to our there's my mouse our filter exclude customized method so now we'll be able to do things like

16:43

filter on the identifiable name or exclude on the identifiable names and then the last thing we need to do is override the auth user manager get query set and you can see how I've done that there for the object second thing we have to do is exclude pseudonyms and pseudonyms are really

17:01

useless like I said they pollute the models so there's actually a method called defer the gist is that if you don't need a particular field when you fetch the data you can tell Django not to retrieve them from the database using defer so it's very similar to the last we'll create a new list hey f strings for the win we'll iterate all over all the

17:21

attributes in our model that start with underscore and then we'll add them to our keyword arguments that we pass to defer which is chained at the end of our monkey patch filter or exclude and now when we query using filter we can use the identifiable information and then also the object that is returned

17:41

does not have those pseudonyms in inside of it I didn't overwrite all you would have to overwrite all in this method the last step here is updating the Django admin so write read is masked and unmasked but what about Django admin well it doesn't know how to do this it doesn't know that we want to display the unmasked values in the admin it doesn't know to mask those

18:02

values when you submit the forms in the admin so we have to start by telling it what fields we want to show then we'll begin to define a form that we can swap out in place for the default one Django wants to use you can see we're overriding the built-in user change form from Django contrib auth

18:21

forms and we're creating a form with the new char field on initialization we get the correct value and check it against the validator for our masked field which could be important with something like a phone number if you were using phone regex and when the forms clean method is called we can get the appropriate value or error out on invalid input next we've got to

18:44

register this so we have our base fields of the model namely username and password but now we've created a group of subfields called personal data and we've added the name property to it and last we told Django admin how we would like to you how we would like the users to be displayed in the user list I will note that this last step could be very important and you may want

19:03

to under the regulations of GDPR you may want to add some logging on this because creating an article 30 each processor shall maintain a record of all categories of processing activities carried out on behalf of a controller and a controller can be someone with access to the Django admin whereas a processor could be you the engineer who wrote this process right you are

19:22

obligated and responsible for logging every time someone re-identifies this PII when it happened who did it and sometimes why they did it what that business process was and so we are set this we have we have finally encapsulated some pseudonymization through the entire lifecycle of this object and how we manage the data of this object it's stored in the database

19:40

as its pseudonymized field and when we retrieve it it will be the identified the re-identified values and best we have access through the Django admin to work with that data as it as it sits identified but it's always going to be saved as a pseudonymized field so the next example is a new and

20:01

improved method it's a it's a lot more straightforward than the last one I show the last one and I build up from it because there's there's a lot of work that needs to be done on legacy code and just because the last method was naive doesn't mean that doesn't make it wrong that you may have a system that has very few fields that constitute PII and you need to create some safety and regulation compliance for your clients that last

20:25

method isn't bad it's just that there are better ones if you're starting from the ground up if you're following along this is song seven so we are going to do data masking via custom fields using a custom field class we

20:40

will automatically mask values on their way in and out of the database with this approach we no longer require getters and setters the custom query set and corresponding user manager or the bulk of changes we did to the user admin because we're doing at the field level so we're taking the same user model as before and here's our customized field it's called pseudonymized field the class constructor and deconstructor methods

21:04

will accept a field type so we need to tell it what kind of field we are saving underneath this will set the appropriate database column and our deconstruct method has to mirror any argument changes we make in the constructor this is the only thing I don't like about this method we will

21:21

also override the get internal type which specifies the internal type of the field sorry I'm trying to show it to you guys if you've seen the field the source code for field this will look familiar we are we are essentially overriding some of it to provide a masking and unmasking method

21:42

when that data goes in and comes out of the database and all of that work is done by these two methods get prep value is called prior to interacting with the database and then from DB DB value is called when a value is pulled from the database so this is the core of this implementation this is what makes it sing we'll use get prep value as an

22:00

opportunity to mask values before they are saved and we'll mask values for query purposes which is really cool also we'll unmask our values when they're pulled from the DB and they're before they're converted to a Python object using DB value and so this is what it would look like when we apply it to our user model we have a field now for name of pseudonymized

22:21

field and we tell it what type of field it will be like what the field is underneath the hood there is a tuple here that accepts the masking and the unmasking algorithm so these are not tied to the customized field you can swap these out you can use different ones for different field types in fact that may be one of the ways to improve upon this is to have something underneath that knows intuitively how to shift

22:41

something like a phone number or something like a zip code or something like a name or a date of birth and you can see that there's still validations that can happen on the phone field and that's it y'all so thank you I am gonna take questions there is a sample there's some sample code and a

23:01

blog post associated to this and you can find them at those links I see you have sample code but have you packaged up this pseudonymized field on

23:21

by by PI so it can be used by other people I wanted to well so I shouldn't say I we wanted to this was a collective effort with the number of engineers at that Cuddlesoft we've chosen not to mostly because the pseudonymized field code isn't that it isn't that verbose and we don't

23:45

see the value in having something like that easily like injectable and like available for your code you can just copy and paste it we also think that there's a number of ways to improve upon it which we haven't gotten around to yet but it's just not I don't know to me it doesn't it's not significant enough of source code that we need to have a package for it I

24:04

thanks for the talk can you explain why you wouldn't use encryption as a pseudonymization method instead of kind of rolling your own pretty much there's a lot of reasons to use it so it just depends on the use case I think pseudonymization doesn't require encryption as its masking

24:22

unmasking method I think you can achieve a lot with the clever sorting or shifting algorithm that you control or that maybe even you seed somehow right encryption encrypting and de-encrypting the the objects attributes adds a lot of compute resources or as the requirement of

24:41

needing a lot of compute resources and so sometimes I just don't think it's necessary to add that type of overhead when there are perfectly you know there are methods that exist that will perfectly handle the regulatory compliance of it the other reason is that think about it from like a

25:00

database administrator's perspective like sometimes you encrypt something and it fills out a lot more space than you know like a phone number would whereas like if you are saving a phone number that's just just shifted around using some unique method the the database space and therefore the representation of that data in the database looks a lot more like the identified data right versus some you know long hash hi I was just trying to

25:28

figure out how to phrase this question so this is a really useful example of how to adhere to GDPR I guess that the broader question is if this is a useful way to make data more anonymized irrespective of GDPR criteria do you

25:47

have general advice for like people building whole apps to be more anonymized for their users yeah we use this method to achieve HIPAA

26:07

and there's just because just because there's a regulation telling you you should do this doesn't mean that you shouldn't do it if you don't have to adhere to said regulation like I said it in one of the earlier slides like anything worth protecting is worth protecting well and I think as

26:23

engineers our responsibility over time has increased in like how we need to handle users data especially so I'm a consultant and especially for folks that are consulting for other entities right like if your client doesn't appreciate having some kind of data privacy technique in place to secure

26:41

users data there's no reason why you shouldn't have that right I don't necessarily think this method adds a lot of complexity over top of what you're trying to achieve and then at the end of the day that that client could go on three years and have somebody working on the project that doesn't know how to configure an s3 bucket for the database backups but you

27:00

did all of their users a solid by just implementing some kind of standardization technique right yeah so my question is what is like the specific threat model that this tries to achieve because if your database was dumped a computer could easily reverse engineer a lot of these because it's not encryption so I guess the question is what's the threat model that this actually solves that's a really good question and the if we're

27:25

talking about database dumps I disagree I don't think that I think it could take a long time to decrypt this if you have a smart masking algorithm this one's not smart this one's real this one's don't use this but

27:44

there's a lot of pseudonym ization techniques that avoid shifting as like the primary basis for moving around that data or sorry D identifying that data and you can mix and match so token izing is not encryption but yet nobody can reverse engineer token to get back the valid credit card

28:00

information to steal this credit card numbers right so I just presented masking and the shifting algorithm as a way for us to all easily understand like what was going on with the data in transit what do you guys do for I know we had a large health provider and logging of the audit trails once

28:25

they're anonymized yeah that's a really good question no Frank will be available in the hallway that's that is a really good question that the challenges that it's always really specific to like what the use cases so

28:44

like something in the Django admin like creating a log for that you know would solve that problem on on accessing data between systems we like cloud watch and then and then because we can use I am to create the roles

29:00

that we need and we can track which roles are accessing the data from which services and then also I am in cloud watch give you a ton of like transparency into just access of systems or access of services and things like that and you can control when user passwords need to be reset you can enforce multi-factor a lot of cool stuff okay so let's thank Frank

29:23

one more time for the talk and the nostalgia