We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Private Data Anonymization with Python, Fundamentals

00:00

Formal Metadata

Title
Private Data Anonymization with Python, Fundamentals
Title of Series
Number of Parts
141
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.
114
131
Smith chartPersonal area networkBitExecution unitInformation privacyInternetworkingState of matterPhysical systemXMLComputer animationLecture/Conference
Advanced Boolean Expression LanguageCategory of beingChatterbotContext awarenessSystem identificationService (economics)INTEGRALFormal languageTranslation (relic)RandomizationExpressionLocal ringEndliche ModelltheorieAssociative propertyConnected spaceDemo (music)Software development kitInformationControl flowWebsiteArtificial neural networkDeterminantPattern recognitionNatural languagePower (physics)PhysicalismSet (mathematics)AuthenticationProcess (computing)Projective planeInformation privacyOrder (biology)Cartesian coordinate systemSystem administratorSoftware maintenanceRegulator geneEnterprise architectureProfil (magazine)Observational studyCollaborationismOpen sourceVirtual machineMachine learningAdaptive behaviorBiostatisticsEmailBookmark (World Wide Web)Uniform resource locatorPersonal digital assistantSelf-organizationVideo gamePerformance appraisalPlastikkarteFood energyField (computer science)CASE <Informatik>Database transactionOrientation (vector space)Element (mathematics)Attribute grammarNumberDistribution (mathematics)Descriptive statisticsMaxima and minimaAuditory maskingIdentifiabilityMixed realitySensitivity analysisComputer animation
Advanced Boolean Expression LanguageCASE <Informatik>Intermediate value theoremLengthEndliche ModelltheorieSymbol tableSoftware developerOrder (biology)CASE <Informatik>Subject indexingMachine learningUniform resource locatorWordServer (computing)Serial portValidity (statistics)Reverse engineeringSlide ruleVirtual machineComputer architectureAddress spaceFree variables and bound variablesBlock (periodic table)PseudonymizationTranslation (relic)File formatInstance (computer science)Type theoryPattern recognitionDiffuser (automotive)Finite-state machineProjective planeDomain nameAreaProcess (computing)SpacetimeInformation privacyInternet service providerInformationElectronic mailing listData modelFormal grammarWeb applicationPoint (geometry)Self-organizationIdentity managementRepresentational state transferDataflowoutputClient (computing)EmulatorWebsiteFrame problemMultiplication signCartesian coordinate systemMathematical optimizationArtificial neural networkVisualization (computer graphics)Set (mathematics)Declarative programmingExpected valueShared memorySpreadsheetBuildingComputer animation
Advanced Boolean Expression LanguageSymbol tableCASE <Informatik>Endliche ModelltheorieSpacetimeFormal languageLibrary (computing)Translation (relic)NumberOrder (biology)FeedbackPattern recognitionMetropolitan area networkMultiplicationElectric generatorDifferent (Kate Ryan album)Multiplication signService (economics)Field (computer science)Expert systemSimilarity (geometry)Combinational logicFormal grammarDatabaseSet (mathematics)Table (information)Loop (music)Data dictionarySoftware frameworkWritingCausalityComputer animationLecture/ConferenceMeeting/Interview
Computer fileSelf-organizationFlow separationEndliche ModelltheorieInstance (computer science)Subject indexingReverse engineeringVirtual machineProcess (computing)Different (Kate Ryan album)Order (biology)Projective planePresentation of a groupCellular automatonSpreadsheetLecture/Conference
Computer animation
Transcript: English(auto-generated)
So a little bit of history. The data protection and privacy techniques has been researched since 1850 in the United States when the United States Census Bureau started to remove the personal data from the public available census
about United States citizens. So nowadays everybody knows that the internet is full of data and roughly nearly 36% of that data is related to medical or healthcare system. And nearly 25 to 26% of that data is related to fintechs,
economy, e-commerce, vandalizations, and stuff like that. So regarding this, in that data is a lot of our personal data. So regarding this, privacy regulations such as the GDPR and CCPA have strict regulations to provide a strong protection
to that kind of personal data. So techniques for data minimization will enable business and public administrations to adhere to these regulations and protect that data for misuse or abuse. So what is the importance of the uses of the RNA Mindset data? Let's put an example.
Let's say that a hospital needs to share some kind of their patient's data in order to carry on a research or medical study. So this data should must be anonymized in order to protect the patient's privacy. That will include anonymized names, bank accounts, ethnicity, sex.
So as I said, the data minimization seeks to protect private and sensitive data by deleting or encrypting identifiable information. The data minimization is done for the purpose of protecting individuals or a company, private activities,
while maintaining the integrity of the data gathered and shared. So the anonymized data, when it's shared, should be keep it to his whole integrity and only anonymize the data that should be sensitive.
So the data minimization is also known as data obfuscation, data masking, or data de-identification. The GDPR splits in special categories of personal data. Follow these categories for confidential attributes. Belief, such as religion or philosophical belief, politics,
political opinions, trade union membership, sex, sexual orientation or sex life, ethnic, racial, or ethnic origin, health, health and biometric data. This also includes sensitive health-related habits such as substance abuse.
And not confidential information. In this case, most of the entities. So the MPI project, which some of you have heard before about the MPI project. Okay. One more favor, okay.
MPI project is the acronym of multilingual anonymization for public administrations. The ultimate goal of the MPI project is to develop and provide a fully solution for multilingual anonymization kit based on name entity recognition, mostly of you for sure are familiar with the name entity recognition.
And this name entity recognition should be applicable for all the European Union languages, including those languages on the resources such as Latvian, Lithuanian, Croatian, Slovenian. And this kit will not only restrict it only to European names or surnames, but also for most common names
in all European Union countries. And with a connection to each translation, irrespective of whether the text is monolingual, bilingual, or with mixed languages. This toolkit actually should be able to detect personal data, as I said before, name, addresses, email, credit cards,
bank accounts, between others. Moreover, this toolkit will be able to minimize this data. Thus, it will help public administrations to comply with GDPR particularly in the health and the legal fields. So these following names are only shown for the purpose
to proving many European countries and enterprises and government collaboration with the MPA projects. Within there, there is Pangionic, it's a Spanish company that provides NLP solutions, adaptive machine learning translation, data masking and optimization,
and artificial intelligence power translation services. TILDE that provides translation services, chatbots, localization, and explicit descriptions. The European Language Resource Association evaluations and language resource distribution agency from France and the University of Malta, among others.
So, so far, what has been achieved by the MPA projects? They provide already documents in genes for anonymization in 24 European languages, the pre-trained machine learning models for legal clinical public administration domains, the annotated data sets for name and the recommendations
with nested entities, and there is also currently a good approach out of these applications in the legal ministry of Spain. All of these resources by the MPA projects are currently available online, are open source, including the pre-trained models, the annotated data sets for all of the languages,
and they also provide a demo online of how the anonymization works. So, what is the anonymization in the GDPR context? In the Article 4 of the GDPR, it states that personal data means any information related to an identifiable person or data subject
that can be identified either by name, identification number, location data, online identifier, such as mails, usernames, URLs to profiles and websites, physical, physiological, genetic,
mental, economical, cultural, or social identity, data of a natural person. Also, the GDPR says that processing of personal data revealing racial, authentic origin, political opinions, or philosophical belief for the purpose of uniquely identifying a natural person shall be prohibited.
So, now we will show you some of the anonymization techniques and examples used within the MPA project and our company. So, general techniques to anonymize data should be using gaps. It will be the translation of the recognized entities with special characters.
It can be under a score or full block of text. Placeholders using alphanumeric symbols with the similar length of the replaced word entity, in this case to preserve the original format of the document. If you are anonymizing a word document for instance or Excel spreadsheet, you should maintain the format
of that document using tags, using a predefined tag that preserves the grammar information for this case or using pseudonyms is the replacement of an identity by another entity of the same type. For instance, we are seeing here the example of using these techniques.
In the case of the sentence, Albus was working in Japan's GM earning two millions. Actually, Albus I think was doing some kind of dirty business to end that kind of money. Okay. Using gaps, as you see, we identify four entities. Albus is a name. Japan is a country.
GM is an organization and millions is quantity. Using gaps, we like straight through all the entities. Using placeholders, we replace the entity with in this case using the letter X with using the same label of the entity recognizer. Using tags, we put the tag type
and the entity type instead of the entity in that case. And using in this case pseudonyms, the sentence will be John was working in Britain's BM earning four something. So we replace it completely using pseudonyms, the sentence. Well, in this world of machine learning and LFP and name recognition, mostly you should be familiar
with not everything is unicorn and rainbows. In the case of name entity recognition, we face many problems including linked entities. For instance, Keith Arthur could be a splitter with title and name in this case.
A lot of the rain, if you use that sentence in a name entity recognition model usually says that it's a title and it's not a title, it's a word of art, a movie, a book. The ethnicity tags, the names also, for instance, let's put an example, that with my company.
If you said that with is an American company, the mostly of the name entity recognition, machine learning models which said that American is an adjective and they not recognize American as an entity in this case. That is wrong. And also the addresses, the addresses are really, really common problem because mostly
of the countries write the address in a different way. So this is at glance the architecture of the Python Anemization Toolkit. The first step of the flow is receive the unstructured private data as input, so the data itself. Then we decompose that data in chunks.
For instance, if we receive a whole document or a whole paragraph, we split that paragraph in sentences. After that, those sentences or chunks are sent to the artificial intelligence detection entities in this case the machine learning model. After that, we proceed to the replacement of the entities.
As we saw before, we can use one of the techniques that I showed you, GAP, pseudonyms. Then we create an index of the reversing and that index is used by the end user of the anonymized data in order to have a technique to make the reverse engineering for the anonymization process.
And after that, the anonymized data is sent to the client, to the final client itself. So which tools we use during the development of this Python Toolkit? I like to call the holy trinity. FAS API, Pydantic at Ubicorn. FAS API because we created a REST API for the Dutch ingestion, validation, anonymization
of documents and text. Pydantic for data validation and faster serialization. Ubicorn as a web application server and also I strongly recommend this book from Francois Boron. It's called Building Data Science Application with FAS API. It's a really, really nice book.
There is the URLs in Amazon. I will share the slides when we finish the talk. So with visual expectation, we should expect us in the MAPA project and that as a company. We are trying to improve the MPA datasets
for future public and private models. Increase the data in legal and clinical domains and also provide custom mindset spacing models using deep learning with Think. I don't know if you are familiar with Think. It's from the creators of Space itself and also cascade arrangements of neural models.
Well, which remarks we should have so far? Anonymization is still a research area with too much, without too much applications, sadly. The same anonymization is a strong approach to keep available and save the documentation at the same time.
The models provided by PyTorch and spaCy named into the recognition public models are excellent at setting points for create anonymization solution. Actually, in the spaCy website, they have a really, really nice example of how to start with their models and the English model is great.
For the creation of custom models, as everybody knows, we need a lot of data, annotated data and in this case, we can reuse the MPA project data and revisit that data and improve that data. And also, we should define our tag list really carefully because private data is a really, really longer set.
It can grow as much as we can. Okay, thank you very much. Thanks, Oskar, for the great talk. We have a lot of time for Q&A,
so if someone needs to ask anything, please use the mic. Thanks. Can I use this as a library, not as API? I'm sorry? Can I use it as a library, not as API? Yeah, yeah, definitely. Actually, if you use the spaCy solution,
you can use it as a library. You load the model and you can start using it as a library and provide your own solution. Great. Hello, my name's Secret, but I still have a question. Do you intend to go first as in just substituting entities with your AI model?
For example, if I have a text and all of my personal data is just redacted, you can still find out it's me because of the grammar, the vocabulary I use, similarity to other texts I've written. Is that the next step where you can actually change the text itself? I'm sorry, can you repeat? Oh, sorry, is that? Oh, okay, now it's okay.
Now we're working. Yeah, I was asking if you plan to go first as in just substituting very specific entities like my name and my sex and my ethnicity, for example, the actual grammar or vocabulary that's used in the text. Yeah, yeah, you can actually do that. Oh, that is awesome. Like it, thanks.
Actually, one of the first and most common techniques in the name-entity recognition is dictionary basic name-entity recognition and that's also solved that. Thank you for your talk. I have a question. As you mentioned, Spicy is a great solution for English language. If we take like, you know, French or maybe Spanish,
the entity recognition becomes more tricky. Do you think that by combining anonymization with synthetic data generation can enforce or sometimes replace? I don't know if you tried to play around with.
Yes, actually Spanish is my mother tongue. So this is the cause of my really bad English. So yeah, in those cases, I mentioned English model because they have really, really huge score. So the data set, the user for that model, it was really, really great. You can, they provide a lot of models, Spicy itself.
You can also use Flare. There is another framework based on PyTorch. But the numbers are not really great as a spaces and Flare is kind of really expensive in matter of resources. We are deploying that model.
Usually in these cases, the best solution is a pre-trained model, custom model with your own entities, with your own data, but that data should be tagged by experts in that field. So you can actually, you can use the provided models by Spicy and beyond that, I don't know,
use techniques like man in the loop and provide some kind of feedback in order to improve that data. Also they provide Spanish, Portuguese, even Japanese. They have even Japanese that, you know, that symbols are really, really difficult to identify entities on that language,
but they handle that too. Hi, thank you for the talk. I have a question, like what if the documents or the dataset that you want to anonymize contains multiple languages within it, so like English and Spanish at the same time
or any other combination, like do you see it as like, you just run it multiple times with different models or is there another smarter way? You can use actually multiple language models. At the same time. At the same time, yeah. Actually, when you provide service for translations,
companies, usually they send you a document with two or more languages, so you must be able to provide several languages in your solution. So in that case, it's not, that is already covered as well. I have another question. So the use case that you showed us fit well
for textual data and do you have any idea or solution already for tabular data, especially what I'm interested in is the ability to not reverse the data, like we had an example.
I think it was Netflix who published his tabular data for competition which was anonymized and then some with tricky manipulations. Actually, one of the solutions that we were currently working on is anonymization of databases. So as I said, in order to anonymize data, you should always split that data into chunks.
So for instance, you will anonymize one cell. Speaking of tabular data. So there is no problem with that. You can anonymize spreadsheets or presentations, XML files, but that process is always from your side, not from the machine learning model.
He only identifies entities, so that should be done by you. Thanks for the very interesting talk. Thanks to you. I have a question about the MAPA project. They have a model so far as I know for several different things. For example, legal texts, clinical texts and so on.
Do you know if there are performance differences between these topics? Yeah, yeah. You mean performance in order to the score of the models or the recognition? Yeah, yeah, yeah, definitely, definitely. And which is working better? Which is worse? No, I don't have that answer on hand by now.
Yeah, definitely. And there is a lot of work to be done regarding those models and the data by the way. Hi, thanks for the great talk. Thanks to you. Might be a silly question, but can you reverse the anonymization if you don't have the original data?
Sorry? Can you reverse engineer the anonymization if you don't have the original data sources? Yeah, usually when you anonymize a document, for instance, you create like an index. Oh, okay. When you identify the entity, for instance, your name, and you define an index for that entity. And that index is held by the same organization
that actually undermines that data in order to provide reversing into that anonymized data. All right, thanks. Thank you very much.