PySyft: Data Science on data you are not allowed to see
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 131 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/69489 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
VersuchsplanungMechatronikSingle Sign-OnBitAlgorithmische LerntheorieInformatikVerdeckungsrechnungPerspektiveDatenmissbrauchMomentenproblemInformationMultiplikationsoperatorOrdnung <Mathematik>Selbst organisierendes SystemSoftwareGeradeVirtuelle MaschineComputeranimationVorlesung/KonferenzBesprechung/Interview
01:51
Thermische ZustandsgleichungEntscheidungstheorieMultiplikationsoperatorRobotikAlgorithmusNotepad-ComputerProdukt <Mathematik>Dienst <Informatik>ProzessautomationAussage <Mathematik>Computeranimation
02:36
FreewareSystemtechnikProdukt <Mathematik>AlgorithmusEntscheidungstheorieComputeranimation
03:10
Maschinelles LernenDifferenteVirtuelle MaschineVorgehensmodellTaskStichprobeGoogolRechnernetzBoltzmann-KonstanteArchitektur <Informatik>Neuronales NetzBitfehlerhäufigkeitCloud ComputingDifferenteComputerarchitekturStichprobenumfangBildgebendes VerfahrenZahlenbereichVorgehensmodellRechenschieberVirtuelle MaschineEvoluteTypentheorieFormale SpracheAlgorithmusTaskSchnittmengeNeuronales NetzBitfehlerhäufigkeitAggregatzustandAlgorithmische LerntheorieNeuroinformatikCASE <Informatik>SoftwareUmwandlungsenthalpieNatürliche SpracheGewicht <Ausgleichsrechnung>Computeranimation
07:05
AnwendungssoftwareOffene MengeRechenschieberInternetworkingArithmetisches MittelVirtuelle MaschineSoundverarbeitungAlgorithmusVorgehensmodellComputeranimation
08:01
Physikalisches SystemVirtuelle MaschineCASE <Informatik>Physikalisches SystemVorgehensmodellDokumentenserverDatenmissbrauchProdukt <Mathematik>Kategorie <Mathematik>SchnittmengeArithmetisches MittelOffene MengeOrdnung <Mathematik>NebenbedingungComputeranimation
09:40
GraphiktablettStatistische HypotheseLokales NetzMaschinencodeDifferenteVorgehensmodellSelbst organisierendes SystemDatenmissbrauchRPCAlgorithmusMaßerweiterungComputeranimation
11:02
Kette <Mathematik>p-BlockMultiplikationChiffrierungFunktionalanalysisBeweistheorieDatenmissbrauchDifferentialHomomorphismusEin-AusgabeDatenmissbrauchInformationGibbs-VerteilungVorgehensmodellMomentenproblemFunktion <Mathematik>DifferentialDatenfeldProgrammverifikationEinsComputeranimation
12:02
OSS <Rechnernetz>VersuchsplanungPASS <Programm>DADSVorzeichen <Mathematik>RPCChiffrierungRauschenDatenmissbrauchQuaderPublic-Key-KryptosystemMultiplikationsoperatorHomomorphismusZentralisatorComputersicherheitFamilie <Mathematik>Computeranimation
13:15
BeweistheorieDifferentialDatenmissbrauchAlgorithmusMultiplikationChiffrierungHomomorphismusFunktionalanalysisSystemplattformAlgorithmusMultiplikationsoperatorRechenschieberCASE <Informatik>Computeranimation
13:45
Lineares GleichungssystemUmwandlungsenthalpieInformationSelbst organisierendes SystemCASE <Informatik>Computeranimation
14:33
VektorpotenzialEvolutionsstabile StrategieParametersystemVektorpotenzialAnalogieschlussComputeranimation
14:57
PROMVersionsverwaltungSoftwaretestRückkopplungTotal <Mathematik>ServerSelbst organisierendes SystemWeb SiteRouterRankingProgrammierumgebungInhalt <Mathematik>Shape <Informatik>Gebäude <Mathematik>PunktSoftwareentwicklerSystemplattformInformationProdukt <Mathematik>UmwandlungsenthalpieAdressraumTermBenutzerbeteiligungSelbst organisierendes SystemRPCOrdnung <Mathematik>MereologieSensitivitätsanalyseWeb SiteKollaboration <Informatik>ServerLastAlgorithmusSystemverwaltungVerschlingungMetadatenLoginMaßerweiterungHinterlegungsverfahren <Kryptologie>Open SourceVorgehensmodellAnalysisPunktDatenmissbrauchInformationProjektive EbeneResultanteMaschinencodeDemo <Programm>QR-CodeVollständiger VerbandDomain <Netzwerk>BeobachtungsstudieDADSComputeranimation
19:49
Web SiteVirtuelle MaschinePasswortATMLoginPASS <Programm>RechnernetzZeitbereichClientFunktion <Mathematik>VorgehensmodellMaschinencodePartitionsfunktionNotebook-ComputerSchreib-Lese-KopfUmwandlungsenthalpieBeobachtungsstudieAnalysisOpen SourceKeller <Informatik>Selbst organisierendes SystemFramework <Informatik>SichtenkonzeptElektronische PublikationKernel <Informatik>ServerProtokoll <Datenverarbeitungssystem>Kontextsensitive SpracheCross-site scriptingArithmetisches MittelGleichmäßige KonvergenzInstantiierungWeb-SeiteOperations ResearchDemo <Programm>MereologieRechter WinkelVersionsverwaltungSoftwareentwicklerArithmetisches MittelNotebook-ComputerBitMultiplikationAlgorithmusSchnittmengeSelbstrepräsentationDifferenteMultiplikationsoperatorObjekt <Kategorie>SystemverwaltungDefaultCASE <Informatik>MustersprachePunktFunktion <Mathematik>SoftwaretestGrenzschichtablösungFramework <Informatik>Projektive EbeneVirtuelle MaschineOrdnung <Mathematik>Web SitePerspektiveDomain <Netzwerk>ServerMinimumExogene VariableMaschinencodeInformationNeuronales NetzZufallsgeneratorInternetworkingSichtenkonzeptGeheimnisprinzipAbstraktionsebeneComputeranimation
24:57
ClientWeb SiteE-MailMaschinencodeNotebook-ComputerATMKernel <Informatik>SichtenkonzeptElektronische PublikationVersuchsplanungMailing-ListeFehlermeldungInformationMetadatenVirtuelle MaschineTotal <Mathematik>LoginMetrisches SystemSoftwaretestFramework <Informatik>StandardabweichungPräprozessorEin-AusgabeZeitbereichThermische ZustandsgleichungLesen <Datenverarbeitung>WellenlehreFlächeninhaltSukzessive ÜberrelaxationInverser LimesFunktion <Mathematik>ProgrammverifikationCachingFunktionalanalysisOrdnung <Mathematik>MetadatenDatenbankWeb SiteDatenflussProjektive EbeneMaschinencodeRechenschieberInformationFunktion <Mathematik>SchnittmengeAlgebraisch abgeschlossener KörperZeiger <Informatik>FehlermeldungNotebook-ComputerStellenringDokumentenserverEin-AusgabeDomain <Netzwerk>ServerSoftwareentwicklerÄhnlichkeitsgeometrieDeskriptive StatistikArithmetisches MittelPartitionsfunktionVorgehensmodellWellenpaketDefaultEinfache GenauigkeitPerspektiveMultiplikationsoperatorPunktRechenbuchZahlenbereichTermVirtuelle MaschineResultanteBitRelationentheorieDatenmissbrauchRechter WinkelMetrisches Systemsinc-FunktionVererbungshierarchieComputeranimation
29:58
DrucksondierungVerweildauerVollständiger VerbandComputeranimationVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:04
I know this is after lunch, so I'll try to make it interesting. Well, I think it is interesting. I'm really curious to know what you think. So that's all my talk. PyCIFT, Data Science and Data You Cannot See. A little bit about myself, or as I call it,
00:23
a short summary of myself in logos. I'm a computer scientist. My background is in computer science. I've been working in research for many years, machine learning and data science mostly, and also working from a data science perspective with medical data. This is why I became interested in working on private data for machine learning,
00:45
and this is what I'm going to talk about to you today. Behind the mask, that's me, I promise you. I actually work with OpenMind at the moment. OpenMind is a non-profit organization, and what we try to do is to envision the possibility to have,
01:06
to unlocking research and machine learning and data science in general in order to use everyone's data, and at the same time, like everyone is entitled to keep the data.
01:21
So everyone can use data, but withholding previously information. Also a fellow of the Software Sustainability Institute, and this is where I started getting interested in previously preserving techniques. Also Python Geeks, you might have seen me in other conferences as well,
01:40
and I'm a contributor to the Python community for many years. Well, along the lines, I also play Magic the Gathering, just because, you know, why not? All right, so let me start this talk by saying that nowadays, algorithms influence how people spend their time,
02:01
and because algorithms are actually guiding our lives. And indeed, many companies are increasingly offering AI-driven products and services to do what they have to do, like guiding our lives. They're very pervasive in our lives.
02:21
And the value proposition of AI in general is that an AI can advise or automate decision-making. And so we're talking in-house robots, self-driving cars, NLMs, personal assistants. But the problem is that an AI product is only profitable
02:41
if users can trust the product. And so can trust the algorithm with their decision more than anyone else. But, for example, the thing is, what if users don't know how your AI product will behave? And so the question now is, how can you trust?
03:02
How can you make your AI product trustable and reliable? The answer is pretty simple. Data, all right? So this is probably well-known to all of you, but just to clarify, human learning and machine learning is very different things.
03:22
So what you're seeing in this slide is the picture of a puppy. But what the algorithm or what machine learning models actually see is not just that picture. It's more like a collection of three-channel images. But more specifically, what they look like is a matrix of numbers. So let's just by saying that the two types of learnings are different
03:44
because we are looking at the same thing in different ways. And so we learn differently, which means that we have different challenges. So, for example, if you tell to a toddler, you want to teach to a toddler how to recognize an apple, well, you show them three samples of an apple,
04:03
and that little human being will be able to say, okay, this is an apple, this is not an apple. For machine learning models, it's very different. Normally, you need lots of data. That's why the models I call data hungry. But what if you have these two pictures also shown to humans?
04:22
Would you be able to say that those are apples or not? Well, according to Google, they are. I have no idea whatsoever. But the thing is that we do have different challenges as humans with respect to algorithms. And there's a very interesting paper, and it's like from long ago, like 2009,
04:41
so it's not like recent discovery. And this paper is called Scary to Very, Very Large Corporal for Natural Language Ambiguation. So what they were trying to do in this paper was like different language disambiguation tasks given to different models, different algorithms,
05:02
and with increasingly, with different data sets increasing in size. And essentially, the conclusion to that experiment was like when you have enough data, even simple model can achieve very good performance. So in the end, it's all about data, it's not really about the algorithm.
05:25
Interestingly, so if you think about what has been the evolution of neural network architectures, for example, so we had neural network research starting from back in 1943, right? And like there's been an evolution over the years up until 2009
05:43
in which all of a sudden we have deep neural networks. So we're talking AlexNet, GAN, UNet, ResNet, you name it. So they're all different architectures. So apparently, in that decades, more or less, it seems that everything was all about the models.
06:02
This is because GPU computing came up and also different data sets. But if you think about it, in that decades, if you read the paper from there, apart from very specific cases like, for example, UNet, which is like segmentation for medical imaging, everything seemed to be tested state of the art on two data sets.
06:24
So all of a sudden, research shifted into model is more important. Well, the reality is that we had BERT 2018, which is a large language model, and now we have GPT, then there is GPT-3, and finally we have GPT. So what I'm trying to say in this slide is,
06:43
of course you can have different architectures, different models, fine, but when it comes to the difference between GPT-3 and GPT, the real difference is actually the data. This is when NLM started really to work. And so this is just another very long and convoluted way
07:04
to say that data is very important. So we can absolutely conclude that AI models are data hungry, and so given this requirement, there's a push for open data set. And I found this slide on the internet calling the GIGO effect.
07:20
GIGO effect is actually garbage in, garbage out, meaning that yes, you do have data, but data need to be of good quality. It's not a matter of quantity, it's more about quality. So if you don't have reasonably good data to learn from, essentially you're not going to get anything out of your algorithms.
07:41
And so AI models are data hungry, which means that there is a push for highly curated open data set. If you want to increase the amount of algorithms and research brought to the public on machine learning, for example, you need open data, and you want high quality open data.
08:02
But the underlying problem is that in order to answer a question about AI system, an AI product, ethics, safety researcher needs to see a copy of the data about the system. And it means that sometimes you can get a copy of the data if the data is publicly available.
08:22
But in reality, that is not always the case. Because many reasons, there could be legitimate privacy, IP, intellectual property meaning, legal constraints or adversarial concerns, meaning that when you try to run a machine learning model,
08:41
it's not always the case that you can just download the data and do what you want to do. Because data sometimes cannot move the premises where data is collected. And definitely the research in the recent years has done a lot into making a push towards releasing a repository of data sets
09:03
which would be available. But the reality is that the majority of the data is not open because it cannot be open. And so what I'm arguing here is that most of the data that could be made publicly available has been made open already. And so the remaining of data sets are sitting there
09:25
essentially not being able to use it for different reasons. Not just even open it, let alone using. Let's think about seeing the data, not even using it. It's impossible sometimes.
09:41
And so essentially what we need to do to unlock the possibility to use this data is to empower researchers to answer important questions about AI algorithms. But we need to do that without seeing a copy of the data they use to answer the questions and only with proper ethical oversight.
10:03
We think that, so we hypothesize that this process could be called remote data science. Remote data science means that an extended researcher or data scientist can create data science code, submit it to the organization holding the data.
10:22
That organization being any entity holding some data. Public, non-public data doesn't really make a difference but of course we're talking data that cannot be released publicly. And so you can submit to the code. The organization or someone delegated for the organization
10:41
is getting the request, reviewing the request and so once the request has been approved you can download the answer to your question. And essentially you can use previously non-seek technologies to mitigate the previously risks. And all of that without seeing the underlying data.
11:03
So for those of you not familiar with PETs I'd just like to introduce them very briefly. Actually I'm not mentioning, I'm just mentioning them, not introducing them at all. What's suffice to say here is that PETs at the moment is a very new research field and there are so many different solutions to PETs
11:21
PETs, if I didn't say clearly, stands for Previously Non-seek Technologies meaning that you can do data science guaranteeing that no privacy, no sensitive information will be disclosed during the process. And there are so many different technologies.
11:42
You can categorize some of them depending on if you want input privacy or you want output privacy or output verification or input verification. So we're talking, some of the most popular ones you might have heard before could be care anonymization, differential privacy or further learning for example. These are the most popular ones.
12:04
But PETs allow you to make this requirement because this is the main huge requirement we're trying to achieve doing remote data science. Answer questions using data you cannot see. That is something you can do with PETs. But the problem is that individual PETs
12:22
fail to decouple use of data from governance of data. So for example further learning, if you're familiar with that, you have a centralized governance over distributed data. This is how that works. In differential privacy you add the noise to your data before sending to someone else. This is the whole principle.
12:44
With secure enclaves you essentially joint governance is possible but not required. On data, and homomorphic encryption is also very interesting which is like data science on encrypted data. You have to own the private key on a centralized way
13:00
and so on and so forth. So there are different other technologies that they do work in isolation but sometimes they're not fitting entirely the governance of the data. So you cannot actually use them out of the box all the times. And so underneath we have the algorithms so the methods that we can use
13:21
but what is really interesting is actually the top of the slide which is answer the question without seeing the data. So that is the ability that really matters. The algorithms are essentially the way in which you can do that. And what we're trying to do in OpenMind is providing a solution to actually not just having a single technology
13:43
but a platform to use them all depending on the case. So in practice the scenario we have in mind is like you have a data scientist and this data scientist is interested to answer a specific question on some data or on the organization on the other side.
14:03
And the organization is of course retaining the governance over whatever is going to happen on the data. And so ideally this data scientist should study the data leaving and never moving out of the organization and without having to actually read the data
14:21
which it seems like counterintuitive in the sense that how would you possibly do that without doing data science without actually looking at the data. And that's why I wanted to present you PySIFT. So our argument before that is pets must be combined to realize their potential.
14:43
So my analogy is looking like everyone is working on car parts but we don't have a car yet because we don't have something to combine them all. And since we don't have that it's going to be difficult to envision a world with cars. So imagine a world with previously not selected technologies
15:04
and what we can do with very important goal is to minimize the misuse of data and maximize the innovation. And this is me introducing PySIFT. So PySIFT first and foremost is an open source project. So PySIFT is developed by OpenMind
15:23
which is a non-profit organization and it's available on GitHub. It's definitely free to use and open source and it's implemented in Python of course. So I'm just going to show you step by step how the process of remote data science looks like in PySIFT.
15:43
So you have a researcher and the organization willing to talk in this remote data science process. First is the organization launches the PySIFT data site. A data site is known, you can imagine that like a web server,
16:00
a private Apache web server for private data. The admin of the organization is essentially setting up this node and the first it does is loads algorithm metadata into the server, creates an account for extended researchers so that people can actually access to the server and then it's done, essentially goes and takes coffee.
16:24
So now that the server is ready, on the other hand we have the extended researchers willing to start a submission to the organization through PySIFT. So how does it look like? Essentially, first off researchers logs into the data site
16:46
and get an answer to what is possible to do on the data, like get answers to a lot of questions and this is when the data scientist is actually embedding the previous analysis technologies bit.
17:03
So at this point the researchers have submitted a request, which is a project proposal sent to the organization admin through PySIFT. So at this point the admin reviews the request, which is indeed including two things.
17:22
On one hand is reviewing the purpose of the study, which is submitted into the request and he's also reviewing the actual code that the data scientist is willing to run. It's executing the code either locally or remotely on the data and then submits the result when this gets approved.
17:43
So at this point the result is available and the researcher can actually get the result and get the result for their algorithm run on private data.
18:02
Last thing I want to share before moving to a quick demo is we've been collaborating with some partners in the past, including Twitter, Dailymotion, LinkedIn. They've been having deployment of PySIFT servers on their premises
18:24
in order to share the data and this is actually brand new. So we started a couple of weeks ago actually, a collaboration with Reddit and this is really, really amazing. So if you're on Reddit and you go on Reddit for researchers, Reddit is willing to host PySIFT domains,
18:44
PySIFT data sites, servers in order to share and like to make available to everyone ready data and you may imagine how interesting ready data could be in terms of privacy, in terms of sensitive information and everything and so we're working together with Reddit
19:03
in order to have a PySIFT deployment available to everyone to actually log in and use ready data through PySIFT. Before moving into the demo, we are a community of 16,000 people on Slack.
19:22
If you wanted to join Slack, that would be mostly appreciated and we actually are starting to go back to the community and this is part of my commitment and what I'm actually doing right now. So if you're willing to join our Slack, that would be amazing and there will be a link to join our Slack link
19:45
and the QR code will be shared again later. So now is the demo part. I have five minutes plus questions. Let's see what I can do. It's going to be live. I promise you I cleared all the notebooks
20:03
and PySIFT actually works in notebooks already. So is this big enough for you, first off? Any better? Right, so essentially this is just the introduction,
20:20
like in summary what has been described so far. The workflow is a data scientist connects to a data owner and in PySIFT there is an actual separation between two main roles and this is what we're going to see in the demo. We have a data owner which is the one in charge of setting up the domain,
20:42
setting up the data site and reviewing the requests and the data scientist who is the external researcher who wants to access the data site. So if you want to have another look from a data perspective, which is this picture here on the bottom, essentially this is like a request response kind of workflow
21:04
and I'm going to tell you in a second how data are organised because this is how effectively it explains the workflow in practice. So let me jump directly into that. All right, so PySIFT can be deployed in many ways
21:21
but the interesting bit is that PySIFT includes a development server, meaning that you can spawn a PySIFT server directly locally, you don't have to deploy anything and this is entirely intended for development purposes, very similar to what you have in Django framework for example, like you have the development server, you don't need to do anything,
21:42
it just works. So I just set up, let me just reduce the code a little bit, so I'm just spawning the server here, so it's running locally in my machine, I call it europython-testdomain and it's running now, so I can connect as an admin,
22:04
I'm using default credentials at this point, so you get all this reach output if you're running into notebooks, so welcome to europython-testdomain and you have all the information and something you can do. So we can get access to new projects, request users or dataset like we're doing.
22:21
In order to set up the dataset, we wanted to start uploading a data and this is where it gets interesting in PySIFT, so in PySIFT, whenever you have a dataset, essentially every dataset is composed by two parts, this is the data owner view of things,
22:40
so someone is in charge of managing the dataset. So we have two versions of the data, we have the real true data, which will be always available only to the data owner, never to the data scientist, and we also have what we call mock data, meaning we have a synthetic artificial version of the data
23:07
that will be always made public and the only purpose of mock data is just to provide to external users something to work with when they need to prepare the algorithms. So let me just show you an example of what I mean.
23:22
Let's say we're having a breast cancer dataset taken from Scikit-Learn, something very simple and not requiring internet to get. So I load the data here and this is how data looks like, right? So first off, let me create very simple two versions of this data, two mock versions of this data.
23:44
One is, let me zoom a little bit, one is, so simply, it's just like adding the mean of each feature into the data plus random numbers and for the labels, we actually scramble the labels so that in case of any pattern, there is no pattern anymore.
24:04
So we have two versions of the datasets and this is how our datasets look like in PySift. Let me increase a little bit so you can see better. So a dataset is a collection of multiple assets and this abstraction is very interesting to me
24:22
because it tries to encapsulate different use cases. So for example, you can have a dataset composed by feature and labels, two different assets. You can have a dataset composed by multiple assets as in longitudinal data, so like data taken at different times and it's a very flexible representation.
24:43
So what we're going to do here is just like creating a SIFT dataset object, creating two assets and when we create the assets, essentially we set the data and the mock version of the data and when we're done, we add the assets to the dataset
25:01
and this is how the dataset looks like. So the dataset, we get to reach output here, we have a description and a name and the dataset is composed by two assets, the features and the labels. The last thing we need is we need to upload this data to the dataset and this is now officially available online.
25:24
And this is the data we got online. So we have one dataset, two assets and all the metadata information here. The last thing we need before moving to the data science notebook is needing to create access credentials. So we have Owen, the data owner,
25:41
and Rachel, the data scientist, willing to play here. So now we have a new user. As I said, I'm using default credentials, so we have Jane Doe. And Rachel Science is someone with a data scientist role willing to access the domain.
26:00
So let's now see how things look like from a data science perspective. We log in into the node. Sorry, I forgot the import. Import SIFT as SY. There we go. So... And what I mean... Yeah, sorry.
26:22
It was defined above. Right. Okay. So logged in. Once we logged in, we see that there is a dataset. We access the dataset by name, which is unique in the database and we saw that database is indeed composed by two assets.
26:40
We got the assets and we actually get access to the mock data. So far so good. Mock is supposed to be public, but when it comes to access data, as you can see, we're not seeing anything. So the data... And just to clarify, what features and labels really are are just pointers to what is in the repository.
27:01
Actually, since we're running the local development server with verbosity on, you're actually seeing that as soon as I make an execution here, you get some information on the log of requests made to the local server. So let's say I want to prepare for a machine learning experiment.
27:22
Something very simple, nothing particularly rocket science. So like partition data normalization, model training and metrics calculation. So this is a function which is expecting features and labels and some seed for reproducibility. And I run this function, I get these numbers, so far so good.
27:42
So this is actually working locally. Let me make this something that can become a code to run on PyShift. And the way in which you do that, very straightforward. You need two things. Essentially, PyShift thinks in terms of closures. If you have experience with Kubernetes and MLflow, similar things.
28:07
So when you want to unpack some code and send it to the server, you have to make a closure, meaning that you need to make a function with all the inputs in the body. So we're moving everything inside the body so that all the code is consistent.
28:21
And the only thing you need in order to make it a PyShift function is indeed a decorator, which is sift dot sift function single use. And you specify essentially the data that should be used by this function. Thank you. So you make it a sift function, we access some metadata,
28:45
and we see that what this has become is some code that we want to submit. Then we create a project. I'm jumping because I'm almost out of time. We create a project, we attach this code request to the project,
29:00
and we send it to the server. And if we try to run the code, we've got an error, because it says you're not allowed to run it, you need permission. So super briefly, and I'm going to share this on Discord for all of you in order to reproduce. The data owner at this point connects, connects, see the request,
29:22
there's a pending request from a data scientist, and got all the information here. We get the request, we get the code, we want to run, assuming this is all good. Make it, I'm running locally, it works. Running on private data, it works.
29:40
I deposit the results, fine. And last bit, I'm a data scientist, I get the reference, request is approved, and I can run the code, I get the result, everything works. All right, so let me just jump to the last slide. Thank you so very much.
30:08
Thank you. Please reach me out on Discord, I'm here the whole week, and join Slack if you want to, I'm very open to questions, I'm always around, so please feel free to stop me there. Thank you so much.
Empfehlungen
Serie mit 3 Medien