On the Edge of Human-Data Interaction with the Databox
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 254 | |
Autor | ||
Lizenz | CC-Namensnennung 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/53132 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
00:00
Keilförmige AnordnungProjektive EbeneDeskriptive StatistikGamecontrollerSystemplattformImplementierungLesen <Datenverarbeitung>QuaderWhiteboardComputeranimationJSONVorlesung/Konferenz
00:55
Operations ResearchRechnernetzSystemprogrammierungComputerCAMUniformer RaumUnrundheitQuaderKollaboration <Informatik>Demoszene <Programmierung>Projektive EbeneZahlenbereichBitWeb SiteGrundraumCOMProgrammierumgebungXMLUMLVorlesung/KonferenzBesprechung/Interview
01:50
PlastikkarteDigitalsignalWort <Informatik>KontrollstrukturInternetworkingInformationProgrammierumgebungInzidenzalgebraQuick-SortEnergiedichteInternet der DingeNatürliche ZahlRechter WinkelExploitHyperbelverfahrenGlobale OptimierungGesetz <Physik>WirkungsgradGebäude <Mathematik>MAPComputeranimation
03:09
PunktwolkeFacebookProjektive EbeneExploitInternet der DingeQuick-SortGamecontrollerRichtungFormation <Mathematik>Basis <Mathematik>EinfügungsdämpfungKontextbezogenes SystemNeuroinformatikVorlesung/Konferenz
03:57
CAMCloud ComputingPunktwolkeProzess <Informatik>Baum <Mathematik>ComputerInternetworkingDigitalsignalUnternehmensarchitekturDienst <Informatik>CodeNP-hartes ProblemTermModelltheoriePunktCodeProzess <Informatik>PunktwolkeQuick-SortPhysikalisches SystemSystemplattformTermInformatikKomplex <Algebra>MultiplikationsoperatorKontextbezogenes SystemResultanteInteraktives FernsehenExogene VariableZahlenbereichInternetworkingDatenverarbeitungInformatikerServiceorientierte ArchitekturOrdnung <Mathematik>Konstruktor <Informatik>DigitalisierungTransaktionCASE <Informatik>Familie <Mathematik>Natürliche ZahlDienst <Informatik>ComputerarchitekturBitAnalogieschlussBeobachtungsstudieDatenfeldMechanismus-Design-TheorieNeuroinformatikGebäude <Mathematik>ZentralisatorVerteilte ProgrammierungSoundverarbeitungBenutzerbeteiligungHorizontaleMathematikBildschirmmaskeRichtungComputeranimation
08:20
Ein-AusgabeAnalysisInformationGruppenoperationLogischer SchlussInteraktives FernsehenNeuroinformatikBeobachtungsstudieKollaboration <Informatik>Quick-SortAnalytische MengePhysikalisches SystemEndliche ModelltheorieDatenfeldEigentliche AbbildungAnalogieschlussProzess <Informatik>GruppenoperationDruckspannungResultanteLogischer SchlussMathematikMAPFormale SpracheUbiquitous Computing
09:17
Prozess <Informatik>GruppenoperationLogischer SchlussCAMQuellcodeBenutzerprofilAnalysisKanalkapazitätDatenverarbeitungPhysikalisches SystemRückkopplungDatenmissbrauchPlastikkarteLogischer SchlussMinkowski-MetrikResultanteGruppenoperationMultiplikationsoperatorPunktNeuroinformatikCookie <Internet>Interaktives FernsehenMinimalgradBasis <Mathematik>KanalkapazitätQuellcodeProgrammierumgebungBrowserArithmetisches MittelEndliche ModelltheorieKraftBildschirmmaskeCASE <Informatik>Umsetzung <Informatik>Computeranimation
11:23
Dienst <Informatik>BinärdatenTermEntscheidungstheorieSystemprogrammierungGoogolCodeProzess <Informatik>KontrollstrukturKanalkapazitätDynamisches SystemWort <Informatik>SichtenkonzeptInteraktives FernsehenPhysikalisches SystemEntscheidungstheorieMultiplikationsoperatorTermNeuroinformatikQuick-SortMathematikQuaderBitDienst <Informatik>ProgrammierumgebungVersionsverwaltungBinärcodeGamecontrollerCodeRPCResultanteFamilie <Mathematik>StellenringBesprechung/InterviewVorlesung/KonferenzComputeranimation
13:17
CAMCodeKontrollstrukturProzess <Informatik>Figurierte ZahlTopologieNeuroinformatikTermKartesische KoordinatenEndliche ModelltheorieQuick-SortApp <Programm>URLTransaktionFacebookPlastikkarteQuaderInformationAutomatische HandlungsplanungComputeranimationVorlesung/KonferenzBesprechung/InterviewFlussdiagramm
14:22
Prozess <Informatik>Treiber <Programm>CodeCAMRechnernetzSpeicherabzugVirtualisierungStochastische AbhängigkeitDatenverwaltungZusammenhängender GraphMultiplikationsoperatorMomentenproblemVerkehrsinformationSmartphoneAutomatische IndexierungBildschirmfensterTreiber <Programm>Prozess <Informatik>DatenverwaltungPhysikalisches SystemPlastikkarteQuaderApp <Programm>VirtualisierungQuick-SortMinimalgradPortabilitätCASE <Informatik>Stochastische AbhängigkeitMereologieGamecontrollerSystemplattformEndliche ModelltheorieBitStellenringProgrammfehlerKartesische KoordinatenComputeranimation
16:01
ZugriffskontrolleInterface <Schaltung>InformationsspeicherungMiddlewareService providerGamecontrollerSpeicherabzugRechnernetzNabel <Mathematik>SystemplattformBildschirmfensterSoftwareSkriptspracheProgrammierumgebungInformationsspeicherungSpeicherabzugZusammenhängender GraphDatenverwaltungGamecontrollerQuaderMinimalgradComputerspielMenütechnikAutorisierungTelekommunikationNebenbedingungCookie <Internet>Kategorie <Mathematik>BildschirmmaskeDreiecksfreier GraphKartesische KoordinatenInhalt <Mathematik>Token-RingTreiber <Programm>DateiformatApp <Programm>Online-KatalogMiddlewareVorlesung/KonferenzBesprechung/InterviewComputeranimation
17:41
MetadatenInformationsspeicherungRemote AccessEin-AusgabeToken-RingRechnernetzBridge <Kommunikationstechnik>SpeicherabzugZusammenhängender GraphTreiber <Programm>Interface <Schaltung>Quick-SortKartesische KoordinatenQuaderSoftwareEindeutigkeitSpeicherabzugTelekommunikationDatenverwaltungVirtualisierungEinfach zusammenhängender RaumMinimalgradInformationsspeicherungAggregatzustandSystemaufrufMechanismus-Design-TheorieSchlussregelEreignishorizontDatenbankPhysikalisches SystemTermProzess <Informatik>QuellcodeZellularer AutomatApp <Programm>Token-RingRechter WinkelTreiber <Programm>MetadatenDifferenteVorlesung/KonferenzComputeranimation
19:23
FrequenzReelle ZahlROM <Informatik>REST <Informatik>BinärdatenFront-End <Software>Desintegration <Mathematik>ChiffrierungLastBefehlsprozessorGamecontrollerHIP <Kommunikationsprotokoll>KugelkappeUniformer RaumCAMMobiles EndgerätZugriffskontrolleZusammenhängender GraphQuick-SortDeskriptive StatistikMechanismus-Design-TheorieGamecontrollerMultiplikationsoperatorDatensatzMinimalgradBinärcodePeripheres GerätInverser LimesComputersicherheitKonfigurationsraumFrequenzPunktwolkeInteraktives FernsehenTermFront-End <Software>HalbleiterspeicherTelekommunikationPunktQuaderStandardabweichungVersionsverwaltungInformationsspeicherungAbstandMultiplikationVorlesung/KonferenzComputeranimation
21:06
DatensichtgerätMobiles InternetPlastikkarteCAMMinimalgradSpezialrechnerMereologieBridge <Kommunikationstechnik>DatenmissbrauchAnalysisÄhnlichkeitsgeometrieBestimmtheitsmaßZugriffskontrolleToken-RingMapping <Computergraphik>Quick-SortPhysikalismusQuaderGamecontrollerVersionsverwaltungBildschirmfensterKartesische KoordinatenLesen <Datenverarbeitung>NeuroinformatikRechenwerkTouchscreenInformationsspeicherungMinimumMatchingQuellcodeMereologiePunktAssemblerMultiplikationsoperatorEinflussgrößeZahlenbereichApp <Programm>AnalysisTermStatistikZeitreihenanalyseKorrelationsfunktionZusammenhängender GraphProgrammierumgebungMittelwertDemo <Programm>InformationResultanteTransformation <Mathematik>Bildgebendes VerfahrenGenerizitätTwitter <Softwareplattform>VisualisierungKontrollstrukturDatenflussTranslation <Mathematik>ComputersicherheitCoxeter-GruppeEntropie <Informationstheorie>Suite <Programmpaket>SystemaufrufWeb SiteBitrateFrequenzVorlesung/KonferenzBesprechung/InterviewComputeranimation
23:48
PunktwolkeLogischer SchlussDistributivgesetzModelltheorieFlächeninhaltUniformer RaumKartesische KoordinatenQuick-SortDatenverarbeitungOrdinalzahlResultanteDatenanalyseFacebookProzess <Informatik>Virtuelle MaschineDifferenteZentrische StreckungWellenpaketBildschirmmaskeNeuroinformatikPunktwolkeTermBitQuaderURLEndliche ModelltheorieStatistikStichprobenumfangUmsetzung <Informatik>DatenbankAnalysisMapping <Computergraphik>Analytische MengeAlgorithmische LerntheorieBasis <Mathematik>Vorlesung/KonferenzBesprechung/InterviewComputeranimation
25:37
QuaderBenutzerprofilInhalt <Mathematik>Physikalisches SystemWeb logSchnittmengeInhalt <Mathematik>Kartesische KoordinatenSichtenkonzeptPhysikalisches SystemMultiplikationsoperatorWeb SiteZählenKollaboration <Informatik>VollständigkeitCASE <Informatik>QuaderQuellcodeComputeranimation
26:18
Inhalt <Mathematik>BenutzerprofilPhysikalisches SystemUniformer RaumCAMBiegungDelisches ProblemKartesische KoordinatenInhalt <Mathematik>FacebookProfil <Aerodynamik>Physikalisches SystemVideokonferenzVersionsverwaltungQuellcodeRichtungSuite <Programmpaket>EreignishorizontGamecontrollerInteraktives FernsehenProgrammierumgebungCASE <Informatik>QuaderQuick-SortNatürliche ZahlPunktComputeranimation
27:55
Ordnung <Mathematik>Prozess <Informatik>Web logReelle ZahlBlackboxProjektive EbeneInteraktives FernsehenPhysikalisches SystemRechenschieberDatenfeldBitCASE <Informatik>Gemeinsamer SpeicherVersionsverwaltungPunktApp <Programm>QuaderEntscheidungstheorieKartesische KoordinatenProgrammierumgebungKontextbezogenes SystemPrinzip der gleichmäßigen BeschränktheitProzess <Informatik>Quick-SortMechanismus-Design-TheorieZahlenbereichOffice-PaketMaßerweiterungGüte der AnpassungGamecontrollerDatenbankMomentenproblemComputeranimation
29:51
Prozess <Informatik>TaskFormation <Mathematik>CAMBeweistheorieRandwertVorzeichen <Mathematik>PunktrechnungBus <Informatik>Inelastischer StoßBeweistheorieQuaderGüte der AnpassungMultiplikationTransaktionKoordinatenObjekt <Kategorie>Quick-SortOrdnung <Mathematik>GamecontrollerRandwertRelativitätstheorieComputerarchitekturPlastikkarteE-MailWort <Informatik>ComputerspielQuellcodeSchnittmengeCASE <Informatik>Prozess <Informatik>Web SiteFacebookVorlesung/KonferenzBesprechung/InterviewComputeranimation
32:43
ModelltheorieKontrollstrukturQuellcodeVisualisierungCoxeter-GruppeQuick-SortDatenverarbeitungQuaderCoxeter-GruppeMereologieSchnittmengeZahlenbereichBitrateAggregatzustandResultanteGüte der AnpassungMechanismus-Design-TheorieKartesische KoordinatenProzess <Informatik>PunktwolkePhysikalisches SystemMetropolitan area networkVorlesung/KonferenzBesprechung/InterviewComputeranimation
34:40
QuellcodeVisualisierungCoxeter-GruppeKontrollstrukturPay-TVGamecontrollerBestimmtheitsmaßDatenverarbeitungMechanismus-Design-TheorieMinimalgradSummierbarkeitOrdnungsreduktionProzess <Informatik>Quick-SortOrdnung <Mathematik>Computeranimation
36:07
DatenverwaltungQuellcodeKontrollstrukturCAMGamecontrollerQuaderGruppenoperationCASE <Informatik>Prozess <Informatik>EnergiedichteMechanismus-Design-TheorieMultiplikationDatenverarbeitungCoxeter-GruppeInternet der DingePlastikkarteQuick-SortComputeranimation
37:05
Hausdorff-DimensionReelle ZahlObjektverfolgungMobiles InternetDienst <Informatik>Gerichtete MengeIdentitätsverwaltungGruppenoperationDienst <Informatik>AbschattungDatenmissbrauchFacebookGemeinsamer SpeicherIdentitätsverwaltungKollaboration <Informatik>UnrundheitDatenbankProjektive EbeneGebäude <Mathematik>PerspektiveParametersystemQuaderComputeranimation
38:24
StrömungsrichtungProjektive EbenePerspektiveSichtenkonzeptQuaderPhysikalisches SystemInternetworkingArithmetisches MittelCASE <Informatik>Interaktives FernsehenSchnittmengeRoutingAbgeschlossene MengeVererbungshierarchieWurzel <Mathematik>In-System-ProgrammierungKontextbezogenes SystemVorlesung/KonferenzBesprechung/Interview
40:25
Bridge <Kommunikationstechnik>E-LearningGeradeEndliche ModelltheorieQuick-SortComputeranimationVorlesung/Konferenz
41:13
BitTermKartesische KoordinatenProjektive EbeneCASE <Informatik>MereologieProzess <Informatik>PrototypingQuick-SortCloud ComputingSoftwareentwicklerInternet der DingeEinsReelle ZahlSystemplattformDemo <Programm>Endliche ModelltheorieKollaboration <Informatik>Dienst <Informatik>MultiplikationsoperatorGrundraumRechter WinkelZentrische StreckungBeobachtungsstudieVorlesung/Konferenz
43:36
LoginKollaboration <Informatik>Quick-SortInternet der DingeProjektive EbeneTermMultiplikationsoperatorKontextbezogenes SystemBildschirmmaskeHypermediaMAPVorlesung/Konferenz
45:02
MAPFamilie <Mathematik>p-BlockQuaderMereologieCASE <Informatik>EnergiedichteOrdnungsreduktionDatenverarbeitungNachbarschaft <Mathematik>MittelwertBitrateVorlesung/KonferenzBesprechung/Interview
46:02
Uniformer RaumOrdnungsreduktionMittelwertEnergiedichteGeradePunktQuaderKartesische KoordinatenProgrammierungQuick-SortPhysikalisches SystemTermKonfigurationsraumMultiplikationSoftwareentwicklerCodeWeb SiteSoftwaretestRahmenproblemZentrische StreckungArithmetisches MittelDatenverarbeitungFlächeninhaltFacebookProgrammierspracheProgrammierumgebungOffene MengeWellenpaketData MiningComputeranimationVorlesung/Konferenz
47:36
Kartesische KoordinatenProzess <Informatik>ResultanteInformationDatenverarbeitungQuellcodePunktArithmetisches MittelTabelleDatensatzStandardabweichungFaserbündelSemantic WebVorlesung/KonferenzBesprechung/Interview
48:48
App <Programm>DatenmissbrauchEuler-WinkelKontextbezogenes SystemQuick-SortOntologie <Wissensverarbeitung>SondierungKartesische KoordinatenInternetworkingPeripheres GerätStatistikTermSemantic WebBerners-Lee, TimInformationsspeicherungQuellcodePerspektiveAutorisierungExogene VariableProjektive EbeneLogischer SchlussDateiformatParallele SchnittstelleFacebookGamecontrollerDifferenteStandardabweichungPunktwolkeSystemplattformStereometrieDatensatzSoftwareArithmetisches MittelRoutingTabelleQuaderDynamisches SystemSchreib-Lese-KopfImplementierungUnrundheitRelativitätstheorieBrowserForcingRechter WinkelMomentenproblemBitVorlesung/Konferenz
52:35
UnrundheitVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:20
So, the talk is something really, really cool, I think, and I'm really looking forward to it and to get a glimpse of what it's all about. So, the talk is about a research project which let the user see and control what's done with their personal data.
00:42
At least that's what I read in the description of the talk, and I'm really, really looking forward to hear some more details about this from Mort, who is presenting this talk, and he's gonna be talking about the platform design, about the implementation, and the current status of this data box thing,
01:03
and please give a really warm round of applause to Mort. Thank you, and thank you for having me. Before I start, I shall begin by apologizing. I have small kids, so it's permanently flu season in my house. If I start coughing uncontrollably, just bear with me.
01:24
What I'm gonna do is talk a bit about the data box project. This is a project that was funded by the UK Research Council, EPSRC. It's a collaboration between University of Cambridge, Imperial College London, and the University of Nottingham with a number of industrial partners, one of whom I'll mention the talk for BBC.
01:41
So, to set the scene a little bit, I probably don't need to say this very much at this particular venue, so you may just wish to go to those Tumblr sites, which I thought were quite funny. Big data picks tumblr.com, and we put a chip in it, tumblr.com, but we're now in a big data world, so data is collected all around us in the environment from what we do, our retail habits,
02:02
sensing, IoT things in our homes. All around us, data is being collected. There's a lot of opportunities and challenges that are presented by this. You can imagine a great deal of personalization, personal optimization, things you can do to make your house more energy efficient, for example. There's lots of things you can do that are beneficial from this sort of data,
02:21
but there's also a lot of challenges that are presented, particularly around privacy, around the rights of the individual to control and see what's happening about them. I did warn you, sorry. The nature of this sort of collection is that it's building up large collections of very rich, often quite intimate data in large silos.
02:45
Some of the sensors that you can see in the top left there, you've got sort of things you might expect, retro social networks, Nest thermostats, but nowadays, more intrusive things, medical devices, things that are monitoring incident levels, heart rate, so forth.
03:02
It's very rich, it's very intimate data that's now possible to be collected. So the challenge that we pose ourselves in this research project was really what can we do to allow data subjects to control the collection and exploitation of data, particularly data that is what you might think of as their data, so data that's yours,
03:21
that you somehow own, and also data that's collected about you, that you might not have such direct control over. So that's the context, how to enable data subjects to control collection and exploitation of their data and data about them. This is taking place in an existing ecosystem which is very much focused around the idea that we want to move data around.
03:40
Typically, we want to move data into the cloud. Data tends to get pushed out there. Even when it starts out, there's some data where you might expect it starts in the cloud. So you post something to Facebook, it's on Facebook's computers, that's not a surprise. On the other hand, there's a lot of sort of IoT devices that you might think could very well keep the data more local to where they're deployed.
04:00
You might think that data about your house could stay in your house if that's what you wanted to happen. And yet, by default, a lot of them will push that data out to the cloud. Even if they subsequently give you back in some way, it will end up out there on somebody else's computer. And this seems to be, to my mind anyway, it's a structural problem about the way that we build systems nowadays. The internet has become very fragmented.
04:21
It's difficult to build effective, robust, efficient distributed systems across the modern internet. And it's much easier just to centralize things. And the cloud allows us to centralize things. We can just stick it all out there in some system that somebody else runs, as the sticker says, on somebody else's computer. So we're defaulting to moving data into the cloud in order to process it.
04:41
It makes the processing much easier as well if that data is centralized. The starting point for thinking about this was when I rejoined academia in about 2009 and joined a research institute at Nottingham called Horizon, Horizon Digital Economy Research. That was focused very much around this notion of digital footprint and what could we do with these digital footprints
05:01
that we're creating or we're starting to create at that point. And it was quite an interdisciplinary center. So there were people there from sociology, mathematics, engineering, computer science from all over the piece. And a lot of my colleagues essentially said, if you could build us a magic context service, we should do great things with it. We just know the context of the user,
05:21
then we'll be able to do all sorts of fun and interesting interactions. And we had a number of discussions about this where my response would often be, well, yes, but what is that? I don't know what a context, I don't really know what the context of the user is. What do you mean when you say you want to know the context of the user? And it eventually became clear that it wasn't quite well-defined what that was,
05:42
but it definitely involved using personal data. It was definitely going to be possible to construct this from the personal data that could now be collected from sensors, from social networks, from interactions. So the endpoint I came to with that was really being a lazy computer scientist to punt on the hard problems. So I wasn't going to try to define what the context was,
06:01
because that seemed difficult. But what I did say was, well, if you give me some piece of code that encodes what you think the context is, then I'll try and create a platform that will execute that for you and so return to you what you've defined the context to be. So I punted on the problem. And that gave rise to a thing that we called DataWare, which was essentially a service-oriented architecture
06:20
for trying to do personal data processing. So the idea was that the data processor would write some piece of code that would process the data subject's data. The subject would provide the platform on which they could execute that code, and the processor would receive the result. And the point here was that we were now moving the code to where the data was rather than moving the data to where the code wished to execute on it. So we're not pushing the data into the cloud anymore,
06:42
trying to take the code and push that to where the data starts. This was the sort of picture we had at the time of DataWare. So you've got a sort of, well, overly complex, certainly fairly complex request and permission process here. So the data processor requests permission through some mechanism, gets granted permission to do some piece of processing,
07:00
and is then able to push the piece of code they want to execute onto some platform where the data's made available, and then the results go back to the data processor. So that was sort of DataWare, excuse me, DataWare v1. However, when we started to try and build this and try and think about how it might be used,
07:20
it became clear that there was lots of complexity in terms of the interactions you might wish to support on such a system. So there's lots of ways you can construct interaction around this. One obvious way that's received some interest is the idea that people might pay you to use your data, but there's lots of other things that you might wish to happen there. There may be many situations where you want data to be processed, but it's not appropriate for somebody to pay you,
07:40
or another member of your family. It may not seem sensible for them to pay you to use your data. And there was little in the way that DataWare was constructed that actually said anything about how this was gonna happen. So in the case of being paid to use data, exactly what were you being paid for? What sort of use was going to be made of your data? What was gonna happen then? So DataWare was a proposal
08:01
that would support some forms of interaction. It basically gave you a kind of transactional nature where you had a transaction between parties in terms of this request, granted permission, and then possibly some ability to see what had happened afterwards. But there were a lot more things that we could consider. And so we sort of abstracted and stepped up from the problem a little bit, and stepped away from DataWare, and started to think more generally
08:20
about what is it that's going on in this sort of system. And we coined this idea of human-data interaction by analogy with human-computer interaction. So I think, and I'm neither a historian nor a proper HCI person, but my understanding is that HCI has essentially moved, human-computer interaction has moved as a field of study away from where it started,
08:40
which was the idea of a single individual using a single computer. And it's kind of moved towards a collaboration between individuals using computers. And it's now in the sort of world where you're thinking about ubiquitous computing, where it's not necessarily obvious which the computer is you're using. And so human-data interaction tries to take that a step further and say, well, in fact, it's now about the data.
09:00
It's not really about the interaction with the computer anymore. It's about how you're represented in the data and what the data is used to do to you and for you. And so the very high-level model that we have here is that you have some personal data that is collected. Analytics are performed on that data. They process it in some way. That allows you to draw some inferences
09:20
to work out something about the way, what that data says. And as a result of that inference process, some actions are taken. Actions might be to feed back into further analytics, feed back the inferences you've made, or actions might be nudges or things that might change your behavior and thus change the data that's generated in the future. Even in this very simple model, there's a couple of feedback loops that can take place.
09:41
And it's in this kind of space where data processing systems and data processing computations are taking place. And we felt that the systems that we were seeing and the systems that at that point we were trying to build were lacking in three key aspects which underpin this idea of human-data interaction or HDI. The first was legibility.
10:01
So it's clear that most people, I think, most of the time are generally unaware of what the sources of data that might be collected about them are for where can the data come from, generally unaware of the analyses that might be performed on those data, and generally unaware of what the implications of those analyses are. So understanding what's going to happen to you
10:22
in the future on the basis of actions that you've taken now or in the past that are now represented in some data set somewhere, possibly with some degree of inaccuracy, is not necessarily clear. It's not legible. It's not easy to see and understand what's happening in these systems. The second thing that seemed to be missing was agency.
10:40
So agency is the capacity to act in a system. So we are often unaware, certainly I think I am unaware, speak for anybody else, of the means that I have to affect the data that's being collected about me. There are some things I think I can do to try and control what data's collected. I can block cookies in my browser. I can use Brave. I can turn on all the other privacy things.
11:02
But that only controls the data that collected about me to some extent. It might be much less clear for me to know how I can affect this as I move around a smart city environment or a smart environment, for example. It's not obvious to me always what I can do to affect the analyses that are being performed on those data that have been collected about me.
11:22
And in both of these cases, that's even if I know that these means to affect these things exist at all, and I can be bothered to employ them because it may well be complex or difficult to employ these things effectively. So we lack agency. We lack the capacity to act. And then the third thing seemed to be a rather ugly word, negotiability.
11:42
So this is essentially trying to capture the notion of supporting the dynamics of interaction, the idea that when you make a decision, it doesn't necessarily remain your decision in this system until forevermore. You might want to change your view on things. You might want to change the way that you interact with the system, either as you learn more about it or as your behavior changes or as your environment changes for whatever reason.
12:03
So current systems still tend to track us in this kind of binary terms of service. You click the box to say yes, and then you're done. And you don't really get a chance to go back and revisit that. Maybe nowadays, you're starting to see more and more the idea that you can at least completely withdraw from a system. So you can be in the system or out of the system,
12:20
but it's often not really possible to control what's going on in terms of your interaction with the system over time. So that gave rise to this idea of data box, which is you can think of in some ways as data ware version two. So this is still taking the idea that you want to move the code to the data. This allows you to minimize data release. It allows you to retain more control
12:41
over what's done with the data because it's running on a device that is under your control. So at the end of the day, if you really want to, you can just turn it off. And then you know that the data is not being processed anymore. We tried to pay a bit more attention to how access to data, local or remote data was going to be mediated. We went to some efforts to try and make sure that we could control all the internal and external communication
13:02
and that we could log all the IO that takes place on the sort of following the idea that I don't really care what computation you do on data about me, so long as you never see any result from it. And if the computation just runs on a device somewhere and then gets thrown away, has anything really happened? You know, the computation, I don't know,
13:20
runs in the wood and the tree falls on the computer. Did anything take place? So if I can log everything that goes on in terms of what's communicated from that device to the outside world, then I can, in some sense, even if things go wrong, I might be able to go back after the fact and figure out what happened, figure out what leaked and why it leaked and when it leaked.
13:41
So the sort of model we have with Databox is this kind of application. So this is a sort of fraud detection application. So some person called Henry downloads a bank's app onto his Databox. Later on, there's a large transaction made in some foreign country against his credit card. The banking application is able to check Henry's location by saying, are you located in that country where this transaction took place?
14:01
The Databox is able to say no. And then the bank can deny the transaction and so the fraud is prevented. This hasn't revealed to the bank where this individual called Henry is. There's been no release of that information. It's simply been able to say no, not where that transaction claims that he is. So this is trying to minimize the data release
14:21
that takes place. So how is Databox implemented? So the model here is that we're essentially we're installing apps that process data locally. So we're following the app metaphor from smartphones. Apps process data. We also have a notion of a driver,
14:41
which is something which either ingests or releases data. And there are manifests associated with each app and they describe the data that's going to be accessed by that app. And that will be turned into a concrete, what we've called SLA. Some of the terminology is a bit horrible. I apologize for that too. A concrete SLA when you install an app. So the sort of thing that might happen there
15:00
is you have an app that wants to have access to your smart light bulb data. So that's in the manifest. Access to smart light bulb data. When you install it, you're able to control which light bulbs it gets access to. You can have all the downstairs light bulbs but not the upstairs light bulbs in my house. Okay, so that's the kind of the ability for the user to exercise some control over what's actually being revealed there about them
15:21
and what they're happy to share in that moment for that application. All the components in Databox, we were using containerization as a lightweight sort of virtualization technique. So this gave us a degree of platform independence, degree of isolation between running components and the ability to sort of make the management
15:41
of this kind of system easier because there's quite a lot of moving parts here. And so being able to manage things in a fairly homogenous manner seemed useful. When I say platform independence there, that kind of bit us slightly for a couple of weeks because it turned out we were getting bug reports from a user where they were finding that things weren't working and it took us some time to figure out
16:01
that the reason things weren't working is because they were running it on Windows using the Docker for Windows tool that had come out recently. And we didn't realize therefore that that's why the shell scripts weren't working because they were not in a Unix environment. The containers were running and they could get the containers running when they did it by hand but all the startup scripts did not work.
16:22
There are four core components to the platform. There's a thing called the container manager, a thing called the arbiter, a thing called the core network and then many things called data stores. The container manager is the thing that manages the containers unsurprisingly. So it manages container lifecycle in particular.
16:40
It's one of the things that starts up first and then after that it controls which apps are running, which drivers are running, how things are connected and basically kicks everything off. The arbiter is the container that produces the tokens that we use for access control. And the format of those tokens is a thing called the macaroon. Who's heard of macaroons? Not the biscuits, one or two.
17:02
So macaroons are, to reuse the pun that the authors used, macaroons are better cookies. They're essentially access control tokens that you can delegate. So you can attach constraints to them when you delegate them to other parties. The data stores provide a persistent storage facility
17:24
so we can monitor everything that's being recorded and used by each application. They also provide a middleware layer. So communication happens via these data stores and that's zero MQ based, middleware layer. And each store that's created gets registered in a hypercap catalog that exists on the data box
17:41
and then the idea is that that provides a degree of discoverability so an application is able to find out what it is that this data box has and therefore whether it's going to be able to support what that application needs. And then finally, right at the center, a thing called core network essentially tries to manage network connectivity for each application. And we sort of hacked that together in Docker world
18:01
by providing a unique virtual network interface for each application which is connected only to that application container and the data store for that application and to the core network. So we can intercept all of the communication that takes place for any application. So we can make sure that we log everything. We can make sure we prevent anything happening that we don't want to happen.
18:24
As I mentioned, apps and drivers in fact come with a manifest. This basically describes origination metadata. It says what the application is going to need in terms of data access and what its storage requirements are and whether it's going to need to do any remote accesses so it's going to need to talk to anything else on the box or anything off the box.
18:42
The distinction between apps and drivers is essentially that drivers can talk to things that are not only on the data box. They can talk to things off the data box. So that's how you get data in and out of the system. This installation process, as I've hinted, you essentially, you start out, the user tries to install the application. They say, yes, you can have access to these data sources
19:02
and that causes particular tokens to be generated given to that application and then that application is then connected up to the right network devices and the containers are all then started. Those tokens that the application has been given allows it then to present those tokens to the different data stores in the system and the data source can then verify
19:21
that this application has indeed been permitted by the user to access that data. So that's the sort of mechanism for access control. I'll move fairly quickly through this in the interest of time but this is a description of the middleware layer that we have, which is based on some standardized protocols,
19:43
CoAP running on top of 0MQ. We have a Git-like backend to this so it records everything and it supports JSON, text and binary data. There's a degree of security that we attempt to provide with the intent that at some point in the future we might like to distribute this across multiple devices so you'll want to be able to secure
20:00
the communication between data stores and the kind of main reason for doing this is that the first version of this we started out with was hacked together very quickly and using a straightforward sort of HTTP REST style API with Node.js and that was not suitable in terms of supporting relatively high frequency sensor data or the limited memory footprint that we have
20:21
on things like Raspberry Pis. This is much more effective in that sense. So what can you do with the data box? What could you do with the data box? Among the interactions that we can support and that we think we should be able to support better, you can do things with a physical device
20:40
that you can't do so easily with things that are in the cloud. Physical devices are often easier to reason about so you can see them. So you can do things that you can simply glance at it and see what the configuration is. You can imagine situations here where, for example, we might set this up so that access to smart metering data is only gonna be permitted if the green tag has been inserted into my data box
21:02
and my partner's blue tag has been inserted into my data box. So we've both agreed that that data can be shared. Or where the green tag is in the data box and we're both located in the house so we're both proximate to it. So you can set up much richer sorts of ways to control access to data. And this maps quite nicely to notions
21:20
of physical access control which most people have a pretty reasonable understanding of because we're used to doing things like locking windows and locking doors and so forth. One of the members of the team built a thing using what's essentially a hacked up version of IBM's Node-RED. So this allowed you to assemble data box applications by dragging and dropping data sources
21:42
and computational units, linking them together. And then you could essentially click the button somewhere, somewhere off the bottom of the screen and that would take what you've produced and build that into a container and publish that to the App Store. And so building applications is fairly straightforward with this sort of environment.
22:03
We also did some work on looking at richer visualizations of data. So you can take an SVG image for example and break it up into its component parts and then describe transformations so that as the data comes in, it animates the SVG according to what those transformations are that you've described. I think one of the earlier demos of this had an SVG with a cartoon picture of a particular American president
22:23
and then when tweets came in, that would cause parts of the face to animate according to some simple sentiment analysis of the tweets. So you can perhaps make data more legible by doing richer visualizations, making it more obvious what's more explicit, what's happening, what's represented in the data.
22:41
This is a piece of work which unfortunately stalled. But with a PhD student, I was looking at what you could do in terms of generic measures of risk. So the idea that a lot of the data sources you might see in such a device are going to be time series, time series of floating point numbers essentially, temperature readings, humidity readings, air quality readings, whatever they might be.
23:02
Is it possible then, the question we were exploring was is it possible to take a time series like that and treating it simply as a time series without any semantic information about what those numbers represent? Just look at various measures of entropy, statistical measures, also correlations and so forth to see whether or not there's in some sense
23:22
risk associated with giving access to that data. So how much information is contained in that time series in a statistical sense? Is it possible to say for this application, it's asking for access to that data at too high a frequency, it's gonna be able to find out too much. Whereas this other application only wants to see an average over every three months and therefore that's fine,
23:41
I don't really care what that says. And then if you could construct things like that, the initial results was somewhat promising in this sense and perhaps you could then start to try and put those results together and say, well this application is okay, application A is okay and application B is okay, but they come from the same publisher. So if you install both of those applications together,
24:00
you may be revealing a lot to that particular data processor. Another thing which sort of pops straight out of this idea that we want to kind of atomize data and push it out to all these different data boxes, is the idea that it's difficult now to do big data analytics in the traditional way you might do. You might expect where you want to put all the data into the cloud.
24:21
So we were starting to think about, and we have been looking a little bit at how to do small data analytics. The idea that you might do some of the computations first when the data is still private and only subsequently try and aggregate the data. So you don't need to build up these vast data lakes of data about everything and data about every one. Instead you try and, again, minimize data release,
24:42
do as much of the processing as you can while the data is kept private and only later on start to aggregate results together. We had a couple of goes of this, one of which was essentially looking at pre-training models using a small sample, hopefully statistically representative sample of user's data
25:01
and then taking those pre-trained models and pushing them out to lots of different locations. And then in those individual data boxes, you can refine those models and specialize the training of them onto those particular individuals whose data is now being used. This gets you essentially further faster in terms of the accuracy of those models.
25:20
The long-term goal was to try and think about how would you actually do, try and do machine learning, for example, other forms of statistical analysis of data at scale. So if you've got a data box for every house in the country in the UK, I think it's about 30 million households, how are you gonna run a computation across such a large scale set up as that?
25:41
Perhaps the most complete set of applications that got built was actually built in collaboration with the BBC. So this was a collaboration that was talked about, I think there's a blog post on their website from a few months ago that describes this. But the idea of a thing they called the BBC box. So the idea here was to take data from data sources
26:03
that they would not wish to have direct access to themself. So in this case, I think it was your iPlayer viewing habits, iPlayer is the BBC's content delivery system, one of them at the time, but also from your Spotify account and your Instagram account. So to try to take data from those three sources, they obviously don't want to have the data from your Spotify account,
26:20
they don't want the data from your Instagram account, there's no reason they would want to hold that, it would only be a risk for them. Take the data from those systems, that goes into your data box, and then they have a BBC application running on your data box, which is able to process that to produce a profile, which can then be sent to their content recommendation system, and so appropriate content can be recommended to you based on quite rich data about your activities online,
26:43
but without them having to have direct access to that data. There were a couple of other applications, we ran a hack day a couple of years ago using an earlier version of this, there were a couple of other applications that I thought were pretty cool, one of which was the idea of actually exploring
27:02
the idea that you could do actuation through this as well. So you could imagine having, in fact, the couple of people who were involved in producing this demonstration did actually produce this. You have a video editing suite where you assembled, let's say a horror film from snippets of footage, you could put events in that horror film, and then playback is controlled by an application
27:22
sitting on your data box, and when the appropriate points in the horror film come up, the playback application flickers the lights in your living room where you're sitting watching the film, for example. So you can have that without, again, without the publisher of the data, without the BBC or whoever it might be that's broadcasting this film, without them needing to have direct access to control the lights in your living room,
27:41
which obviously they wouldn't want to have and you probably wouldn't want them to have. So this idea of devolving the control to a device that's under your control, so it can then interact with your environment, monitor your environment, control your environment, but under your control. So that's Databox, which I sort of mentioned,
28:00
you could think of as DataWare version two. So where's the interaction that's going on here? How is this better supporting some of these, this eye in HDI, this interaction in human data interaction? It does better, perhaps, than DataWare did, but it became clear as we were going through this project that it's still not enough. So it's still the case that the request and processing
28:22
tend to occur in a black box. An app is kind of a contained environment. You can't see where it's gone up to, you can't see what it's doing. It's not clear what the status of each of these applications is as they're executing this system. We have got this audit logging support in there. It's possible that using that, you could come up with some kind of notion
28:41
of where the processing has got to, like what the status of the application is, but what we can do at the moment, just with IO, is probably not rich enough. We have a number of mechanisms such as audit logging, permissions, requests, that allow us to coordinate, to some extent, within the data box, what's going on.
29:00
But they don't, what the HCI folks say, articulate the field of work, which I'll talk about on the next slide. And then the third thing is that real-world data sharing tends to be recipient-designed. So I will share data, I will share information, rather, with people based on the context that we're in. I might talk about something in the pub with a colleague
29:20
that I wouldn't talk about with my wife. I might talk about something with my wife at home that I wouldn't talk about with a colleague in the office, and so forth. Depending on where I am, controls, to some extent, and who I'm speaking with, controls, to some extent, what I'm willing to reveal to them. And the ways that we support this in data box are a little bit too slow-moving. So you tend to make the decisions
29:41
at the point of installation of an application. It's not necessarily straightforward to go back and change those later. It's not perhaps easy to be dynamic, sufficiently dynamic in how those permissions are being granted, and how those permissions are being controlled. I mentioned articulation work. So there are some quotes from a paper by Schmidt
30:01
that defines this, but the way it was explained to me as somebody who is not subtle enough to really understand some of these kind of concepts, was the example given to me was walking down a busy street. So if you're walking down a busy street, you're probably walking to get somewhere. So that's the work you're doing, is walking to get to your destination.
30:21
But in the course of doing that, you have to do a lot of articulation work. You've got to make sure you don't bump into other people on the street. You've got to make sure you don't bump into signage on the street. You've got to make sure you don't walk into the road and get hit by a bus. And all of this kind of coordination work that you and everybody else on that busy street are carrying out, this is articulation work. It's the work that needs to be done in order that the work you want to do gets done.
30:43
A subject in the data box, a data subject, is engaged in this kind of cooperative work. The subject, the data subject, the data processor, there may be multiple data subjects involved. We don't really do enough in the architecture that we have to support this kind of articulation work where everybody tries to work out what's going on with everybody else
31:00
so that we can all come to the right conclusion and get the right things done. The other thing about this kind of recipient design was observed by a sociology colleague that data is essentially acting as a boundary object. It's a thing that is used in a relational fashion. You use data in multiple ways and it describes a relationship you have with something else.
31:22
So an example of a boundary object was a credit card receipt in the sense that this is something which is used in multiple ways simultaneously. So it's the proof of payment that the customer might have it's the bank's proof that a valid transaction took place it may be a supermarket's proof that the bank is supposed to pay them some money for the goods that you've taken away. All of these things are inherently relational it's about the relationship between these parties.
31:44
And it became clear when we started looking at these sorts of data that almost all personal data is in fact relational. There's very little personal data which is so private that nobody else is included in it or affected by it. This is particularly true when you look at sensing data. Most households either have multiple parties
32:00
living in the house or at least they occasionally have visitors coming to the house. And so the sensing data that you might start to see being collected commonly there is going to implicate multiple people. It's not just the homeowner. It's not just one party in the house that should have control of that or that is represented in that data. Even if you take something that most people think of who thinks of their email as private?
32:24
Okay, but presumably it came from or went to somebody else in most cases. And so even there this is data which involves other people. So in some ways what we try to do with data box in many ways is flawed from the start because we focused on the idea of an individual
32:42
having control of data. And actually data is inherently social in some sense and so it needs to be controlled in a more social way. So moving towards sort of wrapping up the presentation part of this. There are a number of challenges that are posed, interactional challenges that are posed for HDI then which data box doesn't fully resolve.
33:02
It hopefully takes steps towards trying to surface some of them and perhaps resolve some of them but it doesn't fix them. One is a really around or a set of challenges around user-driven discovery. So how do you discover as a data processor who out there has the data you might want to use? It's easy when you're collecting it and you're putting it in the cloud somewhere
33:21
because you've got it, you know what you've got. But how can you find out which of the households, which of the individuals in the population has got a data box and has the data that you wish to process that would be useful to you? How do users discover what applications they might wish to install? The applications that might do things for them.
33:41
How can they be empowered to make sure that they install the right applications, that they're happy with the applications that they've installed? And how do we control those discoverability, that discovery process? There are a number of sort of more standard mechanisms I guess that can be tried out here.
34:00
So along with permissions, you can imagine social rating systems, you know, 14 of your friends have installed this app and they're all very happy with it. Everybody's giving this five stars. These sorts of ways of communicating to other users that these are good applications to help them discover the right things. Legibility. There are mechanisms here that can support legibility
34:21
but legibility remains a problem. You should be able to visualize your own data. You do after all now have it in your data box. It might be much more difficult to visualize the impact that other people's data has on your data or that other people's data might have on the processing of your data, what is going to be revealed as your data is processed
34:40
given what has already been revealed by other people. This is true both for data that exists now but also data that might become available in the future. There's a question here, again, comes back to discoverability as well is what can processes discover about what you have in the same way, what can you discover about what processes want?
35:02
There may also be the need to edit data. So if you detect some data has been recorded, which is wrong, you want to be able to go in and change it. This is another, in some sense, flaw in the way that we frame this is again, we were very much, well, not flaw, it was deliberate but it's not complete. So we focus very much on the data subject,
35:20
on allowing the data subject to have control over data and data processing. But of course, there are more stakeholders than just the data subject in this. The data processor might legitimately want to know that you've not tampered with the data that you're revealing to them, that you haven't faked out your propensity to risk, for example, so they give you lower insurance premiums. I understand some people when insurance companies
35:41
started saying, well, if you were a Fitbit and we can see how active you are, we'll give you reductions on your health insurance. I believe there were people who were putting their Fitbits on their dogs or on a metronome, other mechanisms to try and fake out what the data that was being recorded in order they could use their insurance premiums. There's clearly a need here to try and support some degree of sort of legitimate interests
36:03
of both sides, I guess, in this. As I've hinted at, data is a social thing. Most data is a social thing. So you want to be able to delegate control, delegate access to data, but you also want to be able to revoke it. You want to be able to see what's been happening
36:21
with your data, whether it's being edited, who's been viewing it, with whom it's being shared, and you want to be able to revoke those permissions. You also need to be able to negotiate. So if you have multiple data boxes in a household, for example, it might be legitimate for my data box to have access to some of the IoT sensing data and for the other adults in the house
36:40
to have access to the same IoT sensing data in my home, the energy consumption, smart metering, and so forth. But any one of us can then reveal that data to a data processor, and that might not be what we wish to happen. It might be the case that it would be better if there was some way of negotiating that we all agree that we are happy for this data to be revealed to this particular data processor. We have no mechanism to support that kind of
37:01
sort of social action at present. There's a need to think about who data is getting passed to. What can you do to try and work out when you've revealed some data to somebody else, what they're going to do with it, and what's happening after you've made that revelation.
37:25
And I think from a technology perspective, the two of these that I find most interesting come down to the sharing of data and what to do about shared data. So sharing data, we want to be able to support offline data collection. We want to be able to support data collection from devices that are not necessarily co-located with the data box.
37:42
So this means we want some kind of rendezvous and identity service, and this needs to be reliable and not infringe on the privacy of the people participating in it. Shared data is another interesting thing. So there was a long running argument for about 18 months in this project between myself and one of the collaborators around how to support this idea of shared data.
38:02
What could we do given that data is inherently shared, is inherently social? Their stance was very much that what we needed to do was introduce the idea of a user account onto the data box. So we'd be able to manage access to the data by having user accounts on the data box so that we could say, well, you can say this,
38:20
and then this other account is allowed to see that data and so on. It turned out, well, it turned out, it was certainly my opinion that that was going to be inordinately complicated to implement because of what it boiled down to the problem of who gets to manage the user accounts, who gets to create accounts and who gets to control the accounts.
38:42
With current systems, consumer systems that I'm aware of, there's no real way to have this kind of, essentially there's no real way to do away with the idea of a root user, somebody who can see everything. And it's definitely the case that in other projects that we were doing, that when you're looking at personal data,
39:01
it's often the case that people are actually less concerned about complete strangers seeing their personal data than they are about other people in the house seeing personal data. The idea that your parents might be able to see what all your internet viewing habits are is something that many people find quite upsetting. But the idea that the ISP can see what you're seeing on the internet, they're not really too bothered about.
39:22
So there was a difficulty there about how if you introduce accounts, how are you gonna manage the fact that there's gonna be one account that's going to have access to everything can see absolutely everything that's going on. So we, I fought quite hard to try and keep things so that we had one data box for one person. Unfortunately, that really doesn't solve this problem of social data.
39:40
And the closest we get to that is we start thinking about the idea that we could replicate data within a household across a set of data boxes, perhaps. And then you have to, well, what I did again was punt on the hard problem. So you end up in a world where you're devolving, you're devolving the challenge of managing access to social data to a social matter.
40:01
So you say that, well, you'll discuss it with other people in the house before you start revealing this data because you know it's sensitive, because you're aware of other people's views. It's not like you're sitting in your, living in your house in your social situation and not being aware of anybody else in the house and what they think about this. So I think those are two interesting challenges from a technical perspective, how to support these kinds of interactions
40:20
and these kinds of, these needs in this system. And with that, I'll finish. Any questions?
40:50
So thank you very much for the talk. If you have questions, please line up at the microphones. I think the microphone too. Thank you. Thanks for that.
41:00
I wonder how you see this moving beyond academia into sort of broad adoption. And if you have any thoughts on something like the Estonian eCitizenship model for how this could potentially scale. And then, and I guess also your thoughts on just what do you think needs to happen for this to be adopted at scale?
41:23
So I'm not familiar enough with the Estonian eCitizen model to comment on that. Frankly, I think that for this to be adopted at scale, we probably need to re-implement everything. So it's a little bit less of a research prototype. That would be a good start. I think that one of the big challenges in terms of adoption
41:44
is actually around what these applications might be. There was a strong interest from other parties in this project around IoT data particularly. And one of the things it seems to me to be the case around IoT data is that we've got all these opportunities to collect lots of it, but nobody's quite sure what to do with it
42:01
in terms of really compelling applications that make a great deal of sense. So in that sense, it may be that it's all a dead duck and there's no need for anything like this because in fact none of that's ever gonna take off because it's never gonna be compelling enough to be really, really useful. So I think having some killer applications or some real use cases that are valuable here would be good.
42:21
Some of the ones that I mentioned, the couple of ones I mentioned from the BBC and from other collaborators who I think were at the University of York at the Hackathon we ran, they started to become more interesting I think. So you can start to see some use cases arriving there, but it's quite slow to find them and quite slow to build them. The other thing that we would definitely need
42:40
for broader adoption of this sort of platform that we really need to fix as part of that rewrite is to make the development process much, much easier. It turns out that essentially there were professional developers that were hired to build that BBC demo and they did it and they did a great job and it worked, but I think everybody involved found the development process much harder than they expected.
43:02
The idea that you can't simply access a cloud service when you want to in your code, that you have to request permission for that and actually think about all that process is quite alien to modern development practices I think. So I think the development process is something we really need to work on to actually give us a hope of being adopted.
43:21
So yeah, that's two or three things. Thank you very much. More questions? Yeah, go ahead. Yeah, thanks a lot for the great talk. Would you say that for, so basically what you had in mind when you developed this were IoT applications?
43:40
Well, so it sort of changed over time. When we first started with dataware, we were thinking about social media, social networks, email, IRC logs, chat logs and so forth as personal data. We constantly wanted to do things like getting banking data out, financial data was an obvious sort of thing that people find sensitive but would like to do interesting things with.
44:01
As the sort of time passed, IoT became more of a thing. One of my collaborators in this project is essentially was funded to look at IoT data specifically. So that was where the interest in that came from. I think that given the domestic context that we were targeting, IoT data is a sort of obvious thing to look at
44:21
than other sorts of household data. Finance data is another one that still be kind of obvious. Personal health data now with sort of wearables, things monitoring you can do. All of these things are sort of there. But in some sense, I don't think it matters too much in terms of the challenges that we were coming across as we were doing this.
44:41
I mean, they're fairly endemic across the space, whether or not it's IoT data or other forms of data. You start to realize quite soon when you try and actually build these things that you've got these problems that data is inherently social, multiple people are implicated and so on and so forth. Great. Any more questions?
45:02
Oh yeah, microphone one, I think. I would like you to elaborate, if you could, on the different levels. There's this level of a household containing a family or someone, and there's this level of a community. I think if you have some thing to control the temperature in the living room,
45:23
you could have this box tell you that two other people not named in your household would like it to be somewhat warmer, but you're paying the bills, so you say this could not be the case, and the whole neighborhood, street or block or something,
45:40
that's a different level where you have different questions that you could elaborate and where you could have good use of this box. Yes, so that's an interesting challenge. That's part of the reason we were thinking about this very widespread, federated data processing, essentially,
46:01
on the idea being that, as I understand it, one of the ways that you can nudge people to reduce energy consumption, for example, is to tell them what the average is for people in their demographic. If that's lower than what their consumption is, that acts as a prompt for them to think about bringing their consumption down, whereas if you do it across two larger scale, everybody in the country is doing this,
46:22
then it becomes less meaningful. But if you know that households are more or less the same configuration as yours, are on average using a lot less energy, you might start to think about it. So that was drive those sorts of applications where you want to look at data across multiple data boxes simultaneously, where those data boxes may be spread across the wide area. So we started to investigate some technical means to do that.
46:43
There's a system that a postdoc of mine built called OWL, which is a data processing system for the OCaml programming language, which was trying to embody some of those ideas. If you go to ocaml.xyz, that's the website for that particular thing, he was, I think, 18 months,
47:01
and he wrote 180,000 lines of OCaml code to implement that, it was fairly impressive. We haven't got to the point where we can deploy any of those yet, or actually test any of them out, and certainly not at that sort of scale. That's something that I'm hoping to do in the next year or two with some other developments that we've got in Cambridge around the digital built environment,
47:21
where I might be able to start to deploy some of these ideas and see how they work in terms of data which is being collected and managed at a scale which is larger than a single household. So you don't have the same kind of domestic framing for it. Okay, do you have any more questions, microphone two?
47:40
Could you elaborate a little bit further on the trust of applications, and especially if they start doing unintentional things such as requesting data that, well, given or being based on other data, reveals information without yourself intending to?
48:02
So I think that's essentially one of the challenges that we haven't really addressed. So if an application that you install asks for access to data that you have not given it permission to, it can't have that. Those requests will simply be denied. But if an application that you've installed has been given access to some data, and it manages to do some processing of that data,
48:22
and you're happy with that, and the results of that processing go back to the data processor's home base, and then they're able to join that with some other source of data that you had no idea about, that we can't do anything about at this point. And that's one of the challenges here, is what to do when the data that you thought was okay turns out not to be okay to reveal
48:41
because somebody's found something else that lets them attack it in some way. I don't know what to do about that. Thank you. Thanks. So we have one more question, yes? Yes. So when multiple apps are trying to access the same data, how do you, is there a standard that you're using, like semantic web standards, to understand the meaning of what certain rows of tables mean between applications?
49:05
Not really, no. We didn't go down the route of trying to sort of taxonomize everything and put everything into ontology. So at the moment, the application writer just has to know that's the data source that they're accessing. So they're accessing the Philips Hue light bulb data, they happen to know what that format is. So each application is talking to its own data store
49:21
within the cloud, within the data box? No, the author of each application needs to know the format of the data that that application's going to process. So somebody has to go and look at some specs before they write their app. And how do you see this project, like parallels between, like, let's say the solid project that's being led by Tim Berners-Lee or the own cloud project, where again,
49:41
I feel there's kind of some pattern? I think that, so based on my understanding of those projects, I think that we are focused on the platform and the control in the platform and less about trying to control what the application tries to compute out of it. We're not, I think, solid, I think maybe it's moved on
50:01
since I last looked at it, but I think initially it was quite focused around the browser, for example, and we're not trying to be in relationship to the web at all. It's about having a device. I think that's the other thing that seems to be, again, when I last looked, still seem to be fairly unique. So it seemed to be something we were doing differently,
50:21
which was having a physical device that users could control directly and trying to provide the affordances that you get by having a physical device, rather than having something that's just abstract software in the cloud somewhere that you can't really control in the same way. Good, microphone one. Hi, I'm curious if you have any data
50:43
among the problem awareness in the UK, like of the problem awareness among the population, like if they're already aware of the implementations, implications, I guess. Off the top of my head, I don't.
51:02
From a previous project, we did do a review of a lot of the privacy literature. So papers have been published about people's attitudes towards privacy and understanding of the problems of privacy as represented in data, but I don't actually have any sort of statistical data about how the population is generally aware of these kinds of issues.
51:21
I think when we have looked a little bit at that, trying to do that sort of thing, we got, when I was at Nottingham, for example, we did some work with one of the standard surveys that I think was the city council executed every year, or frequently anyway. And if I recall correctly, there were some questions in that, the answers to which did not make sense
51:42
from a technical perspective. So one of the questions that was asked was, do you use the internet? I think a lot of the respondents said, no, I don't use the internet. Why would I use the internet? Another question that was asked was, how do you arrange to meet up with friends? And a lot of those same respondents who don't use the internet use Facebook to meet up with friends. So I think it can be quite difficult
52:01
from survey data sometimes to actually work out and tease apart really what is going on in terms of people's understandings and concerns about this, because some of these concepts are quite abstract and they're also quite, it's a lot of it, a lot of it is very dynamic. It's this sort of recipient design. So I can give you one answer to the question, am I concerned about privacy of my data? And if you frame the question slightly differently,
52:22
I will give you a different answer because you've triggered something else. And so I think it's quite difficult to gather really robust data that you can really be satisfied with the inferences you draw. Thank you. So there are no more questions, I think. Another round of applause, please, for more.