Developing a privacy-aware map-based cross-platform social media dashboard for municipal decision-making
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68897 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
HypermediaCross-platformInformation privacyComputer animation
00:23
Context awarenessInformation privacyStudent's t-testComputer programmingHeegaard splittingGeometryStudent's t-testBlock (periodic table)Information privacyBlogWordWebsiteGeomaticsComputer animation
00:54
Query languageInformation privacyContext awarenessData structureData structureRight angleComputing platformLocation-based serviceOrder (biology)Uniform resource locatorHypermediaStreamlines, streaklines, and pathlinesDifferent (Kate Ryan album)DebuggerComputer animation
01:37
Data structureInformation privacyContext awarenessIndependence (probability theory)Computer networkProgramming languageNumbering schemeSocial softwareHypermediaComputing platformLatent heatWebsiteArithmetic meanYouTubeUniform resource locatorInformationData structureIndependence (probability theory)SoftwareNumbering schemeProgramming languageComputer animation
02:29
Data structureInformation privacyContext awarenessTwitterResultantFlickrDifferent (Kate Ryan album)Computing platform
03:06
Process (computing)CurveSocial softwareInformation privacyVisual systemCache (computing)Term (mathematics)Data structureContext awarenessMetadataDigital photographyTimestampFlickrTwitterQuery languageRead-only memory1 (number)Computing platformDatabaseProcess (computing)Connectivity (graph theory)Information privacyDebuggerSpecial unitary groupDifferent (Kate Ryan album)BuildingHypermediaTwitterUniform resource locatorMetadataRaw image formatTimestampSampling (statistics)Term (mathematics)Heegaard splittingNumbering schemeRight angleFlickrXMLComputer animation
05:42
IntegerTime domainData miningData structureLinear mapError messageHash functionEstimationRange (statistics)Query languageTable (information)Information privacyContext awarenessElement (mathematics)Read-only memoryStreaming mediaBit rateCountingDigital filterMereologyAlgorithmEstimatorError messageBit rateWordSemiconductor memorySpacetimeElement (mathematics)NumberData structureHard disk driveInheritance (object-oriented programming)AbstractionCountingHypercubeIntrusion detection systemDifferent (Kate Ryan album)Multiplication signLatent heatCASE <Informatik>Raw image formatComputer animation
07:01
Context awarenessInformation privacyElement (mathematics)Total S.A.AlgorithmBefehlsprozessorCountingCASE <Informatik>AlgorithmElement (mathematics)High-level programming languageLengthSet (mathematics)Semiconductor memoryNumber2 (number)Moment (mathematics)AreaProcess (computing)Source codeComputer animation
08:19
AlgorithmAnalogyContext awarenessInformation privacySystems engineeringAnalogyAlgorithmNintendo Co. Ltd.Logic gateInformation privacyMeeting/InterviewComputer animation
08:48
Graphical user interfaceRemote Access ServiceContext awarenessInformation privacyValue-added networkLogic gateShape (magazine)Video gameOpen set2 (number)Computer animation
10:14
Function (mathematics)Exponential functionEstimationAnalogyAlgorithmHigh-level programming languageContext awarenessInformation privacySound effectInformationTracing (software)NumberLogic gateUniqueness quantificationFraction (mathematics)Sound effectAreaQuicksortDirected graphCASE <Informatik>Information privacySynchronizationAnalogyComputer animation
11:28
MassAlgorithmAnalogyInformation privacyContext awarenessProof theoryInformation privacyProof theorySound effectMathematicsContext awarenessCASE <Informatik>Logic gateNegative numberSet (mathematics)Computer animation
12:45
Information privacyPrisoner's dilemmaAlgorithmAnalogyContext awarenessJames Waddell Alexander IIBenchmarkImplementationVisualization (computer graphics)InformationCone penetration testEstimatorTotal S.A.High-level programming languageSet (mathematics)FlickrData structureTwitterRead-only memoryQuery languageUniqueness quantificationLogic gateTranslation (relic)Spring (hydrology)Metric systemLocal GroupOrder (biology)NumberInformation privacyUniform resource locatorCASE <Informatik>Logic gateSet (mathematics)Location-based serviceFraction (mathematics)Prisoner's dilemmaUniqueness quantificationRaster graphicsHypermediaHexagonQuicksortAnalogyContext awarenessUsabilityMathematical analysisOverlay-NetzDatabaseInformationIdentifiabilityData structureQuery languageEmailCountingProcess (computing)Address spaceElement (mathematics)Observational studyAreaAtomic numberShape (magazine)Connectivity (graph theory)Term (mathematics)Sound effectEstimatorSensitivity analysisData Encryption StandardComputer fontLoginIntrusion detection systemHypercubeCorrespondence (mathematics)Block (periodic table)Functional (mathematics)Computer animation
18:03
AreaParametrische ErregungFlickrData structureInformation privacyContext awarenessTwitterQuery languageRead-only memoryKeyboard shortcutFunction (mathematics)Mountain passComputer-generated imageryAcoustic shadowInformation securityDatabaseHypermediaContext awarenessOpen sourceRepository (publishing)Information privacyComputer animationSource code
18:28
Group actionInformation privacyContext awarenessQuery languageInterface (computing)HypermediaPopulation densityLaptopDemo (music)BitQR codeSet (mathematics)Computer animation
19:29
Software bugAreaContext awarenessInformation privacyCellular automatonSet (mathematics)Green's functionExact sequenceHexagonBinary file1 (number)Distribution (mathematics)GeometryAreaGreen's functionSpacetimeMereologyCartesian coordinate systemTheory of relativityLevel (video gaming)CASE <Informatik>Query languageOpen setComputer animation
20:49
Query languageCASE <Informatik>Mathematical analysisComputer animation
Transcript: English(auto-generated)
00:00
All right, thank you very much. So hello, everyone. I'm really excited to be here today. My name is Dominik, and I will present you a very long title tonight, which is developing a privacy-aware, net-based, cross-platform social media dashboard for municipal decision-making. But just keep in mind privacy-aware and social media dashboard. That's the two things that I will talk about tonight.
00:24
So just some words about me. I'm a PhD student at the TU Dresden, and I work for a GBCON site, as well as the Joint Research Center of the European Commission. And my main research interest lies in geometrics and geospatial data science and programming as well. And that's the reason why I'm here, of course. And if you're interested, just check out my blog,
00:42
where I just frequently post something about our work, about privacy, about data, et cetera. So more or less, everything that we'll talk about tonight is also on my blog, split up in small portions that are easy to follow. OK. So what you learned today is basically three things. So first of all, I will talk about LBSN structure,
01:01
which is location-based social network structure. It's basically a package in order to split up social media posts and to streamline social media posts from different platforms. The second thing that you see here is HyperLogLog that you probably never heard of. Is there anyone who heard ever of HyperLogLog before?
01:22
No one. That's great. So I can explain it to you very well. And the third and last thing is LBSN dashboard, which is the front end for the structure that we developed here. And that's actually the thing that you see on the right, but we will go more into detail later on. So let's start with social media.
01:41
Location-based social networks is a very long title for the most common social media platforms that we use, like Instagram, Twitter, Flickr, TikTok, even YouTube. Like all of those platforms, they tend to have some kind of geospatial information attached, like a coordinate, an Instagram location, or something that indicates some specific location
02:03
for the data that is posted there. Let's say, for example, from an Instagram post that is tagged with Phosphogee, for example, with the location here on site. And maybe people say, oh, it was great. OK, so we have this information and a particular geotag, meaning it's taking place here at Phosphogee, for example.
02:23
And LBSN structure is basically a common language independent cross-network social media data scheme. And what does it want to do? The question is very simple, how to standardize all the different APIs from these platforms. Because, for example, Flickr has a public API, Twitter has also a public API, as well as Instagram.
02:43
And all these APIs are pretty different. So if you take a look at, for example, this API over here, the result of the Instagram API, there are tons of different nodes. And all of these APIs are different, of course, and they return different Jsons if you query it. And the idea is to find a way to standardize
03:02
all these different data that is being returned by these APIs. And the answer is actually pretty simple. So if you take a look at this data scheme here, it's basically splitting up each individual social media posts into four different facets. These facets are these pyramids that you can see here.
03:22
So it's, for example, the spatial facet, the temporal one, topical one, as well as a social one. And each of these facets have different granularities. For example, the spatial one is pretty simple to understand. So you have a country at the very top, then a region, city, the place, and then maybe let along. So for example, if you would take Phospho-G,
03:42
this would mean a place, or if you would even add a location, like right here in this building, this maybe even have some kind of let along attached to it. And the thing is very similar to temporal, topical, and social facets. So the idea is just to split up each post into the very tiny, almost atomic components.
04:04
So for example, let's take this sample or fictional sample of social media posts by sunlover22 that has some kind of text component and some kind of metadata. And all you do is you just split it up into metadata on the one hand, for example, user ID, like sunlover22,
04:21
a timestamp, as well as a location. And then, for example, if you take all the terms that are in the caption of this picture, for example, today, enjoying the wonderful on, with, as well as the emojis, you have really the most atomic components of each social media post. And you can repeat this process with all kinds of different social media platforms
04:42
and divide each post into these tiny atomic components. So if you think about it, if you do this with all platforms, you end up with a database with all these like tiny, tiny fragments of posts. And then you can also put, for example, in one bucket, like so Instagram posts and Flickr posts, as well as Twitter posts, for example.
05:02
Great, so this looks pretty much like this. So on the left hand side, you have the different social media platforms that is obviously not limited to these ones here. And then you just split it up into these different facets. Great, so having this database, for example, you could already produce a nice front end
05:21
and a dashboard that you could work with if you wanted to. But the main problem here is that there is no privacy at all included in this dashboard. So if you have a database with the original raw data and user IDs, it's very detrimental for user privacy. And we don't want to put the user's privacy at risk. And that's the reason why we need to tackle privacy.
05:43
And this brings us to the second part and the most important part of my talk, which is the HyperLogLog algorithm. So what is HyperLogLog? First of all, it's a probabilistic data structure. So it has nothing to do with the raw data itself. And as well, it's a cardinality estimator. So in other words, it's just a solver
06:02
for the very particular count distinct problem. So for example, if I would count the distinct number of people in this room, I would end up with maybe 20. Okay, great. But I could also count maybe like the amount of times that people went in and out. And this number would be not distinct, but it would count a different number.
06:23
HyperLogLog has some very interesting specifics. For example, a super low error rate of between two to 4% and a very low memory consumption. And by low, I mean really low. So for example, you can put 1 billion elements and only 1.5 kilobytes of data.
06:41
So just think about harvesting 1 billion user IDs and the amount of space that it would usually take on a hard disk. Using HyperLogLog, it only consumes 1.5 kilobytes in this case. And last but not least, it's very fast. And what I mentioned already, it doesn't need to save the original data, but it's only a data abstraction in this case.
07:03
CountDystinct is actually a very hard thing to do for your CPU or your GPU, because the way it works, it's actually more complicated than it looks like. So let's take this Python example in this case. So we have some array with repeating numbers like one, two, three, for example. And how you would do this is simply
07:21
if you would count the total length of this area is just 10. So you can literally just query the length of it. But if you want the set, it's very computational and memory intense. So you need to consume a lot of resources to count the distinct elements, because you always need to keep in mind the last element that you saw and compare it to all the other following elements.
07:42
And the way it works with HyperLogLog is different. It's actually more simple. So this is the HyperLogLog Python package in this case. And the way it works is you have this array and you create an empty HyperLogLog set, and you add each element of the array one after another to this HLL set.
08:02
This means that this is a linear process. In the moment when you add it, this HyperLogLog set grows and you can immediately query the length of the HyperLogLog set. And this is consuming way less memory. Why? We will come to this in a second. This is the algorithm. And I will not talk about this algorithm.
08:21
Instead, I decided to bring you an analogy that is hopefully easy enough to follow. And I decided to bring in some Nintendo characters that you might know. And so let's just jump into it. So just imagine we had a huge beautiful castle of Zelda, for example, and you see this gate here.
08:41
Like there's the castle gate. And if she wanted to throw a privacy party, people need to jump or need to enter in this gate. And there are tons of characters and possible people that could join this party, for example. And all these people, they have different shapes, they have different silhouettes. And what Zelda is trying to do here, she's applying a very simple but yet effective trick.
09:03
So just imagine this gate here was not a normal gate, but more of a video game gate. So what she forces people to do is not to walk through the gate and open it, but instead people just jump through it. They just jump through it. And what happens is they leave the silhouette in the gate.
09:20
So when Mario jumps through that gate, you can see the silhouette that he leaves behind. It's this one. Now that's already very crucial to Hyper Lock Lock. You will get to it in a second. So when more people attend the party, for example, there's Luigi joining the party and he's also jumping through it. And you see, we get a silhouette that is not corresponding too much.
09:41
So before, if you take a look at the Mario silhouette, it's easy to understand that this is maybe one person or one character. Instead, if you take a look at the combined silhouette, it looks more like nothing, like you can't really tell what is in there. And if you make more characters join the party, like Donkey Kong, for example, Yoshi, Mario, and for example, last but not least, Peach,
10:02
you end up with this bulky silhouette in this gate. And you don't really know who passed through that gate unless you know the people very well, but you can only raise suspicion, but you never know for sure who passed through that gate. So what HyperLogLog is capable to do, taking this analogy, is brilliant because HyperLogLog can estimate
10:22
the number of unique visitors solely based on the ratio of the silhouette area to the gate area. So let that sink in. HyperLogLog can estimate the number of unique visitors solely based on the ratio of silhouette area to gate area. So for example, if roughly 80% of this gate
10:42
is sort of cut out from the gate, in our analogy, HyperLogLog could estimate that there were like six people probably joining the party. So this is what HyperLogLog is capable of, only taking a look at this combined silhouette here. So there is no more information about the people itself,
11:02
like there is no, they didn't leave anything behind, any traces that lead to them, but just a tiny fraction of each individual silhouette that combined leads to a different silhouette in this case. And this has very particular privacy effects. So if we take a look at, for example, the more the better. If you start with one character, for example, Bowser,
11:22
it's very easy to see that one character joined this party. There is something else. If, for example, Mario would attend the party after Bowser passed, he could very simply hide in this silhouette and no one would know that he actually joined the party. So you can imagine like if the silhouette is growing even more,
11:42
more people can hide in it and you never know for sure that someone really passed through that gate. There is actually one negative effect of HyperLogLog and that's the proof of non-occurrence. Just imagine that you had only the Bowser silhouette in this case, then Luigi is joining the party.
12:00
You see that the silhouette changes and when the HyperLogLog set changes, you know for sure that someone hasn't been there, for example, in the very end of the party, there's only Bowser's silhouette but there's no trace of Luigi. One would know for sure that he has not attended the party but on the other hand,
12:21
there's also no proof of occurrence. So for example, if we had only this silhouette here and we forced Luigi to pass through that silhouette, there is a chance that he passed through it but there is no certainty and that's the main criteria of HyperLogLog why it's privacy aware and protecting the privacy in this case because you can never be sure
12:41
who actually passed through that gate and also the more it's growing, the safer it gets for the people. On the other hand, this brings us pretty much to the privacy dilemma because if you have the raw data and just think about all the amounts of petabytes of social media data that there are in this world, if you have this raw data, you can perform spectacular analysis
13:01
but privacy is of course at risk because you work with user IDs, with sensitive information but on the other side, if you have no data, there's 100% privacy because no one is working with the data but if you still want to work with the data but in a manner that privacy doesn't get affected badly, there must be some kind of middle way in this case
13:22
and this is where so-called probabilistic data structures like HyperLogLog or the so-called Bloom filter come into play and where they help out because they only take a fraction of, for example, a user ID and it's enough for HyperLogLog to estimate the number of unique elements and that's the reason why we tend to call it privacy-aware
13:41
and not yet privacy-preserving because if you ever hear the term privacy-preserving, this corresponds to differential privacy which is 100% private which HyperLogLog in this case is not and this is also what of course Desfontaine et al, what they claimed in their title is cardinality estimators do not preserve privacy which is true. If on the other hand,
14:01
you want to take a closer look at the privacy effects of HyperLogLog, check out the paper by Alexander Dunkel et al which is the one down here and there's a possible attacking scenario so what could happen in the worst case? So what about ease of work? There is some very particular features to HyperLogLog that I will explain also with the silhouettes.
14:20
If for example, you take a look at this gate here, let's assume this one was the front gate like the front entrance of the castle but this one here was the back entrance and at the very end of the party, you wanted to know how many people all together joined the party, like how many unique people joined the party. The way you do it is just to put a very simple overlay of these two silhouettes together,
14:42
end up with this one with this combined silhouette and HyperLogLog is just doing its job so you can always create lossless unions of HyperLogLog sets. On the other hand, you can also create so-called intersections and for example, if you wanted to know how many unique people passed through both gates like for example, front entrance
15:01
as well as back entrance, so for example, coming in at one gate and also passing through the second gate, you can create these intersections. For example, silhouette one is the blue one, silhouette two is this orange one and the intersection would be the red one and HyperLogLog could estimate solely based on this red shape how many people walked actually through both gates
15:22
in this case. So perfect, we come back to our LBS instruction. Think about these social media posts split into these tiny components of facets. Let's just translate this analogy to the real world. The gate in this case that you see here is the HyperLogLog set
15:42
that is sort of retaining the information about the silhouette or about the unique number of people that passed that gate. The silhouette instead is the unique user identifier or just user ID on social media. It can be your ID, it can be an email address or a phone number, it doesn't matter what, it just must be unique.
16:03
And what we can do now is for every of these tiny atomic components of social media, for example, a very simple hashtag like Phosphogee, we create an individual HLS set. So just think about this gate, we create this gate for every tiny fragment of social media.
16:20
For example, for this hashtag Phosphogee, we can do the same with locations like Piazza XY, for example, or the Phosphogee location, the conference center. And you create this HLS sets for every of these tiny fragments. The cool thing is then you can combine them easily. So for example, if I wanted to analyze
16:41
not only hashtag Phosphogee, but also hashtag Phosphogee 2022, I would just create the union and then have the summed up number for both hashtags. And I would have the unique number of users on social media that posted something considering the hashtag, for example. Considering the locations,
17:00
you could also work with H3, for example, or create some more aggregates, which would even benefit privacy even more if you would take not the Instagram locations, for example, or really the location, like the coordinate, but instead use, for example, a raster or hexagons or something. This could be also done in advance.
17:20
So we have our privacy aware database. And what we can do now is we can query it. On the left hand side, you see an example query that is actually pretty simple. So just querying the cardinality, remember the count distinct function, of, for example, the hashtags. What you see on the right hand here is the 23 most common hashtags
17:41
that were fired in the city of Bonn, because this is my case study area in this case. And you see we have those kinds of hashtags like Bonn Germany, Instagram is Bonn, Love, Bonn Instagram, et cetera. So those hashtags have been used by roughly this amount of posts in this case. And we can easily look at
18:01
like how Bonn is reflected on social media. Okay, so having this database ready, we can, of course, create an API. So we just head over to LBSN dashboard. This is a repository that you can look into. It's on GitHub and it's all open source. And this one here is just fast API
18:22
and just creating a simple API infrastructure based on this privacy aware database. Now, if you wanna see it in action, and if you have, for example, a laptop with you, that would be better, just scan this QR code. You can even do it on your phone, but I promise you it's not optimized for mobile, so it will look a bit messy.
18:40
Anyway, just go on this demo here and you can check out how this dashboard works actually. The QR code is also right down here, so just scan it if you'd like to. This is the interface of the dashboard. And let's start with a very simple example. For example, I just found accidentally more or less in my data set that for the Phospho-G 2016 in Bonn,
19:03
I could see, for example, where people tended to post most posts on social media. And I didn't even know that it took place here in Grunau, which is where the conference center is. And it seems to me that most of the people posted something here, but also of course in the city center of Bonn and also a bit in the outskirts, so to say.
19:22
So you can see, for example, where the density of a certain topic is predominant. You can do the same if you zoom in, and this ones here are so-called hex bins or hexagonal bins that bin the amount of posts on the fly here, and you see that in this conference center,
19:41
there were most of the posts located. You can also perform some individual spatial queries, like draw your custom area of interest and take a look at the relative distribution. For example, this is Endenich, like a part of Bonn. And what you can see here, there's one hexagonal bin that has 1,400 posts, which is a lot,
20:00
and it's predominant for this whole area. And so we can look into it. If you look on OpenStreetBam, for example, we can see that there's the so-called Harmony. It's a concert hall where a lot of concerts take place. Or we can go a step further and take a lot of geometries. Like this one here is, for example, the urban green spaces of Bonn,
20:20
according to the municipal land use plan. And we could just take a look at the heat map, for example, where people tend to be the most. For example, we see here on the Rhine area, most people tend to chill here, but also on this green axis here, and the Rhine in general takes a huge spot in this heat map, for example.
20:42
You can also take a look at the hex bins in this case, and you see that we have like three glasses, one here, one here, and one here. And so we can work with all of these different thematics, like technically doing different analysis, also for very specific queries in this case. So thank you very much. And I'm really excited for your questions,
21:01
if you have any.