Interactive network visualizations as "guided close reading" devices for the social sciences
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61915 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023419 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
MathematicsPlanck constantLattice (group)Computational physicsTwitterArchitectureComputer wormSlide ruleTwitterField (computer science)SoftwareNatural languageBuildingMathematical analysisGroup actionProcess (computing)NeuroinformatikHypothesisMereologyComputer programGraph (mathematics)FreewarePoint (geometry)Computer animation
01:56
TwitterMathematical analysisComputer networkInterface (computing)Social softwareElectronic visual displaySeries (mathematics)BuildingModul <Datentyp>AlgorithmMechanism designPosition operatorInteractive televisionGraph (mathematics)TwitterAlgorithmMetadataMultiplication signSharewareMereologyMeasurementPoint (geometry)SoftwareCASE <Informatik>Computing platformSheaf (mathematics)Projective planeType theorySoftware developerDifferent (Kate Ryan album)System callTracing (software)HypothesisField (computer science)HypermediaGene clusterInterface (computing)Sinc functionBuildingPhysical systemSpring (hydrology)Loop (music)Virtual machineEinbettung <Mathematik>Vertex (graph theory)NumberProgramming paradigmContext awarenessInterpreter (computing)Cross-correlationComputer programTime seriesComputer clusterMathematical analysisSemantic networkComputer animation
09:37
TwitterQuery languageStandard deviationComputer configurationData modelFormal languageTwitterSet (mathematics)WindowInterface (computing)Type theoryIntegrated development environmentSoftwareQuery languageWeb browserComputer animation
11:03
PentagonTelnetTwitterThread (computing)Query languageComputer configurationField (computer science)Clique-widthContext awarenessStreaming mediaWeb browserView (database)Computer networkUniform resource locatorLocal ringInstallation artCodeCone penetration testOpen sourceStandard deviationVisualization (computer graphics)WindowWeb browserTime seriesType theoryDifferent (Kate Ryan album)SoftwareComputer animation
11:44
Computer networkDigital filterFormal languageLink (knot theory)Vertex (graph theory)Component-based software engineeringGraph (mathematics)Network topologyTwitterThresholding (image processing)Hash functionNP-hardGraph (mathematics)MetadataFigurate numberConnected spaceTwitterReduction of orderFormal languageFile formatSoftwareQuicksortComputer configurationPoint (geometry)Computer animation
13:45
Menu (computing)Sanitary sewerSweep line algorithmTerm (mathematics)Streamlines, streaklines, and pathlinesHash functionComputer networkMultiplication signMobile appNetwork topologyComputer configurationVisualization (computer graphics)InformationVertex (graph theory)Graph (mathematics)Thresholding (image processing)NP-hardInteractive televisionFunction (mathematics)TwitterSoftwareCASE <Informatik>MereologyHypermediaTwitterPhysical lawVertex (graph theory)Game controllerLatent heatComputer configurationFlagVisualization (computer graphics)Link (knot theory)Observational studyLecture/ConferenceComputer animation
16:33
Visualization (computer graphics)Computer configurationComputer networkVertex (graph theory)TwitterStreaming mediaNumberInformationLimit (category theory)Sigma-algebraElectronic visual displayLink (knot theory)Graph (mathematics)Computer animation
16:59
InformationComputer configurationTwitterVisualization (computer graphics)Maxima and minimaStreamlines, streaklines, and pathlinesStreaming mediaPoint (geometry)Arithmetic meanVertex (graph theory)Instance (computer science)Visualization (computer graphics)TwitterComputer animation
17:44
Computer configurationVisualization (computer graphics)TwitterStreamlines, streaklines, and pathlinesComputer networkInformationView (database)TwitterVertex (graph theory)Number
18:18
TwitterComputer networkComputer configurationVisualization (computer graphics)InformationType theoryBitSoftwareComputer animation
18:43
InformationComputer configurationComputer networkNumberVisualization (computer graphics)Cone penetration testTwitterInstance (computer science)Computing platformMereologySoftwareComputer animation
19:47
Plot (narrative)TwitterVisualization (computer graphics)Computer configurationComputer networkInformationHash functionBargaining problemLie groupOpen sourceCASE <Informatik>Physical lawEndliche ModelltheorieLevel (video gaming)Vertex (graph theory)Conservation lawSlide ruleGoodness of fitContext awarenessComputer animation
21:08
Computer networkSeries (mathematics)Electronic visual displayBuildingModul <Datentyp>AlgorithmSyntaxbaumInterface (computing)Library (computing)Front and back endsVisualization (computer graphics)CodeComputer animationProgram flowchart
21:43
Query languageStandard deviationComputer configurationPlot (narrative)TwitterStreaming mediaAlgorithmData dictionaryComputer networkReduction of orderHash functionLink (knot theory)CodePersonal digital assistantPrice indexDisintegrationArchitectureData structureHierarchyForceInteractive televisionCommunications protocolSemantic networkINTEGRALLibrary (computing)Interactive televisionSigma-algebraCodeSoftwareInterface (computing)Graph (mathematics)Forcing (mathematics)Client (computing)BitWebsiteComputer animation
23:36
InformationComputer configurationDisintegrationArchitectureComputer networkData structureHierarchyForceInteractive televisionCommunications protocolData structureSoftwareInteractive televisionLink (knot theory)Arithmetic meanTemporal logicProgramming paradigmBitHierarchyTask (computing)TwitterVisualization (computer graphics)Forcing (mathematics)Formal grammarCommunications protocolBuildingLevel (video gaming)Vertex (graph theory)Endliche ModelltheorieBlock (periodic table)TheoryComputer animation
25:52
NumberComputer configurationVisualization (computer graphics)Scale (map)Computer networkInformationZoom lensTwitterSearch algorithmMultiplicationComputer animation
27:15
Lecture/Conference
28:25
TwitterRepresentation (politics)TrailPoint (geometry)NeuroinformatikLecture/Conference
29:21
Right anglePoint (geometry)TwitterResultantBitSubsetLecture/Conference
29:51
Program flowchart
Transcript: English(auto-generated)
00:08
Okay. Hi everyone. It's a big pleasure to be here. My name is Armin Purnaki, and I'm a Ph.D. candidate in applied mathematics, and I work on building tools for discourse analysis.
00:23
And we build tools for discourse analysis based on methods from graph theory, network science, and natural language processing. And today I want to present a tool called the Twitter Explorer that is already a bit older, but that was built in the Max Planck Institute for Mathematics and the Sciences in my previous group.
00:43
And the idea was to build a tool that allows researchers who don't necessarily have programming skills to collect Twitter data, visualize them using graphs, and explore the data and maybe generate hypotheses in their pipelines.
01:02
So this kind of tool building and this research happens in the field called computational social science. So when I was preparing my slides two days ago, I thought it would be good to maybe give a little overview of computational social science, then say why we built the Twitter Explorer and where we saw somehow the need for a new tool.
01:23
Of course, introduce the features of the tool because it's kind of a talk on programming, the architecture, and maybe give some insights on the usage. But when I was sitting down to make the slides two days ago, I was confronted with this. And of course, since the tool is essentially an entry point into the free API, there's also a part of
01:51
it that uses the research API, which of course led us directly to this question, what happens to the research API? It's also not entirely clear, right?
02:04
So I want to maybe instead of giving this talk the way I was planning to do it, I will do it, but maybe I wanted to ask a few questions first that we might then discuss maybe in the discussion also. And I think there is even some kind of something planned later, right? So some kind of panel discussion.
02:22
So I'm just going to throw some questions out there that I think are really pressing now, especially in the research field. How serious is this? By this I don't mean the implications of it because I know a few people whose thesis is now in jeopardy because they can't collect data in a way. But how serious is it in the sense, will it actually happen? Or is it some scare tactic? So I think this is something that is hard to predict.
02:51
And then these are questions also I think that we can discuss here is how, or is there a way for us as users and not necessarily only as researchers to claim our data or our digital traces that we use and that we leave on these platforms?
03:07
And how can things like the Digital Services Act play a role in this? And the last question is very broad, but how do we move on in the sense of how can we see this as some kind of wake-up call maybe?
03:23
And how can we use this new development to maybe on one hand move to different platforms, but on the other hand also to think about how we do computational social science in the future? So with these questions that we're going to discuss later, I'm still going to give my original talk.
03:43
So in computational social science, a typical pipeline for a project is you have a research question, then you collect data related to it. And in this case, it may be data from online social platforms. And then you analyze it and ideally you generated some more insights on the research question you had in the beginning.
04:04
And sometimes the exploration and the analysis of the data can help you maybe refine also the questions you had in the beginning. So it's some kind of loop that you can see in this way. And the tool that I'm going to present, the Twitter Explorer, is precisely made for this second part.
04:20
For both facilitating the collection and also the exploration of such data. And this pipeline is that we start with text. So in our case it's tweets that are annotated with some kind of metadata. We have on Twitter different types of interactions. So you can mention someone, you can reply to someone or retweet.
04:45
And we choose one type of metadata and cast it into an interaction network. And then we want to find the most significant, for instance, clusters or the significant correlations in this data
05:00
by using 2D spatializations. And typically these are done using forced layouts. But today, for instance, in the graph room there were also some talks about new methods of node embedding. And so I think this is also something that we can discuss maybe in the question section. But one reason why I think forced layouts are good is that especially if you use them in a context where you work with social science researchers
05:28
who don't necessarily have a lot of knowledge about the latest machine learning algorithms, they are quite straightforward to explain in the sense that you have a spring system and nodes that are strongly connected tend to attract each other.
05:45
And especially if you look at interaction networks on Twitter, since retweeting can be considered endorsement, clusters in such 2D spatializations can then correspond to something like opinion clusters. And there's a lot of research being done in that way.
06:03
But one question that we always had when we look at these networks is how do we actually go back to the data that generated them? And this is something that we try to kind of tackle with building these tools. So why we built it is firstly to provide an interface for researchers without programming skills, also to collect and visualize the data,
06:24
because we were working a lot with social scientists that did not have these programming skills, but had a lot of hypotheses about the data that they could not test. Then, of course, to facilitate the exploration of controversial issues on social media.
06:41
And this is the point that I was making before, is add some layer of interpretability to these 2D spatializations by providing an access from within the interface to the actual data that created these node positions. And finally, we see it in the context of a larger scientific scope of using the network paradigm as something like a sampling mechanism for the data.
07:10
Because if you're confronted with a large number of tweets, for instance, of course, everyone knows that you can't read all of them manually. So you need some kind of way to get to the tweets that are relevant for you to read.
07:23
And this is what we use the network for, essentially. So we can, when we look at retweet networks, immediately identify, for instance, the most influential actors in the debate, and then read precisely those tweets that they made to maybe influence other actors. And we call this guided close reading, because if you do only close reading, then you have to read all the text.
07:46
If you have distant reading, you kind of look only at the network on a structural level, and this is something in between. So what can the tool do? It collects tweets.
08:02
I mean, I think we have like one week left for the V2 and the V1. So far, the V2 academic is safe, but we don't know that. So you can search for it from the past seven days using the API. In the second part, in the visualizer, you can display just a simple time series
08:24
of the tweets to see maybe if there's some kind of special activity during one day. You can build these interaction networks, build co-hashtag networks. So we divide it into some kind of two types of networks, which we call semantic networks and interaction networks.
08:43
And then you can compute the typical measures people compute on networks, and especially compute clusters using modularity-based algorithms. And all this happens in some kind of interactive interface using JavaScript and D3.js.
09:04
And this is essentially the part where it gets interesting, because so far, all the other things you can do it with a lot of other tools, especially like Giphy. I think you can even collect tweets with some plugin. So I think all of this is not new, and this is kind of where it gets interesting. And I think this is time for a quick demo. I don't know how much.
09:25
OK, I have plenty of time. I think I talk too fast. OK, so I have prepared some Python environments that already have the Twitter Explorer installed, but usually you would do it like this.
09:53
And then all you need to do to fire up this interactive interface is type Twitter Explorer collector.
10:02
And this will open a browser window from which you can choose your API access, choose the path to which the tweets will be downloaded, and insert your search query, maybe adding some advanced settings and saving options.
10:21
So I don't know. This is a question to the audience now, what we should search for. This is easy, and you're looking into the future. I already have this network prepared for the last slide. Sorry. Can you look at the reaction to the Twitter API to be done?
10:40
We could, but what would we look for then? Is there maybe a hashtag, like API shutdown? Maybe we need to go to Twitter itself, API something like this. Ideally, we would find some kind of hashtag. OK, let's just use maybe this as a search query.
11:12
OK, now it's collecting in the background. And then we can open another browser window here and fire up the visualizer.
11:30
Now we see that while this is still collecting, we can already access... Oh, there were only 400 tweets, so there seems to be...
11:40
So we can look at a time series of tweets, and then we can choose different types of networks to create. We can filter them by language if we want. And this is the language that the Twitter API returns.
12:03
So there is no language detection going on here. We can do some network reduction methods, like taking only the largest connected component of the graph. Then we have this option here to remove the metadata of nodes that are not what we call public figures.
12:24
So if you want to publish some explorable networks, it is advisable to do so. There is not, as far as I know, not a very distinctive or clear rule, after which point one is considered such a public figure. But within our consortium, we decided that it's 5,000 followers. This is also something we could discuss.
12:47
But since Twitter is public by default, in a way, anything you post is somehow potentially to be used and displayed somewhere. Then you can export the graph to all sorts of formats. Then you can aggregate nodes.
13:03
This means that, for instance, removing them based on how many retweets they have or how many retweets they did themselves, and remove, for instance, nodes that only retweeted one person. So if you have a graph, and then there are some nodes that only retweet this person.
13:39
I don't know if everyone can see that, but they tend to clutter the first directed algorithm.
13:48
Structurally, they do not necessarily add anything to the network. So if you have very, very large graphs, it makes sense to remove these and somehow englobe them into this supernode.
14:02
Then you can do traditional community detection. Then it will be saved as an HTML that you can then open. So we see here that this is, again now, in a retweet network, every node is a user and the link is drawn from A to B if A retweets B.
14:31
Now we can look at this user, TChambers, and look at the actual tweets that were made for them to end up at this part of the visualization.
14:51
So the data we collect is kind of sparse, so this network doesn't look that interesting. But I have prepared some fallback options.
15:05
So what we did in a case study a few months ago was to look at the repercussion of some discussions in the US about red flag laws. Red flag laws are specific kinds of laws for gun control that allow state-level judges to
15:25
confiscate temporarily guns from people that are deemed to be a threat to themselves or to the public. These laws created very big repercussions, especially on social media and especially in the conservative camps.
15:41
And this is one typical example where people then can analyze on Twitter if there's something like echo chambers or if people then maybe retweet each other only from the similar camps. And then people draw very quick conclusions very fast. And what we want to do with this tool is to show that maybe things are not that simple as they seem.
16:04
So I have prepared these networks. I think I will make it a bit smaller.
16:21
So this is now a bit bigger than what we had before. We have roughly 25,000 nodes and 90,000 links. And this is already one limitation of the tool that I think I would also like to discuss in the end is that you can't display mentally huge graphs.
16:41
So 100,000 links approximately is kind of the limit. And I think this is also where integrating it with other tools such as Sigma or Gephi might actually make a lot of sense. And so now I can color the nodes by the Louvain community. We can turn off the light also.
17:03
And now we can wonder what are these two communities. And right now the node size is proportional to the indegree, meaning how often a given node was retweeted. So these may then be considered as something like the opinion leaders of the given camps.
17:25
And so if we go here, we see for instance on this side Donald Trump Jr. And we can then look exactly at the tweets that led the visualization to put him where he was. So we don't need to go into the details of what he said, but you see the point.
17:46
We can also change the node size to the number of followers. And then we get an immediate view at who the main actors are that in general are also influential on Twitter. So we have the New York Times here and Wall Street Journal.
18:10
So we can see that we have something like a more liberal versus a more conservative camp. But if we look only at the retweet behavior, we might think that, okay, these are separated echo chambers and people do not talk to each other.
18:26
But what is interesting is if we look at other types of networks in this example. So we can look at the replies. I think I will make it a bit smaller. And all of the sudden we don't see this very strong segregated clustering anymore that we saw here.
18:45
Maybe it's easier if I put it in. But we see something more of a hairball layout. And when we look at the nodes, we see that indeed the path of going for instance from Donald Trump to Hillary Clinton or New York Times,
19:09
of those people that were very far apart in the retweet network, is maybe not that long in the reply network. Meaning that these opposing camps actually maybe do talk to each other. And it might be more interesting to see how they talk to each other and what they say.
19:22
And this is something that you can do when you use this interface and look at the tweets and the actual replies. I don't know if it's so nicely visible. So it allows you to then actually go to the parts of the platform that generate this data and that then generate these networks.
19:44
And finally, as a small example of the semantic networks, we can look at the hashtags that are used. Again, I'll make it smaller.
20:02
And you see that, for instance, there is one kind of hateful conservative hashtag cluster. And again, maybe I should have said that in the hashtag networks, every node is a hashtag and they are connected if they appear together in the same tweet. So this is a very low level way of seeing what is going on in the data in a way.
20:25
You don't need to do some kind of topic modelling or complicated techniques. You can literally just by looking at the hashtags already get a hint at how the different camps speak about the same topic. So if you go here in this area, this is about gun confiscation laws.
20:45
So Marxism in this case is also a good example. Right now we don't really know how it is used. And it can be used either by conservatives or by liberals. And it's important to look at it in the context of the data. So then we would have to...
21:02
Okay, five minutes left. Good. I will go back to the slides. Okay, so under the hood, this whole backend of the collector and the visualizer is written in Python.
21:22
And it's using the Streamlit Python library to serve it on a local frontend. So this is actually a very convenient library. I guess a lot of people also know it. But you can write your code in Python and then it essentially serves it in interfaces that look like this.
21:46
And the Explorer is written in HTML and JavaScript. And it uses D3 and prints the graph on Canvas. Which is also why it's probably not as fast as Sigma is.
22:03
But it has some nice other features that are especially due to this force graph library. So I think if anyone has questions, I'm gonna go into the details in the questions anyways. So this is how you install it. It's fairly simple. If you have a running Python bigger than 3.7.
22:22
And there's also an API. So of course, especially here, probably people will not be so interested in using the Streamlit interface. But you may want to include it into some kind of existing code pipeline that you have. And this is essentially the API. For semantic networks and interaction networks.
22:44
So I invite you to try it out yourself while you still can. You have five days. Of course, if you have the research API, you might be able to use it for a bit longer. But otherwise, go on these websites fast.
23:03
And I will stop the talk with some questions. Actually, I came here with more questions than answers. And I'm really hoping for a lively discussion now. Because I'm not originally a developer, so I kind of wrote this a bit on my own. And I wonder if this integration of Python and JavaScript is actually a good idea.
23:22
Because in theory, it would also be possible to probably do everything in JavaScript. And maybe do it on the client side so you wouldn't have to install all these libraries. Then, okay, maybe one thing that I would like to show is that I experimented with temporal networks. So of course, doing temporal force layouts is kind of a non-trivial task.
23:44
But we can kind of look a little bit at the temporality of these networks. By at least displaying only the links that are active during a given day. So this is also kind of nice, I think. But I would like to discuss maybe other visualization paradigms for this kind of network.
24:04
Then one thing that would be really interesting, I think, is to dig deeper into a visualization paradigm for hierarchical structure of communities. Meaning that, okay, in theory I can either run stochastic block models or move on community interactions and stop them at a certain level.
24:20
And then have some kind of hierarchical node structure. But how to visualize that is another question, but I think it would be very interesting, especially for very large graphs. Then another question is force layouts. Should we still use them now that everyone is doing node2vec and all these other things? I think yes, but maybe there's good arguments against it.
24:41
And on a more deeper conceptual level is, and this is a question, the first one is a question for people who already have much more experience in building tools for the social sciences. How do you further integrate these kinds of methods into existing, maybe also more qualitative social science pipelines?
25:01
So it's kind of an open question. And how can we devise something like a research protocol for these kinds of interactive network visualizations? Because as you saw in my demo, we look at the big nodes, we look at the tweets they made, and it gives us some kind of intuition of what's going on in the debate. But how can we formalize such kinds of visual network analyses?
25:23
And I think, I mean, there's people in the audience who actually work on this, so it would be very interesting for me to talk about this. And finally, to end on actually maybe a bit nicer note, is that there is the network of FOSTEM, as we had already said in the beginning, on this website.
25:42
So it is updated every 15 minutes, thanks to a data collection done by my colleague Diatrice. Thank you very much. And so if you go on this website, you can see the retweet network of FOSTEM. And if you tweet, then you can find yourself in the network also.
26:05
So yeah, what do we have here in the middle? Okay, FOSTEM itself. And there was Ubuntu, Debian, somewhere. Okay, time's up. Thank you.
26:54
Yes, so the question is, I mentioned that you can only collect tweets from the last seven days.
27:00
With the free API, this is a limitation. But the tool itself just writes into existing CSV. It depends, basically. So if you do the same keyword search multiple times, then it will just append to a CSV. Yes, I mean, this is the question right now. It depends, because the question is, what happens on Mastodon?
27:28
I don't know. All these, like if you want to look at political controversies and such discussions, I don't know if Mastodon is mature enough yet to, or adopted enough yet.
27:40
I think if you want to look at the FOSTEM community, it's great. So for this, yes.
28:26
I'm from the University. I'm served at a conference of citizens with human people, same in collection, and not computers. We can choose our computer only for people to be able to learn so often. So I'm very reactive about this kind of thing.
28:42
I don't know what else to think about that. Well, I don't know, which point exactly should I address? Because you raised a lot of... Okay, if I can rephrase. So you are concerned about this kind of research also? Yes, because it can be used to track users across political camps.
29:10
Yes, okay. I see. So I think it's more about the representativity of Twitter data for the wider population,
29:21
which, of course, you're totally right. It is kind of a subset of highly politicized, maybe also a bit more educated than average people. So you cannot, but this is not what we're trying to do also. You're not trying to infer, I don't know, actual election results based on Twitter data. So yeah, I don't know if I addressed the point.
29:43
Maybe we can take more about it. Right. Thank you.