We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Mapping the Chatter: Spatial Metaphors for Dynamic Topic Modelling of Social Media

00:00

Formal Metadata

Title
Mapping the Chatter: Spatial Metaphors for Dynamic Topic Modelling of Social Media
Title of Series
Number of Parts
351
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
opic modelling is a branch of Natural Language Processing that deals with the discovery of conversation topics in a document corpus. In social media, it translates into aggregating posts into topics of conversation and observing how these topics evolve over time (hence the “dynamic” adjective [Murakami, 2021]). Conveying the results of topic modelling to an analyst is challenging since the topics often do not lend themselves naturally to meaningful labelling, where relationships between them can involve hundreds of dimensions. Furthermore, the popularity of topics is itself subject to change over time. In this paper, we propose a spatialization technique based on open-source software that reduces the intrinsic complexity of dynamic topic modelling output to familiar topographic objects, namely: ridges, valleys, and peaks. This offers new possibilities for understanding complex relationships that change over time, that overcomes issues with traditional topic modelling visualisation approaches such as network graphs [Karpovich, 2017]. Spatialization [Fabrikant, 2017], a technique that uses spatial metaphors to aid cognitive tasks, has been a research field since the early ‘90s. It can be used to make sense of vast amounts of information by reducing them to a physical landscape. In this work, we consider spatialization of topics in a 3D space where the X-axis is the similarity of topics posted on the same day, the Y-axis is the similarity of topics across time and how their relationships evolve, and the Z-axis is a measure of the topic popularity. With this approach, a topic is therefore reduced to a single point in a 3D space, and the interpolated surface constructed out of these points becomes a landscape with peaks, ridges, and valleys. More precisely, the “valleys” represent less popular topics, while “peaks” are the more popular ones and flat surfaces indicate the average topics. Our team is working on the Australian Data Observatory project, which has been collecting tweets and other social media posts (Instagram, Reddit, YouTube, Flickr, etc)) related to Australia for the last 12 months. Through the use of the new Twitter academic license, the project is harvesting 10s of millions of tweets per month. The social media posts are stored and analyzed daily using the deep learning BERTopic package. The BERTopic output is then stored and served through a ReST API, which is used by different clients (at present these are Jupyter notebooks and a web application). The intended audience of our platform is composed of the average topics domain researchers including social scientists, linguists, and data journalists. The goal is to support big data exploration at scale and overcome the smaller scale cottage industry of social media research that has hitherto been the norm in academia in Australia Topic modelling is often presented using 2D visualizations, such as circles with size proportional to topic popularity and position related to the similarity between topics, The dynamic (temporal) aspect of topic evolution is typically shown with animations that show how topics morph into different ones and wax and wane in popularity or it is ignored completely and researchers just use static topic modelling visualisations. here is merit in trying a different approach for dynamic topic visualisation: namely, to map the social media landscape to the physical one, as this metaphor allows the simultaneous appreciation of time, topic similarity, and popularity while allowing -via zoom operations- the aggregation/disaggregation of topics into bigger/smaller cluster of posts. This 3D landscape naturally aids the end-user in understanding complex highly dimensional data at a scale and volume that would otherwise be impossible. The formation of islands, archipelagos, mountain ranges or valleys related to mainstream topics such as Covid, vaccination, lockdown, through to geopolitical events such as the invasion of Ukraine provides a finger on the pulse of what is being discussed at scale by the broader population across the social media landscape. This approach is currently realised using a web application that enables the “topographic” exploration of the topic landscape with functions to improve the user experience in the areas of topic labelling and inter-topic distance. There are a few criticalities in the proposed visualization: distance between topics has to be drastically reduced in dimensionality from the ones provided by the Deep Learning model to just one (the X-axis); the Y-axis (time) has to be put in relation to a completely different measure (distance between topics) to make it amenable to an interpolation; topic popularity (the Z-axis) has a huge variability leading to irregular surfaces, hence the need for a non-linear scaling of the Z-axis; communicating the meaning of each topic to the user is difficult, as the top terms of each topic may not be meaningful to a human, and make for a poor label. The proposed processing and visualization is developed using only open-source tools and frameworks, leveraging the work of the open-source geospatial community. All the software developed in the course of the Australian Data Observatory project is available under the Apache 2.0 license, and available through the University of Melbourne GitLab source code repository.
Keywords
Quantum chromodynamicsDialectAsynchronous Transfer ModeLocal GroupMathematical modelSocial softwareYouTubeFlickrAlgorithmStandard ModelNatural languageBit error rateGoogolTwitterDatabaseProcess (computing)SpacetimeVector spaceMathematical modelLevel (video gaming)Presentation of a groupNatural languageAlgorithmAreaMachine learningVector spaceDatabaseStandard ModelHypermediaMoment (mathematics)Dynamical systemProjective planeOrder (biology)CASE <Informatik>Mixture modelProcess (computing)Visualization (computer graphics)Dimensional analysisTwitterNavigationDigitizingRight angleTesselationYouTubeComputer animation
Plot (narrative)Time evolutionRadiusDimensional analysisCircleInstance (computer science)Semantics (computer science)NumberDynamical systemDimensional analysisHypermediaDistancePlotterAnalytic continuationCirclePoint (geometry)SurfaceCartesian coordinate systemEvoluteQuantum chromodynamicsVisualization (computer graphics)Point cloudMultiplication signMathematical modelComputer animation
Term (mathematics)BitInstance (computer science)Computer animation
SurfaceAlgorithmASCIIDimensional analysisEstimatorPopulation densityKernel (computing)Electronic visual displayComputer iconTwitterAuditory maskingEstimatorPopulation densityKernel (computing)Level (video gaming)HypermediaSurfaceVisualization (computer graphics)Computer animation
Transcript: English(auto-generated)
Thank you. So the presentation is titled Map in the Chata. What does this mean? It means that we'd like a novel visualization technique for dynamic topic modeling. So what is dynamic topic modeling, first of all?
It's a natural language processing technique. Right, it's a problem area in which some algorithms are used to determine what a text is about, what its topic, or a mixture of topics in case of longer documents. Now, we apply this to social media posts,
which are very short. So basically, it's what people are talking about in the social media modern piazza. To do this, we have to collect many social media posts. Then we use an algorithm to determine
what each post is about. And then we cluster them in order to determine the topics and the popularity of every topic. So this is what we do at the Australian Digital Observatory, which is a project jointly funded
by the University of Melbourne, University of Technology, University of Queensland, and some other entities. We collect 400,000 social media posts, mainly from Twitter, but also YouTube, Flickr, and stuff like that, every day about Australia, from Australians or from people that are living in Australia at the moment.
So we have collected 121 million posts so far. We store them in a cluster database. Every night, there is a topic modeling algorithm that runs through those data. It's a deep learning algorithm based on Google birth language model.
And then we determine the topics, and we cluster them. Now, there is one more problem, though. How do you visualize these topics? Because every topic is a vector in 384 dimensions,
which means that unless you are a spacing-gilled navigator, you cannot conceptualize that. Traditionally, this has been the visualization use. So you reduce the number of dimensions from 384 to 2.
And then you plot them. So every circle is a topic, and the size of the circle is the popularity, so the social media posts number of the specific topic. And you do it for every day. Now, it's simple enough.
The problem is when you have dynamic topic modeling, so you want to see the evolution through time, then it's more complicated, because you need to have so many plots to look at. So we thought about doing something different, so to use a special metaphor. So on the x-axis, you have time.
On the y-axis, you have the distance between topics, semantic distance. So a topic on the Russian-Ukrainian war will be, say, closer to a topic of the Ukrainian economy than to one on the US economy, for instance.
So and then you have the z, which is the topic popularity, so the number of social media posts for that topic for the day. You have then a point cloud, basically. You drape. Sorry, you don't drape. You interpolate that with a 3D continuous surface,
and then you have something that resemble a physical landscape. So for instance, this is the Ridge of Batman. So this over there is the lead up to the release of the Batman, the movie. And you see these topics over there, the book chapter writer.
The one on the ridge is movie, Batman. These are the top terms, yep. So if you look a little bit, that's at the beginning of March. You see here, this is the mask. We call it the mask peak, because apparently the mask tweeted about his buying of Twitter.
And you see Twitter tweet that topic, social media, at the bump of popularity. And you see over there the Ridge of Batman. And how did we do it? We reduce the dimensionality from 24 to 1 using UMAP, then use a kernel density estimator for the surface.
And then we use QGIS to do the visualization, using QGIS 3.js. And that's all.