SentText: A Tool for Lexicon-Based Sentiment Analysis in Digital Humanities
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/52968 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2021 | |
Production Place | Regensburg |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computer animation
00:56
Computer animation
10:06
Computer animation
17:15
Computer animation
Transcript: English(auto-generated)
00:01
I hope everyone sees my slide and I will gently switch to English as you hear. Do you see my slides. Okay. So my name is Thomas Schmidt from the media informatics group of the University of Regensburg. And I will present Sentex the tool for lexicon based sentiment analysis and digital humanities.
00:25
So, as already was said, Johanna Daniel was the primary developer of this tool Christian Wolf, the supervisor of diseases that's connected to this tool development. And yes, this is a theme. This is originally a demo slash short paper contribution.
00:43
So I will talk a little bit about background and how we approach this tool development. But I will mainly focus on showing the tool and by using it. So I will start with a little bit of background about sentiment analysis. Here you see just a standard definition of sentiment analysis.
01:04
Also called opinion mining the field of study that analyzes people's opinion sentiments appraisals attitudes and emotions, mostly in written text. Overall, the idea of sentiment analysis is to try to predict the sentiment expressed sentiment in a text unit.
01:23
If it's rather positive or rather negative, if it's rather neutral or something like mixed. And that's the basic premise of this method. And it has become a rather popular method in recent years in the text mining community and information science in general.
01:40
And it's applied in various research areas like social media analysis, analysis of product reviews, sentiment analysis on Twitter and similar research branches. If you would approach sentiment analysis nowadays, the state of the art idea would
02:00
probably be to use some sort of large pre-annotated corpus about sentiment annotations. For example, you want to use Twitter sentiment analysis. And there are a lot of large Twitter corpora with annotated sentiment per tweet. You would use these as training corpus and then perform some machine learning algorithm to train a model.
02:24
Probably nowadays with something like a large word embedding like BERT or similar stuff. Of course, not every topic and not every research area has large chunks of annotated corpora. And of course, especially nowadays, the modern machine learning approaches are not that transparent as other methods.
02:47
So there's also another very popular, still very popular method, which is a little bit more simpler, which are lexicon based methods. And this is also what is still kind of popular in the DH community due to some sort
03:03
of lack of annotated corpora, although there are developments that are going more into the direction of machine learning. Shortly about lexicon based approaches and sentiment analysis, because this is the main idea of our tool. It's a rather simple concept.
03:23
Instead of having a pre-annotated corpus, you have a pre-annotated list of words, so-called sentiment bearing words, which are annotated concerning the sentiment polarity. So if a word is rather positive or negative in the general usage of this word, and
03:40
this can be a number from like one or minus one, or it can be some metric scale. And you use these lists, which are created in various forms, sometimes expert based, sometimes by some sort of semi-automatic process. And then you perform rather trivial calculations for a text unit.
04:01
Here's just a very simple example. You simply count the words you detect as positive, you count the words you detect as negative, and then you perform some basic mathematical calculations to get an overall value of the polarity, the expressed valence of the text. Of course, there are some approaches to perform this more sophisticated. You can try to look at
04:26
negation words, you can try to look at valence intensifiers, you can perform lemmatization to improve this. The overall premise is a basic calculation of the words detected in the text.
04:41
As I already mentioned, due to the lack of corpora, because it's very transparent also and very easy to perform, for a long time in the age context, lexicon-based approaches were the approaches to use when performing sentiment analysis. Just some examples of what people are doing. They are, for example, looking at the most negative or the negative sentiment words
05:08
used in plays by Shakespeare, for example, here in Hamlet. They look at relationships between characters. This is a graph that visualizes the accumulated sentiment and
05:23
speeches expressed by a specific character in a Shakespeare play, Othello, Towards Desdemona, and you can see in the visualization that accumulated sentiment becomes more and more negative, which is in line with the real content of this play.
05:40
And similar stuff was also explored in various other areas that are close to digital humanities, like literary studies but also for historical language and other research areas. The idea is the same. We also contributed to this branch of research. We explored
06:05
lexicon-based sentiment analysis for another tool we developed for quantitative drama analysis, which I will shamelessly plug in here. In this specific case, we performed lexicon-based sentiment analysis on plays of Lessing and we evaluated approaches and we explored visualizations.
06:25
We developed a web tool to explore sentiment analysis in Lessing's plays, a tool that performs analysis like this and produces visualizations like this. This is, for example, a bar graph visualizing how the sentiment becomes more and more negative in a
06:45
play from Lessing. This is something you can identify in all of his plays. And we developed a lot of visualization for this method. But as we did this and as more users use the tool and gave us feedback on what is interesting and so on,
07:02
the feedback we received was more and more towards the idea that people wanted to perform their own lexicon-based sentiment analysis and people also wanted to have more transparency about how these results actually came to existence. So we saw a lack of tool existence in this context, and we decided to actually
07:25
develop such a tool that people could use to perform their own lexicon-based sentiment analysis. And the idea was to focus primarily on the DH community. We wanted to make it as accessible as possible.
07:41
My impression is that tools that need a lot of installation, need a lot of dependencies are not used that frequently than tools that are more accessible like web tools. And we also wanted to give some transparency in the context of how calculations came to existence.
08:02
We also applied some methods of the user-centered design process and we performed a requirements analysis. We integrated methods of usability engineering. For example, we performed interviews with a lot of with a couple of people from the DH context, but also from
08:31
adapted to the specific community we are designing it for as possible. I will not go into detail of all of these methods, since I really want to show the tool.
08:43
Just some of the requirements we acquired via this process are shown here. As I already said, people wanted to primarily use their own material. They wanted to adjust lexicons. As you can imagine, text sorts in DH are not really contemporary language, but most of the time historical language or other specific domains, specific language.
09:07
And people wanted to adjust this. They wanted transparent results. The web is seen as a very accessible platform to use, which of course causes a lot of problems.
09:23
This is the idea we followed here. I summarized some of the overall functionality. We do perform some advanced stuff with our tool, like limitizations and negations. Right now everything is focused on German, but we plan to extend this also on other languages.
09:42
Instead of talking about the functionality, I will actually show the tool with a live demo. It will actually be also the first time that I explore this specific text. As you already know, in IT there can go nothing wrong with a live demo.
10:04
Let's look at the tool. I hope you all see my browser now. Can you give me a short feedback?
10:21
This is the start page of Tintex. We integrated a lot of documentation and explanation. As we already made the experience, the people are very interested in all of this. We explain the different pre-processing steps, what types of data you can upload, what you can download, and so on. If you go to the sentiment analysis branch of this tool, you then can indeed upload your files and perform sentiment analysis.
10:50
We offer some basic German sentiment lexicons, but there's also a possibility to upload your own lexicon if you follow a specific data standard, a rather simple standard, that you can read more about in our documentation of the tool.
11:07
From the advanced options, as I already said, we can perform lemmatization with a off-the-shelf German lemmatizer. The process is very time-consuming.
11:21
Again, since we are in the web, everything, performance is a thing, so to speak, but you also have some other adjustment possibilities like stopwords lists and to use negations as well and shifters and so on.
11:41
I looked for an interesting example where I hope that I find some interesting results for state-of-the-art research, so to speak. What we will do, we will compare German rap lyrics with German schlager lyrics. I looked a lot for the English word of schlager, but I didn't find anything. It seems to be German schlager.
12:04
I will prepare the sentiment analysis. This might take a little time, so I, of course, prepared this beforehand. Just to give you an insight how this looks, this corpus is basically a list of lyrics from various artists of the specific rap or schlager genre.
12:26
I don't know much about rap, but maybe I have some artists, yes, some well-known German artists of schlager. Not super representative, but of course I just want to show how the tool is applied.
12:40
So please don't, this is not about German lyrics or something like this. Nevertheless, if you perform the analysis, what you get is a screen like this. On the left screen, in this specific case, the entire corpus of rap lyrics and the entire corpus of schlager lyrics are seen as one document, so to speak.
13:07
So we just compare one document to each other. You can, of course, import more documents and then create so-called folders to compare document collections to each other. Here we will just focus on this one type. We currently look at the schlager results.
13:27
You get a normalized score and it's always very, very small since it is normalized by the numbers of tokens, which helps to compare documents of different sizes. Nevertheless, the overall impression is indeed that the schlager texts do look
13:51
actually less negative than the rap texts, although the difference is quite small. We produce some visualizations that you can look for.
14:02
For example, a pie chart of the negative detected words and the positive detected words. You can look at the strongest sentiment-bearing words. This is just for schlager in this example. So these are the very positive words that are most frequently used for schlager.
14:23
These are the most negative words that are used for schlager. The word hölle, I think there's a very specific song that is the reason for this result here. And you can explore other analysis. We also tried to connect a little bit of close reading, which is actually something
14:45
that we found out that users really like to explore how the sentiment analysis actually works. In the right text, you can explore your document and look at the specific words that are detected and what sentiment value they actually have.
15:03
So this is actually the part where people oftentimes go into and look at, okay, a wrong word was detected, this word shouldn't be positive and so on. There are some words that sound intuitive. I was looking a bit for an example where negation comes to play.
15:27
But I'm not sure if I can find one off the top of my head. So if a negation word is close to a sentiment-bearing word, the valence sketch gets thrown into the other polarity.
15:43
Yes, but this is something that people actually quite like. You can also compare documents with each other. In this case, we would compare rep to schlager, for example. For example, if I want to compare the most negative words in schlager and the most negative words in
16:04
rep, you can look at something like this and then you can get an overall impression of what is happening. I would argue that some of the results here are already telling of the specific genre. Of course, you can also look at this from a more quantitative standpoint of the specific word distribution.
16:29
You would see here, indeed, there are more negative words, 46%, than in schlager, which is apparently a bit more positive, according to the specific methods and all the limitations that are connected to the specific method.
16:49
Yes, so I have all the links in my presentation. I will later on also share the presentation. Of course, you also find all the links to the tool and so on. On the specific paper to explore the tool, I think that's always the most fun with tools to explore them yourself.
17:09
From my standpoint, we also performed a usability test. I will try to go back to the slides here. I hope you all see my slides again.
17:25
It was a rather small usability test, but the feedback was rather positive with very good results. Of course, we had some sort of iterative development, so the tool was constantly trying to improve.
17:42
Overall, there are still a lot of missing features. Other than that, I'm rather glad the tool exists. It seems to be used rather often. My access numbers are surprisingly high. I don't know how these people always find the tool. Every now and
18:02
then, I receive some mails. What people still would like the most is, of course, more lexicons and more languages. Could you integrate something for Spanish and so on? This is something to do. Of course, some sort of user management. Right now, everything is more or less.
18:22
We basically save nothing from a user. We don't even save the text, so you can't save your overall dashboard. You can save the PNGs. You can download some tables with the results. You can download an XML with your results, but you can't save your entire process, so to speak.
18:47
Overall, since I'm also doing sentiment analysis with other methods, I would say that if you really want to do a very scientific research project, performing lexicon-based sentiment analysis,
19:03
you usually need more control than this tool can offer, but I still think this tool has its value. Nevertheless, you can explore a text sort, get first results, get first insights, get a first understanding of your specific text that might be problematic.
19:21
I also think this tool is very nice to use for education purposes, just to show the tool to students to explore sentiment analysis and its methodology. If you want to have more information, the good friends of Fortext created a very large
19:43
tutorial for the tool that I can recommend if you want to get to know more. Other than this, I thank you for all your attention. I thank Johanna for the great tool. Her contact data is also here, and I hope you had some fun with the talk.
20:04
Thank you very much.
Recommendations
Series of 4 media