NLP Application in Cases of Violence Against Women
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69483 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202440 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Personal digital assistantComputer programmingGoogolSelf-organizationFirst-order logicCloud computingYouTubePositional notationVideoconferencingPrice indexPresentation of a groupAreaComa BerenicesEvent horizonYouTubeVideoconferencingCASE <Informatik>MereologyPattern recognitionCartesian coordinate systemData modelProcess (computing)Natural languageTraffic reportingSpeech synthesisSingle-precision floating-point formatPattern languageFocus (optics)Host Identity ProtocolSoftware testingMultiplication signComputer programmingGroup actionAutomationLevel (video gaming)WordProfessional network serviceAdditionNormal (geometry)Visualization (computer graphics)Computer animation
08:05
Event horizonAreaVideoconferencingComa BerenicesBeer steinGoogolNumerical digitProcess (computing)BuildingMathematical analysisCloud computingSource codePresentation of a groupSide channel attackNumbering schemeBit error rateTrigonometryWordMereologyMusical ensembleContext awarenessArithmetic meanValidity (statistics)Axiom of choiceFrequencyAlgorithmBitElectronic mailing listCartesian coordinate systemBound stateReduction of orderModul <Datentyp>WeightTraffic reportingSoftware testingState of matterYouTubePoint (geometry)VideoconferencingEndliche ModelltheoriePattern languageInformation privacyMetadataToken ringGoogolGene clusterUbiquitous computingFormal languageNumberRepresentation (politics)Single-precision floating-point formatChemical equationComputer animation
16:10
Bit error rateReduction of orderOrdinary differential equationPresentation of a groupMathematical analysisBuildingRepresentation (politics)Coma BerenicesGoogolForceDecision theoryInsertion lossTowerSet (mathematics)Mixed realityMereologyWordPattern languageStructural equation modelingRepresentation (politics)Similarity (geometry)AlgorithmWebsiteCASE <Informatik>Semantics (computer science)Multiplication signTranslation (relic)Water vaporOrder (biology)Associative propertyMoving averageOnline helpPlotterMedical imagingFamilyTraffic reportingVideo gamePort scannerMatrix (mathematics)Bit error rateComputer animation
24:15
Data acquisitionSoftware testingEndliche ModelltheorieMereologyVideoconferencingFormal languageRight angleBitPlanningTraffic reportingYouTubeCASE <Informatik>Multiplication signPresentation of a groupPerfect groupDependent and independent variablesGoodness of fitLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
Okay, thank you. Welcome everybody, welcome to my talk. Today I will talk about natural language process application in cases of violence against women. This is the agenda
00:23
for today. A little about myself. I'm from São Paulo, Brazil and my journal into the world of data began back in 2018. I am a data scientist and I've become an active member
00:46
of PyLady São Paulo in 2019 and I'm very grateful for this community. I started to
01:03
teach program for teenagers in Brazil, especially for some regions in suburbs. Well, why this topic? For us women, it's a normal situation, especially for us in South America, but I
01:24
think these problems are around the world. And every single day in Brazil, you can read in the newspapers about violence against women. As I mentioned before, I'm a data scientist, but first of all, I am a woman and when I read this new, I guess, in the cases of violence
01:50
against women, I felt these violences against me as well. I think every single woman in this room felt the same. I hope so. And that's why I thought I could use my technical knowledge
02:07
to understand this social problem. And for this reason, I started researching data on domestic violence, but to my surprise, I didn't find any data about this situation.
02:24
We have a problem. This is the situation, some news about the violence. One in three women suffered this and they didn't find the data. It's impossible now. I know this
02:45
data is very sensitive. You have some issues about the data, but I needed to find a solution for this problem because I would like to identify patterns in these reports
03:02
because when you speak for the, for example, for the official policy, the women talk in the same way, but you know, her feelings are much more than that situation. And for
03:21
this reason, I have a good idea in my mind, you know, because I think, oh, I can collect this data from newspapers, online newspapers, but I collect only the Portuguese from Brazil
03:42
because it's my reality. But this, the first idea to collect this data from newspaper didn't work, of course, because it's a little complex to separate the women report from the journalist tests. That's when I am Brazilian, you know, you fight all the
04:06
time and I had the second idea. Oh, I collect too from YouTube. YouTube has a lot of videos and this when I, but first of all, I needed to know if there are videos about women talk
04:28
about this, their experience about the violence because I know it's very sensitive. And first of all, I collect, I did a manual search using some keywords such as hip art,
04:45
violence and the women. And then look here, it's, I put the English, but in Portuguese, I found a lot of videos about this problem. Luckily for me, I locate the videos. Unfortunately for these women, I found these videos. I collect only Brazilian Portuguese and for
05:09
my surprise, I found a several videos about this situation, but you know, if you work with data, you know, these, when you collect the videos, it's so hard to find the patterns
05:25
on these videos because it was unstructured data. And I had the third idea. Oh, I can collect these in an automated way. And that's when I look at for APIs that could help
05:45
me to collect these videos. And YouTube has an API that allows you to collect these videos. In addition, YouTube has a rich and deep material about these reports because
06:06
there are a lot of channels about the groups of women that support other women and they felt comfortable to talk about their experience. And for me, the YouTube has,
06:26
this is a very rich material because I would like to identify patterns because, you know, especially for us women, you know, there are a lot of patterns in this situation. If you are
06:43
poor or rich, it doesn't matter. It's the same. And in this stage, I collect 119 videos, but most of them don't talk about domestic violence because my focus was domestic violence
07:02
because there are a lot of other violence against women. But for my master, his shirt, I use only for domestic violence. And after I collected these data, but I needed to transform
07:22
the unstructured data into a structured data. When I had the idea to use Whisper, Whisper is from OpenAI. Whisper is an automatic speech recognition. And that was
07:40
exactly what I needed to transcribe my videos to test. In other words, I transform my unstructured data to structured data. This part was too hard for me because I needed to listen to every
08:04
single video to confirm this video talk about domestic violence and I needed to validate transcription by transcription. For my surprise, the Whisper works very well, especially because,
08:25
you know, the Whisper works very well in English, for example, but in Brazilian Portuguese, for my surprise, works very well, almost 100%. But for my mind, no, my mind didn't work well,
08:43
because I listen every single video and talk about violence. I guess, me, not especially me, but for women, you know, I feel very sad in this period. But I explain here a little bit my data.
09:05
For example, I collect the APIs on YouTube. You can collect some metadata and I collect the video ID, the video title, the original transcription and the validate transcription. The fourth one here, it's the transcription that I needed to do transcription by transcription.
09:32
This part took a while. I needed to use Google Collabs because I needed to use a GPU,
09:43
because I used the biggest one model, the Whisper model for Portuguese. Take more or less 90 hours to transcript these videos. Well, after that, I collect, ah, this is very important, the fourth one.
10:08
I took some parts of the test, because some parts people talk, for example, subscribing on my channel or other things, or someone ask to the woman,
10:22
oh, talk about your experience. I remove this part. The fourth column, the validate column, it's about the test of women, the report of this woman. After that, I needed to start to analyze the date, but I needed to standardize it,
10:46
because we have some problem when you use a test to analyze some, to analyze the, this part was very important here in the, when I remove the stop words, I prepare
11:06
my list for this part, because in Portuguese, you may use a lot of words don't, how can I explain that don't mean for the context and the remove, I had, I have a great friend
11:24
that helped me about this part and some words I remove as well. For example, women, women or violence, because it is my context, you know, and the last one for me, it's more
11:40
important. I remove the real names because I know it's a public data because it's staying YouTube, the other things, but for respect of the vitamins, I remove real names because sometimes the women talk a lot, her name or the, the, the child
12:01
and the other personals here. Well, this part is very important. I use it to, to extract the partners in this report. I use it. Bird topic. Bird topic is a, is a particular useful for
12:22
discovered hide topics in the test, especially in the large collection data. And for this reason, I use it then for, I use it for, to identify this part of the women's
12:40
report. For example, here, I don't care if the, I don't care if the, to understand for the report of one woman, I would like to understand the, the collective, not the individual. Well, another point, a positive point here though, of the, the bird topic is bird topic is a topic
13:08
modeling, but they don't modularity, which offer way to flexibility in choice the algorithm using to eat a state of the application. For example, in bounds, you can choice your model
13:24
or in the part of the dimensionality, the reduction, you can choice the algorithm, that too, what do you want? And for me, the bird topic was very essential
13:42
because I can change, for example, the, in the sentence, I use the algorithm, the Brazilian award. And, but you can see below that the, the step that I use it, for example,
14:02
this name is very cool, you know, here. Okay. Okay. Okay. And this topic is just this part of the imbalance to very curious because in Brazil, you have the instrument, musical instrument, the name is berimbau, you know, and you mix bird and berimbau,
14:28
that is the berimbau. The Brazilian is very creative, you know, this is the instrument here and well, I use for this part a pretend model in Portuguese, they named berimbau.
14:45
It was very useful for me because then can understand better our language than others multi-model algorithms. And for the part of the dimensionality reduction, I use a UMAP.
15:08
And for the cluster is stuck, I use the ATB scan. This part I use a ATB scan because when using it, you don't need it to inform the amount of cluster that you do like to find.
15:27
In my situation, I don't know what number I needed to put here to find these clusters. And this is the reason that I use ATB scan because I would like to identify, discover
15:49
hiding patterns and the ATB scan is very well for my situation. And the last part here, the tokenizer, and this part
16:08
to get an accurate representation of the each topics you use for, I use the, I prepare the matrix and I use a TF-EDF for this. But this part here is very important because
16:29
was I used to work on each cluster or topic here. I mean, I use a TF-IDF for each cluster
16:43
that I found using ATB scan. For example, when you use a TF-IDF, you use for the sentence, only for the sentence. In my case, I use it for the each cluster or topic is the same here.
17:08
Well, this is part that the BERT topic show me after all these steps. Here is the topics that had the best monthly similarity
17:27
words. You can see, for example, I know it's in Portuguese, but the idea here, for example, the first topic here, the topic two, this is the top five topics that had the best semantic representation. Because for example, you can ask me, oh,
17:45
what about topic one or topic three? Because in this part is very manual situation, you can read the words that they found and you see, oh, make a sense in this part or don't make a
18:02
sense. In my case, this topic, the top five topics here, it had the representation, semantic representation. This word here, for example, the first word in the topic two, it's going to say, is known in English, is that the word that had most representation
18:30
in topic two and the other words at the set of words that were associated with this topic here,
18:41
the topic two. But I know the image in Portuguese, sometimes it's very difficult to understand, but let's take a closer at topic two. Here, a translation to English, this is the word that associated for this topic here.
19:07
Well, when you see, for example, the phrases here, the part of the documents, this part is a mix, there are a lot of reports of human, not for, yeah, it's a document,
19:27
I don't know how women talk about this situation. But here, the algorithms found the partners. And for example, it's so hard to read this because when I start to read this,
19:47
it makes me very sad. But you can see here, there are a lot of, not there are, there is a pattern here. Well, when observing the associated words, it's possible to see that the aggressions
20:09
and the violence experienced by these women probably stimulated then the desire to seek
20:23
and to end all of the adversities they face. Here, the woman knows about the violence. We know it, but they needed to escape of this violence.
20:47
But it's very complicated, especially for us women. I guess it's very hard to the woman
21:00
to start to wake up and escape to this violence. Here is one of the other topics that found the patterns because I needed to read a lot about domestic violence in these cases.
21:25
And this is the part that I needed to some help of the specialist of the domestic violence. And that I said before, I am a data scientist, I'm not a specialist of domestic violence.
21:47
But I identified some patterns, especially for this topic here that I show you. It's too sad to read this because sometimes the family don't believe in the woman or the society.
22:08
And but when they wake up of this situation, she can escape and start a new life again.
22:23
Well, remember that I'm not a specialist in domestic violence, but it was possible to identify some patterns here. This is the plot that there are other stocks here. If you can see
22:46
that, for example, this one is very distant for the water sight. But you can see some patterns in the words. It's in Portuguese, but make a sense, believe me. If you can speak in Spanish,
23:06
it's very close. But there are a lot of partners here. I needed to work now with someone that was a specialist in domestic violence. And for conclusion here,
23:22
other thing that I would like to leave you, you think about is that you can use your technical knowledge for social being, you know. I know there are a lot of individuals here,
23:42
but you can use our technical knowledge for the society, not for only the money, money, money, money. You can use these for us, you know, for humanity. And
24:02
thank you a lot. I appreciate your time. If you have any question, let me know. This is my LinkedIn. That's it. Thank you.
24:20
Perfect. We have four minutes for questions. So run to the microphones, everybody. Hi, thank you for the presentation. I have a question. In the beginning, you mentioned that the data from newspapers wasn't suitable for your purpose. And I'm not sure I understood why.
24:41
So could you please elaborate on that? Why I didn't use the newspapers? Yeah, okay. Because it's so difficult to separate the women report and the journalist test. Because in the newspaper, sometimes the journalist asks something for the woman and the woman
25:04
response then. And I couldn't separate this part. And I would like to find only for the women reports. So it's hard to separate the question from the answer. Yeah, exactly. Because sometimes it's not easy to see. Because sometimes it's only one
25:24
test, you know. And on the video, it's like different voices and you can separate them. Yeah. Okay, thank you. All right, cool. Hi. First of all, excellent talk. That was super, interesting. Thank you. Now that you have this data, do you have plans on kind of what your
25:46
next steps are? Like, what are you planning on doing with this data now? Yeah, good question. All the time I wonder about this. I would like to move on. Now I guess I need someone that
26:03
knows about domestic violence to say to me, oh, make a sense. Don't make a sense for us, this more technical person. Oh, it's work. It's very cool. But I don't know if for this specialist it makes sense for this. My next step is to find someone that knows a lot about this
26:25
and to use it for the other contacts as well for homophobe or other things, racism. Cool. Yeah, that sounds super interesting. And I've got a second question if that's okay.
26:41
Is there anything in your setup that would stop somebody else being able to apply this to a different language? So you said everything here was Portuguese, Brazilian Portuguese. Is there anything to stop somebody else doing it with Spanish or German or whatever? Yeah, there is a model, the multilingual model you can use. But especially for my case,
27:07
I was a master and my advisor told me, oh, only the Brazilian Portuguese because there's so many other languages to understand, especially because I understand a little bit in this
27:21
situation in Brazil. But that is, you can use, for example, if you collect the YouTube reports around the world, you can use the same step. Only the change for because I use the birth ball because it's Brazilian Portuguese, you change for the multilingual. Cool. Thank you very much.
27:41
You're welcome. Unfortunately, we have to cut this here, but I'm sure Deborah is running around, so just grab her and ask her all your questions. Thank you again for the very cool talk. And please give it up for Deborah once again. Thank you.
Recommendations
Series of 4 media