Emotional Trauma, Machine Learning and the Internet
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 234 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/32954 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
re:publica 2017165 / 234
1
5
7
10
12
13
18
19
20
22
31
32
37
42
43
44
48
51
65
72
74
78
79
88
89
101
104
106
107
108
109
115
127
138
142
150
152
156
159
160
166
167
172
175
181
183
185
186
188
191
197
201
204
206
208
210
213
214
215
217
220
221
223
224
226
234
00:00
InternetworkingMultiplication signOpen setInternetworkingWage labourVirtual machineLabour Party (Malta)Machine learningComputer animationJSONXMLUMLLecture/ConferenceMeeting/Interview
00:53
ComputerFormal languageData conversionContent (media)ChatterbotMetropolitan area networkExistenceSoftwareInteractive televisionForm (programming)Computer programmingNeuroinformatikRobotAlgorithmSeries (mathematics)Task (computing)MereologyWordLecture/ConferenceComputer animation
02:02
Uniform resource locatorFormal languagePhysical systemData conversionMachine codeVirtual machine
02:40
Host Identity ProtocolUniform boundedness principleWordReading (process)Graph coloringAlgorithmShared memoryVirtual machineBit rateMachine codeMachine learningSeries (mathematics)Interactive televisionLecture/ConferenceMeeting/Interview
03:22
SpacetimeAutomatic differentiationAlgorithmResultantInformation privacyMedical imagingSet (mathematics)Lecture/ConferenceMeeting/Interview
04:16
Product (business)Physical systemDigital photographyRow (database)Endliche ModelltheorieMultiplication signAlgorithmAutomatic differentiationMachine learningPattern recognitionComputer clusterMereologyMedical imagingSoftwareVirtual machineLecture/ConferenceMeeting/Interview
05:15
Physical systemResultantWater vaporAlgorithmSheaf (mathematics)SpacetimeGenderWave packetSet (mathematics)Meeting/Interview
06:08
Set (mathematics)PredictabilityAlgorithmFeedbackMathematical analysisResource allocationVirtual machineDisk read-and-write headGroup actionLecture/Conference
06:45
Flow separationObject (grammar)Physical systemNeighbourhood (graph theory)State of matterProduct (business)Traffic reportingVirtual machineWordMachine learningLecture/Conference
07:33
WordBlock (periodic table)Search engine (computing)Term (mathematics)ResultantPhysical systemObservational studyEmailLecture/Conference
08:12
Game theoryWordSound effectBit rateLecture/ConferenceMeeting/Interview
08:54
QuadrilateralCommutatorDecision theorySound effectComputing platformAlgorithmText editorDifferent (Kate Ryan album)Virtual machineOpen sourceSpacetimeExistenceLecture/ConferenceSource code
09:39
Different (Kate Ryan album)Formal languageThermal expansionStandard deviationWordDigital mediaMultiplication signMereologyBlock (periodic table)SpacetimeTelecommunicationParticle systemSpeciesQuicksortLecture/ConferenceMeeting/Interview
10:36
SoftwareIdentity managementWordGroup actionDigital mediaDifferent (Kate Ryan album)Mixed realitySpacetimeComputing platformThermal conductivityAlgorithmQuicksortNumberVirtual machineSeries (mathematics)Machine codeRule of inferenceWeb serviceRight angleTerm (mathematics)Lecture/Conference
11:34
NumberString (computer science)Digital mediaSpacetimeVideo gameSeries (mathematics)Virtual machineFacebookDigital photographyMachine learningAxiom of choiceLecture/ConferenceMeeting/Interview
12:26
Series (mathematics)Goodness of fitDigital photographyDialectWordCoefficient of determinationArithmetic meanGame theoryVariety (linguistics)Interface (computing)Data miningDifferent (Kate Ryan album)AlgorithmProcess (computing)FamilyPeer-to-peerDigital mediaComputer animationLecture/ConferenceMeeting/Interview
13:10
Internet forumComputer fileSoftware frameworkTraffic reportingMachine learningVirtual machinePhysical systemComputer programmingWordProcess (computing)TwitterComputing platformMereologyGame theoryMultiplicationLecture/Conference
14:11
Scaling (geometry)Digital photographyContent (media)Level (video gaming)Traffic reportingSheaf (mathematics)Process (computing)Morley's categoricity theoremPeer-to-peerCategory of beingInternet forumSpacetimeTwitterDependent and independent variablesFeedbackEstimatorMultiplication signLecture/ConferenceMeeting/Interview
15:06
Internet forumPhysical systemTraffic reportingWordReal-time operating systemView (database)FlagAutonomic computingoutputSpacetimeContent (media)FacebookContext awarenessLine (geometry)Lecture/ConferenceMeeting/Interview
16:06
Dependent and independent variablesMachine learningCollaborative softwareCollaborationismGoodness of fitContext awarenessMedical imagingVirtual machineProcess (computing)Extension (kinesiology)Lecture/ConferenceMeeting/Interview
16:48
Wave packetSoftware development kitProcess (computing)Artificial neural networkFormal languageVirtual machineInternetworkingMedical imagingAnalytic setWordLatent heatQuicksortData storage deviceSoftwarePhysical systemDoppel-T-TrägerMathematicsInternet forumLecture/ConferenceMeeting/Interview
17:51
QuicksortFormal languageSpacetimeWeb pageRight angleDigitizingBit rateMathematicsLecture/Conference
18:37
Internet forumElectronic program guideTerm (mathematics)SpacetimeDifferent (Kate Ryan album)Classical physicsSpeech synthesisIdentity managementSelectivity (electronic)Game theoryGroup actionQuicksortState of matterLecture/Conference
19:23
WordFrequencyAddress spaceBlogSpacetimeDifferent (Kate Ryan album)Term (mathematics)Lecture/ConferenceMeeting/Interview
19:59
Electronic program guideDifferent (Kate Ryan album)Term (mathematics)Variety (linguistics)GradientSpeech synthesisWebsiteConservation lawWordSpacetimeBlogUniform resource locatorLecture/Conference
20:49
Virtual machineMereologyPhysical lawWebsiteData analysisCategory of beingTerm (mathematics)Physical systemWordInternetworkingMeeting/Interview
21:26
Term (mathematics)BlogFormal languagePhysical systemCASE <Informatik>Speech synthesisVirtual machinePartition (number theory)WordPerspective (visual)Multiplication signFrequencySpreadsheetPolygon meshSpacetimeBit rateDatabaseLatent heatInternet forumLecture/ConferenceMeeting/Interview
23:05
PlastikkartePerspective (visual)Structural loadVolume (thermodynamics)Internet forumGoodness of fitDifferent (Kate Ryan album)FlagError messageExtension (kinesiology)Online helpLecture/Conference
23:42
Latent heatDifferent (Kate Ryan album)WorkloadInternetworkingMultiplication signData conversionInternet forumStandard deviationSet (mathematics)Electronic mailing listLecture/ConferenceMeeting/Interview
24:29
Wage labourInternet forumWeightStructural loadInformationAlgorithmDifferent (Kate Ryan album)SpacetimePartition (number theory)Online helpPhysical systemWeb pageTwitterUser-generated contentComputer animationLecture/ConferenceMeeting/Interview
25:09
SpacetimeUser-generated contentMultiplication signInstance (computer science)SummierbarkeitDifferent (Kate Ryan album)Meeting/InterviewLecture/Conference
26:10
Speech synthesisElectric generatorSpacetimeMathematicsAuthorizationTheoryMassLecture/Conference
26:52
AlgorithmTwitterWordInheritance (object-oriented programming)Speech synthesisClassical physicsCollaborationismMathematical analysisSpacetimeDifferent (Kate Ryan album)Block (periodic table)Graph coloringMereologyOpen sourceComputing platformRight angleSinc functionEndliche ModelltheorieGroup actionLinear regressionBit rateType theoryBitLecture/Conference
29:10
Series (mathematics)Information privacyData conversionFacebookSet (mathematics)FamilySoftwareLecture/ConferenceMeeting/Interview
29:45
Multiplication signExpressionNoise (electronics)Virtual machineComputer programmingPerspective (visual)Right angleSource codeLecture/Conference
30:42
Grand Unified TheoryCASE <Informatik>Virtual machineSoftware testingMereologyProduct (business)Multiplication signAlgorithmSound effectProcess (computing)Set (mathematics)Positional notationUniverse (mathematics)QuicksortStandard deviationLecture/Conference
32:38
Wireless LANContext awarenessForm (programming)Graph (mathematics)Moment (mathematics)Internet forumMachine learningAlgorithmMereologyContent (media)Online helpLecture/ConferenceMeeting/Interview
33:17
Content (media)AlgorithmCASE <Informatik>Type theoryBasis <Mathematik>Virtual machineSheaf (mathematics)Term (mathematics)TwitterComputer fileProcess (computing)Hand fanCyberspaceMereologyExtension (kinesiology)Volume (thermodynamics)Game theorySpring (hydrology)Machine learningResultantSpacetimeTraffic reportingView (database)Internet forumDifferent (Kate Ryan album)MathematicsDigitizingLecture/ConferenceMeeting/Interview
35:19
Game theoryLogic gateQuicksortObservational studyOvalTwitterPhysical systemInteractive televisionContext awarenessVolume (thermodynamics)Digital photographyLecture/Conference
36:15
Digital photographyRobotOvalAxiom of choiceSet (mathematics)Information privacyRight angleStudent's t-testDigital rights managementRow (database)Element (mathematics)Level (video gaming)Arithmetic progressionLecture/Conference
36:58
Internet forumObservational studyInstance (computer science)Vector potentialInheritance (object-oriented programming)Right angleAdditionTraffic reportingProcess (computing)Total S.A.Meeting/InterviewLecture/Conference
37:47
VideoconferencingContent (media)Level (video gaming)Wage labourQuicksortTask (computing)Medical imagingLecture/Conference
38:20
Medical imagingMetadataSimilarity (geometry)Content (media)Coefficient of determinationBit rateWordVolume (thermodynamics)Hash functionLecture/ConferenceMeeting/Interview
38:59
Sheaf (mathematics)Traffic reportingWordGame theoryAreaLecture/Conference
39:39
User profileGroup actionFacebookTwitterSet (mathematics)View (database)MereologyLevel (video gaming)Identity managementDialectLecture/Conference
40:26
SpacetimeProduct (business)TwitterFacebookData conversionQuicksortAlgorithmDatabaseContent (media)Multiplication signPerspective (visual)MathematicsFormal languageThermal conductivityArithmetic meanCommunications protocolInteractive televisionMachine codeComputing platformTemplate (C++)View (database)InternetworkingGroup actionObservational studySpeech synthesisVirtual machineLattice (order)Figurate numberGame theoryRight angleLecture/Conference
42:28
Term (mathematics)SpacetimeDifferent (Kate Ryan album)WebsiteWater vaporPoint (geometry)Sound effectPhysical systemInternetworkingSpeciesSpeech synthesisError messageFormal languageSubsetLecture/Conference
43:41
Speech synthesisWebsiteBit rateParameter (computer programming)WordRule of inferenceLecture/Conference
44:18
Rule of inferenceLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:17
I'm an online harassment researcher with the Wikimedia
00:20
Foundation, as well as an open lab fellow with BuzzFeed and Eyebeam. For the past two years, I've been studying online harassment inside of social networks. For the past five years, I've been studying online protest as well as human behaviour and the kind of data we create inside of social networks. The title of this talk is machine emotional labour plus machine learning. So, I spent
00:45
a lot of time studying people on the internet, and you should laugh at my GIFs, because they're great. One of the things I've been thinking a lot about is the kinds of content we create inside of social networks, the kind of language
01:01
and conversations we have, because it exists online, because it exists inside of technology, any kind of conversation we have is actually data. Our interactions from be them emotional, professional, flirtatious, angry, sad, mad, they all exist as a form of data. Well, two years ago, I was working at IBM Watson as a design
01:21
researcher working on chat bot software, and I started to think about how much of our language is actually networked and how much of it actually exists as data once it's put inside of technology. Very basically, this is how a computer understands language. This is how an algorithm or a chat bot software would parse out a conversation you're having. It's incredibly literal. In this example, it
01:42
reminds me to feed the baby tomorrow at 7 a.m. The computer doesn't actually need to know what remind me is. Someone can program it to say the letters R-E-M-I-N-D equals do this task on a calendar. It doesn't actually understand your language, it's actually pinpointed to a series of commands inside of programming. Or even like this, for
02:03
example. Hello, what's the weather today? A weather API plus your location can return things that are denoted as weather, be it hail, storm, windy, overcast, et cetera, and can return with hello, it's sunny. But someone has to define this language, someone has to define what weather is.
02:22
And those definitions can carry any kind of bias into them, especially when it's something that's not literal, like the various conversations we have. People define through code what a system is, and we have to live inside of those systems. How much knowledge or agency do we have inside of those systems? So this is one of the problems with machine
02:42
learning, and I would say with unsupervised machine learning. Unsupervised machine learning is a series of algorithms that work autonomously through code. There isn't a lot of human intervention. Supervised machine learning actually has much more human interaction within it. A researcher can help guide a lot of the parameters inside of that algorithm. This is an example that was written about
03:04
in the Guardian about a year ago by Lee Alexander. On the let me see, on my right, which is also your right, when you Google the words unprofessional hair, this is what popped up a year ago. And as you can see, this is mainly people of colour hairstyles. On the left is
03:24
actually when you Google professional hair. You can see it's mainly white people. I don't think an engineer set out to decide to make this algorithm racist. They were using pre-existing data sets and images that had already been marked professional, non-professional, but it has carried the inherent biases we already have in our society
03:42
into a space where we actually can't intervene algorithmically. We're fed these results but there's nothing we can do to stop it. Latoya Sweeney's work on bias algorithms in 2013 was actually pretty instrumental in also pointing out how unintended biases can pop up, specifically, again, in
04:01
the US. Latoya Sweeney is a professor at Harvard. She works in their data privacy lab. She started Googling black- sounding names and saw they triggered arrest-related ads. She even Googled her own name, and the same thing happened. Latoya Sweeney, question mark, arrested, question mark. Ads popped up there about bail bonds. This is a way to, when you're arrested, pay your bail in
04:23
the United States. Professor Sweeney has no criminal record. It's the implication of these ads, right? Earlier today, in a talk with Julian York and Matt Stender, they spoke specifically about this kind of algorithmic bias inside of facial recognition software, and they had this really
04:41
great Kate Crawford quote that I would love to repeat back to you. Kate Crawford is part of this initiative in the United States called AI Now. It's a non-profit designed to look at bias that exists as we start to create more and more diverse systems that can be used to create more and more diverse products using machine learning. Algorithms are being fed certain images, often chosen by engineers, and the system builds a model of the world
05:03
based on those images. If a system is fed on photos of people who are overwhelmingly white, it will have a harder time recognising non-white faces. So what does that sort of mean? What kind of bias data is being fed to all these systems that exist inside of our lives? Who made the data? Where does it come from? How big is the data
05:22
set? How old is it? How many different genders are represented? And where can we access that data to even begin to look and fact-check it? This bias is already intervening in our daily lives, and it's creating harmful, erroneous results from that. It's creating very literal results. This is an example of face-ception in
05:44
Israeli algorithmic face detection company that's being used to start detecting different kinds of faces. The problem with this, other than the surveillance aspect, which is major and grand and awful, is more how wrong this kind of data can be. Who's training it? Again,
06:01
what does the data set look like? How are they determining who these faces are? And also, when there's the added layer of emotional analysis, who's deciding what happiness is and what is sadness? This bias is a particular kind of feedback loop from an older data set that's using to create predictions for new data. There's already a lot of bias in that. In fact, a lot
06:21
of predictive algorithms are using older data sets. You get caught in this feedback loop. The head of Google's machine-learning group actually wrote this on Medium. Predictive policing listed as Time Magazine's 50 best inventions of 2011 is an early example of such a
06:42
problem. It's the idea to use machine-learning to allocate police resources to likely crime spots. Believing in machine-learning's objectivity, several U.S. states implemented this policing approach. However, many noticed that the system was learning from previous data. If police were patrolling black neighborhoods more than white neighborhoods, this
07:00
would lead to more arrests of black people. The system then learns that those arrests are more likely, thus leading to this reinforcement of this original human bias. So what does it mean that we're using this older data that hasn't been course corrected? How many different products already exist within that? A lot of the work that I'm working on is thinking about how can you use machine-learning to
07:21
start looking at online harassment. But how are we thinking about what is harassment and who's determining this? I actually have a really funny example. This is one of my favorite words. Because it's an insult in America, and it's a term of endearment in the UK. And as a poorly, poorly
07:40
behaved American, I love this word so much. And it's also an example of a smart idea gone awry, because it represents a problem in technology. It's called the Scunthorpe problem. The Scunthorpe problem occurs when a spam filter or search engine blocks emails or search results because of their text that contains a certain kind
08:00
of word. So Hotmail implemented this, and the citizens of Scorthump, England, could not register their mail if they included the word Scornthorpe, because it has the word cunt in it. And this is something where design fails spectacularly and it fails in what could be an awful and unintended way. And this is one of the casual
08:22
effects to be considered. Word-blocking is often a thing people think about implementing very quickly inside of online harassment. Think about if users could have blocked the word Gamergate, for example. What would their experiences have been like when we've had such a big problem with the online harassment campaign Gamergate? The better example of why to think
08:41
about the Scornthorpe problem, though, is when a company implements on your behalf this kind of word filtering, right? So Hotmail implemented removing the word cunt, but no-one decided as an individual user or consumer that that's the word they wanted to filter out, and thus we end up with problems like this. So how do we think about the causal effects of these
09:01
greater decisions that companies are having for us? I'm a major advocate of user agency, that's why I work for the Wikimedia Foundation. It's an open-source company where we actually co-design with our editors to think about the different problems that exist inside of our platform. But what would it look like if users could have more agency inside of these spaces? But then from there, can machine
09:21
learning be used along with co-designing with different kinds of users? As a designer, I think about this a lot. How do we design transparently with algorithms? Can it exist, and what would it look like? And by transparently, I mean what kinds of algorithms are we using, but also what are the decisions that we're making? How do you fold in a community, a large and expansive community
09:42
across many different cultures and languages when you're trying to set standards? How can you do that with harassment? I think this is a really important thing to consider, especially as we put more and more of our data and more and more of our time into these really large social networks that are private companies. Social networks are
10:02
becoming the commons. It's where we discuss, it's where we talk about everything. People fall in love who have never met before inside of social networks, right? Social media is not just social media, it's a communication tool, but what happens inside of those spaces is very opaque. It's very private, it's not very public. Part of the problem with harassment is harassment can be very nebulous. It can
10:21
be literal, it can be contextual, and it can be cultural. These are hard things to sort of teach an algorithm. How do you teach context? How do you teach culture? It's much easier with something literal like word blocking. It's users can implement that kind of blocking on their own. So social media is a mixed
10:40
emotional and identity space. So, one thing I've noticed in the past couple years is we've actually moved closer to understanding what harassment is across all different kinds of networks and across all different kinds of groups. For example, the word doxing is starting to become much more commonplace. It's the release of public documents. But even better, doxing is already being folded into codes of conduct in
11:03
different platforms. No doxing actually exists on Paceman, even though doxing still occurs there. Now, it's one thing to fold it into your rules, it's another thing to actually implement it. And I think it's sort of important to think about that, like how we're moving closer to this space of having a more general understanding as to what harassment
11:21
is. But where can machine learning work into this? Doxing could have just a series of numbers within it if you release someone's email, or, sorry, someone's phone number, right? And that's a set number. You can have an algorithm run through that and try to check those numbers, or look for something that says phone number plus, like, a string of numbers following that.
11:42
So this is something I actually walk a few social media companies through. The idea is that it's supposed to represent a decontextualized space, which actually doesn't exist on social networks. There is no decontextualized neutral space. Every space has bias in it. It carries all the prejudice we have from real life into this space. The reason I like to walk people through
12:01
it is it's more of a design exercise to think about harassment and how you could use machine learning inside of this space to mitigate harassment. So this is just a regular decontextualized Facebook photo. A user asks to remove the tag. Most places would be like, that's great, we can totally do that, remove the tag. What if they ask to remove the photo?
12:21
You can go through these series of questions next and think of these as a series of steps that could exist as code or as a series of design choices, be it radio dials or whatnot, inside of an interface. So why? Because I feel uncomfortable. I actually noticed when I was doing my two years of research into Gamergate that the majority
12:40
of Gamergate victims would start off talking about their harassment by saying, I feel uncomfortable. And I think of this as a good dog whistle. Why do you feel uncomfortable? How expansive is the word uncomfortable? What are all the different meanings of the word uncomfortable? It's not just discomfort. It can mean a variety of different things. I don't like the photo because maybe I look unattractive.
13:01
I'm afraid to lose my job. I'm afraid to upset my family. I'm afraid to upset my peers or I feel unsafe. So none of this actually required any kind of algorithmic intervention because the user is picking these. What's important is what happens after a user would file a report. So how do we start thinking about ways in which machine learning could actually be used for moderators and filing harassment reports?
13:22
What I'm suggesting is an algorithmic intervention that uses supervised machine learning that uses the knowledge researchers have with a better framework using machine learning to isolate new trends that could exist inside of harassment reports. Your users will tell you what's happening and what's wrong on the platform, especially if it's a daily part of their lives.
13:42
Gamergate victims would file multiple, multiple, multiple reports with the exact words Gamergate inside of it. So if any kind of supervised machine learning program had been run on these reports, they would have noticed a new word appearing, the word Gamergate. They also would have noticed an increasing amount of reports that were happening over the course of Gamergate that should have triggered any kind of system
14:02
to say we have a new trend appearing and it's not a good one. Inside this example I show is different ways to think about how these different emotions could be marked in a much more literal or categorical way. The first one, I like my photo. Perhaps that is considered annoying
14:20
content. You just remove the tag and you push the report to a section called annoying. Perhaps afraid to lose my job, afraid to lose my family, upset my peers, maybe that gets marked more as abuse. You can then rate the level of abuse. You can put it on a scale. From there, you can then think about which moderator it goes to and put in an estimated way of time for response.
14:42
This way the user is actually getting some feedback from any report that they filed. The majority of harassment victims actually don't receive any kind of response on spaces like Twitter when they file a harassment claim. The last one, I feel unsafe. You could follow up if it's something that is viewed as a dangerous situation.
15:04
What's even more important is moderators could change what's marked as abuse, the level of abuse, et cetera, in real time by re-rating the reports that they got, re-tagging, flagging new abusive words that they haven't seen, and also having the system do this as well.
15:20
What this does, if you think about the fact that a moderator could re-tag, what that means is that they're re-teaching the system. The system is learning from their actual input. So the system is not doing anything autonomously. It's learning directly from moderators. Now, this idea I'm proposing would only work if you have trained moderators. Spaces like Facebook actually push the majority of their moderating content
15:40
to spaces like the Philippines. They often are not trained in any kind of deep cultural context of what they're looking at, nor are they given any kind of emotional support. But the idea behind this is maybe we could perhaps create a way to start better understanding emotional trauma and emotional data inside of these systems but also letting trained moderators, letting researchers and
16:01
ethnographers also determine what is happening inside of this space. It's allowing for more human intervention as well as a more human response. The majority of this talk is really about how machine learning can be seen as a collaborative tool for humans. There are things machines do really, really well.
16:21
They process data and images faster than humans can. But there are things that humans do extraordinarily well. We're good at sussing out things. We're good at understanding context. We're good at asking questions. We're good at following up. We're good at not being literal unless we have to be literal. So as a machine learning designer and
16:40
researcher, I want to think about what the future of machine learning could be if it's viewed as a collaborative tool or viewed as an extension of myself, not viewed as an autonomous third party, not viewed as artificial intelligence humanoids, but viewed as an actual tool in my tool kit that I can also be more involved in the process of training.
17:02
Especially for something like harassment, when internet language is changing so, so quickly, how do we think about what machines do well? If there is a better way to sort of store specific words, seeing the analytic rise in words or images and actually being able to compare those to new images arising, we could maybe actually be able to study
17:20
meme culture as well as harassment in a much better and more nuanced way as opposed to waiting for victims to talk very publicly about the kinds of problems they're having on these networks. How do you mitigate and how do you lessen harm of victims and how do you also create a system that's easier and safer for moderators to look at this harmful material?
17:44
So, slang and harassment vernacular changes quickly and often. I mentioned earlier that I was a BuzzFeed and iBeam OpenLab fellow. What that means is for the past six months, I've been studying the rise of the alt-right in the United States as well as the rise of populism.
18:00
I've been looking at how the Trump presidency has sort of changed the digital landscape inside of spaces like 4chan, 8chan, and Reddit, as well as how it's changed harassment culture. It's become a lot more political. It's also become a lot more specific. Now, this is where ethnography is really important, but also where what I'm describing comes directly into play.
18:22
Right before the alt-right subreddit was removed from Reddit, this is what the page looked like. What you see here is specific language that deals with the alt-right where they're promoting their space as a space of white nationalism. What that means is it's an inherently violent
18:41
thought term in space to be in. It's a political science term that I hadn't seen appear in spaces like Reddit before. In spaces like gamergay, Kotaku, and action, et cetera, they would talk about fighting for their own identity as gamers, but they didn't use classical political speech inside of the way that they describe themselves. What's different about this election
19:01
is that it's become a lot more specific, and people aren't afraid to actually talk about the fact that they're white nationalists. White nationalism supports the idea of a white state and white identity. Now, this is sort of terrifying, I think, as a researcher. This is also what their subreddit looked like the last day it was
19:21
active on Reddit. So I scraped all of the alt-right subreddit as well as the Donald, and I started looking at word frequency and how often specific words appeared inside of this Reddit versus other Reddits. This is the Donald. So you'll see garbage,
19:42
Trump, Hillary's down there, obviously, white supremacy, et cetera. And out of all of this, I started reading all these different blogs that I was finding inside these two different spaces, and what I noticed is that there is a rise of particular slang terms, so if we go back for a second, I don't know if you see this here, but
20:02
they've created a guide to all the different kinds of terms that they use. They've created guides for indoctrinating new members into the alt-right. So I took a variety of guides I found from the Daily Stormer, which is a neo-Nazi website that's also linked to off the alt-right subreddit, as well as a neo-reactionary site.
20:22
That's a rise of neo-conservative politics in various alt-right sites. And what I created was what I call a hate speech dictionary, and this is a small snippet of it. What I'm doing is I'm tagging all the different words that I see if it's a blog, if it's a person, if it's a slang term, if it's a space or a location.
20:42
I'm then also tagging the words if it's general alt-right, white nationalist, white supremacist, or neo-Nazi. And I've been working directly with the Southern Poverty Law Centre on determining which words fall into which category. This is actually already being used with ProPublica, a journalistic site
21:01
in the United States, for a law of big data analysis and scraping that they're using. Okay, so why would I do this, other than subjecting myself to looking at horrifying parts of the internet? You can laugh if you want to. Because this knowledge doesn't exist before, and any
21:21
kind of machine learning system actually wouldn't be able to recognise any of these terms as hate speech, because it doesn't actually exist inside of a spreadsheet that you can feed a system that is designed to analyse hate speech. This takes research, this takes contextual learning, and it takes a fair amount of teaching. But now that
21:41
I've compiled this hate speech database, I can actually use machine learning to help me look for new terms. I couldn't have used machine learning first because I didn't know what it was looking for. But because these terms appear with such frequency next to other words, I can now compare it and look for new and emerging words as they appear inside of these different spaces, inside of these different blogs, inside of these
22:01
different social networks. I can now see any kind of new sling that comes up, and I can hand this off to journalists and researchers. This is an example from Google's perspective API. Now, I actually really like perspective, but the reason I bring this up is this is, according to the Anti-Defamation League,
22:22
this is the most popular white supremacy phrase in the world. And, as you can see, it's only rated as 37 per cent toxic. Perspective is a new API that's supposed to look at toxicity in language. It's not designed to work autonomously, and we can see that, in this case, if it were working autonomously, it would fail. It would fail
22:42
in a very large and very real way. But perspective is designed to work with moderators inside the New York Times to help better partition the comments that they're getting, so they can go through them faster and easier, and determine which comments are suitable to exist on the Times. It's not perfect. It is designed for a very specific use case, and for
23:01
moderators within that use case to teach this over time. This is more of perspective. The bottom says, have a good day in Farsi, and I love you in Chinese. But, again, it's a way to start helping partition the different kinds of the web, or, like, the volume of load that a moderator is getting. It is
23:22
not perfect. And I will keep saying that. But even with its incorrections and false flags, a human can quickly suss out these errors and still work in a faster way as opposed to working without any kind of tool or extension or help. So I'm going to bring it probably all back
23:41
again. This is where technology can alleviate a specific kind of workload that we're having. It can alleviate the different kinds of problems that exist inside of moderating, especially as the world is in such tumultuous times. What does it mean to exist on the internet now with the rise of populism? And as an American, I think about this a lot with the rise of the alt-right.
24:02
But as someone who works in online harassment, I'm curious as to how many conversations I'll have to look at that will be shrouded in political disagreement when it's actually a conversation that's really rooted in racism. How do you determine and set standards for that? But how do you also prep moderators for what they're about to see? How would you prep a moderator to see any kind of Nazi
24:21
insignias? How do you make sure that people's workflows are not overrun with this kind of heavy emotional lifting? There are things we can do with technology to help bear the weight of that, to help bear the load of the emotional labor that moderators are facing. Now we still have to train these systems, but it can help partition the way
24:41
that we look at these spaces, given how fast different kinds of algorithms can parse information. And the big thing about this is how do we create more spaces that have user-generated definitions of safety? I don't know what Twitter thinks of as good quality when they enacted across all of our different accounts the quality filter. I don't know
25:01
if Twitter is on the same page with me when I think about good, because Donald Trump still has a Twitter account. I don't know what is even considered safety or what is considered of high importance inside of spaces like that. And so, as a researcher who's really rooted on user agency, I wonder how do we create spaces that have user-generated definitions
25:21
of safety? How large can that be? How large can we scale these different kinds of spaces? Thank you. Caroline Sanders!
25:41
I think we have plenty of time for Q&A, and you're prepared for Q&A. Yeah, I always prepare for Q&A. Okay, we have two microphones in that room. Yes, we have. So we have one question over there. Was it like this? And the second question over there.
26:02
Okay, so microphone is on the way. Please introduce yourself for us, and then start with the question. Yeah, my name is I'm a journalist from Berlin, and I have basically two questions. The first is that you mentioned that you want to generate these secure spaces
26:21
where people are protected from hate speech and so on. But I wonder if you run into the danger that you create filter bubbles where you keep those people who are affected to right-wing ideology out, and you can't convince them of the opposite anymore. That's the first question.
26:41
And the second one is I was in the US last year and met the author of a book called Math of Mass Destruction, and her basic theory was that we need that algorithms who predict, for example, policing or who are used in Twitter to
27:02
filter certain kinds, that they have to be open source in general. So we create kind of an ethic how to create and how to program algorithms. I don't know if you have a judgment on this. Well, I'm super open source given that Wikipedia is an
27:20
open source tool and platform, but I also agree spaces, how do you think of ways to create agency around algorithms? We should be able to see the way that they're written, but also the data that they're fed. Even before an algorithm is implemented inside of a social network, it has to be written and created. It has to be trained on models, right? What are those models? Who trained it? How big
27:41
was the data set? How old is the data set? We need access to all of those things, so I very much agree with that. Secondly, I'm not advocating for companies to implement any kind of word blocking. The research I was doing was actually to see how do you start to think about hate speech in a new way beyond sentiment analysis. Classically, there's five different kinds of sentiments. I have sentiment analysis.
28:02
I don't think hate speech would necessarily fit into anger plus disgust and maybe a little bit of joy. It fits into a different space. Anger, disgust, joy, and there's sadness. I always forget the last one. Those are the ways that that's classic sentiment analysis, right? But there's so much more human emotion than disgust, anger, joy, sadness, et cetera, right?
28:21
So, another part of your question was with filter bubbles. I think we already exist in filter bubbles. I think filter bubbles are a problem, but I think we already exist inside of them. I mean, Republica is technically a filter bubble if you think about it. You chose to come to this conference for a specific reason. It's in a specific geolocation. It has specific talks that you
28:41
like. Perhaps your work let you get off that, like, come to this conference for the day. It's about technology. It's about community. It's about collaboration. But some people don't actually like those topics, so how would you reach them? I think it would be really hard to do that. But also, people are allowed to decide and design what spaces they exist in. I think
29:00
one of the few things that doesn't get talked about with filter bubbles is oftentimes, it can be safer to be in a filter bubble, especially if you're a person of color or a marginalized group. I don't know if I would want Trump, a series of Trump supporters jumping in on my Facebook on certain conversations. That's why I have really high privacy settings, right?
29:20
I think the problem with filter bubbles is when we don't get to decide what's in our bubble when it's algorithmically decided for us, which is something that Facebook does. But we also already live in a bubble because we're friending people that we know, and that we probably know somewhat well, or we've met or we already have mutual friends. I think people forget that our actual IRL social
29:40
networks, our family networks, our work networks are already technically filter bubbles. So, time for the next question. Where was the next question? Here's the next question. Great, thank you. Hi, I'm a filmmaker. I like your talk very much.
30:00
Very general question. Concerning the outlook of your research, when you think about our societies and the impact that what you call emotional machine learning, what that might have in a political perspective. Do you think in that way during your research are you more
30:22
concerned with what is actually happening? Because I have the impression that people who actually program that, your examples with Google, they have an idea probably of society, which is an idea that has a perspective. Can you elaborate on that,
30:40
please? Sure, or I'll try to. I think a big problem is when things are designed, they're designed with too few use cases or too few personas. A thing I like to do when I design is to think about what's the worst thing that could happen, like what's the worst that could go wrong. So if I were to come up with a product idea that's maybe about helping people work out better,
31:01
I like to go through all the ways it could be used in the worst possible way, and then think about that as the causal effects of what I'm designing, and how do I not implement that? I don't know how often algorithms go through that kind of like I would say intense QA, intense testing.
31:20
Part of the problem is that we're having to work with data, and like we have to work with so few really large data sets. So part of the problem is that a lot of external algorithms are made inside of like university labs, and then they only have access to certain kinds of data. So they're having to use these data sets
31:40
that are really old, or they're not quite large enough, or doesn't quite fit their problem, but they're also not designing the stuff for public consumption, they're designing it to sort of test an idea. From there, then these things can be folded into public consumption. So I think part of the problem is actually like looking at companies that are implementing machine learning into products and asking them to spend a lot of time on the product. The problem with that is
32:02
like modern capitalism doesn't actually set up any kind of design firm or small company or bigger agency to spend that much time on developing a product that's using machine learning, and that's really the main problem. It's also thinking about data and consensual data as a currency. So consensual data in the sense of can users opt in?
32:23
Are they choosing to give up their data? Are they a part of the process? And then can you train based off this? A big thing is that there's just not enough data, but also not enough data that people willingly give up, and then data being used in a smart way. Next question.
32:42
Yes, here. And where else? And here, okay. So first here, and then there. Thank you. Katarzyna, Panoptikon Foundation, Poland. I understand that the context for your research is the use of algorithms for moderation mostly, yes? So to help people who moderate
33:02
content. Can you also think of a different context where this kind of machine learning could help us understand what is happening online and how we should react? That's one part of my question. And second, more detailed, I was intrigued with your example of the algorithm learning what annoying is, and I
33:22
understand censoring that content from me in the future. I use censoring because that's how I see the risk of this. So if the algorithm learns on the basis of one case where I clicked, I don't like it, that this type of thing is annoying, it might well become a problem for myself in the future.
33:40
So maybe I didn't get the example, but if you could elaborate on that, that would be great. Thank you. Sure. So in the example, I didn't necessarily intend for annoying to be the basis of a machine learning section, more so in the sense that if you look at any kind of harassment filing, annoying is often listed as a thing. And what that means is it's like a low
34:00
priority in terms of a harassment report. So if you're marking content as annoying, it's something that they won't actually pay attention to, they'll just remove it from your timeline or from your view, but it often just gets swept under the rug. So I think the second part of your question was asking how this could be used beyond moderation. I mean, I think using any kind of research-guided machine
34:22
learning extension would be fantastic for looking at emerging geopolitical trends inside of spaces like Twitter. I'm concerned that with how Twitter is being run, will we see another Arab Spring inside of a space like Twitter, or will it be shut down and censored by the government? And how can we watch that
34:42
and as an international community support that kind of protest that's happening? So I think a good example is being able to use these things to look at different kinds of digital protests that's starting to exist. A big fear I have as someone who works in online harassment but also has studied protest is a protest campaign and a harassment
35:01
campaign, if you look at it just based off the volume, they look almost identical. So Gamergate, people thought was a fandom fracturing. They thought it was two people, they thought it was one fandom fighting and infighting. Gamergate see themselves as protesters. They see themselves protesting a change in games. They see themselves
35:21
protesting society that wants to change games and take it away from them. Victims of Gamergate viewed Gamergate as what it is, which is a harassment campaign, but it's incredibly complex. So if we're using tools to sort of look at just very basic analytic systems or identifying markers, which would be volume, will we
35:42
limit the ability to protest when we're trying to solve it with harassment? And that's why I think context is really important as well as specific researchers that can work inside of these systems and can work alongside these systems. There's a lot of different identifying markers, I think, to harassment campaigns. A big thing you can look at is like the interaction history between
36:01
users. Have they ever interacted before? Is one user a brand new user versus an older user? Twitter did a study and found that most egg accounts are either spam or they're engaging in harassment, so that's a good example. If an account is really, really young and doesn't have a photo tied to it, it could be a bot or it could be engaging in harassment.
36:21
So is there a way for people to filter out by their own choice, like opting in to filter out new egg accounts? That's a way to cut down harassment that isn't that algorithmically heavy and it's letting users decide and determine their own privacy settings. Okay, and the next question comes from the audience
36:40
to the right of me here in the second row. Hi there, my name is Anastasia, I'm a design management student. I enjoyed the talk. Thank you. I was kind of wondering about a different element of some things that you mentioned so you said both that maybe more care needs to be taken in the preparation
37:01
of moderators so that they're kind of more educated when determining what might be abusive and what not. But you also mentioned that for instance, studying the alt right was kind of scary, which I totally understand. So I'm wondering if you've thought about potential psychological preparation
37:21
for moderators in addition to the more practical aspects of their preparation so that they themselves don't just become embittered and sad in the process and are actually able to do their jobs effectively and be still relatively happy humans. Totally.
37:41
I mean, I think a big thing is looking at what exists inside of the harassment reports and being able to change it up. So like being able to mark something as annoying, for example, if you've been looking at something that's rated as highly, highly abusive, being able to switch from, okay, I looked at highly abusive content all day to can I look at annoying content for hopefully a week
38:02
as a way to sort of lessen that emotional labor. So that's like one big thing to think about is do people have enough varied tasks to where they're not having to look at the same level of abusive content be it like videos of beheadings or child pornography. Images are a lot easier actually
38:20
to sort of, in a way I guess, moderate. They're worse to look at, but if an image has existed before, you can use image hashing or metadata from the image to see if it's a similar same image. What's tricky is when it's a brand new image. Someone still has to look at that, but at least that's cutting down on some of the volume of like more
38:41
abusive content people are looking at. What is harder is with words and being able to like rate the words as abusive. And that's why I think looking at slang words is a good example of a dog whistle, so you know, gamergate again is a great example of like they use the hashtag with everything so any abuse report that has gamergate
39:01
could go to a special kind of section of this is like all the gamergate stuff ever. Or if you go back to the dictionary and making being able to look and see like enough of these words have existed and it's being reported as harassment or abuse, like this is probably something talking about white nationalism. Great. Do we have more questions?
39:21
Yes. Here in the second row as well. Can we have applause for the volunteers of the microphones as well? Hi there, Ernst. Thank you for your talk. I was wondering, you mentioned
39:41
a lot of tools are available to moderators but as a general public, if you want to interact outside your own filter level or perhaps with these groups and trying to establish a dialogue, I can imagine that your Facebook or Twitter profile can easily be flawed with a backwash
40:02
of trying to interact with a group that's perhaps not as open to your view. Do you think your research also provides a set of tools to maybe monitor threats or to interact safely with these groups while still maintaining your own profile and personal
40:21
identity online? I think it could. I think part of the problem is spaces like Facebook and Twitter view the product as a very closed off product, a space like Wikipedia views Wikipedia as an open platform that you can write your own templates to exist on. That's much more of a protocol in the sense
40:41
that you can make changes to your own personal account and change the way that you interact with it. I think a massive problem is how productized things are. We live in a time of seamless product design and also the people that work for these companies, they're hard to find, they're hard to figure out who they are, they're impossible
41:00
to contact. These spaces exist as these sort of nebulous entities. I think that this could be used to help people perhaps interact or mitigate the kinds of interactions they have. I don't actually interact with the alt-right at all. I did interact with Gamergate when I was studying it. It did have negative consequences for me.
41:21
So I think this is kind of a really tricky space. I study from afar, but I also study looking at the kinds of language they generate online and on the internet. I think it's important to talk about that. They are putting out content from this hyper-specific Western perspective inside of spaces that talk a lot about irony
41:41
and that don't talk about intentionality. So I wouldn't say that with the database I showed or any kind of machine-learning algorithm that you could say this person is like X percent white supremacist. I don't think you can generate those kinds of conclusions from data, nor should you. That goes a lot more into thought crime and policing territory than I'm comfortable with, and it also
42:02
creates another layer of surveillance that I don't think that we need. I do think that it is interesting for us to have conversations as a public about what we consider hate speech, and what is considered acceptable inside of these spaces, meaning I think the code of conduct often gets overlooked. So with
42:20
something like doxing, for example, two years ago, that wasn't considered harassment. Now, it is widely implemented on different social networks, and it's even implemented on the site where doxing occurs, which is pace bin, but at least that is a more normalised term. So thinking about what
42:40
are we normalising in this space is really important. I'm not suggesting that we remove white supremacy terms from the internet. I do think it's important for us now with the rise of populism, even in Europe, to talk about what kind of systems do we exist in, and what is considered acceptable. One thing to point out is most social networks are American-based
43:00
companies. That also means that they're implementing American policy in spaces that are not America. America has a very tricky relationship to the First Amendment, which is the freedom of speech amendment. A lot of these companies err on this very libertarian side where anything can be said in these social networks, and they should just let anything go.
43:21
And I don't necessarily think that they should create a lot of policy around what can be said. I do think that they need to have internal and external dialogues around what kind of spaces are they generating if people are planning offline attacks, and also what are the effects of language. It's one thing to say
43:42
you can't talk about rape on this site. It's another thing to say you can't make rape threats, and those are not the same thing. And I think that's where things get super, super tricky. And that's where this understanding of freedom of speech often gets really nebulous, and it becomes an argument of censorship. I don't think social networks should outlaw
44:02
the word rape. I do think that they should necessarily have rules that say don't make rape threats. One more question? No? Okay. So, thank you, Caroline Sanders.
Recommendations
Series of 2 media