We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Getting Your Data Joie De Vivre Back!

00:00

Formal Metadata

Title
Getting Your Data Joie De Vivre Back!
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Most of us work too much and play too little.  When was the last time you smiled at something you made?  Playing with fun datasets, especially big data sets, opens up weird new forms of technical recreation.  Why not train an amusing model in a browser tab while you're waiting for that day-job Spark query to finish?  I'll show you some data toys I've built using AI and interesting data sets: Most of them involve both backend data science and front-end visualization tricks.  They range from poetry-composition helpers to game log analysis to image deconstruction and reconstruction. All of them taught me something, often about myself and what I like artistically, and sometimes about what ""big data"" actually means.
Keywords
GoogolPoint cloudVisualization (computer graphics)Data analysisPunched cardMaxima and minimaFile viewerComputer-generated imageryRandom numberQuantumTask (computing)Electronic visual displaySinguläres IntegralTwin primeMenu (computing)Video projectorQuantumTwitterQuicksortLibrary (computing)Right angleForm (programming)Data analysisRandomizationWeb applicationWeb 2.0MereologySpeech synthesisState of matterMultiplication signComputer programmingProjective planeProcess (computing)5 (number)Mathematical analysisMessage passingOcean currentCASE <Informatik>WordReal numberWave packetModel theoryVideo gameSheaf (mathematics)Group actionBitMedical imagingHacker (term)Mobile appThumbnailSelectivity (electronic)Interface (computing)Level (video gaming)RobotDemo (music)Context awarenessLecture/ConferenceComputer animation
Local ringLeakWindowBookmark (World Wide Web)Computer filePeer-to-peerComputer-generated imageryError messageData compressionWeb browserAlgorithmLink (knot theory)InformationStatisticsData conversionDirectory serviceNumberGUI widgetOutlierTwitterWeb pageTwitterMedical imagingDirectory serviceStructural loadLaptopFlow separationLevel (video gaming)AuthorizationZoom lensDifferent (Kate Ryan album)1 (number)Electronic visual displayUniform resource locatorMappingMobile appPoint (geometry)TesselationWeightImage resolutionMatching (graph theory)Sound effectTemplate (C++)Scaling (geometry)MereologyGoodness of fitAlgorithmComputer fileExpert systemWeb 2.0RobotRectangleSheaf (mathematics)Game controllerMultiplication signScripting languageError messageData compressionVirtual machineSemiconductor memoryView (database)Right angleNumberWebdesignWeb browserComputer animation
Rule of inferenceVector space modelMatrix (mathematics)Similarity (geometry)ReliefSource codeQuery languageMetadataTable (information)Content (media)Model theoryEmailExecution unitElectronic visual displayDemo (music)MetadataMedical imagingWordMappingVector spaceTesselationCodeTable (information)MathematicsMatrix (mathematics)AudiovisualisierungInteractive televisionTrigonometric functionsSource codeFormal languageComputer fileQuicksortEntire functionSound effectDeterminantSimilarity (geometry)TwitterRepresentation (politics)BitNP-hardAuthorizationAttribute grammarMathematical analysisContext awarenessFile formatProjective planeCASE <Informatik>NumberLibrary (computing)InformationMobile appWeb applicationEmailReal numberGeneric programmingEndliche ModelltheorieElectronic mailing listZoom lensType theoryGame controllerInstance (computer science)Library catalogWindowQuery languageOpen set1 (number)Computer animation
Graphical user interfaceComputer fileView (database)Bookmark (World Wide Web)Lie groupOrder of magnitudeCountingModel theoryElectronic visual displayDemo (music)Slide ruleData modelSimilarity (geometry)Term (mathematics)Software development kitEndliche ModelltheorieMultiplication signGraph coloringMereologyWordDistanceLevel (video gaming)QuicksortSpeech synthesisComputer fileSpacetimeFormal languageModel theoryComputer animation
Model theorySimilarity (geometry)Term (mathematics)Computer fileEndliche ModelltheorieWordFault-tolerant systemSimilarity (geometry)Model theoryMobile appBitResultant2 (number)Reading (process)Graph (mathematics)Shooting methodComputer animationDiagram
WordChainData modelWindowHexagonBookmark (World Wide Web)View (database)Model theoryVector space modelSpacetimeProgrammschleifeCountingDistanceCASE <Informatik>QuicksortMobile appWordElectronic mailing listCycle (graph theory)ChainPublic domainGraph coloringDatabaseComputer animation
Time domainData modelConstraint (mathematics)CountingLocal ringEmulationComa BerenicesInclusion mapNominal numberReliefPasswordLink (knot theory)Random numberEinbettung <Mathematik>Scale (map)Range (statistics)Euclidean vectorElectronic mailing listLine (geometry)Modal logicComponent-based software engineeringDistanceSpeech synthesisPoint (geometry)DistanceGraph coloringContrast (vision)Personal digital assistantEndliche ModelltheorieWordFormal languageLoop (music)CASE <Informatik>MereologyModel theoryQuicksortMultiplication signCategory of beingInteractive televisionContext awarenessElectronic mailing listAuthorizationOrder (biology)Einbettung <Mathematik>Line (geometry)Game theoryDatabaseDifferent (Kate Ryan album)Game controller1 (number)Reading (process)Maxima and minimaMobile appVisualization (computer graphics)Projective planeSource codePublic domainMachine learningClosed setArithmetic meanTable (information)Structural loadBit2 (number)Demo (music)Latent heatField (computer science)VolumenvisualisierungConstraint (mathematics)CountingTransformation (genetics)ProgrammschleifeLevel (video gaming)Right angleNumberTerm (mathematics)Hand fanDomain nameScaling (geometry)Web 2.0Inheritance (object-oriented programming)XML
Model theoryMathematicsMereologyBit rateUsabilityTask (computing)Coefficient of variationDemo (music)Set (mathematics)Link (knot theory)Content (media)State of matterSurreal numberLevel (video gaming)Loop (music)Endliche ModelltheorieDemo (music)MusIS <Museumsinformationssystem>Projective plane1 (number)Functional (mathematics)Category of beingTransformation (genetics)QuicksortMusical ensembleSpacetimeRevision controlGrand Unified TheoryPoint cloudModel theorySpherical capCodeMathematicsDigital photographyHill differential equationNetwork topologyTap (transformer)Mobile appSet (mathematics)DistancePoint (geometry)Medical imagingGraph coloringInformation privacyElectronic mailing listWeb crawlerRobotGame theoryDescriptive statisticsComputer configurationSlide ruleVisualization (computer graphics)WordInheritance (object-oriented programming)Multiplication signSound effectVideoconferencingOpen setTowerParameter (computer programming)AuthorizationWeb pageElectric generatorFile archiverEmailFormal languageoutputNewsletterRandomizationBookmark (World Wide Web)Link (knot theory)CurveTwitterMereologyTheory of relativityRight angleMessage passingBell and HowellCASE <Informatik>Representation (politics)XML
Computer fileView (database)Bookmark (World Wide Web)Asynchronous Transfer ModeMobile appWindowSource codeComputer-generated imageryAlgorithmRevision controlCodeDean numberEmailPunched cardElectronic visual displayTwitterNumberError messageRight angleHacker (term)Axiom of choiceSequenceMultiplication signStructural loadElectronic visual displaySoftware repositoryAuditory maskingWordElectric generatorProjective planeSet (mathematics)Medical imagingDemo (music)Degree (graph theory)GradientTunisRobotCodeQuicksortRevision controlPlotterOnline helpSource codeAlgorithmParameter (computer programming)CASE <Informatik>Student's t-testSpecial unitary groupPlanningMusical ensembleOpen sourceComputer animation
Lecture/Conference
Transcript: English(auto-generated)
I also wasn't registered for the conference and my MacBook wasn't working with the projector. Let's see if we get through this. So I'm very happy to be here. This is always interesting to me to give talks.
So I do a lot of NLP these days. I'm not going to talk about NLP as much today, although we're going to do some Word to VEC models. Right now I'm working on hate speech detection, toxic and offensive speech, and it's horribly depressing.
That's also why I'm not going to talk about NLP. So this talk for me is a sort of a self-care piece on getting some fun back into my daily hacking life so that I'm not working on things like that all the time and thinking about things like that all the time. So I was a teacher.
I'm actually consulting now. That's where I'm doing the hate speech detection, but I'm still in Leon. So I'm just down the road from you guys by a train. When I do invited talks like this, I get to do what I want. So like I said, this is me doing self-care and hacking on the weekends, and probably I should develop some real life hobbies rather than hacking, but I'm a nerd just like
you guys. But this is the thing about hacking. It's really hard for those of us who program to separate out the programming for fun from the programming for work. Everybody I talk to is like, oh, yeah, I did this side project and I learned how to do blah, and now I can use it on my job.
It's an especially American problem where we're always thinking about how to monetize our free time. Let's do less of that. Let's make more junk, and let's have a good time doing it. That's one of the things I wanted to say to you guys. It's also a message to me.
So I'm going to talk about some things I did for this talk that are just totally junk, but fun junk that I admit I learned a lot from and probably wouldn't have done them if I hadn't, but I was inspired to do them by wanting to do junk. So you guys might be familiar with Hieronymus Bosch's The Garden of Earthly Delights. This is this massive, massive, beautiful painting with these three panes of it.
The left side is sort of a Garden of Eden simple thing. In the middle, we have whatever the hell's going on that has to do with people on Earth and maybe corruption and desire and talking fish, who knows?
And then over on the right, we have The Last Judgement or sort of the end of the world, hell, whatever. There's a lot of analysis of this picture. It's a form of big data. I got really interested in this because of the Boschbot on Twitter. The Boschbot is this person who is posting quantum random selections of this big picture
as a tiny little thumbnail. And it's an amazingly fun distraction during the day when you're watching the American politics and British politics scrolling past and you get this little picture of pure joy from the 1500s.
The way it works is he posts this or she posts this image that's a very, very zoomed in random segment. And then of course, you want to go and click on it and you do and you get like the bigger context, which is random. I mean, so sometimes it's a good segment and sometimes it's not and then you fave
it and you move on. So what I wondered as a data analyst is what are the details in this big picture that people fave the most? What did they like the most? And so I made an app and these are the sort of stages I went through doing this.
Because we're a hacking audience, I want you to know the parts that were fun and the parts that were really hard. But also because I'm learning while I'm hacking, I'm pushing through the pain, right? There was some real pain on some of these projects. So first of all, there's this cool library, Twint, that you can actually, you don't need to use Twitter's API and it has a really nice command line interface to pull down tweets
and you can even say tweets from this user with only images in them, which is what I did for this case. So I pulled down January 1st through May 5th of this year. I finished the project finished for the current state in May. So that was a super awesome library that I hadn't used.
It's really cool. Then I used pandas to load that data from JSON, sort, filter, do some work on it and get the top 10 by likes. Pandas is always fun, totally awesome. The part that was interesting and hard and that I did not succeed at, talk about
that in a second, is figuring out where in the giant, giant picture that little segment came from. We can talk about that in a second. And then I used Leaflet.js as a web app to display this. So you're going to see a lot of web stuff because I was sort of getting back to some web programming for these.
So this is the app, let me just quickly demo it so you can see. So what I did was with the top 10, I located them in the image and put a little marker there for where they were taken from the big picture.
You can see at a glance, the thing I was most curious about is what panel are they coming from. The one on the right has an awful lot of action going on. It's not too surprising when you look at the actual details of what's going on in that panel. There's very little action over in the Garden of Eden section. And then we have a bit, and I think there would be a bit more when I redo this and
update it in the middle panel because there's so much stuff there. So I wanted to size by likes. So you can see there's a big one here. This is the biggest one. And then I was like, okay, well, is it just a fact that the things that are older have more likes because duh. But no, it's not true.
So this one at the time I did the scraping was the most liked, but actually the newest as well. So if we click on that, we zoom in and that's the picture people liked.
So I learned an awful lot about Twitter from that. Now if we look at some other things going on up here. So this one, this is just burning in hell, that little image, which is pretty cool too. There's one down here that I like.
So one of the cool things about this picture is really zooming in pays off a lot. This is like big data that's totally fun. So we can sort of move around. There's so much detail in this picture, it's ridiculous. So this is the lowest level zoom level.
You can see probably like the paint fleck drying effect here. This is one of the reasons localizing where the little segments came from in the big picture was a no-go for me. Okay, let's go back to this. So this is big data.
Big data is just data that's hard to deal with. That's right. It was slow to load the high res image, which is 223 meg in one go in the browser, which is what I was going to do originally. And then if you want to move around pan and zoom, it's basically ridiculous. Pillow based Python tools won't load it.
You actually get an error warning about decompression bombs, which I was like, what? So basically you can disable that and get it to work with it anyway, but that shows you the size of this as a big image. But worse, the algorithms I tried to locate where that little snippet, the posted snippet
were in the big image, crashed my laptop several times. Now obviously I could have gone to a remote machine with more memory, et cetera, but the way they were failing and the kinds of matches I was getting when I changed resolutions, et cetera, were leading me to believe that they weren't going to work. If anyone is a really good expert at neural net template matching and feature detection,
I would love to talk to you afterwards to see if I could figure this out. However, there was an out, this was a hack, right? I'm like, okay, solving this problem isn't the point of my exercise. It was what are the parts of the picture people like.
So I just went into Photoshop and figured out where these segments were and hand coded a JSON file for the top 10. Obviously that doesn't scale to a live app. And I talked to the Boschbot author and he or she has added the actual Snapchat locations with top left corner, et cetera, to all of the posts now just for me.
So I have to make a live app with these after I'm done with this talk. So there will be a live app showing what the most recent likes and faves were and what section of the image they were. So that's kind of cool. I made a friend. So I said I used Leaflet. So the big data part of this for web design, this was something I had been
wondering about for a while. I sort of knew about it, but I hadn't made anything with it. You can think of a really big picture that you want to zoom and pan on as a map. And the tools for doing maps online are the tools for doing big image display like that. So what you do is you tile your image.
You have sections of the big image at different resolutions and different size for all these different zoom layers, and it theoretically happens seamlessly. I think my laptop's running out of memory, and that's why it wasn't so seamless when I was moving around. But essentially, you use something that's a map tool, GDAL to tiles,
to make these layered directory folders for different zoom levels. And there's a lot online about how to do it. It's just interesting that we're essentially dealing with tools for maps. So this is me this morning in my hotel room trying to remember the Bash script to show the size of each directory.
So at the zeroth level, which is the top view, there's only one image. And then as you zoom in, seven being the lowest zoom level where you can see the little paint flecks, we have the most images. And we only pull up the ones that we're looking at at that point, which is how you get a smooth effect.
So the other thing that's cool about treating this image like a map is that if you use a tool like Leaflet.js, which is for interactive map display online, it's an old established tool with lots and lots of UI features and controls. It's actually really easy to add stuff like when I zoom into a certain level, replace it with the rectangle showing what the edges of that snippet were.
When I zoom back out, replace it with a dot. And you get tool tips and things like that. So that was cool. And naked guys butt on a fish. So this one, it's only because I was talking to the Bosch bot
about how he or she did the segments of the image randomly that I learned why this was number one. At first I was like, well, obviously that's just people like that picture, right? Actually, it's the pinned image on their account at the top of the account. And so obviously it's the first thing you see, so you go and you hit like, right?
It's not just that. There was a celebrity who went and retweeted it. And so that's actually why it's number one. So that when you're looking at faves on Twitter, you're always gonna have these effects where somebody really famous accelerates your liking. So it's a bit hard.
She's a comic. She's actually a pretty funny follow. So just this past week, The Pudding, they do a lot of interactive, interesting data visualizations. They did something similar. This is a TS&E layout using Mario Klingman's Roster Fairy to squareify the layout.
They used a tool called Open Sea Dragon instead of Leaflet. So you get the same kinds of zoom, pan, move around. I don't have a mouse, so it's like less than cool. But essentially these are very detailed tiled images. I could probably plug in my same tiles.
All they've got is the zoom in and out button. Because they didn't use Leaflet, they don't get access to the same number of UI controls and things that I wanted to have. So even if I switched, I'd have to do a lot of coding from hand to get some of the effects that I was building in the app.
OK, so that was a fun project that now has me promising to build a live updating real web app. Excellent. Word2Vec toys. So one of these is a project I did last year for a conference that I then didn't go to because they had weird funding problems.
But it led to the project after it. So I'm going to talk about this. So most of you are probably at least glancingly familiar with Word2Vec, which has been all over the place for a few years. The concept in Word2Vec is that we do an analysis of text and a lot of text. And we come up with essentially a matrix
with a vector representation for each word in the document collection that we've analyzed. We do that by looking at sort of a window of context around the words. So what we do is we essentially encode in this matrix things about the relationship of this word
to other words in the collection. Which means that we can do math on that matrix like find me the most similar vectors using cosine similarity, for instance. And those are words that are related in that they've occurred in the same contexts.
My source for these projects is Project Gutenberg, which is a great source for fair use text. Lots and lots of text and lots of formats. You can even download and read it on your Kindle. You can get the plain text. There are libraries out there to deal with getting text out of this because it's a huge collection.
The one that I settled on using is actually an R project. Gutenberg-er from David Robinson. Because it allowed me to query by subject. So I do a lot of things by subject like say ghost stories or poetry in this case. Or detective stories or whatever.
You can do other queries by author, by other types of metadata. Now, Gutenberg updates its catalog fairly regularly. So you have to keep your local copy of the metadata up to date because their API, I don't believe, I don't believe they have an online API that you can query and do all of what we've done here.
All right, so anyway, so I downloaded his code. I had to update all of the metadata tables because in January, in particular in January, there was a big change. And then one of the things his library does that some libraries don't do is it strips out the header text and the footer text, which is this sort of generic stuff on Gutenberg files.
You don't want that in your language model. All right, so then if you wanna make a Word2Vec model, there's lots of info online about how to do it. The most popular way is Gensym. Gensym wants you to put in tokenized sentences.
So you have essentially this is a sentence and it's tokens, so just the words as a list. And then your entire document collection is one of these for each document, sentences generally. And then I'm giving you this particular code example because it shows how to save the model
and then load the model. If you're gonna do the kinds of things that I did where you use your model for other things, you need to be able to save and then reload it in some place like a Flask API. So one of the attributes of Word2Vec that a lot of people have heard about is that you can make these sort of 2D layouts
that show you those relationships between words. So like I said, you can do cosine similarity and determine words that are similar to each other in the vector space. So you can also make these sort of explorable maps. Thanks to Peter Baumgartner for a gist that I modified to get this working quickly for this talk.
So this is folk and fairy tales. And this is an interactive exploratory tool for looking at the space. So it's in Plotly, which means that I can filter and sort and things better with a map. So if I roll over these,
you can see what the words are and see possibly why they're grouped together. So this is things that have to do with time and distance probably, fifth, sixth, fourth, eighth, afterwards, week, before, short, afterward, years, months, miles.
So frequently you get parts of speech that are grouped similarly. Boy, it would be way better if I had a mouse. Okay, over here we have things that are sayest, mayest, wust, wilst, thyself.
So apparently there was some really old-fashioned text in this Gutenberg file, and it grouped all of those old-fashioned helper verbs together. Over here, further, rather, sooner, higher, sweeter, et cetera. Up here we have other languages, words from other languages, which is strange,
and I could go and look at these. In this particular layout, color means that this word occurred more often in the text file. So his gist that I used to make this, I updated so that you can do your own with the word counts as well as the model.
And it makes this cool, this cool map. Okay, so like I said, you can find words that are related to each other with these models. So you can do the, so for polite, what are the most similar words? Courteous, friendly, cordial, professional, attentive, gracious, that looks like a good model.
So my question was, what if we went from the closest word to polite, which is courteous, and then from courteous we looked for what the closest word is? It's not necessarily friendly because of the way these things work, right? And so what if we kept chaining from word to closest neighbor, or word to closest neighbor in these models?
And so I made an app. This is the word fire in a Gutenberg older poetry word2vec model. Closest word is burning, closest word to that, but it's a bit distant compared to others, is folded.
Lids, curls, silks, garlic, glossy. And so if we keep doing that, this swoop is supposed to indicate that we're reading left to right, we get these words. Now if I change to a different model,
this is a slightly different model that includes things that were released in January. We get different results. So the closest word for fire is flame, burning, crimson, purple, foam, gleaming, okay? Now if I switch to folklore and fairy tales, the one I was just looking at in the big graph,
we get hearth, and then pretty far away, oven, pale, empty, dipped. So think about Hansel and Gretel, things like that. The witch is gonna put them in the oven or whatever. So these are nouns related to scary fire situations probably. Okay, and I don't think nerd2vec works very well.
I'm gonna talk about that in a second. This is quite different. And yeah, so nerd2vec, which is built on like Star Trek, Doctor Who, Star Wars, firing, shooting, throwing, not so surprising. All right, so this is another example.
Alison Parrish is the one who inspired me to do this originally because she had a poetry Gutenberg corpus, and then I made the word 2vec and then built these apps with it. So what we're seeing here is the vertical distance, so the distance between one word and its next closest neighbor.
And then the size here is essentially encoding the fact that there were loops. We don't want cycles in this list of words. It gets boring really fast because you see the same word repeated over and over. So I just sort of increment the count for a word and then go on to the next closest so that we get an actual chain. And the color is just pretty in this case.
I didn't say that as a database person, but it is. All right, so like I said, there was new public domain stuff in January 2019. So some of the things that were added were really cool things to work with, like Robert Frost. Robert Frost is a well-known American poet.
His stuff all entered public domain. A bunch of other great fiction works that are essentially more modern in feel. So this is an interesting contrast here. Believe, the word believe in the older model pre-January. We had a model for poetry that went learn, grieve, suffer, choose, understand, et cetera.
And then after I updated the model with the things from January of this year that were just released, we get magazines in there somewhere, which is super weird. So now believe goes to suppose, believe, I didn't take out loops from the seed word, down. And then over there in the second column, magazines,
satires, newspaper, et cetera. You can see it kind of goes off the rails sometimes over to the right, where we get into really specific words like parliament. This is in a poetry model too, which is weird. Yeah, so this is the source on Nerd2Vec. So Nerd2Vec is definitely very different in its properties.
Now, this brings me to my next project. So all of that was just sort of precursor to the project I actually wanted to make for this talk, which is that I'm really interested in creativity assist tools that help you build things using AI or machine learning or whatever,
but with a human solidly in the loop and in control. So one of my inspirations with text games is things like blackout poetry. You have an entire text, but you only use certain words in it. That's Austin Kleon.
This is cutout poetry or cutout poetry where it's different texts combined in order to make a poem by a human author. So this goes all the way back to the surrealists from before, that's Timothy David Ray. This is an interesting case. This is, Reznikov wrote poetry based on legal briefings.
So he took legal documents and then put together segments to make poems that are fairly moving, actually. So I built an app. So what I wanted to do was mess up poetry.
I wanted to take good poetry and then play with it at the word level. The inspiration here was like, am I gonna learn, am I gonna make anything better? Probably not, but am I gonna learn why they did what they did in great detail? Probably. So it's a really interesting close reading exercise
to take someone's work and then mess with it. But I wanted to do it in a kind of principled way with lots of constraints. So what I wanted to do was for each word there that's in color, I'm loading up the list of the 10 closest words in a poetry model. What other poets might have used instead.
Okay, so essentially this is an editable demo. So that's Robert Frost. This, I have a bunch of different models in here, Word2Vec models. I have the older original poetry one. I have up to 1923 poetry. So if I pick woods, I can pick a different word.
So I'm going through the woods, maybe I'm going through the fields instead. Whose fields these are, I think I know. His house is in the village. His table is in the village. Which, see this is not a good poem.
But his table's in the castle, I kind of like. All right, so it turns out that, I'll get to the color in a second. It turns out that these longer poems are a bit harder to work with. Let me pick something shorter. Okay, so this is Amy Lowell.
Your voice is like bells over roofs at dawn when a bird flies. Okay, you see how they aren't all the same part of speech so there are Word2Vec models that are encoded for part of speech so I could have done that. But instead I'm building a model that's just like what did other poets do in this kind of context, right?
Your cry is like bells. So at the point where you've done the second word, you might want to go back and fix the other one. This is the way this thing works, right? Over our surf.
Okay, so these words about time are usually the easiest. Okay, now, Nerd2Vec is weird. So remember, Star Wars, Star Trek, Doctor Who,
character, name, role, performance, vocal. Some of them are just like what? All right, sorry.
I don't know, it's so pillars, I don't know. So vampires have got to be at night. Notice there's like sirens. So in one of these you get Sith, the Sith Lord. So all right, so they're fun. In fact, it turns out that actually the really
funnest ones are the haiku for some reason. So with haiku, of course, you have a syllable count. So you have to decide am I gonna stick with the syllable count or not. So I usually try to stick with syllable counts, so old pond. Okay, now look, here's a point about these older embeddings.
They are not good at getting contexts of words that are very different. So if a word can appear in different meanings in different places, the modern Transformer language models are much, much better at handling that. These old models aren't. This is an exact example of that. So pond, I'm in the Nerd2Vec model.
Doctor Who, Amy Pond is a character, if you're a Doctor Who fan. So is Rory. A bunch of these are people. So essentially the word pond in the Nerd2Vec model is about a person more than it is about ponds in the woods or whatever. In poetry, it's about ponds in the woods.
But so here, we essentially have a case where the word 2Vec sort of breaks down or this is a shitty model for poetry, which it definitely is. All right, so let's talk about this a little more. So in fact, the color here represents a normalized distance between the original word and its next closest relative.
So essentially, if you see something that's pinker, that means that the next relative here is further away from it than the next relative is in these cases. So that can mean that that's like an interestingly different word that isn't as common in these poems.
I did a lot of work to get this color because I'm a database person. I don't know if it was worth it and paid off in terms of actual use. That's one of the points about doing dataviz and interactive work is you aren't always sure you're gonna get a payoff from the feature that you just spent
a lot of time working on. So essentially, this is just me explaining how this worked. D3 is what I used for data visualization on the web. The way this works, just so you understand, is that you essentially take a domain of numbers, which in my case are the distances
between a word and its next closest neighbor, an array of all of those distances, and I get the minimum and I get the maximum, and those are my anchor points for the color scale. So the minimum is gonna be the blue, and the maximum is gonna be the pink. So I could've recoded this, but I just used somebody's picture last night at 11 o'clock.
So that's how the encoding to color works. So it also turned out that to get those scores, to get that color, I had to radically rearchitect the app. This is another thing about dataviz. So this is the stack, because I stupidly used React.
The stack is incredibly deep here. We have the app at the top with the dropdowns for the model and the poem, and then we have the poem renderer. We have each line in the poem. We have words that are clickable that load up a little dialog that have this list that you pick from and then you pass it back up to redraw the whole poem, right?
So initially, I called the API to get the closest word here, which is the obvious place. I click on a word, I get the list of closest words, super simple. But to get the colors, I had to do it up here, and I was in Promised Hell. So truthfully, anyone here who's actually good at React, I wouldn't mind having a chat.
Because React essentially is, passing state is this. This is what my code looks like, too.
I have like 18 functions. Yeah. I have like 18 functions called handle change. It's insane. All right, so like I said,
the Basho haiku are actually the coolest. This is the original haiku, the first cold shower. Even the monkey seems to want a little coat of straw. I turned it in with one of the poetry models, the last wet shower. Even the basil seems to try a little cap of dirt, which I think is very haiku. But then the last hot meteor, even the lizard seems to need a little robe of guts,
is the nerd to back version. I thought, so I laughed. All right, so the parts of this project. So making the Word2Vec models is always easy with Gensium and always fun. When I decided to code the poem in React, so it's essentially because I've been doing data science in server-side Python for a little while now
and haven't been making data vis apps, I was like, oh, let's just use the tool du jour. But in fact, this was a really complicated app to be my starter app in React. The learning curve was hard. Why isn't this updating
and why is this updating too much was really hard. And then at the point where I decided I had to do the visualization by getting all of the distances up front and then passing them back down to the dialogue list, that was really hard. The promises handling was hard. Putting the actual color on each button was super easy.
D3 is fun and it's still fun in React. I was glad to see. And then at the end, like last week, I was like, oh, I'm gonna add a dropdown to change the poem, add a dropdown to change the Word2Vec model so I can get Nerd2Vec in there. That was actually kind of easy. So at that point, I'm like, maybe React is good for big projects
where you need to make a little change. Okay. I think hooks would have helped, which is the latest way of handling all the state passing. So some related AI creativity work. So we're moving away from my projects
to sort of the bigger picture. I really like that human in the loop working with a model or a tool or a learned representation. And for me, the interesting AI art is the art where I got to do an awful lot of the customization of the authoring of the curation,
not just throw up a big GAN model and then tweak one parameter. So obviously, if you're keeping up with the GP2 model, big language generation models, Hugging Faces, right with Transformer, is an example where you can write with the GP2 model.
I still don't love it. I mean, I don't know if you've tried it, but essentially, it's a great app, but... All right. So essentially, we've got a starter text, and then we get options that the model generated based on what it was trained with. So Legolas and Gimli advanced on the orcs,
raising their weapons with a harrowing work-wise, but the heroes were not to be defeated. So that's it. And then you start typing something.
Hit Tab, and then you get what it would complete with. A lot of times, it's nonsensical. Most times, it's nonsensical. Even GP2 isn't very good at big-picture coherence and text. So for me, this is still not a superb creativity tool. It's fun and funny,
and I'll probably add a slide to my deck after that's one of my favorite stories I created with it, but it still has its issues. Okay. In the visual space, there's a lot more going on. NVIDIA's GaoGAN demo is pretty cool. So what you do is you draw with a pen that says,
what are you drawing? Sky, tree, cloud, mountain, or whatever, and then you tell it to turn it into a photorealistic picture, and it does that based on what it's trained with. So this is like my not-very-good trees turned into weird trees by an ocean with some hills. So I like it because it's totally surreal, but it's obviously not photorealism.
It's surreal maybe because my input was bad or because of how it's putting it together. And then you can edit it again and get more stuff going on, but it's still not superb. This is one of my favorites. Sorry about my Pinterest thing. This is one of my favorites, though. GANbreeder is a very cool app.
This person is updating it to do more things with it. The way it works is this. Those were my pictures, by the way. I'm obsessed with generating castles. I just wanna build the coolest castle, fortress, tower, whatever, and that's one of my obsessive goals. Okay, so if you go to the GANbreeder page,
and you can see sort of random things other people have created or that have come out of the app, you can start from an image, and then what it does is it generates children related to that image, and you can edit the genes, and you can see what went into this. Garden Spider, Jellyfish, Sea and Emily comic book,
consomme, a soup, Bell Pepper. Okay, click on that, get more children. Those are looking pretty, actually. Okay, I'm gonna save that.
Anyway, there's a bot that's posting cool ones from this and they're about to do major updates on the UI tools for this thing. It's gonna be great. It's a super awesome tool. So this is a case of a human in the loop editing generated model pictures.
I said I'm obsessed with castles, so as soon as he or she said they were updating it, I'm like, more castles, please, because essentially there's one category for castles and they aren't sufficient for me to really make the castle of my dreams. All right, so that brings us to some of my next points. So doing this stuff, find, make, use cool data sets,
but be clever about it, right? There's so much interesting data out there and you can make your own data sets. Finding fun data sets isn't actually that hard, but there's two resources I'm gonna point you to. Jeremy Singer Vine's data is plural mailing list is awesome. He's a data journalist and there's a big archive
of all of his data sets out there and it's a great newsletter with just a few short bullets with like a description and he loves weird data sets. So that's super fun. I also obsessively collect links to data set collections. So my pinboard, which I update daily with my many open tabs from Twitter,
there's a data set tag. And some of it isn't gonna be obvious. It's things like game, game archives, things like that. So you're gonna look and go, why is that a data set? It's a data set if you're a data scientist or a data goofer offer. Okay, and you can obviously make your own data sets. This is Josh Stevens, who's a great map maker.
He did this now quite famous map of Bigfoot sightings in the US. Bigfoot, the abominable snowman, the yeti. So there's like public data about this and he turned it into this map that was all over in lots and lots of articles at the time. This wasn't for work, it was goofing off.
And then he has over here a little like educational bivariate explainer about population versus sightings of Bigfoot, which is super interesting. Okay, of Oz the wizard. This is one of my favorite weird examples. So this is data you didn't know was data.
So just like me editing poetry. So this is this guy, Matt Muzzi, who alphabetized everything in the Wizard of Oz. Ready.
Already. All right. All right. All right. All right. Okay. Yeah, so everything in here is alphabetized.
So it's essentially alphabetical and then linear or temporal sequence. Decree, the grade of degree. Delusion. All right, so obviously he used code to do this.
We could have done this, right? I'm kidding, I wouldn't have done this. But it's amazing. So one of the things about that exercise is that, as he said, his appreciation for the film increased.
It's just like looking at the choice of a word in a poem and thinking, could I do better? Could I turn it into mine? When you look very closely at something using tools, you start to love how it was made. You start to love the data. Doing things with artistic works is even better because you learn a lot about the art.
Another interesting AI art project. So Victor Dibia did this African mask curation project. So he hand-curated zillions of pictures of African masks and then used it to generate new African masks with a GAN.
So there's a good write-up and an interesting demo that he made. This is a case of a passion project where you curate your own data set to do something interesting with it. I'm working on gargoyles and castles, obviously. Stay tuned. Anna Riddler, so probably Luba Elliott is gonna talk about these projects tomorrow,
so I'm not going into any detail, but Anna Riddler took pictures of tulips and has a huge display of all the tulips she used, and then she used them for the Netherlands to generate GAN images of tulips that have never existed. This is another hand-curated, beautiful collection.
Helena Sarin makes really interesting GAN art. She also curates and collects and does her own imagery that she feeds into the GAN to make her own artwork. I'm quite sure Luba will talk about this tomorrow. I love her style. It's very different from a lot of other GAN art.
So, that said, don't be a jerk. When you're doing this kind of creative work and using data sets other people created or inspiration from other people or poems or code, give credit to your sources and inspirations, and in particular, don't be this guy.
Hopefully we all know about deep nude. This dude is like, the algorithm only works with women because images of nude women are easier to find online, but he's hoping to create a male version to bullshit. He's not gonna make a male version, okay? And his argument was if I don't do it, someone else will do it in a year.
What? Be smarter and be more interesting, and don't be offensive, essentially. So, the code was all over GitHub because he said it was too late, it was out there. I saw yesterday late last night that GitHub is taking it down, so good luck to them,
but there's a lot of really offensive things you can do just because they're easy. It doesn't make them fun. So, this is an example of sort of a tired metaphor, but it spoke to me. So, this is my patio in Lyon. I have lots of flower pots on it. There's a particular flower that I really like that's essentially a weed that I cannot grow well in these pots.
I don't know what the deal is. One seed fell off the edge. You can see it sort of down here. This is one plant. It gets almost no sun, no rain, and it fell on gravel, and it did that. That's one plant. I can't put it in a pot and have that success,
but sometimes the stuff you just throw out there that lands where you think is shitty, it does the best, and that's one of the cool things about being creative and putting your stuff up there is that sometimes it will put a flower in a place full of weeds.
I think you understand what I'm getting at. Max Kraminsky also has this reminder. If inexperienced creators are using your tool to churn out loads of half-baked garbage, your tool is a phenomenal success. So true. What we want is people to be making a lot of half-baked garbage and having a good time.
So what I would say as a takeaway is if you find yourself playing with something you made and you're actually smiling or you're laughing at it and you're playing with it to check out something new, you've definitely succeeded at a fun weekend hack, and you wanna make yourself and other people smile. You don't wanna be a jerk,
and you wanna make the world a better place for yourself and for other people. So I wanna give some shout-out thanks to, this is not all of the open-source projects I use, but the particular helps for this. The Bosch bot was really helpful. Peter Baumgartner's gist for making a UMAP display in Plotly was an amazing help. I used this dude's Word2vec API code,
which I could've written myself, but I saved a bunch of time, weekend hack. Peter Gassner helped me a little with React, and this person helped a ton of people in a comment on the UMAP repo for how to actually get UMAP recognized in your Python project. You can see how many people are super happy
I'm the one here. So anyway, that's my talk. Thank you. Thank you, Lynn.
Thanks, this was a great keynote. I guess you have many questions maybe, but you're around until Thursday? Yeah, I think let's talk over coffee so we don't. Yeah, okay, thank you very much. Thank you.