How we Used NLP and Django to Build a Movie Suggestion Website & Twitterbot
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 3 | |
Number of Parts | 52 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/32701 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
DjangoCon US 20163 / 52
6
10
13
14
17
20
23
31
36
44
46
47
48
49
00:00
Independence (probability theory)Software testingNatural languageRoboticsBitQuicksortSoftwareWebsiteMathematical analysisStatisticsElectronic program guideMultiplication signEmailException handlingEndliche ModelltheorieDatabaseData managementCASE <Informatik>WeightSlide ruleProcess (computing)Projective planeOpen sourceFunctional (mathematics)Physical systemEvent horizonTwitterAverageNeuroinformatikSpecial unitary groupIdentifiabilitySearch engine (computing)RobotFigurate number2 (number)Uniform resource locatorProbability density functionVideo gameReal numberDescriptive statisticsComputer animation
05:00
Sinc functionInformationWebsiteElectronic mailing listBitDescriptive statisticsWeb pageSearch engine (computing)Computer animation
05:47
Web pageOpen sourceQuicksortType theoryDifferent (Kate Ryan album)WebsiteMarkup languageSoftwareCodeQuery languageObservational studySuite (music)Group actionMereologyUniform resource locator
07:06
Right angleQuery languageMultiplication signDatabaseWeb 2.0BitLoop (music)Functional (mathematics)Time zoneInformationObject (grammar)QuicksortSearch engine (computing)Web pageRegular graphDivision (mathematics)MereologyLogicSuite (music)Scripting languageSource code
09:38
WebsiteData managementFunctional (mathematics)Web pageCASE <Informatik>Web 2.0Online helpProjective planeString (computer science)Server (computing)Computer fileFunction (mathematics)Standard deviationComputer configurationHuman migrationDatabaseComputer animation
11:46
Coefficient of determinationWordMultiplication signLemma (mathematics)Arithmetic meanVector spaceMereologyBitElectronic mailing listQuicksortFormal languageCodeResultantNumberContext awarenessData managementForm (programming)Product (business)Level (video gaming)Right angleDisk read-and-write headNegative numberDescriptive statisticsTerm (mathematics)Natural languageProcess (computing)Computer programmingType theoryLibrary (computing)Social classMathematical analysisPrisoner's dilemmaCASE <Informatik>Cartesian coordinate systemSimilarity (geometry)NeuroinformatikInverse elementLink (knot theory)Data structure3 (number)CurveDifferent (Kate Ryan album)Water vaporFrequencyPairwise comparisonGroup actionCountingDigitizingWebsiteImplementationCycle (graph theory)Matching (graph theory)Functional (mathematics)Element (mathematics)Barrelled spaceComputer animation
20:51
Functional (mathematics)Escape characterSlide ruleDescriptive statisticsElectronic mailing listWordComplete metric spaceLetterpress printingArithmetic meanMetric systemVector spaceMatrix (mathematics)BitFrequencySpecial unitary groupProgrammschleifeSystem callToken ring1 (number)CASE <Informatik>Term (mathematics)Level (video gaming)AlgorithmPoint (geometry)Different (Kate Ryan album)Loop (music)Object (grammar)Similarity (geometry)Cartesian coordinate systemSound effectFitness functionGoodness of fitSineCodePairwise comparisonResultantLemma (mathematics)
25:01
Loop (music)Polarization (waves)Set (mathematics)CodeAxiom of choice1 (number)Wave packetWordTwitterFormal grammarMetric systemNegative numberObject (grammar)AlgorithmCASE <Informatik>NeuroinformatikBitFunctional (mathematics)Mathematical analysisBeta functionPosition operatorQuicksortComputer animation
27:29
WebsiteFunctional (mathematics)PlotterElectronic mailing listForm (programming)Projective planeType theoryCASE <Informatik>Descriptive statisticsTheoryGoodness of fitComputer animation
28:50
Square number1 (number)Query languageWordTwitterMereologyDifferent (Kate Ryan album)Pairwise comparisonMultiplication signProcess (computing)Endliche ModelltheorieSet (mathematics)InformationNatural languageSinc functionDatabaseForm (programming)Standard deviationDimensional analysisDescriptive statisticsWebsiteWeb crawlerMobile appKey (cryptography)Data managementComputer animation
30:43
TwitterFunctional (mathematics)Data managementComputer animation
31:48
Data managementProjective planeVirtual realityMultiplication signQuery languageEvent horizonDatabaseMetropolitan area networkComputer animation
32:41
Sampling (statistics)CloningForm (programming)Materialization (paranormal)Formal grammarAndroid (robot)Bootstrap aggregatingSinc functionSoftware frameworkFront and back endsSoftware bugQuicksortWebsiteMathematical analysisAlgorithmPairwise comparisonMeeting/InterviewLecture/Conference
33:45
Slide ruleMereologyCodeInterface (computing)Source codeQuicksortWordRevision controlData managementParsingResultantSoftware testingSimilarity (geometry)Mathematical analysisDirection (geometry)Pairwise comparisonSystem callSubsetGastropod shell2 (number)GoogolIntegrated development environmentEndliche ModelltheorieOpen setSinc functionWebsiteQuery languageTwitterMoment (mathematics)Counting1 (number)AdditionBenchmarkPresentation of a groupSummierbarkeitTerm (mathematics)MathematicsCharge carrierSuite (music)Process (computing)Projective planeElasticity (physics)Scheduling (computing)InformationMultiplication signDifferent (Kate Ryan album)Functional (mathematics)Lecture/Conference
38:28
Proof theoryComputer animationXML
Transcript: English(auto-generated)
00:18
It's my first time speaking at DjangoCon, so I'm very excited to be here.
00:23
Yeah, this is my email, so if you have any questions or anything, please feel free to reach out. And real quick, if anyone likes to follow along, the slides can be downloaded at that URL. It's just a PDF in case anyone is looking for that.
00:45
So real quick, to give you an overview of what we did, this is going to be talking about a project we implemented. So we worked with the Cleveland International Film Festival to build a little bit of software for them.
01:01
This was sort of a hobby project that we did. And we wrote briefly about the Cleveland International Film Festival. It's a two-week-long event, similar to Toronto or the Sun being Sunbeam, the other film fest. And they have a lot of independent films.
01:21
It was held in April this year. And it's a pretty big event for Cleveland, so we were really excited to do some stuff with it. I myself am not a film buff. I'm sure many of you in the room are. Shout out to a movie I saw there, Morris from America. It was an independent film, very good.
01:41
And I'm not affiliated with the Cleveland Film Fest, so that's my disclaimer. And about the project, that's our little robot guy there. So we built the movie recommendation engine. There's about over 400 films at this festival, and they're pretty much all brand new releases,
02:02
independent. They've never been seen. They've never been reviewed. The only thing we have on them is a title and a description. So when you're deciding which one to see, especially if you don't know the producers or don't know who's making the film, it's kind of intimidating for an average person.
02:22
So we decided to take the publicly available data that they published on their website and do a little bit of natural language processing on it to identify which films were similar to the other films so we could have a guide as to what to watch. And we also did the Twitter bot too as sort of a fun aspect of it
02:42
that would tweet out during the showtimes. It would say, you know, this film is starting, and then when the film was over, it would say this film is ending. If you liked it, catch this other film that's similar. And it was all done through, it was all computer-generated statistics about that. So it was a quick thing we did in a couple days.
03:01
It's 100% Python except for the use of cron. It was all built with open-source software, and once again, the data was all publicly available from their website. So what we're gonna look at in this talk is we're gonna put on our search engine hats, first of all, and figure out how to scrape the data from their website
03:23
and get it into our Django database, into our models. The second thing we're gonna do is look at some Django management commands, which are pretty simple, but maybe new for a few people. They're very useful. And then we're gonna get into the meat of the talk and look at some concepts in natural language processing,
03:43
just some basic stuff, and then explore functionality in the NLTK, which is the Natural Language Toolkit. It's an open-source toolkit. It's all in Python. It's very easy to use, and it's a great way to learn natural language processing or entry-level AI if you're interested.
04:02
And then we're gonna look at using the Twitter API to make sort of a dumb Twitter bot. We're not making a Microsoft Tay or anything like that. And we're gonna use the cron job to sort of give life to our Twitter bot to tweet out at appropriate times.
04:20
And if we have time, I'll go into the whoosh and the haystack searching of the movie database. So the first things first is we need to identify, we're basically starting from nothing. We need to look at their website and identify what's the data and what do we need to do to represent it. And we need to make some models.
04:41
So the most obvious models are the movie model to represent the film itself and the showtime to represent the time that the film is playing, because some of them will play two or three times over the course of the film fest. So I will just go to their website real quick.
05:11
They've got a whole list of films. Now, this may have changed a little bit since it's all over, but here's a film on their website.
05:23
And as you can see, it's a very nice looking site. It has all the information laid out, but once again, putting on our search engine hats, how are we going to identify what is what on the page? We know that this is a description, but how will Python know? And we know that this is a title,
05:42
how will Python know? So for that, we use two libraries, one of which is urllib, which I'm sure many people have used in here, and the other is beautifulsoup. So urllib basically just gives you the ability to do requests,
06:03
sort of network type things. Beautifulsoup, it sort of builds the HTML document in the markup and lets you query that. So if you've ever used jQuery, beautifulsoup is sort of a Python way of doing a similar thing, querying the different HTML tags on the page
06:21
and identifying what's in there, parsing the data out of them, et cetera. So I did wanna open source this, but because we could potentially DDoS their site from crawling it way too much, I didn't, but I'm gonna show you the code.
06:45
So basically, urllib is a pretty simple tool that will let you, oh, that's not showing.
07:07
So urllib basically just lets you make a request to any web, any resource out there. It's very straightforward. It reads that then into memory, and from there, we will take what we just read
07:22
and parse that with beautifulsoup as HTML. So now we have this whole soup object, which is essentially the HTML document, and from here, we can query through it. So the obvious thing is, okay, the H1 is probably gonna be the title, right? So we go through and we parse some of that stuff,
07:42
but as you have seen in your careers probably, web pages aren't always formatted in a way that makes sense, especially when you get JavaScript and stuff going on. So this ended up being a little bit more complicated than we initially thought to try and find,
08:01
okay, where is something existing? We have to look through different columns on the page, et cetera. So naturally, this is a little bit hard-coded more than a regular search engine would be, but you can basically just loop through everything on the page and parse out that info using a query similar to like jQuery,
08:23
like right here you see. We're gonna find all divs that have these certain classes, and then within there, find the paragraph tags. So, and then right there, you've got an object that you can play with and get the content out of. So that's beautiful soup. It's really great if you need to parse any HTML.
08:43
So we basically go through that painstakingly until we're able to identify everything on the page, and then loop through here, and we just create our Django object right from that. So this is our movie object, our model, I should say, and we just create it from this stuff that we parsed,
09:03
and now we have a database full of data, but we need to run this, because this is sort of a script, right? It's almost like a migration, you could say, because you have to populate your database before we can even do anything with it, and then the other quick thing I'll note
09:20
is that do pay attention to time zones if you're ever dealing with dates, because that can really throw some things out of whack. So here we are using the US Eastern time zone, so just always something to be cognizant of. So we've got our big, long function there that, we've got our big, long function that takes care of requesting the pages from the website
09:45
and then identifying the data in them, but how are we gonna run that? So Django management command, they're very simple to do. They really are this easy. They may seem a little bit cryptic,
10:00
but when you, in a Django management command, is just from my virtual ENV, I can just do Python manage run server. That's the obvious one, but if you just do help, it will list them all, and you can see down here that we have a few of our custom ones,
10:23
and the one we're interested in is great movies, so all that is to make a Django management command is you put a file in within your project. My project is called web management commands. It's as straightforward as it can get,
10:42
and then the name of your command in its own Python file, and it really is that simple that all you have to do is call a Python function from inside your handler, so creating this one simple file will give you the Django management command
11:03
that shows up right down there, and if you notice, I had a little help string in there, so if you do help, and then it will output my help command right up here
11:20
and all of the standard options, so it's very easy to make a Django management command, and then all I have to do is run that, which I'm not gonna run now because that will make hundreds of requests to their site to scrape everything, but it's as easy as that to do, so I would run my migrations, make my database,
11:41
and then run this, and it will populate my database, so now we've got all that stuff taken care of, and that's code tip number one, and once again, you see it just gives you the management command right there, so that part was in principle pretty easy,
12:01
but it took a little bit longer than expected to once again find the HTML. It's sort of an SEO thing almost. Now onto the good part, which is the natural language processing, so before we look at any code for that, I'm gonna go over a few concepts,
12:22
so as you know, natural language processing is basically understanding text in a way that a human would, but you're doing it with the machine, so in our case, what we're trying to understand is how similar are two movies? If we've got one about war and one about love,
12:42
how's the computer going to know that those are even similar, and the most obvious way is look at the words that are in the document, so that's what TFIDF is. It stands for term frequency, document inverse frequency, and basically what that means is it looks at both of the documents
13:00
and says which terms show up most frequently within those documents, so if you have three documents and one of them has the term love in it, repeated 10 times, and another one has the term love once, and the third one has the term love five times, then document one and document three will be determined to be the most similar
13:22
out of that group, so because of the simplicity, it's relatively easy to implement, but the problem is it's not very smart because it only looks, it literally just counts up the words and says which has the most in common,
13:40
and one other thing I should mention is that when you are making this comparison, you remove what's known as stop words, so like the, a, and, your prepositions, that type of thing, so you're really only looking at words that actually have context or meaning rather than just words that are there for structure,
14:03
so let's look at these two sentences, right? I went to the bank to deposit money compared to I slid down the bank by the lake. TFIDF would say, oh, they're both similar, they are short and they both contain the word bank, so they must be similar, but we know
14:22
because we are humans that the word bank has two completely different meanings in this context, and the, one of the more traditional examples that you see in textbooks is I gave my dog a bone or the sailor dogs the bar made, right? One is being used as a verb and means to pester,
14:41
and one is being used as a noun and means an animal, completely different meanings, so TFIDF really isn't that good at solving this, but half of the time it will get pretty accurate because our language isn't that diverse, so what's a better way to approach this? A better way is to look at the word sense disambiguation
15:03
and what this means is that you actually look at the meaning of every word based on the context that it's in, so once you understand the context, you will know that, okay, this word refers to something completely different than it does in this other context, therefore they are not similar,
15:24
and the way you would go about this is you can't just sum up all the words in the document, you have to look at it sentence by sentence so that you can get the correct context, so once you determine the meaning of each word within each context, you would take another step
15:41
and look at the lemmas. Now a lemma is basically like an adjective, but the true way of describing what a lemma is it's the meaning of the word when it's in your head before you actually speak the word, that's the true definition, it's hard to describe, but it's basically a synonym,
16:00
so it's the true like abstracted meaning of that word, so we will look at each sentence, identify the meaning of the word within that sentence and then look at lemmas or synonyms of that word within its context, and when we do that and compare two documents we'll definitely have a much better sense
16:22
for how similar they are, so here's our same example once again, I went to the bank to deposit money, here we identify bank as a meaning and it is in this case a financial institution that accepts deposits and channels the money,
16:43
dilemmas would be bank, banking company, financial institution, there's probably more, and then in our other example I slid down the bank by the lake, here we identify the meaning of bank to mean sloping land, especially besides a body of water,
17:01
and dilemmas would be slope, curve, side, edge, shore, shoreline, et cetera, so if we were using word sense disambiguation to compare the similarity of these we would not even put them in the same class whatsoever,
17:20
and then the third topic on natural language processing that we used was sentiment analysis, now this is something that can be very difficult and things such as the IBM Watson and many other big AI programs really try to pinpoint this down, but there are some very simple ways of doing it, but basically what sentiment analysis means
17:42
is determining the feeling of the text, is it positive or is it negative, that's the most common implementation of it, but when you combine it with other forms of NLP you can get a deeper level of what kind of sentiment is actually being expressed, product reviews, negative comments on things,
18:04
they might indicate different sentiments within the realm of positive and negative, so for this example we're just gonna do positive or negative, but we will be determining that on every movie description, so now to get into a little bit of the code,
18:24
so TFIDF, we're gonna use NLTK and we're also gonna use Scikit for this because TFIDF is sort of a mathematical thing more than anything, you're summing up the words and comparing the vectors of each document, so this is the code that we used to do that,
18:42
it's very simple when you use the Scikit, so Scikit gives you a TFIDF vectorizer and it also gives you the, and from there you can basically do a fit transform, so the first thing you need to do of course is clean all the stop words, punctuation, et cetera out,
19:04
get it into a pure list of words essentially and we're doing that on our film descriptions and then we just run them through the vectorizer and this here is basically building a matrix because you're comparing every single film
19:23
to every single other film, so if there's 400 films, it's basically big O of N squared, you're not comparing it to itself because that would be 100% matched, so it's big O of N squared minus N, but what the result is is a huge grid of you can think of all the movies on axis Y
19:44
and all the movies once again on axis X and how well do they compare to each other, so it's a big amount of data and this link here, the Scikit's pretty well documented, so if you're interested, definitely check it out, it's a little bit difficult to install in PIP
20:01
because it requires SciPy as a prerequisite and doing all of those requires a good amount of compiling and C libraries and stuff, but it is just a PIP package, so you can definitely install that, so it's very easy to do TF-IDF. WordSense disambiguation, a lot more difficult,
20:25
so we'll dive into the code a little bit for that, so you can see that our TF-IDF function,
20:44
can't see it, oh, so the TF-IDF function, this was the exact same code that was just in the slide, relatively simple, you basically use SKLearn
21:01
to build a vector for that. WordSense disambiguation is a little bit more involved because you basically wanna do the same thing, you wanna build a matrix to see how similar are, is each movie to each other movie, but you're not looking exactly at the words
21:21
in that description, you're looking at the meaning of each word and then breaking that down into look for similar dilemmas, so if in the case of the bank example, if one document contains the word bank and we know that it means a financial institution and another document contains the word finance,
21:41
we will still count those as being similar even though they're different words. So to do that, there's a lot of looping and a lot of good stuff that needs to happen, and in this implementation, I sort of did it manually, we did not use the TF-IDF vectorizer, so basically loop through each sentence here,
22:03
tokenize it based on the sentence level, not at the word level, look for the important words, which is basically removing all the stop words, you know, remove your thes, ands, et cetera, and then use the LESC, WordSense disambiguation algorithm, which is built right into NLTK
22:22
and it's the LESC algorithm, I'm not exactly sure about the details of what these algorithms do, but it's built into NLTK, and this, what this does here, this is really the magic function call, you give it the tokens from the sentence and the specific word
22:41
that you wanna identify the meaning of it in the sentence, so in this case, word would be bank and the sentence would be, I deposited money in the bank, and it will give you the sin set in return, which is the list of all the lemmas and the synonyms, so what I'm gonna do now is add all of those synonyms
23:01
and everything to my words to check for, so the words that I'm checking for might actually be larger than the document itself, because I wanna find everything that's similar to it, and then from there, we basically, now we have a complete list of words to check for based on the meanings and the lemmas, the synonyms,
23:20
now we're gonna do the nested for loop where we look through and build that matrix and cross compare every document, so the results of this are sometimes very similar to TF-IDF and sometimes very different, and I will show you,
23:59
the comparator object was what I'm using
24:01
to compare the similarity of two movies, so here you can see one film named Zelos, one named You Carry Me, the TF-IDF, which is basically seeing how many words that they have in common, it scores 0.013, zero means zero in common,
24:22
one means 100% in common, so here it's 0.01, which means they're not very close at all, the word disambiguation actually scores a little bit higher because there probably are some synonyms involved in that, if you look at a few others,
24:40
you can see that there are some differences, usually the word sense disambiguation scores them more similar than term frequency, document, and frequency, so we use both of those metrics to when we're deciding, okay, how similar really are these?
25:05
And then the third, kind of the fun one here is sentiment analysis, and this one's also really easy to do because of NLTK, so we're using the VADR sentiment analyzer, which is once again an algorithm, and it's built into NLTK,
25:22
it's particularly relevant in our case because the way this algorithm works is it was trained from a data set, and this data set consisted of 10,000 tweets and 10,000 movie reviews, and I forget when this was trained, but I want to say it was maybe like 2011, 2012,
25:42
sometime around then, so it may have changed a little bit, but in each one of those tweets and movie reviews was tagged by a human as saying, okay, this one's positive, or this one's negative. You then feed those into the computer, into the algorithm, and it learns based on the syntax
26:04
and the grammar and the word choices what a positive text looks like and what a negative text looks like, so from here, you can basically feed anything you want into the VADR algorithm, and it will scale it from negative one to positive one,
26:22
whether it's 100% negative or 100% positive based on its learning data, its training set, so because it was trained on movie reviews, it was pretty relevant to us, I think.
26:40
Once again, the code is very simple for this because it already exists in the NLTK. You have a sentiment intensity analyzer object. Really couldn't get any easier than that, so all you have to do is, in our case, clean up the data a little bit, run it through the polarity scoring function,
27:03
and what we're looking for is the compound score. It does produce a lot of other scores and metrics. The compound score is sort of the overall, the average, which is what we wanted in this case, and that's really all you have to do, so we just loop through the movies, run it through the NLTK functionality,
27:23
and it gives you that, so in some ways, it makes you seem a lot smarter than you actually are, and I'll show you a quick example. I should have started out with this, too, but this was the site that we made, and it used to show you now playing and recently ended.
27:42
Obviously, it's all over, but there are a few lists that we can pull from, and we represented that as a dark plot or a upbeat plot because pertaining to movie reviews, that's what, or not reviews, but movie descriptions,
28:00
that's what it most closely corresponded to, so this chart here just goes from negative one to positive one, and it varies per film, and I did read through a few of them myself to kind of spot check, and in general, it's pretty accurate.
28:20
There are a few that were kind of off, but in general, it was pretty accurate, so very good functionality right there in NLTK, and NLTK is primarily designed to be an academic-type thing just to learn and to play on. It's probably not very good for doing any real, if you're looking at a business functionality
28:40
or something, it would not be good for that, but for projects like this, it's great, so that's what we did, and so now that we have established three different forms of natural language processing the descriptions of these movies,
29:02
it's time to crunch the numbers, so our crawler pulled in 436 films from the clevelandfilm.org website. We did TFIDF on each film compared to each one, so that was a big O of N squared, and we also did the word sense disambiguation on each one,
29:22
so we broke it down into every sentence of every review and cross-compared all that. Once again, we ended up with 189,000 comparisons. By the time our database was fully populated, and as I mentioned, we used the comparator model to store the comparison between each film,
29:43
so the actual movie model itself doesn't contain any of that. It just contains the movie information, and how we did that was through a set of management commands, and I'm going way too slow here,
30:01
so I'll speed it up. That was a Django management command, and then the last good part of this was the Twitter API, so doing this is very, very simple if you're just doing it with your own Twitter account. This is what we did, which looks a little bit complicated, but the simple form using Twython is super simple.
30:24
You just create a Twitter app at apps.twitter.com. Since you're the owner of it, it has access to your account, and you can access all of your API keys right there at apps.twitter.com. You basically plug those into Twython,
30:41
and you can send a tweet, so I will send a quick tweet here just to show it off because it's kind of cool. Made my DjangoCon tweet function here.
31:09
It's literally as simple as doing that, calling Twython Twitter.update status, and if I run my management command,
31:28
I get a Django 1.10 deprecation warning, but then I also get my online tweet in, and if you check out the Twitter site,
31:47
so all we did was use a cron job, which is basically a one-liner where you, in your cron entry, you enter the virtual environment that your project's running in.
32:02
You run the management command, and then you just specify to log it, so we just ran that every five minutes, and it would run through and query the database and tweet out what was coming up, so that's the bulk of our project. I hope you got something out of it.
32:21
I don't know if we have any time for Q&A, a few minutes maybe. I'm here for Vint.
32:41
Thanks for the talk. I was just gonna say the website looked really nice, and I was gonna ask what sort of front-end frameworks you use for that. So for this one, we just use materialize.css. It's basically a clone of the Google material design. It's very simple, it's pretty good. I found a couple little bugs in it. We normally use Bootstrap for most stuff,
33:01
but we wanted this to have a little Android feel since it's kind of robot-themed, so yeah, materialize.css. Have you thought about the possibility of evaluating your comparison algorithm
33:22
by, say, piping a front-end into something like Netflix where you already have a wide knowledge of the movies that are out there, and you could make a kind of a qualitative analysis of how well your engine performed with movies?
33:40
That would be, yeah, that's a great idea, and once again, enlarging our data set, too, would be great because the sentiment analysis was just off of a small subset of 10,000 reviews, and our own comparisons were just programmatic. They were not going off of any existing data, so yeah, probably for version two,
34:00
we would wanna compare that to an outside source of data for even more accuracy. Hey, so I'm not terribly deeply familiar with NLTK, but I wondered the previous question, did you test how well it performed, and how could you do that, was good.
34:23
I also wondered why you did the parsing with Beautiful Soup more by hand than using a tool like Scrapy that is probably, probably heavier, it's just 300 movies, so.
34:43
But I think the biggest thing I wanted to mention is that it turns out, looking, Googling on NLTK while you're doing the presentation, since I'm familiar with open NLT and Java stuff, is that it was developed at UPenn, and the other comment I wanted to make was
35:06
that because this was, it took me a moment to understand, I had to sort of think about it myself while you were talking, because this was for the Cleveland Film Festival, there wasn't any need to do a search function, you know?
35:24
People were already there, they were looking at the schedule, what do I go see next, you know? And I think if you do this again, that would be a good thing to point out, because people are probably familiar with how you would do this with search, just pipe it into Solr or something, or Elasticsearch, you know?
35:40
Yeah, and actually, if we had a little more time, I went a little too in detail, but I did, if you want to download the slides, there are a few additional ones with Haystack and Woosh just to do a really simple search, and it wasn't really relevant to any of the NLP stuff, it was just part of the website that we made, you know,
36:04
to search through that information, it just made a simple search up there. So, once again, not relevant to the NLP or the Twitter part, but it was part of the whole project, so there are a few slides on implementing a really simple Haystack search in like three steps, so if you are interested in some searching.
36:25
Any more questions? This is naive, I didn't understand why you used the managed Pi interface instead of just calling the Python code directly.
36:41
So the reason we used that was because every, all the Python code needed to run within the Django environment, so that's, yeah, so that's fine, I guess you could drop into the Django shell and call it, but the management command provides sort of a simple way to do that, rather than having to drop into the management shell,
37:01
so that you can access the models in the environment. Any more questions?
37:25
Since you did the work to have the word sense disambiguation, why did you keep the original method of just the word count comparison? We sort of kept it almost as a crude benchmark to see is there really a difference, and originally when we had our,
37:42
when we had the website where it would show you the similarity, you know, this is just making a simple Django query to pull the most similar ones, just ordering top to lowest. We did it with TFIDF, and then some of them were not very similar, so we're like, okay, let's try a better way,
38:01
so we did word sense disambiguation, and the similarity results changed when we reloaded the page, basically, and they're a lot more accurate, so we have both in there for reference, but we primarily looked at the word sense disambiguation first, TFIDF second.
38:23
All right, thank you everyone for coming. Another hand for Vincent. Thank you.