We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Is that spam in my ham?

00:00

Formal Metadata

Title
Is that spam in my ham?
Title of Series
Part Number
7
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Lorena Mesa - Is that spam in my ham? Beginning programmers or Python beginners may find it overwhelming to implement a machine learning algorithm. Increasingly machine learning is becoming more applicable to many areas. This talk introduces key concepts and ideas and uses Python to build a basic classifier - a common type of machine learning problem. Providing some jargon to help those that may be self-educated or currently learning ----- Supervised learning, machine learning, classifiers, big data! What in the world are all of these things? As a beginning programmer the questions described as "machine learning" questions can be mystifying at best. In this talk I will define the scope of a machine learning problem, identifying an email as ham or spam, from the perspective of a beginner (non master of all things "machine learning") and show how Python can help us simply learn how to classify a piece of email. To begin we must ask, what is spam? How do I know it "when I see it"? From previous experience of course! We will provide human labeled examples of spam to our model for it to understand the likelihood of spam or ham. This approach, using examples and data we already know to determine the most likely label for a new example, uses the Naive Bayes classifier. Our model will look at the words in the body of an email, finding the frequency of words in both spam and ham emails and the frequency of spam and ham. Once we know the prior likelihood of spam and what makes something spam, we can try applying a label to a new example. Through this exercise we will see at a basic level what types of questions machine learning asks, learn to model "learning" with Python, and understand how learning can be measured.
Real numberComputing platformComputer virusNatural languageProcess (computing)Goodness of fitSound effectLecture/Conference
Physical systemQuery languageSoftware engineeringData analysisBitSpacetimeMathematicsXMLUMLComputer animation
SummierbarkeitWechselseitige InformationNetwork topologyMultiplication signGroup actionGenderSoftwareEmailBitInformationFormal languageWhiteboardPoint (geometry)Lecture/ConferenceXMLUML
MereologyMachine learningType theoryVirtual machineMereologyMachine learningLecture/ConferenceComputer animation
Machine learningVirtual machineRoboticsInheritance (object-oriented programming)Lecture/Conference
CAN busPattern recognitionComputational physicsComputerField extensionAlgorithmObservational studyPredictionArtificial intelligenceVirtual machineMachine learningBitFormal languageSpacetimeType theoryPattern languageGravitationLecture/ConferenceComputer animation
Artificial neural networkAlgorithmPredictabilityMachine learningMereologyObservational studyFormal languageNeuroinformatikPattern recognitionLecture/Conference
Machine learningComputerComputer programMeasurementMeasurementTask (computing)MereologyMachine learningComputer programmingData recoveryXMLComputer animation
Semiconductor memoryLecture/Conference
Multiplication signArithmetic meanEndliche ModelltheorieDifferent (Kate Ryan album)Parameter (computer programming)Normal (geometry)Semiconductor memorySign (mathematics)Perturbation theoryComputer animationLecture/Conference
Mathematical analysisRow (database)Diagram
Virtual machineEmailTask (computing)Machine learningSpeciesSemiconductor memoryTask (computing)Software frameworkEmailMathematical analysisLecture/ConferenceComputer animation
EmailMeasurementTask (computing)Wave packetSet (mathematics)EmailMeasurementLecture/ConferenceComputer animation
Thomas BayesOffice suiteType theoryMedical imagingStatisticsWindowTheoryXMLComputer animationLecture/Conference
Thomas BayesMathematicsTheoremState of matterStatisticsEvent horizonBayes' theoremEvent horizonLikelihood functionTheoremIndependence (probability theory)Computer animation
Metropolitan area networkEvent horizonLikelihood functionRow (database)Multiplication signContrast (vision)Disk read-and-write headNumberMultiplicationIndependence (probability theory)Computer animationLecture/Conference
Likelihood functionNumberWater vaporRight angleComputer animation
Computer clusterNumberLikelihood functionContext awarenessWordIndependence (probability theory)EmailLecture/ConferenceComputer animation
Formal languageDisk read-and-write headLikelihood functionEmailWordLecture/Conference
Thomas BayesEmailSocial classWordSocial classEmailLikelihood functionTheoremWordComputer animation
Social classWordLikelihood functionEmailEntire functionContent (media)Lecture/Conference
EmailWordThomas BayesSocial classMultiplication signSocial classFrequency distributionBayes' theoremWordEmailContent (media)Computer animation
Conditional-access moduleLevel (video gaming)Maxima and minimaSocial classCalculationSocial classMaxima and minimaEmailLecture/ConferenceXMLUML
AlgorithmMathematicsThomas BayesSuite (music)Cartesian coordinate systemBitFrequency distributionComputer animation
EmailParsingMessage passingSinguläres IntegralWordObject (grammar)MathematicsLimit (category theory)AlgorithmMoment (mathematics)Statement (computer science)EmailSocial classData managementThomas BayesPredictabilityBitMessage passingObject (grammar)Connectivity (graph theory)HypermediaMassBuildingChemical equationWordWebsiteOpen setNatural languageSource codeWave packetLecture/ConferenceXML
Task (computing)Abelian categoryWordEmailWordMoment (mathematics)Thomas BayesWave packetTrailToken ringProcess (computing)Computer animationXML
Wave packetFrequencyTrailWordCategory of beingSound effectSocial classSynchronizationProcess (computing)Lecture/Conference
WordNP-hardWordAttribute grammarCASE <Informatik>Physical lawSelf-organizationFormal languageXML
FrequencyWordEmailLecture/Conference
EmailWordLaceWordMathematicsFreewareWindowEmailSmoothingLikelihood functionWave packetComputer animation
Task (computing)Menu (computing)Execution unitMass flow rateFingerprintLevel (video gaming)Arc (geometry)ArmSocial classEmailPoint (geometry)Black boxData dictionaryRight angleNeuroinformatikPosterior probabilityIterationFloating pointComputer animationXML
Uniform resource nameMenu (computing)Mass flow rateComputer-assisted translationLevel (video gaming)FingerprintArmCASE <Informatik>Object (grammar)Slide ruleReading (process)Point (geometry)Descriptive statisticsLecture/ConferenceXML
MeasurementSummierbarkeitLevel (video gaming)Arc (geometry)Letterpress printingWide area networkMoving averageMetropolitan area networkPointer (computer programming)Interior (topology)EmulationExecution unitConvex hullMathematicsComputer virusMeasurementPerformance appraisalDifferent (Kate Ryan album)Heegaard splittingMoment (mathematics)SpacetimeBenchmarkPoint (geometry)Wave packet10 (number)Multiplication signComputer animationXML
ExplosionGoogolImplementationMathematicsGoogolEmailFilter <Stochastik>Bit ratePosition operatorLecture/ConferenceComputer animation
Position operatorLimit (category theory)Lecture/Conference
Data modelIndependence (probability theory)EmailThomas BayesEmailArithmetic meanIndependence (probability theory)Sound effectBinary codeComputer animation
Virtual machineStudent's t-testProcess (computing)Error messageElectronic data processingCategory of beingMereologySet (mathematics)Supervised learningOrder (biology)Different (Kate Ryan album)Type theoryAlgorithmEmailMachine learningPredictabilityRight angleWave packetLecture/Conference
Computer-generated imageryEmailAddress spaceField (computer science)Natural languageMedical imagingProcess (computing)WordMeeting/Interview
Local ringVirtual machineMachine learningTrailMachine learningConservation lawVirtual machineResultantCross-correlationBounded variationGroup actionCausalityLocal ringLecture/ConferenceComputer animation
AreaLecture/Conference
Transcript: English(auto-generated)
Good afternoon, everyone. Welcome to this afternoon's session. I'd like to introduce Lorena Mesa. She's a platform engineer at Sprout Social in Chicago, and she's a Star Trek fan, and she's gonna talk to us about spam and natural language processing. Hello.
So, real fun fact, I have a loud voice. Is this too strong, or should I be a little quieter? Louder. Louder? Oh, this is great. I can be loud. All right, thank you so much for joining me tonight. This afternoon, I should say, the name of this talk is Is That Spam in My Ham?
Subtext, a novice's inquiry into classification. So, as my announcer already said, my name is Lorena Mesa, and as you can see, I'm a huge Star Trek fan, so live long and prosper. Apart from that, I'm here from Chicago.
A little bit about me and why I wanted to chat on this topic, I'm actually a career changer. So, a few years ago, I came from being a data analyst in the social science space. Specifically, I worked at Obama for America doing data governance, and I then switched into doing software engineering about three years ago. Some big questions that were driving me at the time
are captured in this talk, but some other things I do, I love Django Girls. I helped with the workshop yesterday. It's a glorious, glorious thing. If you have the opportunity to mentor, please do. If you would like to sign up for another one, please do it as well. PyLadies Chicago is a group that I founded in Chicago,
and I recently was voted to the board of directors for the Python Software Foundation, which is very exciting. So, I'm gonna chat a little bit about this great experience that we've all had before. I think we might have all had some kind of email at some point in time where we get something that flutters into our inbox and it has language like,
de-junk and speed up your slow PC. And of course, we would trust an email that comes from AOL underscore member info at emails, yes, with a Z on it, dot AOL dot com, and of course, I'm gonna trust anything that tells me this is free, this is great, you really should do it.
So, I think when we see emails like this, we know visually just by looking at it that it's a piece of spam. We know it's junk, we don't care about it, we ignore it. So, how do we move from saying I know it when I see it to saying I can programmatically detect what a piece of spam is by using Python?
So, in today's chat, we're gonna be thinking about three questions. One, what is machine learning? Two, how is classification a part of this world? And three, how can I use Python to solve a classification problem like spam detection? This chat is going to be really focused on a beginner understanding of machine learning.
So, if you are looking for more intermediate and advanced talks, I definitely know this would be a great conference to check out some of that, but we're gonna really be taking this from the lens of a beginner. So, machine learning. If you were to follow the emojis on the left hand side, the top left would be me, confused,
not sure what machine learning is. I'm like, is it a robot? Is it Johnny Five? Johnny Five being a superhero from a children's movie I loved when I was a little kid, who's super quirky, can arch their eyebrows, and come save the day. Well, I don't really think machine learning is Johnny Five, so let's go ahead
and think a little bit about what machine learning is. One of the things I like to do when I begin working in a new problem space is I try to find some language to actually gravitate myself to understand what types of problems I will be solving. If I were to look around for some language defining machine learning, I might find something like this.
Some discussion saying that there's pattern recognition, computational learning, artificial intelligence, what's going on? I don't know what that is. But there is a part of this that does make sense to me. The study of algorithms that can learn and make predictions on data. I like data, I like algorithms, tell me more. So I think a better way we can think about machine learning
is to borrow some language from Tom Mitchell, the chair of machine learning department at Carnegie Mellon. He wrote Machine Learning, which is kind of a quintessential text for folks who want to start learning about machine learning and he says we can think about machine learning in three kind of parts.
We can say a computer program is said to learn from experience E with respect to some task T and some performance measurement P if its performance on T as measured by P improves with experience E. Okay, so we have a task, we have experience,
we have a performance measurement. I can do this. This makes sense to me. So when I think about experience and how do I know what I know? Well, I'm a human and being a human, the way that I know what I know comes from my memory. I have memories stored up that teach me things about what I like, what I don't like,
what I should do, what I shouldn't do. So maybe as a kid and I was a very hyperactive child, I would be running around like a maniac all the time because I had to be super fast. But what happens when you run around as a little kid and you're growing in your body, you might be klutzy, you might fall and skin your knee. How many times do you have to skin your knee and elbows?
For me, it probably took quite some time for me to learn I shouldn't run around like a maniac. I should walk around like a normal person so I don't hurt myself. That pain was a teaching experience for me. Likewise, when my grandmother was in the kitchen making tamales because I love tamales, I would always be trying to be sticking my hand
on the stove and more than once, I definitely burned my hand. The idea of putting your hand on red hot coils, not very smart. So over time, I learned to recognize that as a sign I shouldn't do that. So when we think of experience as a human, we may think of our memories. What does that mean in different problem spaces?
If I were to ask the question, what is the historical experience of the stock market? Well, I could say if I wanna understand what a piece of stock has done historically, I might go look at what the records tell me about the price of that stock two years ago on July 17th, one year ago on July 17th, and depending on how far back
I wanna do some analysis, I have historical data that can tell me something about the historical performance of that stock. So we have human memories, we have some memories there, but maybe in other spaces, again, we wanna go to historical data that can teach us something. And so coming to machine learning and classification, what does experience actually mean?
Let's frame this in Mitchell's framework. Our first problem's going to be identifying a task. For us, we wanna classify a piece of data. So our question is, is an email spam or ham? And the idea here of ham is just anything that's not spam. It's cute, it rhymes, so spam or ham, that's our task.
Our experience, we're gonna have a set of labeled training data. Essentially, what does that mean? We have a collection of emails, and we have a label that is saying that the email's either ham or spam. So we have a collection of emails that we already know is one thing or the other.
And then our performance measurement. Is the label correct? So what we need to do is be able to verify if emails are indeed spam or ham. So thinking about a classifier that we can use, we can think of naive Bayes. Naive Bayes is a type of probabilistic classifier. I love this image because I really want to know
who's the person that has a neon light of the Bayes theorem, like in their office or in their front window. I don't know who that person is, but I applaud you, you are really great. So naive Bayes comes to us from stats theory. It's based on the Bayes theorem, no surprise.
One of the key things with the Bayes theorem is when we talk about the likelihood of events, the key thing here to note is that we treat events as independent of one another. That's where the naive assumption comes from when we say we're going to be using a naive Bayes classifier. So for those of us who may not remember
exactly what it means when we talk about independent and dependent events, let's have a quick refresher. So if I was going to ask you, what's the probability of flipping a quarter six times in a row and getting heads, how would you go about solving that problem? Well, let's think about it on the first flip. I have two outcomes, I have heads or tails,
so the likelihood of getting heads is gonna be 0.5. The second time I flip that, 0.5. Third time and so forth is gonna be 0.5. So the likelihood of flipping a quarter and receiving multiple heads in a row is going to be independent of one another. So when we talk about independent events, we're trying to think of the outcomes.
In contrast to dependent events, let's say we're talking about horse number five on the, I guess, your right-hand side. If my question was, what's the likelihood that horse number five is gonna win the big derby, one of the things I would say is, well, we need to think about what are the weather conditions?
Is it rainy, is it sunny? Perhaps we wanna think about the age of the horse, the health of the horse. There can be other things that are tied up in the likelihood of horse number five winning. So in this context, the probability of horse number five winning is going to be dependent on other things, for example, the weather.
So when we talk about Naive Bayes, our assumption is we have independent events. So when we talk about emails, we're really gonna be thinking about the words that make up the emails. So let's think about these words. If I was gonna say what's the likelihood of the word messy appearing with the word Barcelona,
we're going to assume that there's no relationship. That's what Naive Bayes tells us to do, even though in our heads we might think that there's a relationship. Or back to some really spammy language we love, what's the relationship between buy and now? We're gonna assume that there is no relationship, that the likelihood of buy is not going to be impacting the likelihood of now
appearing in a corpus of words for an email. So Naive Bayes and spam classifiers, again, our question is, what is the probability of an email being ham or spam? So the Bayes theorem here in the middle, we've got three things we need to kind of think of.
One, what's the likelihood of the predictors in the class? Two, the prior probability of the class? And three, the prior probability of the predictor? All of these together will help us compute through the a posteriori probability of a class. So when I say class, our class is here, ham, spam.
Those are the only two classes we have. Our predictors are going to be the words in the email itself. So for example, if I'm looking at a piece of content, and I say okay, well what's the likelihood of a predictor being in the spam or ham class, we can say, if I'm looking at the word free,
we can think of it as, well 28 out of 50 spam emails have the word free. We will do this for each word in our email, and we will find the likelihoods of all the predictors and multiply them together. We also then need to consider the prior probability of the class. So given the entire collection of data we're looking at,
how many of them are of one class and how many of another? So for spam, if we have 150 emails we're working with, we can say 50 of those documents are spam. So 50 out of 150. And then the prior probability of the predictor, we're here saying, well how many times has the word free
appeared in all of our emails, let's say it's 72 out of 150, and there you go. So the Bayes theorem is basically frequency tables. How many times has this thing appeared? How many times has it appeared in the class? How many times has this class appeared in the collection of things that we're looking at?
Great, we've made some calculations. We found some values between zero to one. How do we know which one to pick? Pretty easy. Whichever one has the higher maximum a posteriori probability. So the reason why we would say a posteriori here is we're not looking at anything new, we're looking at historical data, things that have already happened.
Once we've made a calculation for class ham and for class spam, we simply just pick the larger of the two and we say, this email is going to be either ham or spam. Pretty simple. So why naive Bayes? Well, I think just walking through this,
we can arrive at an answer. It's pretty straightforward. It's as simple as frequency tables. I think we can all do this together. It may seem a little bit daunting at first, but once you start realizing the application of it, you can see that it's pretty straightforward. So for the context of if you are starting to think about classifiers and problems you want to start looking at, I would say this is a great one to start with.
The math is accessible. And while you can use other algorithms, we will talk about some of the limitations in a moment, this is a good one to start with. So that's great. But how do I use Python to detect spam? Okay, well, I cheated a little bit. I didn't do all my own data collection and munging and cleaning, as fun as that is.
I instead went to find a data source out there that already was cleaned and labeled for me and where did I get it? I got it from Cagle in the classroom. So this is a website that has competitions. So the classroom component is more of their teaching problems. They have open competition problems as well. But I loved that my data was cleaned and labeled
and I could just get right to work building a thing. So in our example here, our training data has 2,500 emails, 1,721 of them which are labeled one as ham and the balance labeled as spam which is zero. So the labels themselves are just in a CSV.
We have an ID and we have the prediction, zero or one. Pretty straightforward. And that's a little grainy, I apologize. But the emails themselves are collections of text with some HTML in it. So what are we gonna use when we write our very, very simplistic naive Bayes spam classifier?
We're gonna use these three things. We're gonna use email. It's gonna go ahead and parse our emails into message objects. We're going to use LXML because as I said, those emails have some HTML embedded in it. And right now all I care about is the words themselves. So I wanna strip that stuff out. And then we'll use NLTK, natural language toolkit and that's gonna help us to filter out stop words.
So let's go ahead and get to it and train the spam filter. So the training of the Python naive Bayes classifier, when I say train, we're gonna go through these steps. The first thing we're going to do is we're gonna tokenize the text. We will explain that in just a moment.
One thing I do wanna say is when we look at all the corpus of words in an email, I am not treating words like shop and shopping as the same word. You can actually do that. That's called stemming. So that would be like a bonus feature. I encourage you to go try that on your own. So I didn't do that for this example. So we're gonna go ahead. We're gonna tokenize our words,
which we're gonna do it for each email that we process. We wanna then keep track of the unique words that we see of all the documents that we process. This will come into effect to help us with zero word frequencies. We are going to then increment the word frequency for each category.
So our categories here being ham or spam. We're going to increment the category count, which again is that prior probability of the classes that we need to take into account. And then we're also just gonna keep track of how many words are in each category. And it's good to know how many training examples we've actually processed. So that's the last step.
So training is pretty much going to start with this. Tokenizing text into a bag of words. That's what it is. It's a bag of words. So essentially, this is very simplistic. I've kind of trimmed it down a little. What we wanna do is we wanna pull out the words. This is already after we've removed the HTML that's embedded.
And we're gonna say, hey, for each word in our text, let's go ahead, lowercase the word. We're gonna say, if it's a word, because why not? And we're gonna say, as long as this word isn't in the corpus of stop words for the English language, let's go ahead and keep it. So stop words are words like the, and, or, words that may appear often,
but may not provide us a lot of value when thinking about if this thing is going to be spam or not. So you can get that from NLTK. I'm glad I didn't have to compile that. We go ahead, we do this for each email, and now we have a bag of words. So remember that zero word frequency thing
I was talking about? Well, let's think about this. So I've done my training, and I have a new email. In this email that I'm looking at that I'm trying to classify, I have the word free. But problem, I've never historically have seen the word free in the spam collection of emails that I've looked at. So what's gonna happen when I calculate
the likelihood of all my predictors? I'm gonna get zero. So to offset that, what we can do is we can add a small constant, like which lap lace smoothing permits us to do, and that allows us to have a small offset so that it doesn't throw our math out the window. So let's talk about classifying.
All right, so this is a giant wall of text, but I just wanted to point out that it's quite literally iterations, countings, dictionaries. That's all this is. There is no black box magic here. Essentially what we do in the classify is we say for each category that we're going to create this a posterior probability. We wanna go ahead, find the probability
of all the predictors. We want to then multiply that by the prior probability of the classes itself, and we're gonna pick the one that has the higher value, and that's what we classify the email as. Not very magical. So in the get predictors probability, if we see something we haven't seen before, we're gonna go ahead and then add a value of one to that.
And this point right here, about floating point underflow, when you are doing computations where you really care about having very precise decimal points, you're gonna need to use specific objects. You could use a log instead, but in this case I used decimal objects, and there is a note here which you probably can't read.
I will share these slides, which comes from the Stanford Natural Language Processing description about how to handle doing a floating point computation, and they said use decimals, so that's what I went with. So okay, performance measurement. I've classified, I've picked a thing.
How do I know how well I did? Okay, so I go ahead, my detector says, let's train and evaluate. What I eventually come out with is I have 223 that are correct, 27 incorrect. My performance measurement is about 89%. As a small footnote, the idea of about 90% accuracy
I believe is a benchmark. We obviously can do better here, and we'll talk about what doing better can mean in a moment. So the idea of how to split up our training data. Let's do a 90-10 split. It's pretty much what I've seen as a standard. I'm sure given different problem spaces you might wanna chunk things up differently,
but I went with a 90-10 split. Essentially all I did was say, hey, on 90% of my data, let's go ahead, classify. Let's go ahead and train, that is, and then on 10%, we're gonna go ahead and classify. And how do we know if the thing is incorrect or correct? Well, whatever label we ultimately assigned it, check that labels.csv, see if it's correct,
see if it's incorrect, and it's basically straight math. So that's how we got the 89%. So some things to watch out for. False positives, ooh, this is really fun. So for example, Google does things really well, right? They do really good with spam filtering.
But even they can have some flaws. So I do actually like to sign up for Patagonia emails, and this email was actually flagged as spam. So basically a false positive is when something is incorrectly identified, right? So you can run into this. So we can say, well, when something is incorrect, what's the problem? Is it that our implementation,
because we're talking about naive Bayes, is it too naive? One way we can also correct this, we can tell Google and say, hey, this is actually not spam. So I can actually validate the data and send it to them, and they can put it into their implementation and try to autocorrect for that in the future. So false positives are a thing to watch out for. And some limitations with naive Bayes and some challenges.
Obviously, this independence assumption is very, very simplistic. If I get a marketing email about Barcelona, and they aren't talking about Messi, I'm gonna be very confused. Granted, there is some talks about him being traded, so we shall see. But obviously, this independence assumption
is quite simplistic. It is not the way that things work in the real world. What are the side effects of that? Well, one of the things is, then we're gonna go ahead and overestimate the probability of the label ultimately selected, meaning we're gonna create more binaries. We're gonna say it's either more to left or more to the right, and how it aligns with the category label.
And also, we can think about this, remember how I said I cheated and I didn't go and label on my own data? Well, here's the other thing. Human error. This type of algorithm, classifiers, are called supervised learning. They require historical labeled sets of data to go ahead and learn from in order to make predictions.
Well, human error can be prone in this data process. What happens if, let's say I'm a professor and I'm making use of all my student lackeys, and some of them have been up all night, and 10 of them looked at the same email, and they all came up with different labels for it, but it's in my training set. That's gonna be very inconsistent. So I need to think about that as well.
How is the labeling of the data happening? So as much as we don't like to think about data munging, data cleaning, and data collection, that's actually a really important part of the process when working with supervised machine learning problems. So how can we improve our performance? Well, we can do more and better feature extraction,
because while I would like to say that emails can only be identified by the words in them, we know that's not true. Predicting sentiment of emails is very complicated, very difficult. Natural language processing is a huge field. I'm not getting into that myself, but we need to think of other ways we can identify spam. So what are some things?
Perhaps the subject. Is there something weird in the subject I can pay attention to? What about the images? Is there an abundance of images in spammy emails, or maybe there's none, I don't know. How about the sender? Remember that really cool email address with the Z in it? Because clearly I would trust AOL emails, whatever that was.
Then again, I don't trust most AOL stuff, so that's another thing. But some other ones, if we were just gonna think about what to go ahead and consider other possible features, we can think about capitalization, irregular punctuation, things like that. Ultimately, we also want more data. So do you like data on Star Trek, and have more?
Want to learn more? Go to Kaggle, they're super sweet. I also would highly recommend Sara Guido's introduction to machine learning with Python. She's a great data scientist at Bitly, and I've heard great things about this. And also, your local friendly Python user group. We love talking, we love learning together. Talk to people here, there's a great talk after this
talking more about machine learning. Stay for it. So thanks. And if anything, I hope what you may have learned is correlation may be causation, or causation may be correlation, I don't know. So we can implement a thing, but the question then comes to how do we interpret those results?
And that's where I challenge you to go ahead and try some things out. Thank you so much, y'all. Any questions?
I did such a great job. No one has any questions. Well, if you do have questions, I'll be hanging out in this area out here for a few minutes, but like I said, I do want to hear the next talk. So I'll be around, my name's Lorena. Please reach out and say hi. It's a pleasure to be here. Thank you so much for listening.