Is that spam in my ham?

Video thumbnail (Frame 0) Video thumbnail (Frame 1126) Video thumbnail (Frame 1957) Video thumbnail (Frame 2650) Video thumbnail (Frame 3642) Video thumbnail (Frame 4445) Video thumbnail (Frame 5367) Video thumbnail (Frame 6032) Video thumbnail (Frame 6895) Video thumbnail (Frame 7618) Video thumbnail (Frame 8554) Video thumbnail (Frame 9558) Video thumbnail (Frame 10078) Video thumbnail (Frame 10995) Video thumbnail (Frame 11751) Video thumbnail (Frame 12430) Video thumbnail (Frame 12930) Video thumbnail (Frame 14179) Video thumbnail (Frame 14721) Video thumbnail (Frame 15593) Video thumbnail (Frame 16246) Video thumbnail (Frame 17163) Video thumbnail (Frame 18354) Video thumbnail (Frame 18863) Video thumbnail (Frame 19876) Video thumbnail (Frame 20420) Video thumbnail (Frame 23069) Video thumbnail (Frame 24259) Video thumbnail (Frame 25094) Video thumbnail (Frame 25947) Video thumbnail (Frame 26527) Video thumbnail (Frame 27432) Video thumbnail (Frame 28832) Video thumbnail (Frame 29389) Video thumbnail (Frame 30229) Video thumbnail (Frame 30937) Video thumbnail (Frame 31962) Video thumbnail (Frame 32563) Video thumbnail (Frame 33373) Video thumbnail (Frame 34921) Video thumbnail (Frame 36194) Video thumbnail (Frame 37126)
Video in TIB AV-Portal: Is that spam in my ham?

Formal Metadata

Is that spam in my ham?
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Lorena Mesa - Is that spam in my ham? Beginning programmers or Python beginners may find it overwhelming to implement a machine learning algorithm. Increasingly machine learning is becoming more applicable to many areas. This talk introduces key concepts and ideas and uses Python to build a basic classifier - a common type of machine learning problem. Providing some jargon to help those that may be self-educated or currently learning ----- Supervised learning, machine learning, classifiers, big data! What in the world are all of these things? As a beginning programmer the questions described as "machine learning" questions can be mystifying at best. In this talk I will define the scope of a machine learning problem, identifying an email as ham or spam, from the perspective of a beginner (non master of all things "machine learning") and show how Python can help us simply learn how to classify a piece of email. To begin we must ask, what is spam? How do I know it "when I see it"? From previous experience of course! We will provide human labeled examples of spam to our model for it to understand the likelihood of spam or ham. This approach, using examples and data we already know to determine the most likely label for a new example, uses the Naive Bayes classifier. Our model will look at the words in the body of an email, finding the frequency of words in both spam and ham emails and the frequency of spam and ham. Once we know the prior likelihood of spam and what makes something spam, we can try applying a label to a new example. Through this exercise we will see at a basic level what types of questions machine learning asks, learn to model "learning" with Python, and understand how learning can be measured.
Goodness of fit Process (computing) Computer animation Sound effect Natural language Computing platform
Mathematics Computer animation Query language Bit Data analysis Physical system
Group action Computer animation Software Gender Multiplication sign Whiteboard
Point (geometry) Type theory Wechselseitige Information Email Computer animation Information Network topology Multiplication sign Bit Summierbarkeit Formal language
Machine learning Machine learning Computer animation Virtual machine Mereology Mereology
Machine learning Computer animation Robotics Virtual machine Bit
Predictability Pattern recognition Algorithm Pattern recognition Observational study Observational study Algorithm Artificial neural network Virtual machine Computer Artificial intelligence Mereology Formal language Neuroinformatik CAN bus Type theory Computational physics Machine learning Field extension Computer animation Prediction Gravitation
Machine learning Machine learning Computer animation Semiconductor memory Data recovery Computer program Computer Mereology Measurement Computer programming Task (computing) Measurement
Arithmetic mean Computer animation Meeting/Interview Semiconductor memory Different (Kate Ryan album) Multiplication sign Normal (geometry) Parameter (computer programming) Endliche Modelltheorie
Machine learning Computer animation Semiconductor memory Mathematical analysis Species Row (database)
Email Computer animation Mathematical analysis Software framework Task (computing) Virtual machine Task (computing)
Email Email Computer animation Set (mathematics) Measurement Wave packet Task (computing) Measurement
Type theory Medical imaging Statistics Computer animation Office suite Theory Window
Mathematics Metropolitan area network Thomas Bayes Event horizon Computer animation State of matter Theorem Bayes' theorem Statistics Event horizon Likelihood function Thomas Bayes
Multiplication Computer animation Multiplication sign Independence (probability theory) Right angle Water vapor Contrast (vision) Disk read-and-write head Likelihood function Number Row (database)
Context awareness Word Email Computer animation Computer cluster Independence (probability theory) Likelihood function Number
Email Word Thomas Bayes Word Email Computer animation Social class Likelihood function Formal language Social class
Email Email Multiplication sign Content (media) Likelihood function Entire function Word Thomas Bayes Word Computer animation Frequency distribution Social class Bayes' theorem Social class
Email Mapping Computer animation Calculation Maxima and minima Social class Maxima and minima Conditional-access module Social class
Mathematics Thomas Bayes Algorithm Computer animation Algorithm Suite (music) Frequency distribution Moment (mathematics) Statement (computer science) Bit Cartesian coordinate system Limit (category theory)
Predictability Email Parsing Email Building Chemical equation Connectivity (graph theory) Bit Mass Open set Wave packet Word Word Data management Computer animation Hypermedia Object (grammar) Website Object (grammar) Message passing Thomas Bayes Social class Singuläres Integral
Word Trail Email Word Process (computing) Computer animation Token ring Moment (mathematics) Abelian category Task (computing) Wave packet
Trail NP-hard Physical law Sound effect Wave packet Formal language Attribute grammar Word Category of being Frequency Word Process (computing) Computer animation Synchronization Personal digital assistant Self-organization Social class
Email Word Frequency Word Email Mathematics Computer animation Smoothing Freeware Likelihood function Window Wave packet
Point (geometry) Execution unit Email Mass flow rate Mapping Floating point Menu (computing) Black box Data dictionary Arm Computer animation Iteration Task (computing) Fingerprint Arc (geometry) Posterior probability Social class
Point (geometry) Slide rule Mass flow rate Mapping Menu (computing) Arm Computer animation Personal digital assistant Uniform resource name Object (grammar) Fingerprint Reading (process) Descriptive statistics Computer-assisted translation
Mapping Computer animation Moment (mathematics) Letterpress printing Summierbarkeit Measurement Benchmark Arc (geometry) Wave packet Measurement
Point (geometry) Metropolitan area network Execution unit Computer virus Multiplication sign Interior (topology) Emulation Wave packet 10 (number) Mathematics Pointer (computer programming) Computer animation Moving average Convex hull Wide area network
Email Explosion Googol Computer animation Bit rate Googol Implementation Limit (category theory) Position operator
Predictability Email Email Algorithm Binary code Virtual machine Independence (probability theory) Sound effect Set (mathematics) Student's t-test Mereology Wave packet Independence (probability theory) Data model Category of being Thomas Bayes Arithmetic mean Machine learning Process (computing) Computer animation Order (biology) Right angle Error message
Medical imaging Trail Email Word Process (computing) Computer animation Computer-generated imagery Natural language Field (computer science) Address space
Machine learning Group action Machine learning Cross-correlation Computer animation Virtual machine Conservation law Local ring Bounded variation Resultant Virtual machine
Area Computer animation
good afternoon everyone welcome to this afternoon session I'd like to introduce the random effects she's a platform engineer at Sproat social in Chicago and she's the started started friend and she's gonna talk to us about spam and natural language processing so real contact I have a lot of ways is this too strong there should be a little quieter or louder and all this is true on the latter are you think you so much for joining me tonight this afternoon I should say the name of this talk is is that spam in my hands such text analysis in
query into classification so as
my announcer reduced set my name is Lorena NASA and as you can see on the future of the trip so live long and prosper concluded apart from that I here from Chicago a little bit about me and my wanted to chat on this topic I'm actually career changes so a few years ago I came from being a data analyst in the Sociales in the social sciences face specifically I work at Obama for America doing data governance and I haven't switched into sovereign gineering about 3 years ago some big
questions that were driving me at the time our captured in this talk but some other things I do I love gender groups and I hope that the workshop yesterday it's a glorious curious thing if you have the opportunity to network these do you if you would like to sign up for another 1 please do as well I highly
Chicago group but I found it in Chicago and I recently was voted to the board of directors for the treatment of offenders which is very excited so minutes had a
little bit about all this great experience that we've all had before I think we might have all have some kind of e-mail at some point in time where we get something that letters into her Inbox and it has a language like the jumping speed up your slope PC and of course we we we want trusted e-mail that comes from AOL underscored member info @ e-mails yes with on it . 2 well that come and of course the minute trust anything that tells me that this is for this is great you really should do it so I think when we see you know best we don't usually just by looking at it that it's a piece of spam we
don't junk we don't care about it we don't we ignore it so how do we move from saying I know it when I see it to say I can programmatically detect whether a stand is by using type that so interface chapter 1 thinking about 3 questions 1 What is
Machine Learning to how the classification of part of this world and 3 how can I use Python to solve a classification problem like spam detection this attack this chart is going to be really
focused on a beginner understanding of machine learning so if you are looking for a more intermediate and advanced topics I definitely know this to be a great conference to check out some of the people but were really really taking this from the land of the beginning so called machine learning
if you were to follow the abilities of the left hand side the top left and the me confused not sure what machine learning and I'm like is the robot is that Johnny Johnny 5 being a superhero for a children's movie I love it when I was a little kid is super quirky can watch the eyebrows can conceive today
well I don't really think machine learning is 5 so let's go ahead and think a little bit about what machine learning 1 of the things
I like to do when I began working in a new problem this as I try to find some language to actually gravity myself to understand what types of problems will be solved if I were to look around for for some language defining machine learning I might find something like this some discussions
saying the pattern recognition and computational learning artificial Intelligence what's going on I don't know about it but there is a part of this that does make sense to me the study of algorithms that can learn and make predictions on data I like it up I want algorithms tell more so I think a better way we can bring about machine learning is to borrow some language from Tom Mitchell the chair of Machine Learning Department at Carnegie Mellon University learning which is kind of a quintessential taxed for folks to monitor learning about machine learning and he says we can
think about machine learning in recovery kind of part we can say that a computer program is said to learn from experience E with respect to some task T and some performance measurement he it's performance on T as measured by P improves with experience the OK so we have a task we have experience we had a performance measurement I can do this this makes sense to so what about
experience and how we know what I know while I'm a human and then we went on to the human the way that I know what I know comes from my memory I have memories stored up that teach me things about what I like what I don't like what I said do you what I should do so maybe as a kid
and I was very effective child will be running around like a maniac all the time because I had to be super fast but what happens when you run around every little kid in your growing in your body you might be but see you might fall in skin you need how many times you have this thing you know best for me it probably took quite
some time for me to learn I should run around like that like a maniac I should what parameter normal person so I don't put myself that came with the teaching experience for me likewise my grandmother was in the kitchen making the model because I love the mother I would always trying to be sticking my hand on the stove and more than 1 side definitely burned my hands the idea of putting your hand on red-hot oils not very smart so over time I learned to recognize that as assigned I should do that so we think of experiences that come in the making of our memory what does that mean and different problems because if I were to ask the question what is the historical experience of the
stock market well I could say if I want to understand what the piece of stock has done historically and Michael look at what the records tell me about the price of that stock 2 years ago on July 17th 1 year ago on July 17 and you know depending on how far back I wanted use some analysis i have historical data that can tell
me something about the historical performance of that stuff so we have a human race we have some memories there but maybe in other species again we wanted historical data that can teach us something so coming to machine learning and classification what does
experience actually mean let's
bring this in Mitchell's framework our 1st problems going to be identifying a task for us we want classify the piece of data so our question is is an analysis spam or ham and the idea here of him just anything that's not it's cute
it runs so spammer happened that task our experience were going to have a set of labeled training data
essentially what that means we have a collection of e-mails and we have a label that's that is saying that the e-mails either him or stamps with the collection of e-mails that we already know is 1 thing or the other and then a performance measurement is the label cracked so what we need to do is be able to verify if e-mails are indeed spam or ham so thinking about a
classifier that we can use we can think of naive Bayes Naïve Bayes is a type of probabilistic classifier I love this image because I really want to know who is the person that has and young light of the Bayes here like in the office on the
front window I don't without personal use but I thought you were really great so maybe they use comes to us from stats theory it's based on the
Bayes theorem no surprise and 1 of the key things with the the bayes theorem is when we talk about the likelihood of events the key thing here to note is that we treat events is independent of 1 another that's where the naive assumption comes from when we say we're going to be using a naive Bayes classifier so for those
of us who may not remember exactly what it means when we talk about independent and dependent events let's have a quick refresher so far I was going to ask you what's the probability
of flipping a quarter 6 times in a row and getting heads how to go about solving problem well let's think about it on the 1st one i've 2 outcomes 5 heads or tails so the likelihood of getting have is going to be 0 . 5 the 2nd time I think that 0 . 5 the time and so forth is going to be 0 . 5 so the likelihood of flipping a quarter and receiving multiple heads in road is going to be independent of 1 another so we talk about independent events were trying to think of the outcome In contrast to deep end that's what they were talking about the course number 5
on me I guess your right hand side is my question was whether the likelihood that course number 5 is going to win the data being 1 of the things I would say as well we need to think about water the water the weather conditions is it rainy is it something perhaps 1 thing about the age of a horse the health of the there can be other things that are that are
tied up in the likelihood of of course number 5 winning so in this context the probability of course number 5 winning is going to be dependent on other things for example the weather and so when we talk about naive days we we are assumption is we have
independent events so we when we talk about females we're really going to be thinking about the the words that make up the e-mail so let's think about these words in Paris and the same with the likelihood of the word messy appearing with the word Barcelona were going to assume that there is no relationship that's what naive
Bayes tells us to be you even ahead of you might think that there is a relationship or back to some really standing language we love what's the relationship between by and now we're going to assume that there is no relationship that the likelihood of 5 is not going to be impacting the likelihood of now appearing in a corpus of words before an e-mail so many spam classifiers again our
question is what is the probability of an e-mail being hand for span so these debates and here in the middle we got 3 things we need a kind of thing 1 what's the likelihood of the predictors in the class to the prior probability of the class and 3 the prior probability of the predictor all these together will help us compute use the a-posteriori probability of the class so when I say class or classes here hand spam those are only 2 classes we have our predictors are going to be the words in the
e-mail itself so for example if I'm looking at a piece of content and say OK well what's the likelihood of a predictor being in the spam or ham class we can say if I'm looking at the word free we can think of it as well to 28 out of 50 standard males have the word creative we will do this for each word in our e-mail and we will find the likelihoods of all the predictors of multiplying together we also need to consider the prior probability of the class so given the entire collection of data we're looking at how many of them are of 1 class and how many of another so for Spanish we have 150 e-mails were working when we can say 50 of those documents are stands 50 out of 150 and then the prior
probability of the predict their work you're saying well how many times has the word free appeared in all our e-mails let's say 7 out of 150 and there you go so these the the Bayes theorem is basically frequency tables how many times has the same appearance content as it appeared in the class how many times
has class appeared in the collection of things that we're looking at great we've we've made some calculations we found we found some values between 0 to 1 that we know which 1 to pick pretty easy whichever 1 has the higher maximum a posteriori probability so the reason why
we would say a posteriori here is we're not looking at anything new we're looking at historical data things that have already happened which we've made a calculation for the class him and for class that we simply just pick the larger of the 2 and we say this e-mail is going to be either him or Spanish prey simple
so why Naive well I think just walking through that we can arrive at an and it's pretty straightforward it's as simple as frequency tables I think we can all do this together it may seem a little bit daunting at 1st but once you start realizing the application of it you can see that it's pretty straightforward so for the context of if you are starting to think about classifiers and problems you
wanna start looking at that and say this is a great 1 to start with the math accessible and what you can use other algorithms we will talk about some the limitations in a moment this is a good 1 to start with so that's great but handling statement to detect
spam OK well I
cheated a little bit I didn't do all my own own data collection and managing and cleaning as as that is i instead went to find the resource out there that are ready was cleaned and labeled for me and where did I get that from people in the class so this is this is a website that has competitions so the classroom component more the teaching problems they have open competition problems as well but I love that my data was clean and labeled and I could just right to work building and so in our example here are training media has 25 hundred e-mails 7 1721 of them which are labeled 1 as hand and the balance labelled as spam which is 0 so the it labels themselves they're just yes here we have an idea we have predictions 0 or 1 pretty straightforward and indeed that the little grainy apologize but the e-mails themselves are collections of text with some HTML and it so what are we gonna use we write our are very very simplistic naive bayes spam classifier where these these 3 things for any e-mail it and go ahead and pass e-mails into massive objects we're going to use owl XML because as I said those e-mails have some HTML embedded in it and right now all I care about is the is the words themselves so NoScript that's about animal use and on Natural Language Toolkit and that can help us to filter out stop words so
let's go ahead and get to it and train the spam filter so
the training of the play the naive Bayes classifier when when I say train we were going to go through the steps the 1st thing we're going to do useful and tokenized we will explain mind just a moment 1 thing I do when say is when we look at all the corpus of words in an e-mail I am not trading words like and shopping as the same where you can actually do that that's called standing so that would be like a bonus feature I encourage you to go traveling around so I didn't do that for this example so we're gonna go out hadronic tokenizer words which are that we're going to have for each e-mail that the process we wanted them and keep track of the unique words that we see of all the documents that we process
this will come into effect to help us with 0 word frequencies we are going to then synchrony increment word frequency for each category so our categories here being have or were going to implement the category counts which again is that the prior probability of classes that we need to take into account and there were also just gonna keep track of how many words are in each category and it's good to know how many training examples we've actually process so that the last step so training is pretty much going to start with
that tokenizing text into a bag of words that's what it is it's a bag of words so essentially this is very simplistic kind attribute down a little what we wanna do is we wanna pull out the words this is already after we've removed the html that's embedding them organization a for each word in our text let's go have a lower case the word were going to say if if the word because why not and we're gonna say as long as this word isn't in hours in the corpus of stop words for the english language let's go ahead and do that so stop words are words like law
and or words that have yet to hear of may appear often but may not provide us a lot about the value and thinking about if this thing is going to be spam or not so you can get that from an altercation I'm glad I didn't have to compile we go ahead and do this for each e-mail and now we have a bag of words so remember that 0 word frequency thing I
was talking about well let's think about this so I've done my training and I have an e-mail in this e-mail that I'm looking at that and trying to classify all of the word free but problem I never historically have seen the word free in the spam collection of e-mails that have looked at so what's gonna happen when I calculate the likelihood of all my predictors and it 0 so to offset that what we can do is we can add a small constant like which Laplace smoothing permits us to do and that allows us to have a small offset so that it doesn't for our math of the window so we talk about
classified sorry this is a
giant wall of text but I just wanted to point out that it's quite literally iterations accounting dictionaries have all of this is there's no black box magic here essentially always that we do the classifiers and say for each category that we're going to create this posterior probability we wanna go ahead and find the probability of all the predictors we want to them multiply that by the prior probability of the class is itself and we're going pick the 1 that has the higher value and that's what we classify the e-mail that's not grammatical so in the get predictors probability if we see something we haven't seen before we're going to go ahead and then added value 1 2 that's and this point right here about folk floating-point underflow when you're doing where you really care about that having very precise decimal points you're
going to the specific objects you could use log instead but in this case I used a small objects and there is no tears which you probably can't read and I will show the slides which comes from the
Stanford Natural Language Processing description about how to handle during the floating point our competition and they said used a small so that's what I want so OK
performance measurement of
classified picked the thing how do we know how well I did OK
so I go I have my detectors airports train and evaluate what eventually come out with is I have to 123 that crotch 27 incorrect my performance measurement is about 99 per cent at the small footnotes b idea of about 90 per cent accuracy I believe is a benchmark we obviously can't do better here and we'll talk about what doing better comedian moments
so the idea of how split up a training data points doing 90 times like it's pretty much what I've seen as a standard I'm sure given different problems this is you might want chunk things up differently but I want the the 19 tens but essentially all I did was say hate 190 some idea let's go ahead classified but let's go ahead and training that is and on 10 % we're going to go ahead and classified and how do we know if something is incorrect or correct well whatever that whatever label we ultimately assign the
check that labels that's used we see that crap incorrect and it's basically straight so that's how we got the 89 % so
some things to watch out for false positives who this is a really fine so for example as Google those things really well rate they you really good with spam filtering but even they can have some flies so I do actually like to sign up for Patagonia e-mail and this e-mail that to find the stamp so we basically a false positive is when something is incorrectly identified right so that you can run into this so we can say well when something is in cracked what's the problem
is it that there are implementations because we're talking about naive Bayes isn't too naive when we call subtract this we can tell Google and says that this is actually not spam so I can actually validate the data and send it to them and they can put into there implementation and try to art that feature the false positives are added to watch out for and some limitations with naive Bayes and some challenges obviously
this independence assumption is very very simplistic if I get a marketing e-mail about Barcelona and they aren't talking about messy and going to be very confused as printed there's some talks about him being traded so we shall see but obviously this independence assumption is quite simplistic it is not the way these things work in the real world what are the side effects of that well 1 of the things is then we're going to go ahead and overestimate the probability of the lot of the label ultimately so selected meaning we're going to create more binaries were going to
say it either more too laughter more to the right and how it aligns the category labels and also we can think about this remember how I said I cheated and I didn't go and label on my own data well here's other things human error this of algorithm on classifiers are called supervised learning they they require historical labeled set of of data to to go ahead and learn from in order to make predictions about human error can be grown in this speed concepts what happens if let's say I'm a professor and making use of all my student lackeys and some of them have been up all nite and 10 them looked at the same e-mail and they all came up with different labels for it but it's in my training set that can be very inconsistent so I need to think about that as well how was the labeling of the data happening so as much as we don't like to think about the demanding data cleaning and data collection that's actually a really important part of the process when working with machine learning problem supervised machine learning problems How can we improve our performance well we can do more
and better feature extraction because well I would like to say that e-mails can only be identified by the words in them we know that's not true predicting sentiment of e-mails very complicated very difficult natural language processing is a huge field I'm not getting into that myself but you know we we need to think of other ways we can identify so what are some things perhaps the subject is there something we use the subject I complete tentative what of images is there an abundance of images and spam e-mails or maybe there's no and I don't know how send their member that like really cool e-mail address with because clearly I would trust any well e-mails don't whatever that was then again I don't trust most well stuff so that's another thing but you know some other ones if we're just going to think about what to to go ahead and
consider all other possible features we can think about capitalization irregular punctuation and things like that ultimately we also want more data so do you like the star track and have more want to learn more good OK go there's super
sweet I also would highly recommend varied leaders Introduction to Machine Learning with pipeline she's a great deal scientist at the and I've heard great things about this and also your local friendly played on user group we love talking learning together and talk to people here there's a great talk after this talking more about machine learning stay for it thanks and of
anything I hope what you may have learned is correlation maybe conservation or conservation maybe correlation I don't know so well we can implement a thing but the question then becomes to try to interpret those results and that's where I challenge you to go ahead and try some things that make you so much
do you have any questions and it's such a great to have no 1 has any question that Wikipedia questions on all the thing out in this area here for a few minutes but like I said I D 1 here the next talk so I'll be around my name's laurie now please reach out and the higher the pleasure to be here that so much for listening