Merken

Is that spam in my ham?

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
good afternoon everyone welcome to this afternoon session I'd like to introduce the random effects she's a platform engineer at Sproat social in Chicago and she's the started started friend and she's gonna talk to us about spam and natural language processing so real contact I have a lot of ways is this too strong there should be a little quieter or louder and all this is true on the latter are you think you so much for joining me tonight this afternoon I should say the name of this talk is is that spam in my hands such text analysis in
query into classification so as
my announcer reduced set my name is Lorena NASA and as you can see on the future of the trip so live long and prosper concluded apart from that I here from Chicago a little bit about me and my wanted to chat on this topic I'm actually career changes so a few years ago I came from being a data analyst in the Sociales in the social sciences face specifically I work at Obama for America doing data governance and I haven't switched into sovereign gineering about 3 years ago some big
questions that were driving me at the time our captured in this talk but some other things I do I love gender groups and I hope that the workshop yesterday it's a glorious curious thing if you have the opportunity to network these do you if you would like to sign up for another 1 please do as well I highly
Chicago group but I found it in Chicago and I recently was voted to the board of directors for the treatment of offenders which is very excited so minutes had a
little bit about all this great experience that we've all had before I think we might have all have some kind of e-mail at some point in time where we get something that letters into her Inbox and it has a language like the jumping speed up your slope PC and of course we we we want trusted e-mail that comes from AOL underscored member info @ e-mails yes with on it . 2 well that come and of course the minute trust anything that tells me that this is for this is great you really should do it so I think when we see you know best we don't usually just by looking at it that it's a piece of spam we
don't junk we don't care about it we don't we ignore it so how do we move from saying I know it when I see it to say I can programmatically detect whether a stand is by using type that so interface chapter 1 thinking about 3 questions 1 What is
Machine Learning to how the classification of part of this world and 3 how can I use Python to solve a classification problem like spam detection this attack this chart is going to be really
focused on a beginner understanding of machine learning so if you are looking for a more intermediate and advanced topics I definitely know this to be a great conference to check out some of the people but were really really taking this from the land of the beginning so called machine learning
if you were to follow the abilities of the left hand side the top left and the me confused not sure what machine learning and I'm like is the robot is that Johnny Johnny 5 being a superhero for a children's movie I love it when I was a little kid is super quirky can watch the eyebrows can conceive today
well I don't really think machine learning is 5 so let's go ahead and think a little bit about what machine learning 1 of the things
I like to do when I began working in a new problem this as I try to find some language to actually gravity myself to understand what types of problems will be solved if I were to look around for for some language defining machine learning I might find something like this some discussions
saying the pattern recognition and computational learning artificial Intelligence what's going on I don't know about it but there is a part of this that does make sense to me the study of algorithms that can learn and make predictions on data I like it up I want algorithms tell more so I think a better way we can bring about machine learning is to borrow some language from Tom Mitchell the chair of Machine Learning Department at Carnegie Mellon University learning which is kind of a quintessential taxed for folks to monitor learning about machine learning and he says we can
think about machine learning in recovery kind of part we can say that a computer program is said to learn from experience E with respect to some task T and some performance measurement he it's performance on T as measured by P improves with experience the OK so we have a task we have experience we had a performance measurement I can do this this makes sense to so what about
experience and how we know what I know while I'm a human and then we went on to the human the way that I know what I know comes from my memory I have memories stored up that teach me things about what I like what I don't like what I said do you what I should do so maybe as a kid
and I was very effective child will be running around like a maniac all the time because I had to be super fast but what happens when you run around every little kid in your growing in your body you might be but see you might fall in skin you need how many times you have this thing you know best for me it probably took quite
some time for me to learn I should run around like that like a maniac I should what parameter normal person so I don't put myself that came with the teaching experience for me likewise my grandmother was in the kitchen making the model because I love the mother I would always trying to be sticking my hand on the stove and more than 1 side definitely burned my hands the idea of putting your hand on red-hot oils not very smart so over time I learned to recognize that as assigned I should do that so we think of experiences that come in the making of our memory what does that mean and different problems because if I were to ask the question what is the historical experience of the
stock market well I could say if I want to understand what the piece of stock has done historically and Michael look at what the records tell me about the price of that stock 2 years ago on July 17th 1 year ago on July 17 and you know depending on how far back I wanted use some analysis i have historical data that can tell
me something about the historical performance of that stuff so we have a human race we have some memories there but maybe in other species again we wanted historical data that can teach us something so coming to machine learning and classification what does
experience actually mean let's
bring this in Mitchell's framework our 1st problems going to be identifying a task for us we want classify the piece of data so our question is is an analysis spam or ham and the idea here of him just anything that's not it's cute
it runs so spammer happened that task our experience were going to have a set of labeled training data
essentially what that means we have a collection of e-mails and we have a label that's that is saying that the e-mails either him or stamps with the collection of e-mails that we already know is 1 thing or the other and then a performance measurement is the label cracked so what we need to do is be able to verify if e-mails are indeed spam or ham so thinking about a
classifier that we can use we can think of naive Bayes Naïve Bayes is a type of probabilistic classifier I love this image because I really want to know who is the person that has and young light of the Bayes here like in the office on the
front window I don't without personal use but I thought you were really great so maybe they use comes to us from stats theory it's based on the
Bayes theorem no surprise and 1 of the key things with the the bayes theorem is when we talk about the likelihood of events the key thing here to note is that we treat events is independent of 1 another that's where the naive assumption comes from when we say we're going to be using a naive Bayes classifier so for those
of us who may not remember exactly what it means when we talk about independent and dependent events let's have a quick refresher so far I was going to ask you what's the probability
of flipping a quarter 6 times in a row and getting heads how to go about solving problem well let's think about it on the 1st one i've 2 outcomes 5 heads or tails so the likelihood of getting have is going to be 0 . 5 the 2nd time I think that 0 . 5 the time and so forth is going to be 0 . 5 so the likelihood of flipping a quarter and receiving multiple heads in road is going to be independent of 1 another so we talk about independent events were trying to think of the outcome In contrast to deep end that's what they were talking about the course number 5
on me I guess your right hand side is my question was whether the likelihood that course number 5 is going to win the data being 1 of the things I would say as well we need to think about water the water the weather conditions is it rainy is it something perhaps 1 thing about the age of a horse the health of the there can be other things that are that are
tied up in the likelihood of of course number 5 winning so in this context the probability of course number 5 winning is going to be dependent on other things for example the weather and so when we talk about naive days we we are assumption is we have
independent events so we when we talk about females we're really going to be thinking about the the words that make up the e-mail so let's think about these words in Paris and the same with the likelihood of the word messy appearing with the word Barcelona were going to assume that there is no relationship that's what naive
Bayes tells us to be you even ahead of you might think that there is a relationship or back to some really standing language we love what's the relationship between by and now we're going to assume that there is no relationship that the likelihood of 5 is not going to be impacting the likelihood of now appearing in a corpus of words before an e-mail so many spam classifiers again our
question is what is the probability of an e-mail being hand for span so these debates and here in the middle we got 3 things we need a kind of thing 1 what's the likelihood of the predictors in the class to the prior probability of the class and 3 the prior probability of the predictor all these together will help us compute use the a-posteriori probability of the class so when I say class or classes here hand spam those are only 2 classes we have our predictors are going to be the words in the
e-mail itself so for example if I'm looking at a piece of content and say OK well what's the likelihood of a predictor being in the spam or ham class we can say if I'm looking at the word free we can think of it as well to 28 out of 50 standard males have the word creative we will do this for each word in our e-mail and we will find the likelihoods of all the predictors of multiplying together we also need to consider the prior probability of the class so given the entire collection of data we're looking at how many of them are of 1 class and how many of another so for Spanish we have 150 e-mails were working when we can say 50 of those documents are stands 50 out of 150 and then the prior
probability of the predict their work you're saying well how many times has the word free appeared in all our e-mails let's say 7 out of 150 and there you go so these the the Bayes theorem is basically frequency tables how many times has the same appearance content as it appeared in the class how many times
has class appeared in the collection of things that we're looking at great we've we've made some calculations we found we found some values between 0 to 1 that we know which 1 to pick pretty easy whichever 1 has the higher maximum a posteriori probability so the reason why
we would say a posteriori here is we're not looking at anything new we're looking at historical data things that have already happened which we've made a calculation for the class him and for class that we simply just pick the larger of the 2 and we say this e-mail is going to be either him or Spanish prey simple
so why Naive well I think just walking through that we can arrive at an and it's pretty straightforward it's as simple as frequency tables I think we can all do this together it may seem a little bit daunting at 1st but once you start realizing the application of it you can see that it's pretty straightforward so for the context of if you are starting to think about classifiers and problems you
wanna start looking at that and say this is a great 1 to start with the math accessible and what you can use other algorithms we will talk about some the limitations in a moment this is a good 1 to start with so that's great but handling statement to detect
spam OK well I
cheated a little bit I didn't do all my own own data collection and managing and cleaning as as that is i instead went to find the resource out there that are ready was cleaned and labeled for me and where did I get that from people in the class so this is this is a website that has competitions so the classroom component more the teaching problems they have open competition problems as well but I love that my data was clean and labeled and I could just right to work building and so in our example here are training media has 25 hundred e-mails 7 1721 of them which are labeled 1 as hand and the balance labelled as spam which is 0 so the it labels themselves they're just yes here we have an idea we have predictions 0 or 1 pretty straightforward and indeed that the little grainy apologize but the e-mails themselves are collections of text with some HTML and it so what are we gonna use we write our are very very simplistic naive bayes spam classifier where these these 3 things for any e-mail it and go ahead and pass e-mails into massive objects we're going to use owl XML because as I said those e-mails have some HTML embedded in it and right now all I care about is the is the words themselves so NoScript that's about animal use and on Natural Language Toolkit and that can help us to filter out stop words so
let's go ahead and get to it and train the spam filter so
the training of the play the naive Bayes classifier when when I say train we were going to go through the steps the 1st thing we're going to do useful and tokenized we will explain mind just a moment 1 thing I do when say is when we look at all the corpus of words in an e-mail I am not trading words like and shopping as the same where you can actually do that that's called standing so that would be like a bonus feature I encourage you to go traveling around so I didn't do that for this example so we're gonna go out hadronic tokenizer words which are that we're going to have for each e-mail that the process we wanted them and keep track of the unique words that we see of all the documents that we process
this will come into effect to help us with 0 word frequencies we are going to then synchrony increment word frequency for each category so our categories here being have or were going to implement the category counts which again is that the prior probability of classes that we need to take into account and there were also just gonna keep track of how many words are in each category and it's good to know how many training examples we've actually process so that the last step so training is pretty much going to start with
that tokenizing text into a bag of words that's what it is it's a bag of words so essentially this is very simplistic kind attribute down a little what we wanna do is we wanna pull out the words this is already after we've removed the html that's embedding them organization a for each word in our text let's go have a lower case the word were going to say if if the word because why not and we're gonna say as long as this word isn't in hours in the corpus of stop words for the english language let's go ahead and do that so stop words are words like law
and or words that have yet to hear of may appear often but may not provide us a lot about the value and thinking about if this thing is going to be spam or not so you can get that from an altercation I'm glad I didn't have to compile we go ahead and do this for each e-mail and now we have a bag of words so remember that 0 word frequency thing I
was talking about well let's think about this so I've done my training and I have an e-mail in this e-mail that I'm looking at that and trying to classify all of the word free but problem I never historically have seen the word free in the spam collection of e-mails that have looked at so what's gonna happen when I calculate the likelihood of all my predictors and it 0 so to offset that what we can do is we can add a small constant like which Laplace smoothing permits us to do and that allows us to have a small offset so that it doesn't for our math of the window so we talk about
classified sorry this is a
giant wall of text but I just wanted to point out that it's quite literally iterations accounting dictionaries have all of this is there's no black box magic here essentially always that we do the classifiers and say for each category that we're going to create this posterior probability we wanna go ahead and find the probability of all the predictors we want to them multiply that by the prior probability of the class is itself and we're going pick the 1 that has the higher value and that's what we classify the e-mail that's not grammatical so in the get predictors probability if we see something we haven't seen before we're going to go ahead and then added value 1 2 that's and this point right here about folk floating-point underflow when you're doing where you really care about that having very precise decimal points you're
going to the specific objects you could use log instead but in this case I used a small objects and there is no tears which you probably can't read and I will show the slides which comes from the
Stanford Natural Language Processing description about how to handle during the floating point our competition and they said used a small so that's what I want so OK
performance measurement of
classified picked the thing how do we know how well I did OK
so I go I have my detectors airports train and evaluate what eventually come out with is I have to 123 that crotch 27 incorrect my performance measurement is about 99 per cent at the small footnotes b idea of about 90 per cent accuracy I believe is a benchmark we obviously can't do better here and we'll talk about what doing better comedian moments
so the idea of how split up a training data points doing 90 times like it's pretty much what I've seen as a standard I'm sure given different problems this is you might want chunk things up differently but I want the the 19 tens but essentially all I did was say hate 190 some idea let's go ahead classified but let's go ahead and training that is and on 10 % we're going to go ahead and classified and how do we know if something is incorrect or correct well whatever that whatever label we ultimately assign the
check that labels that's used we see that crap incorrect and it's basically straight so that's how we got the 89 % so
some things to watch out for false positives who this is a really fine so for example as Google those things really well rate they you really good with spam filtering but even they can have some flies so I do actually like to sign up for Patagonia e-mail and this e-mail that to find the stamp so we basically a false positive is when something is incorrectly identified right so that you can run into this so we can say well when something is in cracked what's the problem
is it that there are implementations because we're talking about naive Bayes isn't too naive when we call subtract this we can tell Google and says that this is actually not spam so I can actually validate the data and send it to them and they can put into there implementation and try to art that feature the false positives are added to watch out for and some limitations with naive Bayes and some challenges obviously
this independence assumption is very very simplistic if I get a marketing e-mail about Barcelona and they aren't talking about messy and going to be very confused as printed there's some talks about him being traded so we shall see but obviously this independence assumption is quite simplistic it is not the way these things work in the real world what are the side effects of that well 1 of the things is then we're going to go ahead and overestimate the probability of the lot of the label ultimately so selected meaning we're going to create more binaries were going to
say it either more too laughter more to the right and how it aligns the category labels and also we can think about this remember how I said I cheated and I didn't go and label on my own data well here's other things human error this of algorithm on classifiers are called supervised learning they they require historical labeled set of of data to to go ahead and learn from in order to make predictions about human error can be grown in this speed concepts what happens if let's say I'm a professor and making use of all my student lackeys and some of them have been up all nite and 10 them looked at the same e-mail and they all came up with different labels for it but it's in my training set that can be very inconsistent so I need to think about that as well how was the labeling of the data happening so as much as we don't like to think about the demanding data cleaning and data collection that's actually a really important part of the process when working with machine learning problem supervised machine learning problems How can we improve our performance well we can do more
and better feature extraction because well I would like to say that e-mails can only be identified by the words in them we know that's not true predicting sentiment of e-mails very complicated very difficult natural language processing is a huge field I'm not getting into that myself but you know we we need to think of other ways we can identify so what are some things perhaps the subject is there something we use the subject I complete tentative what of images is there an abundance of images and spam e-mails or maybe there's no and I don't know how send their member that like really cool e-mail address with because clearly I would trust any well e-mails don't whatever that was then again I don't trust most well stuff so that's another thing but you know some other ones if we're just going to think about what to to go ahead and
consider all other possible features we can think about capitalization irregular punctuation and things like that ultimately we also want more data so do you like the star track and have more want to learn more good OK go there's super
sweet I also would highly recommend varied leaders Introduction to Machine Learning with pipeline she's a great deal scientist at the and I've heard great things about this and also your local friendly played on user group we love talking learning together and talk to people here there's a great talk after this talking more about machine learning stay for it thanks and of
anything I hope what you may have learned is correlation maybe conservation or conservation maybe correlation I don't know so well we can implement a thing but the question then becomes to try to interpret those results and that's where I challenge you to go ahead and try some things that make you so much
do you have any questions and it's such a great to have no 1 has any question that Wikipedia questions on all the thing out in this area here for a few minutes but like I said I D 1 here the next talk so I'll be around my name's laurie now please reach out and the higher the pleasure to be here that so much for listening
Soundverarbeitung
Prozess <Physik>
Güte der Anpassung
Natürliche Sprache
Systemplattform
Baum <Mathematik>
Computeranimation
Analysis
Metropolitan area network
Bit
Datenanalyse
Mathematisierung
Abfrage
Computeranimation
Metropolitan area network
Datennetz
Geschlecht <Mathematik>
Gruppenkeim
Große Vereinheitlichung
Baum <Mathematik>
Whiteboard
Computeranimation
Meta-Tag
Metropolitan area network
Bit
Punkt
Suite <Programmpaket>
Prozessfähigkeit <Qualitätsmanagement>
Datentyp
Formale Sprache
Information
E-Mail
Baum <Mathematik>
Computeranimation
Portscanner
Virtuelle Maschine
Chatten <Kommunikation>
Mereologie
Algorithmische Lerntheorie
Baum <Mathematik>
Computeranimation
Virtuelle Maschine
Bit
Algorithmische Lerntheorie
Baum <Mathematik>
Computeranimation
Roboter
Inklusion <Mathematik>
Beobachtungsstudie
Gravitation
Formale Sprache
Computer
Mustererkennung
Computeranimation
Portscanner
Virtuelle Maschine
Algorithmus
Prognoseverfahren
Datentyp
Mereologie
Algorithmische Lerntheorie
Baum <Mathematik>
Chi-Quadrat-Verteilung
Computersimulation
Neuronales Netz
Task
Task
Festspeicher
Mereologie
Computer
Wiederherstellung <Informatik>
Optimierung
Algorithmische Lerntheorie
Baum <Mathematik>
Einflussgröße
Computeranimation
RFID
Arithmetisches Mittel
Parametersystem
Subtraktion
Informationsmodellierung
Festspeicher
Besprechung/Interview
Normalvektor
Baum <Mathematik>
Computeranimation
Datensatz
Festspeicher
Algorithmische Lerntheorie
Baum <Mathematik>
ART-Netz
Computeranimation
Analysis
Task
Task
Verschlingung
Singularität <Mathematik>
Framework <Informatik>
Computeranimation
Analysis
Task
Metropolitan area network
Wellenpaket
Menge
E-Mail
E-Mail
Baum <Mathematik>
Einflussgröße
Computeranimation
Informationssystem
Bildschirmfenster
Datentyp
Statistische Analyse
Baum <Mathematik>
Physikalische Theorie
Bildgebendes Verfahren
Computeranimation
Office-Paket
Theorem
Bayes-Regel
Likelihood-Funktion
BAYES
Ereignishorizont
Stochastische Abhängigkeit
Ereignishorizont
Computeranimation
Likelihood-Funktion
Stochastische Abhängigkeit
Wasserdampftafel
Zahlenbereich
Computeranimation
Datensatz
Multiplikation
Rechter Winkel
Kontrast <Statistik>
Steuerwerk
Ereignishorizont
Stochastische Abhängigkeit
Baum <Mathematik>
Schreib-Lese-Kopf
Stochastische Abhängigkeit
Likelihood-Funktion
Zahlenbereich
Wort <Informatik>
Kontextbezogenes System
Ereignishorizont
E-Mail
Stochastische Abhängigkeit
Baum <Mathematik>
Computeranimation
Metropolitan area network
Likelihood-Funktion
Klasse <Mathematik>
Formale Sprache
Wort <Informatik>
E-Mail
E-Mail
Baum <Mathematik>
Computeranimation
Likelihood-Funktion
Klasse <Mathematik>
Häufigkeitsverteilung
E-Mail
Computeranimation
Kreisbogen
Metropolitan area network
Bayes-Regel
Wort <Informatik>
Inhalt <Mathematik>
Ganze Funktion
E-Mail
Baum <Mathematik>
Extrempunkt
Fächer <Mathematik>
Klasse <Mathematik>
Mapping <Computergraphik>
Extrempunkt
Rechnen
E-Mail
Baum <Mathematik>
Computeranimation
Befehl <Informatik>
Bit
Algorithmus
Momentenproblem
Inverser Limes
Kartesische Koordinaten
Häufigkeitsverteilung
Baum <Mathematik>
Computeranimation
Web Site
Bit
Wort <Informatik>
Klasse <Mathematik>
Gebäude <Mathematik>
Parser
Ruhmasse
E-Mail
BAYES
Natürliche Sprache
Computeranimation
Objekt <Kategorie>
Summengleichung
Message-Passing
Task
Prognoseverfahren
Datenmanagement
Offene Menge
Hypermedia
Zusammenhängender Graph
Wort <Informatik>
E-Mail
Message-Passing
Weg <Topologie>
Task
Wellenpaket
Prozess <Physik>
Momentenproblem
Statistische Analyse
Wort <Informatik>
Token-Ring
Digitalfilter
E-Mail
Große Vereinheitlichung
Ext-Funktor
Computeranimation
Soundverarbeitung
Wellenpaket
Prozess <Physik>
Selbst organisierendes System
Kategorie <Mathematik>
Wort <Informatik>
Klasse <Mathematik>
Formale Sprache
Frequenz
Gesetz <Physik>
Synchronisierung
Computeranimation
Weg <Topologie>
Wort <Informatik>
Ext-Funktor
Baum <Mathematik>
Attributierte Grammatik
Wellenpaket
Freeware
Wort <Informatik>
Likelihood-Funktion
Bildschirmfenster
Mathematisierung
Wort <Informatik>
Glättung
E-Mail
Frequenz
E-Mail
Baum <Mathematik>
Computeranimation
Gleitkommarechnung
Punkt
Blackbox
Klasse <Mathematik>
Iteration
Extrempunkt
E-Mail
Computeranimation
Data Dictionary
Metropolitan area network
Task
A-posteriori-Wahrscheinlichkeit
E-Mail
Gammafunktion
Inklusion <Mathematik>
Objekt <Kategorie>
Rechenschieber
Metropolitan area network
Deskriptive Statistik
Punkt
Extrempunkt
Gravitationsgesetz
Baum <Mathematik>
Computeranimation
Lesen <Datenverarbeitung>
Metropolitan area network
Elektronische Publikation
Wellenpaket
Momentenproblem
Logarithmus
Stichprobe
Einflussgröße
Computeranimation
Benchmark
Metropolitan area network
Punkt
Wellenpaket
Zehn
Wellenpaket
Extrempunkt
Baum <Mathematik>
Computeranimation
Gammafunktion
Explosion <Stochastik>
Ortsoperator
Inverser Limes
Bitrate
E-Mail
Baum <Mathematik>
Computeranimation
Soundverarbeitung
Prozess <Physik>
Wellenpaket
Kategorie <Mathematik>
Stochastische Abhängigkeit
Datenmodell
t-Test
E-Mail
Binärcode
Computeranimation
Arithmetisches Mittel
Virtuelle Maschine
Algorithmus
Prognoseverfahren
Menge
Rechter Winkel
Mereologie
Algorithmische Lerntheorie
Ordnung <Mathematik>
E-Mail
Baum <Mathematik>
Stochastische Abhängigkeit
Fehlermeldung
Spezialrechner
Weg <Topologie>
Prozess <Physik>
Datenfeld
Adressraum
Wort <Informatik>
Natürliche Sprache
E-Mail
Baum <Mathematik>
Bildgebendes Verfahren
Computeranimation
Resultante
Virtuelle Maschine
TVD-Verfahren
Gruppenkeim
Energieerhaltung
Algorithmische Lerntheorie
Ranking
Korrelationsfunktion
Computeranimation
Bit
Flächeninhalt
Baum <Mathematik>
Computeranimation

Metadaten

Formale Metadaten

Titel Is that spam in my ham?
Serientitel EuroPython 2016
Teil 07
Anzahl der Teile 169
Autor Mesa, Lorena
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/21176
Herausgeber EuroPython
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Lorena Mesa - Is that spam in my ham? Beginning programmers or Python beginners may find it overwhelming to implement a machine learning algorithm. Increasingly machine learning is becoming more applicable to many areas. This talk introduces key concepts and ideas and uses Python to build a basic classifier - a common type of machine learning problem. Providing some jargon to help those that may be self-educated or currently learning ----- Supervised learning, machine learning, classifiers, big data! What in the world are all of these things? As a beginning programmer the questions described as "machine learning" questions can be mystifying at best. In this talk I will define the scope of a machine learning problem, identifying an email as ham or spam, from the perspective of a beginner (non master of all things "machine learning") and show how Python can help us simply learn how to classify a piece of email. To begin we must ask, what is spam? How do I know it "when I see it"? From previous experience of course! We will provide human labeled examples of spam to our model for it to understand the likelihood of spam or ham. This approach, using examples and data we already know to determine the most likely label for a new example, uses the Naive Bayes classifier. Our model will look at the words in the body of an email, finding the frequency of words in both spam and ham emails and the frequency of spam and ham. Once we know the prior likelihood of spam and what makes something spam, we can try applying a label to a new example. Through this exercise we will see at a basic level what types of questions machine learning asks, learn to model "learning" with Python, and understand how learning can be measured.

Ähnliche Filme

Loading...