We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data Mining Overview, Association Rule Mining (16.12.10)

00:00

Formale Metadaten

Titel
Data Mining Overview, Association Rule Mining (16.12.10)
Serientitel
Teil
8
Anzahl der Teile
13
Autor
Mitwirkende
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2011
ProduktionsortBraunschweig

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
In this course, we examine the aspects regarding building maintaining and operating data warehouses as well as give an insight to the main knowledge discovery techniques. The course deals with basic issues like storage of the data, execution of the analytical queries and data mining procedures. Course will be tought completly in English. The general structure of the course is: Typical dw use case scenarios Basic architecture of dw Data modelling on a conceptual, logical and physical level Multidimensional E/R modelling Cubes, dimensions, measures Query processing, OLAP queries (OLAP vs OLTP), roll-up, drill down, slice, dice, pivot MOLAP, ROLAP, HOLAP SQL99 OLAP operators, MDX Snowflake, star and starflake schemas for relational storage Multimedia physical storage (linearization) DW Indexing as search optimization mean: R-Trees, UB-Trees, Bitmap indexes Other optimization procedures: data partitioning, star join optimization, materialized views ETL Association rule mining, sequence patterns, time series Classification: Decision trees, naive Bayes classifications, SVM Cluster analysis: K-means, hierarchical clustering, aglomerative clustering, outlier analysis
HardwareSoftwareTaskLastSinguläres IntegralMetadatenPhasenumwandlungDesintegration <Mathematik>Inhalt <Mathematik>Produkt <Mathematik>DatenstrukturInformationsspeicherungBusiness IntelligenceData MiningSchlussregelAssoziativgesetzProzess <Informatik>Analytische MengeContent ManagementInformationGruppenoperationAlgorithmusComputerspielInformationNatürliche ZahlOrdnung <Mathematik>PerspektiveTransformation <Mathematik>PhasenumwandlungEntscheidungstheorieGesetz <Physik>AggregatzustandBitDivergente ReiheGruppenoperationLastMereologieResultanteTermBasis <Mathematik>ATMNummernsystemPunktProgrammierparadigmaKartesische KoordinatenVersicherungsmathematikerData-Warehouse-KonzeptEinfügungsdämpfungMailing-ListeData MiningBefehl <Informatik>SichtenkonzeptWeg <Topologie>Dreiecksfreier GraphMultiplikationsoperatorAssoziativgesetzAnalysisSoftwareGebäude <Mathematik>Generator <Informatik>EntscheidungsunterstützungssystemProzess <Informatik>Shape <Informatik>MetadatenSchlussregelInformationsqualitätComputeranimation
Analytische MengeContent ManagementGruppenoperationInformationBusiness IntelligenceGlobale OptimierungCharakteristisches PolynomEntscheidungstheorieTransaktionLokales MinimumElektronisches ForumInformationVerband <Mathematik>ProgrammierungProdukt <Mathematik>EntscheidungstheorieGesetz <Physik>Globale OptimierungAggregatzustandBitDivisionEINKAUF <Programm>GruppenoperationLastMereologieRestklasseTransaktionZentralisatorE-MailOvalSystemaufrufGüte der AnpassungAutomatische HandlungsplanungWasserdampftafelProzess <Informatik>Strategisches SpielDatenfeldComputersicherheitHorizontaleKartesische KoordinatenUmwandlungsenthalpieFokalpunktKundendatenbankAdressraumPlastikkarteSichtenkonzeptEndliche ModelltheorieSchlüsselverwaltungDifferenteMultiplikationsoperatorRechter WinkelFigurierte ZahlAlgorithmusFilter <Stochastik>Generator <Informatik>ProgrammschemaInformationsspeicherungData-Warehouse-KonzeptData MiningZweiComputeranimation
TransaktionGruppenoperationLokales MinimumElektronisches ForumAppletGlobale OptimierungBrowserDatenverwaltungData-Warehouse-KonzeptProgrammierumgebungStrategisches SpielArchitektur <Informatik>BenutzeroberflächeSystemprogrammierungFramework <Informatik>UnternehmensarchitekturFunktionalFlächeninhaltUmwandlungsenthalpieEntscheidungstheorieAusnahmebehandlungAlgorithmusComputerspielFrequenzProfil <Aerodynamik>SoftwaretestGefrierenEntscheidungstheorieGlobale OptimierungIndexberechnungLastMereologieMomentenproblemPhysikalisches SystemStaupunktTermTransaktionDatenflussFlächeninhaltCASE <Informatik>Strategisches SpielZusammenhängender GraphComputersicherheitVektorpotenzialÄußere Algebra eines ModulsPunktArithmetische FolgeEinfügungsdämpfungKundendatenbankData MiningSchnitt <Mathematik>AdressraumPlastikkarteMeterSchlüsselverwaltungFitnessfunktionMultiplikationsoperatorMessage-PassingDienst <Informatik>Figurierte ZahlTwitter <Softwareplattform>AnalysisInformationProdukt <Mathematik>MAPBenutzeroberflächeBitLoopMultiple RegressionE-MailAbfrageGüte der AnpassungProzess <Informatik>Data-Warehouse-KonzeptVerkehrsinformationQuellcodeAusreißer <Statistik>SchlussregelComputeranimation
DatenbankAbfrageverarbeitungProgrammStatistikInformationData MiningAlgorithmusComputerspielDatenbankInformationMathematische LogikStatistikProgrammierungExpertensystemBitEinfach zusammenhängender RaumPhysikalisches SystemZahlenbereichWissensbasiertes SystemAbfrageData MiningEinfache GenauigkeitDienst <Informatik>MustersprachePhysikalischer EffektGesetz <Physik>MultiplikationTermQuick-SortGüte der AnpassungBasis <Mathematik>BildschirmsymbolSoundverarbeitungComputeranimation
EinflussgrößeTransaktionSchätzungData MiningSchlussregelAssoziativgesetzKategorie <Mathematik>Algebraisch abgeschlossener KörperTeilmengeSupremum <Mathematik>AlgorithmusLokales MinimumMaßerweiterungMaximalfolgeAlgorithmusAnalysisAttributierte GrammatikCodeComputerspielDatenbankDatensatzEigenwertproblemFolge <Mathematik>InformationMathematische LogikOrdnung <Mathematik>SchaltnetzSpieltheorieSprachsyntheseStabStatistikTopologieVerband <Mathematik>CodierungHalbleiterspeicherFrequenzTypentheorieFormale SemantikIterationModemEffizienter AlgorithmusProdukt <Mathematik>Profil <Aerodynamik>MAPGenerator <Informatik>Kategorie <Mathematik>ImpulsAlgebraisch abgeschlossener KörperSoftwaretestMIDI <Musikelektronik>FinitismusAutomatische IndexierungEntscheidungstheoriePhysikalischer EffektGesetz <Physik>Total <Mathematik>NebenbedingungGlobale OptimierungMarketinginformationssystemBereichsschätzungAggregatzustandAnalytische FortsetzungArithmetisches MittelBildschirmmaskeBitGarbentheorieGeradeGrundraumGruppenoperationHochdruckIndexberechnungInhalt <Mathematik>Inverser LimesKomplex <Algebra>Lokales MinimumMaßerweiterungMereologieMomentenproblemMultiplikationObjektorientierte ProgrammiersprachePhysikalisches SystemPotenz <Mathematik>ResultanteStatistische HypotheseStellenringSweep-AlgorithmusTeilmengeTermTransaktionTUNIS <Programm>ZahlenbereichDatensichtgerätE-MailQuick-SortSchätzungFlächeninhaltSoftwarewartungSystemaufrufGüte der AnpassungReelle ZahlEinflussgrößeAutomatische HandlungsplanungMatchingUntergruppeNichtlinearer OperatorWasserdampftafelBasis <Mathematik>AusnahmebehandlungHypermediaCASE <Informatik>Anpassung <Mathematik>Strategisches SpielZusammenhängender GraphWurzel <Mathematik>VorhersagbarkeitBenutzerschnittstellenverwaltungssystemInstantiierungSchwebungRoutingFormation <Mathematik>VierPunktDichte <Physik>Klasse <Mathematik>KontrollstrukturAbstimmung <Frequenz>Negative ZahlBildschirmsymbolSchnittmengeOffene MengeCAN-BusWort <Informatik>Lesen <Datenverarbeitung>UnrundheitPersönliche IdentifikationsnummerKartesische KoordinatenHilfesystemBeobachtungsstudieUmwandlungsenthalpieSystemzusammenbruchEinfügungsdämpfungt-TestKundendatenbankVollständiger VerbandMailing-ListeData MiningSchnitt <Mathematik>VerkehrsinformationGraphfärbungSoundverarbeitungPlastikkarteWorkstation <Musikinstrument>SpielkonsoleRFIDElektronische PublikationSkalierbarkeitEreignishorizontBitrateBootenCharakteristisches PolynomSichtenkonzeptWhiteboardAbgeschlossene MengeWeb SiteKonditionszahlSchreib-Lese-KopfSpitze <Mathematik>Endliche ModelltheorieDifferenteRechenbuchNeuroinformatikLikelihood-FunktionObjekt <Kategorie>Element <Gruppentheorie>Atomarität <Informatik>Hinterlegungsverfahren <Kryptologie>Klassische PhysikEinfache GenauigkeitText MiningTLSMultiplikationsoperatorSchlussregelFreewareSoftwareschwachstelleTVD-VerfahrenWald <Graphentheorie>Minkowski-MetrikRechter WinkelDienst <Informatik>BenutzerbeteiligungDemoszene <Programmierung>AssoziativgesetzIdentitätsverwaltungMusterspracheTwitter <Softwareplattform>PortscannerEinsDatenstrukturComputerarchitekturValiditätCluster-AnalyseGebäude <Mathematik>BelegleserEinfach zusammenhängender RaumFunktionalLeistung <Physik>Projektive EbeneZählenDatenflussEntscheidungsunterstützungssystemÄhnlichkeitsgeometrieInternetworkingFehlermeldungDatenfeldGraphische BenutzeroberflächeWissensbasisNotebook-ComputerQuaderInformationsspeicherungGrößenordnungData-Warehouse-KonzeptClientDivergenz <Vektoranalysis>Open SourceSchwellwertverfahrenTupelWeb logExploitStandardabweichungMessage-PassingZweiPersonal Area NetworkHIP <Kommunikationsprotokoll>Computeranimation
Transkript: Englisch(automatisch erzeugt)
So, hello everyone, welcome to the lecture data warehousing and data mining and we kind of are through the data warehousing part so we can kind of like get rid of that and for the rest of the term we will deal with data mining issues and some very exciting algorithms and interesting ways to get more out of your data.
And so today we will start with a short introduction into what business intelligence is and then present the first algorithm which will be frequent item set mining and association rule mining today.
So last time we were talking about how to build data warehouses that concluded our idea of what a data warehouse actually is in terms of the software that is done we were talking a little bit about the ETL process which is the most important process in the life cycle of a data warehouse
because the old crap in, crap out paradigm is still valid so if the data in your data warehouse is bad then all the decisions based on that data and all what you can mine from this data is of low quality too.
And that's kind of very annoying so what you have to do is you have to consider how do you get the data how is the data transformed into some global schema that you can really use for a lot of applications which is extensible also for new and upcoming applications
and then you have to look in the transformation phase a little bit into the topic of data cleaning how do you make sure that the data is really correct and that the data is complete because that determines the major quality issues of your data warehouse
and interestingly enough if you look at some of the big companies today that have huge data warehouses terabytes and terabytes of data and you look at the quality of the data that is actually in there you will more often than not find that the data quality is actually pretty poor
and of course that is a problem in today's data warehouses in bigger companies that cost a lot of money to clean Finally you have to load it that is usually a process that is done in bulk mode
so you load all the data that has been cleaned that has been transformed that has been extracted from the underlying productive sources and put it overnight or whenever into the data warehouse and the data warehouse is ready for all the OLAP queries that you could think of the next day
Another thing that we were talking about briefly last time was metadata so how do you describe what data actually is in the data warehouse how do you keep up with the metadata and how do you get an understanding about the nature of the data that comprises your data warehouse
Another thing that is very often done these days in terms of metadata is the so called data lineage or data provenance so information about where does the data come from That can be very interesting to see if there are some oddities or some things that you just don't understand
Looking at the trail of data may very often help to get a feeling how this data was aggregated and maybe there is something wrong So this can point you really to interesting insights about your data warehouse But as I said today we want to go into data mining rather
and the major term that is needed if we talk about data mining and data warehouses in one lecture is business intelligence because that is what you want You want to find information that you don't know yet
about your business, about your company, about your customers, about whatever you have in your data warehouse Since you have all the data in the data warehouse that's a good place to look for it On the other hand it's quite a difficult thing to look for that because you just dump the data in the data warehouse
and there is no way that you find out some hidden relationships or something You just have your schema and loads and tons of data So we will briefly provide an overview of what business intelligence challenges are today and then start with data mining
and today will be one of the most important algorithms for data mining that is very often employed especially for marketing purposes Association rule mining Good! So what is business intelligence?
Business intelligence is kind of insights about your company Insights about what you do in your company and it's very often said that what in your data warehouse actually resides is data
and what you need to make decisions is information and those are two totally different things You can generate information from data Well actually you have to generate information from data otherwise it would be guesswork and the data gives you the support to have some statements about what's going on
and the information makes the data digestible in a way So you can easily grasp the concepts underlying things And there is another step, many people talk about this pyramid kind of shape
You have a lot of data in your company You extract information out of the data and what is the peak of our little pyramid here?
No, decisions is the actions that you take based on what is in the pyramid Knowledge, exactly So this is the most refined analysis data that you will have
It's built up on information and gives you some time perspective into it and some decision support rather The decisions that you then take is basically the actions that is supported by your knowledge
And so what you basically do is you take the data warehouse This is here where all the data resides The data warehouse And to get from here to there you use analytic tools So this is where OLAP resides
And this is where data mining algorithms reside Working on the base data and generating information from that And this is basically a manual step A manual step putting together information into knowledge
which is basically the work of analysts That just goes through all the information and then build strategies from it and find out what you should do to improve the processes of your company or to increase revenue of your company
or to increase the happiness of your customers or increase the creativity of your workforce or whatever, it can be all kinds of stuff But it's securely founded by all the data that we have stored in our data warehouse
That's the basic idea Okay With that we're going to store again, I see Same old problems every year There we go
Come on Business intelligence, hello So what are typical applications for business intelligence? One very important thing is segmentation Mostly customer segmentation or product segmentation where you find different groups of customers
that you could address by certain advertising campaigns or whatever You find out how the market is actually segmented and whether your products cater for specific segments or for the market as a whole
And then you have decisions like Well, if I don't cover the whole market yet, why don't I address some other segments or something like that Then there's the propensity to buy So who are the customers that are willing to buy, that have the money, that are willing to spend the money
if you just give them the right idea what they want Also has something to do with advertising Profitability of course Is it worthwhile catering for certain customers or are these customers demanding extremely high quality products for very cheap prices
So your revenues are actually marginal and eaten away by the taxes anyway So you might not be willing to cater for these customers anymore Very central thing, fraud detection
Think about a credit card company or something like that that needs to find out whether the transactions taken on a credit card are possible Figure for example where the card was used and you find out that the card was used in three hours in three different countries
Maybe that is possible It's very improbable So with these things and typical data mining algorithm you can detect fraud or at least you can generate candidates for fraud detection
that you then can look over and find out whether this really was fraud Same thing, customer attrition So are the customers leaving Are things slowing down And the channel optimization
So how do you get the product to the customer Are you addressing the segments through the right channels So probably emails is not a very good way of advertising anymore because everybody has spam filters these days Maybe making some postal things would be much nicer
but you can't afford it so much and so on So find out what your customers really want Customer segmentation as I just said This is basically about the market segments So are they young people, are they older people
Does your product cater for them all You might need some fleshy products for the younger people and some products that are very reliable for the older people or something like that Also the personalization of customer relationships which is usually referred to as customer relationship management
or you will all have heard of CRM tools that are Dérigueurs in industry today which just means customer relationship management So follow up on purchases Find out whether the customer is satisfied with the product
Maybe you can do something to make him even more satisfied You can handle warranty claims and stuff like that over customer relationship management Very important part of these days And of course via customer relationship management and data mining you can find out whether somebody always uses the warranty
or hardly ever uses the warranty and then in fulfilling the claims if somebody hardly ever uses the warranty well why would you care, just fulfill the claim but if somebody always uses the warranty
and has returned some product five times already there might be something wrong So this is what you find out via customer relationship management Second thing, propensity to buy Which customers are most likely to respond to a promotion How can I target certain customer groups
especially campaign probabilities of your planning, advertising campaigns Where should I put television commercials If I have certain customer groups it might be more interesting to put my advertisement in the evening program If I want to cater for kids and they should go to their parents nagging about what they want for Christmas
I want this new product from Tito's wonderful company Then put it during Spongebob or whatever So the children's television program Profitability What is the lifetime profitability of my customer
Does it pay off to send him or her a Christmas card Are they going to buy some car in due course or did they just buy a car and I couldn't be interested less the next five years they will definitely buy no new car or something like that That's kind of interesting to follow up on customers
and find out about the overall profitability So you can really focus on the profitable customers It's very often done in banks these days So you have these key account management where really if some customers are especially profitable
you open up a key account for them and have some people that exactly cater for this customer Very important in today's business and in today's industry Fraud detection as I tell What transactions are likely to be fraudulent So credit card companies use a lot of that
and have very clever ways these days to find out what is actually fraud and what is a normal use of a card Has also something to do with a little bit of What was I going to say
Outlier analysis and profiling So find out what somebody normally does and then look for the outliers that are not typical for his or her profile These outliers could be fraud You hear the messages on the news every five minutes
Be it the Nigeria scam or things like here 400,000 pound insurance scam that can be found out and the good thing about data mining algorithms or business intelligence algorithms and that is that they work automatically
So you can run them overnight and react very quickly to what's happening sometimes while it's happening but at least immediately after it has happened and that could cut your losses and could make things easily addressable
also for police and stuff Customer attrition Channel optimization So what customers should I take care for because they are bound to leave or they are kind of like behaving in a way that seems they are considering other alternatives
to my product or to my services and the idea is to prevent the loss of high value customers and if somebody is not profitable anyway well your competition is absolutely welcome to those customers of course Channel optimization
whether you do TV advertisements, email spamming or like fanciful banners in the city center or whatever you do it has to fit to the customer, it has to fit the product and this is what you find out via business intelligence
So what you basically do is you have the data warehouse at the center of your company you have some data sources which is basically productive systems but can also be customer relationship management systems or key account systems
and the ETL process ETL is basically bringing all that data into your data warehouse and from this data you have the business users mainly analysts that work with the data, that work on the data
and the most important point here are dashboards where you can see things that are automatically generated some things are reports like we did with our OLAP queries but more often than not you will find that there are some data mining algorithms that give you some hint what is going on
or what is interesting, what are trends you should look at and what are trends that you can safely forget about because they are not interesting or they are not and finally it goes to the managers or to the decision makers
that will develop the strategies or adopt strategies as shown by the analysts and the interesting thing is here in the future component intelligent systems, so this is where mainly data mining algorithms work on their own directly on the data warehouse and also deliver the data what is normal, what is exciting, what is new
to the user interface so the analysts can really see what is happening The one thing to find out what's happening is automated decision tools
so for example fraud detection mainly happens automatically very often you don't have a human in the loop anymore but the algorithms run and immediately when your credit card seems fishy
you get a letter, your credit card or the account has been closed or has been at least put on ice at the moment did you really do these three transactions that look fishy to us? and then you respond yes I was buying something at Greenland yesterday or something
or you respond no definitely that was not me and then the process is invoked for credit card misuse but all this closing down the account or freezing the account
sending out the letter for the potential fraud victim all that can be done automatic and this is kind of interesting you have just rule based systems that have a certain solution if the case arises
fraud or possible fraud, probable fraud at this point you don't know whether it is fraud or whether it will be perfectly harmless but better safe than sorry so you should really consider freezing the account at a very early stage
before all the money is gone and then the customer says well I wasn't that and you should have detected it so you have to be reliable for the revenue same goes for loan approvals
also that is mainly automatically these days you just hand in what you earn and what you want and what you basically have in terms of securities and then the decision can be made automatically same goes for business performance management so just getting the indicators of how your business is running
is a very important thing mostly today based on so called balanced scorecards which is an economic system that shows you the key performance indicators
of your business and that can be done automatically that can be performed in an automated way directly on the data in your data warehouse and then you get the balanced scorecards and you can see what performance indicators are up and running
and what performance indicator should need a little bit of attention very good one of the possibilities to get this to the people, to the decision makers are so called dashboards or business cockpits
because they look like the dashboard in your car a little bit you may have some meters showing whether you are in the green area or whether you are in the red area with your revenues or what you gained or some regression analysis
showing whether the trend is upwards or downwards or stagnation at the moment and you can see all these pictures and derive the information and the knowledge from it that you might need to find out what is happening in your business well, so how to do that?
That's the kind of interesting thing How do we get all this wonderful data, all this wonderful information that we want to visualize in our dashboard or in our business cockpit? The answer of course is data mining also known as knowledge discovery in databases
there was an early term for that the life sciences then thought knowledge discovery in databases sounds a little bit too bizarre let's just call it data mining like you go into the mine and you pick out all the data all the information that you need from your data mountain
kind of interesting and this is exactly what it is you find the interesting information or patterns that occur in your database given that the database is so large that you can't do it by hand
that you just don't see it I mean if I can look through my account books it's very easy to find out whether I've been profitable or not if I'm a multinational company with thousands of holdings it's not so easy anymore
and you have to look very closely what you need for having interesting information is that it is non-trivial it's obvious that when you earn a lot of money that things are good I don't need somebody to tell me that
but maybe somebody should tell me about a certain segment of customers being more and more dissatisfied with my services because that is not trivial it's mostly implicit so I mean it's not just a single number but it's information that puts a lot of numbers together
that are in some factual connection with each other and it should of course be previously unknown because I don't like a system telling me what I already know more of the same I could have McKinsey for that but that is basically what I mean when I'm talking about interesting information
and of course there are other ways to get to that information so for example there's deductive query processing, expert systems, statistical programs that also run we have a different lecture this semester
which is about expert systems actually it's a knowledge-based systems lecture and there are a lot of techniques mostly rule-based or I don't know there are a lot of logical ways of addressing information of mining out interesting information
but that's not what we want we want purely statistical algorithms running over the data and then we want to find out what's interesting in our data the applications of data mining are then on one hand database analysis so find out what is in the data
and of course decision support find out what to do when something happens in the data and as in the case of business intelligence these are all applications that are valuable there are a lot of more applications especially in the data
in the life sciences there's a lot of data around they have to find some characteristics or some patterns in this data they use a lot of data mining other things go for text mining so there are a lot of products at the moment exploiting blog posts or finding out how people think about a certain product
by looking at posts on the internet or email analysis or whatever you see a lot of new services at Google every five minutes to find something out that you didn't previously know
so these are typical data mining applications the one that we are focusing on is mainly market analysis you want to know how the market is cemented how your customers fit into the profiles you address with your products
you want to find purchasing patterns is it worthwhile to close down the shop over the weekend because nobody buys something over the weekend especially your product or whatever
do you just want to get into the Christmas business and forget about the rest of the year cross market analysis so what happens between different products how they are related so maybe up selling, down selling, cross selling
would be a good idea if a customer buys something maybe he wants the extension or something that fits very well to this product that would be a nice thing to have also summary information so again the topic is reporting
I want to know about what's happening in my business I want to have the reports and that of course includes all the statistics second thing is corporate analysis and risk management finance planning what is my company worth
what is the cash flow that's happening at the moment can I predict some interesting trends are the sales going up or down is some product running out or is some product rather getting hip at the moment also resource planning so what do I have to put into resources
do I need more of something or should I slow down production a little bit because the market doesn't take it at the moment also very good to know what your competitors are doing so how's their market
how's their segmentation how's their revenues and their sales that's kind of interesting pricing strategies of course what is the stuff worth that I produce which is not only dependent on what I pay for it but rather dependent on what the market is willing to give for it
there might be high quality product that the market is just not taking if they are too expensive and there might be some which happens very often actually some amazingly cheap product that the market is prepared to pay everything for so why not take it then
that's basically what you do the architecture of data mining systems is basically as I just showed you for the business intelligence you have the data warehouse that is basically built from the productive systems and resident data in the data warehouse
and on top of that you put a data mining engine that is the main component that runs all the algorithms for data mining that you are interested in finds the pattern and then you need something to evaluate the pattern very often this is rule based very often this is based on some constraints
that you can fiddle with which is part of knowledge base and then you have a graphical user interface to get the information across or build nice charts that you can show to your boss and say well here is a trend channel that is sloping down and we don't want that do we
we have to do something good the major parts in data mining that we will cover in these lectures are association we will do that today association means that
things may be correlated with each other because they are causal so if one thing depends on some other product that is bought or some other event that has happened then those should be considered together
obviously but how do I know which things are dependent on each other that is association mining second thing classification and prediction how do the segments fall apart
how can I predict which segment is stronger or is getting weaker it is very important to find models that kind of allow me to predict a lot of what is going on in my business and there are a lot of ways to present
all that information so decision free classification rules we will go into some of them during the next lectures and of course we will also look a little bit into the future with prediction algorithms and find out what will happen next cluster analysis is a very big
topic in data mining so you might have different classes and you have to label them you have to find out how to group your products or your services together that you can manage them more efficiently
for example produce similar product together or outsource a certain kind of service to some subcontractor or whatever same goes for the advertisement if I know what cluster of people are buying a certain product I can directly advertise and maybe if I want to address people
that are rich I don't go into the slums or the poor parts of the city and distributed leaflets but I will just go to the hip suburbs where all these people and beautiful live and try my advertising there
Then already talked about outlier mostly fraud detection and stuff like that something happens that is out of the normal maybe a black friday some market crash or something that is definitely out of the normal better detected early maybe some fraudulent use of a credit card
again better detected early this is rather useful if you have rare events So today's topic is association rule mining as I said associations between objects and the idea is
that whether objects or products or customers or whatever it is that you are looking at occur together there has to be some hidden relationship between them Why are exactly these five people buying that product?
What common idea do they have? And the classical application for that is so called market basket analysis very often done by the big vendors the big grocers so for example walmart
was one of the very big contenders of market basket analysis for what reason actually so what it actually does is you have a supermarket and you have your market basket you know like this little basket and you put in stuff and you buy that stuff
and then you find out for example some association rule everybody who buys cheese also buys wine or it's very probable that if somebody buys cheese he will also buy wine it's obviously where the rules comes from because very many people like to have wine and cheese
like a dessert or something like that and you will have a certain support for the rules so for example 10% of the customers buy cheese and wine together and 80% who buy cheese will also buy wine
this is a rule that is intuitively clear interesting thing why should I know? being walmart
special advertising my beer and wine no my cheese and wine special buy the things together and get 15% off or something like that other reasons?
silly as it sounds that's one of the major reasons planning where to put your products if I know there's a relationship between two products
it's much more probable that I get customers to buy exactly these two products or to look at promotions that I'm doing on these products if I put them together very close to each other because then people will think well here I bought the wine oh there's cheese wonderful that goes very well together
also grab some cheese and this is one of the reasons I mean with cheese and wine it's kind of a classical example that is very obvious but for example walmart has found out that nappies so pampers and stuff like that are very often bought together with
exactly with beer six-packs it's not obvious at all well if you think about it it's kind of like it's the daddies that go Saturday afternoon shopping
and then they have toddlers at home so they can't go out for a beer they have to care for their toddlers they have to stay home with the wife so I need a lot of beer well put good cold beer next to the pampers and you're fine because people will buy it so this is kind of like just getting things more efficient
let's go into the algorithms association of romance how do you find out given a large set of data how do you find out efficiently and correctly what are the associations that are valid for your data we have two basic components
one the items that I'm selling the products everything that is in my supermarket is a product so I1 is a product and I2 is a product and so on and two
what is bought and I consider this as a market basket as a transaction so everybody takes his market basket or his little shopping carts puts everything on the till and all the stuff that is bought together
by a single customer is one such transaction the data is very easy to get because all the supermarkets now have electronic cash registers and scanners for everything that is bought
so at the end of the day you know exactly that there was a customer don't care about the identity of the customer as soon as the customer does not use payback cards or Deutschland cards
or how it's called you know like these profiling tools you know that some customer whoever he may be bought these items together today an association rule is an implication where you say
somebody who bought these items also bought these items and they should be different I mean I don't want tautologies somebody who bought wine also bought wine haha that's not very novel but typical things are somebody who bought wine also bought cheese
somebody who bought pampers also bought beer this is what we are trying to find out and if we take the items sold in the store I1 might be beef I2 chicken I3 cheese and so on
everything that you sell and then you take you look into the basket of people at the cash register and say well there was one guy who bought beef, chicken and milk together and there was a second guy who bought beef and cheese and the third one bought cheese and wine and so on ok?
that's the basic idea behind it if somebody buys beef and chicken then it's very probable that he also will buy milk rules of that type allow me to put the fridge with the milk next to the beef and chicken
good of course I could make thousands of such rules because people will buy anything they need and I might come up with oh this guy bought shoe polish or something like that
bought wine so wine and shoe polish do have something in common there must be a hidden connection well the hidden connection might have been that he just needed both things which can always happen so the rules can be rather weak like the shoe polish and wine example or they can be strong
and what makes them strong? basically that it occurs very often that people buy these things and if I consider something like shoe polish people will buy that when they need it when they need a new can doesn't really matter what they bought together
it's just that every two months you will buy a box of shoeshine and that's about it there are two basic measures for the strength of a rule which is the support
and the confidence in the rule that you have the support deals with the data how strong is the data? and the confidence deals with the semantics support basically means does it happen very often
that people buy shoe polish or that people buy wine or is it just a one-off I have this one guy I've never sold shoe polish in my shop and there comes one guy and he buys shoe polish and a pineapple wow, there's a new rule
no, obviously there's no new rule it has to happen a certain amount of times that people buy shoe polish until I can state some sensible information about what is often bought together with shoe polish okay? that is basically the support so the support of a rule is the percentage
of transaction that contain both partners of the rule okay? then I can say it happens together basically this is the probability that all these things occur in a single transaction
how often does it happen? well the support for some rule is just you take all the items in X and in Y and you count the number of transactions
where all these items in X and Y are bought together and you divide it by the number of transactions that you did during the whole day or that you did in your shop okay? then you find out how often do people buy this product
normalized by how often people buy in your shop anyway and you find out whether something is interesting or not the confidence of some rule deals with the semantics of X is somehow the cause of buying Y
so if something, if somebody buys cheese very probable that he also buys wine how about the other way around? somebody buys wine
is he also bound to buy cheese? well it could differ, couldn't it? they could be bought together very often but one might be the cause for the other so if I buy wine, cheese would go very well with it but there might be a lot of anti-alcoholics
that don't like wine at all but rather like cheese so they will buy cheese and not buy wine so the rule seems to work rather the way somebody who buys wine also is bound to buy cheese then somebody who buys cheese also is bound to buy wine
how do I find out? I count the occurrences where wine and cheese have been bought together and look how many times one of the products
has been bought individually okay? so if somebody buys if I wanted to state the rule if somebody buys cheese he is bound to buy wine I count the number of times wine and cheese were bought together
and divide them by the number that cheese was bought with or without wine then I found out how often is it really that they are both bought together or that one of the things is bought individually okay? so the confidence of the rule is basically
take the number of times it is bought together so basically the support and divide it by the number of times that only the cause has been bought yes? clear?
good well, the rationale behind support and confidence is basically that if the support is too low we have no way of saying this is a typical rule this really has statistical relevance but it may be just like our shoeshine and pineapple idea
you know that happened once and nobody ever bought shoeshine again then this is not a rule, this is stupid so low support should be avoided same goes for low confidence if the confidence is low
and people buy anything with cheese then putting the wine next to the cheese is as random as putting shoeshine next to the cheese it doesn't help you because you don't have the confidence in your rule obviously
the association rule mining as an algorithm now has to discover all associations in the number of transactions that you have with a minimum support and a minimum confidence ok? so you set some values and say well I'm only interested in products that make 10% of my business
I'm not considering shoeshine get me to the big players but I only want rules that are reasonably believable maybe have a confidence over 80% or something
well that's what we do so let's try it I have my number of transactions here 7 customers came in today the first one bought beef, chicken and milk second one bought beef and cheese third one bought cheese and boots
and now you see it's not too easy to see all the associations in these transactions is it? because it could be possible that somebody who buys beef always also buys chicken and milk one possible rule
it could also be possible that somebody who buys beef and chicken always buys milk how about somebody who buys beef and milk always buys chicken there are lots of possibilities and I could put all my items that were sold together in any arbitrary order
that is an exponential explosion of possible rules so what do I do? I calculate the support and the confidence for every one of these combinations madness
I have 5 things here cheese, boots, beef, chicken and milk and close 6 things here Walmart has 20,000 in a single supermarket try building all the rules out of 20,000 objects
madness so let's consider we have a minimum support and a minimum confidence we are only interested in things that we sell more than 30% of our sales
and we are only interested in rules that have a minimum confidence of 80% so they are rather believable so one thing would be a rule I just invented whoever buys chicken and closes also buys milk how often does that occur? well how often do chicken, closes and milk
occur together in my transactions? once, twice, three times this is only chicken this is only milk, this is chicken closes and that's it so these do not count
because they don't contain all the objects that I'm interested in, all the items but these three do count and I have 3 out of 7 transaction dealing with it 3 out of 7 is 42%
which is definitely higher than my minimum support of 30% ok? so this obviously beats the minimum support what is my confidence? well the rule states that whoever bought chicken and closes will also buy milk so let's find out
who buys chicken and closes chicken and closes here, here, here chicken and closes that's it ok? in how many of these cases
that somebody bought chicken and closes? did he also buy milk? 1, 2, 3 3 of 3 ok? this is 100% confidence it never happened
that somebody bought chicken and closes without buying milk ok? so the rule seems to be causal it is not because I invented it but still it has a confidence of 100% and a support of 30%
for my little supermarket here ok? everybody clear? good now we could have all kinds of clothes well somebody who buys clothes also buys milk and chicken somebody who buys milk also buys clothes and chicken blah blah blah blah blah I can make thousands of these rules how do I find out efficiently
what association rules are above the minimum support and what association rules beat the minimum confidence any way somebody can think of?
now if you had the idea you would have invented one of the most prestigious algorithms in data mining a very valid algorithm we will see the algorithm in a minute and it's actually quite simple once you have seen it
actually very believable of course this kind of association rule mining is a rather simplistic view of the shopping basket so we are not considering the quantity in which things are combined or what price is paid
so we would be affected by some special offers where you say ok this is amazingly cheap today so it will be bought with everything this week or something like that there are lots of problems with the basic thing but once you define your transactions and your items in the right way
you will find that it is definitely valid and it doesn't matter how you mine out these rules you can always compute the minimum confidence and the minimum support so it doesn't matter what algorithms you use
the rules are the same so there is a certain set of rules that is above minimum support minimum confidence yeah, the three major algorithms of which we will look today
are these two the a priori algorithm mining with multiple minimum support and there is also so called class association rules which we will go into very briefly in the detour at the end but the best known algorithm of all definitely is the a priori algorithm
a very efficient way of getting things done properly it basically consists of two steps the first one is so called frequent item set mining so I have to find out all the item sets
that are above the minimum support and the second step is generating candidate rules that have a chance of being above the confidence of being above the minimum confidence
from these frequent item sets and these are the candidate rules that will be generated and only these we will not look at all rules that are possible and still get the correct result and that is a rule and the basic idea is that when somebody has a frequent item
that is above the minimum support all the items contained in the set will have to be frequent themselves so if I have a minimum support of
I don't know 10% or something then I can say if milk, cheese and bread is a frequent item set then all of these three items individually
have to occur at least in the number of times they occur together on top of that they can occur individually but their support is definitely higher or equal
to the minimum support of the bigger set OK? Understood? So this is what is called a downward closure property any subset of a frequent item set is also frequent
a frequent item set. If I take chicken closes milk, I have chicken closes milk, chicken closes milk, chicken closes milk, then all parts of it, milk closes, closes milk, closes milk, occur at least in these three instances and can
occur some more. Doesn't. Okay? So this occurs three times. What about the others? How
about chicken closes? Chicken milk, chicken milk, chicken milk, chicken milk. Again, occurs three times but even occurs a fourth time without the closes. But it
definitely occurs at least as many times as the big set. Okay? And this property is amazingly useful because it allows us to steer the candidate
generation from frequent item sets of low cardinality to frequent item sets of high cardinality. And that means we won't have to try all the different rules
that are possible but only those that definitely have a frequent item set that works. So how do we find frequent items? We start by looking at one element sets. Okay? Once we have found all one element sets, how can two
element sets be built? Well basically by putting together one element sets that are frequent. Because if part of it would be not frequent, the downward closure property wouldn't hold. Okay? So the bigger set cannot be frequent.
That's the basic idea. For each iteration, I look at the generated candidates of the last iteration and put them together such that one more
new item gets into the set. That means I go from a k-1 frequent set to a k item frequent set. Good? And as a brief optimization, I will assume that the items are sorted in a
lexicographic order. So I have a topological sorting of the items such that I can find out what, well, very easily find out what is the intersection of item sets. Because if they would be kind of like all twisted
round, I will always have to look when comparing two item sets what is contained in both of them. If I order them alphabetically or lexicographically, I will just have to look at the beginning and once they start diverging
then they cannot have a large intersection anymore. Good! Let's start with finding all the item sets of size 1, which basically is all products that are sold more often than the minimum support. Easy! Just go through your
transaction list and find them. Then I need, the next step, I need to build two-dimensional, three-dimensional, four-dimensional, five-dimensional frequent item sets. How do I do that? Well, I take the one-element item sets, put them
together to two-element item sets. This is a candidate. I still have to check in the data whether this really is a two-element item set. Because, consider,
I have an item I1 and an item I2. Well, I2, I2. Okay, and these are my transactions. So, doesn't really matter what they were bought together with, you
know, like could have been bought with all different things. Consider now I have a minimum support of two. And both I1 and I2 are frequent items. But is I1 and I2 a frequent item set? In my little example here. Is it? Exactly! It's not.
Both items individually fulfill the minimum support, but the intersection in the transactions is empty. So, they are never bought together. Good! So, we find
the candidates that are possible and now we have to see which of them are actually valid. How do we find the candidates? This is the joint step. We put together all the sets of cardinality k minus 1 such that one new
item enters the set. This is a very efficient thing to do because we have ordered the things lexicographically. So, they have to have the same head and in the end one single item should be differing. Because if I take those two
sets and put them together, then I get a set of cardinality k. Ok, the head has to be the same. The last element may differ. Putting these two together
results in, of course, the same head. But then two different items here. Again, sorted lexicographically and the cardinality of ik is k. Ok? Everybody sees that? Very
efficient to do. And that is important for this algorithm, of course. Good! Now I have a candidate. I have to look whether this candidate is valid. Just go through the data and look if it fulfills the minimum support. Ok? So, all the candidates that do not
respect the downward closure property, what does it mean? That all its subsets are frequent items. Do I know the subsets? Well, yes. Because in the last
step I calculated the k-1 frequent sets. If there is a k-1 set in any of my candidates that have k products, then I can immediately prune it. I can
immediately cut it out. Because then the downward closure property would be clear? Let's consider an example. It makes it more easy to see. So, assume we have worked our way up to
three item sets. And these item sets are 123, 124, 134, 135 and 234. And now I want to put together four item sets. So, I want to go from F3 to F4. I have to join the candidates. How do I join candidates?
Well, I look at the first items and those with the same first items can be joined together because having different items in the last place would create a
new item set with four entries. Okay? So, what I have to do, I start here and look
what tuples could this be joined with? This is 1, 2. This is 1, 2. So, definitely here's a join. This is 1, 2. This is 1, 3. No join. 1, 3. 2, 3. No join. Okay. So, these are invalid. Putting this together results in a four tuple set. Clear? Good. Let's go to the next.
1, 2, 4. 1, 2. No, no, no. Doesn't create something new. Next, 1, 3. Oh yes. 1, 3 here. 1, 3 here.
So, this creates 1, 3, 4, 5. Okay? Doesn't create anything. Next, 1, 3. Oh, it did that. 2, 3. Cannot be joined with anything because there's no 2, 3. Okay? So, I get two
new candidates from my 5, 3 item candidates. Right? And those are the only candidates that could be frequent item sets. Or there could be no item set, I don't know, having a 6 in it or
something like that because 6 is not a frequent item. Good? Well, now I have to see whether those actually do the trick. Well, how can I prove they do the trick? I look at all the
possibilities to get three numbers out of them. So, I could have 1, 2, 3. 1, 2, 4. 1, 3, 4. 2, 3, 4.
These are all three element sets that are part of the four element sets. Right? And I have them here. Do they exist? Are they frequent items? Well, let's look about it. 1, 2, 3. 1, 2, 4. 1, 3, 4. 2, 3, 4.
We have them here. And for your convenience, I put them here. Okay. 1, 2, 3. Got it. 1, 2, 4. Got it. 1, 3, 4. Got it. 2, 3, 4. Got it. All the subsets of my bigger frequent item set are itself frequent item sets.
The downward closure property holds, so this is definitely a frequent item set. Let's look at our second candidate. 1, 3, 4. 1, 3, 5. 3, 4, 5. 1, 5, 4. Can all be done. Look at it. 1, 3, 4 is there. 1, 3, 5 is there.
Those two are not there. They are not frequent item sets. Downward closure property is heard.
Prune it. It's not a valid frequent item set. Okay? So the candidates for the frequent item sets is just a single item.
Of course, if we do that, we have to scan through the list times and again. Let's look at an example with a minimum support of 0.5. We have four transactions here, which means I need two occurrences to make 50%. Okay?
Well, let's look at the individual items. Here's item 1. Here's item 1. So I know the two occurrences, which is minimum support of 50%. Item 2. Here's item 2. Here's item 2. Here's item 2. So I know the 3 for item 2.
Minimum support of 50% done. Item 3. Here's item 3. Here's item 3. Here's item 3.
Note 3. Now it's kind of getting bizarre. Item 4. Oh, only one occurrence. Good. Item 5. 1, 2, 3. Okay? Now, these are the candidates, which actually has the minimum support of 50%.
Everything that occurs two times or more. So the 4 is cut out. Clear? Good.
Now, I have to join them together. Well, joining one element sets to two element sets is very easy because they fit all together. They have all the same beginning. 1, 3, 1, 2, 1, 3, 1, 5, 2, 3, 2, 5, 3, 5.
The 4 does not occur anymore because it in itself is not a frequent item set. I'm actually closing down the space here, the search space, still getting the correct result. Makes the algorithm efficient. Good. Now I have to look at what actually happens.
Well, in this case, everything belongs to F1, so there's nothing pruned yet, because all these things are frequent item sets. I don't have the 4 in it. And now I have to scan through the things, which of these candidates actually have a minimum support
of 50%. It's all possible that they have a minimum support of 50%, but where does it actually occur? So let's look at 1, 2. 1, 2 happens once. OK. Note a 1. 1, 3. Here's 1, 3. Here's 1, 3. Happens twice.
1, 5. Here's 1, 5. Happens only once. OK. You see how it's working. Calculate the minimum support. You need 2 or more.
1, 5. Definitely not in the game. 1, 2. Definitely not in the game. All the others are frequent item sets of size 2. Now, how do we join them to item sets of size 3? Well, can I join something with 1? No, I don't have 1 anymore.
OK. Can I join anything with a 2? Yes. Here's a second 2. Join them. 2, 3, 5. OK. Can I join anything with a 3? No, there's no 3 to join. Good.
Only single candidate for 3 items, frequent item sets that I have to look through. OK. How about the subsets? 2, 3, 5. 2, 3 is here. 3, 5. Oh, yeah. 2, 3 is here.
3, 5 is here. 2, 3, 3, 5. 2, 5 is here. OK. Or there. No need to prune it. Is it really supporting? Well, let's look about it.
We have 2, 3, 5. It's a valid candidate, as we said. 2, 3, 5 happens here. Happens here. That's it. Two occurrences. Support of 50% holds. Definitely is in the set.
Can we join that with something? Well, no. We don't have anything anymore. I considered 5 one-element item sets. I considered 5 two-element item sets. I considered a single three-element item set.
That's much less as the combination of the 5 items I have here and all the subsets you can gain from that. OK? Very easy, very efficient. Everybody clear about the step 1 of the a priori algorithm?
Good. Then, let's make it a little bit more difficult. No, not at all. Let's go to step 2. In step 2, we will generate rules from these frequent item sets and evaluate the confidence that we have.
Well, the frequent item sets have to be distributed because a frequent item set is kind of 1, 2, 3. An association rule looks like who bought 1 and 2 also bought 3.
So there are several ways to split the items in the frequent item set to gain association rules. So what we basically do is we distribute the items in the frequent item set
such that we need a support. How often do the items occur together divided by how many transactions do I have? We need a confidence, which is basically how often do the items occur together
divided by how often does the body of the rule occur on its own because if it occurs on its own, it's not the cause for y. And this is what our confidence wants to measure.
Basically, this is support y by support x. This is the confidence that we want to calculate. So what can we do? We have frequent item set 2, 3, 5 and it has a support of 50% as we calculated. What does it have in terms of subsets?
We have 2, 3, 2, 5, 3, 5, 2, 3, 5. All of those could be the body of the rule and the head of the rule would follow from what is missing. If the body of the rule is 2, 3, then of course the 5 is missing
would be put in the head of the rule. Good.
We can calculate the supports for all the different things like we did in the last step. So for example, 2, 3 has a support of 50%, 2, 5 has a support of 75% and so on. And now we can generate all the rules that are possible.
2, 3 makes 5, 2, 5 makes 3, 3, 5 makes 2, 2 makes 3 and 5, 3 makes 2 and 5 and 5 makes 3 and 2. And now we need to know the confidence. The confidence is basically the occurrence of all items divided by the occurrence of the items that make the body of the rule.
Easy to calculate. So for example, how often are 2, 3 and 5 bought together? 2, 3 and 5, 2, 3 and 5, 2 times. How often are 2 and 3 bought together?
Also 2 times. Meaning we have confidence of 100% that this rule is correct. 2 and 3 are never bought unless also 5 is bought.
Might be a cause. Very high confidence. 100%. Do this for all the others and we find we have actually two rules with 100% and all the other rules have a confidence of 2, 3 which for our toy example here is very high.
Usually you won't get so good rules in an association of rule mining. The confidence is usually much lower. And the support for all the rules is kind of the same of course because it's the same set. We're generated from this single set.
Clear? Easy to calculate. You make that for all the different sets that you have, all the frequent item sets, you get your association rules only considering a minimal number of candidates.
Good? Break? Okay, let's reconvene at 5 past. So, let's cut to the chase.
To summarize the algorithm, we want association rules of that type and we need to know the support for the body of the rule
and the support for the whole frequent item set. And of course we already got these values when calculating the frequent item set. So testing the confidence is not a problem
and the support we know anyway. For every step of the algorithm, for every extension of the frequent item sets, we just made one pass through the data.
So if our largest item set is of size k, we pass over the data only k times. Given that there's exponentially many association rules, that's quite efficient.
It can be done in linear time. It's kind of nice. The mining that we're using exploits the sparsums of data and high minimum support and minimum confidence thresholds makes it especially rare because the frequent item sets will break down nicely
and the candidate generation can be restricted to only a few ones. On the other hand, it's kind of interesting sometimes to focus on rare items. So what about the shoe polish? I will never get any rules with shoe polish because it's not bought too often in the supermarket,
as least when compared to milk or to yogurt or whatever it is that people buy every day in the supermarket. They buy shoe polish once a month or something. That's different. So all my rules will be considering milk and yogurt and bread and whatever
and no rule will consider shoe polish, shoe shine because it's just a rare item. What can we do about that? That is a variant of the ARPURA algorithm. So basically it's mining with multiple minimal supports.
So you say, well, the minimum support is not bound to the set of items in my shop. I don't have a global minimum support but for every product I have a special minimum support that reflects how often this product as a whole is bought.
Good. Sounds easy, poses some problems. Well, the basic point is that it really introduces the rare items into the association rules. If you have, for example, cooking pans or frying pans or something like that
and you are sure that they are bought much less frequently in your supermarket than bread or milk, just lower the minimum support for these products and they have a chance of getting into the frequent item sets and then you're done.
This is exactly the rare item problem. If the minimum support is set to high, you will never find rules involving this. If you need a rule that involves both frequent and rare items and you use a global minimum support,
then you would have to set it very low, generating a lot of rules about bread, sugar, milk, blah, you know, like everything because everything is over that minimum support basically. It doesn't help you either. A global support cannot do the trick. You need individual support. That is what I'm stating here.
If you have a minimal support for each item, then you need to find out what the rules are. Making a rule of shoeshine and bread is not very reasonable because one is a frequently bought item that will very often occur without the shoeshine.
The confidence of rules like that is definitely very low. You don't want them. How about things that are also rare?
Well, that could work. So maybe people buy shoeshine with laces or with shoes, yes, right. Very rarely, but the confidence could be very high. So what you basically do, you restrict the frequent item sets to those where objects
have not too far diverging minimal supports. You don't want bread and shoeshine, but you might want shoeshine and shoes. Or you might want shoeshine and caviar.
Both things are very rarely bought. They could have something to do with each other. Well, probably don't, but still. Both rare items and this is what you do. You basically look at the maximum support for any item in your set
and you look at the minimum support for any item in your set and if they diverge too much, if you have one that has a minimum support of 50% and one that has a minimum support of 10%, just don't consider it. It's not a sensible rule.
Good. This is basically what you do. Once you have chosen your frequent item sets, every item in the frequent item set may have a different minimum support. What do you do for the rules that you generate?
The basic idea is take the minimum of support values that you have and that is the support value for the whole rule. So for example, if you have the user-specified minimum values for bread, shoes and clothes, 2%, 0.1%, 0.2%
and we consider it works, you know, like 2% is not too far from 0.1% or 0.2%, it doesn't diverge too much, then I could have a rule, everybody who buys clothes also will buy bread,
has support of 0.15 and a confidence of 70%, but how about the minimum support? Clothes and bread have 0.2% and 2%. This is too low.
This rule doesn't get the minimum support value. On the other hand, if you have clothes and shoes taken from the same frequent item set, so the support and the confidence are the same, but this time the 0.15 work because it's larger than 0.1.
Clear? Depending on what items are in your rule, you define the minimum support for the rule as the minimum minimum support of all the items in the rule. Good. The problem with multiple minimal supports
is that the downward closure property breaks. And why is this? Consider we have four items in the database, one, two, three, four, and have different minimum supports, 10%, 20%, 5%, 6%.
Then we could again join them, so for example one and two has a support of 9%, the minimum of 10% and 20% is 10%, so this is not a frequent item anymore,
it doesn't cut the mustard. What is with one, two, three? Well, if we take one, two, three, we get 10%, 20%, and 5%.
Now the minimal support for this set is 5%, not 10% as was the case here. So this set has a minimal support of 10%, this has a min-sub of 5%.
Okay? What I know from the downward closure property is that the frequency of one, two, three
definitely is smaller or equal than the frequency of one, two. But now that I also have a smaller min-support, it could be sufficient. And if we wouldn't have created the one, two,
we would never have thought of the one, two, three with our a priori algorithm. Okay? We have to work on that. That's a little bit difficult. Anyway, basically we will this time sort all the items again,
but not according to their lexicographic order, but according to their minimal support values. So we have total order of items. The smallest minimal support starts and then the minimal support grows.
What is the idea behind that? If we do so, and then take the beginnings of the list of items and frequent item sets, the minimum will be the same. Okay? It can only happen with one item that is taken out
that will change it. The minimal support algorithm is a straightforward extension of the a priori algorithm. Again, we have as a step one frequent item set generation. Only that the candidates are now generated
with respect to the new multiple minimal supports. And we need a slightly different pruning step because we still have to consider some of the sets that don't make the minimal support.
But maybe they could be useful in some other sets, in some bigger sets where they will meet the minimal support. And step two, the rule generation works exactly as in the a priori algorithm. So what do we do?
For the frequent item set generation, we take the set and we take the minimal supports for each. 10%, 20%, 5%, 6%. Now we order the item such starting with the lowest minimal support
going to the highest. So 5% here, item 3 is the first one, item 4 is the second one, item 1, 10%, and item 2, the 20% comes last. Okay? Just ordered. Nothing happened.
Now I scan the data and count each item. 3 occurs 6 times, 4 occurs 3 times, 1 occurs 9 times, and let 2 occurs 25 times. So this is a very frequent item. Assume we have 100 transactions.
So the support for 2 is already 25%. Support for 3 is 6%, 4 is 3%, 1 is 9%. Good. Having said that, we go through the items and find the first item that needs its minimal support.
And this item is the seed for generating joint partners. Okay? Let's do it. So, we said start with 3 and the minimum support for item 3 is 5%,
but we have 6 occurrences in 100 transactions makes 6%. So, that is valid. Okay? Next one. 4. We have 3 occurrences in 100 transactions.
The minimum support for 4 was 6%. Argh! Doesn't work with 4. Okay. The 1. For the 1, we need 10%. We have 9.
So, it doesn't need its minimal support, but it needs the minimum support of the first item. So, it's a valid partner. It's not in itself a frequent item, but it's a valid partner together with a rare item.
Okay? So, we use it here to build a new frequent item. Okay? We do the same with the 2. Has a minimum support of 20%. Happens 25 times.
Not only beats its own support, but also specifically beats the 5% here. Good. So, these are 3 candidates. 3, 1, and 2. And now, we will find out that the item set 1
really has a support that is lower than itself and should be excluded. So, this is not a candidate for 1 element sets. So, if it's not a candidate for 1 element sets, why do I consider here the minimum support
of the first item that I started with? Well, because the 1 could be interesting for building higher order frequent item set, though in itself it is not a frequent item set. And how would it be interesting?
Well, I could combine it with the 3. It has 9%. The 3 has a minimum support of 5%. Okay? So, it's above the minimum support for the 3 if it's put together with it.
Right? This is why we keep it. Okay. Well, now we have to do 2 element sets. We have found out that item 2 and 3
meet their own minimum supports so they can be mixed with others and at least 1 is kind of like not a good candidate but still a valid candidate and 4 was definitely out of the race.
Okay? It doesn't even meet the lowest standards. Let us assume we don't want any rules where the spread is too far. Okay? Then we take the first evidence and now use the candidate set.
So 1, 2, 3 as elements and not those meeting their actual minimum supports because of the downward closure property. So we know the 4 is definitely out of the game.
Those 2 are good contenders and this one is not a frequent item set item but is a contender for joint. Okay? Still I don't have to join with the 4.
So, what do we do? We have the support of 3 which is obviously larger than the minimum support and we could form level 2 candidates with the 3. Two possibilities, 3, 1, 3, 2.
Okay? 3, 1 is a candidate. The support of 1 is larger than the minimum support and the support of 3 minus the support of 1 is smaller than the 5
because this has a support of 9. This has occurrence of 6. That's quite close together. Okay? Let's look at 3, 2. Well, 25 and 6.
It's very far apart. It differs by 10%, by more than 10%. So, cut it out. Okay? It doesn't have to do anything with the support or the minimum support or stuff like that.
It's only because of the divergence rule. But see that we have created a candidate matching all the minimum support requirements by some item that in itself is not a frequent item set.
Okay? Good. Understood. We do the things. We take the next seed from error. So, 1, 2 is the next probability. The support of 1 is smaller than the minimum support.
So, we cannot use that as a seed anyway. And after the candidate generation is completed, we have only a single candidate, 3, 1. Now we look over the data again, read the transaction list and find out that the support for 3, 1 is 6,
which is larger than the minimum support of 3 and 2. Minimum support of 3 was 5%. Okay? So, this is 5% here. And thus, this is a valid candidate. This is a valid frequent item set. Okay?
That's basically how it works. If we want to have bigger sets, it works exactly like before. But we have to look at this clause, the divergence clause. This is another check. Otherwise, we will do exactly the same thing, which means look at the front.
If the heads of the two lists are identical up to the last thing, these make the new joint partners. Okay? Good. And then we need the prune part.
So, now the point is really that we cannot go to the k-1 subsets to find out whether this is a valid set, because they might have been pruned away due to a higher minimum support. The downward closure property doesn't hold anymore.
Still, if we find all the subsets, everything's good. We're still valid. But it could be that the head of the rule, the first item having the smallest support, is missing in the rule.
That would increase the minimum support because we have ordered the list so that the smallest minimum support is in the front. And that would break it. So this is the exception that we definitely need here. Look at the head item.
If it does include the head item, then it has to be in the candidate list that has been calculated before of the smaller sets. If it does contain the first item in the list,
it has to be in the smaller candidates. And if it does not contain it, it does not have to be in the smaller candidates list. And so this is what we do. So for example, if we have a couple of three element sets here, and we join them,
here's 1-2, here's 1-2, makes 1-2-3-5. Here's 1-3, here's 1-3, makes 1-2-3-5. Here's 1-4, here's 1-4, makes 1-2-5-6. Only joins we can do.
Then, after pruning, we might get 1-2-3-5, 1-3-5-4-5. Let's try to figure out what's in the set. So we need to have all the sets that can be built out of
all the three element sets that can be built out of this four element set. 1-2-3 is there. 1-2-5 is there. 1-3-5, there.
And 2-3-5, 2-3-5, there. Or there, valid candidate. Let's look at the second one. 1-3-4, 1-3-4, let's do it in a different color. 1-3-4 is there.
1-3-4-5 is missing. Does it matter that it's missing? No, it does not matter, because 1-3-4-5 does not include the minimum support item. This is the one that presses down the minimum support.
So we don't need it in the list before. All the other things, this is 1-4-5. 1-4-5 is there. And 1-3-5, 1-3-5 is there.
So they are both there, and thus we have that. Then for the last one, 1-4-5-6. How can we build things out of that?
Let me get rid of my little drawings here. We can have 1-4-5 is there. We could have 4-5-6, doesn't have to be there because the first item is not in the set. We could have 1-5-6,
oops, that is missing, and the first item is included in the list. So the whole thing goes down the drain. Okay? Clear? Similar as before,
very easy. Just, you don't have to find those sets where the first item is not in the list, and that's all. Okay? Well, for the rule generation,
we found that the downward closure property is not valid anymore. That means that we have frequent k order items that contain k-1 non-frequent items.
So we may have built something whose subgroups are not frequent items in itself. And, well, that is a problem because we don't have the support values. We have to compute the support values for those.
It's not a problem, but we have to go over the data again. It's an efficiency problem, not the effectivity. Well, that is called the so-called head item problem, so we might end up with some sets where we have to look at supports
but don't have the support values because the set of smaller cardinality is not a frequent item set in itself. Well, easiest way to deal with that is just if something is missing, go over the data, find out what it is.
Of course, also a very inefficient way to deal with it. There are some other ways to there are some other ways to deal with it. We will come to that in a minute. So, to give you an example, if we have frequent item sets, shoe closes bread,
then we could have the minimum support is the minimum of all the three of them. So if we have bread closes shoes, then 0.1 definitely is the minimum. So this is here. Okay? Now we have, for example, closes bread, has a support of 0.15.
Shoe closes bread has a support of 0.12. And the minimum support holds. But what about closes bread? Well, for closes bread, the minimum here is 0.2.
So the minimum support for this set does not hold anymore. I don't have the confidence for that. All the rules I can build that have the closes and bread I cannot compute.
Okay? They might be true. I just cannot compute them without accessing the data. Good? Clear? Okay. Not too difficult. So this is what I call the
the head item problem. Basically I can calculate the support of some item or the confidence in some rules and in some I cannot do it without reading the data again. well, basically there is a
possible solution that at least heightens the probability without reading the data again. If I take the non all the probabilities where one non-frequent item set
is taken out. So if I take those and take out one non-frequent item set then I will get to this and that was the probability that I didn't record in the first call. But if I record all the probabilities also for those items it's not guaranteed that I get
all the things I need but it's probable. So also calculating them when you run through the data is a good idea. Anyway Good! Advantages of multiple minimal support.
It's very realistic for practical application because bread and shoeshine are sold together but are not the same thing. One is a convenience article that is bought every day. The other is an article that is bought a luxury article basically that is bought once a month or something. You cannot
deal with them with a global threshold. If we use multiple minimal support and consider the divergence we will not end up with all rules you know like people like bread because they like shoeshine or something like that. But we rather have rules of
the type people buy laces because they buy shoeshine because they think at that point that they also need new laces or something like that. So this is kind of like rare item rules and that is what we set out to do.
Basically if we set the support values to 100% we can prevent items to occur in rules anyway. Also a nice idea. I am definitely not interested in anything that has something to do with bread. Well I just ask for a minimum support
for bread of 100%. If there is a single transaction where bread was not bought that is it for the bread. Okay. So I can effectively exclude items from frequent items. Maybe a good idea.
Yeah and that is it for today in terms of algorithms. Yes. This is basically how how many it is kind of like a steering wheel setting the
minimal support for the granularity of rules that you are interested in. If you really only want rules that have something to do with the top products you set raise the bar. If you are interested in all
your products and their relationships you will get thousands of rules use a small minimal support. That is basically it. So it is kind of like it sets the granularity how you look at your data.
How you look at your product. So in the detour we will look at another algorithm the class association rule mining and that will do. Silvio. Okay. So we have seen the classical approaches. We have seen the cases
where we are going to perform some market basket analysis. But this kind of analysis is not targeted. So I am not expecting something specific in the right side of my rules. Therefore there was the idea of class association rule
mining to introduce specific semantics to this kind of rules by exploiting classes. The idea was we will define a set of classes and we try to identify those rules which explain the
important items co-occurring inside those classes. And for this algorithm I have taken an example from the area of text mining. It is I think the best field for class association rules.
Okay. So let's imagine our data as a set of documents. Here I have taken seven documents each speaking about different topics. The first three documents speak about education and the last four about sports. We have then the nouns.
Well, of course for the sake of an example I have taken a very small number of nouns for each document. You should imagine about 300 to 400 unique nouns. I have taken here three nouns for education for the first document and so on.
we apply classical a priori algorithm and we calculate of course rules. With support and confidence as we've seen before. So for example, for the education, for student school, we find the support of two divided by seven.
And this has to fulfill the minimum support condition and the minimum confidence condition. Clearly, the minimum confidence condition is not such a problematic situation.
The association rules that I am getting out of here, as you can observe, have a classical structure. Namely, the words are in the left side, the class at the right side. This is the whole idea of class association rule mining.
This is why the advantage here is that association rule mining with classes can be performed in just one step. I am going to perform the candidates to join and prune, identify the frequent item sets.
But then I don't need to perform any combinations to extract the rules, because I already know that the classes are at the right of my rule system, yeah? So then I have each transaction as before,
resulting in rules of type condition set. These are my words or items and labels of my classes. It's pretty simple. Therefore, the a priori algorithm can be further used to calculate class association rules,
just as before, just adding semantics through classes. Of course, also the idea of multiple minimum supports can be used here. I can add two different classes, different minimum supports. A classical example for this is when I define two classes,
one positive, one negative, with the corresponding items. And then I say, okay, I'm really interested in what makes the positive class. So I'm going to use a lower minimum support, and I'm going to use a higher
support for the negative class, since I'm not that interested, or it comes a lot in my data anyway. If I want, I can also exclude it entirely from my data by using a minimum class support of 100%. Good, some classical tools for performing association rule mining,
all the three algorithms we've discussed about today, or variations we've discussed about today, are offered as well as in open source projects, as in commercial projects. The most known open source projects for association rule mining and data mining as a whole are RapidMiner and Veca.
You can download them and play with them, test them, and see what kind of functionality they offer. They are completely free for student use. As commercial solutions, they are a bit more powerful. They scale very well with data, are the intelligent miner offered by IBM.
There's a product from SPSS, and of course, from Oracle, they also have their ODM miner for the Oracle database. I wanted to bring you a practical example.
So I've tested association rule mining with the a priori algorithm on a data set which I have obtained from the California University. Of course, you can find there a lot of data sets for performing exactly association rule mining.
And here I've downloaded a data set with regard to customer preferences when it comes to buying a car. The classes are unacceptable. The client found a certain product unacceptable, acceptable, good,
up to very good. And the attributes which the customers had to consider when deciding upon the class value was the cost of the car, which varied from high to low, very high to very low, the maintenance cost, the number of doors, luggage room, and so on.
So there were more attributes for considering the acceptability of a car. I've performed this study with VECA, and in the VECA interface, one can set how many rules one wants to obtain. I've set here, I think, 100 rules.
One can set the support intervals, the confidence, the class index. In the case, one wants to use the class association rule mining. In this case, I've performed classical a priori algorithm with classical association rule mining, so no class involved.
And I wanted to see what comes out. And I've observed that among, so the first rule, for example, was that coupe, so a two person car, was found as unacceptable by most of the clients.
This is a pretty convincing rule. It has a confidence of 100%. The same can be said, for example, about cars with two persons and small luggage compartment, again unacceptable, and so on.
These are pretty simple rules. VECA displays at first the rules with just a small number of items in the left part of our rule. So I waited a bit longer and received also a bit more comprising rules.
Like, for example, this is a sedan, a four seat car, which was found unacceptable due to its low safety. So if a four seat car is found unacceptable, then it is most surely because it's unsafe. And if you wait some more or define a larger threshold of allowed rules,
then you'll get even complexer rules which decrease also in confidence due to the large number of records in your data set. Good, I then wanted to try something larger. And I've taken a car accidents database, which actually is pretty interesting.
It is quite big, so 350,000 throws and 54 attributes. And I was also curious to see what kind of attributes. And then there's information about the hour the accident took place, where the accident took place, how the intersection,
if it's an intersection, looked like. What are the meteorological weather conditions? All this information is there in the attribute. So I wanted to see, okay, how do the rules look like in order to detect what kind of conditions are more frequent for such accidents?
I started the VECA and for only 54 attributes and 350,000 throws, it died pretty fast. So you can imagine the complexity of the algorithm can go up pretty fast. And the four giga of memory are fast, not enough.
Of course, there are improvements of this algorithm, which achieve this kind of performance and can deal with more amount of data. And I then also was interested to see how Walmart can do such analysis. And I found out that actually what they are doing is kill it with iron.
So they just put a lot of power when performing something like this. They try to prune as much as possible. They have a tree-based algorithm and then it works. So it's not a good idea to do this on a laptop as I've tried it.
Good, so what I would like you to take from today's lecture, we've spoken about the business intelligence. We've spoken about the importance of segmenting the customers, about the propensity of the customers to buy. How and what are they disposed to buy?
What is the profitability based on the customer? And how may I keep the customer to the company? We've spoken about an overview regarding data mining. What is data mining in general? What are the interesting algorithms?
And we've started with association rule mining. We've introduced the a priori algorithm, which we can measure the strength or the weaknesses of rules through support and confidence. The most important part of the algorithm
is the very intuitive downward closure property, which minimizes the intermediate results by a high order of magnitude. We've discussed about the association rule mining by considering multiple minimum supports to solve the rare item problem.
And we've discussed about the head item problem, which is derived from using multiple minimum supports. After the holiday, we'll continue throughout the data mining field
as an application to data warehouses with time series, trend and similarity search analysis, as well as sequence patterns. Thank you and have a nice holiday.