6th HLF – Laureate Lectures: Big Data, Technological Disruption and the 800 Pound Gorilla in the Corner

1 views

Formal Metadata

Title
6th HLF – Laureate Lectures: Big Data, Technological Disruption and the 800 Pound Gorilla in the Corner
Title of Series
Author
Stonebraker, Michael
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Heidelberg Laureate Forum Foundation
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
Michael Stonebraker: "Big Data, Technological Disruption and the 800 Pound Gorilla in the Corner" This talk will focus on the current market for “Big Data” products, specifically those that deal with one or more of “the 3 V’s”. I will suggest that the volume problem for business intelligence applications is pretty well solved by the data warehouse vendors; however upcoming “data science” tasks are poorly supported at present. On the other hand, there is rapid technological progress, so “stay tuned”. In the velocity arena recent “new SQL” and stream processing products are doing a good job, but there are a few storm clouds on the horizon. The variety space has a collection of mature products, along with considerable innovation from startups. I will discuss opportunities in this space, especially those enabled by possible disruption from new technology. Also discussed will be the pain levels I observe in current enterprises, culminating in my presentation of “the 800 pound gorilla in the corner.” The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video.

Related Material

Loading...
Point (geometry) Turing test Ocean current Read-only memory Musical ensemble Sequel Real number Volume (thermodynamics) Real-time operating system Bit rate Mereology Field (computer science) 2 (number) Number Very-large-scale integration Goodness of fit Database Business model Partition (number theory) Computer architecture Physical system Installation art Complex analysis Multiplication Product (category theory) Twin prime Element (mathematics) Electronic mailing list Analytic set Parallel port Volume (thermodynamics) Data warehouse Band matrix Internet forum Software Data storage device Computer network Computer science Vertex (graph theory) Game theory Fiber bundle Data integrity Pole (complex analysis) Row (database)
Complex (psychology) Scientific modelling Mathematics Computer configuration Central processing unit Information security Position operator Physical system Algorithm E-book Closed set Virtualization Data warehouse Prediction Internet forum Data storage device Quicksort Linear algebra Read-only memory Numbering scheme Sequel Virtual machine Time series Point cloud Rule of inference Field (computer science) Number Wave packet Internetworking Database Operator (mathematics) Computer hardware Energy level Subtraction Computer architecture Bayesian inference Autocovariance Analytic set Line (geometry) SI-Einheiten Table (information) Calculation Word Loop (music) Software Data center Vertex (graph theory) Fiber bundle Spectrum (functional analysis)
Logical constant Axiom of choice Email Complex (psychology) Database transaction Randomization Concurrency (computer science) Scientific modelling Medical imaging Array data structure Velocity Stress (mechanics) Forest Error message Information security Physical system Social class Oracle Enterprise architecture Spacetime Touchscreen Product (category theory) Interior (topology) High-level programming language Prediction High availability Sequence Hand fan Message passing Data storage device Telecommunication Order (biology) Pattern language Quicksort Point (geometry) Sequel Computer file Variety (linguistics) Online help Streaming media Event horizon Template (C++) Number Frequency Goodness of fit Term (mathematics) Database Computer hardware Computer-assisted translation Computing platform Computer architecture Newton's law of universal gravitation Complex analysis Standard deviation Information Assembly language Artificial neural network Tesselation Autocovariance Independence (probability theory) Volume (thermodynamics) Set (mathematics) Cartesian coordinate system Table (information) Loop (music) Software Statistics Code ACID Parameter (computer programming) Function (mathematics) Weight Order of magnitude Web 2.0 Machine learning Bit rate Strategy game Matrix (mathematics) File system Control system Algorithm Process (computing) Closed set Internet forum Data model Website output Hill differential equation Data management Data integrity Pole (complex analysis) Ocean current Purchasing Read-only memory Existence Game controller Enterprise architecture 9K33 Osa Real number Virtual machine Field (computer science) Metadata Wave packet 2 (number) Revision control Causality Internetworking Subtraction Capability Maturity Model Cellular automaton Analytic set Voting Natural language Online game Units of measurement Form (programming)
Database transaction Musical ensemble INTEGRAL Scientific modelling Source code Design by contract Disk read-and-write head Mereology Bookmark (World Wide Web) Formal language Programmer (hardware) Casting (performing arts) Velocity Office suite Drum memory Error message Physical system Thumbnail Enterprise architecture Algorithm Spacetime Process (computing) Software developer Data warehouse Tendon Rounding Electronic signature Internet forum Arithmetic mean output Self-organization Data management Pole (complex analysis) Data integrity Row (database) Point (geometry) Frame problem Computer programming Maxima and minima Read-only memory Statistics Enterprise architecture Presentation of a group Transformation (genetics) Variety (linguistics) Virtual machine Rule of inference Field (computer science) Wave packet Number Goodness of fit Term (mathematics) Database Hierarchy Integrated development environment Renewal theory Units of measurement Condition number Complex analysis Multiplication Scaling (geometry) Hazard (2005 film) Computer Planning Analytic set Division (mathematics) Volume (thermodynamics) Cartesian coordinate system Loop (music) Noise Local ring Matching (graph theory)
Internet forum Enterprise architecture Musical ensemble Arithmetic mean Control flow Bit rate Subtraction
Internet forum
[Music]
for the second part of our doubleheader
you should realize that our speaker really had to brave the elements and work very hard to get here and Mike's flight into Frankfurt was cancelled he went to Berlin he just got here this morning so he has really worked very hard to get here our speaker as those of you in computer science surely know is the legend of databases the winner of the Turing Award in 2014 and also a number of years earlier of the I Triple E von Neumann metal and other honors as well Mike Stone brain good morning Lufthansa is high on my list right now but other than that I'm delighted to be here what I'm gonna talk about is Big Data this is going to be the absolute opposite of the talk you just heard this is going to be a huge mushy field with what I think is worth working on and I think is not worth working on so I'm going to talk about Big Data mostly the interesting things about Big Data are about disruption because that's when the the established vendors get faked out by the new guys and then I'm going to talk about the 800-pound gorilla in the corner which is the thing you really should be working on okay so what's Big Data all about well it was invented by a marketing guy for some company maybe a decade ago and it doesn't mean anything - it means whatever the marketing department wants it to mean what I'm gonna say it means is that you've got a big data problem if you've got one of the three V's either you've got too much data and you can't you can't manage it all or it's coming at you too fast and your software can't keep up or it's coming at you from too many places and you have a horrible data integration problem so I'm going to talk about each of the three and in the big volume you know talk about well if you want to do sequel analytics you want to do stupid analytics on a lot of data that's one problem and if you want to do more complex analytics a lot of data second problem okay if you want to do sequel on petabytes of data there are lots of there are a bunch of commercial vendors I know of maybe 20 or so installations that are running in production with multiple petabytes of data on very very big clusters and in my opinion as a research problem this is the definition of its solved which is there are plenty of commercial offerings just for example Zynga the game company the guys who wrote farmville Zynga records every click anywhere in the world in real time and use that to have their engineer their marketing and engineering folks figure out how to sell more virtual goods the real business model of Zynga and farmville is to convince Chinese teenage teenagers to impress their dates by buying more virtual goods anyway if you want to do sequel I don't see it being a big research problem at this point and everybody runs the same architecture that's been pretty much proved to be the right way to do it they're all running multi node systems that are partitioned they're all parallel column stores if you don't know what any of that means go go look at go look at Vertica go look at presto go look at hive they all work roughly the same way do not go look at Oracle Oracle doesn't run this way Oracle is nowhere in the warehouse business so anyway there's no gorilla here and there's only really two flies in the ointment that I see in the data warehouse world the first one is all the data warehouse products were built either a decade or more ago I built one of them and we all were optimizing for network bandwidth that was the high pole in the tent try and save sending bits around between the nodes in a partition database well the problem is in the last 10 years network bandwidth is increasing stir then CPU main memory bundles are getting beefier so alas current data warehouse products and systems deployed are not network bound so anybody who does research on saving network bandwidth in a data warehouse world forget it it's not worth it it's not not the high pole in the tent so this is going to cause a rear koteki of commercial data warehouse products and the engineers are you know the good news
is they they will have some more work to do and this may allow some competitors new competitors maybe second fly in the ointment for these guys is way more serious everybody is going to move to the cloud sooner or later my favorite example is from Dave DeWitt the previous speaker mentioned Azure here's as yours data centers as they are currently deploying them they are shipping containers in a parking lot chilled water power and Internet go in otherwise they're sealed roof and walls are optional they're only there if you need them for security line up the shipping containers they're sealed if there's any failure to turn that node off when too many nodes get turned off bring in a forklift take the thing away and replace it by a new one if you think about raised flooring in Heidelberg versus shipping containers in the Columbia River Valley guess who's going to be cheaper so everybody's going to move there for cost reasons and you know AWS mentioned by the previous speaker plays by different rule then if you are buying hardware so AWS gives you two different storage options s3 and EBS I'm not going to tell you what they look like one is dramatically cheaper than the other for no good reason so they are you know Amazon is not implementing economics here they're they're pricing you know in an arbitrary way also they have a bunch of in-house solutions that they wrote redshift is one aurora's another spectrum is the third one Amazon gets to choose how to price their internal software because they are running the hardware all the other guys who want to run database systems on AWS have to rent resources from Amazon Amazon charges they're charging algorithm for their internal systems dramatically favors in-house solutions so it's not a level playing field at all so that's kind of a problem if you want to if you want to compete against the in-house solutions and even if you don't Amazon has the notion of a t-shirt size which is a bundle of resources that you can pay for it has virtual CPU speeds memory sizes that sort of stuff there's at least 50 of them with different prices and you have to decide which one is the best way for you to compute so cloud architectures gonna be a huge challenge for anybody who's going to run there and we're all going to run there however this all is is a blip in the overall scheme of things the big problem with warehouses is that's yesterday's problem so people have been predicting that data science will supersede business intelligence business telogen s-- his sequel analytics data warehouses data science is going to supersede this stuff as soon as you all train enough data scientists to go out and fill all the open positions so predictive models are great and they're going to replace big tables of numbers data science is of course another meaningless word it's whatever the marketing guys decide it means so let me tell you what I think it means basically it's complex math operations Leslie Lamport kind of stuff machine learning clustering k-means predictive modeling Bayesian analysis all that stuff that's what big analytics is all about the world of the quants rocket scientists and the thing that is really interesting is that this is mostly specified as linear algebra honoré data and there are a dozen or so common inner loops so let me just give you a quick example of why you should believe what I just said so everybody's favorite example this is New York Stock Exchange so consider the closing price on all trading days for the last five years of say Oracle and IBM so the quanta Hall wanted want to say I wonder if those two time series are correlated if they're correlated positively then if IBM goes up then you want to buy Oracle they're correlated negatively if IBM goes up you want a short Oracle anyway so obvious calculation is what's the covariance between these two time series it's written down in the red go look it up in any linear algebra book well that's not very hard you do this on your wristwatch so to make it interesting
what the quants really want to do is do this for all pairs and why is he stocks there about 4,000 of them so five years worth of data is about a thousand cells so we have for any stock we have a sow we have a thousand closing prices of trading data and we have four thousand stocks so the red thing is a four thousand by 1,000 matrix so that's the data you get to deal with and covariance ignoring the constants subtracting off the mean what is it in 30 Leslie lamport's terms it's that red array times its transpose matrix multiplied times its transpose this is the inner loop of practically all complex analytics is stuff like this now the thing to get very clear is sequel is not going to do any of this stuff first of all sequels on tables not on arrays and doesn't have any clue what stock times the stock transpose would mean so you can try this if you want to do something really hard try this in sequel with some relational ization of that red matrix so this is what you want to do if you're into complex analytics this is the rage and machine learning and I just want to point out the rage and the real rage in machine learning is deep learning sort of big neural networks if they're really good at image understanding they're really good at finding cats on the internet they're good at natural language processing this is not going to take over the world this is absolutely not going to take over the world so why not well I'm involved in a start-up that I'll mention later on we do data integration we have big machine learning application and it's conventional random forest techniques machine learning as it's been known for 20 or 30 years and we have a couple hundred customers none of whom are interested in deep learning for this application so what's the big problem the big problem is training data in the enterprise which is where tamer exists training data means IBM sa is or is not the same thing as IBM Incorporated so is the Spanish subsidiary the same thing as the US subsidiary that's a yes or no question but it requires your CFO to give you the answer and so training data is invariably human approved even if it's automatically generated the high pole in the tent is getting enough training data to make any model work and that's a struggle for all tamer customers if you do deep learning you need one or two orders of magnitude more training data and enterprises are having a hard time getting enough just to do conventional ml so in cases where training data cannot be automatically produced deep learning is not going to go there also right now it's a little hard to explain deep learning predictions if I'm predicting your credit score and whether or not you get a loan depends on the answer then if you say your credit score is X and I say why did you say X and you say well this thing here said so I mean that will get you sued so in a lot of real-world enterprise situations you have to be able to explain yourself and so far that's a little trick you know I'm sure the deep learning guys will get there but they're not there right now so there's going to be a lot of different machine learning stuff do not drink the deep learning kool-aid at least not for all applications so in this world of complex analytics there's lots of ML platforms you I'm sure a bunch of you if used some of them think I just want you to know tis there's no data management and no persistence in any of this stuff so you run you run one of your ml packages it produces a bunch of output data you want to save that output data because ml is an iterative process you try it on different different parameters and you get this sequence of stuff you really have a workflow and current ml packages are clueless on how to how to deal with persistence workflow and data sets over time you can certainly run the stat packages they also have weak or non-existent data management same problem that the ml packages have they use filesystem storage and if you ever learn one thing from the data management community is put your data in a database system because then you can find it again do not put it in a file system and encode all the metadata in the name of the file so anyway I'm not a big fan of stat packages there's a bunch of array database systems most of them are startups if you have to deal with Red Data why not use the database system that understands arrays instead of one that doesn't so site eb tile deavere as demand from here in germany they're starting to get traction and the genomic space we'll see how they do off into the future please if you learn only one thing from what I say this morning do not use Hadoop for anything a dupe is good Hadoop is not good for anything if you're using it stop and do something and I can I can give you a long song and dance about why you're being stupid if you're using a dupe but I won't so it's really the Wild West right now so there's all kinds of stuff it's a very immature field there's lots of things that you would want you would want a toolkit that could implement a lot of different algorithms a toolkit that would remember everything you did and allow you to backup three weeks because you screwed up anyway the goal is something that's wildly different than we have right now it's the Wild West lots of interesting worked to be done here but this is not the 800-pound gorilla that will come to that okay so how about if your data is coming at you too fast well since we're tagging everything of value you know generates huge velocity all the insurance all the automobile insurance companies in the US are putting a a sensor in your car to take note of how you drive so if you jam on the brakes too often your insurance rates are going up they are they are they are reading that sensor you know 30 times a second times lots and lots of cars smartphones state-of-the-art Internet games all of this sentence velocity through the roof so this it's there's a lot more you know big streams of stuff to process so there's two different solutions here if you want to look for patterns in a firehose so across the CNBC screen you're seeing that trades on the NYSC and you want to find out if Oracle is going up and within a short period of time IBM's going down so find me a strawberry followed within a hundred milliseconds by a banana there are complex event processing systems that are focused on this probably the most popular modern one is Kafka storm is another one there are a bunch of proprietary systems and I don't hear anybody complaining that they can't that they these systems won't keep up with their problem you don't of course make them easier to code use less lease technology but I don't hear anybody complaining about performance
the other thing I hear is epitomized by a company called gecko you've probably never heard of it but it's 10% of the NYSC trading volume day-in day-out it's an electronic trading company that has electronic trading desks in London New York Singapore all around the world and so the Gecko CEO is kind of nervous because all these independent trading engines can all decide to short IBM at the same time and that will create way too much risk for the company as a whole so what he wants to do and what he's currently doing is assemble for every security that they trade some 60,000 of them assemble my real-time global position that means on a milisecond basis and alert me if my exposure is greater than some tolerance so that's his application now if you think about this for a minute you can't lose any messages if you lose a message this whole thing fails if your hardware fails then of course this fails so this looks like high-performance online transaction processing want to update a database at very very nice high rates of speed so there's a bunch of things that look like this so my suspicion is that there's more of these guys than there are people who want CEP but we'll see the commercial marketplace votes with its feet so if you want to do this you have three choices you can run Oracle the old old the current I affectionately call them the elephant's people who've been around for decades who give you what I like to call old sequel recently there's a bunch of people who've alleged that no sequel was a good idea that's 75 or so vendors maybe a hundred and fifty there's a lot of them and their marketing strategy is you don't want sequel it's too slow and you don't and you can't afford acid it's too slow acid means transactions so give up sequel and give up transactions and buy my software so then there's a third camp and third camp has come to be called new sequel which is these people say sequel is a good idea high-level languages are a very good idea Leslie just proved it to you so sequel is a high-level language the no sequel guys say code and assembler so sequel is a good idea and by the way transactions are a really really good idea but you can't use the you can't use Oracle it's architected you know you know in you know 20 years ago it's you need a new architecture so if you run work all forget it in this whole world it's a couple orders of magnitude too slow if you run a no sequel system and MongoDB is an example Cassandra is an example there's no standards and by the way if you want acid and you're running a no sequel system then you get decoded in userspace and that is a pain worse than death so I'm not a big fan of giving up giving up acid giving up sequel and certainly giving up standards is a lousy idea so the new sequel guys are giving you main memory DBM DBMS is high availability failover you know it's stay up etc etc they're running different concurrency control systems than the Oracles of the world and there are a bunch of vendors here all of whom will be happy to do a million transactions a second and I don't know anybody who wants to go faster than that yet so I view this as doesn't look very hard so the big velocity space there's some people who want to do CEP and products in the marketplace seem to work reasonably well they could be easier to use of course and then there's next-generation OLTP which is a new class of database systems and they are fast enough to you know deal with this market so I don't see anything in the big velocity space that looks like an 800-pound gorilla ok so where is know there's one fly in the ointment those of you who know about RDMA it's a very fast way to communicate between notes the people who do this new sequel stuff may have to deal with our DMA at which point they're gonna have to rethink their community their concurrency control algorithms but you know they'll just do that this isn't this isn't that hard so where's the 800-pound gorilla it's in the big variety space so I want to tell you two scenarios scenario number one has to do with data scientists so these are people we are trying to help with complex analytics machine learning all that stuff so Merc you know a big pharma company has around a thousand data scientists here's the template of what they do day in day out they have an idea does Rogaine cause weight gain and mice I don't have any idea whether that's an interesting question or what the answer is but just exemplars something you might want to know the answer to so the Merck scientist has a huge data discovery problem Merck has 4,000 or so Oracle databases they don't even know exactly how many they have they have a huge data Lake plus all kinds of files and the public web is a treasure chest of information so just finding datasets that might answer this question is a huge challenge and once you get them then you've got to do data integration you've got to put them together so they can be there semantically the same so here's a quote from the data scientist that I Robot those are the guys that sell you the self vacuuming vacuum cleaners so she said I spend 90% of my time finding and cleaning the data 90% 10% is left for something else so then she says I spend 90% of the other 10% running my data models finding out they don't work and they don't work because of data cleaning errors which is the input is too dirty to actually use so she spends 99% of her time finding and cleaning data 1% of her time doing the analytics for which she was hired so those of you who want to make a difference you can go help data scientist with his algorithms he spends at most half an hour a week doing that stuff and or you can help him by helping him with his data integration data finding problem that's what he does thirty nine and a half hours a week so this is the high pole in the in the data science tent which is getting and cleaning your data well the enterprise guys have a have a version of this which is very compelling G which is the you know the guys who conglomerate in the news a lot about whether their management is any good so the GE CFO estimated that they can save a hundred million dollars a year with the following tactic they have seventy-five procurement systems so the procurement system for those of you don't know so if you work for a company and you need some paper clips you go to your procurement system it spits out a purchase order you take the purchase order down to Staples and you get your paper clips ge is seventy five of them the obvious correct answer using Leslie Lamport notions is one GE has seventy
five for all kinds of political reasons so if GE can empower each of these seventy-five procurement folks when their contract with staples comes up for renewal find out what the terms and conditions negotiated by the 74 other procurement officers what they are and then demand the most favorite nation status that's worth a hundred million dollars to you know GE of course what do you have to do you have to integrate seventy five supplier databases because this is a supplier database behind each of these procurement systems they're all independently constructed and by the way you want you get done doing that G wants to integrate parts customers lab data goes on and on okay so the data integration challenge faced by GE and bunches of other enterprises for every local data source after you find it you've got to ingest it get it into some common place perform transformations you guys use euros we use dollars you've got to get the units to be the same you've got to clean the data that is an Achilles heel a rule of thumb is that at least 10% of your data is either wrong or missing and you've got to fix it and then you've got to line up the fields you call them wages I call them salary that's called schema integration then GE wants to find all the different all the different records that correspond to staples that says deduplicate deduplicate all of the data that you've put together and sometimes you want to find golden values which is you know Dave Patterson you know is you've got five records for Dave three of them say he's 46 one of them says he's 52 you got to decide gee they're all wrong anyway that's what finding golden male uses do all of this stuff and the thing that is a killer is you've got to do it at scale so GE has about 10 million supplier records so running an N squared algorithm on tendon ten million records is a not go out for lunch let's take a short vacation so you can't use you can't use an N squared algorithm we're here in Europe it turns out Toyota is customer of this company tamer Toyota has an independent subsidiary in every country Spain has one France has one and so forth if you buy a Toyota in Spain and you've then moved to France Toyota develops amnesia about you the French guys have no idea what's in the Spanish database so Toyota is in the process of integrating their customer databases all over Europe there's about 30 million records in 40 different languages just so you can get a sense for it and do not even do not think about naive algorithms than Python this is a problem of making making making stuff performance scale so that's what you have to do the traditional solution simply does not work so the traditional solution is what's called extract transform and load it's brought to you by a whole bunch of legacy companies and it's been extended somewhat recently with what's come to be called master data management brought to you by the same cast of characters and extract transform and load requires way too much manual programming by a programmer so it just is too expensive and master data management simply does not scale does not scale to 30 million records why do I say it doesn't scale well here's MDM says master data management says well if I have IBM SA and I have IBM incorporated they're they're the same thing match those deduplicate them so they're saying right right a bunch of rules to decide what matches and how to merge multiple records into a single golden record so they say do it with rules so just a round number is that a human can grow around 500 rules nobody's written rule systems that work with 5000 rules you just can't get your head around them so one of jeez problems is they have 20 million spend transactions spend transaction is Leslie went down the staples and bought paper clips so you've got 20 million of those which I think might be their annual annual spend in a couple of divisions and their CFO just says are we how much money are we spending on computers how much money are we spending on travel so they have a classification hierarchy of you know that you know of parts and then computers and then memory and so forth so they have a pre-existing classification hierarchy and all they want to do is classify twenty million transactions into this hierarchy so he wrote about five hundred rules which is how much you could kind of get your head around and that classified to million of the twenty million spent transactions that leaves you eighteen million to go so far this is totally useless so you can't write five thousand rules or twelve thousand rules the technology just is the wrong way to think about things so better salute the obvious solution here is you've got to use machine learning statistics to do as much heavy lifting as possible with minimal human involvement so tamer the company I'm associated with use the 500 rules as training data that gave you 2 million labeled labeled records and they built a predictive model to to basically classify the remaining 18 million you've got to do it this way the orden there you've got to do something that isn't system so anyway this is hugely important problem hugely important problem where the traditional solutions do not work and are known not to work so there's lots of startups in this space it's the wild wild west hold on to your seatbelt but remember that if you're interested in data science and this is the important part of data science it's not data science so if I was you guys I would work on this stuff so finally in summary machine learning will be omnipresent absolutely omnipresent some of it will be deep learning some of it will be conventional m/l complex analytics is going to replace data warehousing as soon as we can get enough people to understand this stuff and both ml and complex analytics are gonna go nowhere without good data and good data requires data integration at scale the 800-pound gorilla so this is the high pole in the tent in my opinion and what I think I'd like to just leave you with if you want to really make a difference think about working on this stuff I'm done okay given when we started I think we have about five minutes for questions [Music] let's see if we have questions from the young people first there's one should I be running with this mic or you're you're running with the mic okay okay
thank thank you very much for the presentation at the beginning of this plan the presentation you explain the meaning of big data with the three vs the big volume big velocity big variety my question is at what point in time can you describe your data to be big saturator has a big volume big velocity and and big variety thank you okay so so what do you what exactly does this mean when when you're down in the trenches the answer is you know you have a big you have a big data problem if you're tearing your hair out trying to get your application done so so and that isn't that isn't specifically number of bytes it's how much pain are you in and if you're in a lot of pain you have a big data problem hey another question from the young folks hi Mac thank you for a talk first of all I'm very surprised that you're not wearing money of your red signature shirts I was I was told by the organizers that I had to dress up so okay this is a dressy as it gets all right all right no no red shirt this time the question is whether you think it may be possible at some point we will develop analytics algorithms that one need clean data because there will be robust to all the error in the noise model that you know the noise that may be added by by the but a human in the loop or whatever I mean statisticians that work with sensor data and with missing data for a wild multiple imputation and stuff like that shortly their algorithms are not scalable yeah but I mean we can move to robust machine learning robust analytics that one need such an expensive data cleaning step okay so here's my problem with that the question is why can't we build super smart algorithms that all tolerate dirty data the answer is because you can't tell it's dirty you know it is IB IBM you know IBM SA and IBM Incorporated either are or are not the same thing if they're the same thing then your algorithm goes one way if they're different than your algorithm goes a different way only a human can tell you whether they're the same or not Oh all true drew also like that I'm gonna give him a mic anyway or just ask the question another way what are you actually doing when you're cleaning data what are you doing with your clean data when you are cleaning your data how do you clean your data why are you so sure that a machine can't do it why well I I agree what you know what happens is is there's you can use outlier detection you can use functional dependencies a bunch of stuff you can use that give that gives you a collection of things that you think are might be dirty and if multiple of these of these algorithms think it's dirty then chances that hazard you go up so then you say well how do I fix it well automated repair is in its infancy now you can say well maybe ml can do that but the trouble is you're then gonna have to check the ml and that's the big problem because then you're back you're back to the human in the loop you know as the high poll and the tenth but it looks skeptical but we will go on to the next question which will also be the last question there was a hand in the back maybe somebody could hello I'm done from Technical University of Berlin so I've worked a lot of
sensors networked so sensor we're getting really cheap do you think that online collecting data at compensate to the cleaning problem because other de saucé data can can help to fix other problem in the data I'm just saying that your data ten percent of it is is wrong and so if you're gonna launch a nuclear missile based on your current data I wouldn't do it I'd clean your data first now if you don't care if you don't care that if you don't care about data about accuracy then by all means do do whatever you're doing I think the the answer is people at least in enterprises people people you know there are a bunch of tamrac customers who say how many suppliers do I have and they really it really makes a difference to how they think about things if the answer is 50 versus it's a thousand and dirty data is the difference between 50 and a thousand typically yeah I agree but somehow that if you have more data that you will spend less effort and cleaning and the problem getting smaller okay I think we should take further further discussion during the lunch break and let us thank Mike for the very insightful [Applause] [Music]
you
Loading...
Feedback

Timings

  637 ms - page object

Version

AV-Portal 3.10.1 (444c3c2f7be8b8a4b766f225e37189cd309f0d7f)
hidden