Teaching machines new tricks - TIB AV-Portal

Teaching machines new tricks

00:00

9

Related Material

Free and Open Source Software Conference (FrOSCon) e.V.

Drost-Fromm, Isabel

Formal Metadata

Title

Teaching machines new tricks

Subtitle

Machine learning: Silver bullet or route to evil?

Title of Series

Number of Parts

95

Author

Drost-Fromm, Isabel

License

CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/32302 (DOI)

Publisher

Free and Open Source Software Conference (FrOSCon) e.V.

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

According to the Gartner Hype Cycle Machine Learning is currently at the peak of being hyped. Scanning current press publications we can find anything from Elon Musk warning about AI being the biggest existential threat to humanity, scientists fooling machine learning models with seemingly tiny modifications to street signs, machine learning enhancing smart phone pictures, as well as introductory material trying to explain what machine learning is about. According to Wikipedia "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers the ability to learn without being explicitly programmed." This keynote will detail what it takes to build a successful machine learning pipeline. We will explore some examples of how machine learning has evolved over the last twenty years and close with highlighting some of the implications that new machine learning based systems have.

Keywords

Speech

Text

Image

00:00

FreewareOpen sourcePhysical lawPatch (Unix)Integrated development environmentComputing platformRoutingCartesian coordinate systemRootXMLComputer animationLecture/Conference

00:58

Type theoryMachine learningLevel (video gaming)BitCycle (graph theory)Multiplication signField (computer science)Cartesian coordinate systemSoftwareLecture/Conference

02:13

Computer-generated imagerySolid geometryVideo gameLibrary (computing)WordMachine learningSystem callReal numberScalabilityMultiplication signOpen sourceScaling (geometry)Wave packetWaveComputer animationLecture/Conference

03:28

Data analysisWave packetMultiplication signOpen sourceFreewareScalabilityPhysical systemBuildingArmVirtualizationLecture/Conference

04:34

Computer-generated imagerySoftwareOpen sourcePhysical systemOrder (biology)Constructor (object-oriented programming)Open sourceDecision theoryProcess (computing)Well-formed formulaDemoscenePatch (Unix)BitSoftwareHierarchyDuality (mathematics)WordPressureVideo gameComputer animationLecture/ConferenceMeeting/Interview

06:11

Computer-generated imageryApplication service providerMaxima and minimaDemonSoftwareBoom (sailing)Goodness of fitMultiplication signSoftware developerSoftware frameworkServer (computing)XMLComputer animationLecture/Conference

07:00

Artificial intelligenceService (economics)WordCASE <Informatik>Term (mathematics)Nichtlineares GleichungssystemSlide ruleVideo gameArtificial neural networkService (economics)Cartesian coordinate systemObservational studyQuicksortForm (programming)Lecture/ConferenceComputer animation

08:24

Artificial intelligenceObservational studyTerm (mathematics)ComputerMaxima and minimaIntegrated development environmentIntelIntegral domainFunction (mathematics)CognitionComputer networkSample (statistics)outputData modelDecision theoryAlgorithmComputer programmingFeasibility studyRankingMachine visionMachine learningComputerPattern recognitionArtificial neural networkMachine learningPredictionTask (computing)Range (statistics)Optical character recognitionOpticsHill differential equationArtificial neural networkAuditory maskingStress (mechanics)Figurate numberProjective planeSpectrum (functional analysis)Arithmetic meanObservational studyPredictabilityEndliche ModelltheorieFlow separationAlgorithmMassMachine learningRange (statistics)Workstation <Musikinstrument>Task (computing)CASE <Informatik>Goodness of fitFormal languageSet (mathematics)Computer programmingInstance (computer science)Decision theoryoutputSampling (statistics)BitMathematicsLecture/ConferenceComputer animation

10:17

WordFormal languageCategory of beingComputer scienceLine (geometry)AlgorithmTransformation (genetics)Video gameSocial classFlow separationVector spaceWeightSupercomputerObservational studyInsertion lossLecture/ConferenceComputer animation

11:42

HyperplaneVideo gameFormal languageSpacetimeWordLine (geometry)Cellular automatonDimensional analysisHyperplaneFile formatDatabaseOrder (biology)TheoryLinearizationFlow separationNonlinear systemLecture/ConferenceComputer animation

12:37

Transformation (genetics)Selectivity (electronic)Information extractionParameter (computer programming)Endliche ModelltheorieVector spaceInformationMultiplication signProcess (computing)File formatOrder (biology)Wave packetPower (physics)Medical imagingElectric generatorLecture/Conference

13:58

Process (computing)Process (computing)Uniform resource locatorOrder (biology)Thresholding (image processing)Different (Kate Ryan album)Operator (mathematics)Product (business)TimestampData conversionMultiplication signSpreadsheetEndliche ModelltheorieLine (geometry)QuicksortDecision theoryMilitary baseTransformation (genetics)Normal (geometry)Electric generatorCycle (graph theory)DiagramLecture/Conference

16:07

Probability density functionOrder (biology)HoaxMultiplication signInformationIdentity managementTimestampQuicksortError messageData conversion

16:50

Conic sectionSource codeFreewareOpen sourceRow (database)Information privacyProgrammschleifeAdditionOrder (biology)Uniform resource locatorMultiplication signMotion captureLecture/Conference

18:04

PlastikkarteMachine learningScale (map)Process (computing)Open sourceEndliche ModelltheorieMachine learningRight angleSoftwareQuicksortWordLoop (music)Software engineeringMetric systemPerformance appraisalOnline helpXMLLecture/Conference

18:59

Data managementBlack boxBitCASE <Informatik>Support vector machineError messageEmailResultantDecision theoryPseudonymizationSearch engine (computing)Query languageEndliche ModelltheorieOffice suiteVector spaceWeightVirtual machineAlgorithmLecture/Conference

20:08

Source codeFreewareOpen setException handlingQuery languageProduct (business)Extension (kinesiology)DivisorNumberCASE <Informatik>Server (computing)Integrated development environmentAlgorithmLecture/Conference

20:53

Process (computing)Open sourceFreewareComputer-generated imageryProgrammschleifeAverageIntegrated development environmentPerformance appraisalForm (programming)Group actionLoop (music)Product (business)Endliche ModelltheorieBuildingSocial classInternet service providerPoint (geometry)MiniDiscEmailComputer animationLecture/ConferenceDiagram

21:56

Data modelDisintegrationMeasurementMedical imagingMessage passingRule of inferenceMiniDiscDifferent (Kate Ryan album)Cycle (graph theory)MathematicsOrder (biology)Endliche ModelltheorieINTEGRALProcess (computing)Multiplication signOrientation (vector space)Lecture/ConferenceProgram flowchart

22:55

BuildingService (economics)Local ringStudent's t-testMedical imagingCartesian coordinate systemMultiplication signGame theoryIterationLecture/Conference

24:02

ComputerComplete metric spaceData miningWorkstation <Musikinstrument>Task (computing)Order (biology)Medical imagingOptical character recognitionStandard deviationEmailOnline helpEndliche ModelltheorieXML

24:43

Pattern recognitionComputer-generated imageryRow (database)RobotInteractive televisionSpeech synthesisOptical character recognitionRaw image formatType theoryPattern recognitionVector spaceMeeting/InterviewComputer animation

25:32

Multiplication signBit rateLogic gateBoolean algebraMeeting/InterviewLecture/Conference

26:15

RobotAlpha (investment)YouTubeVideoconferencingAutonomic computingGroup actionReal numberComputer animationLecture/Conference

27:00

MultiplicationCodeCollaborationismOpen setGame theoryBasis <Mathematik>CollaborationismCodeSet (mathematics)Endliche ModelltheorieGroup actionBeta functionPredictabilityOrder (biology)Integrated development environmentProduct (business)Information privacyOperator (mathematics)Lie groupFigurate numberDrop (liquid)Lecture/ConferenceComputer animation

28:20

Information privacyService (economics)Decision theoryEndliche ModelltheorieAdditionSet (mathematics)Range (statistics)Task (computing)Heat transferCartesian productFocus (optics)Metric systemAgreeablenessSparse matrixGroup actionFreewareLecture/Conference

29:33

Focus (optics)Computer-generated imageryTelecommunicationPolygon meshRange (statistics)Set (mathematics)Web pageShared memoryMessage passingCartesian coordinate systemWebsiteSocial classHome pageComputer animation

30:33

Cycle (graph theory)Decision theoryMathematicsProcess (computing)HypermediaTerm (mathematics)TelecommunicationPrice indexStatement (computer science)Order (biology)Wave packetReal numberSound effectMultiplication signGoodness of fitGame theoryAlgorithmLecture/Conference

31:41

Computer-generated imageryReal numberCategory of beingEndliche ModelltheorieOrder (biology)TwitterWave packetAlgorithmLine (geometry)Design by contractDecision theoryLecture/Conference

32:29

Endliche ModelltheorieOrder (biology)outputDecision theoryOffice suiteWordRule of inferenceLecture/Conference

33:12

Series (mathematics)Set (mathematics)Cycle (graph theory)Order (biology)Software developerReal numberMachine learningProcess (computing)AreaComputer animationLecture/Conference

34:08

Virtual machineGroup actionFrequencyRegulator geneProcess (computing)Decision theoryPosition operatorComputer programmingNeuroinformatikComputer animationLecture/Conference

34:57

Position operatorVideo gameReal numberCategory of beingOrder (biology)Flow separationRegulator geneRobotEndliche ModelltheorieType theoryPattern recognitionTheory of everythingLecture/ConferenceMeeting/Interview

35:54

Game theoryAsynchronous Transfer ModeNuclear spaceRoboticsMachine learningDecision theoryRobotSet (mathematics)Machine learningBitEndliche ModelltheorieOrder (biology)Wave packetOpen setDecision theoryXMLUMLLecture/ConferenceComputer animation

36:45

Machine learningDecision theoryDecision theoryExtension (kinesiology)Level (video gaming)ResultantLecture/ConferenceComputer animation

37:36

Computer clusterSpacetimeFreewareEndliche ModelltheorieMappingPosition operatorGraph (mathematics)Order (biology)Metric systemWave packetComputer configurationRegulator geneDecision theoryCausalityDisk read-and-write headSoftwareCASE <Informatik>Cartesian coordinate systemHypermediaGroup actionPattern languageIntegrated development environmentVideoconferencingMathematicsTheoryLine (geometry)Physical lawAreaMassOffice suiteRight anglePropositional formulaPoint (geometry)Moment (mathematics)Machine learningBitSimilarity (geometry)Virtual machineLecture/Conference

45:40

Machine learningOrder (biology)State of matterService (economics)Point (geometry)Dependent and independent variablesInformation privacy1 (number)Mathematical singularityPredictabilityRegulator geneEmailPhysical systemSoftware developerIntegrated development environmentMereologyDevice driverCircleNumberDistancePerspective (visual)Real numberEndliche ModelltheorieCASE <Informatik>Game controllerDatabaseView (database)MathematicsMachine learningTheoryWeightDecision theoryWordMeasurementComputer programmingProjective planePhysical lawValidity (statistics)LengthField (computer science)Electronic mailing listModel theoryCategory of beingShared memoryLecture/Conference

53:18

FreewareOpen sourceEvent horizonComputer animation

Transcript: English(auto-generated)

00:08

Yeah, hello and welcome everybody to the keynote of the second day of the FrostCon, of the 12th FrostCon. It is really a pleasure for me to introduce Isabelle Trostfrom.

00:23

She is a member of Apache Foundation, co-founder of the Apache Mahout project, serial incubator so to say. Mahout is a platform for environment for quickly creating scalable machine learning application

00:44

and this is also what the talk is about, teaching machine the tricks of a bullet or root to evil. Please welcome Isabelle, thank you. Thank you for this very warm welcome.

01:03

So today we are going to talk a little bit about machine learning and its implications. Why is this even relevant? If you take a look at the Gardner hype cycle and see where our AI and machine learning are floating, that's pretty much at the top. So there's a lot of hype involved but it's also like a technology that's more and more pervasive.

01:23

What we are going to look at is what it looks like to build a machine learning application so that we are going from magical fairy dust to actual understanding of what it looks like on a very very high level. We're going to see a few of the successful applications as well as their historic background and where we are coming from.

01:44

And I will conclude with a few implications. As a warning I'll probably not use up the whole time slot so if you have any questions we will probably have a lot of time for a Q&A session at the end. So we had a little bit about an introduction of why I'm giving this talk about machine learning.

02:06

I entered this field as a researcher back in 2003 before I realized that writing real software is more interesting than publishing papers, which is when I went into industry. In industry I realized that was like 10 years ago that the libraries that

02:24

we had back then that dealt with machine learning were either licensed under something strange like give me a call if you want to use it, so no real license actually, or they were active and visible for as long as they had research funding

02:44

and afterwards that, or they couldn't scale to large amounts of data, and that's how together with a few people at the Apache Software Foundation, namely Grant Ingersoll and others, we founded Apache Mahood all around a proven scalable open source license

03:04

with the goal of building a community that survives individual contributors not having time or interest anymore. And that's how Mahood came to be. Apart from that, I come from Berlin. If you know someone from Berlin, they don't like traveling outside, they will more likely invite you inside.

03:25

So if you need an excuse to make your employer pay for your train or flight ticket to come to Berlin in summer, go to Berlin Busroads if you're interested in anything scalable like data analysis, search, machine learning. By now also virtualization systems like Kubernetes. That's early June. Trust me, Berlin is lovely in summer.

03:49

If you're not into big data and you still want to convince your employer to travel to Berlin, I'm currently trying to make a conference on everything free and open source backstage fly.

04:00

It's named FOSSbackstage.de. It's going to be two days after Berlin Busroads next year or November 20th. November 20th is going to be like a kickoff workshop this year. All things open source governance, licensing, community building, Apache Way, et cetera.

04:25

Why do I get time to do this? I'm currently working as open source strategist at Europace AG. You may not know them, but raise your hand if you built a house or needed financing for some kind of construction work in your house.

04:43

Then you probably have spoken in Germany to someone who gave you a mortgage offering and if they were comparing different offerings, they were probably using our system in order to figure out what is best for you. The processes that we use internally to develop software are pretty much converting towards independent teams, towards making

05:12

decisions close to the team where they are needed, towards making decisions without having huge hierarchies and escalation paths.

05:23

This is fairly aligned with how at least the Apache Software Foundation works. Apache Software Foundation also is a duocracy, so people who want to do something, those people that take the decision want to do something and actually submit a patch.

05:40

So there's quite some alignment and that's why Europace is supporting making this conference on open source behind the scenes fly. Okay, you know a little bit about me. I'm famous for doing meetups and for taking this microphone and handing it through the audience. I will save

06:02

you from that today. What I want to do with you is a quick show of hands. How many of you know what this formula is all about? Okay, pretty much more than half. Good. Anyone heard about MXNet? Okay. I would love to see at least one Amazonian hand up.

06:25

So it's a deep learning framework, currently under incubation at the Apache Software Foundation, which is heavily sponsored by Amazon with development time. Someone knows this logo? TensorFlow? A few more? Okay. About a third

06:45

maybe? Quarter? Anyone seen these logos before? Spark, mahout, before I mentioned it? Okay. Got quite a mix back. Anyone taken a machine learning course before? Okay. Hopefully you won't feel bored because

07:03

the base theorems that we saw earlier, that's pretty much the only equations that I have in my slides. Sorry. Any geeks in the room? Okay. Anyone else who's scared of AI? Or at least who knows the movie? Come on. Okay. Why did I make this quick show of hands?

07:27

Recently I came across a study, one of my colleagues sent it to me, actually. What customers really think about AI? So 72% believe that they understand what artificial intelligence is all about.

07:44

However, when asked about whether they actually use any artificial intelligence technology, only 34% answered was yes. If you then take a look at the devices that these people who have been asked were using and

08:01

at the applications and at the services that they are using, it would truly more be something like 84%. So chances are high that at least if you're using some form of mobile phone, you did interact with some sort of machine learning based system before in your life.

08:24

What does the press say? According to Elon Musk, artificial intelligence is our biggest existential threat. According to the MIT, maybe it could solve the world's biggest problems.

08:44

If I look at the spectrum, to me this looks like people are thinking we are talking about magical fairy dust. You take it, you sprinkle it over your project and suddenly everything works like a charm. Or does it? Let's see what Wikipedia has to say about it. It's not particularly helpful.

09:13

If you look at machine learning, what we read is something like, we want to have a program that makes data-driven predictions or decisions through building a model from sample inputs.

09:25

Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible. It's kind of more usable. If I take a look at the machine learning book by Tom Mitchell back from 1997, machine learning is a study of computer algorithms that improve automatically through experience.

09:48

Sounds like magic, doesn't it? Actually, it's a little bit more like math. This is how your machine learning algorithm sees your data. This is by now fairly old-fashioned. It's an SVM model. It's a very simple SVM model and it only finds a linear separation.

10:10

So, what do we have to do to our data? For instance, in this case, a classifier can do something with it. Imagine that our language consisted only of two words, one of them being high-performance computing, the other word being sunny weather.

10:28

All texts that contain only sunny weather, that's the green dot over here, but not high-performance computing, probably weather forecasts maybe.

10:46

What about the red dot over here? It talks about high-performance computing and sunny weather. It's probably some research publications on how to create weather forecasts maybe.

11:00

The plus in the lower right, the blue one, only high-performance computing, probably something computer science related. So, what our algorithm will do is try to see a lot of examples. We tell them which categories the example belongs to and then it will find just the lines that separate those classes that belong to the class from those that don't belong to it.

11:25

Create a weight vector that tells it what the line should look like, plus maybe a transformation from the origin where it should lie, and then we've got our separation. All easy.

11:41

In reality, of course, we don't have only two words in our language, otherwise life would be boring. Imagine doing this not in 2D, but in a more high-dimensional space, like a couple thousand, couple million dimensions. The only things that we are doing is just drawing lines in space, just creating hyperplanes, that's everything.

12:07

If you bring deep learning into the mix, you're no longer drawing hyperplanes, what you're drawing is some kind of non-linear separation, but essentially the concept remains the same.

12:22

In order to come up with this hyperplane, what do we need to do? We want to learn from data, so where do we get this data from, where does it live? Does it live in a database? Do we have it somewhere in a CSV format? Is it maybe in some kind of proprietary format that we first have to read and convert?

12:45

Is it maybe just available on a few sheets of paper on your employee's desk? What if it's not even recorded at all? In order to train a machine learning model, you will spend a lot of time figuring out where in your

13:02

company the data that you need to build this model lives, and to get this to a format that makes sense. No shiny model training, model tuning, parameter tuning, just figuring out where the data lives and talking to a lot of teams and people. Great, we've got the data, we are done, right? Not quite yet. What we have now is a huge

13:26

pile of documents. Could be images, could be voice, could be audio, could be music, could be something like text. But we don't have mathematical vectors yet, so what we need beforehand is a first transformation step.

13:43

We need you to come up with some kind of feature generation, feature selection. Let me tell you a story of a team that tried to build an information extraction for identifying information relevant in job advertisements.

14:00

What you need for that is a location plus at least a job title. Back in the days these advertisements were published not only in HTML but also in PDF. So they ran these PDF documents through a text converter and then created features in order to identify what was the job title.

14:24

Now guess what the most important feature was? Is this line of text located at the very bottom of the document? Why on earth? Because the PDF to text generator would take all the text that was marked as title and put it at the very bottom.

14:44

So the very bottom line often was the job titles that people were looking for. There was another team building a spam classifier. The data scientists spent weeks and weeks optimizing the model, coming up with different

15:00

feature transformations, with different normalizations, but he couldn't quite get it above the threshold where he could publish it and deploy it. So what this person ended up doing was going to the ops people, giving them a spreadsheet where he had all of the features sorted by how much influence they had on the decisions that the machine learning model would take.

15:28

Ops people would then go ahead and say, hey, there's a feature missing, over here. We know it, that's what it looks like in production. It took a couple of cycles, maybe a week, maybe two. Suddenly there was

15:42

a huge performance boost, things could go out to production and everything was great. So what this means is that for feature generation, you will need to spend a lot of time talking to your teams, talking to your ops teams, talking to people who know the business in order to figure out what the best features are.

16:06

When converting data, you will run into all sorts of lovely issues. How many of you have dealt with timestamp conversion? Same here. You will have to deal with typos in your data, either because that's

16:27

just the human error, because someone couldn't spell the whatever feature is correctly, or on purpose. At least for me personally, I've registered a couple dozen times in the past with just fake identity information in order to retrieve a PDF document that was published.

16:46

So there's going to be a lot of noisy data out there. Or to speak from a location provider, there's hotel owners who purposefully give you the wrong latitude and longitude for their hotel.

17:07

Why would they do that? Well, you'll probably be more likely to book it if it's closer to the ocean. Suddenly it's moved. Oftentimes the data that you will need isn't captured. You

17:29

will have to go through additional loops in order to deal with privacy protection. But there's also the issue of nobody actually thought that this piece of data would be interesting.

17:44

Maybe there's not enough data to train off, maybe the records that you would like to put into one packet are not linkable together. So overall, you will spend a lot of time just managing data, just getting pipelines together.

18:04

All sorts of boring works that you will have to do before you come to some kind of fancy machine learning. If you want to know more about this, there's two papers out there, one by Google and one by Amazon. The one on the lower right was published a couple of weeks ago at VLDB.

18:26

Okay, now we've found our data, we've preprocessed it, we've trained our model. We're done, right? Well, if you're a software engineer, you know that you need to test and identify how good the quality of your solution is, so you need to evaluate quality.

18:47

What's the quality metrics that you want to use? If you look at literature, you will find dozens of metrics that you can use only for classification. So you will have to identify whether it's accuracy, whether it's precision recall,

19:01

whether it's F-measure or whether it's something like AUC or what have you. On top of that, think about the spam classification example, that's the best known one. Your algorithm putting something in your inbox you didn't want to see is probably not quite bad, it's annoying at worst.

19:25

But your spam classifier, putting something in your spam folders that you would have wanted to read, not quite so good. So you want to put weighting on your errors depending on your business use case.

19:42

It's all fancy and shiny if you use deep learning models, if you use support vector machines like if you use these black boxes. But as soon as you talk to your manager, you will probably want to be able to explain why a certain decision was made. If you've ever built a search engine, you probably remember that your manager comes into your office and

20:02

tells you, hey, why is this search result not at the very top when I search for this query? Clearly there's a few ranking factors that are involved that were optimized. So often what happens is that in many cases, something that is easy to explain is something that wins.

20:28

Also you want to make what happens visually available. So just having a number that tells you yes or no is helpful to some extent, but you probably also want to visualize what your algorithm does.

20:44

And finally you need to deploy what you did to production. Sounds easy, you just take what you trained and put it on your server. Except if you talk to your average data scientist, they will use and they will want an environment which lets them experiment very easily.

21:02

And what's easy to experiment with is often something that cannot be deployed easily. So you will have to figure out how to make this transition. Now to make things worse, you will probably have a loop where your quality evaluation informs which additional data you should have explored to begin with.

21:23

Plus you will probably have an additional loop. Once you've put something to production, user behavior will probably change, so you will have to retrain your model. Just to tell you one story about an email provider that was building a spam classifier.

21:43

If they build a model today, it was usually effective for roughly three months and after that it was useless. Already ten years ago what happened was that at some point their disks were filling up. Why was this happening?

22:01

Because spammers had figured out that the spam classifier didn't look at images, so they were hiding the spam message with an image. So suddenly you have a different classification problem there. So to summarize, you will always have these 80-20 rules that you already know from engineering.

22:21

You will spend 20% of your time that goes into selecting and tuning your model. Plus you will spend 80% of your time in integration and feature preparation and deployment of your pipeline. You probably want this to be a continuous process where you observe your usage, you measure how good you are, you

22:45

decide which feature to implement next and then you actually act and make those changes in order to re-enter the cycle. The tighter you make the cycle, the faster you make these iterations, the faster you can be than your competitors.

23:03

What you want to do is to automate as much as possible, what you want to do as well is to be able to fail fast and cheaply. You want to be able to learn from others, so that means that you want to build API services that make it possible that people are being tricked into supplying you more data.

23:23

You probably want to make something like annotations a game. Back in research time what we did was to pay students to annotate documents. You can still do something today if you go to something like Amazon Mechanical Turk or go to other click workers. What happens there is you can train these people but still these annotations will be noisy so even there the cost will go up.

23:48

You can put out rewards for better data, think about being local scouts on Google Maps. Okay, let's look at the history of a few applications. Anyone know where the image might be coming from?

24:11

So that's actually handwritten character recognition task, standard data mining task, standard machine learning task published back in 1995.

24:22

Fast forward like 10 years and scientists were trying to learn to complete sentences. Imagine like a help desk scenario, you've got emails coming in and it's often the same questions that are being asked. So you can try to train a model of offset data in order to help these agents to provide better and faster answers.

24:46

Fast forward another 10 years and you are at end customers swearing at their mobile phones because autocorrect doesn't do what they want. More often than not like swipe to type actually does work.

25:04

Plus you've got tiny little devices that you can talk to and they will answer your questions, hopefully correctly. So what we are looking at is 20 years going from raw character recognition to speech recognition and interaction.

25:22

Again like roughly mid 90s what we had was something like little robots trying to play soccer. Make a random guess, when was the first time that a car crossed the desert autonomously?

25:44

Anyone? Anyone have a guess? 5 years ago? 2003? This was pretty close, it was 2005.

26:01

There was a car driving through Berlin, if you come from Berlin there is this street which goes from the Siegesweil, which has a large roundabout to Krantenberger Gate. When was it? Actual real traffic. Capital of Germany. Guess? 2 years ago? 2011.

26:32

You can still watch the video on YouTube. It's fairly impressive. So going from there we now got an advertisement for actual autonomous cars and it actually reached public discourse.

26:50

So we had 20 years going from indoor road roads to autonomous cars let loose on real world streets, at least in the US.

27:01

How did we get there? What we needed was collaboration. If you think back at the soccer games at World Cup what was the basis there was that they had multidisciplinary teams and they were sharing the code and the data sets after the yearly competition was over.

27:21

The winning team was supposed to share the code so that others could be kick started and could work off of that and could be faster. Essentially what this means also, if you think back to your corporate environment is

27:43

that the lone data scientist coming up with a magical idea, that's a lie. Think about the data scientist trying to build a spam classifier who was only boosted after talking to ops people. You will need to have a production understanding in order to understand your data.

28:04

You will need to have a production understanding in order to figure out if a drop in data is something meaningful or if it's just an outage. It being an outage is very interesting to your business but probably not to your prediction model.

28:20

You will need to have an understanding of data privacy because as soon as you link data sets together and as soon as you start to acquire more data you will need to understand if you are allowed to do that and you will need to understand what kind of warning you have to give to your users. You will need to make a decision what is better for you if you want to ship pre

28:42

-made models or if you want to build services and with these services gain more insight into additional data. What also helped was competition. Back in the old days what we had were UCI data sets, they were fairly limited and what you learn on these probably don't transfer one-on-one to production settings.

29:06

It helps to have something like Kaggle where you've got a wide range of data sets and of tasks that you can deal with. It's still not ideal but you want to have some tasks where you can compete on pre-agreed metrics and you want to have teams work on these tasks together.

29:31

What also helps is to focus on your customers, focus on the problems that have business value. What are the most valuable problems that you can solve right now?

29:41

For that you need more than just a mathematical understanding and the theoretical understanding, you need a business understanding. Looking at the implications, data intensive applications, machine learning based applications are being deployed in a wide range of settings.

30:08

One of them being pages and websites where ordinary people share their thoughts and become what before was only possible for a press,

30:24

namely become someone who shares the message with the wider public plus share a message that's not only verbal but which is written and which is permanent. So we are going beyond traditional media, we are increasing this publication process, publication cycle and speed.

30:51

We are increasing it in terms of reach, so false statements suddenly make it very much faster to the end user. We are dealing with amplification effects.

31:02

As engineers what we typically say is that these are just algorithms, but there's someone training them, there's someone deploying them, there's someone tuning them. Plus there's someone selecting the data that's being fed into these algorithms in order to make decisions.

31:24

We've seen changes in communications that have had impact on politics before. Often times they were first exploited before they were regulated or before they were used for good.

31:40

Where do we see influence? It's all fun and games until it has real world impact. Looking at the ecosystem today, if you look at these algorithms and if you look at these models they can impact real human beings' life, they can have an impact on elections, like what your algorithm selects to be shown top or to be

32:02

shown as a trend or to be shown to many users can impact how these people think and act. What about training algorithms in order to predict if someone should receive an insurance contract or if someone should receive a mortgage?

32:22

What if there's implicit racism or implicit bias from prior decisions? Of course this will manifest in the models themselves. There's this rule of garbage in, garbage out, whatever you see in the data that you train with you will see in the model afterwards. Then you've got suddenly a question whether machine driven bias is any worse than human bias.

32:50

Or is human bias worse? Or is there something that we can do about the bias inherent in these models? Can we somehow deal with our input data and get the bias out of it?

33:08

What about models deployed among judges or among police officers in order to drive their decisions? What about bias in the data sets over there?

33:24

Currently autonomous vehicles are being tested on real world traffic situations. What I mentioned earlier when we had the cycle of development, what I told you was that it's good to go through the cycle very fast in order to improve fast. What I told you was that it's good to have a fail fast scenario.

33:44

We are here in Germany. Imagine a fail fast scenario in a car that goes on a motorway. On the other hand, machine learning can lead to more automation and a different landscape when it comes to jobs and employment.

34:05

We've had that in the past as well. So this is just one poem that's famous at least in Germany. And usually first it becomes worse before regulation helps and makes the situation better.

34:26

So what we have to think about is do we want machines to entirely replace certain jobs? Or how will these machines change jobs? And what will the transition period look like?

34:41

What implications will we have on society? As an engineer, and I'm an engineer myself, we believe that we are taking pure technical decisions. After all it's just a computer, it's just the programs that we write, right? But these pure technical positions are starting to have real life consequences on fellow human beings.

35:09

Even worse, there's real life consequences on human beings if you imagine these same advances being rolled out in military.

35:24

Impacting how warfare is being made. What impact will it have if face recognition enters military? If other machine learning models are being trained in order to be used in the military?

35:44

What type of regulation? Do we want any regulation there? And if so, which type? There's one open letter there, signed by several AI and robotics researchers as well as practitioners, urging the world to regulate AI being deployed in a warfare setting.

36:09

So how many of you still believe that AI or machine learning really is magical pixie fairy dust? I hopefully helped dispel this magic.

36:22

At the end of the day, machine learning is just a tool. It's a little bit of algorithm, it's a little bit of data that you use in order to train your models. It can be used for good and evil as so many things. At the end of the day, what it's being used for is our decision.

36:42

And with that, I'm open for questions. Any questions, comments, thoughts?

37:07

Yeah, sorry? Whether I really think it's our decision. I do think to some extent it is our decision. After all, we can go and vote.

37:24

And we can talk to people who can vote. You go to your politician and you talk to them about what a regulation should look like.

37:46

And they will then have to enforce it. If I can vote, I need something to vote on. There needs to be an election where this is actually a topic.

38:00

This will not come in this election. It will not come in the next four years. And by then, it's all decided anyway. So I don't see where we can vote. You can rebel, you can act against it. The other option would be to get organized like the FSE did in order to further the cause of free software. And free software licenses. And become a lobbyist in that topic.

38:28

And draw people together who think like you. Well then it's not just a decision, then it gets a lot more than that. Sure. So techniques for getting rid of bias in your data.

38:58

First of all, you have to analyze your data set.

39:00

You have to figure out where this data is coming from and then maybe start to stratify. When I was working at a maps provider, what we did was a very simple example. We had a lot of traffic, say in Germany.

39:25

So our quality metrics were always dominated by the good German performance. What we started was to look at individual metrics, like how well are we within France. Or how well are we within China, if you want to expand to China.

39:41

And then look at these metrics individually and start to optimize for these. That would be one option.

40:06

How would you do it? That's a very hard question. Sorry, I don't have an answer for that.

40:23

What's easier is if you have access to the data, you can start looking at the decisions they make and try to figure out if there is like a pattern. There is a very famous video right now, viral in the social media space, where you see someone with a black hand trying to get soap in a soap dispenser and it doesn't work.

40:45

As soon as they put something white over the hand, it suddenly works. Then apparently there is some kind of bias in there.

41:13

So the question was whether agencies employing machine learning are aware of this bias. I'm pretty sure that those who train the models are aware of this bias.

41:23

You can't go through a machine learning training without being taught that there is bias in data that you have to deal with. I would doubt that the police officer on the street still understands the model so deeply that he is aware of it. Other than people telling him and training him or her.

42:19

People tend to be afraid of change and people tend to be afraid of what they don't understand.

42:29

I believe that so far the way this works isn't so well known amongst the general public. If I show you the model up there, which is just a line going through space, it's fairly simple actually.

42:49

But what we tend to show people is what it can do in the end user application. This sounds magical. It's a bit of the tip of the iceberg problem.

43:01

They see what works, but they don't see how much work they had to go into this model in order to make this one use case work. And this tip of the iceberg looks so magical that it almost looks fairly simple in order to build something similar. They don't see the 20 years of research going into it beforehand.

43:22

That's my personal take. Have a public discussion, have more better education. Yeah?

43:41

I think the quote from Elon Musk at the beginning, I'm sure it's a tool, but I think it's motivated by the idea of a singularity. What's your take on singularity? Will it come or will it not come?

44:05

I can't predict the future. I can't predict the future, but given what you've seen in this talk about how it works today, what's your take? When will it come? At a certain point, there might be a point where machines can go autonomous.

44:36

All of a sudden their ability to replicate and innovate will exponentially outgrow the same ability of mankind.

44:55

If the proposition that this will happen is true, then we will have a problem.

45:04

So I don't know if the proposition is true and I would like to assess it. Is there really a danger that that moment will come? So the issue is that as engineers we love talking about theoretical issues. Right now with AI we have real issues out there.

45:21

We've got cars driving on machine learning models where there is no law in place how to deal with them. There are machine learning models deployed out there that are capable of influencing our political decisions right now as we talk.

45:42

I believe we should rather deal with these problems first before thinking about something theoretical. Probably agree. Maybe you would say it's a pretty major discussion, so we're not yet at that point. In my personal opinion, we're not there yet. We've got way more problems out there to deal with before we go into this discussion.

46:04

I totally agree. I did not look into that much detail into this theory about singularity. It's complicated. It's a lovely problem and it's nice to think about it. I believe right now we've got other issues to talk about.

46:22

I think some of these thoughts are even too hypothetical because we're disabled ourselves even with the technology today. We have issues with driving cars autonomously in an environment that is built for human drivers. If we support our automatic cars by technique in the street, we can even

46:45

reduce the distance from the cars and narrow the streets and all of the stuff. So that we disable ourselves to reduce our highways driving our car ourselves. I think that's the main problem. Not the technique, not the AI, but disabling ourselves to live in our environment independently of machines.

47:09

That is a more critical part in my view. What I would like to do in this discussion is to focus on the real use cases we already have out there.

47:20

To figure out whether we want to do something there and if so what do we want to do there. Instead of arguing if there is going to be in the distant future something happening. Look at what's out there right now and see what we have to do if we do have to do something.

48:21

There we are back to where you need to get involved in politics in order to talk about regulation. Because Google isn't the only one who has this data and who has this power over you. They are the ones making it accessible. There are other companies who have the same data and who have the same models. There are state institutions that have the same data and build the same models.

48:44

So it's not only about Google, but it's about the topics that you have to discuss at a wider circle. The question is kind of a follow up to a very recent talk by General Kennedy.

49:14

So are you aware of it and how would you get involved in this?

49:39

Will European companies be able to solve the problem?

49:57

So about data protection being forced on European companies.

50:01

Was this an advantage or a disadvantage? Let me turn the question around. Maybe it's even an advantage to European companies being subject to this law. Because people can be certain that their data isn't being used for purposes that they didn't agree on. From a company perspective who wants to roll out machine learning, of course they also always

50:23

want more data in order to build better models and in order to build better services. But from a customer's perspective it's like two sides of a coin. You want those better services that are being based on your data, so you will have to share it somehow.

50:42

But on the other hand you want to have control over your data. So maybe it's even an advantage to be able to tell your customers, hey I'm subject to this law, it's being enforced right here, your data is safe with us. Not just because we say so, but because the state enforces us to do so.

51:02

The question was about the roadmap of Apache Mahout. Roadmaps go to the mailing list, it's an all volunteer project.

51:24

Right now for deep learning most people that are active in Mahout are also active in MXNet. So there are ideas flowing back and forth. But other than that I would ask you to go to the dev list, ask there so that others can benefit as well.

51:43

I won't make predictions on behalf of other people there. I don't see any further questions. If I may take the opportunity again, I think this shows that this development about deep learning and AI takes

52:16

the responsibilities of the engineers who are doing this and the ethical implications that the work has a bit further.

52:28

I think this is what happened with IT through the last 50 years. It was just like number crunching, not number crunching, it was just like, it was just something like corporate database that were not very,

52:45

and now IT developed more and more sophisticated systems and now it's going into the public realm where it goes about personal preferences decisions, about personality. So societies get more involved from the outcome what happens.

53:05

So this is very important work, raises very important questions and I would like to thank Isabel a lot again. Thank you very much.