We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

An Introduction to Deep Learning

00:00

Formal Metadata

Title
An Introduction to Deep Learning
Title of Series
Part Number
166
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Deep learning: how it works, how to train a deep neural network, the theory behind deep learning, recent developments and applications. ----- (length: 60 mins) In the last few years, deep neural networks have been used to generate state of the art results in image classification, segmentation and object detection. They have also successfully been used for speech recognition and textual analysis. In this talk, I will give an introduction to deep neural networks. I will cover how they work, how they are trained, and a little bit on how to get going. I will briefly discuss some of the recent exciting and amusing applications of deep learning. The talk will primarily focus on image processing. If you completely new to deep learning, please attend T. Rashid's talk 'A Gentle Introduction to Neural Networks (with Python)'. His talk is in the same room immediately before mine and his material is really good and will give you a good grounding in what I will present to you.
SummierbarkeitMathematicsCodeNormed vector spaceArtificial neural networkComa BerenicesGoodness of fitArtificial neural networkResultantBuildingMachine visionComputational intelligenceFocus (optics)Image processingSelf-organizationMassLibrary (computing)Endliche ModelltheorieCodeMathematicsLecture/ConferenceComputer animation
Library (computing)Artificial neural networkLipschitz-StetigkeitLocal GroupLibrary (computing)ResultantEndliche ModelltheorieConvolutionArtificial neural networkMultiplication signWeightXMLComputer animationLecture/Conference
Materialization (paranormal)Web browserRepetitionLogic gateSoftware repositoryLaptopWeightCodeMultiplication signEndliche ModelltheorieComputer fileSlide ruleComputer animationLecture/Conference
Slide ruleComa BerenicesInformation managementWechselseitige InformationVirtual machineComputer-generated imageryWeight3 (number)Social classSlide ruleProfil (magazine)Set (mathematics)CodeVirtual machineWechselseitige InformationMultiplication signDivisorSocial classDifferent (Kate Ryan album)Type theoryRoundness (object)Medical imagingComputer animationLecture/Conference
Social classPredictionFisher's exact testVector graphicsError messageBit rateComputer-generated imageryWeightMedical imagingMechanism designTerm (mathematics)Multiplication signAxiom of choiceArtificial neural networkSocial classDisk read-and-write headMachine visionComputational intelligenceVector spaceLecture/ConferenceComputer animationXML
Computer-generated imageryWeightConvolutionError messageBit rateArtificial neural networkSoftwareRegulärer Ausdruck <Textverarbeitung>CompilerData structureLevel (video gaming)CompilerBit rateMathematicsError messageGame theoryArtificial neural networkComputer architectureBitPhysical systemConvolutionExpressionCompilerWritingTerm (mathematics)Compilation albumLevel (video gaming)Network topologySpectrum (functional analysis)Lecture/ConferenceXMLComputer animation
Slide ruleComa BerenicesInformationSocial classComputer-generated imageryArtificial neural networkBefehlsprozessorSlide ruleDescriptive statisticsWebsiteDataflowVector spacePropagatorMedical imagingArtificial neural networkBitRepresentation (politics)Lecture/ConferenceMeeting/InterviewComputer animationXML
outputArtificial neural networkSingle-precision floating-point formatRow (database)Social classVector spaceElement (mathematics)WeightoutputArtificial neural networkFunction (mathematics)Line (geometry)NumberMatrix (mathematics)Term (mathematics)Execution unitFunctional (mathematics)Lecture/ConferenceComputer animation
outputElement (mathematics)Vector spaceFunction (mathematics)Parameter (computer programming)Sigmoid functionRule of inferenceWeightMatrix (mathematics)Positional notationPixelArtificial neural networkNonlinear systemMultiplicationNonlinear systemFunctional (mathematics)Term (mathematics)Function (mathematics)Parameter (computer programming)Artificial neural networkSampling (statistics)PlanningMedical imagingWeightResultantVector spaceWell-formed formulaSlide ruleMultiplicationMereologyExecution unitMatrix (mathematics)MappingPixelLinearizationRow (database)Maxima and minimaLecture/ConferenceComputer animation
WeightPredictionArtificial neural networkMeasurementWave packetSet (mathematics)Function (mathematics)outputMorley's categoricity theoremCategory of beingMaxima and minimaSocial classVector spaceBackpropagation-AlgorithmusWave packetFunction (mathematics)Artificial neural networkWeightCost curvePredictabilityAverageoutputDifferent (Kate Ryan album)Error messageWater vaporBitLinearizationNonlinear systemFunctional (mathematics)Category of beingVector spaceSocial classLecture/ConferenceComputer animation
Morley's categoricity theoremCategory of beingMaxima and minimaSocial classVector spaceFunction (mathematics)NumberSpecial unitary groupSocial classElement (mathematics)outputExponentiationArtificial neural networkError messageSummierbarkeitCoefficient of determinationMedical imagingCost curveFunctional (mathematics)Likelihood functionPotenz <Mathematik>Lecture/ConferenceComputer animation
Function (mathematics)Identity managementSummierbarkeitLinear regressionCoefficient of determinationArtificial neural networkMedical imagingFunction (mathematics)Error messageCategory of beingDependent and independent variablesLinear regressionoutputLinearizationSummierbarkeitDifferent (Kate Ryan album)Functional (mathematics)Identity managementLecture/ConferenceComputer animation
GradientParameter (computer programming)Symbolic computationWeightDerivation (linguistics)GradientError messageGradient descentArtificial neural networkParameter (computer programming)Sign (mathematics)ExpressionCodeComputer animation
Parameter (computer programming)Bit rateArithmetic meanStapeldateiSample (statistics)Wave packetSingle-precision floating-point formatWeightMultiplication signGradientBit rateWave packetSampling (statistics)Table (information)Parameter (computer programming)AverageDerivation (linguistics)Lecture/ConferenceComputer animation
Sample (statistics)Parallel portBefehlsprozessorStapeldateiOperations researchMeasurementAverageError messageGradient descentGradientParameter (computer programming)Wave packetArtificial neural networkArchitectureExecution unitMaß <Mathematik>IterationSet (mathematics)Error messageMeasurementWave packetAverageParameter (computer programming)Parallel portGradient descentEntire functionSampling (statistics)Artificial neural networkPower (physics)Set (mathematics)MultiplicationProcess (computing)Right angleExecution unitIterationValidity (statistics)PerceptronDigitizingComputer architectureDigital signalComputer animation
Computer-generated imageryNumerical digitCASE <Informatik>ApproximationInvariant (mathematics)Translation (relic)Artificial neural networkVulnerability (computing)DigitizingCASE <Informatik>Medical imagingArtificial neural networkExecution unitWeightPosition operatorInvariant (mathematics)Scaling (geometry)Vulnerability (computing)Translation (relic)Marginal distributionLecture/ConferenceComputer animation
Artificial neural networkMaß <Mathematik>Computer-generated imagerySlide ruleKernel (computing)PixelSummierbarkeitoutputWeightPixelDigitizingLie groupFiber bundleSlide ruleArtificial neural networkOrder (biology)Wave packetComputational intelligenceSignal processingConvolutionOperator (mathematics)Bounded variationExecution unitUniform resource locatorMachine visionUniverse (mathematics)Web crawlerMedical imagingCoefficient of determinationSet (mathematics)Computer-assisted translationLecture/ConferenceComputer animation
Digital filterExecution unitMaß <Mathematik>ConvolutionMachine visionKernel (computing)WeightForm (programming)Variety (linguistics)PixelSlide ruleWeightKernel (computing)Position operatorMereologyMedical imagingRight angleNeighbourhood (graph theory)1 (number)WaveConvolutionoutputTrailProduct (business)Computational intelligenceFilter <Stochastik>Line (geometry)QuicksortFunction (mathematics)Correspondence (mathematics)Artificial neural networkMachine visionFunctional (mathematics)Insertion lossUsabilityDesign by contractGreen's functionLecture/ConferenceComputer animation
Computer-generated imageryConvolutionWeightKernel (computing)PixelMatrix (mathematics)MultiplicationOrientation (vector space)1 (number)Range (statistics)Kernel (computing)Variety (linguistics)Artificial neural networkOffice suiteMultiplicationQuicksortWeightDifferent (Kate Ryan album)Matrix (mathematics)ConvolutionMathematicsPixelMedical imagingFunction (mathematics)DemosceneLevel (video gaming)Memory managementSquare numberComputer architectureSparse matrixMereologyLecture/ConferenceComputer animation
MultiplicationWeightMatrix (mathematics)Artificial neural networkMachine visionDivisorCASE <Informatik>Sample (statistics)Computer-generated imageryPixelGradient descentMathematicsFood energyMassGradientMedical imagingPoint (geometry)Maxima and minimaOperator (mathematics)Block (periodic table)Level (video gaming)Sampling (statistics)PixelComputer configurationGraph coloringIndependence (probability theory)QuicksortLecture/ConferenceComputer animation
ConvolutionFunction (mathematics)Sample (statistics)PixelNumerical digitConvolutionQuicksortResultantPixelMultiplication signFunction (mathematics)Artificial neural networkMedical imagingRevision controlMaxima and minimaoutputKernel (computing)Lecture/ConferenceComputer animationProgram flowchart
Numerical digitSet (mathematics)IterationDigital filterComputer-generated imageryKernel (computing)Kernel (computing)Maxima and minimaSocial classFunction (mathematics)Execution unitMedical imagingIterationWave packetError messageGreatest elementArtificial neural networkPairwise comparisonFilter <Stochastik>Graph coloring1 (number)Lecture/ConferenceComputer animationProgram flowchart
Computer-generated imageryDigital filterKernel (computing)MIDIGraphics processing unitRight angleQuicksortComplex (psychology)Orientation (vector space)Kernel (computing)Square numberLie groupTexture mappingCellular automatonCurveOperator (mathematics)MereologyArtificial neural networkCASE <Informatik>Object (grammar)Lecture/ConferenceComputer animation
Artificial neural networkFunction (mathematics)Insertion lossRegulärer Ausdruck <Textverarbeitung>Slide ruleComa BerenicesLevel (video gaming)ExpressionArtificial neural networkMathematicsMultiplication signSlide ruleFunctional (mathematics)Insertion lossCompilerMereologyFunction (mathematics)QuicksortLecture/ConferenceComputer animation
ArchitectureArtificial neural networkDigital filterBlock (periodic table)MereologyoutputArithmetic meanComputer-generated imageryKernel (computing)Positional notationConvolutionArtificial neural networkOrder (biology)BitResultantMereologyComputer architectureGraph coloringMaxima and minimaNumberBlock (periodic table)outputConvolutionSampling (statistics)Positional notationFilter <Stochastik>Kernel (computing)Limit (category theory)Lecture/ConferenceXMLComputer animation
Digital filterNumberArithmetic meanoutputSampling (music)MereologyBlock (periodic table)Population densityComputer-generated imageryKernel (computing)DivisorMaxima and minimaNumberFilter <Stochastik>Block (periodic table)Image resolutionConvolutionBitPoint (geometry)QuicksortFunction (mathematics)Positional notationLecture/ConferenceXMLComputer animation
outputPositional notationComputer-generated imageryArithmetic meanConvolutionGoogolArchitectureStapeldateiDatabase normalizationArtificial neural networkPersonal digital assistantCASE <Informatik>Function (mathematics)InformationMedical imagingUniform resource locatorResonatorComputer architectureArtificial neural networkNormal (geometry)StapeldateiModal logicXMLLecture/ConferenceComputer animation
StapeldateiDatabase normalizationOrder of magnitudeArtificial neural networkScale (map)Exponential functionBit rateError messageWave packetStapeldateiArtificial neural networkRight angleInsertion lossGoodness of fitNormal (geometry)WeightoutputOrder of magnitudeNumberStandard deviationArithmetic meanMultiplicationMatrix (mathematics)ConvolutionNonlinear systemLecture/ConferenceComputer animation
Software testingCurve fittingSample (statistics)Set (mathematics)Machine learningArtificial neural networkData modelArtificial neural networkSurgerySystem callMultiplication signWave packetSampling (statistics)Software testingPredictabilityTime travelMachine learningEndliche ModelltheorieMusical ensembleLecture/ConferenceComputer animation
Maß <Mathematik>Sampling (statistics)Wave packetOrder of magnitudeFunction (mathematics)SubsetExecution unitBinary multiplierRandomizationSoftware testingDivisorLine (geometry)RhombusWeightTerm (mathematics)Drop (liquid)Lecture/ConferenceXMLComputer animation
SubsetMaß <Mathematik>Different (Kate Ryan album)Latent heatSample (statistics)Covering spaceSet (mathematics)Multiplication signBackpropagation-AlgorithmusLine (geometry)Drop (liquid)SubsetWeightExecution unitBitQuicksortSet (mathematics)Roundness (object)Software developerArtificial neural networkWave packetAugmented realityRevision controlSampling (statistics)Lecture/ConferenceComputer animation
Transformation (genetics)Scale (map)Computer-generated imageryArtificial neural networkExecution unitVarianceArithmetic meanWave packetMedical imagingShift operatorSet (mathematics)Fitness functionAugmented realityTransformation (genetics)Standard deviationArithmetic meanArtificial neural networkState of matterExecution unitVarianceLecture/ConferenceXMLComputer animation
Function (mathematics)Inversion (music)CASE <Informatik>Linear regressionArtificial neural networkoutputPredictionComputer-generated imageryPixelSample (statistics)Flow separationFunction (mathematics)outputLinear regressionScaling (geometry)Range (statistics)SpacetimeReverse engineeringArtificial neural networkServer (computing)QuicksortStandard deviationSampling (statistics)CASE <Informatik>Right angleMedical imagingPixelArithmetic meanGreen's functionLecture/ConferenceComputer animation
Scale (map)Arithmetic meanComputer-generated imageryInsertion lossEquivalence relationError messageBit rateInsertion lossArithmetic meanWave packetFunctional (mathematics)Standard deviationDivisorBit rateError messageArtificial neural networkRandomizationEquivalence relationPoint (geometry)Computer animationLecture/Conference
Logical constantInsertion lossLocal ringMaxima and minimaArtificial neural networkMultiplication signLogical constantPattern languageInstance (computer science)Set (mathematics)Social classMaxima and minimaCASE <Informatik>Insertion lossArtificial neural networkLecture/ConferenceComputer animation
Logical constantInsertion lossMaxima and minimaLocal ringArtificial neural networkExistenceMachine visionMaxima and minimaResultantCost curveExistenceArtificial neural networkNumberElement (mathematics)Computational intelligenceMachine visionDigitizingComputer animationLecture/Conference
Artificial neural networkComplex (psychology)Single-precision floating-point formatRight angleRight angleArtificial neural networkWave packetComplex (psychology)Single-precision floating-point formatBit2 (number)Computer clusterWebsiteComputer animationLecture/ConferenceXML
MereologySpecial unitary groupLibrary catalogDisk read-and-write headPattern languageSystem identificationComputer-generated imageryComa BerenicesTwin prime3 (number)Distribution (mathematics)Uniform resource nameMenu (computing)Duality (mathematics)Metropolitan area networkData modelWeightRow (database)Rule of inferenceOnline helpLevel (video gaming)Stack (abstract data type)Data AugmentationRotationTranslation (relic)ScherbeanspruchungBounded variationView (database)Execution unitMaxima and minimaRandom numberHypothesisUltraviolet photoelectron spectroscopyVideo trackingTrailSeries (mathematics)Figurate numberPattern languageMereologyHidden Markov modelBitSet (mathematics)Disk read-and-write headArtificial neural networkSurfaceMedical imagingFunction (mathematics)RandomizationWave packetLecture/Conference
Medical imagingWave packetNetwork topologyCuboidNumberRight angleFigurate numberArtificial neural networkMereologyPerspective (visual)Decision theoryLecture/Conference
Disk read-and-write headSpecial unitary groupCuboidInternationalization and localizationRaw image formatInterior (topology)RoundingRight angleQuantumMetropolitan area networkWave packetArtificial neural networkComputer-generated imageryAddressing modeUniform resource nameLevel (video gaming)outputFunction (mathematics)Medical imagingVideo gameNoise (electronics)Dependent and independent variablesLocal ringLimit (category theory)Point (geometry)Disk read-and-write headLecture/ConferenceComputer animationXML
Pointer (computer programming)Location-based serviceMathematicsSoftware engineeringMaxima and minimaSpecial unitary groupPattern languageArmMetropolitan area networkDisk read-and-write headPoint (geometry)Computer-generated imageryLie groupNatural numberRaw image formatWave packetArtificial neural networkArtificial neural networkWave packetMedical imagingOrientation (vector space)Disk read-and-write headPosition operatorPoint (geometry)QuicksortKey (cryptography)Greatest elementLecture/ConferenceXMLComputer animation
PressureArtificial neural networkData modelSocial classComputer-generated imageryEndliche ModelltheorieMaxima and minimaArithmetic meanHeat transferArtificial neural networkMedical imagingSet (mathematics)System callWeightAttribute grammarRevision controlMaxima and minimaLaptopEndliche ModelltheorieConvolutionComputer architectureLecture/ConferenceComputer animation
WeightComa BerenicesPhysical lawStructural loadPersonal identification numberSpecial unitary groupCAN busComputer fileNetwork topologyFunction (mathematics)Artificial neural networkParameter (computer programming)Data modelLie groupArray data structureForm (programming)Arithmetic meanComputer-generated imageryElement (mathematics)PixelOvalSocial classElectronic mailing listString (computer science)IntegerNumberQuantum stateCurve fittingSample (statistics)ConvolutionStapeldateiWave packetClique-widthHausdorff dimensionPredictionPlane (geometry)outputDigital filterDimensional analysisGraphics tabletSummierbarkeitBuildingNonlinear systemMaß <Mathematik>Population densityMaxima and minimaOrder (biology)Variable (mathematics)Set (mathematics)Regulärer Ausdruck <Textverarbeitung>MathematicsRepresentation (politics)Genetic programmingTerm (mathematics)CASE <Informatik>TensorDeterminismPrice indexMatrix (mathematics)Row (database)Cartesian coordinate systemFile formatMedical imagingArtificial neural networkSocial classVector spaceCodeBitDrop (liquid)Function (mathematics)LaptopParameter (computer programming)outputEndliche ModelltheorieComputer architectureNonlinear systemConvolutionLecture/Conference
Artificial neural networkComputer-generated imageryWeightDifferent (Kate Ryan album)Task (computing)Online helpHeat transferArtificial neural networkWave packetDampingMedical imagingResultantGoodness of fitAtomic numberLecture/ConferenceComputer animation
MereologyArtificial neural networkTask (computing)Heat transferWeightComputer-generated imagerySocial classEndliche ModelltheorieMedical imagingTask (computing)Artificial neural networkMereologyWeightBitFunction (mathematics)Two-dimensional spaceSocial classUniform resource locatorObject (grammar)Different (Kate Ryan album)Category of beingLocal ringLecture/ConferenceComputer animation
Maxima and minimaArithmetic meanoutputCuboidParameter (computer programming)Artificial neural networkWave packetBoundary value problemComputer fontHeat transfer1 (number)Multiplication signParameter (computer programming)Wave packetDistanceGoodness of fitResultantDomain nameArtificial neural networkComputer animationLecture/Conference
Field (computer science)Dependent and independent variablesConvolutionArtificial neural networkComa BerenicesComputer-generated imageryPredictionConfidence intervalField (computer science)outputDependent and independent variablesArtificial neural networkConvolutionMedical imagingDemosceneInstance (computer science)Confidence intervalDegree (graph theory)Noise (electronics)Metropolitan area networkLecture/ConferenceComputer animationXML
Computer-generated imageryConfidence intervalPredictionArtificial neural networkParameter (computer programming)outputOrientation (vector space)WeightReverse engineeringFunction (mathematics)ConvolutionTexture mappingGradientDigital photographyGradient descentIterationAlgorithmQuicksortMedical imagingArtificial neural networkReverse engineeringObject (grammar)Data structureTexture mappingArithmetic meanWordParameter (computer programming)Orientation (vector space)Prisoner's dilemmaAlgorithmXMLLecture/ConferenceComputer animation
AlgorithmComputer-generated imageryDigital photographyMedical imagingTouchscreenBitArtificial neural networkReal numberWordComputer animationLecture/Conference
Metropolitan area networkComputer-generated imagery3 (number)Spectrum (functional analysis)Protein foldingMatching (graph theory)QuicksortParameter (computer programming)ExpressionWordUniverse (mathematics)ResultantTouchscreenSimilarity (geometry)Metropolitan area networkOnline helpEndliche ModelltheorieCoefficient of determinationArtificial neural networkLecture/Conference
InformationRule of inferenceConvolutionComputer architectureThumbnailBlock (periodic table)QuicksortGraphics processing unitMaxima and minimaWave packetMultiplication signSemantics (computer science)Medical imagingArtificial neural networkException handlingLecture/ConferenceMeeting/Interview
Process (computing)String (computer science)outputPattern languageNumberMedical imagingFinite-state machineQuicksortElement (mathematics)Artificial neural networkMultiplication signTime zoneEinbettung <Mathematik>WordVector spaceSparse matrixRepresentation (politics)Term (mathematics)Group actionFunction (mathematics)Figurate numberRecurrence relationLecture/Conference
Artificial neural networkProcess (computing)AdditionKey (cryptography)CoprocessorOperator (mathematics)DigitizingSet (mathematics)Endliche ModelltheorieSummierbarkeitMultiplicationFigurate numberMedical imagingExecution unitLecture/Conference
Medical imagingArtificial neural networkAnalytic setQuicksortProcess (computing)AreaEndliche ModelltheorieGoodness of fitStudent's t-testLecture/Conference
Artificial neural networkDuality (mathematics)PlanningProcess (computing)Goodness of fitEndliche ModelltheorieSimilarity (geometry)Pattern recognitionFunction (mathematics)QuicksortPhysical systemFrequencyLecture/Conference
Function (mathematics)Streaming mediaCost curveFluid staticsGame theoryMaxima and minimaArtificial neural networkNeighbourhood (graph theory)Video gameDecision theoryShooting methodSystem callWave packetoutputBitElectronic mailing listVideoconferencingSet (mathematics)Goodness of fitTouchscreenLecture/Conference
Mathematical optimizationArtificial neural networkApproximationTravelling salesman problemApproximationsalgorithmusHidden Markov modelCost curveSymbol tableMeasurementLecture/Conference
Operator (mathematics)2 (number)Point (geometry)QuicksortRandom number generationCompilerReal numberWeightExpressionAutomatic differentiationSound effectGradientBefehlsprozessor
Sampling (statistics)ExpressionOperator (mathematics)Execution unitSubsetElement (mathematics)Different (Kate Ryan album)Selectivity (electronic)Multiplication signLecture/Conference
Computer animation
Transcript: English(auto-generated)
The title is An Introduction to Deep Learning. So please welcome Jeff. Good morning, thanks for coming.
OK, I'd like to start by thanking Tariq Rashid for giving his excellent, gentle introduction to neural networks. I'm going to build upon that and hopefully show you how to develop some of the networks that have been used to get the really good computer vision results that we've seen recently. So our focus is mainly going to be on image processing
this morning. And this talk is, I'm going to cover more of the principles and the maths behind it than the code. And the reason is, it's quite a big topic. There's quite a lot to go through. I've got to squeeze it into an hour. So a little less code, but hey, I hope it's useful. So a quick overview of what we're going to go through.
We are going to discuss the Theano library, which is the one I personally use, although there's also libraries like TensorFlow. We're going to cover the basic model of what is a neural network, just building on Tariq's talk. Then we're going to go through convolutional networks. And these are some of the networks that have been getting the really, really good results
that we've seen recently. Then we'll look briefly at Lasagna, which is another Python library that builds on top of Theano to make it easier to build neural networks. We'll discuss why it's there and what it does. And then I'll give you a few hints about how to actually build a neural network, how to actually structure it, what layers to choose, just so you have a rough idea on how
to train them, just a few hints and tips to practically get going. And then finally, time permitting, I'll go through the OxfordNet VGG network, which is how to use a pre-trained network that you can download under Creative Commons from Oxford University.
You'll have to use that yourself, because I'll go through why it's useful to sometimes use a network that somebody else has trained for you, and then tweak it for your own purposes. Now, the nice thing is there are some talk materials. This is based off a tutorial I gave at Pi Day to London in May. And if you check out the GitHub repo there,
BrightFury deep learning tutorial Pi Day to 2016, you'll find that there's the GitHub repo there. All the notebooks are viewable on GitHub, so you should be able to see everything there in your browser. I would ask, though, that please, please, please do not try and run this code during the talk. And the reason is, is because when
you run the stuff that uses the VGGNet OxfordNet models, that will need to download a 500 meg weights file, and you will kill the Wi-Fi if you all start doing that. So please do that in your own time, if that's OK. Also, if you want to get more in depth about Theana
and Lasagna, I've put up some slides. If you check out my speaker deck profile, there'll be this talk slides, and there'll also be intro to Theana and Lasagna as well. So that will give you a breakdown of Python code using Theana and Lasagna, what it does, and how to use it. And furthermore, if you don't have a machine available,
you don't want to set it up yourself, I've set up an Amazon AMI for you. So if you want to go use one of their GPUs, you can go and grab a hold of that and run all the code there. Everything's all set up, and I hope it's all relatively easy to get into. All right, now time to get into the meat of the talk.
And what better place to start than ImageNet? ImageNet is an academic image classification data set. You've got about a million images. I think it might be even more now. They're divided into 1,000 different classes, so you've got various different types of dog, various different types of cat, flowers, buckets, whatever else, whatever you can come up with, rocks, snails.
And the way the ground truths, as in you've got a bunch of images that have been scraped off flicker, and you've got to provide a ground truth of what each image is. And the way all those are prepared is they went and got some people to do it over Amazon Mechanical Turk. Now, the top five challenge, what you've got to do is you've got to produce a classifier that, given
an image, will produce a probability score of what it thinks it is. And you score a hit if the ground truth class, the actual true class, is somewhere within your neural network or whatever it is you use. It's top five choices for what it thinks the image is.
And in 2012, the best approaches at the time used a lot of handcrafted features. For those of you familiar with computer vision, these are things like SIFT, HOGS, Fisher vectors. And they stick it into a classifier, maybe a linear classifier. And the top five error rate was around 25%. And then the game changed.
Kaczewski, Sutzkever, and Hinton, in their paper ImageNet Classification with Deep Convolutional Neural Networks, bit of a mouthful, they managed to get the error rate down to 15%. And in the last few years, more modern network architectures have gotten down further. Now we're down to about 5% to 7%. I think people like Google and Microsoft
even got down to three or four. And I hope that this talk is going to give you an idea of how that's done. OK. Let's have a quick run over of Theano. Neural networks offer comes in two flavors, or it's kind of on a spectrum, really. You've got the kind of neural network toolkits that are quite high level at one end, and at the other end, you've got expression compilers.
A neural network toolkit, you specify the neural network in terms of layers. With expression compilers, they're somewhat lower level, and you're going to describe the mathematical expressions that Tariq covered that are behind the layers that effectively describe the network.
And it's a more powerful and flexible approach. Theano is an expression compiler. You're going to write NumPy-style expressions, and it's going to compile it to either C, to run on your CPU, or CUDA, to run on an NVIDIA GPU if you have one of those available. And once again, if you want to go get an intro to that,
there are my slides there that I mentioned earlier. There's a lot more to Theano, so go check out the deeplearning.net website to learn more about it and find out about, that gives you the full description of the API and everything it'll do, some of which you may want to use. There are, of course, others. There's TensorFlow developed by Google,
and that's gaining popularity really fast these days, so that may well be the future, we'll see. Okay, what is a neural network? Well, we're going to cover a fair bit of what Tariq covered in the previous talk, but it's got multiple layers, and the data propagates through each layer, and it's transformed as it goes through.
So, we might start out with our image of a bunch of bananas. It's going to go through the first hidden layer and get transformed into a different representation, and then get transformed again to the next hidden layer, and finally, we end up with, assuming we're doing an image classifier, we end up with a probability vector. Effectively, all the values in that vector
will sum up to one, and our predicted class is the corresponding row in the probability vector, element in the probability vector rather that has the highest probability. Okay, and this is what our network kind of looks like. We see there are weights that you saw in the previous talk,
that connect all the units between the layers, and you see our data being put in on the input and propagating through and arriving at the output. Breaking down a single layer of a neural network, we've got our input, which is basically a vector, an array of numbers. We multiply by our weights matrix,
which are the crazy lines, and then we add a bias term, which is simply an offset. You just add a vector, and then you have our activation function or non-linearity. Those terms are roughly interchangeable, and that's the output layer activation is what then goes into the next layer, or the output if it's the last layer in the network.
Mathematically speaking, X is our input vector. Y is our output. We represent our weights by their weights matrix. That's one of the parameters of our network. Our other parameter is the bias. We've got our non-linearity function, and normally these days, that's ReLU,
rectified linear unit. It's about as simple as they come. It's simply max of X and zero. That's the activation function that's become the most popular recently. In a nutshell, Y equals F of W X plus B, repeated for each layer as it goes through,
and that's basically a neural network. Just that same formula repeated over and over once for each layer. And to make an image classifier, we're gonna take the pixels from our image, we're gonna splat them out into a vector, to stretch them out row by row,
run them through the network, and get our result. So in summary, our neural network is built from layers, each of which is a matrix multiplication, then our bias, then our plan on linearity. Okay, and how to train a neural network.
We've got to learn values for our parameters, the weights and the biases for every layer. And for that, we use back propagation. We're gonna initialize our weights randomly, there'll be a little more on this later. We're gonna initialize the biases all to zero. And then for each example in our training set,
we want to evaluate, as Tariq said, we've got to evaluate on our network's prediction, see what it reckons the output is, compare it to the actual training output, what it should produce, given that input. We've got to measure our cost function, which is roughly speaking the error.
That's the difference between what our network is predicting and what it should predict, the ground truth output. Now, the cost function is kind of important, so we'll just go and discuss that a little bit. For classification, where the idea is given an input, and a bunch of categories, which category best describes this input.
Our final layer, we use a function called softmax, as our non-linearity or activation function, and outputs a vector of class probabilities. The best way of thinking about it is, let's say I've got a bunch of numbers and I sum them all up,
and then I divide each element by the sum. That'll give us roughly the proportion or a probability, assuming all of our numbers to start with are positive. But they can also go negative in a neural network, so the softmax adds one little wrinkle. What we do is we take our input numbers, we compute the exponent of them all,
and then we sum them up and we divide the exponent by the sum of the exponents, that's softmax. And then our cost function, our error function, is negative log likelihood, also known as categorical cross-entropy. To do that, you've got to take the log of, let's just say you have an image of a dog. You run the image through the network,
you see what the predicted probability is for dog. You take the log of that probability, which is gonna be negative, if it's predicted probability is one, the log of that's gonna be zero. If it's like 0.1, it's gonna be quite strongly negative. You negate that log, and so the idea is, if it's supposed to output dog,
it should give a probability of one. If it's giving a probability of less than that, the log, the negative log of that will be quite positive, which indicates high error. So that's your cost. Now, regression is different. Rather than classifying an input and saying, which category closely matches this, you're trying to quantify it.
You're measuring the strength of something or the strength of some response. Typically, with that, your final layer doesn't have an activation function. It's just identity, linear. And your cost is gonna be sum of squared difference. Then, what we've gotta do with our neural networks, we've gotta reduce the cost, reduce the error using gradient descent.
And what we have to do is we have to compute the derivative, the gradient of the cost, with respect to our parameters, which is all our weights and all our biases within our layers. The cool thing about it is, Theano does the symbolic differentiation for you. I can tell you right now
that you don't wanna be in a situation where you have this massive expression for your neural network, and you've got to go and compute the derivative of the cost with respect to some parameter by hand, because you will make a mistake. You will flip a minus sign somewhere, and then your network won't learn, and debugging it will be a goddamn nightmare, because it'll be really hard to figure out
where it's gone wrong. So, I would recommend getting a symbolic mathematical package to do it for you, or use something like Theano that just handles it all, and literally you write that code there. The cost by the weight is Theano grad cost weight, and other toolkits do this as well, just to save you time and sanity.
Then you update your parameters. You take your weights, and you subtract the learning rate, which is lambda, multiplied by the gradient, and I'd generally recommend that learning rate should be somewhere in the region of one times 10 to the minus four,
to one times 10 to the minus two, something in that region. You're also gonna, you typically don't train one example at once. You're gonna take what's known as a mini-batch of about 100 samples from your data set. You're going to compute the cost of each of those samples,
average all the costs together, and then compute the derivative of that average cost with respect to all of your parameters, and then the idea is you end up with an average, and that, the idea is that it means that you get about 100 samples processed in parallel, and that means when you run it on a GPU, that tends to speed things up a lot, because it uses all of the parallel
processing power of a GPU. Training on all the examples in your entire training set is called an epoch, and you often run through multiple epochs to train your neural network, something like 200 or 300. So, in summary, take a mini-batch of training samples,
evaluate, run them through the network, measure the average error or cost across the mini-batch, and use gradient descent to modify the parameters to reduce the cost, and repeat the above until done. All right. Multi-layer perceptron. This is the simplest neural network architecture,
and it's nothing we haven't seen so far. It uses only what are known as fully connected, or dense layers, and in a dense layer, each unit is connected to every single unit in the previous layer,
and to carry on, to pick up from Tariq's talk, the MNIST handwritten digits data set is a good place to start. A neural network with two hidden layers, both the 256 units, after 300 iterations, gets about 1.83% validation set error.
So it's about 98.17% accuracy, which is pretty good. However, these handwritten digits are quite a special case. All the digits are nicely centered within the image. They're roughly the same position, scaled to about the same size,
and you can see that in the examples there, and our fully connected networks have one weakness. There's no translational invariance. If, imagine, you wanna, like, okay, take an image and detect, it's gotta detect a ball somewhere within the image. What it effectively means is you, it'll only learn to pick it up, pick up the ball in the position
where it's been seen so far. It won't learn to generalize it across all positions in the image, and one of the cool things we can do is, if we take the weights that we learn, and we say, take one of the neurons or one of the units in the first hidden layer, and take the strengths of the weights that link them to all the pixels in the input layer,
and visualize that that's what you end up with. So you see that your first hidden layer, the weights are effectively formed with a bunch of little feature detectors that pick up the various strokes that make up the digits. So it's kind of cool to visualize it, but that shows you how the dense layers are translationally dependent.
And so for general imagery, like say if you wanna detect cats, dogs, various eyes and everything that makes up the various little creatures and all the various things, you've gotta have a training set large enough to have every single possible feature in every single location of all the images, and you've gotta have a network that's got enough units to represent all this variation.
Okay, so you're going to have a training set in the trillions, a neural network with billions and billions of nodes, and you're going to need enough about all the computers in the world and the heat death of the universe in order to train it. So moving on, convolutional networks is how we address that.
Convolution, it's a fairly common operation in computer vision and signal processing. You're gonna slide a convolutional kernel over the image, and what you do is you imagine, say, the image pixels are in one layer. You're gonna take your kernel,
which has got a bunch of little weights, a bunch of little values, and you're gonna multiply the value in the kernel by the pixel underneath it for all the values in the kernel, and you're gonna take those products and sum them all up, and you're gonna slide the kernel over one position, do the same, slide it over, do the same,
and what you end up with is an output, and, well, they're often used for feature detection, so a brief detour, Gabor filters, if we produce these filters which are a product of a sine wave and a Gaussian function,
you end up with these little, sort of, these little waves, these little soft circular wave things, and if you do the convolution, you'll see that they act as a feature detector that detects certain features in the image, so you can see how it roughly corresponds. You can see the ones with the vertical bars there roughly pick out the vertical lines in the image of the bananas. The horizontal bars pick out the horizontal lines,
and you can see how convolutions act as a feature detector, and they're used quite a lot for that, so back on track to convolutional networks, back, we'll have a look for a quick recap. That's what our fully connected there looks like with all of our inputs connected to all of our outputs.
In a convolutional layer, you'll notice that the node on the right is only connected to a small neighborhood of nodes on the left, and the next node down is only connected to a small corresponding neighborhood. The weights are also shared, so it means you use the same value
for all the red weights and for all the greens and for all the yellows, and the values of these weights form that kernel, that feature detector, and for practical computer vision, whether you're producing the kernels manually or learning them like in a convolutional network, more than one kernel has to be used
because you've got to extract a variety of features. It's not sufficient just to be able to detect all of just the horizontal edges. You want to detect the vertical ones and all the other various orientations and sizes as well, so you've got to have a range of kernels, so you're going to have different weight kernels, and the idea is you've got an image there with one channel on the input
and about three channels on the output, or what you might find in a typical convolutional network, you might actually have about 48 channels or 256. I'll show you some examples later of some architectures, and you end up with some very high dimensionality in the sort of channels output, but okay.
So each kernel connects to the pixels in all channels in the previous layer, so it draws in data from all channels in the previous layer. However, the maths is still the same, and the reason is is because a convolution can be expressed as a multiplication by a weight matrix. It's just that the weight matrix is quite sparse,
but the maths doesn't really change as far as conceptually, and that's fortunate for us because it means that the gradient descent and everything we've done so far just still works. As for how you go about figuring that out, I'd just recommend letting Sienna do that for you. I wouldn't do it myself. I wouldn't recommend it.
There's one more thing we need, downsampling. So typically if you've worked in Photoshop or GIMP or any of these other image editing packages, you might want to shrink an image down by a certain amount, say by 50%. You want to shrink the resolution, and for that we use two operations, either max pooling or striding.
Max pooling, what you're gonna do is you can see that little image up there is divided into four colored blocks. Say the blue block has four pixels. What we do is we take those four pixels, we pick the one with the maximum value, and we use that. So rather than averaging, we just take the maximum, and that's max pooling. And it downsamples the image by the fact
that PFP is the sort of size of the pooling, and it operates on each channel independently. The other option is striding. And what you do there is you effectively pick a sample, skip a few, pick a sample, skip a few. It's even simpler. It's often quite a lot faster
because what you can do is a lot of the convolution operations support strided convolutions, where rather than taking, sort of producing the output and throwing some away, they just effectively jump over by a few pixels each time. So that's faster, and you get similar results. So, moving on.
Jan LeCun used convolutional networks to solve the MNIST dataset in 1995, and this is a simplified version of his architecture. What you've got is you've got 20 kernels. You've got this 28 by 28 input image,
one channel because it's monochrome. You've got 20 kernels, five by five. So they reduced the image to 24 by 24, but it's now 20 channels deep. Max pool, shrink it by half. Then we have 50 kernels, five by five. And now we've got a 50 channel image, eight by eight. Max pool, shrink it by half.
And then we flatten it, and you do a fully connected dense layer to two, five, six units. And finally, fully connected to our 10 unit output layer for our class probabilities. After 300 iterations of the training set, we get 99.21% accuracy. 0.79% error rate, it's not too bad.
And what about the learned kernels? It's interesting to think about what the feature detectors it's picking up. So if we look at, say, a big dataset like ImageNet, this is the Krizhevsky paper that I mentioned right at the beginning. These are the kernels that get learned by the neural network, and for comparison, you can see Gabor filters over there. Now, the reason the color ones are at the bottom
is just because of the way they did the actual thing involving two GPUs and the way they split it up. But if you look at the top row, you can see how it's picked up all these very little edge detectors of various sizes and orientations. That's the first layer. Zyla and Fergus took it a little further, and they figured out a way of visualizing the kernels,
how they respond to the second layer. So you can see you've got kernels there that respond to various sort of slightly more complex features, things like squares and curved texture, little sort of eye-like features or circular features. And then further up, on about layer three, you get somewhat more complex features still,
where you've got things that recognize simple parts of objects. Okay, so this gives you an idea of roughly how the convolutional networks fit together. They operate as sort of feature detectors where each layer builds on the previous one, picking up ever more complex features.
Okay, now I'll move on to Lasagna. If you want to specify your network using the mathematical expressions using Theano, it's really powerful, but it's quite low level. If you have to write out your neural network as mathematical expressions and numpy expressions each time, it could get a bit painful. Lasagna builds on top of it,
and it makes it nicer to build networks using Theano. And its API, rather than just allowing you to specify mathematical expressions, you can construct layers of the network, but you can also then get the expressions for the output for its output or loss. And it's quite a thin layer on top of Theano, so it's worth understanding Theano.
But the cool thing about it is, if you have one of these mathematical expression compilers, if you want to come up with some crazy new loss function or do something new and crazy, whatever it is you like, or you want to be sort of inventive, you can just go write out the maths and let Theano take care of figuring out how to run that using CUDA, using NVIDIA's CUDA,
so you don't have to worry about it yourself. It's quite easy to get going. You just do it all in Python, and it all just works great. So that's why I happen to like it. And once again, slides are available if you want to go and dive in more detail. Okay.
As for how to build and train neural networks, I think we'll start out with a bit about the architecture. If you want a neural network to, if you want to get a nice neural network that's gonna work, I'm gonna try and give you some rough ideas of what kind of layers you wanna use and where in order to get something that's gonna give you good results.
So your early part of the network is gonna be, just after your input layer, is going to be blocks that are gonna consist of some number of convolutional layers, two, three, four convolutional layers, followed by a max pooling layer that effectively downsamples. Or alternatively, you could also use striding as well.
And then you have another block the same. And you'll note that the notation is, that's quite common in the academic literature, is you specify the number of filters, the number of kernels, and then the three specifies the size. So often you use quite small filters, only three by three kernels.
MP2 means max pooling downsamples, factor two. And notes that after we've done the downsampling, you double the number of filters in the convolutional layers. And then finally at the end, after your blocks are convolutional on the max pooling layers, you're gonna have the fully connected or also known as dense layers,
where you'll typically, if you've got quite a large resolution coming out of there, you'll want to work out what the sort of dimensionality is at that point, and then roughly maintain that or reduce it perhaps a bit in your fully connected layer. You could have two or three fully connected layers if you like, and then finally you've got your output.
And there's the notation for fully connected layers. Does that just mean 256 channels? Okay, so overall, as discussed previously, your convolutional layers are gonna detect the features in the various locations throughout the image. Your fully connected layers
are gonna pool that information together and finally produce the output. There are also some architectures. You could look at the Inception networks by Google or ResNets by Microsoft for inspiration if you want to go and have a look at what some other people have been up to. Go on slightly more complex topics.
Batch normalization. It's recommended in most cases. It makes things better. It's necessary for deep networks. By the way, I should tell you, deep learning neural networks, a deep neural network is simply a network of roughly more than four layers. That's all it is. That's what makes them deep. And so if you want particularly deep networks of more than eight layers, you'll want batch normalization, otherwise they just won't train very well.
It can also speed up training because your cost drops faster per epoch. Although it can take more, each one can take it longer to run. You can reach lower error rates as well. The reason why it's good is sometimes you've got to think about the magnitude of the numbers.
You might start out with the magnitude, the numbers of a certain magnitude in your input layer, but that magnitude might be increased or decreased by multiplying by the weights to get to the next layer. And if you stack a lot of layers on top of each other, you can find that the magnitude of your values either exponentially increases or exponentially shrinks towards zero.
Either one of those is bad. It screws the training up completely. So batch normalization, it standardizes it by dividing by the standard deviation, subtracting the mean after each layer. So you want to insert it into your convolution of fully connected layers after the matrix multiplication, but before adding the bias and before non-linearity.
But the nice thing is Lasagna with a single call does that for you, so you don't have to do too much surgery yourself on the neural network. Dropout. It's pretty much necessary for training. You don't use it at train time, but you don't use it at prediction and test time when you actually want to run a sample
through the network to see what its output is. It reduces what's known as overfitting. Overfitting is a particularly horrific problem in machine learning. It's gonna bite you all, it's gonna bite you all the time in machine learning. It's what you get when you train your model on your training data. It's very, very good at the samples
that are in your initial training set, but when you want to show a new example that's never seen before, it just dies, it fails completely. So, essentially what it means is it gets particularly good at those examples. It picks out features of those particular training samples and fails to generalize.
So, dropout combats this. What you're gonna do is you're gonna randomly choose units in a layer and you're gonna multiply a random subset of them by zero, usually around half of them. And you're gonna keep the magnitude of the output the same by scaling it up by a factor of two.
And then doing test predict, you just run as normal with the dropout turned off. You're gonna apply it after the fully connected layers. Normally you don't bother, you can do it after the convolutional layers as well, but the fully connected layers towards the end is normally where you apply it. That's how you do it in lasagna.
And to show you what it actually does, this is with your dropout turned off, so you see all the outputs going through. Those little diamonds represent our dropout, so we take half of them, we pick them and turn them off, and you see the gray weight lines. What that effectively means is when doing training, the back propagation won't affect those weights because the dropout kills them off.
And then the next time around, you turn off a different subset of them and furthermore. And the reason it works is it causes the units to learn a more robust set of features rather than learning to sort of co-adapt and develop features that are a bit too specific to those units.
So that's roughly how it sort of combats overfitting. Data set augmentation. Because training in neural networks is notoriously data hungry, you want to reduce the overfitting and you need to enlarge your training set. And you can do that by artificially modifying your existing training set
by taking a sample and modifying it somehow and adding that modified version to the training set. So for images, you're just gonna take your image, you're gonna shift it over by a certain amount or up and down by a bit. You're gonna rotate it a bit, you're gonna scare a little bit, horizontally flip it. Be careful of that one.
So for example, if you've got images of people and you've vertically flipped and so they're upside down, that will just screw up your training set. So when you're doing data set augmentation, you've got to think about what you need from your data set and what it should output and think about whether your transformations are a good idea. Okay, and finally, data standardization.
Neural networks train more effectively when your data set has a mean of zero, all the values are a mean of zero, and unit variance or standard deviation of one. And also with regression, you want to standardize your input data,
and with regression, you want to standardize the output. Remember that in regression, we are quantifying something so we're producing real valued outputs. You want to make sure that's standardized as well. I've personally found that. I've been bitten when I haven't done that. But when you use your network, when you deploy it, don't forget to do the reverse of the standardization
to get it back into the space that you, back into the sort of scale and range that you want it to be in the first place. And to do that standardization, you extract all the samples into an array, and in the case of images, you're just gonna go through all the images and extract all the pixels and splat them out into a big long array,
keep all the RGB channels separate, and you're gonna compute the mean and standard deviation in red, green, and blue, and you're gonna zero the mean by subtracting it and then divide by the standard deviation and that's standardization. Okay. When training goes wrong, as it often will,
you'll notice, what you wanna do is, as you train, you wanna get an idea of what the value of your loss function is. When it goes crazy and starts heading towards 10 to the 10 and eventually goes nan, everything's gone to hell.
So you gotta track your loss as you train your network so you can watch for this. Okay. If you have the error rate equivalent of a random guess, like it's just throwing a dial, throwing a coin, it's not damn well learning anything. And essentially, it's learning to predict
a constant value a lot of the time. Sometimes it just, there isn't enough data for it to pick up the patterns. It can also learn to predict a constant value, let's say, for instance, that you have a data set where, say, you've got them divided into, say, 10 classes, but say, the last class only has
about 0.5% of the examples. Now, one of the best ways, the sneaky, horrid little neural network will figure out a way to cheat you is to simply never predict that last class because it's only gonna be wrong in 0.5% of the cases and that's actually a pretty good way of getting the loss down to a pretty low value
by concentrating on all the other classes and getting those right. And the problem is it's a local minima, there's a local minimum of the, you can think of it as a local minimum of your cost function and neural networks get stuck in those a lot and it will be the bane of your existence. They most often don't learn what you expect them to
or what you want them to. You'll look at it and think, as a human, I know the result is this and the neural network will learn to pick up features and detect something quite different and so, yeah, local minima of the bane of your existence and I'm gonna illustrate this with a really nice, cool example
that is available online. I'm gonna talk about how you design a computer vision pipeline using neural networks. With a simple problem like handwritten digits, you could just throw it at a neural network, one neural network and it'll do it, great, wonderful. But for some more complex problems, they're often just not enough and neural networks are not a silver bullet
so please don't believe all the hype that's around deep learning right now. It's theoretically possible to use a single neural network for a complex problem if you have enough training data which is often an impractical amount. So for more complex problems, you gotta break the problem down into smaller steps and I'm gonna talk a bit about
Felix Lau's second place solution to the Kaggle competition on identifying right whales. So his first naive solution was to train a classifier to identify individuals so I'm gonna pull up his website and okay, cool, okay.
So effectively, these patterns on the head of the whale is what you use to identify an individual and the challenge is to pick out, figure out, given an image of a whale, figure out which individual it is and this is the kind of image you get in the training set. You've got the ocean surrounding a little whale as he breaches, as he pokes his head over the surface
and you gotta figure out who he is from that picture. So Felix's first solution was just to stick that through a classifier and see what happens. So let me scroll and find out, okay. Baseline naive approach, here we go and what he found is that it gave no better
than random chance, hmm. So what he then did is he used what's called saliency detection where he used a trick to figure out which parts of the image are influencing the network's output the most and he found out that actually, bits of the ocean were affecting it. Why would it do that?
Okay, try a thought experiment. I want you to imagine that I give you this problem. You've got a bunch of images of right whales and I say that's number one, that's number seven, that's number 13 but you've also been given really, really horrendous, horrible amnesia that has completely wiped your mind of the concept of what a whale is, what the ocean is,
of just about every human concept you have. So you are literally starting out with images. There's zero knowledge at all, no semantic knowledge about the problem. You can't even guess what it is. You're just given images and given numbers and then told from this training set, figure out what these are. How are you gonna make that decision? Is it the ocean? Is it the whale?
What part of the image is actually helping you make that decision? And when you think about it from the perspective of a neural network, that's where every neural network is starting out from. It's starting out from zero knowledge and that's why the initial solution didn't work very well. You could do if you had a billion images with all the ground truths and the whale, the marine biologists have gone, you know,
hand classified a billion images of them and put in enormous amounts of human effort because then the signal will eventually come through the noise. But we can't practically do that in real life. So his solution, his, yeah, I mentioned the region-based saliency
so found out they had locked onto the wrong features. So he trained what's called a localizer. Now, I've told you about classifiers and regressors. Localizers, what they do is they look at image and they find my target point of interest is over there in the image. And so what he did is he found,
he got the localizer to take that image of the whale and found out that the head is there. And after that, he ran it through the classifiers. The idea is he first gets, trains a network to look for whale, pick it out, crop it out from the image and then just work on that piece. And then furthermore, he trained a key point finder.
Whale had a liner, here we go. He trained a key point finder to find the front of the head and the back of the head. So he could then take the image of the whale and rotate it so that they're all in the same orientation and position. And after that, having got sort of really uniform images
of whales, he could then run it through the classifier. And eventually, that train the classifier on oriented and crop whale head images got him second place in the Kaggle competition. So I think that's kind of a nice illustration of how you gotta be careful how you use these things.
All right, how am I doing for time? Great, okay. Might even have a bit of extra to go through a few extra things. Never know, we'll see, we'll see, we'll see. So, OxfordNet VGGNet and transfer learning.
Using a pre-trained network is often a good idea. The OxfordNet VGG19 is a 19-layer neural network. It was trained on that big million image data set called ImageNet. And the great thing is they have generously made the neural network weights file available under create a commons license with CC attribution.
And you can get it there. There's also a Python pickled version that you can grab hold of as well. They're very simple and effective models. They consist of three by three convolutions, max pooling and fully connected layers. That's the architecture.
And if you wanna classify an image of the VGG19, I shall show you an IPython notebook that will do that. All right, so we're gonna take an image to classify, which is our little peacock here.
We load in our network. Oh, sorry, beg your pardon. Cool. So we've got our little peacock we're gonna classify. We're going to load in our pre-trained network. I think I better skip over the code a bit. This is gonna be a bit too dull,
but effectively, you can go through the notebook yourself. It's on the GitHub. I hope you don't mind if I spin through this quite quickly. So we're just going through a bit about what the model's like. Okay, so this is where we actually build our architecture. So you can see the input layer. We've got our convolutional layers, max pooling. This is all the lasagna API.
We'll skip all this. We'll go down. Finally, we've got our output, which has got softmax non-linearity. There you go. Build it. We're gonna drop all our parameters in. Beg your pardon. Okay, sorry, this is originally from my tutorial. So anyway, finally, we show our image that we're gonna classify,
and we predict our probabilities here. And we notice the output is a one, a vector of a thousand probabilities. And we find out that the predicted class is 84 with probability 98.9%, which is a peacock. And you can run that yourself, and you can find out that it'll work. So the cool thing is is you can take the pre-trained network and just use it yourself.
And transfer learning is a cool trick, and this is the last trick I wanna show you. Training in your network from scratch is notoriously data-hungry. The reason is you need a ton of training data.
And preparing all that's time-consuming and expensive. What if we don't have enough training data to get good results? We don't have money to prepare that prepare it. Well, the ImageNet dataset is really huge. Millions of images with ground truths. And what if we could somehow use it, what if we could somehow use the ImageNet dataset
with all its vast data to help us with a different task? Well, the good news is we can. And the trick is this. Rather than try and reuse the data, you train a neural network like VGG19, or you download VGG19, and you're gonna take part of that network and retain it, throw away the end part of it,
and stick some new stuff on the end that will output what we want. And that way, you effectively train just the bit that you've added, and then fine-tune it at the end. I'll go over that. But essentially, what you can do is you can reuse part of VGG19
to, say, classify images that weren't in ImageNet, and for classes and different kinds of object category that weren't mentioned in ImageNet, so you can reuse it. You can reuse it for localization, so you wanna find the location of an object, like the location of that whale head, maybe. Or segmentation, where you wanna find the exact outline of the boundary.
And to do transfer learning, what we do is we're gonna take VGG19 that looks like that. Those are all our layers. We're gonna chop off those last three. The stuff on the left just gets hidden so we can show some text. But we chop off those last three layers,
and then we create our new ones, randomly initialized on the end. Then what we do is you train the network with only your training data, but you're only gonna learn the parameters. You're only gonna train the parameters on the new layers that you've created. And then you fine-tune it, where you train parameters on all the layers, having trained those initial new ones,
you then fine-tune the whole lot. You just do training this time, updating the parameters of all layers, and this will get you some better accuracy. And the result is a nice, shiny new network with good performance on your particular target domain that's gonna be somewhat better than you could get with starting out with your own data set.
Okay, so finally, some cool work in the field that might be of interest to you. Xyla, I think I mentioned this briefly already, but they've visualized and understand, they've visualized the understanding convolutional networks. They decide to visualize the responses of the convolutional layers to various inputs.
So you've seen these images where they decide to visualize what's going on. If ever you wanna find out what your network's picking up, this is a good place to look for how to work out what your network is detecting. And these guys decide to figure out if they can fool a neural network. So they decide to generate images that are unrecognizable to human eyes, but recognized by the network.
So for instance, the neural network has a high confidence that that is in fact a robin. It looks like horrible noise, but it thinks that's a cheetah, but that's an armadillo, but that's a peacock, really. I can't see a peacock there.
They then went on to effectively say, well, how can we generate images that do sort of make sense to a human? That's a king penguin. That's a starfish. And you can kind of see where it's picking things up. It's looking for texture, but it's not really looking for the actual structure of the object. So it's picking up certain things and ignoring other quite important features.
You can run neural networks in reverse. You can get to generate images as well as classify them. So these guys decide to make them generate chairs. So they give the orientation, the design, the color, and the parameters of the chair, and they're trying to generate an image. So you end up with these chairs, and they're even able to morph them.
This one got a lot of press. Neural algorithm artistic style, and if you've got the Prisma app, you'll know what that's all about on iPhone. They took OxfordNet, and they extract texture features from one image, and they apply them to the other.
So you take that photo of, say, say this waterfront, and you take a painting, like, say, Starry Night by Van Gogh, and it repaints the image in the style of Van Gogh, or in the style of Edward Munch's The Scream, or any of these others. It's very, very cool. And the nice thing is there are iPhone apps that do this now.
And what these guys did is, this is a bit of a masterpiece of work. They've generated, these images of bedrooms are generated by a neural network, and the way they did it is they trained two neural networks, one to be a master forger, and the other to be the detective. The master forger tries to generate an image, and the detective tries to tell,
is that a real image of a bedroom, or is that one that's been generated by the forger? And the idea is you co-adapt them to get them both better, so that the master forger gets better and better and better until it generates pictures like that, which is kinda cool. And they even then took it further by figuring out what the sort of, by combining some of the parameters,
and if you've seen some of the results from the sort of king minus man plus woman equals queen stuff that's been done on some of the Word2vec, thank you, the Word2vec models, similar things with the facial expressions as well. Anyway, I hope you found this helpful. I hope it's been good.
You've been a great audience. Thank you very much. We have about nine minutes now for the questions.
It was a great talk, thank you. I have actually several questions. The first one, when you are modeling a neural network, how do you choose, or is there a way to choose, how many hidden layers and neutrons are there in them? Because I know that was an issue for me
when I was modeling some. I'm not aware of any particular sort of rule of thumb to choose how to design your network architecture. The sort of rule of thumb I use to look at something, things that have worked for other people and build off that. So the OxfordNet architecture where you've got the,
I found the small convolutional kernels, well people find the small convolutional kernels work well. A few of those layers followed by max pooling or striding and those blocks repeated. I think there are some people who've probably tried things like grid search where you just try to hold up, you get to automatically alter the architecture but given the fact for something like an ImageNet model,
your training time can extend even into weeks or hours at least on really big GPUs that can be impractical. So afraid to say, it's just rule of thumb as far as I know. Just try it out and see it works. Yeah, well I would look up the literature and see what other people have done, just adapt it. I'm sorry, I can't give you more information on that.
My second question, we saw that you guys are analyzing images and numbers. Is there a way that you can make strings input and recognize patterns in them? How would you do that? Would you have to transform them somehow or?
I mean for text processing. Yeah. I think what people tend to do is they tend to use something like Word2Vec to convert each word into an embedding which is like a 200 or 600 element vector and then use what's called a recurrent neural network where rather than just having it go through the output, it goes partially through and then feeds back into an earlier layer.
So then it sort of has an idea of time. I've not implemented those models. I'm afraid I'm outside my comfort zone there in terms of being able to advise you. But look at recurrent neural networks. But yeah, they tend to use the word embeddings. They tend to use the word embeddings to convert the words into a sort of vector.
The sort of more trivial way of doing that is to just turn it into one hot representation where if you've got 2,000 words in your vocabulary, you simply have a one on the, to represent a word with a vector of all zeros except one for the particular word that it is. But given the sparsity of that input,
that often causes problems which is why they use the embeddings. And the last question, sorry. Could you train a neural network to do like math, like addition, maybe multiplication? And if you can, would it be maybe faster than the usual way that the processors are doing it?
You can train to do addition. I think actually there are some people who've managed to take the ImageNet data set where you take two handwritten digits in an image. It figures out what they are and then is trained to produce the sum. It can work. Multiplication, they don't do. Actually, people haven't figured out how to get anyone out to do that. So the models actually can't extend to certain things
which is interesting. So there are certain things that just don't do very well. Oh well. So I think it's quite limited. But as for would it be faster? No way it'd be faster, no. Because you're using a hell of a lot of mathematical operations just to do something that is a one instruction operation on your processor. So, sure.
Thanks for the talk. Really, really interesting and great stuff at the end around the images. What are your thoughts on how neural networks could be applied to text analytics? Because most people don't do that. Text analytics.
It's outside my area so I don't know. But I would speak to Catherine Jamal. She's here and she did a very, very good talk describing the sort of, she gave a really, really good intro and a really, really good sort of overview of what the sort of text processing world is like.
And she gave quite a few neural network models. Neural networks are some of the best models for it now. But it's outside my area of expertise but she knows her stuff on that so I'd speak to Catherine Jamal. Any other question?
The name of neural networks comes from the science of the brain. Do you know if it's used widely in brain science? Not sure. I think that the model that we use for our neural networks that I've been talking about here is quite different from our neurons in the brain work.
I think that my very, very basic layman's understanding of brain neurons is they operate on spike rate. So they generate output spikes and it's the frequency of that that is roughly the strength of their output I think, I don't know. So I think that sort of trying to liken these
to one another is, I don't think that they're that much alike. I think that where the similarity is is that people looked at how neurons in the brain are all hooked up to each other and they said how can we make something that models this? But what we've got is something that seems to work well given our processes
and seems to produce very good pattern recognition. But as for similarities to the brain beyond that, I don't feel comfortable saying anymore. Any other questions? Hi, have you heard of the self-driving car using deep learning to implement how they drive it?
I wonder how they would update the cost function because it's a stream of video rather than a fixed static output. I've heard about it. I'm not sure how the hell they're doing it. I don't know.
I suppose, if you were to try and do something like that, one of the things you could do is you could prepare a bunch of footage where you say the human who's driving this car has done well, as in they haven't crashed it or killed anyone or done something, so you're like that. So the idea is that all that's good
and maybe if there are some footage of some accidents, they say that's bad, don't do that. Or what you probably want to do is you want to say, given this video, have these outputs as in steer like this, accelerate and brake like this, produce these decisions.
So that's actually a little bit like the Atari game playing neural networks that Google developed, the stuff where they got really good scores on the video games, where they take the input, the screen, and they decide whether to move up, down, left, right, shoot. But it's a similar thing, where instead of deciding whether to move up, down, left, right, and shoot,
you're controlling the steering wheel, the accelerator and the brakes. You could do it like that, but given my experience of it and given the fact that, as I mentioned, if you have particularly rare examples, rare situations where quite often the neural network will just cheat and not just, because they might make up 0.0001% of your training set,
it'll never actually bother to learn anything from those. It'll just, the cost function, it'll discover a local minima that ignores them. I would not be very comfortable getting into a car that was just controlled by a neural network. I would not want to put my life in the hands of a vehicle like that.
But that might be how you could build it, but whether it would, I don't think it would be very good. Any other questions? Hi. Do you ever combine neural networks with other techniques like approximation algorithms?
Approximation algorithms? Yeah, like optimization techniques? I was thinking about travel salesman problem, for example. I don't know. I haven't tried them for that. I'm not aware. I wouldn't be surprised if someone's tried it, but I'm not aware, because I looked at it, I'm afraid.
Hmm. That's a difficult one, I'm afraid I don't know. I'm sorry. You'd have to figure out a way of coming up with some kind of cost function that measures how good its solution is. How one would go about doing that in certain problems, I don't know. I have time for.
One last question. Maybe it's kind of a technical question about Teano. But when you apply dropout, does the expression
get recompiled, re-optimized to be efficient, not to take account of that sort of those weights? Or the floating point operation
get to the GPU or CPU, but they are zero, so they don't affect the gradient? I think it's the second. Because what you do is you get this random number generator to generate either a zero or a one. And then you put that multiply in the expression. So I think it's just not actually optimizing. I think it'd be quite difficult to optimize.
Because the problem is, for every single sample in the mini-batch, you're actually blocking out a different subset of the units. So I'm not even sure how one would actually go about optimizing in an efficient way. Because you've got to almost select which units you're dropping out, and then from that, decide what operations you can save. And you've got to do that on the fly.
And I think that'd be quite tough. So I'm guessing it. I would guess that it doesn't. So since there are no other questions, so I'll probably thank Jeff for his wonderful talk.
And probably, I'll say yes, enjoy lunch.