We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Real-time Face Detection and Emotion/Gender classification with Convolutional Neural Networks

00:00

Formal Metadata

Title
Real-time Face Detection and Emotion/Gender classification with Convolutional Neural Networks
Title of Series
Number of Parts
95
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this work we present a real-time system for face detection and emotion/gender classification using Convolutional Neural Networks and Haar-like features.
Keywords
Computer networkOpen sourceFreewareGenderReal numberDemonRobotService (economics)System programmingElectric currentSmoothingInformationTask (computing)Scale (map)Computer-generated imagerySample (statistics)Regulärer Ausdruck <Textverarbeitung>Computing platformImplementationCNNVisual systemVisualization (computer graphics)Real-time operating systemPresentation of a groupArtificial neural networkGenderQuicksortCodeInformationRoboticsOrder (biology)Pole (complex analysis)Physical systemOpen sourceMedical imagingStudent's t-testCASE <Informatik>WebsiteConvolutionEndliche ModelltheoriePhase transitionIndependence (probability theory)Right angleFrame problemExtreme programmingAtomic numberService (economics)Characteristic polynomialSoftware testingEstimatorOcean currentMetropolitan area networkLevel (video gaming)Social classRepresentation (politics)Black boxComputing platformMultilaterationOpen setSystem callSoftware frameworkSoftwareBitComputer animationXML
Real numberRobotComputing platformRegulärer Ausdruck <Textverarbeitung>ImplementationCNNSystem programmingVisual systemTask (computing)Visualization (computer graphics)GenderFunction (mathematics)outputArtificial neural networkComputer networkType theoryConstraint (mathematics)Matrix (mathematics)MultiplicationWeightConvolutionOperator (mathematics)Computer-generated imageryKernel (computing)Data modelLokaler ZusammenhangMaxima and minimaConstraint (mathematics)Medical imagingAreaKernel (computing)Physical systemoutputOperator (mathematics)Multiplication signMatrix (mathematics)ConvolutionFormal languageArithmetic meanOrder (biology)SoftwarePixelWeightElement (mathematics)Endliche ModelltheorieFunction (mathematics)MappingFunctional (mathematics)Associative propertyExpressionSystem callBitApproximationCondition numberConnected spaceWordPatch (Unix)Arrow of timeSet (mathematics)Mathematical optimizationNeuroinformatikArtificial neural networkError messageTime seriesKonturfindungCollisionCASE <Informatik>Shared memoryDataflowRecurrence relationFlow separationQuicksortNetwork topologyState of matterMultiplicationVolume (thermodynamics)Visualization (computer graphics)XMLComputer animation
Artificial neural networkReal numberKernel (computing)Medical imagingWeightConvolutionMappingProgram slicingQuicksortLevel (video gaming)BelegleserArtificial neural networkLatent heatMultiplicationoutput1 (number)Multiplication signVolume (thermodynamics)Scalar fieldOrder (biology)XMLComputer animation
CNNParameter (computer programming)Real numberAverageFunction (mathematics)Vertex (graph theory)ArchitectureConvolutionGenderData modelMusical ensembleSystem programmingLinear mapMaß <Mathematik>Endliche ModelltheorieException handlingResidual (numerical analysis)Modul <Datentyp>Artificial neural networkNeuroinformatikEndliche ModelltheorieRoboticsResidual (numerical analysis)Physical systemOrder (biology)Representation (politics)Level (video gaming)Set (mathematics)ConvolutionKernel (computing)Flow separationMappingParameter (computer programming)SoftwareWeightFunction (mathematics)Social classResultantQuicksortComputing platformState of matterConnected spaceInformationAverageRight angleComputer architectureMaxima and minimaMereologySingle-precision floating-point formatScalar fieldReal-time operating systemMusical ensemble1 (number)CASE <Informatik>Exception handlingTask (computing)Element (mathematics)Medical imagingCollisionPattern languageMultiplication signSymbol tableFrequencyComputer animation
Modul <Datentyp>Texture mappingWeightModule (mathematics)Residual (numerical analysis)Real numberSurjective functionStandard deviationDivisorCorrelation and dependenceArchitectureArtificial neural networkParameter (computer programming)Reduction of orderData modelGenderImplementationTask (computing)BefehlsprozessorVisualization (computer graphics)Backpropagation-AlgorithmusMultiplication signConvolutionArrow of timeGradientFlow separationKernel (computing)QuicksortEndliche ModelltheorieData conversionCross-correlationParameter (computer programming)Backpropagation-AlgorithmusSearch algorithmException handlingPerturbation theoryComputer architectureDivisorReduction of orderState of matterSquare numberCASE <Informatik>Form (programming)Functional (mathematics)Nonlinear systemLevel (video gaming)Operator (mathematics)Physical systemComplete metric spaceSoftwareBefehlsprozessorNetwork topologyMappingBridging (networking)2 (number)AlgorithmOrder (biology)Error messageMetropolitan area networkResultantFilter <Stochastik>outputVolume (thermodynamics)Medical imagingArtificial neural networkPhase transitionRight angleGraphics processing unit1 (number)Open setMoving averageMathematical optimizationWeightIdentity managementSet (mathematics)ImplementationGenderMixed realityComputer animation
EmailRoboticsView (database)Maxima and minimaComputer-generated imageryTwin primeExecution unitReal numberMetropolitan area networkPhysicistResultantSoftware repositoryError messageOrder (biology)Office suiteBasis <Mathematik>SoftwareRight angleProjective planeOpen setComputer animation
Demo (music)Real numberSample (statistics)Extension (kinesiology)Parameter (computer programming)Reduction of orderEndliche ModelltheorieVisualization (computer graphics)Visualization (computer graphics)Medical imagingDependent and independent variablesSoftwareOrder (biology)ImplementationMatching (graph theory)Associative propertyArtificial neural networkRow (database)DemosceneSurfacePixelNetwork topologyMereologyEndliche ModelltheorieElement (mathematics)Parameter (computer programming)TheoryOpen sourceOffice suiteOpen setReal numberoutputQuicksortVideo gameServer (computing)PropagatorComputer animation
Sample (statistics)Extension (kinesiology)Parameter (computer programming)Reduction of orderEndliche ModelltheorieArchitectureConvolutionGenderData modelMusical ensembleSystem programmingCNNLinear mapMaß <Mathematik>AverageArtificial neural networkDirac equationoutputVolumeFunction (mathematics)Regulärer Ausdruck <Textverarbeitung>RobotComputing platformImplementationTask (computing)Visual systemVisualization (computer graphics)Computer-generated imageryScale (map)Real numberFamilySet (mathematics)Demo (music)ImplementationQuicksortPresentation of a groupMultiplication signOpen setSoftwareTheory of relativityParameter (computer programming)Reduction of orderSampling (statistics)Function (mathematics)Endliche ModelltheoriePhysical systemEmailRight angleData miningError messageArtificial neural networkConvolutionGenderInformationMedical imagingMetropolitan area networkStrategy gameOrder (biology)Kernel (computing)Mathematical optimizationFlow separationSocial classInsertion lossMessage passingFile formatCoalitionLoop (music)Doubling the cubeVideoconferencingPower (physics)Core dumpWeightComputer animationXML
Nichtlineares GleichungssystemMedical imagingPerformance appraisalSoftwarePerturbation theoryFunction (mathematics)outputWordOrder (biology)Right angleSet (mathematics)Validity (statistics)Machine learningWorkstation <Musikinstrument>Support vector machineProcess (computing)Software testingProcedural programmingMathematicsMultiplication signNoise (electronics)Backpropagation-AlgorithmusMathematical optimizationKernel (computing)ImplementationSocial classState of matterElement (mathematics)AlgorithmOctagonSign (mathematics)InformationMereologyEndliche ModelltheoriePattern recognitionArtificial neural networkRule of inferenceConvolutionInsertion lossFunctional (mathematics)Lecture/Conference
Open sourceFreewareFlow separationEntropie <Informationstheorie>Functional (mathematics)Dimensional analysisInsertion lossRight angleFunction (mathematics)Core dumpProbability distributionMathematical analysisVideoconferencingSquare numberLecture/Conference
Open sourceFreewareModemDemonOrder (biology)QuicksortGroup actionCondition numberGraphics processing unitTunisException handlingNumberRight angleSoftwareSlide ruleReduction of orderConnected spaceRepository (publishing)Multiplication signConvolutionInformationComputer-assisted translationComputer architecturePlanningNetwork topologySurfaceOcean currentProcess (computing)Level (video gaming)Artificial neural networkLecture/Conference
Computer networkRobotContext awarenessElectric currentComputer-generated imageryTelecommunicationSystem programmingService (economics)Computer programmingProduct (business)SoftwareSoftware bugConvolutionOrder (biology)Medical imagingProjective planeComputer animationXML
Computer-generated imageryLocal GroupData modelRobotContext awarenessDifferentiable functionRevision controlRead-only memoryComputerDigital signalPredictionWordFunction (mathematics)CNNComputerOffice suiteCASE <Informatik>InformationSoftwarePopulation densityWordMedical imagingOrder (biology)Physical systemGroup actionGame theoryDescriptive statisticsQuicksortBlock (periodic table)Pulse (signal processing)Universe (mathematics)Machine learningRoboticsComputerConnectivity (graph theory)EstimatorLogic gateSet (mathematics)Revision controlBitMetropolitan area networkState of matterFunction (mathematics)Process (computing)Artificial neural networkoutputSemiconductor memoryConvolutionForm (programming)Recurrence relationMultiplication signData storage deviceComputer animation
Data modelExtension (kinesiology)FrequencyCNNWhiteboardContinuous trackLocal GroupLaptopSqueeze theoremMedical imagingEvent horizonInformationRoboticsMetropolitan area networkResultantQuicksortSoftware bugWindowBounded variationArmInsertion lossMetric systemSound effectMultiplication signServer (computing)Endliche ModelltheorieTable (information)Parameter (computer programming)LaptopCartesian coordinate systemGoodness of fitData managementError messagePerformance appraisalSoftwareWordNeuroinformatikGroup actionPhysical systemComputer-assisted translationMathematical singularityFormal languageShared memoryDiscrete groupSqueeze theoremField (computer science)WaveRight angleTrailArtificial neural networkProjective planeComputer animation
Core dumpWordValidity (statistics)MereologyRight angleMetric systemDampingSoftware frameworkSet (mathematics)Lecture/Conference
Open sourceFreewareEvent horizonComputer animation
Transcript: English(auto-generated)
Well, thank you all for coming to my presentation on real-time face detection and emotion-gendered classification using convolutional neural networks. First, a little bit about me, Octavio, it's a, I am a, yes, as he mentioned, a master
student here and well, a participant of the RoboCup athlete competition, so all of this work came or was brought up to because of the RoboCup, International RoboCup competition for the at-home league. So, yes, well, our main motivation, why would we like to have an emotion and gender classification
system? Well, here in our university, we have this robotic curse and obviously, our intentions are to put this into a robot and well, domestic and our current robot platforms are currently being used for home systems or as personal intelligence systems, so in
this case, we have the robot here, Pepper, which is very famous and I believe this robot here belongs to a French company as well that they're trying to sell it so they can deploy it at your home and they can be, yeah, serve as intelligence system that can be take pictures and remind you of things and yes.
Obviously, these sort of robots require a high-level perception skills and in this case, we can think of it, we can think of several examples. One of them is a person that is an elderly woman or man that is just trying to communicate that the person is in pain or that the person might have some sort of anger or might be
sad and maybe it's unable to express it, so a person might be in pain that is unable to express it or therefore, the robot might, we would like that the robot extract information from this person's face and in order to proceed as it should proceed.
For example, in order to call someone or in order to tell someone and yeah, there are also other cases and more going to the other extreme in which the robot would like to take a picture and we would only like to take a picture when everyone in the frame is happy, right? So I mean, there are a lot of cases in which we can think, in which we would like
to extract information from the user's face in order to convey some sort of movement with the robot. And well, one important characteristic of emotion classification is that it's not so easy. One would think that maybe classifying someone else's emotion only by looking at the face
could be potentially or substantially easy, but apparently humans perform around 65 percent accuracy on this current dataset that we're using and which is the third 2013 dataset and yeah, it's open source and everything and it has 35,000 grayscale images. So in this example, I have shuffled all these images and then we could look around
and see, yeah, I don't know, like it's easy for us to detect this person that it's happy, right? But maybe this one, yeah, I don't know which class should I assign to this person. Maybe sad or neutral and also, I mean the baby, like I don't know which class to give to the baby, right, which level to give to it.
And yeah, another thing that we could see from the dataset here is that the faces are not presented in a canonical way. That means that the pose of the face is rotated, so this might also be problematic eventually for performing a classification. And yeah, the other dataset that we used was the IMDB dataset which consists of images
from the IMDB website. It's around almost half a million pictures. And as you can see, there are pictures of the small world Hollywood ecosystem, right? And these eventually leads to problems as I will talk about later.
But yeah, it's one of the biggest datasets out there which has around half a million images of genders or sex. And yeah, so yeah, in spite of the problematic things that we might able to have in our robot which would be basically that we are constrained Harvard-wise and also that the estimation
or the classification of gender might be and emotions are rather complicated, we have to create a system that is able to be efficient and that is able to perform all of these tests. So in this work, we present such system that is able to do face detection, gender classification
and emotion classification all in real time and open source. We release our models and all code open source, so it's able to also, it's robot or framework independent, so we also try to make the community or the robot community
try to adapt our software, so we try to make all the appropriate software accommodations so this can be implemented as well. And furthermore, we also know that convolutional neural networks or neural networks in general are used as black boxes and I mean there are black boxes, but often we would like to see what are the hidden features learned by the convolutional neural network.
Therefore, we also implemented visualization method which I'm not saying it will tell you what actually the convolutional neural network learned, but it tells you something, it makes a visualization that is, I believe, interesting to interpret and this goal is,
this solution is called guided back propagation. So I don't know how you guys are familiar with the neural networks, I worked for around one year, but I don't know, could you guys raise your hand if you have worked in neural networks before, so I just can get a feeling, yeah, okay, so there's a lot of people here, so I will go quickly about it, but not too fast.
So basically, neural networks are functions approximators, so you have some inputs and you have some outputs and you want to create a function that maps these inputs to these outputs and how you want to create this function, it's by a set of models in which the model contains weights which are represented by these arrows and we are changing these
arrows using an optimization algorithm that tries to minimize an error at the outputs and yeah, so basically trying to minimize an error and then we change the weights accordingly in order to make this neural network approximate the function given by our data and here I displayed two of the most important separation of neural networks,
one of them is recurrent neural networks and the other one is fully connected neural networks, recurrent neural networks have a hidden state, that means that the neural network itself is dependent on the history of the previous inputs. So one could think that recurrent neural networks might be better or more suitable
when inputs are given sequentially, so for example time series is one sort of area in which recurrent neural networks are used to, but also language, so for example language is also we construct a sentence and a sentence has a sequential meaning to it.
And yeah, and the work here is well, convolutional neural networks, not necessarily feed-forward or fully connected neural networks, so I will explain a little bit about convolutional neural networks as well, convolutional neural networks are basically feed-forward neural networks with two imposed constraints and basically
they substitute the matrix multiplication that was happening here for a convolution operation and I will give in detail some explanation about the convolution operation, but here we in the convolution operation we have a kernel and this kernel gets cobalt in our image, so here we have a picture of a and we will make a convolution, I don't know if you guys
are familiar with the operation, but you just put this matrix over here and you multiply all the elements and you sum them up and this gives you one pixel over here and then you keep moving forward this sort of operator and you'll get these sort of pixel values at each time these operators apply one pixel here and at the end from this handcrafted kernel
I'll get some edge detection and this edge detection will help us eventually, it could help us eventually to create a system that is able to recognize something, right, but the idea of a convolutional neural network is to know or to learn the elements
of the kernel, so in order to, when it cobalt the image you get features that are important or that are more suitable for the classification, and as I mentioned before these are the two constraints that are given to a neural network or to a fully connected neural network in order to convert it into a convolutional
neural network, one of them is local connectivity which tells you that not all the input values are going to be as important for the neural network or that you're only looking at the local patch of the input values in order to make explicit computation, so in our words
a pixel that is located on the far end of the image might not be as important for a local patch that is located on the other end, and weight sharing tells you that a set of pixels or a set of extracted feature from a local patch of pixels will be important for the other
patches and therefore you will assign the same weights to all the patches, and at the end you impose these two constraints and magically your feed forward neural network is converted into a convolutional neural network, and this is one of the images that I like very
much because it explicitly tells you how to calculate the or how the convolutional neural network works, but in this case let's think about an RGB image, an RGB image is a matrix that consists of three matrices are stuck together and we would have a kernel which is represented by this other tiny matrix which is also a volume of three stack matrices
and what we do basically is just cobalt the kernel with the input so we do this by for example looking at this specific x, y in the input feature map we will multiply each
value this zero times zero and this one times this minus one and we will do this with the first slice of the kernel and then the second slice of the feature map we will do it with the second slice of the kernel and so forth for the other ones and then once we have computed all these multiplications all of these will get some and then finally all these three values that
came into scalars will get some and then finally a bias will also get some. These summing values at the end are very magically given and also with this something that was improved later on which I'll let her talk about as well but yeah this is the way in which a convolution neural network looks like in the in the convolution layer at least so it's just
you're learning this sort of 3D kernel weights that are cobalt in an order volume of of an image for example and another interesting thing is that we have for example here two kernels kernel one and kernel two and kernel one gives us another feature map and kernel two will give us another feature map and a feature map is just a modification of the original image
by cobalt in the kernel so if we have for example m or n feature map sorry we will have n or sorry if we have n kernels we will have n feature maps at the end so we can control the amount of feature maps you want by the amount of kernels that you have and
this is one of the most successful convolutional neural networks which is BGG16. BGG16 I believe won in 2014 the ImageNet competition. ImageNet competition is a competition of classifying 1,000 classes from 1.3 million images and this is this is very incredible result
it's I like to think of it as a some sort of system that you can integrate into for example into your robot platform and is able to identify 1,000 classes so now we could put this into a robot and the robot will walk around and you can take picture of something I will say yeah this
is a this is a computer yeah this is a person and he will do this for 1,000 classes and now we can do this and this is very easily to do but yeah these sort of convolutional neural networks are computationally very expensive for example they have around 138 million weights
but another interesting thing is that as I explained before we have these convolutional layers and at the end you add a fully connected layer which is a very traditional approach to construct a neural network so you have this convolution, convolution, convolution, convolution and then you have the fully connected layers and these fully connected layers account for
90% of all parameters in the network so this is also very very interesting like it it might not be completely convolutional but it's 90% some fully connected network right so yeah so another solution to in order to reduce the amount of parameters that we have in our fully connected part is to create or to add a global average pooling layer
and a global average pooling layer will take the feature maps and it will just make them sorry it will take these feature maps and it will convert them into a single scalar value so in the previous approach what was happening is that we have the feature maps and each feature map gets flattened and then concatenated with all the
other ones and then from here we will perform a fully connected layer and then it will turn out to be the output values but in this case what we are actually going to perform is we're average of all the elements in the feature map and then this average will turn itself
into a scalar value and then these scalar values will actually be the output nodes so why are we performing the average this is a good yeah this is something that is rather interesting because average gives you information you need information of all the single
nodes in your convolutional neural network in order to to have in order to have something significant but if you have something like the maximum value of it then you're kind of losing information about it so global or the the mentality of why using average is used to give individual representation to all the nodes in the feature map and yes our initial architecture
which is the one that i propose in this work it's consists of nine convolution layers which use global average pooling at the last layer and it has around 16 600 000 parameters and it
achieves an accuracy of 96 in the imdb data set and 66 on the fer 2013 data set this model is 7.5 megabytes so it's a rather small and yeah other methods achieve the same accuracy but using an ensemble of convolutional neural networks so this also might prove low a slope for
real-time systems but this is not the end of the story here because at the end 600 000 parameters seems rather cumbersome for for a simple task so we decided to explore more on modern architectures on convolutional neural networks and we encountered this exception architecture
which combines two of the most successful experimental assumptions in convolutional neural networks that is the use of residual models and depth-wise separable convolutions and this the use of these two sort of models or assumptions are basically the state of the art right now in convolutional
neural networks which is yeah the residual models that instead of actually performing a sequential operation on the future maps we will divide the future map so it adds an identity to it and we will perform a convolution a convolution again by an non-linear function and then we add
the previous future map so this is something like a perturbation method in which you have the future map and you want to see which perturbations in the future map actually make make it better in order for the classification to to be more accurate
but the most important thing about the residual models is that it makes the back propagation algorithm which changes the which is optimization algorithm which is the algorithm that modifies the way sufficiently makes it easy to create a bridge between the previous feature maps so it's able to propagate the
the the gradient easily through the network which is this is also a well-known problem convolutional neural networks that the the the convolutions get so or the neural network gets so big that the actual back propagation of the gradient also gets decreasing or also explodes
depending on the on certain assumptions but this allows you to back propagate the gradient easily and therefore this architecture one i believe the 2015 image net which consists of 101 layers of convolutions one after each other and the other one which is depth-wise separable
convolutions and as i explained before you have these sort of kernels right these 3d volumes that map the the input images with that try to convert it to another set of feature maps and in this case the depth-wise separable convolutions try to separate the cross
correlations between the spatial values and the channel values so instead of having 3d volume kernels that will be cobalt in the input feature map we will make them channel dependent so in this case we have a channel that has its own kernel so all the all the input feature maps
would have a specific kernel which is um you have in this case zk dk times one and we will learn to we will learn to um mix the values outputted by the
combo by the by the filters in several ways which are given by this one time one convolution so the one once time one convolution what is doing it's mixing the values given by this sort of kernels here and yeah this sort of reduction gives you a factor of one over n plus one over d
square and this also eliminated some parameters in our convolution neural network and at the end our final architecture which is based on this exception network was able to yeah we reduced the
parameters so 10 times more so we have six uh 60 000 parameters now and the model is around 853 kilobytes it achieves almost the same accuracy in the gender classification in the previous model if you recall it's 96 here's 95 and also emotion classification was exactly the same which
is 66 and our complete pipeline that included the phase detection and the gender classification emotion classification takes around yeah i mean this much of seconds 0.48 seconds right and this on a low end gpu and yeah on a on a i5 gpu was around 0.051 seconds so i mean it's um
so basically we reduce the amount of parameters that we actually need in a in a convolutional neural network in order to perform a system a real-time system and these are i'll i'll show this later but these are some of our results so this is what we want right
we have this our robocop team in italy and we're performing a competition so here we have blue represents women and the other ones represent men and yeah happy men i mean this is very fairly easy to represent right happy happy happy happy happy and this is neutral if you cannot read it and yes there's we have also something here which is sat men
and but if you yeah i don't know it's a this is uh this error here was made for the because of the phase detection algorithm which is the open cv implementation but
yeah if you kind of look at it more closely you can see there is some sort of eyebrows there and uh yeah the mat looks at you know and this is another um yeah fun thing that i work on which is uh yeah this is the sold by conference and here's mordecui and then albert einstein and yeah the other famous physicists and it says angry women which i believe it's
correct and uh yeah neutral men this movie that i like which is called 12 angry men and then you can see that not all of them are angry i think none of them are angry neutral neutral and yeah i'll show you some of the results and um that i have on our github repo
so yeah by the way this is um these are positive that was um yeah that i did in order to put my project and everything and yeah this is um this is rather
interesting because i publish it without any um without any idea of like how um interesting for people could be but it turned out to be that a lot of people like it and uh yeah it got i got a nice reception uh yeah i got a lot nice contributions from the open software community
and here are the same examples well here's different faces for the emotions and here's a real sense of me and i'll explain later why this is so why this might be interesting because like okay this is sad i might not look so sad right but this angry i'm not angry here this is kind of weird right i'll talk about my errors later on which is also very interesting
and yes perfect and yeah and well let me just go back here which is uh the visualization technique that i
was talking to you about this visualization technique is called guided back propagation and basically what it's doing you give it as input the image and then you wobble the or you can think of it as wobbling the elements of the picture and seeing which neuron gets activated in the neural network and you look at the new at the neuron that gets activated the
most and then you keep wobbling the the the pixels and you find the pixels that correlate more with the with the activation of a neuron inside your convolutional neural network and therefore it could tell you which pixels in the image make a neuron activate the most and here for example i believe every every row is an emotion so apparently this guy was angry
because yeah i know simile l jackson was angry here and so we can for example see that the neural network responds very heavily on the smile obviously so this is something that i mean we expected it from the neural network we also see something interesting here which
is that the surprise classification depends solely on the eyes or mostly on the eyes so how big your eye eyeballs are open so this is also very interesting and also yeah angry which is the one the part that i wanted to talk about like angry seems that you you only have to like
really um i don't know how this uh you only have to make your face like this and it will the neural network you will trick the neural network to believe that you're actually angry so you just have to do this not really you don't have to be extremely angry you just have to really perform this sort of face and this is why i was telling you in the real in the real life model that i didn't look angry but yeah this is why you have also to be careful around these open
server implementations right because i know what activates the neural network the most so i can make the faces that activate it right so you also have to think about it and um yes another interesting um idea or a problem that arose is that mo as sorry as most of uh
as i mentioned before yeah this data set right i mean this data set it's a rather it's very oriented to western looking faces right so i was um yeah i'm i was here in the and the robocop competition what's happening in japan and i received a lot of emails from
people in japan telling me that my uh implementation didn't work because it was not working for asian people i mean it was not working because obviously the data set is of these people right and um also something very interesting that i uh encounter was the use of
glasses the neural network confused people that got angry with with a person that was only wearing a glass or wearing glasses like mine for example which are very dark but also more most interesting i think is that the neural network got confused i had a i had a friend of mine and she was sitting in front of it and we had the live demo on it and she had she didn't
have any glasses and it was classifying her as a woman she put her glasses on it said men so this is something also very interesting that i would like to pinpoint from the data sets i mean the you really have to dig into the data set to look what errors my what errors might occur inside the network right it might be biased to say that because a person is wearing glasses
might be a man why because all the samples that you give to a neural network of persons wearing glasses are men right and yeah yeah so future work the reduction of parameters
made us create a real-time system but this is not the end of how much we can reduce the amount of parameters we also can perform some evolutionary strategies strategies in order to reduce the parameters depending on maybe the amount of kernels that we have this was also not a parameter that i um that i tweaked too much so i believe further optimization can be
developed here also we can incorporate more classifications i mean the networks that we're turning are so small that we can create more of them in order to produce for example age and also age data it's in the imdb data set so we can easily train another neural network for this
and then the last one which create double-headed models which means that from a single forward pass you can output several classes at several yeah several classes that include for example happy or that include emotions that include gender and that include age but in order to produce this successfully we will have to have data set that contains the
labels so that every time we forward pass the image we have the information of the three labels so this is also something that we do not have and this is something that that i couldn't able to to make it but at least for the age and for the gender that are contained in the same data set we can create this sort of double-headed neural networks
and yes thank you for your time and i will how much time do we have is it we have time yeah okay this is the first this is a sort of half of the presentation that i was
trying to give today and if we have time after the questions i would like to give another talk if you guys are interested which is again convolution neural networks and offers yeah an open software relation to it which is for image captioning but yeah thank you any questions
yeah thank you yeah this is a very interesting question which is how can we make sure that our networks are able to generalize correctly and that the small perturbances in
the input data don't change the output and this is a very hot topic right now and i believe maybe the paper that you read was on adversarial networks which they tried to create different modifications so i think there's even some examples in which they add noise
which for us it's we cannot see this noise one it's an image and then the class changes entirely from what it was assigned for example into a panda and but the the noise is is not recognized it's we are not able to perceive it and this yeah this is a well-known problem but another thing is that is it's well known for convolution neural networks but it's also well known for
most of the machine learning algorithms so it's not only for convolution neural networks so one can also optimize noise that you give to input data to an svm or something else that will change the output so i mean this is this a big problem in machine learning in general
and not necessarily only in convolutional neural networks yes
it doesn't really see it and then i find a rule set how to combine these signals what i think is missing is when you when you think of the process of recognizing things you always have a filter and your thinking process is after everything is filtered and this
is why this noise that is added to images we can't see them we can we can say okay this is still a stop sign but i think really what needs to be done is put some filtering in front of
the neural network so that this is something where we still have
um i believe what people i mean currently this as i mentioned this is a very hot topic and currently there is even a competition in kegel in which they are trying to create adversarial examples so in order to make the neural networks more robust
so there's no answer as a part of sign-on right now something that's just doing octagon's recognition then something that's just doing red
and then something on top of it and all three in parallel plus the red of course
information of age or female information and you've picked the most successful set of yeah okay so the question is how is uh how's the training and evaluation process performed
well the the training process is i believe well known in neural networks which is just the implementation of the back propagation algorithm which that's that's why it's so important because i don't have to choose the elements in the kernel the
kernels or the neural network learn to choose the values for me and and the evaluation procedure well we basically divided our data set into um yeah training validation and test and then we just perform um we use the model that performed better on the validation and then
we just proceeded to test it yeah so the yeah the complete uh yeah chosen of the of the values in the kernel is not my uh i didn't do it the the optimizer did it for me
exactly yeah
yeah you will do you have to perform the um this is why you have the validation set because you want to see that your validation set is always uh is performing correctly right so you split something that you're not training on and in order to see that it's performing well
and in order that is able to generalize well and yes this validation set comes only from the original data set yes yeah yes yes uh the the loss function is nothing fancy
here it's something um the you have several loss functions as you mentioned you have this square loss for example but here we only use the typical classification loss entropy loss which is just tell you the the entropy of how well or the cross entropy basically
so tells you how well your um output values in this looking as them if they were a probability distribution how well it compares to the probability distribution of the of the actual thing yeah so the cross entropy loss yeah so nothing really special
yeah so the question is whether or not we can incorporate video analysis in our robots and yeah obviously this is something very interesting as well but this is not also something that becomes computationally complicated right because you have to account for now video would have another would add another dimension to
your neural network and but yeah there is current research on this for for example um how you call this to to detect whether what a person is in action uh classification yes actual classification yeah further questions before i proceed to the next uh yes
but can you mean the numbers of the reduction so how much we do manage to cut from your examples when you're the planning and one-by-one convolutions and everything so how much did you win yeah this uh the question is how much did i win from
cutting out the the last fully connected values well i have i would have had to train a fully connected network in order to see how many values are actually um how many values actually
give me the same uh classification accuracy so i didn't do this i proceeded i proceeded explicitly to perform the global average pooling which i knew it was something that already gave a reduction but as we saw for example that incorporation of the uh of the exception
architecture it gave us a yeah it reduced uh 10 times more right yes
so all the convolutions and everything and just train this last fully connected layer to let's say detect all the cats and so i didn't see this fact about 90 percent so is it to hope to use this sort of approach to train something easily in home conditions on your gpu
have you ever tried something like that yeah so the question is have i tried doing fine tuning yes i've tried doing fine tuning but as you mentioned as as you saw in the bg16 network
as you saw in the network the the network consists of three fully connected layers so you don't take out the three you just take only one and then from on top of this one you train yeah so you don't get rid of this 90 percent right yeah yeah you still have something
cool yes yes the slides will be available sorry for that yes there is information of my of everything here on the on my github repository so all the information you can find it there but i'll make sure that
also people from frost can uh have it okay good then i'll proceed to the i mean if we have some time i guess we have 20 minutes i'll proceed with something um that might be also interesting for people that it's um yes so what i'm going to present now is
image captioning classification of a normal situation this project was also made with professor paul pluger and it's um it's also trained a convolution and an lsdm network in order to account for situations in which the situation in which values might be dangerous or for for
example we would like to train a neural network that tells us that that creates a description of the situation and this description will give us information on whether or not the situation should be important for the robot so image captioning consists of a convolution neural
network and a lstm network that in conjunction perform or construct a sentence of the image so the neural network will take the image and it will perform it will say a group of young
people playing a game of frisbee and this is image captioning and our basic example or thing that we would like to perform is a system in which a robot sees a situation it takes a picture of the situation and then it describes the situation or it makes a sentence of the situation this sort of sentence can be communicated to someone else and then they could proceed in
order to as they see fit regarding whatever is happening so for example in this case the robot will approach and instead of having several components in a robot in which they try to do maybe post estimation in which they try to do several classifications we will have a single
component that will tell us everything that we would like to know about the situation at hand so in this case the robot will be will approach it will take a picture and then it will say an elderly man is laying unconscious on the ground and this is the system that we created and yeah lstms are another form of recurrent neural network that are a little bit more
complicated but yeah they are differential versions of the memory chip in a digital computer this is what Alex Rabe said and which is a famous researcher that worked with lstms for a long time and basically lstms contain several gates that control the input the output and the hidden state of the network and I will not go into too much detail about this but
yes the process in which the image captioning system works it just passes an image through a pre-trained convolutional neural network as I explained before it possesses into the first state of the recurrent neural network and then the recurrent neural network will start predicting
give it the information of the image a word or the first word in this in the in the sentence then it will use the next value in which it will strain on and the hidden state in order to produce the next word in the sentence and it will continue to carry on into it performs a sentence
such as this a group of young people playing a game of street speed is explicitly or given here in which you have a store token you have the information of the image and then it produces a straw and then given the word straw and the hidden information of the previous step will
produce hat and then if you give hat and the given information from the previous two steps it will give you the token end so we'd learn to create a sentence and this there was a problem regarding our data contained in the for this case because there are usually not data sets that
contain anomalous situations so we don't have data sets that contain for example guns or people maybe in pain or or blood so we had to actually go into flickr look for these images and then ask people around our university to label or to create a sentence for each of these images
so this was a very enduring process but at the end we collected eight thousand or sorry one hundred one thousand and eight caption images and that were captioned by 20 different people here and yeah all of our images are on their creative common license and I hope to make them available soon so people can work on this also in machine learning and these are some of the examples that
are recollected in our data which include for example a person that has been injured a police officer holding a gun and some violent images and some things running on being on fire and on
these image captioning models and we obtain a score of 14.2 a metro score which tells you which tries to compare sentences basically how much a sentence is in accordance to another reference sentence so this sort of evaluation is also very mysterious because it's a it's built
by humans in which some parameters are also well tuned and I believe this metric should improve but anyhow this is the sort of metric that people use in these captioning models and we also obtain an accuracy of 97 percent which means that I'll give you an image and then I
will tell you I will I want to know whether the image is an anomaly or not an anomaly so we receive an accuracy of 97 percent for this and these are some of our results that we obtained so this is the image that we received and the neural network said this a man is doing a trick on a skateboard then we gave an image of this a train traveling down
these tracks next to a luxury field a group of people flying kites in the field pizza with cheese and herbs on the table yeah this with a laptop and a monitor just yeah this is also I think very difficult and could also be for very a good application for robots and then a man
on a server riding a wave in the ocean and these are some of the errors that we also got from these non-anomalies images which are a woman sitting on a bed with a laptop computer well it's not a laptop computer but it's a some sort of cushion and then a dirty toilet in a small room with a wooden floor there it's rather dirty but yeah if there's no wooden floor or
anything like but there's wood here so I don't know like what's happening actually here right so it's also very we we would have to make some sort of error evaluation of the network to actually see what's happening there a cat sitting on a car seat with a concerned look
and this word here concern or concern look it's also very like why did it happen right well I went into the data and I found out that a lot of people describe images that are
looking towards them as concerned so this is why we have the word concern here because it's also in the data right and then we discovered that a neural network doesn't know how to count so a cup of coffee and a cup of coffee so this this might not be entirely wrong but still it's and yeah a man is eating a sandwich and drinking beer
a man and a woman holding wine glasses is also wrong and these are some of the examples that were that were that we captured that our neural network captioned and that I believe that they are positive so a man is holding a gun so it only took the image and was able to create
learn the english language and then create a sentence out of it and then here house is burning a car is crashing a snow-covered street there is a woman with blood on the floor a man is choking another man a firefighter is trying to put out a fire
so I mean this is what we wanted we wanted the system that we can incorporate in a robot that is able to tell us whether some information might be an anomaly right and these are some of the results in which yeah in which the neural network didn't perform
as well for anomalies people with red helmets are sitting on a car so we see that the neural network doesn't always finish the sentences there is a broken window laying on the ground so there's no there's I believe that this might be because there's some variation here on the floor but I'm not entirely sure and a man is showing his injured shoulder so it's uh it's not showing
his shoulder and then a man is showing his right arm in which he has a severe injury I mean this is not his right arm this is like but it's difficult right I mean it's a man is being held by the police
so I mean it's not a police but referees so I mean a man is dancing with a woman on a dance floor I don't know what's happening there to be honest so this is also difficult caption right like what are we asking to a network what are we training it for and this is also complicated right so we have to look at the images we're given and yeah thank you for your time this is
another small project we have questions questions yeah yeah yeah the question is where are you
whether I write whether I wrote everything from scratch or whether I use as a framework or yeah no I use tensorflow with Keras yeah which is the the easiest way yeah
uh any other question yeah yeah it is rather small yes yes no we perform yeah we divided the
data set accordingly like 20 20 like 80 percent for training and then and then from this 80 another 20 for validation and then we did the appropriate values so we don't overfit at least in the classification but the captioning part is rather complicated to evaluate
and this is why I was making a small remark on the on the metrics used on the captioning a scenario it's a it's also very hard-coded so I don't trust it too much yes
any other question yeah okay cool thank you