Real-time Face Detection and Emotion/Gender classification with Convolutional Neural Networks

Video thumbnail (Frame 0) Video thumbnail (Frame 2042) Video thumbnail (Frame 3670) Video thumbnail (Frame 5413) Video thumbnail (Frame 11054) Video thumbnail (Frame 13303) Video thumbnail (Frame 15700) Video thumbnail (Frame 17912) Video thumbnail (Frame 20370) Video thumbnail (Frame 22595) Video thumbnail (Frame 24554) Video thumbnail (Frame 30901) Video thumbnail (Frame 32409) Video thumbnail (Frame 34763) Video thumbnail (Frame 37860) Video thumbnail (Frame 40284) Video thumbnail (Frame 43401) Video thumbnail (Frame 53450) Video thumbnail (Frame 56075) Video thumbnail (Frame 60858) Video thumbnail (Frame 61890) Video thumbnail (Frame 63153) Video thumbnail (Frame 64359) Video thumbnail (Frame 66073) Video thumbnail (Frame 68054) Video thumbnail (Frame 69798) Video thumbnail (Frame 71433) Video thumbnail (Frame 72484) Video thumbnail (Frame 73950) Video thumbnail (Frame 75467) Video thumbnail (Frame 78268)
Video in TIB AV-Portal: Real-time Face Detection and Emotion/Gender classification with Convolutional Neural Networks

Formal Metadata

Real-time Face Detection and Emotion/Gender classification with Convolutional Neural Networks
Title of Series
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
In this work we present a real-time system for face detection and emotion/gender classification using Convolutional Neural Networks and Haar-like features.
Keywords FrOSCon meets Science

Related Material

The following resource is accompanying material for the video
Video is cited by the following resource
Ocean current Presentation of a group Service (economics) Artificial neural network Gender Real-time operating system Student's t-test Computer animation Atomic number Robotics Personal digital assistant Quicksort Computing platform Physical system
Information Open source Characteristic polynomial Extreme programming Frame problem Medical imaging Computer animation Personal digital assistant Robotics Order (biology) Phase transition Energy level Right angle Quicksort Metropolitan area network Pole (complex analysis) Social class
Functional (mathematics) Open source Code State of matter Set (mathematics) Time series Real-time operating system Black box Function (mathematics) Formal language Medical imaging Estimator Robotics Representation (politics) Arrow of time Software testing Endliche Modelltheorie Associative property Error message Mathematical optimization Physical system Area Mapping Artificial neural network Gender Weight Independence (probability theory) Flow separation Convolution Approximation Frame problem Recurrence relation Arithmetic mean Computer animation Software Website output Right angle Quicksort Collision
Pixel Constraint (mathematics) Artificial neural network Multiplication sign Expression Bit System call Convolution Element (mathematics) Medical imaging Word Kernel (computing) Computer animation Software Operator (mathematics) Order (biology) Matrix (mathematics) Konturfindung Quicksort Condition number Physical system
Dataflow Pixel Patch (Unix) 1 (number) Set (mathematics) Medical imaging Latent heat Program slicing Matrix (mathematics) Belegleser Multiplication Constraint (mathematics) Mapping Artificial neural network Weight Shared memory Computer Convolution Connected space Word Kernel (computing) Computer animation Software Personal digital assistant Network topology Order (biology) output Quicksort
Mapping Weight Parameter (computer programming) Convolution Frequency Medical imaging Kernel (computing) Computer animation Software Robotics Right angle Quicksort Resultant Computing platform Social class Physical system
Multiplication sign Function (mathematics) Parameter (computer programming) Mereology Element (mathematics) Average Single-precision floating-point format Scalar field Representation (politics) Endliche Modelltheorie Physical system Task (computing) Computer architecture Exception handling Information Mapping Maxima and minima Convolution Symbol table Computer animation Software Personal digital assistant Order (biology) Pattern language Quicksort Collision
Filter <Stochastik> Functional (mathematics) Divisor State of matter Multiplication sign Parameter (computer programming) Backpropagation-Algorithmus 2 (number) Medical imaging Cross-correlation Bridging (networking) Operator (mathematics) Reduction of order Square number Arrow of time Data conversion Endliche Modelltheorie Error message Metropolitan area network Physical system Computer architecture Exception handling Algorithm Mapping Gradient Volume (thermodynamics) Perturbation theory Complete metric space Flow separation Convolution Kernel (computing) Befehlsprozessor Computer animation Software Nonlinear system Personal digital assistant Network topology Search algorithm Order (biology) output Quicksort Resultant
Computer animation Physicist Order (biology) Basis <Mathematik> Office suite Error message Metropolitan area network
Pixel Implementation Dependent and independent variables Matching (graph theory) Open source Artificial neural network Surface Parameter (computer programming) Mereology Demoscene Theory Element (mathematics) Medical imaging Computer animation Software Network topology Order (biology) Office suite Endliche Modelltheorie Associative property Row (database)
Email Implementation Artificial neural network Sampling (statistics) Insertion loss Data mining Loop (music) Computer animation Core dump Right angle Family Error message Metropolitan area network
Presentation of a group Multiplication sign Set (mathematics) Parameter (computer programming) Function (mathematics) Power (physics) Medical imaging Strategy game Reduction of order Videoconferencing Endliche Modelltheorie Mathematical optimization Social class Physical system Theory of relativity File format Coalition Gender Weight Flow separation Message passing Kernel (computing) Computer animation Software Order (biology) Quicksort
Probability distribution Implementation Functional (mathematics) State of matter Multiplication sign Workstation <Musikinstrument> Set (mathematics) Insertion loss Function (mathematics) Backpropagation-Algorithmus Dimensional analysis Element (mathematics) Medical imaging Mathematics Machine learning Core dump Videoconferencing Square number Entropie <Informationstheorie> Software testing Mathematical optimization Social class Noise (electronics) Algorithm Support vector machine Validity (statistics) Mathematical analysis Perturbation theory Flow separation Performance appraisal Word Kernel (computing) Process (computing) Computer animation Software Order (biology) output Right angle Procedural programming
Ocean current Group action Information Artificial neural network Surface Multiplication sign Connected space Process (computing) Computer animation Software Network topology Order (biology) Energy level Right angle Quicksort Exception handling Computer architecture
Medical imaging Computer animation Software Order (biology) Convolution Computer programming Software bug Product (business)
Pulse (signal processing) Group action Artificial neural network Connectivity (graph theory) Medical imaging Estimator Computer animation Software Robotics Personal digital assistant Game theory Quicksort Descriptive statistics Metropolitan area network Physical system
Group action Information State of matter Set (mathematics) Bit Function (mathematics) Computer Revision control Medical imaging Word Process (computing) Population density Computer animation Software Semiconductor memory Order (biology) output Logic gate Physical system
Medical imaging Word Machine learning Computer animation Block (periodic table) Personal digital assistant Universe (mathematics) Office suite
Medical imaging Group action Computer animation Sound effect Insertion loss Quicksort Endliche Modelltheorie Parameter (computer programming) Metric system Resultant Metropolitan area network
Laptop Server (computing) Cartesian coordinate system Computer Software bug Performance appraisal Medical imaging Goodness of fit Data management Computer animation Software Quicksort Table (information) Computer-assisted translation Error message Metropolitan area network
Medical imaging Word Computer animation Mathematical singularity Metropolitan area network
Discrete group Information Shared memory Formal language Medical imaging Word Computer animation Software Robotics Internetworking Metropolitan area network Resultant Physical system
Medical imaging Arm Computer animation Calculus of variations Event horizon Window Metropolitan area network Software bug
Word Computer animation Validity (statistics) Multiplication sign Core dump Right angle Damping Software framework Mereology Metric system
Computer animation
well thank you all for coming to my presentation on real-time fiction emotion gender classification using convolutional neural networks
so 1st live at about me on so I am Marius's as he mentioned masters student here and I will up at participant of the RoboCup of Atlee competition so all of this work came from or was brought up to because of the RoboCup International RoboCup competition for the atomic school it's well our main motivation what we're doing why would we like to have an emotion and gender classification system well of here in the anniversary of his rewarding go curse and obviously or intentions are to put this into a robot and while domestic and current Robert platforms on are currently being used for home systems or espresso intelligent systems so in this case we have the robot here pepper which is very famous and i would this Robert here belongs to a French company at well that they're trying to sell it so they can deployed at your home and they can be service intelligence system that can take pictures and remind you of things that I yes uh obviously this sort robots
require a highly were perception skills and in this case we can think of it we can think of several examples 1 of them is a person that is sort of an elderly woman or man that is a so trying to communicate that the person is in pain or the person might have some sort of long the but some sort of anger or might be set and maybe it's unable to express it so person might be pain is an old expressive for the role of might uh would we would like that the robot extracts information from this person's face and in order to proceed as it should proceed for example not to call someone or not to tell someone and yeah there also other cases and more going to the other extreme in which the robot would like to take a picture and the we would only like to take a picture 1 that everyone in the frame is happy right so know there are a lot of In cases in which we can think in which we like to extract information from the user's face in order to convey some sort of of woman with a world and well 1 important
characteristic of of emotion classification is that is not so easy that 1 would think that and maybe classified someone's those of some notes as motion only by looking the face could be potentially of substantially easy but from a few months before 65 per cent of accuracy on the understanding of reducing and which is the FIR 2015 dataset and yet it's open source and everything and it has 35 thousand crystal images so in this example I have 2 of all these images and then we could look around the on the island of like it's easy for us to take this person that it's happy right but maybe this 1 1 which question sent to this person may be sad the neutral and also I mean the baby like onal which class to get to the at which level to get to and yet another thing that we could see from the dataset here is that all the faces are not presented in a canonical way that means that the the poles of the phases for a rotated so this might also be problematic eventually for performing a classification and the and the other
data so that we use was a I'm DB dataset which consist of images of from the iron DB website it's around almost half a million of pictures and as you can see that our own doubts pictures of the of the small world Hollywood system right and this eventually leads to problems says we're I will talk about later but yet so is 1 of the because it is this all there which has sort of million images of genders 1st 6 happen and yeah so again in
spite of the problematic the problematic things that we might be able to having a robot which would be basically that reference frame Harvard wise and also that the estimation are the classification of gender might be and emotions are rather complicated we have to create a system that is able to be efficient and is able to perform all of this test so in this work and represent such system that is able to do face detection gender classification emotion classification all in real time and open source are we really store our models and moral code opensourc so it's able to also from this robot the framework independent so we also I tried to make the community or the Robert community try to that hours of so we try to make all the a purpose of recommendations to so this can be implemented as well and furthermore we also know that pollution Edwards or neural networks in general are used as black boxes and I mean there are black boxes but often we would like to see what are the hidden of features Lambert convolutional network therefore we also implemented of of association method which I'm not saying it will tell you what actually it's the the collision at were learned that it tells you something about it makes an association that is a really interesting to interpret and this goal is good this persistence called guided by corporations so I don't know how you guys are familiar with the neural networks a works so for around 1 year but there are no good yes raise your hand if you have working on as 4 so I just can't get a feeling OK so there's a lot of people here so all Goto quickly about but not not but that's happened so basically annual net prefer functions approximators so we have some inputs and you have some outputs and you want to create a function that maps inputs to this output and how you want to create this function it so by a set of models in which a model contains the weights which are represented by these arrows and the changing these arrows using an optimization algorithm that tries to minimize an error at the at the outputs and so yeah so basically trying to minimize an error and then we change the weights are currently in our to make the use of know that were approximate the function given by our data and here I displayed 2 of the most important separation neural networks 1 of them is acromial networks and the 1 is fully connected no member recurrent all networks have a hidden state that means that the neural network itself that is dependent on the history of the previous inputs so 1 could of think that the chromium networks might be better or more suitable when of inputs Oregon sequentially so for example time series it's 1 sort of for an area in which were recurrent those are used to but also language so for sampling which is so so we construct a sentence and we is a sentence has a sequential meaning to it and the
other and the work here is well convolutional networks not necessarily on did for word or fully connected neural networks so explain a little bit about convolutional networks as well as convolutional networks are basically could their own approach we to impose constraints and basically they substitute the major isn't implication that what's happening here
for a convolution operation and I will uh keeping details of expression about condition operation but here so we in the convolution operation we have a kernel and this kernel gets the wall in our image so here we have our picture of of caution and we will make about our convolution user a familiar with depression but it's so is put this matrix over here and you multiply all the elements and sum them up and this gives you the 1 pixel over here and then you keep moving forward the sort of operator and used get the sort of of pixel values at each time this operators supply 1 pixel here and at the end from this crafted kernel I'll get some edge detection and these edits affection our will help was eventually it could help cement to create a system that is able to recognize something right but the idea of a convolutional neural network is to in order to learn the elements of the kernel so in our 2 1 could call ball see image you get features that are important or that are more suitable for the classification and as I mentioned for these are the
2 constraints are given to of neural network or to a fully connected on and work in order to convert it into a convolutional neural network 1 of them is local connectivity which tells you that not all the input values are going to be as important for the neural network or that you're only looking at the local patch of the input values in order to make the explicit computation so while or in more along in our words a pixel that is located on the far end of the image might not be as important for local patch it'll get on the Earth weight sharing wait churn tells you that a set of pixels are a set of extracted feature from out from a local patch of pixels will be important for the older patches and therefore you will assign the same weights to all the patches and at the end you these 2 constraints and yeah magically here the for no network is converted into a convolutional neural for yeah and this is so 1 of the images that I like grammatical explicitly tells you how to calculate the or how the convolutional on none of the works but in this case let's think about our an RGB image in RGB images so matrix that consists of or true matrices are stuck together and we would have a kernel which is surface of with all a tiny matrix which is also assists as a ball you of flow of tree spec matrices and what we do basically shows got cobalt the
kernel with the input so we do this by for example looking at specific of x y can be improved of feature map multiply each value this year that is 0 and is wanted is minus 1 and we will do this with the 1st a slice of the kernel and then the 2nd slice of future my will give me the 2nd slice of the kernel and so for for the other ones and then once we have bit all these of multiplications of all of these will get some and then finally all these 2 values that are came into scanners will get some and then finally a buyers will also gets at these summer mothers of then our urban magically given and also with this something that was improved later on which I will talk about well but that this is the way in which a convolution of neural network looks like in the in the convolution layer at least so it's here is just your learning this sort of 3 D kernel weights that are cobalt in in an oddball the more of an image for example the and I'm not interesting thing is that we have for example here to kernels from 0 1 kernel to and kernel 1 gives us another feature map character will give us on our future and the feature map is dose of modification of the original image by all in the current so if we have a for example and or n future mounted will have an or sorry if we have and kernels will have and feature maps at the end so we can control the
amount of feature maps you won by the amount of kernels that you have and this is 1 of the
most successful or convolutional networks which is b 16 16 I believe 1 2014 we imitate competition image that competition is a competition of classifying find 1 of thousand classes from 1 . 3 million images and this is so this is where incredible result of its I like to think of it as as some sort of system that you can integrate into for something to the robot platform and is able to that if I want doesn't classes so now we can put this into a robot and the road will walk around and you can take a picture of something else so yeah this is this 2nd period is a person and he will do this for what does a class and now we can do this this very easily to but yet the sort of collusion that were sort are conditionally very expensive for example they have around 138 million weights but are interesting thing is that as explained before we have this convolutional all layers and at the end you had a fully connected layer which is very traditional approach to construct an elaborate so you have this convolution publishing which ablution and then you have the fully connected layers and fully connected layers account for the 90 per cent of all parameters in the network so this also very very interesting like it might not be completely convolutional but is 90 % on fully connected network right so yeah
similar solution to in order to reduce the amount of pattern years that we have in our fully connected part is to create or 2 at a global average pooling layer and the global average fully later will take the feature maps and it will just make them sorry with take this feature maps and will convert it into a single scalar value so in the previous approach what was happening is that we have the feature maps and each feature map gets flattened and then concatenated with older once and then from here will perform a fully connected layer and then it will come it will turn out to be the output values but in this case what we're actually going to perform is we're gonna take this feature map a 1 of the a global on will will to perform an average of all the elements in the feature map and then these average will turn itself it does color about and then this gallery always will actually be the output notes so why are we performed the average this a good and there is something that is rather interesting because average gives you the information you need information of all the single notes in your convolutional networks in order to to have of in order to have something significant but if you have something like the maximum value of a then you're can losing information about it so global or the the mentality of why using of average is just to give the individual representation to all the nodes in the future the and here's our initial
architecture which is the 1 that I propose in this work it's consist of 9 convolution layers which use global average pulling at the last year and it has around 16 them 6 articles of parameters and achieves an accuracy of 96 and the MTV dataset and 66 on the vertical 2015 this these model is 7 . 5 megabytes so it's a rather small and yeah all the methods achieve the same accuracy but using of and symbols convolutional so this also might prove lower sulfur of time systems the but this is not the end of the story here because at the end 6 sort of the is rather cumbersome for a for a simple task so we decided to explore more on modern architectures on convolutional networks and we encounter these exception architecture which combines 2 of the most successful experimental assumptions in collision that was that is the use of a single models and that why separable convolutions and this 1 the use of these 2 yeah sort of follow models were known assumptions
are basically the state of the art when I'm ConvolutionalNetworks which is there was residual models that instead of actually performing a sequential operation on the future maps we will divide the feature map so it adds an advantage to it the and we will perform a convolution the convolution again by an anon nonlinear function and then we have the previous so this is something like alone a perturbation method in which you have the feature map and you want to see which birds to edges in feature map actually make make it better in order for the classification to be more accurate but the most important thing about the residual models is that it's makes the backpropagation algorithm which changes the which is obtained search algorithm which is the other that modifies away sufficiently makes it he said to create a bridge between the previous feature maps it's able to propagate good deed the gradient is literally network which is this is also a well-known problem convolutional networks that the build the convolutions gets so or the known get so big that the actual back propagation of the gradient and also gives decreasing or also expose depending on the on certain assumptions but this allows you to back propagated the gradient easily and therefore this architecture 1 I believe the 2015 ImageNet which consists in a heart and 1 of layers of convolutions 1 after each other the and the other 1 of which is that why separable convolutions and as explained before you have this sort of kernels for a all that map the the input images we go but try to convert lots of feature maps and in this case the debt why several convolutions tried to seperate the cross correlation between the spatial values and the channel others so instead of having the treaty volume kernels that will be cobalt in the input feature map we will make them the channel dependent so in this case we have a channel that has its own kernel so all the all the input feature maps we have a specific kernel which is a long we have in this case a k the k times 1 and we will learn to we will learn to our makes the bound is outputted by the humbled by the by the filters in several ways which are given by these 1 1 convolution so the wine once time 1 convolution what is doing its mixing the values given by this sort of low kernels here the and yeah this sort of reduction gives a factor of 1 over and was 1 over the square and this also eliminated some parameters in our convolutional networks and out the on our from architecture which is based on this exception and were I was able to that we reduce the parameters of 10 times forms sorrow we have 6 6 thousand parameters now and the model is around a pond 800 and depth of tree kilobytes it achieves all the same accuracy in the in the classification in the previous model if you recall is 96 years 95 and also emotion classification was exactly the same which is 66 and art complete biplane that included the face detection and the genic was ification emotion classification takes around them this much have of seconds . 48 2nd and this on our low-end GPU and the honest and I I I project this CPU was wrong answer 51 to consume so so basically we need reduce the amount of parameters that we actually need now in a convolutional network in order to perform a system a real-time system and it these are of was this these are some
of our results so this is what we want that we have this art RoboCup team in Italy and we're performed the competition so here we have move represents women and there was a person men and they have a man I mean this brief release the 2 forks as rapid happy happy about it this is neutral if you and the this is we have also something here which is sad man and but if you download motor this is of this error here was made of for D because of the face detection algorithm which is the opposite implementation but that you can look at the more closely to consider some sort of arrows there and there and the mountainside and and this is on our own thing that I work on which is it is being sold by conference and here's what
really and another and so on and nebula and physicists and it angry woman which I believe is correct the fact that and I neutral man this somewhere that elected to school 12
men and then you can see that not all of them are angered the none of them are neutral sad neutral and yeah I'll
show you some of the results and on they have emerged from so never the way this is so these are posted that was long did did in order to both provide protic and everything and yeah this is so this rather interesting because I publish it without any on that any idea of like how 1 interesting for people to be but it turned out to be that there a lot of people like it and uh yet good I got them either such a yeah I got a lot nice contributions from the Office of community and you're same example will here's the from the basis for the emotions if it's a real sense of me and I'll explain later why this so the what is maybe interesting because like OK this is set I might not look so sad right or but this angry I'm not angry here but this this can we address felt I'll talk about my errors later on which is also very interesting and yes perfect in and yeah and
well just by here which is so
the association technique I was talking about this is so such indicate the school guided by appropriation and basically what is going to give that as you put the match and then do wobble depicts the order could think of it as while holding the elements of the picture and seeing which neuron gets activated in the neural network and you look at the news of the neuron that gets activated the most and then hit the wall in the the the pixels and you find that the pixels that correlate more with the with the activation of a neuron inside the convolutional network and therefore it could tell you which pixels in the image make a neuron activate the most thank you for example uh I believe every every row is so an emotion so parameters there was angry and because they and also model this was angrier and so we can for example see that the neuronal response theory of are heavily on the smile of his thing so this something that will I mean we expect that from the neural network we also see something interesting here which is that the surprise alone classification depends solely on the ice or mostly on the ice so how big your I eyeballs are open for this also the interesting and also get angry which is the 1 the part I want to talk about like angry scenes that you you only have to like really on an office you only have to make a phrase like this and you will the new on that were gold tree the to lead that you're actually and we this up to this not really enough to be extremely angry you have to really perform the surface and these why I was telling you the real the real like model that idea look angry but yet this when you have also to be careful around is open source implementations right because I know what activates in neuron that the most I can make the places that activated right so you don't have to think about step and so on there's another interesting on
idea where of problem that arose is more as as most
of the same agent for yeah the
status that right I mean this dataset it's brothers
very oriented to Western Balkan faces right so I was wrong the I'm I was here in the social and the global cult of competition what's happening Japan and I receive a lot of e-mails from people in Japan telling me that my of implementation in work because it was not working for Asian people I mean it it was not working because obviously the bears at this this people right and the also something very interesting that I encounter was the use of losses the neural network confuse people that got angry with the with a person that was only when the glass last er wearing glasses like my for example which for a very dark but also more most interesting thing is that the neuron that got Gutkin I had I had I had a friend of mine and she was sitting in front of it then we have the light them on and yet gene have any glasses and was classifying her as a woman she put her us on its at man so this is something also very interesting that I like the ping-pong from the dataset some in the you really have to the into the dataset to loop what errors my whatever's my core inside the net regret it might be biased to say that because a person is wearing glasses my vehement why because all the samples that you give to neural network of person wearing glasses are men and so
the yes a future work the
reduction of parameters of made a straight a real-time system but this is not the end of how much we can reduce the amount of parameters we also can perform some evolutionary start is strategies in order to reduce the parameters depending on maybe the amount of kernels that we have this was also not a power meter that a like trick too much so believe further optimization can be developed here also we can import incorporate more classifications I mean the net was evident are so small that we can trade moral them in articles for example H and this also age data is in the eye and the dataset so we can easily train a remember promise and then the last 1 which to trade double-headed models which means that from a single for pass you can output several classes of several yes several classes include for example happy or that emotion that include gender and that include age but in order to produce this successfully we will have to have a data set that contains the true labels so that at every time we for parsing image we have the formation of the true labels so this is also of something that we do not happen this into that out better for they could enable to to make it but at least for the H and for D gender that are contained in the same dataset we can create a sort of double headed the neuron of works and yes thank you for a time and I will how much suppose we have time OK this is the 1st of this of sort of half of the presentation was so trying to today and if we have time after the questions like to you cannot talk URIs interesting which is again coalition in networks and offers them an open so over a relation to a which is 4 image captioning video thank you and
and the kind of the the time of the of the of the of the of the history of science and that these very public space the the the colonies of the reason I think that the material not only the stuff and then you have all the time in a long the time based on the the the move was the of this station and not being the right of the of the of the thank you that is a very interesting question which is how can we make sure that the networks are word to generalize correctly and that's a small perturb it says in the input data don't change the output and this is also a very hot topic right now and I believe maybe the paper that red was on adversarial networks which they try to create different modifications so I think there there's even some examples in which they have moist which for all without so we cannot see this north 1 but image and then the class changes entirely from what it was assigned for example into upon them and but that the noise is is not recoup its it we're not able to perceive the and if yeah this is a well-known problem but an odd thing is that is it's well known for collusion at 1st but is also well known for most of the machine learning the all right and so it's not only for publishing on so 1 can also optimize noise that you give to label data to an SVM or something else that will change the output so I mean this is is a big problem in machine learning in general and not necessarily only in convolutional networks at the end of this week was the ground and they new let you know what I need In the get that that the user they walk in order to produce the in thing and that kind of guy you didn't yeah I is saying with you the and and this is what and we got the I said the the the and the 1 that have so I this something that the and was the the this is not that I totally would people and currently this suspension this a very hot topic and Canada's you competition caIling which are trying to create adversarial examples sold in order to make the networks more robust so there's no answer as a personal friend of the and that the so not only do we know that there was a lot of you have a lot to do something the this is the red and the of the of the of the worst is of the thank the many of you and is the state in which the the the the the the most successful said and all of the things that you have in the in the in the process of really the phone yeah OK so the question is how is the houses training and evaluation was performed and while the the training process is so I believe well known in of no networks which is just the implementation of the back-propagation algorithm which that's the best way to imported because some have to choose the elements in the kernel the kernels of or the neon have learned to choose so the values for me and and the evaluation procedure well be basically divided the dataset into the yeah training validation and test and then we just performed so we use a model that perform better on evaluation and this is the test of that so that get a complete of yeah chosen at the click of of the values in the ground is not mine I can do it the the optimizer did it for me so in this the influence of a set of n exactly the and that we think of this is the only way that is the time of you in this this the last and you the rest that you will do you have to perform D on is when you have the validation set because you want to see that your by dataset is always so is performing correctly right so you split something that you're not on training on and in order to see that it's performing well and in our this'll sold to generalize well and yes this validation set comes only from the original dataset is what is this all you
the core of Hey there where is how it is and it yeah so that the loss function is nothing fancy here perhaps it something on the you have several sponsors dimension of this is square loss for example but here we only use so the typical classificational centric loss which is just tell you the the entropy of how well or the cross-entropy basically so tells you how well they're of output values in this look-in has them if they were a probability distribution how well it compares to the probability distribution of the of the actual thing to the cross-entropy loss to nothing really special the kind of the it the the loss of the house you know what the goal is to use yeah so the question is whether or not we can incorporate long video analysis in our and on the other is the this is something very interesting as well but this aren't also something that becomes computationally the complicated right because you have to account for now you would have a knowledge would add another dimension to new
neural network and so on but yeah there's current research in this for for example on the I call this to to detect whether of what a person is an action of classification is actually is because further questions before I proceed to the next and this and that they will be there this way of race in the the US and the rest of the world how much the what you will yeah yeah this so the question is how much they're went from cutting out the the last fully-connected the values well I have I would have had to train of fully connected network in order to see how many values are actually alone how many values actually give me the same of classification accuracy so I can do this I presented I proceeded explicitly to perform the global average pooling which I knew it was something that already give a reduction but as we saw for example that incorporation of the all of the exception architecture it gave so David this are 10 times more from this is the 1 that has a high dimensional for example you can do that as well and I think think of really the where the notions and it was not really aware of the in there is this kind of process so that all the sort of approach to trace something usually been all this so you and that is yeah so the question is how we tried doing fine tuning the well yes I've tried the fine tuning but as you mention as as as it is on the these 16 network of as is on the network there there will consist of 3 fully-connected layers so give-and-take of the tree is the only 1 and then from on top of these 1 trade so you don't get rid of these 90 per cent right here that the is of for this 1 or this is like a level surface this half the and so there there's information of my of everything here on the on my data preposterous all the information you can find the but I'll make sure that also people from frost and and OK good then I'll proceed to the many we have some time ago so we have minute out as it is something out of that might be also interesting for the that song
and and so on
and so
on and
so yes so the what I'm
going to % always image captioning classification of anomalous situation this product was also made with Professor Paul program and it's so it's also train a convolution enlisting network in order to
to account the 4th situations in which this way in which values might be of Dangerous or for example we like to train a neural network that tells us that as that trace of description of the situation and this description will give us information on whether or not this situation should be important for the robot so image captioning consists of a convolutional network and never that conduction perform or construct a sentence of the image so the new owners will take the image and it will perform and will say a group of Don people playing a game of risk and this is image captioning and our word-based
sago example or basic go thing that we like to perform is a system in which a robot sees situation it takes a picture of the situation and then he describes as tuition or it makes a sentence of the situation the sort of senses can be communicated to someone else and then they could proceed in R 2 as they see fit work regarding whatever's happening so for example in this case there would approach and instead of having several components in a robot in which they try to do maybe pulse estimation in which they try to through on several classifications will have a single component that will tell us everything that we would like to know about the situation and in this case the robot will be will approach will take a fiction and will say an elderly man Islamic countries on the ground and is the system that we created and yeah it
was the answer another formal recurrent network that are a little bit more complicated but then yeah they are differential versions of the memory chip in a digital computers would Alex set and which is so famous researcher that so work with LCM so for a long time and that basically I was James contains several gates that control the input the output and the hidden state of the network and I will not go into too much detail about this but yes the process in which the image captioning system works is just passes an image through a pre trained convolutionally unapproved as explained for it posses the center the 1st state of the EU recurrently on network and then the colonial network will stop predicting given the information of the image of a word or the 1st word in this in the in the sentence then it will use the next value in which it was trained on and the hidden state in order to produce the next word in the sentence and continue to carry on interior perform centers such as as a group of junk
people of the density is explicitly er given here in which you have for token we have the information of the image and then it produces a and then given the words from and the few information of the previous step will produce high and then if you give at and given information from the previous 2 steps will give you the token and so would learn to create a sentence
and this there was a problem regarding our data containing the for this case because there are usually not the dataset that contains anomalous situations so we don't have the is that contain for example Gunzburg people maybe in pain or or block so do we have to actually go into flicker look for these images and then ask people are around our of our university to label or to create a sentence for each of these images so this was all over in the inferences but at the end we collected a thousand words 3 100 1008 captured images and that where all the caption but 20 different people here and yellow for images are under Creative Commons license and I hope to make them available soon so people can work on this also in machine learning
and these are some of the examples that are collected in our data which include for example our present as the inter of of a police officer holding a gun and some violent images and some things running on been on fire in the street the and we
implemented 1 of these image caption models and we obtain um a score of 14 . 2 amateurs score which tells you which tricycle per cent this is basically how much a sentence of these in accordance to another reference sentence so this sort of revelation is also very mysterious because it's a it's built by humans in which some parameters also will tune and I believe this metric should improve but anyhow this is the sort method that people use in this caption models and we also obtained an accuracy of 97 per cent which means that I'll give you an image and then I will tell you how old I want to know whether the image is an anomaly or not an anomaly so if we receive images and eyesight 97 per cent for this and these are some of our results but the
effects of this is of limited receiver never said this man is a metric can escape then we give an image of this at Trent rolling down distracts Mr. loss review a group of people find cats and
feel the pizza which is an absolute the table
In this with a laptop and manager so this also seems very difficult and goes to the very good application for Robertson and then a man server and away from the ocean these are some of
the errors that we also got from this non anomalies images which are a woman sitting on a bed with a laptop computer is not a lot computer so over so sort then I do told in a small room with a with for the is rather dirty but but there there's a with floor it in that but but there's good here so I don't know like what's happening actually said so also really we we would have to make some sort of the evaluation of the network to actually see what's happening there a cat sitting on a coset we're concerned look
you and this or here concern are concerned look is also very like 0 why did it happen well I why went into the data and I found out that a lot of people describe images that are look into words and as concerned so this is why we have a were concerned here because it's also buys in the data and then we discover that on with doesn't know how to count so cup of coffee and a cup of coffee for this is that this might not be entirely wrong but still it's
and there many singing centers and drinking beer a man and a woman holding my glasses of this also run and these are some of the examples
of where that word that we catch that our new network caption and that I believe that the so a man is holding a gun so don't to image and was able to create else if learn the English language and then create a sentence out of and then share house is burning a
car discretion is no corporate Street there is a woman with blood on the floor a man is choking on our
men a firefighter is trying to put out a fire so I mean this is what we want we want the system that we can incorporate in a robot that is able to tells whether some information on might be an anomaly right and these are some of the results in which which Internet
within perform as well for anomalies and people written a cars on our so we see that the neuron that doesn't always finish the sentences there's a broken window laying on the ground this so there's I believe that this might be because there is some variation on the floor but I'm not entirely sure and I
mentioned this in your shoulder so it's not not showing solar and then I manage show his right arm in which he has a severe injury is is not is wrong this is like that is difficult I mean
is a man is being held by the police I serving is example is that the referees so that means that a man is dancing with a woman events for down what's happening there together so this also difficult caption like like what what are we asking never what the we train at 4 and also complicated but so we have to look at images were used and so on that
thank you for a time this enormous will probably have thanks
to the questions of the and and the other question is where I used to along with the right where wrote everything from scratch or whether users of a framework for it yes and no I use sense of fluid cares there which is a this is what have there under the questions but in the world and this is all about yes it is 1st noticed the use of the word and yes No well we perform it we divided the dataset the core only like 20 28 like 80 per cent for training and then and then from this 80 and are 24 validation and then we did the appropriate values so we don't overfit at least in the classification but the caption part is rather complicated to evaluate and is why I was making a small remark on the on the metrics use on the on captioning as no so it's also very hard coded the side that entrusted to my this and the question of the who in the future