We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Neural networks

00:00

Formal Metadata

Title
Neural networks
Alternative Title
Нейронні мережі
Title of Series
Part Number
5
Number of Parts
10
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Computer networkNeuroinformatikLie groupComputerProcess modelingArchitectureEndliche ModelltheorieVertex (graph theory)Maß <Mathematik>InformationParallel portLink (knot theory)CognitionSoftware frameworkProcess (computing)Data modelPattern recognitionArtificial neural networkGraph (mathematics)Derivation (linguistics)BefehlsprozessorGamma functionElement (mathematics)Food energyMiniDiscPairwise comparisonSoftwareComputer hardwareThresholding (image processing)outputPhase transitionWeightEmulationRange (statistics)Attribute grammarDatabaseDatabase normalizationScaling (geometry)DecimalSoftware testingMaxima and minimaWell-formed formulaPattern languageIntegerFunction (mathematics)Weight functionNumberLinear mapExecution unitParameter (computer programming)Sigmoid functionAxiom of choiceCategory of beingEmailSocial classBinary fileTask (computing)Internet forumHypercubePlane (geometry)GeometryError messageBoolean algebraBit rateRandom numberIterationComputerGEDCOMExclusive orComplementarityMultiplicationVector spaceMathematicsAlgorithmGenetic programmingBasis <Mathematik>ImplementationSample (statistics)Condition numberNonlinear systemLogical constantNichtlineares GleichungssystemCalculationVariable (mathematics)CASE <Informatik>Presentation of a groupDew pointRecurrence relationSystem programmingAerodynamicsPredictionRead-only memorySet (mathematics)outputTask (computing)NeuroinformatikDerivation (linguistics)Execution unitResultantTerm (mathematics)Process (computing)Graph (mathematics)Function (mathematics)Bit rateComputer architectureHyperplaneCartesian coordinate systemChaos (cosmogony)Speech synthesisDecision theoryConnected spaceReal numberData structurePredictabilityWave packetWeightAlgorithmElement (mathematics)Different (Kate Ryan album)Flow separationSummierbarkeit2 (number)Form (programming)Single-precision floating-point formatFunctional (mathematics)Similarity (geometry)Mathematical optimizationMedical imagingMultiplication signMereologyBlock (periodic table)Set (mathematics)DistanceCoefficientMetric systemWell-formed formulaLogistic distributionError messageRow (database)Correspondence (mathematics)Vector spaceCalculationPropagatorImage resolutionIterationCondition numberMathematicsCASE <Informatik>Social classSampling (statistics)SoftwareVariable (mathematics)Radio-frequency identificationProgrammschleifeAdditionMatrix (mathematics)Mathematical modelEndliche ModelltheorieBoolean functionWeight functionCategory of beingParallel portArtificial neural networkLinearizationPoint (geometry)Positional notationPower setPosition operatorNegative numberLinear subspaceBuildingBackpropagation-AlgorithmusLinear regressionInformationMaxima and minimaBinary codeRange (statistics)Sigmoid functionNormal (geometry)Exclusive orParameter (computer programming)Propagation of uncertaintyPerceptronInstance (computer science)Nichtlineares GleichungssystemGroup actionSubgroupAttribute grammarPower (physics)Equaliser (mathematics)Transformation (genetics)ComplementarityClassical physicsPresentation of a groupCycle (graph theory)Loop (music)Traffic reportingNumberHypothesisComputer simulationConvolutionPairwise comparisonLimit (category theory)Electronic mailing listRecurrence relationBasis <Mathematik>Symbol tableLine (geometry)Wage labourXMLComputer animationJSONUML
Today we have lectures about neural network structure. We start from the simplest of architectures and then discuss different architectures
for different tasks, for example, image analysis, speech analysis, and so on. And first of all, I would like to explain why actually we have such kind of classification of prediction and what does it mean. And we start from the simplest linear model.
And the main idea, as we discussed before, of linear model is to choose appropriate mathematical model and place on this model to minimize distance between predicted data and real data that you can find here as blue points.
And of course, mathematical model can be built as a vector notation. And in this case, our task is to find coefficient. And based on this coefficient, to build our model. And next, we would like to minimize the role between real data and predicted data.
And to do that, we build loss function. So we have means acquired error. This is predicted data. This is real data. And we must solve optimization task. From other side, if you talk about classification,
in the simplest way, we try to find the hyperplane in hyperspace. And based on this hyperplane, to split our positive case and negative case, but by two independent subspaces without any intersection. And in this case, we have the same.
We must build decision function. And to do that, we could use different algorithm with different mathematical explanation. But we have the same. For this decision function, we need to find coefficients.
From other side, we can build more sophisticated model. For example, as you can see here in this hyperspace, our decision cannot be built based only on one hyperplane. And in this case, we can solve classification problem using three separate hyperplanes.
So in this case, we have new features, three additional hyperplanes with all parameters. And of course, we have our decision. But we have the same. We must build linear model based on these new features.
And what does it mean for us? In this case, we have one complex task. And in addition, we have three separated tasks. It means that we can rewrite our algorithm
in terms of computational graph. So we have our input data or axis x1 and x2. Next, we have our three hyperplanes. And we have output. We have our decision. And our task in this computational graph
is to find this coefficient w to learn our computational graph, to learn our algorithm, to solve the problem that we have in the output.
And if you talk about computational graph, and we would like to solve this operation, and we would like to solve this optimization task, because each time we have the same, we would like to minimize error, it means that this task can be solved by derivatives,
because we would like to minimize something. And in this case, our computational graph is transformed into a graph of derivatives. And in this case, the main other task for us is to calculate derivatives, to find the path from one
part of our graph to other part, but multiply all edge values along this path, because we have coefficient w. Add two resulting derivatives and so on. And actually, this graph of derivative
is one of the simplest unit in artificial neural network. A real artificial neural network looks like that. For example, here you can find architecture of AlexNet developed in 2012.
Here you can see that we have a lot of small blocks combined together. And for example, we have more than 60 millions of parameters. And training time, or time for finding of this w coefficients, this coefficients, is around six days.
So in this case, what does mean artificial neural net? What we have in this case? We have the set of connected input output units, the same as we have here, input unit and output unit,
where each connection has a weight associated with it. And in this case, we also need to develop algorithm to learn our neural network, to classify,
to solve our problem. And learning in our case is to find appropriate coefficients and to use these coefficients and developed model for new data. And in this case, the simplest neural network can be used for classification task.
But of course, neural network, depends on the structure, can be used for regression, for clustering, and so on. And in this case, connectedness refers to computer modeling approach, to computation that is loosely based upon the architecture of the brain.
So actually, neural network is a simplified model of human brain. Here you can find the short information about history of neural net. First of all, the first information
about neural network was made in the mid of last century. And neural network was used for mathematical task solving. After that, very huge problem related to problems with calculation of neural network was arised.
As you can see here, each time we must multiply huge matrix. And it was the problem for computers at this time. And after that, new architectures of neural network were developed.
And now, neural network can be used for different applications. Handwriting, text recognition, voice recognition, speech recognition, vehicles, self-driving, rotation, and management, and so on.
And here you can see comparison of brain and traditional computer or neural network as simplest model of human brain. Similarly to human brain, computer network, computer artificial network consists of bytes and connection
between the bytes. Similar to human brain, neural network can deal with incomplete data, with data, with noise, and so on. But human brain can work in parallel.
And when we talk about neural network, we can organize only serial or centralized computation. So and why we talk that neural network is the model of human brain? In this case, we can care about biological aspiration.
Because actually, our brain consists of neurons. He can find the structure of this neuron. And each neuron is presented as input, output, or result of calculation and process unit.
And human brain is connection of different neurons together. And in this case, output of first neuron is input for second neuron.
And in this case, for training human brain, we try to change this connection or to add a new connection in our human brain. So in this case, we have something similar. We must learn our network to change
the weight of connection between different neurons, between different nodes, to classify or predict unknown data. And based on that, a neural network needs a long time for training, because I talk that we deal with huge metrics.
But neural network has higher tolerance to noisy and incomplete data. So this lecture, we discuss with you neural network as the simplest classifier. And to analyze the classification problem,
of course, our data must be normalized. We talk about this later. Before, sorry. And two basic normalization techniques, just for reminding, this is maximum normalization and decimal-scaling normalization. Here you can find formulas for that and example.
And the same for decimal-scaling normalization. This is formula, and this is example. So now I would like to present to you neural network consists of one neuron. Here you can see the structure. So we have inputs, our data.
We have weight for these inputs. We have our calculation unit, and we have our output or class label. Our computational unit consists of two parts.
The first part is summator, allow us to find weighted sum of our signals. And the second part is activation function, allow us to transform a result of summator to class label. And regarding the summator, here you can find example.
Suppose w1 is equal to 0, 5, and w2 is equal to 0, 5, 2. And our input values are equal to 0, 3 and 0, 8 respectively. In this case, a weighted sum as part of our computational unit
is expressed, as you can see here. So the next, we must transform the result of summator, this value, to classification label.
To do that, we use activation function. The simplest activation function, step function, looks like here. For example, if the result of summator will be less than 0, 5, in this case, our decision will be 0. In other case, it will be 1. So for our example, the weighted sum is equal to 0.45.
And this way, our classification result will be equal to 0. In another example, if we have, for example, 0.45, in this case, our decision will be 0. In addition, if we try to minimize
the role between real data and predicted data, we must analyze carefully the error character for each feature in our data set. Suppose we have two features, x1 and x2,
and we build our classification model for that. But based on analysis, we see that error rate for x1 feature is more or less good. But for x2 feature, we have our classification model here.
For example, our error rate is huge. It means that we must move our mathematical model in other place based on this axis x2.
And to do that, we can use beers. So beers, this is additional value that allows us to move our mathematical model. So this is classical summator.
This is beers. And based on that, we just move our mathematical model to top or to down. And in this case, beers is presented as extra input in our simple single network.
So we have power signals. We have beers, and of course, beers has own value too. And in this case, we have, as before, summator, we have activation function and output. And regarding activation, this additional function
allow us to transform data from weighted sum to real value of our class. And to do that, we can use different functions. For classification model, we use the simplest one. For example, step function presented, as you can see here,
ramp function, sigmoid function, widely used for different tasks, Gaussian function, and so on. And step function allow us to solve binary classification problem. For example, if our x1 is less than c, in this case,
our decision will be a. Or in other case, our decision will be ramp function. In this case, we can solve multi-classification problem, multi-class problem. Then we have three or more classes.
In our case, we can talk about three-class classification. If our value is in range between 0 and c, we have first class. Between c and d, we have second class. And higher than d, we have last class. Next, sigmoid function can be used
not only for classification task, but also for prediction task. The result of sigmoid function is presented in the range between 0 and 1. And we have natural values. Of course, if we would like to solve classification task, binary classification, we can add an additional step function
to sigmoid. And if the result of sigmoid function will be chaos and 0,5, in this case, it will be the first class. And in other case, it will be the second class. But due to real value as result, sigmoid function
can be used for prediction task, too. So in this case, we can present our computational graph, for example, as in this example, as in this picture. So we have two input elements.
We have our three computational elements. And of course, you have result. So this model, allow us to organize multi-class classification, because we have three computational elements and three outputs.
And we can use different activation function for that. And example of learning of this neural network is given below. So we have our neural network. We have input signal. We have a vector of output, this one. We call this is multi-classification task.
And we would like to calculate this vector. To do that, we calculate each neuron separately. So we start on the first neuron, first of all, summator after that step activation function, and we have the result. Second neuron, the same, summator and activation function,
and the last neuron. And based on simple neuron, we can build the first architecture and use this architecture for real task. And the first architecture is perceptron.
This is specialized form of single layer feed forward neural network. Perceptron was proposed by Rosenblatt in 1958 as a simple neuron that is used to classify its input into one of two categories.
And perceptron use a step function that returns a positive one or a negative one. And in addition, can consist of BS2. So this perceptron can be used for classification,
but only this very important limitation. Perceptron can deal only with linearly separated classes. What does mean linearly separated model? It means that we can use only limited list of Boolean
functions inside our calculation element, inside our calculation unit. For example, previously we talked with you about summator. Here you can see. And actually, summator is built on Boolean function or.
And here you can see explanation. You know that result of Boolean function or will be equal to 0 only in this case if both arguments of this function is equal 0, 2. So x2 equals 0, x1 equals 0 result is false.
And in all other cases, the result is equal true. That is why we can use one separated line and split our hyperspace by two separated sub-hyperspaces
with negative class and with positive class. And based on that linear separation, we can organize learning process for perceptron. So learning process is organized as changing
of weights by this formula. So we have old value of weight. Next, we have error. We have value of our signal. And we have learning rate. Learning rate allow us to change the speed of learning.
If we talk previously about a graph of derivatives here, it means that each time we try to find local minima. And after that, based on this local minima
to find global minima. And to do that, we split our hyperspace by separated groups based on these derivatives value and try to find local minima in each subgroup separately.
And this learning rate allow us to customize the size of the subgroups. If we create with you a lot of the subgroups, it means that learning time, the learning process will be very slow. If we, but very exact value, if we
choose the huge value of each subgroups, it means that we can lose local minima. And in the result, we can lose global minima too. That is why if we need very precise exact result,
in this case, it will be good to use small value of learning rate. If we would like to find a very fast solution, in this case, we can use high value of learning rate.
And so he can find example of perceptron training. So we have randomly initialized weights, w1 and w2. And we have data set consist of three instances.
For the first instance, we have this values. We have this values. And we try to find output based on our perceptron. So we use weighted sum.
After that, we use a equation function. And result is equal to 0. Real data is 0. Predicted data is 0. It means that our weights are not changing. For next instance, we have this values. And again, our predictive result is the same as a real result.
And we didn't change our weights. And for the next training data, for example, as you can see here, real data is equal 1, and predicted data is equal 0. So in this case, we have a row presented
as real data minus predicted data. That is why a row equal 1. Suppose a learning rate is equal 0.2. And in this case, we must recalculate our weights based on this formula.
And based on that, new value of weights will be 0 for w1 and 0.4 for w2. And we must repeat this process till we get stable result without changing the weights. So one more again about the limitation of perceptron.
Perceptron can use only linear separation functions. We talk about AND, OR complement. AND, for example, cannot use XOR function.
XOR function is not linearly separated functions. And we obtain 0 in this case if our arguments are equal. For example, for 0, 0, for 1, 1. And we obtain 1 if we have different values
of our arguments. So the presentation of hyperspace looks like here. Here in this part of our hyperspace, we have negative class. And in this part, in this one, and in this part, we have positive one, positive hyperplane.
So we need more than one line, more than one hyperplane to split our data. That is why XOR cannot be used in perceptron model. To solve this problem, to solve the problems with perceptron, we can use multilayered feed-forward neural network.
In this case, we have a lot of input units as we had before. We have output units, but in addition, we have hidden nodes. We have hidden units, allow us to solve different complex tasks.
And in this case, we have input vector, we have output vector, we have hidden units, and of course, we have weights for each layer, for each neuron
in our neural network. So here you can see example of multilayered perceptron. We have input vector consists of two values. In addition, we have a weights vector
for first calculation elements in the hidden layer. After that, we have the result of this layer, and this result will be actually input for the next layer. And after that, we have output of the old neural network.
So in this case, the neural network learning process is organized as a classical learning process for one perceptron, for one neuron. Each time, we try to calculate weighted outputs
from the previous unit and send the result to the next unit. And perceptron training algorithm looks like here. First of all, we initialized randomly our weights. After that, we denote that our weights can be changed.
And organize the loop. While we must change our weights, we try to build the hypothesis our weight is not changed. And after that, check our result. First of all, we calculate our output.
If our output is not equal to the real data, we denote that our weight must be changed and change our weight and delete all old values. And after that, organize a new cycle in our loop.
So in multilayered feedforward network, the units in the hidden layer and output layer are sometimes referred to as narratives due to their symbolic biological basis or as output
units. And multilayered feedforward network consists of input or reports without class attributes with normalized attribute values. From these inputs, we can build input vector. So this input vector is presented in input layer.
There are many nodes as non-class attributes. And of course, the lens, we know the lens with this attribute. After that, we have hidden layer. We can use more than one hidden layer with different number of neurons in this neuron.
And we have output layer corresponds to the class attribute. And of course, we can use more than one node in the output layer, too. And to learn our neural network, we used such technique as back propagation.
Back propagation technique allows us iteratively learn our neural network based on error found in the last layer, in the previous layer, and so on. And for each sample, weights are modified to minimize the error between real data
and between actual data. So here you can find the step of back propagation algorithm. You start from a randomly initialized our weights and BSs in this range, in this interval. Next, we organize, hit the training sample,
or calculate weighted sum and activation function for each layer in our neural network. Next, propagate the input forward, or compute a network input and output for each unit in the hidden and output layer. And after that, we use back propagation technique.
Propagate error from the last layer to the first layer in our neural network. And based on that, based on this error, update weights and BSs to reflect the propagates error. So propagation through hidden layer looks like here.
We calculate weighted sum. We calculate activation function. After that, we calculate row. After that, we propagate a row to the previous layers this side to our neural network.
So how actually we can propagate the inputs? For each unit in the input layer, its output is equal to inputs. The net inputs to each unit in the hidden and output layers is computed as a result of input layer.
So we use our weighted sum. And after that, we use activation function. The widely used activation function is a good or logistic activation function.
Here you can see the formula for that. And after calculation the result of neural network, we try to find the row. And we start from the row in the last layer of neural network. Because we have actual data, and we can compare this actual data and predict the data.
So we have true output, real data. And we have predicted data as a result of activation function. And based on that, and just for reminding that our neural network is derivative graph, we try to find derivative rate of change of activation
function. And based on that, calculate a row. And after that, we propagate this row to the hidden layer. So for the hidden layer, first of all, we have the sum of weighted error from the output layer.
And of course, we have derivative error in this hidden layer. And the next update, our weight and PSS are organized using the same formula as for one neuron.
We try to find data W. Data W is presented as a lagging rate, a row, and input. And the same is with BS.
And to update our weight, we can use more than one iteration. And the name of the iteration is epoch. This is one iteration through the training set. And alternatively, the weights and PSS increments could be accumulated in variables.
And the weights and PSS update after all the samples of the training set have been presented. We can use this technique for learning neural network too. But we discuss that for more complicated architectures
of neural network. So in additional, we can use terminating condition. For example, we can train our neural network until the delta W will be less than expert value.
And in practice, several hundreds of thousands of epoch may be required before the weights will change. And all back propagation formulas you can find on this slide. First of all, we start from pit forward,
from the beginning to the end, without loops inside our neural network. We calculate weighted sum. After that, we calculate real output based on activation function. And for the last layer, output layer, we try to find classification of predictive error.
After that, we can calculate error also for hidden layer based on this formula and to use errors value for our weights abating. And here you can find a simple example of back propagation.
Our neural network consists of three input neurons, one hidden layer with two neurons, and one output layer. We have real input. We have randomly generated weights.
And we have biases. So we must calculate one possible epoch in our neural network. And we start from pit forward. It means that we must calculate input for next layer.
To do that, we calculate weighted sum for unit 4 as output of x1, x2, and x3. So we calculate weighted sum. It will be x1 multiplied by w14, x2 multiplied by w24,
and x3 multiplied by w34 plus BS. And we have here this plus BS. This is weighted sum.
After that, we use activation function and obtain the result. The same we do with second neuron in our hidden layer. After that, output of hidden layer of neurons 4 and 5
are presented as inputs for the last neuron. And for the last neuron, we do the same, calculate weighted sum and activation function. Weighted sum is calculated based on weights and our inputs.
This is our inputs. And of course, BS. And after that, we calculate sigmoid and have a result. Next, we can calculate a row. And we start, in this case, from the last layer,
from the output layer, this one. So based on this formula, we have output.
We have output, 1 minus output, and the real data minus predicted data. Next, we calculate a row for the hidden layer
based on this formula. So we have a weighted row from previous step,
weighted row from previous step. And we have input multiply 1 minus input. And the same for next element.
After that, we can change our weights. Suppose learning rate is equal to 0.9. And based on formula, we have untvalue, learning rate. Next, we have a row.
We have a row. And we have our output value. And the same we do with the rest of weights in our neural network.
And this is all. And next lecture, we discuss with you more complicated architecture, such as recurrent neural network, convolutional neural network, and so on.
Thanks.
Thanks.