Object detector
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 10 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68388 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
2
3
4
00:00
Student's t-testObject (grammar)Process (computing)Machine visionBuildingVideoconferencingTask (computing)ComputerComputer-assisted translationInternationalization and localizationInstance (computer science)Computer-generated imagerySocial classBus (computing)Table (information)DiagramDemosceneVideo trackingFormal verificationPreprocessorPixelAlgorithmRotationInvariant (mathematics)Transformation (genetics)Direction (geometry)AnisotropySensitivity analysisAutocorrelationMilitary operationCalculationRootEigenvalues and eigenvectorsDeterminantBerechnungskomplexitätScale (map)Maxima and minimaExtreme programmingAreaFunction (mathematics)Parameter (computer programming)Thresholding (image processing)SequenceStability theoryPoint (geometry)Pairwise comparisonInformationContinuous functionAngleVirtual machineNeighbourhood (graph theory)Computer networkArchitectureCNNConvolutionVotingoutputPlastikkarteWindowVacuumLinear regressionHypothesisVector spaceAxonometric projectionPersonal digital assistantVector spaceEndliche ModelltheorieMedical imagingThresholding (image processing)PixelSubsetTensorBoundary value problemDialectSocial classoutputSoftware testingMultiplication signDisplacement MappingInvariant (mathematics)Level (video gaming)Point (geometry)Computer architectureAlgorithmSquare numberResultantSoftwareCircleWave packetDecision theoryVisualization (computer graphics)OpticsNetwork topology1 (number)Object (grammar)Set (mathematics)Selectivity (electronic)WeightVideoconferencingWindowInsertion lossConnectivity (graph theory)Well-formed formulaContrast (vision)Transformation (genetics)Virtual machineMaxima and minimaMereologyEigenvalues and eigenvectorsMatrix (mathematics)Error messageRootCorrespondence (mathematics)Food energyFigurate numberLocal ringCalculationCASE <Informatik>Frame problemDependent and independent variablesDifferent (Kate Ryan album)Kernel (computing)AngleType theoryDirection (geometry)DiagramAutocorrelationNumberRotationInformationOperator (mathematics)NeuroinformatikConvolutionMoment (mathematics)Latent heatUniform resource locatorLikelihood functionImage registrationMappingDistanceDigitizingConfidence intervalBuildingProcess (computing)Function (mathematics)Degree (graph theory)Position operatorOrder (biology)Element (mathematics)AuthorizationBounded variationComputer hardwareNatural numberFunctional (mathematics)Procedural programmingSoftware developerAreaLinear regressionPlastikkarteInformation securityVector potentialVirtualizationTexture mappingVotingCartesian coordinate systemDigital electronicsRange (statistics)Task (computing)Bridging (networking)Line (geometry)Dimensional analysisOcean currentTerm (mathematics)SequenceMachine visionMathematicsFehlererkennungBinary imageImage processingObjekterkennungPattern recognitionArtificial neural networkConnected spaceComputational complexity theoryTrailBranch (computer science)Single-precision floating-point formatElectric generatorCombinational logicAnalytic continuationGreedy algorithmMoore's lawRectangleGauß-FehlerintegralDemosceneSupport vector machineFlow separationQuicksortAnisotropyPrincipal idealDerivation (linguistics)DiagonalCoefficientElectronic mailing listExtreme programmingSensitivity analysisGroup actionCategory of beingSymmetric matrixXMLUMLComputer animation
Transcript: English(auto-generated)
00:05
Good afternoon. Today we will talk about detecting objects in images. What is an object detection? Object detection is a technology that belongs to the phone of computer vision and digital image processing.
00:20
The task is to detect objects of a certain type, as living beings, cars, and buildings in a digital image or a video. For each type of object there is a set of specific features with which you can classify the object. For example, to identify a face such features will be eyes, lips, nose, skin color, and distance between the eyes.
00:44
The same specific features are also used to detect objects. Or, as another example, with the development of smart city technologies, fast and efficient systems, where object recognition are required in order to minimize the hardware requirements of these technologies, as well as increase the accuracy of their result.
01:04
The areas of application of object detection algorithms in image are varied, as medical care, retail, security systems, personal identification, virtual assistants, and much more. Detecting objects in an image is a current problem in computer vision.
01:23
Pattern recognition is a general term that describes a range of computer vision tasks that solves the problem of detecting objects in an image or video frame. These include image classification, object localization, object detection, and segmentation. Let's draw a line between them.
01:40
First of all, image classification for the type or class of object in the image is determined, for example, person, cat, airplane, or others. The next one is object localization, when you need to find objects and mark them somehow, for example, with rectangle. The next one is object detection, where you need to find objects
02:02
and mark them and classify them. And the last one is segmentation, when you need to find objects and separate them with a background by highlighting them. The first feature shows how several problems are solved using the example of a single object of many objects. Protection task.
02:20
It is required to determine the location of all objects of a given class in the image as a feature tool. Object class. A specific object, any registration plate or number of the car, the face of the arbitrary person, and an arbitrary person. Location shows pixels belonging to the object image, object outline,
02:43
and the area of the image containing the object as a rectangle, ellipse, and others. Detection task is to assign to each image i a set of positions of objects b of classes of interest in the feature 3, the text video room. The texture circuit is shown in feature 4, which consists of the following stages,
03:04
as image, preprocessing, as a correction, exposure, noise, and others. Scene segmentation, as highlighted areas of interest, or flip areas with object. Image classification, as finding objects in areas of interest.
03:22
Verification, filtering out positives of the classifier, and tracking, as tracking changes in the position of an object over the time. K-point detection algorithm, moravic detector. One of the most common types of special points are corners in an image. Because unlike edges, angles in a pair of images can be inequally matched,
03:45
the location of the corners can be determined using local detectors. The input of local detectors is a binary image. The output is a matrix with elements whose values determine the degree of likelihood of finding the angle in the corresponding pixels of the image.
04:03
Next, pixels are cut off with a degree of likelihood less than a certain threshold. Repetitive angles are removed using the NMS, non-maximal suppression procedure. All resulting non-zero map elements correspond to corners in the image. The main disadvantages of the detector under consideration are the lack of invariance
04:25
to the rotation type transformation, and the occurrence of detection errors in the presence of a large number of diagonal edges. It is obvious that the moravic detector has the property of anisotropy in eight principal directions of window displacement.
04:43
And disadvantages are that it is not invariant under rotation transformation, and a large number of false alarms on edge due to noise. Harris and Stefan detector. The Harris detector is based on the moravic detector and is its improvement because it is characterized by anisotropy in all directions.
05:05
Harris and Stefan introduce derivatives in some fundamental directions and expands the intensity function into a Taylor series, as we can see in Formula 5. In the Harris matrix or autocorrelation matrix, as a rule,
05:21
a weighted convolution of derivatives with varying credit sense of a Gaussian window is chosen. A sweet of size showing three pixels with coefficients 0.04, 0.12, 0.04, 0.12, 0.36, 0.12, 0.04, and 0.12.
05:46
Note that the Harris matrix is symmetric and positive symmetric. By calculating the eigenvalues of the resulting matrix, image points can be classified into edges and corners. If both eigenvalues of the autocorrelation matrix are large enough,
06:03
then a small window should lead to significant changes in intensity than the pixel. If one eigenvalue is significantly larger than the other, then this means that the window must be shifted perpendicular to the protrusion,
06:20
so the pixels belong to the edge. If the eigenvalues are close to zero, then the current pixel contains neither corners nor edges. Calculation of the eigenvalues of the matrix M requires the use of the square root operation. The Harris detector, compared to the previously discussed detector, requires more calculation due to the need to construct convolutions with the Gaussian kernel.
06:45
At the same time, it is quite susceptible to noise. Increasing the size of Gaussian window makes it possible to suppress noise, but this leads to the significant computational cost, though it is necessary to find a compromise between the quality of the algorithm and the number of operations performed.
07:05
The Harris detector has a property of anisotropy along the horizontal and vertical directions, because the autocorrelation matrix contains first derivatives only along the specified directions. Compared to its predecessor, this detector is invariant with respect to the rotation,
07:24
and the number of corners detection errors is not large due to the introduction of convolution with the Gaussian baiting coefficients. Detection results change significantly when the image is scaled, so subsequently modification of the Harris detector arise that take into account
07:44
the second derivatives of the intensity function, for example the Harris-Laplace detector. And the disadvantages are, first of all, greater computational complexity compared to the morality detector, sensitivity to noise, anti-dependence of detection results and image scaling, MS error detector.
08:06
When developing the MS error detector from maximally stable, the MS error detector identifies many different regions with extreme properties
08:24
of the intensity function within the region and at its outer boundary. Let's consider the idea of the algorithm for the case of a black and white image. Let's imagine all possible copies of the time and image. As a result, we obtain a set of binary images at different threshold values, so from 0 to 255.
08:47
A pixel whose intensity is less than the threshold is considered black or otherwise white. Thus, a pyramid is built in which at the initial level, corresponding to the minimum intensity value, there is a white image,
09:01
and at the last level, corresponding to the maximum intensity value, there is a black image. If at the some moment there is a movement, then black spots appear on the white image, corresponding to the local intensity minima. As the threshold increases, the spots begin to grow and merge, eventually forming a single black image.
09:23
Such a pyramid makes it possible to construct many connected components corresponding to white areas, regions with the maximum intensity value. For example, if we invert the binary images in the pyramid, we get a set of regions with the minimum intensity value. The algorithm diagram consists of several stages.
09:41
First of all, let's sort the set of all image pixels in an ascending or descending order of intensity. Note that side sorting is possible inside proportion with the number of pixels. Let's build a pyramid of connected components. For each pixel of the sorted set, we perform the following sequence of actions.
10:01
We are updating the list of points included in the component, and updating the areas of the next component, as a result of which the pixels of the previous level will be a subset of the pixels of the next level. Let's search for local minima for all components, and we can find pixels that are present in a given component but are not part of the previous ones.
10:23
A set of local level minima respond to an extreme region in the image, and advantages are invariance to a fine transformation of intensities, stability, simultaneous detection of errors in different scales, and computational efficiency, fast detection.
10:42
The detector described earlier determines special points in an image in a particular corners by applying some model algorithm directly to the pixels of the original image. An alternative approach is to use machine learning algorithms to train a point classifier on a set of images. Fast detection features from the accelerated test those decision trees for pixel classification.
11:06
For each pixel of the image, a circle with a z-tra at this point is conccelerated, which is inscribed in a square with a side of 7 pixels, as we can see in feature 8. The circle passes through 16 pixels of the neighborhood, and disadvantages are when there is no
11:23
generalization for a quick test, and the full test does not use information for the quick test. Each neighboring pixel is x from 1 to 16, related to the central p to x. One can be in one of those three states, as we can see in formula 7. Next, a decision tree is considered in feature 9. At each level of the decision tree, the set
11:48
corresponding to a tree node is divided into subsets by selecting the most informative point, the pixel with the highest entropy, and the result of the decision tree is used to determine the angles in the test images.
12:01
First, R-C-N-M. The R-C-N-M or region-based convolutional neural network architecture was developed in 2014 by Ross Gershik and others. This method is based on the following algorithm. First of all, finding potential objects in the image and dividing them into regions using selective search method.
12:23
Second, extracting features of each resulting region using convolutional natural networks. Last, classification of perceived features using the support vector machine and defining the boundaries of regions using linear regression. As a result, we get separate regions with the objects in their classes. In the center are
12:45
convolutional natural networks which show good accuracy using images as an example, but this architecture has disadvantages. As injury tensor requires a lot of time for training. In the selective search method, the image
13:01
is first segmented in the 2000 regions which are then iterated into larger regions using a greedy algorithm. And in addition, convolutional networks themselves also require computing power. Another disadvantage is that it cannot be used for video. So again, this follows from the disadvantages above since all intermediate methods are
13:25
energy consuming, so the frames simply will not have time to be perceived. And selective search is not a machine learning algorithm that may have problems identifying potential objects in the different images. At the moment, the R-C-N-M model is outdated and not used.
13:42
The shortcoming of R-C-N-M led the authors in 2014 to improve the model. They call it PAST R-C-N-M. It is based on the following architecture as we can see in figure 11. First of all, the image is add to the input of convolutional natural network and processed
14:01
by selective search. As a result, we have a map of features and regions of potential objects. Second, the coordinates of the regions of potential objects are converted into coordinates of the feature map. Third one, there is other feature map with regions transferred to the region of interest, polling layer.
14:25
Here, a grid of size is supermassive on each region. Max polling is then applied to reduce dimensionality. Thus, a region of potential objects has the same fixed dimension. The first one, the result of features are fed to the input of the fully connected layer, which is transmitted to two other fully connected layers.
14:48
The first one, with the submax activation function, determines the probability of belonging to a class. The second one determines the boundaries or displacement on the region of a potential object.
15:00
Lastly, R-C-N-M shows slightly higher accuracy and a large increase in processing time in contrast to the R-C-N-M. Since it is not necessary to feed all regions to the convolutional layer, but nonetheless this method uses expensive selective search. Therefore, the objects come to the faster R-C-N-M.
15:23
The objects continue to open fast R-C-N-M and proposed faster R-C-N-M in 2016. They developed their own optical visualization method to replace selective search R-P-N or region proposal networks. The R-P-N is based on the entry system. The faster R-C-N-M architecture is formed as follows.
15:46
The image is fed to the input of the convolutional network, thus the feature map is formed. Second one, the feature map is perceived by R-P-N or region proposal network layer, as we can see in feature 12.
16:00
And here a sliding window is traversed over the feature map. The center of the sliding window is connected to the center of the entries. Entries are areas that have different aspect variations and different sizes. The outputs use three aspect variations and three sizes. Based on the intersection over union metric, the degree of intersection of entries and true merit
16:27
rectangles, the decision is made about the current region, whether there is an object or not. And third, the next fast R-C-N-M algorithm is used. The feature map with the result of objects is transferred to the layer, following the processing
16:44
of fully connected layers in classification, as well as determining the displacement of bridges of potential objects. The faster R-C-N-M model does a little worse ethicalization, but faster R-C-N-M. The model is trained as a single linear network. The error function
17:02
is a weighted loss function for two branches corresponding to classification and regression. R-P-C-N or region-based fully convolutional network is a logical continuation of the development of the faster R-C-N-M method. The main idea of R-F-C-N is to generate, at the output of the network, a continuous map of the
17:24
learning to admissible classes, which are sensitive to the location of errors of the possible presence of objects or position-sensitive score maps. R-F-C-N operation diagram. As shown in feature 13, the extracted features from the original image saw
17:41
a forward path through some convolutional neural networks, adding convolutional layers and generating a set of confidence maps for belonging to the admissible classes, which are sensitive to the location of errors of the possible presence of objects. Generate regions of possible presence of objects using fully convolutional R-P-N.
18:05
Combining confidence maps according to the relative position in the area or position-sensitive polling layer. According to the location of the area, the corresponding part on the side of feature maps corresponding to the relative position of an object is cut out, and the resulting maps are program-coordinated according to the relative position.
18:22
Original classification uses a softmax classifier. The classifier input is a confidence vector of the region belonging to each of the admissible classes obtained through voting. Lots of deep models for the object detection is not limited to those discussed in the lecture. These are a large number of modification-considered architectures in the particular faster R-C-N-N and S
18:46
-S-D, as evidenced by the results of well-known competitions for the different objects in the different classes. Thank you for your attention. See you next time!