We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Segmentation problem

00:00

Formal Metadata

Title
Segmentation problem
Alternative Title
Завдання сегментації
Title of Series
Number of Parts
10
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Student's t-testGroup actionPixelProcess (computing)Object (grammar)Object (grammar)Characteristic polynomialRepresentation (politics)Medical imagingDifferent (Kate Ryan album)Selectivity (electronic)PixelDigitizingFlow separationElementary arithmeticComputer fontDialectProcess (computing)Image processingPattern recognitionTask (computing)Group actionComplex (psychology)XMLUMLComputer animation
Table (information)Object (grammar)Instance (computer science)Local GroupLevel (video gaming)Classical physicsPixelComputer-generated imageryMixed realityField (computer science)Conditional probabilityRandom numberInferenceGEDCOMData modelArchitectureInterpolationSeries (mathematics)Function (mathematics)Bilinear mapMultiplicationScale (map)Block (periodic table)Bit rateConvolutionHybrid computerCNNInsertion lossGamma functionStandard deviationEntropie <Informationstheorie>Vector graphicsSocial classFile formatCoefficientDerivation (linguistics)PixelEndliche ModelltheorieFigurate numberCoefficient of determinationSocial classEntropie <Informationstheorie>Computer-assisted translationFunction (mathematics)Object (grammar)StatisticsStandard deviationContext awarenessMultiplicationWave packetCondition numberIdeal (ethics)Scaling (geometry)RandomizationInsertion lossSemantics (computer science)State of matterFlow separationSemiconductor memoryNeuroinformatikPredictabilityMachine visionDiscrete groupInformationCellular automatonoutputShape (magazine)SoftwareData structureNatural numberMedical imagingOperator (mathematics)Complex (psychology)Instance (computer science)Hybrid computerRule of inferenceClassical physicsTerm (mathematics)ConvolutionAlgorithmFunctional (mathematics)Mathematical optimizationRepresentation (politics)Self-organizationCASE <Informatik>MappingResultantException handlingPopulation densityBlock (periodic table)Grass (card game)Error messageLocal ringCodierung <Programmierung>Vector potentialParameter (computer programming)ZeitdilatationWeightSpacetimeElectric generatorNumberSampling (statistics)Single-precision floating-point formatGamma functionComputer architectureSet (mathematics)SummierbarkeitCartesian coordinate systemFood energyGoodness of fitAugmented realityCombinational logicOnline helpMaxima and minimaLevel (video gaming)Latent heatHeat transferDemosceneExtreme programmingConfidence intervalForm (programming)Graph coloringExtension (kinesiology)Chemical equationPlotterDifferent (Kate Ryan album)GradientPerformance appraisalComputer configurationFraction (mathematics)Metric systemTheory of relativity2 (number)Well-formed formulaLibrary (computing)Process (computing)EstimatorCalculationCuboidVariable (mathematics)Orientation (vector space)PolygonVolume (thermodynamics)Position operatorNegative numberOrder (biology)DialectPairwise comparisonBoundary value problemCategory of beingConnected spaceModal logicEqualiser (mathematics)Series (mathematics)Contrast (vision)Image processingComplexity classBilinear mapInterpolationImage resolutionNoise (electronics)MathematicsDeconvolutionContent (media)Computer animationXMLUML
Transcript: English(auto-generated)
Hello and welcome. Today we will talk about one of the most complex methods of digital image processing. Image segmentation. Segmentation is the process of dividing a digital image into several segments, fragments, groups of pixels.
According to some general criterion, segments differ from each other in elementary characteristics such as brightness, color, texture, shape. The purpose of segmentation is to simplify and change the representation of an image so that it is simpler and easier to analyze.
Depending on the criterion, different image segmentation problems are obtained. Object extraction, selection of a specific arbitrary object specified by the user or otherwise specified. For example, the user can select an object with a bounding box or draw an outline of the object.
Unsupervised segmentation, dividing an image into regions that are homogeneous in their visual characteristics and differ from neighboring regions. Incorrect selection of segments in an image can ultimately affect the quality of recognition and even make it impossible. Therefore, the task of segmentation is extremely important and very relevant.
Semantic segmentation. Semantic segmentation is the process of assigning a label to every pixel in the image. This is in stark contrast to classification where a single label is assigned to the entire picture.
Semantic segmentation treats multiple objects of the same class as a single entity. On the other hand, instance segmentation treats multiple objects of the same class as distinct individual objects or instances. Typically, instance segmentation is harder than semantic segmentation.
Comparison between semantic and instance segmentation. Figure 1. Classical methods. Before the deep learning era kicked in, a good number of image processing techniques were used to segment image into regions of interest. Some of the popular methods used are listed below.
Gray-level segmentation. The simplest form of semantic segmentation involves assigning hard-coded rules or properties a region must satisfy for it to be assigned a particular label. The rules can be framed in terms of the pixel's properties, such as its gray-level intensity.
One such method that uses this technique is the split-and-merge algorithm. This algorithm recursively splits an image into sub-regions until a label can be assigned, and then combines adjacent sub-regions with the same label by merging them.
Figure 2. The problem with this method is that rules must be hard-coded. Moreover, it is extremely difficult to represent complex classes such as humans with just gray-level information. Hence, feature extraction and optimization techniques are needed to properly learn the representations required for such complex classes.
Consider segmenting an image by training a model to assign a class per pixel. In case our model is not perfect, we may obtain noisy segmentation results that may be impossible in nature, such as dog pixels mixed with cat pixels as shown in the image, figure 3.
These can be avoided by considering a prior relationship among pixels, such as the fact that objects are continuous, and hence nearby pixels tend to have the same label. To model these relationships, we use conditional random fields, CRFs.
Conditional random fields, CRFs. CRFs are a class of statistical modeling methods used for structured prediction. Unlike discrete classifiers, CRFs can consider neighboring contexts such as relationship between pixels before making predictions.
This makes it an ideal candidate for semantic segmentation. Each pixel in the image is associated with a finite set of possible states. In our case, the target labels are the set of possible states. The cost of assigning a state or label, u, to a single pixel, x, is known as its unary cost.
To model relationships between pixels, we also consider the cost of assigning a pair of labels, u of v, to a pair of pixels, x, y, known as the pairwise cost. We can consider pairs of pixels that are its immediate neighbors, grid CRF, or we can consider all pairs of pixels in the image, dense CRF.
The sum of the unary and pairwise cost of all pixels is known as the energy or cost loss of the CRF. This value can be minimized to obtain a good segmentation output. Deep learning methods. Deep learning has greatly simplified the pipeline to perform semantic segmentation and is producing results of impressive quality.
Model architectures. One of the simplest and popular architecture used for semantic segmentation is the fully convolutional network, FCN. This set of convolutions is typically called the encoder.
The encoded output is then upsampled either through bilinear interpolation or a series of transpose convolutions. This set of transposed convolutions is typically called the decoder. This basic architecture, despite being effective, has a number of drawbacks.
One such drawback is the presence of checkerboard artifacts due to uneven overlap of the output of the transpose convolution or deconvolution operation. Another drawback is poor resolution at the boundaries due to loss of information from the process of encoding.
Several solutions were proposed to improve the performance quality of the basic FCN model. Below are some of the popular solutions that proved to be effective. U-Net. The U-Net is an upgrade to the simple FCN architecture. It has skip connections from the output of convolution blocks to the corresponding input of the transposed convolution block at the same level.
This skip connections allows gradients to flow better and provides information from multiple scales of the image size. Information from larger scales upper layers can help the model classify better. Information from smaller scales deeper layers can help the model segment localize better.
Tiramisu model. The Tiramisu model is similar to the U-Net except for the fact that it uses dense blocks for convolution and transpose convolutions.
A dense block consists of several layers of convolutions where the feature maps of all preceding layers are used as inputs for all subsequent layers. The resultant network is extremely parameter efficient and can better access features from older layers. A downside of this method is that due to the nature of the concatenation operations
in several ML frameworks, it is not very memory efficient, requires a large GPU to run. Multiscale methods. Some deep learning models explicitly introduce methods to incorporate information from multiple scales. For instance, the pyramid scene parsing network, PSPNet, performs the pooling operation, max or average, using four
different kernel sizes and strides to the output feature map of a CNN such as the ResNet. It then upsamples the size of all the pooling outputs and the CNN output feature map using bilinear interpolation and concatenates all of them along the channel axis.
A final convolution is performed on this concatenated output to generate the prediction. Atros dilated convolutions present an efficient method to combine features from multiple scales without increasing the number of parameters by a large amount.
By adjusting the dilation rate, the same filter has its weight values spread out farther in space. This enables it to learn more global context. Hybrid CNN-CRF methods. Some methods use a CNN as a feature extractor and then use the features as unary cost potential input to a dense CRF.
This hybrid CNN-CRF method offers good results due to the ability of CRFs to model inter-pixel relationships. Loss functions. Unlike normal classifiers, a different loss function must be selected for semantic segmentation.
Pixelwise softmax with cross entropy. Labels for semantic segmentation are of the same size as of the original image. The label can be represented in one-hot encoded form as depicted below, figure 12.
Since the label is in a convenient one-hot form, it can be directly used as the ground truth, target, for calculating cross entropy. However, softmax must be applied pixelwise on the predicted output before applying cross entropy as each pixel can belong to anyone our target classes.
Focal loss. Focal loss proposes an upgrade to the standard cross entropy loss for usage in cases with extreme class imbalance. Consider the plot of the standard cross entropy loss equation, figure 13, blue color.
Even in the case where our model is pretty confident about a pixel's class, say 80%, it has a tangible loss value, here around 0.3. On the other hand, focal loss purple color with gamma of 2 does not penalize the model to such a large extent when the model is confident about a class, that is, loss is nearly 0 for 80% confidence.
Let us explore why this is significant with an intuitive example. Assume we have an image with 10,000 pixels with only two classes, background class is 0 in one-hot form and target class 1 in one-hot form.
Let us assume 97% of the image is the background and 3% of the image is the target. Now, say our model is 80% sure about pixels that are background but only 30% sure about pixels that are the target class. While using cross entropy, loss due to background pixels is equal to 97% of 10,000 multiplied by 0.3 which
equals 2850 and loss due to target pixels is equal to 3% of 1000 multiplied by 1.2 which equals 360. Clearly the loss due to the more confident class dominates and there is very low incentive for the model to learn the target class.
Comparatively with focal loss, loss due to background pixels is equal to 97% of 10,000 multiplied by 0 which is 0. This allows the model to learn the target class better. Dice loss is another popular loss function used for semantic segmentation problems with extreme class imbalance.
The dice loss is used to calculate the overlap between the predicted class and the ground truth class. Our objective is to maximize the overlap between the predicted and ground truth class that is to maximize the dice coefficient.
Hence, we generally minimize 1D instead to obtain the same objective as most ML libraries provide options for minimization only. Even though dice loss works well for samples with class imbalance, the formula for calculating its derivative, shown above, has squared terms in the denominator.
When those values are small, we could get large gradients leading to training instability. Evaluation of segmentation accuracy. How is segmentation accuracy assessed? First, there are per-pixel metrics.
That is, we estimate the proportion of correctly segmented pixels in the image, regardless of what problem we are considering. If the pixel label is correct, this is good and we consider the proportion of correct pixel labels in relation to all pixels. The second evaluation option is approximately the same as we used in object detection.
That is, we have many segments. We compare the segments to each other and calculate the IOU, intersection over union criterion, figure 14. For all segments, we count the fraction of correctly classified pixels relative to the combination of correctly classified false positives and false negatives for each object class under consideration.
In order to calculate IOU, you must be able to calculate the internal volume of the objects in question. In the case of polygonal models, they most often resort to volume estimation using the Monte Carlo method.
Also, for correct calculation of IOU, it is necessary that the compared models have the same scale and orientation. In computer vision problems, IOU is often used to evaluate how correctly a bounding box or bounding box has been found that is as a metric of the quality of the algorithm.
But recently, various modifications of IOU have appeared that can also be used as a loss function. Problems in image segmentation. First, variability in image quality. One of the main problems in image segmentation is the variability of image quality.
Images obtained under different lighting conditions with different resolutions or content of noise and artifacts can significantly affect the accuracy of the segmentation process. For example, when segmenting medical images, changes in image quality due to different imaging
modalities or patient movement may make it difficult to accurately identify structures of interest. Second, ambiguity in objects boundaries. Another common challenge in image segmentation is ambiguity in object boundaries.
In some cases, the boundaries between different objects or regions in an image may be unclear or indistinct, making it difficult to segment them accurately. For example, when segmenting natural scenes, the boundaries between objects such as trees, bushes, and grass may be blurred, leading to segmentation errors.
Third, complex object shapes. Segmenting objects with complex shapes can present serious challenges. Objects with irregular contours, curved boundaries, or complex structures often require more sophisticated methods for accurate segmentation. For example, segmenting organs in medical images such as the liver or brain can be challenging due to their complex shapes and internal structures.
Fourth, overlapping and occlusion. Segmenting images with overlapping objects or occlusions can be particularly challenging.
When objects of interest overlap or occlude each other, it becomes difficult to accurately separate them and assign correct labels. This problem is commonly encountered in various fields, including object detection in computer vision or cell segmentation in biomedical imaging.
Fifth, lack of sufficient training data. The availability of labeled training data is critical for training accurate image segmentation models. However, acquiring a large and diverse dataset with annotated ground truth can be a time-consuming and time-consuming process.
Moreover, in certain areas, such as rare diseases or specific medical conditions, obtaining sufficient amounts of high-quality training data can be extremely challenging. Transfer learning, where pre-trained models are fine-tuned on small datasets, can be a useful strategy in such cases. Additionally, data augmentation techniques such as rotation, scaling, or flipping
can help generate synthetic training samples to overcome missing data. Image segmentation faces various challenges related to image quality, object boundaries, complex shapes, occlusion, and training data availability.
Thank you for your attention and see you again!