Evaluation of Deep Learning Instance Segmentation Models for Pig Precision Livestock Farming
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/53692 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
11
00:00
Computer animation
Transcript: English(auto-generated)
00:00
then let me start with my presentation about the topic evaluation of deep learning instant segmentation models for pig precision livestock farming. First of all, I want to introduce the topic by describing the motivation and the problem and the current situation of the demographic situation
00:21
of pig livestock farming, especially in Germany. So the data from the Federal Statistical Office shows the opposite trend of a continuously declining number of pig farms in Germany since 2010, with simultaneously increasing numbers
00:40
of pigs per farm since 2010, and also a highly volatile slaughter price, which along these three demographic conditions makes it pretty difficult for the farmer to enable or be economically profitable in pig livestock farming.
01:00
In addition to that, society and politics alike are demanding for a more sustainable and animal-friendly hospitality, which puts additional pressure to the farmer and additionally makes economically profitable pig livestock farming difficult. So the question arises how these challenges
01:21
can be addressed in the present or in the future. Because of these conditions, the term precision livestock farming has increasingly gained popularity in the literature because it's seen as a possible approach to some of these challenges.
01:41
For example, the effect of management of the continuously increasing numbers of pigs per farm. So precision livestock farming describes systems that utilizes modern camera and sensor technologies to enable automatic real-time monitoring in livestock production to supervise animal health, welfare, and behavior.
02:00
This involves the automated acquisition, processing analysis, and evaluation of sensor-based data like temperature, humidity, or CO2 concentration and image and video data. So this was a little bit much information for just these two sentence. So let me decompose the problem with this image.
02:22
So basically precision livestock farming describes a system that utilizes a camera and sensor data, which monitors the livestock of the respective farmer and tries to analyze these data streams with some kind of processing unit
02:41
or a computing device that is placed in the farm, in the pen or outside the farm, or maybe when 5G becomes a thing, we can maybe stream all the data, the big data streams to a distant server, but in the actual conditions, this is not possible.
03:04
So the business of the farmer installs camera systems above a respective pen, for example, to monitor the livestock and the sensor data logs environmental information such as temperature, humidity, and so on. And on these edge devices or processing units,
03:24
specific models, rule systems, or software applications are implemented, which deal with specific tasks based on the data stream it gets from the camera and sensor recordings. For example, in PLF, many different use cases
03:43
can be addressed alone on the image data that can be provided through the cameras. For example, the monitoring of the pig behavior, the detection of aggressive behavior, the early detection of diseases, pig counting, pig posture classification, and so on.
04:04
The possibilities alone on image data are quite endless. And these systems that we need to develop extract specific use case, specific information to provide it to the farmer who can access it
04:20
via smartphone or mobile app from wherever he wants. For example, if the monitoring system detects some anomalies in the livestock, the farmer can immediately be informed via app and can react accordingly instead of only noticing it when it's maybe already too late
04:44
by entering the pen when looking at the livestock locally. So the challenges in enabling these systems are quite different because when looking
05:01
at the most difficult problem when enabling such systems are the actual video and image data streams. So because image and data, video data is unstructured, we can't access information right away from the unstructured video data. So we need methods that enable the extraction
05:24
of structured information from this kind of video streams. And when looking at the single images from a pen, this is an example image that we found in the literature that we also used as a data set, as a foundation of the data set in our paper
05:40
and the problems become apparent. For example, we have when detecting pigs on images, for example, for just the localization and recognition, we have the challenge of extreme piling of pigs. We have occlusion through objects and for other pigs, we have the soiling through different objects in the pen,
06:08
we have different lightning conditions, we have a variety of camera positions, camera lenses, backgrounds, a number of pigs because every pen isn't the same
06:23
when comparing it to the other pen. So we have constantly changing factors that needs to be considered. And the only method that is currently available to cope with this different factors that are constantly changing are deep learning methods
06:42
and which is why they are currently used quite often in PLF literature. So when we speak about the detection of pigs, there are basically two approaches. We have the object detection, which basically aims to put a bounding box
07:01
around the respective object that needs to be detected so we can get specific information about the pig in the image in a form of numerical data in form of the bounding box coordinates, for example.
07:21
So with the help of bounding boxes, we can locate the pig in the image, we can use this information, the structured information and further processes, for example, to track the pig around images and so on. What we don't get with the bounding box is specific information about the contours of the pig.
07:41
For example, if we look at use cases such as tail biting events, we would like to have information about the tail region or head region and very precise information of these different regions and the bounding box can give this to us because looking at the bounding box, half of the image is background
08:01
and half of the image is actual pig and we can distinguish which pixel belongs to the pig and which don't, which is why we investigated instant segmentation methods because these methods can give us these pixel level accuracy. Here we can access the contours of the pig
08:23
and use this information for use cases such as tail biting in the future. But first we needed to evaluate if there are actual models that can be used for this kind of use case in this kind of domain,
08:41
which is why, what was I saying before, this is why we evaluated different instant segmentation models in this paper. And so there are many different instant segmentation architectures that can be found in the literature.
09:00
The domain of computer vision is constantly evolving. Almost every week, a new architecture is introduced with a new state of the art. So it's pretty difficult to keep track of all the things that is happening in this domain. So we define some selection criteria
09:21
so that we can choose specific models or architectures that we want to evaluate in this paper. These criteria are oriented on the needs for PLF system, as well as the architecture that has already been used in the literature.
09:42
We defined four different criterias, which I want to present now. The first criteria is the model we choose should be as accurate as possible at all costs. So we choose the detector as architecture since it's the, sorry,
10:04
since it's the most accurate model in terms of precision that is currently available in the literature. We also looked at models that have a fast inference time or inference time in real time, which is able to have models
10:34
that can actually cope with high FPS images. For example, if we need to analyze
10:44
fast moving objects in video data, for example, when detecting aggressive behavior, this is why we model inference was also a very important point. We also looked at architectures that have innovative approaches
11:00
when looking at instant segmentation. This had no specific reason for PLF related things. It was just because we were curious of what innovative architecture they are in literature. We choose the data architecture for instant segmentation
11:21
based on the Facebook implementation. They use transformer models for instant segmentation, which is quite new and differs completely from the other approaches, which is why we wanted to take a look at it. We also looked at already applied architectures
11:41
in the literature, in the PLF literature. And this is why we choose Mask-RCN as the baseline. So to train and evaluate the models, we created a custom data set, which I want to describe briefly now. Here are some example images from the data set.
12:01
As you can see, we have a wide variety of different images in the set. We took a lot of care into the fact that we want to include as many different camera angles, lightning conditions, camera locations,
12:21
number of pics and so on into the data set. So our model eventually becomes as robust as possible. The data set contains about 20 different locations with varying numbers of pics per image and here's an example of how we annotated
12:45
the actual images for the instant segmentation task. As you can see, we had to put little dots around the contrast of the image to declare the actual instance of the pic. This is very time consuming
13:01
and lots of time was taken into this, but we had to invest this much time because the data quality is the most important thing when training deep learning models based on image data. This is why this was considered the most important fact of all.
13:23
Here are some other information about the data set. We annotated 731 images. Each image contains between three to 30 pics per image and an average of 90 points per polygon were annotated.
13:43
Again, high quality segmentation masks and a 75, 25 split was used for train and test the models, the selective models. So now to the results,
14:00
to the quantitative evaluation of the results. And we choose different metrics to evaluate the results. We use the average precision, including the mean average precision, which is the average precision over a different average decisions at different thresholds.
14:20
In this case, we had the thresholds from 05 to 9.5 with steps of 0.05. And we put the average to fusion with an IU of 0.5 and 0.75 as reference in the results as well.
14:43
And we also use the inference time as a metric as well as on GPU as well as on CPU. And we locked the parameter size of the model to indicate the size of the model because this parameter size is a good indicator
15:03
of how much resources we actually would need to deploy the model on low cost hardware, for example. So the results are not surprising. Detector is the most accurate model of the tested models.
15:21
This comes in price for the inference time and the number of parameters. It's by far the slowest of the models and by far the biggest of the models, it's three times larger than the other tested models, which is why it is questionable that this type of architecture can be deployed in a PLF domain on low cost hardware.
15:44
More surprising was the fact that the Mask RCL model was the fastest of the tested models. We thought the Zolo V2 would take this place, but Mask RCL ended, but it's also in terms of the mean average precision, the worst of the tested models.
16:01
But looking at the mean average precision overall and the different thresholds, IOU thresholds, it becomes apparent that the models doesn't differ as much in terms of prediction accuracy, which was also very surprising because we thought the detector S which was three times larger than the other models
16:21
had to be a significant edge, had to have a significant edge in the accuracy, detection accuracy, but this was not the case. So the question arises, is these types of big models fitted to be used in a PLF domain? Here are some other key facts.
16:42
The dead fill model was the smallest of the tested models, it was slowest on GPU, on CPU, sorry. Zolo V2 was the best model at an average precision threshold, at an average precision of 0.5 with 0.980, sorry,
17:02
and was also the fastest on CPU. Overall, again, the results show that these models are quite similar in terms of accuracy, and this brings us to the conclusion of our research. First of all, every, oh yeah, sorry, the time, yeah.
17:24
In terms of prediction accuracy, every tested model is suitable in the context of PLF when looking at the accuracy. When looking at the model size and complexity, looking at the model size, the complexity does not necessarily have a significant effect on improving
17:43
the mean average precision as of the accuracy of the model. Looking at detect OAS, which was three times larger than the others, scaling the model in size does not help improve the accuracy, which is why other methods are needed to address those shortcomings in the future. Overall, Zolo V2 and Detre and Mask-RCN
18:02
are better suited than detect OAS in PLF domain, and in our opinion, Mask-RCN is the best baseline to start or to use instant segmentation on PLF data. These are my sources. Thank you for listening.
18:21
Sorry for the rough start and for the time overlay, and thank you for listening.