We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

HIECTOR: Hierarchical object detector for cost-efficient detection at scale

00:00

Formal Metadata

Title
HIECTOR: Hierarchical object detector for cost-efficient detection at scale
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
Object detection, classification and semantic segmentation are ubiquitous and fundamental tasks in extracting, interpreting and understanding the information acquired by satellite imagery. Applications for locating and classifying man-made objects, such as buildings, roads, aeroplanes, and cars typically require Very High Resolution (VHR) imagery, with spatial resolution ranging approximately from 0.3 m to 5 m. However, such VHR imagery is generally proprietary and commercially available at a high cost. This prevents its uptake from the wider community, in particular when analysis at large scale is desired. HIECTOR (HIErarchical deteCTOR) tackles the problem of efficiently scaling object detection in satellite imagery to large areas by leveraging the sparsity of such objects over the considered area-of-interest (AOI). This talk presents a hierarchical method for detection of man-made objects, using multiple satellite image sources with different Ground Sample Distance (GSD). The detection is carried out in a hierarchical fashion, starting at the lowest resolution and proceeding to the highest. Detections at each stage of the pyramid are used to request imagery and apply the detection at the next higher resolution, therefore reducing the amount of data required and processed. We evaluate HIECTOR for the task of building detection for a middle-eastern country, estimating oriented bounding boxes around each object of interest. For the detection of buildings, HIECTOR is demonstrated using the following data sources: Sentinel-2 imagery with 10 m GSD, Airbus SPOT imagery pan-sharpened to 1.5 m pixel size and Airbus Pleiades imagery pan-sharpened to 0.5 m pixel size. Sentinel-2 imagery is openly available, making their use very cost efficient. The Single-Stage Rotation-Decoupled Detector (SSRDD) algorithm is used. Given that single buildings are not discernible at 10 m GSD, a bounding box does not describe a single building but rather a cluster of buildings. The estimated bounding boxes at 10 m are joined and the resulting polygon area is used to further request SPOT imagery at the pan-sharpened pixels size of 1.5 m. In the case of SPOT imagery, given the higher spatial resolution, one bounding box is estimated for each building. As a final step, predictions are improved in areas with low confidence by requesting Airbus Pleiades imagery at the pan-sharpened 0.5 m pixel size. Ablation studies show that HIECTOR achieves a mean Average Precision (mAP) score of 0.383 and 20-fold reduction in costs compared to using only VHR at the highest resolution, which achieves a mAP of 0.452. Code will be released under MIT license. We will also release the trained models on Sentinel, SPOT and Pleiades imagery. In addition, manually labelled building footprints over Dakar will be open-sourced to allow users evaluate the generalisation of the models over different geographical areas. The Sentinel Hub service is used by HIECTOR to request the commercial imagery sources on the specified polygons determined at each level of the pyramid, allowing to request, access and process specific sub-parts of the AOI.
Keywords
202
Thumbnail
1:16:05
226
242
MultiplicationScale (map)SatelliteSineQuery languageAngular resolutionAngular resolutionObject-oriented programmingSparse matrixBuildingApproximationInternationalization and localizationImage resolutionAreaPixelHierarchyLevel (video gaming)DialectObject-oriented programmingCuboidLocal ringBuildingEstimatorGoogle MapsApproximationSatelliteSelf-organizationSoftware developerDifferent (Kate Ryan album)State observerCASE <Informatik>Scaling (geometry)Object-oriented programmingDigital divideMusical ensembleLecture/ConferenceComputer animation
Product (business)Object-oriented programmingPredictability
RotationObject-oriented programmingSingle-precision floating-point formatInferenceVirtual machineProcess (computing)Order (biology)Mathematical modelPredictabilityBuildingMereologyDataflowProcedural programmingAreaDifferent (Kate Ryan album)Confidence intervalCuboidLevel (video gaming)Core dumpImage resolutionServer (computing)Multiplication signModel theoryInstance (computer science)Quantum stateObject-oriented programmingInferenceNeuroinformatikSubject indexingLibrary (computing)Single-precision floating-point format2 (number)Virtual machineNumberFunctional (mathematics)Cloud computingDrill commandsCyberspaceRotationSymmetry (physics)Texture mappingReference dataSocial classTerm (mathematics)Square numberPoint (geometry)Computer networkConfiguration spaceOpen sourceReading (process)Direction (geometry)Computer animation
BuildingNegative numberCurvePairwise comparisonSign (mathematics)Heat transferMathematical modelOpen setProbability distributionBuildingAreaCalculationCASE <Informatik>Texture mapping10 (number)Similarity (geometry)ResultantLevel (video gaming)Validity (statistics)Multiplication signDifferent (Kate Ryan album)Insertion lossInformation retrievalOpen setMathematical modelOrder (biology)Angular resolutionPairwise comparisonProcess (computing)Graph (mathematics)Set (mathematics)BitNatural numberAlgebraTunisTableauPredictabilityReference dataImage resolutionHeat transferEntire functionDistribution (mathematics)Single-precision floating-point formatComputer animation
Heat transferMathematical modelBuildingOpen setHierarchyMathematical modelCodeOpen sourceLaptopInferenceQuery languageRegulärer Ausdruck <Textverarbeitung>Price indexService (economics)Image resolutionData storage deviceInstance (computer science)Open sourceImage resolutionInferenceCodeCovering spaceCASE <Informatik>Data storage deviceAreaLaptopPresentation of a groupModel theoryInstance (computer science)Computer animation
Transcript: English(auto-generated)
So hi everybody, I'm Nate. I'm a data scientist at Synergize, and today I'm going to be presenting the HECTOR, the Hierarchical Object Detector using Multiscale Satellite Imagery. So first, some motivation. I don't know about you, but when I first encountered some kind of satellite
or aerial imagery, it was with Google Maps like back in 2006 or something when I was very young. And the first thing I wanted to do is see my house from it. And yeah, you saw the house, it was very nice. But then years later, I started working at the Earth Observation Company, and I got introduced to Sentinel-2. But there you find out that, okay,
Sentinel-2, you try to find your house. You can't really see your house because it's less than 10 meters by 10 meters, and it doesn't even cover one Sentinel-2 pixel. So it means that detection of man-made objects from this medium resolution imagery is kind of not possible, so you require high-resolution imagery, which is very expensive,
and it does the detection over large areas, is limited to kind of organization with a lot of resources or a lot of money. But it turns out that the man-made objects, so buildings, roads, and so on, are actually, if you look at the larger scale, at the large AOI, they are actually very sparse
and cover a very small percentage of an area of interest, especially if your area of interest is large, maybe country level or even continental level. And this means that we can leverage the hierarchical approach. So we have different levels of detection.
We use the lowest one to do kind of rough estimation and then increase in accuracy as we go along, and to minimize the amount of very high-resolution imagery that we need. And for this, we do this so-called hierarchical divide-and-conquer approach. So first, we start with detection at medium resolution.
So your medium resolution is your Sentinel-2 or maybe even Landsat, and in our use case, we use ESA Sentinel-2 imagery and we use the 10-meter band, so blue, green, red, and near-infrared. And the idea here is to use this imagery
for approximate localization of the built-up areas. So you want to know where the actual built-up areas are because just not ordering the very expensive high-resolution imagery over those areas can save you a lot of money. So here is an example of Azerbaijan, and you see that the detected built-up areas are in red,
and the other areas are basically areas where we do not expect any buildings to be or any man-made objects. But of course, for the areas that do contain built-up, you still need to detect them somehow. As I said, Sentinel-2 is not really sufficient to accurately see the footprint of your house
or the bounding box. And this means that we try to use airbus imagery, so more high resolution, so in our case, we use spot, and the idea here is to try to detect large and medium objects. Okay, so here we have this covered,
but especially kind of in developed countries and in some more urban regions. It is a problem because you have some very dense areas with small buildings, maybe townships and so on, which are very hard to separate even on 1.5-meter resolution. And for that, you need something like maybe Pleiades imagery at half a meter resolution,
and idea here is to order Pleiades only on those areas where kind of the previous level is not enough. And to detect and accurately detect the small objects. So the final product would look something like this. So in green, you have the Pleiades imagery, and in red, the detections on the Pleiades imagery,
and in red, the predictions on spot. And they are kind of combined together into a map. Of course, I mean, doing detections and so on, we need some kind of model, but I think it's not the point of this talk to go into technical details about the model,
but just to give you an idea, so we use a model that detects oriented bounding boxes directly. So we are not doing segmentation or pixel-based, but we are going more into the object detection approach where you try to localize the object by finding the bounding boxes,
and the model uses the ResNet backbone and tries to use this feature pyramid network to be kind of scaling variant in a sense. And another thing, kind of in practical terms, is that this model turns out that for each detection, it produces a lot of bounding boxes which are overlapping
and with different amounts of pseudo-probability, and that's why we need to apply non-max suppression to get, for example, for one, for each detection, only one building box. As far as training goes, of course, you have the model, you need to train it. We trained two models, one was for Sentinel-2 imagery,
and the other one was combined model for which we fed, to which we fed both Spot and Pleiades. And for Sentinel-2, it was quite easy. So we had images available across the whole country of Azerbaijan, and we also had building footprints available for the entire country. I mean, the ground-truth data was not so accurate,
but it was accurate enough to train this kind of model. And again, remember, the idea here is not to train kind of an accurate object or a building-class detector, but it's to find built-up areas rather than single buildings. And for Spot and Pleiades, of course, like it was mentioned before, this imagery is quite expensive.
So we only had imaging data over small AOIs, and these images covered only, for example, nine square kilometers, which is just 0.01% of the country's area. And another thing with these kind of images is that they are very, they have high space resolution,
which means that all the problems in the reference data, so all the kind of geolocation inaccuracies, are even more prevalent. So what we did, we took some areas, and we manually validated buildings in those areas to get kind of a very, very clean reference data set, which enabled us to train an accurate model.
And for these two, yeah, we trained a single-state rotational detector. I guess the main kind of engineering part of our procedure is the inference. So it takes a lot of engineering efforts to have inference on a large area be scalable.
So what we do is, let's say you have an area of interest, you first split it into kind of sub-grid of little chunks on which you then order Sentinel-to-imagery. So for each of these sub-chunks,
you then apply the model, and this means that you get the detections, that you get the detections where the built-up areas are. After you have these built-up areas recognized, you order spot imagery for those, and again, apply the same procedure as before
by applying the Spot Pleiades model to get, again, detections on those areas. But here we come to a problem. So, okay, we ordered spot imagery across all the areas that we suspect are built-up,
but how do we know where to then order Pleiades imagery? So we sub-split those areas again, so there's a lot of reading in this procedure. So we sub-split those areas again, and for each of these sub-splits, we calculate the so-called drill-down index, which is a function of the number of buildings in those areas in that chunk,
and the confidence of the model in its predictions. So the idea here is that you have an area with a lot of small buildings, or a lot of buildings in general, and if the confidence of the model is poor, then there you need to kind of verify your predictions using Pleiades imagery. And again, once you identify those areas,
you order Pleiades imagery, and on that imagery you run your model again, and then you combine the predictions from the spot level and the Pleiades level into your final map.
Of course, I mean, this is all nice and well if you're doing small areas, but again, here the idea is to scale. And here we come a problem. So let's say that you have an area the size of Azerbaijan, as was discussed, and we sub-grid it into, I don't know,
100 by 100 meter, or maybe one kilometer by one kilometer, you still get a lot of chunks, and if each one takes 10, 15 seconds to process, this means that it will take a lot of time. So here we leverage the capabilities of Ray, which is one of the open source libraries that we utilize in our team.
And what Ray does is, first of all, it enables you to parallelize across cores on a single machine. So let's say you have a server somewhere lying around for something like this. You can speed up your processing by running Ray, and it will take care that your kind of workflow is fully utilizing all the cores.
But even this, there's only so much you can do with a single machine. If you want to go kind of higher, or want to utilize on a larger area, you need to do some cloud computing. Here, in our company, we use AWS. So what you can do, you can use Ray again,
so you need to set up kind of a cluster configuration and set up a Docker image, but when that is done, the only thing you need to do is parallelize, use runway with your configuration, and it will parallelize your computation across cheap AWS EC2 instances. I know this sounds expensive, but in reality,
it's not really. So for as observed by John, the processing was in the order of tens or hundreds of euros. So it's not really that much cost, and it's affordable by smaller companies, or even, I mean, if need be, individuals. Of course, once you have your data,
I mean, your predictions, we wanted to validate them to see if they are any good, and if what we produced actually makes sense. So we kind of validated those in two different steps. So we did validation on the Sentinel-2 imagery,
and what we found out is that the built-up areas are quite nicely detected, even for the few isolated buildings. So the main kind of question that we had was, okay, let's say that you have a single building somewhere, or maybe a small village with two, three houses. We'll be still able to get these areas
using this Sentinel-2 approach, and it turns out, yes, that we can. So the calculation that we made was that we retain around 40% of the entire area of interest. So 60% can be completely discarded, so it doesn't need to be processed anymore, and with this, we only lose 0.6% of buildings.
And one of the kind of natural questions that one asks is, okay, why even use Sentinel-2? Why don't you just use something like OpenStreetMap? And it turns out that, especially in the developed countries, so for example like this, Azerbaijan, or maybe in Africa, OpenStreetMap is not really accurate,
or it's really not up to date. So for example, in this use case of Azerbaijan, using OpenStreetMap, we would lose 3% of the buildings, so much more than just using Sentinel-2 imagery. Of course, we also then validated the Spot and Pleiades predictions, and kind of the main takeaways was that
we wanted to see if what we did makes sense, and yeah, the conclusions are that it does. So Pleiades is much better at detecting smaller buildings, as expected and as wanted, because the spatial resolution is high. But another one, and here on the graph, we see that, for example, the pseudo-probability distribution for Pleiades is kind of much higher,
so the model is more confident, because again, it gets a higher spatial resolution. But one problem that we faced when we looked at the imagery is the false positive detections. So this is kind of a caveat, because we had limited imagery for Spot and Pleiades, and this limited imagery was kind of focused more on built-up areas,
and so that means that the model didn't see a lot of other areas where there are no buildings, so this means that we had some problems with the false positive detections, yeah, both for the built-up areas and the non-built-up areas. Like I said, of course, there is no free lunch,
so this approach does lead to some loss in accuracy, so can't be free, but maybe we can make it a bit cheaper with this similar quality of ingredients. So here we have a nice table comparison between kind of the accuracy, so the accuracy here is the mean average precision score,
and the cost saving over using just Pleiades, so let's say that you wanted to just use Pleiades, what would be your cost and what would be your accuracy? So first we look at the first one, we see that the accuracy is, for just using Pleiades,
the mean average precision is 0.45, but let's say that we just use the first level of reactor, so Sentinel-2 to get the built-up areas and then order Pleiades over those, we already get a saving of approximately 2.5 times with negligible loss in accuracy, I think,
so there's not much of a difference. Similarly, if we just use Spot, okay, it's much cheaper than Pleiades, but there is a loss in mean average precision, but again, then with Hector, so with all three levels,
and with some threshold that we found out and we saw that this kind of works nicest, you see that the accuracy is somewhere between what you get from Pleiades and what you get just using Spot, but the cost saving is a lot, so you save 20 times the amount of money, basically,
and we feel like, yeah, this is a nice result for most use cases. Of course, if you require very accurate precision and you have unlimited amounts of money, then you might consider just using Pleiades, but for most companies, I think using this hierarchical approach makes a lot of sense. Of course, these are scores that we calculated
on our manually validated areas, so it's, yeah, if you use a different area or maybe something like this, then maybe it's not completely accurate, but the rough idea should be the same. After we had this model tuned,
kind of built for Azerbaijan, we also tried to do some kind of transfer learning and fine-tune this model in Dakar to see how well it transfers to a completely different geographical region, and we used, first of all, we used Sentinel-2 model without fine-tuning, so we didn't, because we had basically
no training data for Dakar. We just tried to run, we just tried to run the Sentinel-2 model as is, and you can see here on the graph that, yeah, you do lose, the model is much less confident, but overall, the quality seemed to be quite okay. And what we also tried is to fine-tune
the SpotPaleades model. Again, I said that we had no reference data, but what we did, we, or not to the quality that we wanted, but we manually labeled a data set of 7,000 buildings and trained the model on that. And it turns out that this leads, actually,
to quite good results, so we compared these results to open buildings, so from Google, and which provides buildings, basically, across whole Africa, and results are, I think, comparable, if not even favorable for Hector.
So, I'm here talking about what we did, so the question here in the audience is, how can I use it? So, and what we made publicly available. So, Hector is fully open-source, so we've published the models and labels for Dakar, and they are published on this open AWS S3 bucket
that you can freely access. We've also published the full Hector code on our GitHub, so Sentinel Hub Hector, under the MIT license, and we've also published kind of an example notebook, so kind of to get you started, that does inference only,
so no training, but does inference, but for a lot of the use cases, maybe this could be enough for somebody to try out what Hector is about. Yeah, just something about the code base, so the code base allows you to reproduce the training and the inference steps in total, but there are, of course, some prerequisites,
which, unfortunately, are unavoidable, so you do need a Sentinel Hub account, because we do work with Sentinel Hub, and you do need some credits for very high-resolution imagery, but again, yeah, depends on your area of interest, how much you need, and our code base is kind of tailored to be used with AWS,
so you need an S3 bucket for the storage of the data, and for the EC2 spot instances. I think this about covers my presentation, so I think, so if you have any questions,
please let me know, and I would be happy to answer. All right, thank you, guys.