Earth Observation through Large Vision Models
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69428 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024106 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Data Encryption StandardMachine visionMathematical modelOctahedronSoftware development kitOvalMoistureTelecommunicationEnthalpyObject-oriented analysis and designIntrusion detection systemExact sequenceState observerMathematical modelMachine visionMachine visionEvent horizonNetwork topologyPlanningOnline helpTelecommunicationConstructor (object-oriented programming)Overhead (computing)DistancePoint cloudInformationData managementLatent heatRadarAreaGravitationUniform resource locatorField (computer science)View (database)SurfacePhysicalismWorkstation <Musikinstrument>Bridging (networking)Computer animationLecture/Conference
02:33
Cone penetration testOrbitWorkstation <Musikinstrument>Mathematical analysisInternet forumComputing platformOrbitMathematical analysisSet (mathematics)SurfaceFile formatNumberNoise (electronics)Wave packetDistribution (mathematics)Type theoryMedical imagingComputing platformDifferent (Kate Ryan album)Musical ensembleOrder (biology)Image resolutionData storage deviceDatabaseSlide ruleComputer animation
04:08
Single-precision floating-point formatThermal expansionAreaWeb browserSet (mathematics)Link (knot theory)Type theoryExtension (kinesiology)Multiplication signWater vaporNumberGraph coloringBridging (networking)AlgorithmMedical imagingComputer animation
04:48
Real numberObject (grammar)InferenceDepth of fieldAbstract syntax treeBimodal distributionMathematical modelHTTP cookieSineFormal languageMachine visionDependent and independent variablesMedical imagingSocial classElectronic mailing listProcess (computing)Formal languageMachine visionData typeMathematical modelValidity (statistics)Computing platformMathematical modelHTTP cookieValuation (algebra)Wave packetTask (computing)TunisRepresentation theorySet (mathematics)Modal logicRepresentation (politics)Online helpDifferent (Kate Ryan album)Coefficient of determinationHausdorff spaceCodierung <Programmierung>VideoconferencingEinbettung <Mathematik>SpacetimeOrder (biology)Regular graphPixelPerformance appraisalDemosceneMachine visionType theoryRandom matrixComputer animation
07:57
Execution unitMaß <Mathematik>Object-oriented analysis and designComputer-generated imageryContent (media)Function (mathematics)Medical imagingCASE <Informatik>Mathematical modelTable (information)Content (media)Different (Kate Ryan album)Execution unitComputer animation
08:31
InferenceDigital signal processorHand fanComputer-generated imageryBuildingModal logicPrincipal ideal domainSpeciesCorrelation and dependenceMathematical modelSoftware bugConditional probabilityPerformance appraisalRouter (computing)Bridging (networking)AreaFactory (trading post)Query languageMedical imagingMathematical modelContrast (vision)Wave packetInsertion loss1 (number)View (database)Field (computer science)Object (grammar)InformationTunisSimilarity (geometry)Position operatorCombinational logicPixelDiagonalEinbettung <Mathematik>Codierung <Programmierung>Mathematical modelSpecial unitary groupReflection (mathematics)Right angleBridging (networking)DialectMultiplication signImage resolutionCondition numberAcoustic shadowPoint cloudSurfaceAngleWeightElectronic mailing listElement (mathematics)Social classBuildingMachine visionLatent heatFormal languageSet (mathematics)Trigonometric functionsStapeldateiMathematicsAreaComputer animation
13:59
Computer-generated imageryConditional probabilityMathematical modelPerformance appraisalMathematical modelMedical imagingStability theoryElectric generatorReverse engineeringOrder (biology)Mathematical modelCondition numberOnline helpSuite (music)CASE <Informatik>Performance appraisalComputing platformRandomizationComputer animation
15:01
OvalPreprocessorMathematical modelStructural loadComputer-generated imageryOpen setOrdinary differential equationStandard errorMedical imagingOpen setBootingSocial classMathematical modelOrder (biology)Execution unitComputer animation
15:51
Parameter (computer programming)Mathematical optimizationTask (computing)StapeldateiComputer-generated imageryIterationGradientMathematical modelParameter (computer programming)Physical lawBootingMedical imagingWave packetMathematical modelLoop (music)Insertion lossContrast (vision)Social classMathematical optimizationEntropie <Informationstheorie>Task (computing)PredictabilitySet (mathematics)StapeldateiComputer animation
16:46
Computer fileIndian Remote SensingDISMAIntegrated development environmentData miningMedical imagingRight angleOnline helpData managementCASE <Informatik>Different (Kate Ryan album)Order (biology)Mathematical modelChainFormal languageApproximationMachine visionVisual systemComputer animation
18:17
AreaPredictionRouter (computing)InformationMedical imagingAreaOrder (biology)MathematicsMathematical modelComputer animation
18:49
AreaSocial classTrailComputer-generated imageryoutputInternationalization and localizationQuery languageSet (mathematics)Machine visionAngle of attackArchitectureCuboidSelectivity (electronic)Medical imagingOpen sourceWeightOnline helpQuery languageSocial classCartesian coordinate systemNumberBuildingObject (grammar)Key (cryptography)Task (computing)DistanceApproximationAreaMathematical modelCountingGroup actionInsertion lossModal logicTrailCASE <Informatik>1 (number)Field (computer science)Green's functionComputer architectureOrder (biology)Local ringMereologyMathematical modelMachine visionSlide ruleMechanism designTransformation (genetics)Formal languageMachine visionComputer animation
23:27
Mathematical modelConfiguration spaceVisual systemNumberTwin primeVisual systemOrder (biology)Library (computing)Configuration spaceQuery languageCodierung <Programmierung>Medical imagingMathematical modelTransformation (genetics)Computer fileCASE <Informatik>Type theoryMathematical modelInternet service providerTunisBit error ratePreprocessorComputer animation
24:12
Computer-generated imageryoutputQuery languageTrigonometric functionsArchitectureBit error rateMathematical modelConfiguration spaceVisual systemCodierung <Programmierung>CuboidContrast (vision)Configuration spaceMedical imagingComputer filePhysical lawInsertion lossDirectory serviceDomain nameDifferent (Kate Ryan album)Parameter (computer programming)Field (computer science)Functional (mathematics)InferenceWeightMathematical modelComputer animation
25:33
Design of experimentsCache (computing)NumberOrdinary differential equationField (computer science)AreaBinary imageDifferent (Kate Ryan album)PixelGraph coloringMedical imagingMathematical modelInheritance (object-oriented programming)Formal languageOnline helpSlide ruleTemporal logicData storage deviceVideoconferencingImage resolutionNumberType theoryAngular resolutionMotion blurMusical ensembleInformationMachine visionSurfaceMultiplication signSurface of revolutionComputer wormMathematical modelSpacetimeMathematical analysisSet (mathematics)Noise (electronics)Cellular automatonComputer animation
29:35
Bilinear mapComputer-generated imageryData compressionMathematical modelValue-added networkLibrary (computing)Online helpOpen setBilinear mapInterpolationMedical imagingMultiplication signNoise (electronics)Mathematical modelMathematical modelArtificial neural networkImage resolutionProcess (computing)Execution unitReverse engineeringWave packetOrder (biology)Functional (mathematics)Insertion lossWeightData compressionPredictabilityNumberReliefComputer animation
33:48
Ordinary differential equationDatabase normalizationInheritance (object-oriented programming)Computer-generated imageryMathematical modelWave packetImage resolutionMedical imagingInsertion lossBit rateSocial classBootingResultantNoise (electronics)NumberType theoryCAN busMathematical modelSlide ruleInheritance (object-oriented programming)outputParameter (computer programming)Mathematical modelDifferent (Kate Ryan album)SoftwareExecution unitCodeScheduling (computing)Computer animation
35:07
Design of experimentsoutputMedical imagingNetwork topologyMathematical modelPredictabilityImage resolutionComputer animation
35:39
Duality (mathematics)Internet forumInformation privacyPresentation of a groupMultilaterationLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
Welcome to EuroPython. My name is Mayank, and I'm a data scientist at ESRI. My talk is about Earth observation through large vision models. So I work mainly on the intersection of satellite data and AI. So let's get started.
00:21
What is Earth observation? And how do we observe events occurring on Earth? So information, gathering of information about Earth's physical, chemical, and biological systems is known as Earth observation. And we can do that with the help of various sensors, like thermometers for temperature,
00:41
barometer for air pressure, soil moisture sensor, and seismograph for measuring the intensity of earthquake. Now all of these are ground-based observations, which we do from the place of occurrence. The next is remote sensing, which we do from the distance from the place of occurrence. And that can be done by putting the sensors
01:00
on airplanes, drones, satellites, and ships. And they provide a wide angle view of a particular area. So in this talk, we are going to mainly focusing on satellite data set. What are satellites? There are a lot of satellites orbiting around the Earth at more than 600 kilometers above its surface.
01:21
And each of these satellites has specific sensors, like GPS for location. There are communication sensors. There's magnetic field sensors, gravitational sensors, and most of these satellites have imagery sensors. So there are various sectors which are dependent upon satellite data set, like urban planning.
01:44
So whenever there's a need of construction of bridges, railways, or tram stations, planning needs to be done, which can be done with the help of overhead imagery. Agriculture. So it can be useful for identifying the crop health. So do you see red trees in this imagery?
02:00
So this is the infrared radiation getting reflected by the trees. And it can be quite helpful in identifying the crop health before it is even visible to our eyes. Disaster management. So there are various specialized satellites which work like a radar system. And they can send the microwave,
02:20
which can even penetrate through clouds. And hence, it can be used for observing the Earth on a cloudy day. And also, it can be useful for weather forecasting as well. So what's the journey of data from orbit to annulus? So there are a lot of satellites,
02:41
and they have different orbits. Some are geostationary satellites, which remains fixed with respect to the Earth's surface. There are low Earth orbit satellites, medium Earth orbit satellites. So ground station needs to be located strategically on the surface of Earth based on the speed of these satellites. So data is sent through high powered antennas
03:02
to these ground stations. And after that, data is sent for processing. So here, during processing, noise removal is done, and data is calibrated based on the sensors. And after that, it is converted into a readable image format. Storage and distribution.
03:21
Now the data is stored into large databases, and then it is distributed over various platforms, which you are going to see in the next slide. Analysis. Now this is where the magic happens. Trained analysts or deep learning professionals use this satellite data set in order to get some outcome out of it.
03:41
Now here are various data platforms you can download satellite data set for free, like Earth data from NASA, AWS, Sentinel Hub, Bhuvan from Indian Space Research Organization, Copernicus from European Space Agency, and USGS. So all of these platforms provide different types of satellite data set. Some of them might have high resolution.
04:03
Some of them might have different number of bands, and so on. So here you can see an example of Copernicus browser. You just need to provide the area of extent you want to download, the date and time, type of satellite data you want, and you can click on search.
04:21
After that, you'll get the links to download the satellite data set. Now once we have the imagery, how can we create algorithms which can identify how many bridges are there, number of boats, or what's the color of the water in this image, or where exactly is the roundabout, that can be quite useful for urban planning,
04:43
and can we even increase the resolution of this image? So we can take the help of deep learning to do that. Now if you compare a satellite image with a regular image, you'll see that satellite image has much more pixels as compared to a regular image.
05:01
Hence, our deep learning model won't be able to process all of it at once. So we divide it into smaller chips in order to do the processing. So here's a general deep learning workflow goes. We take the data and we split that into training and validation set.
05:22
Then we train the model. And after that, we evaluate it on the validation set. So based on the evaluation score, we save it as a predictive model. And if we are not satisfied with the score, we fine tune it until we get a better score.
05:41
Now once we have a trained model, we can use it for various tasks, like here we are using it for classification. We can classify airplanes versus cars, or we can detect airplanes and cars in this example. Now this is supervised learning,
06:01
and once we have a model, it can only classify these two classes. Now what if we have a requirement that we want to classify or detect red cars, or a car with a sunroof? So we'll have to go through the whole training process again. We'll have to label the red cars, add to the data set, and retain.
06:20
Or we can add language into it. So this is where multimodal deep learning can be helpful. So here we can take the help from different modalities, like vision, audio, and text, and then create a better representation of the data, which can be helpful to learn the model better,
06:41
and then we can save it as a predictive model. So we have to create a shared representation for this data. Like here in this example, we can give a data type like text, dog barking,
07:00
image, and a video for the scene. We pass it to a multimodal, which contains various encoders, which can convert it into embedding space. So if the values of embeddings are similar, it will be put into a separate space in this space. And similarly, for a different class, like cars honking in the city, and the audio of that, it will be put into a different place in this space,
07:23
dependent upon the class. So here you can see, irrespective of the type of data we pass, it will be placed into a shared representation space. Now, you must have seen various vision language models which are available on platforms like Kaggle or Hugging Face.
07:42
Like you give a prompt and image, it will provide you the whole list of ingredients. Like here we have provided a prompt that give me a recipe of these cookies, and it will provide the list of ingredients and the recipe of these cookies. Similarly, here's another interesting use case.
08:01
We have revenue growth of different sectors, like food delivery, classifieds, payments, edtech. Now we ask the model to analyze the contents of this image as a markdown table, and here it has created it as a markdown table. Now, which of the business' units has the highest revenue growth?
08:21
So here, the classified has the highest revenue growth. The model is able to predict it based on looking at this image. Now, what if we use these models on satellite dataset? Here's an example of image captioning model.
08:42
Now, this model gives a caption by providing the image. Here, we have provided a satellite image, and it has given the caption that an aerial view of a city with buildings. Now, you can see there's a lot more than the city with buildings. There's a ground field, there's a tennis court,
09:00
a swimming pool, and there are many more things. So we want more information out of the model instead of this generalized information. Similarly, here you can see we ask the model to detect airplanes and windmills. So it has detected this windmill as an airplane as its propeller is looking like
09:21
the propeller of an airplane. So we need to fine-tune these models. So before moving ahead, let's see what challenges do we face with satellite data. So objects are quite small on satellite dataset,
09:43
satellite images, and they are composed of just few pixels. So if a model is trained on low-resolution images, it won't perform well on high resolution or vice versa. So change in resolution affects the performance of model hugely.
10:01
Position of the sun. So some images might be overexposed. As you can see on the right, there are very high reflections from the objects, and you are not even able to see the road properly due to reflection. And there might be darker regions in the image, like you see a lot of shadows in this image. So models won't be able to perform well
10:22
in these images as well. Bad weather. So if there are clouds, there are snow, or there's haze. So we got only one shot for a specific area at a specific time. So our model should be able to work well in these conditions as well.
10:45
And the last is angle which satellite makes with the Earth's surface. A lot of taller objects might look different from different angles, so model won't be able to identify that from different angles as well. Now let's see how can we fine-tune these multimodals,
11:02
which are visual language models. So we'll start with CLIP. So this is the most fundamental multimodal based on which other models have been developed. As the name says, it is based on contrastive learning for training, and this model has been created by OpenAI,
11:22
which uses images along with their captions as their training dataset. So the main goal here is to pull the correct image and text pairs towards each other and push the incorrect ones away from each other. So this is how contrastive loss works. Here you can see we have a text encoder
11:41
and image encoder. We take batch of text and batch of images and pass it to the encoder, which converts them into embeddings. Now, these embeddings are used to create a cosine similarity between them. So here we calculate the cosine similarity between all the combination of embedding,
12:00
like I1T1, I2T2, or I1T3. So at the main diagonal, you will see the correct image text pairs, and at the off-diagonal, you'll see incorrect image text pairs. So our main goal here is to increase the similarity at the main diagonal and decrease it at the off-diagonal elements. So this is how the model learns
12:21
and the way it gets updated. Once we have a trained model, it can, given a list of text and image, it will be able to provide the probability based on the prompt we provide. So here you see a satellite image of a roundabout,
12:42
image of an intersection, satellite image of a church, and image of a roundabout has a highest probability. Similarly for airplanes, crop land, and building. Now let's see where can we use clip model.
13:01
So it can be used for zero-shot image classification. So just like we saw a red car example, there in that example, we can ask a model to classify the red car in that image and it will be able to give the high probability for that or the car to the sunroof. So this is how zero-shot learning works.
13:20
We can just give a prompt and model doesn't require any training for that specific class. And image search. We can provide a prompt and model will be able to extract all the images from the dataset and based on that, we can search the images. Here you can see an example. We give a prompt that three cars were driving on the bridge opposite to a green river.
13:42
So the model has extracted all the images opposite, a bridge opposite to a green river. And here you see there are exactly three cars based on the performance of the model. It is ranked as three. And similarly for other examples.
14:01
So it can also be used for conditional image generation for text-to-image models. Now you must have seen various models like DALY or Stable Fusion which can generate images based on the text you provide. So if you don't condition those models, they will generate images randomly. Now those models can be conditioned with the help of clip.
14:22
That's why we are able to provide the text and the image gets generated. And in the reverse, it can also be used to generate the caption based on the image. So once we pass the image, it will be able to generate captions from that image. And that can be done by combining clip with a generative text model.
14:43
And at last clip score can be used for evaluation as well. So we can download various multimodels from online platforms. We can use a pre-trained clip model in order to evaluate different models which suits best for our use case.
15:04
So let's see how can we fine-tune this model. We will be going to use PyTorch and a pre-trained model from OpenAI in order to fine-tune it. So first we provide the image path
15:21
and the path for the captions. And here we are providing the OpenAI model which is loaded from clip and a pre-processor which will be used for processing our images. Then we create a custom class which will be used to create a custom data loader. Here we are tokenizing our text
15:43
and pre-processing our images based on the OpenAI model and which will be returning image and their captions. So here we are creating a data loader with a batch size 50 and passing a custom data set class.
16:00
Then we are creating an optimizer which will be useful for fine-tuning our model. Now here after that we are creating a cross entropy loss. Now here we are not using contrast loss as we are fine-tuning it for our downstream loss which is satellite data set.
16:20
Now here's our training loop which is in PyTorch and you just have to provide a model which is giving you the prediction of lodgers and the ground truth. We can use the loss for text and the images, add them up and we update the gradients and update the model parameters. As simple as that.
16:43
So this is how we can fine-tune clip model. Now this is a place in USA which is famous for clubs and casinos. Does anyone want to guess what place is this?
17:05
Yeah, this is a Vegas and this image is from 2018 and on the left we have image from 2014. Now how many of you can think that, can identify there are more than approximately 100 differences in this image?
17:23
And how about more than 1,000 at this particular resolution? So there are approximately 100 differences in these two images and this is called chain detection and it can be quite helpful in identifying the urban development happened over the past few years
17:42
or it can be useful for doing environmental monitoring like deforestation or glacial melting is happening over a place. It can be useful for monitoring that or it can be useful for defense. If there's some illegal mining activity happening over some place or some illegal activity at the border,
18:01
it can be useful for that as well and disaster management. So there are various use cases we can think of which can be done with the help of chain detection. Now let's see how can we use visual language model in order to detect change? So we can do image captioning in order to do that.
18:24
So we provide a past image and a future image. Model will be able to provide how many houses have been built along this road based on these two images. Similarly here it has given a caption that a small green area appears in the desert. So based on this information
18:42
we can filter out the relevant information and remove the rest. Now imagine there's an earthquake at a particular area and your task is to identify the building which have been damaged due to earthquake and send the reinforcements here as quickly as possible.
19:03
So this is where object detection comes into action. So it can provide the approximate counts of number of objects in that particular area and it can localize those objects by providing the bounding box and the coordinates of those areas. With the help of that we can identify
19:20
the relative distance of those multiple objects and we can send the reinforcement there quickly which can be efficient. So there are various models which are being used like YOLO, SSD, faster R-CNN and those models can be trained on fixed number of classes like it can be useful for detecting ships,
19:40
airplanes or trucks and buildings. So there are various applications we can think of. Now let's see how can we use vision models in that case as well. So it can be used for zero shot object detection. You just have to provide a prompt and it will be able to detect those objects.
20:01
Now we just have to describe the object we want to detect like the old large green and red ground track field. It will detect it for you. And similarly a short large overpass. Now it can also refer which object we want to detect. Like here we want to detect the chimney on the left.
20:23
We just have to provide a prompt. It will detect the left chimney and filter out the other ones. And similarly here ship on the top. So it is able to refer what objects we are trying to identify. So we can use grounding in order to do that.
20:42
It is an open source model which is built by idea research and it is built on top of detection transformer and it learns with both detection and grounding data. What it means that it would require images, the captions and a bounding box on those images.
21:02
And the caption should describe what's inside that bounding box, not the whole image. Just like we did in clip model. And it can be used for doing zero shot object detection just like we saw in the last slide. And it can pinpoint exactly which part of the image correspond to which part of the text.
21:21
Now let's see how does this model works. This is a whole architecture. So initially we are going to use swim transformer and bird as text backbone and image backbone which will convert into image and text features.
21:41
Now these features are then passed to feature enhancer. So feature enhancer use attention mechanism in order to fuse the feature from image and text. Now let's see what happens in feature enhancer. So here we are passing text and image to a self-attention layer.
22:01
Now what self-attention layer does is it identifies the relationship within the image features and within the text features. After that we use image to take cross attention. So what this does is it fuses images and text. So it processes text here with the help of key and value
22:22
and query from images. And the next it is processing images with the help of text by taking key and value from image and query from text. After that we pass it to a feed forward layer. Then the weight gets updated. Now once we have updated features,
22:42
we use a language guided query selection which selects a number of queries and filter out the rest which are not required by our model. So here we select the relevant queries and filter out the rest which are sent to cross modality decoder. Now what decoder does is further fuses image
23:03
and text pairs and then it predicts the bounding boxes. Now the predicted bounding boxes with the class is sent for loss. So here we are using contrastive loss just like we saw in clip and the localizations loss which will be used for identifying the loss between the predicted bounding box
23:21
and the original bounding box. And in a nutshell this is how this model works. So we can use mmDetection library in order to fine tune this model. Now we just have to create a config file where we can provide type of model we want.
23:41
Here we are passing the number of queries which are 900 in this case which were used for decoder. And after that we are passing the preprocessor which will be used for processing the images. And then we are passing BERT model and twin transformer which are our encoder models.
24:01
And here we are passing visual layer configuration, text layer and fusion layer configuration just like we saw in a fusion of text and image enhancer. Here like we saw in a feature enhancer, similarly we are creating this encoder. After that we are creating a decoder
24:21
which contains self-attention, cross-attention for text and images. And after that we create a loss function which will be used in config files. So here we are passing the argument for contrastive loss and the bounding box loss. So these arguments are currently set for satellite dataset.
24:41
They can be set for different domains like medical imagery or fashion imagery and so on. Now here we are using mm detection which will be used to load the config file we just created. Here we are passing the image directory and after that we pass the annotation
25:01
which are our labels. And after that we just have to run the runner which will train our model. After that we can use the inferencer where we can pass the path for a config file and we can pass the weights which were trained from this API.
25:20
And after that we create a inferencer which where we are passing the image file and the prompt. Now here we are passing a prompt that we want to detect goal field. So after training a model for like 20 epochs, it is able to detect these goal field in this area.
25:47
Now this was for vision language models. Now let's see how can we increase the resolution of these images with the help of super resolution model. So what do you mean by resolution of an image?
26:01
It's just amount of information inside an image. Like it can have number of pixels, more number of channels, or the amount of the value inside the pixel. So we have mainly four types of resolution. First is radiometric resolution, which is the amount of information inside each pixel.
26:22
So information can range from zero to 256 or it can be from zero to one. Here you see we have a binary image so pixels are either uniformly black or uniformly white. So it has value either zero or one. And here we have different shades of black color.
26:42
And the next is spectral resolution. Now the video or images you take from your cell phone usually have three bands, but satellite sensors are much more advanced. They can capture more bands, like 10 number of bands or 100 number of bands.
27:02
We can call them as a multi-spectral image or a hyper-spectral image. And they capture different types of information from the earth surface. The other one is temporal resolution. It defines the time it takes for the satellite data set to complete a whole revolution around the earth
27:21
and visit the same area. So more temporal resolution is the time it takes. And then we have a spatial resolution. So it represents the area by a pixel. Like if a pixel represents a small car, it would be a high resolution
27:41
as compared to a pixel which can represent a large house. So there's a trade-off between temporal and spatial resolution. Now if you take a video from your cell phone, and you take it on a very high resolution, like 8K resolution,
28:02
you won't be able to take it on a very high FPS, like 60 FPS or 90 FPS. You'll get a lot of motion blur and noise. Or if you reduce the resolution of those images, you'll be able to do that in a high FPS. Similarly, if a satellite is orbiting at a very fast speed,
28:23
you won't be able to capture images at a very high resolution. This is how there's a trade-off between these temporal and spatial resolution. Now with the help of super resolution model, we can increase the spatial resolution. Now, why do we need to increase the spatial resolution?
28:42
Because high resolution imagery is quite expensive. What it means that it would require you to send a larger sensor to the space, which would result in a more payload for the rocket, and the launch would be expensive. And not only that, it would also take a lot of time
29:00
for the high resolution data to send to the ground station and would require more storage for that. And to overcome the trade-off between temporal and spatial resolution, just like we saw in the last slide. And we can even enhance the legacy data. So there are various sensors which were launched in the past
29:21
and they were not quite advanced. So we can enhance that data with the help of super resolution model. And of course, we can do the better analysis with the help of deep learning models or manually. So there are various classical methods which can be used for this super resolution.
29:50
Like we have interpolation methods like nearest neighbor, bilinear, and bicubic. We can do that easily with the help of OpenCV library.
30:02
And we have deep learning methods. So they don't not only increase the resolution of an image, they can even remove the noise and compression artifacts from that image. So GANs and diffusion models are widely used for this approach. And it only requires high-resolution imaging
30:21
in order to train. And we can create low-resolution imaging out of them. So let's see what is GAN-based approach and how can we use that. So it contains a generator and a discriminator. Now this takes a low-resolution image
30:41
and generator generates a high-resolution image. Initially, if it is not trained, it can generate a random image. And then we pass a generated image and a high-resolution image to a discriminator which identify whether the image is generated by a generator or it is an original high-resolution image.
31:01
So after a few steps of loss, it will not be able to identify whether the high-resolution image is passed or whether the image is generated by a generator. Now this is how a GAN-based model gets trained. So this training is quite unstable, but it is fast as compared to our latest
31:21
diffusion-based models. Now let's see how diffusion-based models work. Now it is based on a forward and reverse diffusion process. Once we have a high-resolution image, we can add noise to that high-resolution image
31:40
which generates a noisy image which is known as forward diffusion process. Now with the help of trained model, we can remove this noise and generate a high-resolution back which will be known as reverse diffusion process. So this is how diffusion models works in a nutshell.
32:01
Here, how can we train this model? You can see the various steps we can do in order to train this model. So we have an original image, a noise, and we concatenate noise to that original image
32:23
and we pass it to a unit model which generates a predicted noise. So the unit should generate what noise which we have added to this model. So after that, we calculate the loss function if it has generated the correct noise or the noise is incorrect.
32:41
So we use the generated and the real noise and calculate the loss and update the weights of the unit until it generates noise better. And after training the model, we use sampling which is inferencing. Here, we pass the noise to a trained neural network
33:01
and model removes the noise from the prediction noise and after that, we subtract the noise from the original noise and it is done by N number of times. So iteratively, we remove noise for like 500 steps and then we can get a generated image. Now, how can we use it to increase the resolution?
33:22
We can concatenate a low-resolution image to a noisy image. So here, we have a low-resolution image like 16 by 16. We upscale it with the help of interpolation method to match the resolution of the original image and after that, we concatenate it with a noisy image.
33:43
Then we pass it to a unit neural network. So this is how we can fine-tune a super-resolution model for our dataset. So there are different approaches like DDPM or SR3.
34:03
So you can find a complete code on the GitHub which I'll show you in the last slide. And here, we are just passing the transforms and the data loader for our model where we are resizing the image for creating a low-resolution image and after that, we are creating a unit model
34:22
passing the number of input and output channels so it can even make a high-resolution image out of a multispectral image. And after that, we have a SR3 model. We are creating a class of SR3 where we are passing the required parameters like learning rate, size of image, and the loss type.
34:44
Now here, we are passing a scheduler which will be a noise scheduler. It is quite helpful for adding noise and then we can train it for like 800 epochs. Now, this model is quite slow as compared to different model like a GAN-based approach, but the results are quite good
35:01
as compared to a GAN-based approach in diffusion models. Now here, you can see the input which is a low-resolution image and a target image, what we are trying to achieve, and then a prediction. Now here, the model is able to generate a predictive
35:22
prediction based on the low-resolution image and you can see cars and trees and the houses are much sharper. Thank you, everyone.
35:43
Thank you so much for your presentation. It was so informative. If you have any questions, you can ask your questions in Discord, not here, please, for company privacy and Mayank will reply questions on Discord later. Thank you for your understanding.
36:00
The remaining sessions will be in forum hall. Thank you so much. Thank you, everyone.