We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How Can We Get 3D-Pictures of Moving Objects with Just One Camera?

00:00

Formal Metadata

Title
How Can We Get 3D-Pictures of Moving Objects with Just One Camera?
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The recovering of our 3D-world with only one camera is a challenge in many fields ranging from self-driving cars to plastic surgery. In this video DANIEL CREMERS presents innovations from computer vision to tackle that challenge. Modelling the movement of a camera in addition to the geometry of the depicted world and using new algorithms allow the recovery of more pictures in real-time. The algorithms compute the highest possible consistency between consecutive images and make it possible to recover even moving objects with the limited computational space of a laptop.
AdditionEndliche ModelltheoriePhysical systemArithmetic meanGraphics processing unitReal-time operating systemMedical imagingEstimatorField (computer science)Extension (kinesiology)Single-precision floating-point formatLaptopGroup actionOperator (mathematics)State of matterData structurePoint (geometry)Moore's lawStandard deviationConnectivity (graph theory)SmartphoneEntire functionBuildingSimilarity (geometry)Real numberGraph coloringDirection (geometry)FeedbackIntegrated development environmentFunctional (mathematics)MereologyVirtual machinePerturbation theoryCartesian coordinate systemAlgorithmCore dumpTransformation (genetics)1 (number)Line (geometry)WordComputerMachine visionFamilyComputational visualisticsTheoryArtificial neural networkClassical physicsPlastikkarteNetwork topologyPrice indexData storage deviceSimultaneous localization and mappingProjective planeImmersion (album)MathematicianTranslation (relic)Object-oriented programmingVirtualizationAreaRotationInsertion lossCorrespondence (mathematics)Cost curveMappingAutonomic computingUniform resource locatorTable (information)SurgeryDifferent (Kate Ryan album)Local ringDemoscenePixelRobust statisticsTerm (mathematics)ConsistencyDevice driverMultiplication sign2 (number)Semantics (computer science)NumberNumeral (linguistics)GeometryMetric systemMoment (mathematics)Run time (program lifecycle phase)Arithmetic progressionContrast (vision)Characteristic polynomialSphereParameter (computer programming)BefehlsprozessorDifferentiable manifoldPolar coordinate systemMultiplicationoutputIdentifiabilityKey (cryptography)Scaling (geometry)Data recoverySet (mathematics)Self-organizationSparse matrixMeeting/Interview
Transcript: English(auto-generated)
The research question of our project is to recover the 3D world in front of a camera from multiple images taken with that camera. This is a problem that we as humans are familiar with from our world. We see with our eyes the world but we only see two-dimensional projections of that
world and the challenge is to recover the 3D world. This is actually a very classical problem in my research area called computer vision and interestingly this was studied already for a hundred years by now. So back in 1913 Krupa proposed, there was an Austrian mathematician Erwin Krupa who
showed that if you see five points, corresponding points in two images, that you can recover the motion of the camera that relates the two images and the 3D location of these five points. What makes our project different from
Krupa's project is that we don't typically have corresponding points when we switch on the camera. What we see are colors and so the challenge that we tackle in this project is how to get directly from the colors in the images of the various pixels in in the image to a 3D model of the world
and to a motion of the camera. The method that we used to solve this problem takes as input the color images from a moving camera, possibly a handheld camera for example, and from these images it computes the
motion of the camera over time and that would be the rotation that the camera undergoes and the translation and in addition it also computes the 3D geometry of the world. This is a problem that is known as SLAM, that stands for simultaneous localization and mapping. Localization means
estimating the camera motion, mapping means estimating the 3D structure of the world. To understand how we deviate with this method from existing approaches you have to see that the existing approaches typically try to find these corresponding points based on the original Krupa paper.
They say let's identify points in images, characteristic points, and let's try to find the same points in another image, put these in correspondence and then recover the structure very much based on Krupa's original paper. Where we deviate from that is that we don't extract
points from the image but we try to find a model of the 3D world and of the camera motion such that the entire images are consistent, the colors match. And this is a technique that is called direct image alignment. There are actually many innovations in the paper but this is one of the most important central ones that's the
kind of the core component of our method is this idea of directly aligning images. So you can phrase the the method in this way you can say find a model of the world and a model of the camera motion such that if I project this image into
that world and transform it into the new camera coordinates that the images are consistent so that all pixels ideally have a corresponding partner of similar color. Technically the contribution of what we did is that we suggest a cost function that contains the parameters that
represent the camera motion and the parameters that represent the 3D world and then we optimize that cost function to find the best possible camera transformation and the best possible model of the world that puts the images
into consistency and the cost function essentially measures that color consistency of the images. If they are very color consistent then it's a low cost if they're not color consistent we have a very high cost. And so in other words what our algorithm computes is the camera transformation and the
3D model of the world that minimize this cost that give the best color consistency among consecutive images. In addition one of the aspects where we deviate from classical techniques is that along with the geometry
we also estimate an uncertainty so the algorithm not only estimates where the world is but it also knows how certain how sure it is and so that is very helpful to not only have an estimate of the world but to know the precision of these estimates in some sense.
The findings of our method are numerous first of all since we do not abstract the images to a sparse small set of key points as is commonly done we save on runtime which means that our algorithms are faster
we can recover the 3D world in real time on a single laptop CPU so that means we don't need huge extensive computation power a lot of the colleagues in our field who do a real-time reconstruction they need graphics cards etc and multiple PCs to recover the world in real time
for us a simple laptop is sufficient. In addition the component in our algorithm that makes it large scale capable can be observed in that the reconstructions are not limited to a desktop environment but we can recover entire houses multiple buildings entire cities
in real time. In addition we find that since we estimate not just the structure of the world but also an uncertainty we get much more robust estimates in other words if there are aspects that make the estimation difficult for example
moving objects that pass in front of the camera our algorithms tend to be extremely robust to such nuisances although they are built on the assumption that only the camera moves and the world is static even if there's parts that move through the world the algorithm can cope with that and can
reliably recover a large and metrically accurate world our method has enormous societal relevance there are numerous applications that can profit from this technique what makes our algorithm and technique very attractive is that it is very fast that it is very accurate
that it is extremely robust to perturbations that always happen in the real world and that it runs in real time on a single CPU even on a laptop for example the fact that it runs in real time on a CPU means that we can deploy this algorithm on smartphones
and that will allow everyday people to take their smartphone and go around the world with the smartphone and in real time that 3D world emerges as a model computed on the smartphone and so i can do virtual tourism i can you know if i've been to the 16 chapel i can take out my smartphone
and rather than just taking classical 2D pictures of the world i can create a 3D model of that 16 chapel and then i can go back home and show my family and friends here this is what it feels like to walk through the 16 chapel because then you can then once you have a 3D model you can
interactively go through that model and get a very immersive sense of of actually being there in addition to just displaying a model of the world that i saw you can also insert virtual objects for example you can go online
say to a furniture store download 3D models of a chair or a sofa or a table and insert these in real time into your living room so you walk around with your smartphone see your living room but in addition you have an artificial object in that scene and you can get a feel for what
would that chair or sofa look like when it was standing in my living room and then decide do i want to buy this furniture or not furthermore there is applications like driver assistance autonomous and self-driving cars a very
important research area that many many car companies are looking into these days we have worked with numerous car companies car manufacturers to devise autonomous driving approaches to devise driver assistance systems the challenge
there being to recover that 3D world in real time so that i know at any given moment where is my car where is the 3D world in particular one of the most important challenges is to figure out are there obstacles in front of the is there something that i could hit potentially
and we've devised algorithms which can compute in real time a 3D model of the world and they can even predict what's going to happen in the next few seconds you can estimate the motion of things in front of the car and that way you can see that there is say a child running into the driving
corridor and you can then either send a warning signal to the driver or you can actively engage in the driving and do a braking maneuver or sphere to avoid the obstacle these applications are very important because every day there are thousands of people dying in car accidents
and why is that because human performance is not reliable if someone did not sleep well the night before if they have their mind on something else if they're stressed if they had too much coffee or not enough coffee then the driving performance will degrade in contrast if you can teach a machine
to see the obstacles to see children and to avoid them by breaking if you can solve it and we're working on that then that performance is reliable and constant it doesn't depend on how much coffee you're driving and so that way you can save numerous lives and
so we're very much hoping that we can help to save lives another application of our method is in plastic surgery we've been working with plastic surgeons to deploy these and other algorithms into the market the idea being that you can measure the progress on an operation on some surgery operation you can
determine what is the state before the operation what is the state after because you can create 3d models and you can not only monitor the operation but you can even from a 3d model simulate you know what would this organ look like if we changed it in this or that way and so patients actually get the
feedback before ever deciding for an operation would that operation be beneficial or not all of this can be done and the key component is an algorithm that recovers the 3d world and we are working on exactly these out in terms of research the challenges that
arise and that we are facing here are to extend the reconstruction algorithms to not just the static world but actually a dynamic world a world where objects move around the scene and then it's not only recovering the
motion of the camera like we do now and the structure of the 3d world but actually to determine the motion of objects in that world and one of the challenges is that they can be moving just rigidly like a car but then often you have non-rigidly moving objects like people that are articulated and estimating these
articulated motions along with a camera motion in real time is a very challenging open problem another important challenge that derives or goes beyond what we've done is to not only recover the geometry of the world but the semantic meaning of the world
and that's important for machines to not only recover the 3d structure but to actually understand what's happening that there is a person sitting in front of me that there is people that there is tables chairs bottles etc and so this goes beyond just the geometry
but we have to assign to each point in that reconstructed world a semantic label of what are the objects that we see and this will be very helpful to understand so that way i can go around and recover my environment from a handheld
camera but not only recover the environment in terms of the geometric structure but in terms of saying i've seen 25 chairs in this room i've seen three tables and if you want to do monitoring over time you can have algorithms that autonomously recover the world and monitor is it still the same number of chairs as what i saw yesterday etc