Understanding 3D Reconstruction with COLMAP

00:57:02
https://www.youtube.com/watch?v=EdIuDLicU0c

الملخص

TLDRIn this episode of Computer Vision Decoded, the hosts discuss Structure from Motion (SfM) and 3D reconstruction using Colmap software. They explain the workflow from feature extraction to camera pose estimation and incremental reconstruction. Jared Heinley, a computer vision expert, elaborates on the importance of camera models, feature matching strategies, and geometric verification. The episode also contrasts incremental and global reconstruction methods, emphasizing the efficiency of Colmap and the newer Glowmap software. Listeners are encouraged to experiment with Colmap to gain practical experience in 3D reconstruction.

الوجبات الجاهزة

  • 📸 Colmap is an open-source tool for 3D reconstruction.
  • 🔍 Feature extraction identifies unique landmarks in images.
  • 🔗 Feature matching connects similar features across images.
  • 🧮 Geometric verification ensures accurate matches between images.
  • 🔄 Incremental reconstruction adds images one at a time.
  • 🌍 Global reconstruction estimates poses for all images simultaneously.
  • ⚙️ Bundle adjustment refines camera poses and 3D points.
  • 🖼️ Good imagery is crucial for successful reconstruction.
  • 🛠️ Experimenting with Colmap helps understand 3D reconstruction better.
  • 📚 Tutorials and documentation are available on Colmap's website.

الجدول الزمني

  • 00:00:00 - 00:05:00

    In this episode, the hosts introduce the topic of structure from motion and 3D reconstruction using Colmap, a free and open-source software. They aim to demystify the process of 3D reconstruction from imagery, with expert Jared Heinley explaining the workflow involved in obtaining camera poses and creating 3D models.

  • 00:05:00 - 00:10:00

    Jared shares his background with Colmap, detailing its origins and development by Johannes Schunberger during his time at UNC Chapel Hill. The discussion highlights the evolution of Colmap from earlier software focused on aerial photography to a more generalized tool for 3D reconstruction from various image collections.

  • 00:10:00 - 00:15:00

    The hosts discuss the initial steps in using Colmap, emphasizing the importance of understanding camera positions and the process of extracting images from a video or taking multiple photos from different angles to create a 3D model.

  • 00:15:00 - 00:20:00

    Jared explains the significance of feature extraction, where unique landmarks in photographs are identified to establish 2D relationships between images. This step is crucial for later 3D reconstruction, as it allows the software to track points across multiple images.

  • 00:20:00 - 00:25:00

    The conversation moves to the matching process, where the software identifies correspondences between features in different images. This involves geometric verification to ensure that matches make sense in the context of camera motion and scene geometry.

  • 00:25:00 - 00:30:00

    The hosts discuss the various matching algorithms available in Colmap, such as exhaustive, sequential, and vocab tree matching, and how to choose the right one based on the dataset and image collection strategy.

  • 00:30:00 - 00:35:00

    Incremental reconstruction is introduced, where the software builds a 3D model by adding images one at a time. Jared explains the initialization process and how the software determines which images to use based on feature matches and camera motion.

  • 00:35:00 - 00:40:00

    The episode covers the iterative loop of image registration, triangulation, and bundle adjustment, which refines both the 3D points and camera poses as new images are added to the reconstruction.

  • 00:40:00 - 00:45:00

    Jared clarifies the role of bundle adjustment in optimizing the alignment of 3D points and camera positions, and how it can be performed locally or globally depending on the reconstruction size and complexity.

  • 00:45:00 - 00:50:00

    The hosts briefly touch on Glowmap, a newer software that offers a global reconstruction approach, allowing for faster processing by estimating camera poses for all images simultaneously, contrasting it with Colmap's incremental method.

  • 00:50:00 - 00:57:02

    The episode concludes with encouragement for listeners to experiment with Colmap, emphasizing the importance of taking sharp images and understanding the reconstruction process to achieve better results.

اعرض المزيد

الخريطة الذهنية

فيديو أسئلة وأجوبة

  • What is Colmap?

    Colmap is an open-source software for 3D reconstruction from images, allowing users to estimate camera poses and create 3D models.

  • What is the first step in 3D reconstruction using Colmap?

    The first step is feature extraction, where unique landmarks in the images are identified.

  • How does Colmap handle feature matching?

    Colmap uses various algorithms for feature matching, including exhaustive, sequential, and vocabulary tree methods, depending on the nature of the image dataset.

  • What is geometric verification in Colmap?

    Geometric verification is a process that ensures the matches between features in different images make sense geometrically, filtering out incorrect matches.

  • What is the difference between incremental and global reconstruction?

    Incremental reconstruction adds images one at a time, while global reconstruction estimates the 3D poses of all images simultaneously.

  • Can I use Colmap on a standard computer?

    Yes, Colmap can run on standard computers, although performance may vary based on hardware specifications.

  • What is bundle adjustment?

    Bundle adjustment is an optimization process that refines the 3D points and camera poses to improve the accuracy of the reconstruction.

  • What should I consider when taking images for 3D reconstruction?

    Ensure to take sharp images with good features and varied angles to improve the quality of the 3D reconstruction.

  • Is there a learning curve for using Colmap?

    Yes, while Colmap is powerful, understanding its various options and workflows may require some time and experimentation.

  • Where can I find tutorials for Colmap?

    Colmap's official website provides tutorials and documentation to help users get started with the software.

عرض المزيد من ملخصات الفيديو

احصل على وصول فوري إلى ملخصات فيديو YouTube المجانية المدعومة بالذكاء الاصطناعي!
الترجمات
en
التمرير التلقائي:
  • 00:00:00
    Welcome to another episode of computer
  • 00:00:02
    vision decoded. I'm really excited about
  • 00:00:04
    this episode because it's going to solve
  • 00:00:06
    a lot of questions that we get about
  • 00:00:10
    structure from motion and 3D
  • 00:00:12
    reconstruction when it comes to coal map
  • 00:00:15
    and just figuring out how to do some of
  • 00:00:17
    the basics of 3D reconstruction from
  • 00:00:20
    imagery. And as always, I have Jared
  • 00:00:22
    Heinley, our in-house computer vision
  • 00:00:25
    expert, to walk us through what happens
  • 00:00:28
    when you run software like Colemap to
  • 00:00:31
    get camera poses, 3D reconstruction, and
  • 00:00:34
    kind of break down how that all works at
  • 00:00:36
    a tangible level. So when you walk away
  • 00:00:38
    from this episode, you should have a
  • 00:00:40
    better understanding of this black box
  • 00:00:43
    of Cole Map and other 3D reconstruction
  • 00:00:46
    software that follows the same workflow.
  • 00:00:48
    So, as always, Jared, thanks for joining
  • 00:00:50
    me and welcome to the episode. Yeah,
  • 00:00:52
    thank you. Let's just get to what we're
  • 00:00:54
    all here for. Let's let's learn about
  • 00:00:56
    Cole Map. And I don't want to say
  • 00:00:59
    specifically Cole Map, but we're going
  • 00:01:00
    to use it as the basis for this episode
  • 00:01:04
    to have something for someone to follow
  • 00:01:06
    along. And since it's open-source and
  • 00:01:09
    free, they can download Cole Map and do
  • 00:01:13
    this on their own PC without, you know,
  • 00:01:16
    have to pay for some third party
  • 00:01:18
    software that they won't learn as much
  • 00:01:20
    through. So Jared, let's just start off
  • 00:01:22
    with I'm going to share my screen. I
  • 00:01:25
    have some images and we want to turn
  • 00:01:27
    these images into a 3D model or just at
  • 00:01:31
    least know where these cameras are in
  • 00:01:33
    relation to each other. I'm going to be
  • 00:01:35
    doing some screen shares. If you're
  • 00:01:37
    listening to the audio only, I'll do my
  • 00:01:39
    best to talk about what we have on the
  • 00:01:42
    screen. But, uh, if I start out here, I
  • 00:01:45
    have a picture of a well, it was a
  • 00:01:48
    fountain that used to work in front of
  • 00:01:50
    the Oregon State Capital. I took this
  • 00:01:52
    one sunny day last year. And if I flip
  • 00:01:55
    through the images, I basically walked
  • 00:01:58
    around this fountain and got a bunch of
  • 00:02:02
    good angles. In fact, I believe I used a
  • 00:02:04
    video and extracted a bunch of images
  • 00:02:07
    and at some points there's some sun
  • 00:02:09
    issues, things like that. But it was
  • 00:02:12
    good enough for me to get a 3D model. So
  • 00:02:15
    Jared, what's what's the first step
  • 00:02:17
    someone would take then to turn this
  • 00:02:20
    into a 3D model? Know where the cameras
  • 00:02:22
    are, things like that? Yeah. Yeah. Well,
  • 00:02:24
    you just you hinted it right there at
  • 00:02:26
    the very end. Know where the cameras
  • 00:02:27
    are. And I guess and to try to refine
  • 00:02:29
    some of my language. A lot of times when
  • 00:02:30
    I say camera, sometimes I mean, you
  • 00:02:32
    know, image and camera. I'll use those
  • 00:02:34
    words interchangeably sometimes, you
  • 00:02:36
    know, but you said that you walked
  • 00:02:37
    around with a single camera, you know,
  • 00:02:39
    your phone or a DSLR or whatever it may
  • 00:02:42
    be. And from that video, maybe you
  • 00:02:44
    extract frames, you know, images or you
  • 00:02:46
    took photos yourself. And so you have
  • 00:02:49
    multiple images taken by a single
  • 00:02:51
    physical camera, but you were moving
  • 00:02:54
    around that scene, moving around that
  • 00:02:56
    object. And so that camera was occupying
  • 00:02:58
    different physical 3D points in space
  • 00:03:01
    and then these images were captured from
  • 00:03:02
    those different 3D points and those from
  • 00:03:04
    those different 3D 3D perspectives. So
  • 00:03:07
    you know as humans we just do this
  • 00:03:09
    naturally like as you just flipped
  • 00:03:10
    through those photos there and you know
  • 00:03:12
    uh you know and as you kind of orbited
  • 00:03:14
    around that fountain it's like yeah our
  • 00:03:17
    brains are immediately like oh yeah okay
  • 00:03:19
    I can see that the ground is a little
  • 00:03:21
    bit closer. Here's this foreground
  • 00:03:23
    fountain. I see the trees in the
  • 00:03:24
    background. I see some other structures
  • 00:03:26
    in the background and I'm immediately I
  • 00:03:29
    can see that yep you were moving to the
  • 00:03:31
    left and sort of this clockwise motion
  • 00:03:33
    this thing's you know near that other
  • 00:03:35
    things are far and our brains are
  • 00:03:37
    immediately doing all of that 3D
  • 00:03:38
    reasoning but in order to have software
  • 00:03:42
    do this in order to have a computer
  • 00:03:43
    generate a 3D reconstruction or 3D
  • 00:03:46
    representation of what's in these photos
  • 00:03:49
    it has to figure out it has to do all of
  • 00:03:51
    that math and it doesn't know how to do
  • 00:03:52
    that reasoning by default. has to figure
  • 00:03:54
    out well where were you standing when
  • 00:03:56
    that photo was taken? Where was the
  • 00:03:58
    camera positioned? How was it angled?
  • 00:04:00
    What was the zoom level uh of of the
  • 00:04:02
    current lens? And so it's doing all has
  • 00:04:04
    to figure out where everything was
  • 00:04:07
    oriented. And that's typically one of
  • 00:04:08
    the first processes is trying to figure
  • 00:04:09
    out how how are things related to each
  • 00:04:11
    other, you know, and once we kind of
  • 00:04:12
    know how they're related, then figure
  • 00:04:14
    out what is the 3D 3D geometry that that
  • 00:04:17
    uh describes describes that
  • 00:04:19
    relationship.
  • 00:04:21
    And so it goes through. So, I I don't
  • 00:04:23
    have it on my screen, but I will pull it
  • 00:04:26
    up in a second, but Cole Map has in
  • 00:04:28
    their tutorial information a good kind
  • 00:04:32
    of diagram. I'll let me bring that up,
  • 00:04:35
    but it it it basically shows the
  • 00:04:36
    workflow that it goes through. So, if I
  • 00:04:39
    go to the actual website for Cole Map
  • 00:04:43
    and you go look at their tutorial, you
  • 00:04:45
    can see that. So, let's just pull that
  • 00:04:47
    up on my screen as well. while he's
  • 00:04:50
    pulling that up. Um, just jump in with a
  • 00:04:53
    little bit of personal history about
  • 00:04:54
    call map. So, I did my PhD back at UNC
  • 00:04:59
    Chapel Hill. So, I was there from 2010
  • 00:05:00
    to 2015. And while I was there, Johannes
  • 00:05:03
    Schunberger, he came to UNCC for two
  • 00:05:05
    years to do his masters. And so,
  • 00:05:07
    Johannes, he's the author of Call Map.
  • 00:05:10
    Um, but at the time when he was there,
  • 00:05:11
    call Map didn't exist. Johannes had
  • 00:05:14
    worked on uh previous structural motion
  • 00:05:16
    software and had built he'd worked with
  • 00:05:18
    some uh I believe it was drones so
  • 00:05:20
    aerial photography 3D reconstruction and
  • 00:05:23
    so he had built a pipeline that he had
  • 00:05:24
    called MAV map I I'll probably get this
  • 00:05:27
    wrong but I think like you know mobile
  • 00:05:28
    aerial vehicle mav map like map or
  • 00:05:31
    mobile vehicle mapper and so but he was
  • 00:05:34
    looking to generalize that to move
  • 00:05:36
    beyond just aerial photography and to do
  • 00:05:39
    more general purpose image collections
  • 00:05:42
    and So it was this idea of image
  • 00:05:44
    collections you know where he came up
  • 00:05:46
    with call map collection mapper um to
  • 00:05:49
    say I want to take a collection of
  • 00:05:51
    images and generate a 3D reconstruction
  • 00:05:53
    from it. So he was working on that while
  • 00:05:54
    he was at UNC. I may have been one of
  • 00:05:57
    the first people to actually use calm
  • 00:05:59
    map um in in my final uh PhD project. I
  • 00:06:04
    had uh processed a 100 million images on
  • 00:06:07
    a single PC and I was doing so this you
  • 00:06:10
    feature matching extraction but then I
  • 00:06:12
    needed some way to reconstruct them and
  • 00:06:14
    our lab had some other software that
  • 00:06:16
    could do 3D construction but Johannes
  • 00:06:17
    had just written this first version of
  • 00:06:19
    call map and so I said great let's use
  • 00:06:21
    that and that that was efficient that
  • 00:06:24
    was fast and it did did exactly what we
  • 00:06:26
    needed to do and so that that helped uh
  • 00:06:28
    helped get my paper across the gold line
  • 00:06:29
    there at the very end so nice and since
  • 00:06:32
    then Johannes has gone off you know at
  • 00:06:33
    ETH circ and now uh at other companies
  • 00:06:36
    and continued to you know open source
  • 00:06:38
    call map and now it's used all over the
  • 00:06:40
    world and has won won him some awards
  • 00:06:42
    for it. So you know interestingly the
  • 00:06:45
    glow map what came out last year and he
  • 00:06:48
    had his fingers in that as well. Yep. So
  • 00:06:50
    it's not over. I still see coal map
  • 00:06:52
    being updated on a semi-regular basis as
  • 00:06:56
    well. So although it came out a few
  • 00:07:00
    several years ago, it's it's not static.
  • 00:07:02
    No. No. because it because it is such a
  • 00:07:05
    an important step the the task that call
  • 00:07:08
    map solves and and similarly glow map
  • 00:07:10
    you know figuring out the 3D pose you
  • 00:07:13
    know pose is position plus orientation
  • 00:07:16
    figuring out the 3D pose of images is a
  • 00:07:19
    key step in so many uh 3D pipelines you
  • 00:07:22
    if you want to understand the world in
  • 00:07:24
    3D you got to figure out where these
  • 00:07:25
    images were taken from you know and
  • 00:07:27
    that's the key task that that call map
  • 00:07:30
    uh solves for a lot of people Okay, that
  • 00:07:33
    that makes that makes sense. I had no
  • 00:07:34
    idea also that coal map stood for
  • 00:07:36
    collection mapper. I'm I mean it makes
  • 00:07:39
    sense, but I thought maybe it was a long
  • 00:07:41
    acronym. So, um so, okay. Well, I have
  • 00:07:45
    this diagram up then. If you're
  • 00:07:47
    watching, you can see it on the screen,
  • 00:07:48
    but if you're listening, it's basically
  • 00:07:51
    a workflow of how images go from just a
  • 00:07:55
    collection of images to a 3D
  • 00:07:58
    reconstruction. And you got camera
  • 00:08:00
    poses. And I'm going to show this in
  • 00:08:01
    coal map on my screen as well. But this
  • 00:08:04
    diagram just shows the different phases
  • 00:08:06
    are the right or steps that you go
  • 00:08:08
    through to get from pictures to 3D. And
  • 00:08:11
    it starts out with feature extraction.
  • 00:08:14
    And if you actually go to the tutorial
  • 00:08:16
    as well. So if I share the just the
  • 00:08:17
    tutorial page, that diagram makes sense.
  • 00:08:20
    But the minute you start diving into it,
  • 00:08:22
    you have a wall of text that to most
  • 00:08:25
    people won't make through this very well
  • 00:08:28
    unless they are perhaps a computer
  • 00:08:30
    science major, someone like Jared who
  • 00:08:32
    does this academically or for a job. I
  • 00:08:36
    look at this and I'm like, okay, some of
  • 00:08:38
    this makes sense. A lot of this is
  • 00:08:40
    beyond me. So, we're going to break that
  • 00:08:42
    down. So, yeah. Okay, starting out with
  • 00:08:43
    feature extraction. So, what is that
  • 00:08:45
    step? So, what do we We're taking the
  • 00:08:46
    images and sounds like something's
  • 00:08:48
    happening there with features. Yeah.
  • 00:08:50
    Yeah. Absolutely. So, uh, and just to
  • 00:08:52
    take a step back here too. So, like we
  • 00:08:54
    like you said, this is, you know, sort
  • 00:08:55
    of a workflow, a sequence of steps that
  • 00:08:57
    goes into generating a reconstruction.
  • 00:08:59
    So, you had those images input. There's
  • 00:09:00
    a sort of first block of steps that's
  • 00:09:02
    labeled correspondent search. After
  • 00:09:04
    that, we have incremental reconstruction
  • 00:09:06
    and then finally we end up with a final
  • 00:09:08
    reconstruction. But yeah, so within
  • 00:09:09
    correspondence search, our goal for
  • 00:09:11
    correspondent search is to figure out
  • 00:09:14
    the 2D relationship between that
  • 00:09:17
    collection of images. So, we're not even
  • 00:09:18
    talking about, you know, real 3D yet.
  • 00:09:20
    There might be some hints at 3D uh in
  • 00:09:23
    these steps, but we haven't done any
  • 00:09:25
    reasoning to really understand which
  • 00:09:27
    photos, you know, are where in 3D space.
  • 00:09:30
    Um, so it's just about 2D understanding,
  • 00:09:33
    2D matching, 2D correspondence between
  • 00:09:35
    this this collection of images. So, with
  • 00:09:38
    that in mind, first step is feature
  • 00:09:41
    extraction. So the goal there is to
  • 00:09:45
    identify unique landmarks within a
  • 00:09:48
    photograph. And these unique unique
  • 00:09:51
    landmarks, the intent of that is if I
  • 00:09:54
    can identify a unique landmark, you
  • 00:09:56
    know, a 2D point in one photo, hopefully
  • 00:09:59
    I can identify that same point in
  • 00:10:00
    another photo and another photo and
  • 00:10:02
    another photo. And if I can identify and
  • 00:10:05
    follow or you know or track that 2D
  • 00:10:07
    point between multiple images now I can
  • 00:10:10
    use that as a constraint later on when I
  • 00:10:13
    do the 3D construction. I can say, "Hey,
  • 00:10:15
    hey, however these images are
  • 00:10:18
    positioned, that point that they saw,
  • 00:10:21
    that pixel should converge to a common
  • 00:10:24
    3D position in space." And so it's
  • 00:10:26
    adding a sort of a viewing constraint
  • 00:10:28
    saying, you know, each image saw a 2D
  • 00:10:30
    point. I don't know the depth of that
  • 00:10:31
    point. So it all it sort of gives me is
  • 00:10:33
    a viewing ray. So along this direction
  • 00:10:36
    out into the scene, I saw this unique
  • 00:10:38
    landmark. Now, I've seen that same
  • 00:10:40
    landmark in many other photos. I want to
  • 00:10:42
    identify that and add that as a
  • 00:10:44
    constraint because that is like most
  • 00:10:45
    likely you know a 3D point. So feature
  • 00:10:48
    extraction is the automatic
  • 00:10:50
    identification of typically tens of
  • 00:10:52
    thousands thousands or tens of thousands
  • 00:10:54
    of these unique landmarks in an image. A
  • 00:10:57
    lot of times there are different flavors
  • 00:10:58
    of feature detection. The one used in
  • 00:11:01
    call map is sift scale and variant
  • 00:11:03
    feature transform. What it does is it
  • 00:11:05
    looks for I call it a blob style
  • 00:11:08
    detector where it's looking for a patch
  • 00:11:11
    of pixels that has high contrast to its
  • 00:11:13
    its background. So it could be something
  • 00:11:15
    that's you know light colored surrounded
  • 00:11:17
    by dark or vice versa something that's
  • 00:11:19
    dark surrounded by light. You know it's
  • 00:11:21
    going to look at look for these at
  • 00:11:23
    multiple scales. That's why it's scale
  • 00:11:25
    invariant. So multiple resolutions. So
  • 00:11:27
    this could be something that's you know
  • 00:11:28
    very small or something that's larger in
  • 00:11:31
    the image. Mhm. But once it's
  • 00:11:34
    found that sort of high contrast
  • 00:11:36
    landmark, it now will then extract some
  • 00:11:40
    representation of uh the appearance, you
  • 00:11:44
    know, of of the area around that
  • 00:11:45
    landmark. So it'll say, "Hey, I found
  • 00:11:47
    something interesting." So maybe it's
  • 00:11:48
    the um you know, a doorork knob on a
  • 00:11:51
    door, you know. So it'll say, "Hey, that
  • 00:11:53
    that doorork knob is a different color
  • 00:11:55
    than the background, the rest of the
  • 00:11:57
    door." And so now I want to describe
  • 00:11:59
    that door knob. And so I'm going to look
  • 00:12:01
    I don't want to look just at the
  • 00:12:02
    doororknob itself. I'm going to look
  • 00:12:03
    around it and say here's my doorork knob
  • 00:12:05
    and then oh there's this wood pattern on
  • 00:12:08
    the door around it. And so it's going to
  • 00:12:09
    come up with a representation for that.
  • 00:12:12
    And so what sift actually does or what
  • 00:12:14
    different feature representations are
  • 00:12:16
    that could be a whole podcast in and of
  • 00:12:18
    itself. But at a conceptual level you
  • 00:12:20
    just think about it. It sort of
  • 00:12:23
    summarizes what that looks like at at a
  • 00:12:25
    rough level. It says, "Okay, I saw
  • 00:12:27
    something dark in the middle and then
  • 00:12:28
    there was this, you know, rough pattern
  • 00:12:30
    around it vicinity." Mhm. Okay. So then
  • 00:12:34
    I'm bringing up coal map and this is
  • 00:12:36
    I've unfortunately had already run the
  • 00:12:38
    project because I didn't want us to have
  • 00:12:40
    to sit and watch things go and a lot of
  • 00:12:42
    these things run really fast. So sift is
  • 00:12:44
    fast if you can run it on GPU. I don't
  • 00:12:47
    can't necessarily show what 10 what say
  • 00:12:49
    I think it maxes at 10,000 by default
  • 00:12:52
    but if you have coal map and you kind of
  • 00:12:54
    want to follow along the first thing you
  • 00:12:56
    do is set up a new project and that
  • 00:12:59
    part's pretty easy but then you just go
  • 00:13:01
    to processing and hit feature
  • 00:13:03
    extraction and you get to pick a camera
  • 00:13:06
    model. Why is that important? Why is
  • 00:13:07
    picking a camera model important for
  • 00:13:09
    this? Well, this is important and and
  • 00:13:11
    this this ends up being really important
  • 00:13:13
    later on when we start thinking about
  • 00:13:14
    the geometry of these images and and
  • 00:13:17
    what kind of camera and lens was used
  • 00:13:20
    because these camera models are it is
  • 00:13:23
    defining the geometry of that camera. So
  • 00:13:27
    this right now you have a simple radial
  • 00:13:29
    camera you know selected and so
  • 00:13:30
    underneath of it sort of in grayscale
  • 00:13:34
    are some parameters listed. It says,
  • 00:13:35
    "Oh, simple radial has f, cx, c y, and
  • 00:13:39
    k." Mhm. And so you kind of have to know
  • 00:13:41
    from a computer vision literature that f
  • 00:13:44
    is your focal length. CX and CY, that's
  • 00:13:46
    the principal point. So that's defined,
  • 00:13:48
    well, where is the center uh of my image
  • 00:13:50
    or where is the optical axis of my my
  • 00:13:54
    lens and how is that aligned with the
  • 00:13:56
    image center? So a lot of times I just
  • 00:13:57
    kind of say, hey, hand wavy, it's you
  • 00:13:59
    know, what's the center of my image? And
  • 00:14:01
    then that K is a a single radial
  • 00:14:05
    distortion term. So it's assuming a lot
  • 00:14:07
    of times lenses introduce a little bit
  • 00:14:10
    of curvature effect, you know, curvature
  • 00:14:12
    distortion to them. And so we're going
  • 00:14:13
    to use a single mathematical term, a
  • 00:14:16
    single, you know, polomial term to
  • 00:14:19
    represent the distortion in that lens.
  • 00:14:22
    This might be great. This is great for a
  • 00:14:24
    lot of just, you know, general cameras.
  • 00:14:26
    But if you know that your lens has a
  • 00:14:29
    little bit more distortion, maybe you're
  • 00:14:31
    using, you know, a wide-angle camera, a
  • 00:14:33
    GoPro or a drone that has uh a wider
  • 00:14:36
    field of view and some distortion. If
  • 00:14:38
    you have a really wide angle camera,
  • 00:14:40
    something that you can see lot of
  • 00:14:42
    distortion, then you might want one of
  • 00:14:43
    these one of these fisheye versions.
  • 00:14:45
    They have simple radial fisheye or the
  • 00:14:47
    normal fisheye. There's even I think at
  • 00:14:49
    the very bottom of the list, there's one
  • 00:14:50
    called FOV. That's one that's really
  • 00:14:53
    great for super wide angle. Mhm. You
  • 00:14:56
    know, a lot of times for a normal camera
  • 00:14:58
    like your iPhone in your pocket or your
  • 00:14:59
    DSLR or your point and shoot or whatever
  • 00:15:01
    it ends up being or your simple radial
  • 00:15:04
    or your radial models um are nice
  • 00:15:07
    because they assume that you've um
  • 00:15:12
    you've got a single focal length. You
  • 00:15:14
    know, your pixels are square. So, I
  • 00:15:16
    don't need more than one f term. You
  • 00:15:18
    want to model your principal point with
  • 00:15:20
    CXC. And here the radial model added an
  • 00:15:24
    extra lens distortion. So now instead of
  • 00:15:25
    just K, now we have K1 and K2. So that's
  • 00:15:28
    two radial distortion terms. So we can
  • 00:15:30
    do a little better job of estimating the
  • 00:15:33
    distortion of our lens. Okay. And so
  • 00:15:35
    call map asks for this right away
  • 00:15:37
    because what it's doing is it has that,
  • 00:15:39
    you know, part of that project creation
  • 00:15:41
    process is you create a database. And so
  • 00:15:43
    that's going to be, you know, a
  • 00:15:44
    collection of data stored on disk. And
  • 00:15:46
    so this process of feature extraction uh
  • 00:15:49
    is when call map goes through all of
  • 00:15:51
    your images, extracts features, but then
  • 00:15:54
    also creates those image entries in the
  • 00:15:56
    database. And so it needs to know what
  • 00:15:58
    style of camera is going to be
  • 00:16:00
    associated with that image. Mhm. And and
  • 00:16:03
    we could go and deepen a bunch of
  • 00:16:05
    buttons on here. Yeah. I don't want if
  • 00:16:07
    you just run this in default and simple
  • 00:16:09
    radio and using smartphone or something,
  • 00:16:12
    you'll be okay. But, you know, like here
  • 00:16:14
    is thinking I have all these different
  • 00:16:15
    cameras. There's options where you can
  • 00:16:17
    say use it's always one camera. So, it
  • 00:16:20
    just assumes then everyone's the same
  • 00:16:21
    camera, which is great. Yeah, that's
  • 00:16:23
    good. There's options for masks. I just
  • 00:16:27
    bring up a mask on my screen. This is me
  • 00:16:30
    masked. This is a mask. Not necessarily
  • 00:16:33
    the mask you would um use, but basically
  • 00:16:36
    there's a picture of me. This might be
  • 00:16:38
    the wrong picture. And I've been with
  • 00:16:40
    the mask as a separate file. And then if
  • 00:16:42
    you kind of like combine the two, you
  • 00:16:43
    end up with me masked out. And that's
  • 00:16:46
    like a way to say you want me not to be
  • 00:16:48
    in this result. You can mask out things.
  • 00:16:51
    Specifically, if you want perhaps just
  • 00:16:52
    an object to be reconstructed, you want
  • 00:16:54
    to mask out a background, things like
  • 00:16:56
    that we could go deep into. But there's
  • 00:16:59
    all these options, right, to help get
  • 00:17:01
    the right key points. So, if I go to
  • 00:17:03
    this database, so I ran this already and
  • 00:17:05
    I have this database manager where I can
  • 00:17:07
    kind of jump into things and I pick one
  • 00:17:09
    of these and I'm just going to hit show
  • 00:17:11
    image, it's going to bring up the image
  • 00:17:13
    and I can make this nice and big on my
  • 00:17:15
    screen. What we're seeing now is an
  • 00:17:16
    image of the fountain. I'm on the back
  • 00:17:20
    side of it right now with all these red
  • 00:17:22
    circles which are key points. Not
  • 00:17:24
    necessarily all the features, right?
  • 00:17:25
    It's just some of the ones that I think
  • 00:17:27
    it matched on. Is that wrong or am I on
  • 00:17:29
    the wrong I'm not I'm not entirely sure.
  • 00:17:31
    Yeah. Yeah. In some software packages,
  • 00:17:33
    they may show you all of them or may
  • 00:17:35
    show you just just the ones that have
  • 00:17:37
    been matched. I'm not sure with this
  • 00:17:39
    spec specific viewer right now. So,
  • 00:17:41
    yeah. And I'm not 100% clear either. I
  • 00:17:44
    haven't read the documentation. All I
  • 00:17:45
    know is visualizing. So, this is an idea
  • 00:17:47
    of key points where you'll notice
  • 00:17:49
    there's no key points where you have a
  • 00:17:50
    lot of low contrast, not a lot of visual
  • 00:17:53
    variation. So, I'm on my screen. And
  • 00:17:55
    there's a part where it shows the street
  • 00:17:57
    and there's just not much going on there
  • 00:18:00
    versus there's a lot of points on the
  • 00:18:01
    fountain which has all these ornate
  • 00:18:03
    decorations on it. In the background
  • 00:18:05
    there's trees and buildings that it's
  • 00:18:07
    latching onto. So it makes sense that
  • 00:18:09
    where you have less variation you're
  • 00:18:11
    going to have less features that it's
  • 00:18:14
    it's oh the sky is also another one
  • 00:18:16
    where you
  • 00:18:18
    this nice tree behind this thing it
  • 00:18:19
    caught a lot on. So it doesn't mean it
  • 00:18:21
    matched on those because you might not
  • 00:18:23
    see those. So if I then I'm going to
  • 00:18:25
    close this and then you can look at show
  • 00:18:27
    overlapping images. So you know if I
  • 00:18:29
    click here you can look at the the
  • 00:18:30
    matches. You're going to see then this
  • 00:18:32
    kind of correspondence matches where
  • 00:18:34
    it's finding key points between two
  • 00:18:37
    images and they show these green lines
  • 00:18:39
    basically saying these two images have
  • 00:18:41
    matching features that it it believes
  • 00:18:43
    are the same points. Right. Is that what
  • 00:18:45
    we're seeing? Exactly. Exactly. So that
  • 00:18:47
    this is now sort of moved to the second
  • 00:18:50
    and third bubbles within that
  • 00:18:52
    correspondent search block. So back to
  • 00:18:54
    that correspondent search. The first
  • 00:18:55
    step was the feature extraction which
  • 00:18:57
    was just the identification of these key
  • 00:19:00
    points in each of the images. So it
  • 00:19:02
    wasn't even trying to compare images
  • 00:19:03
    yet. We're just saying for each image
  • 00:19:05
    let me find those key points. And as as
  • 00:19:07
    Jonathan said, by default, if you've got
  • 00:19:09
    a GPU enabled version of call map and
  • 00:19:12
    you've got a nice GPU in your computer,
  • 00:19:14
    uh it will use the GPU implementation,
  • 00:19:15
    that graphics processor, which makes it
  • 00:19:17
    go a lot faster. So once we've extracted
  • 00:19:20
    those key points and those or or
  • 00:19:22
    features, again, I use those terms
  • 00:19:24
    interchangeably a lot, the key point and
  • 00:19:25
    the feature. Now, we want to match
  • 00:19:28
    images together, and that's to discover
  • 00:19:31
    which images show similar content. And
  • 00:19:34
    so the result of that is going to be the
  • 00:19:37
    set of correspondences, the set of uh
  • 00:19:39
    features saying the features in this
  • 00:19:40
    image matched to the features in this
  • 00:19:42
    image. And those were those green lines
  • 00:19:43
    that Jonathan had shown up uh just prior
  • 00:19:46
    saying that you know not all of the key
  • 00:19:48
    points from one image matched to the
  • 00:19:49
    other. There was some subset but um
  • 00:19:52
    we're trying to discover what those
  • 00:19:54
    matches are. In this diagram we said
  • 00:19:57
    that you know we had feature extraction,
  • 00:19:58
    matching and then geometric
  • 00:20:00
    verification. matching and geometric
  • 00:20:02
    verification uh a lot of times will go
  • 00:20:04
    hand inand you know so you run matching
  • 00:20:06
    and then you immediately run geometric
  • 00:20:08
    verification after that. So the
  • 00:20:10
    intention there is your matching is just
  • 00:20:14
    trying to figure out which features look
  • 00:20:18
    similar between two images but it's not
  • 00:20:20
    trying to do any sort of 2D or 3D
  • 00:20:23
    reasoning. So, it may think that, oh,
  • 00:20:26
    the the top of the tree in one image
  • 00:20:29
    looks like the top of another tree in
  • 00:20:31
    another image, but they're in completely
  • 00:20:32
    different parts of the image, and it
  • 00:20:33
    doesn't even make sense. Like, it it may
  • 00:20:35
    confuse things or especially if you
  • 00:20:37
    have, you know, a building with some
  • 00:20:38
    sort of repetitive pattern on it. You
  • 00:20:40
    know, the same brick repeated over and
  • 00:20:42
    over again, but you have some sort of
  • 00:20:43
    unique windows or unique artwork, you
  • 00:20:46
    know, that appears, you know, on that
  • 00:20:47
    wall. For feature matching, it may end
  • 00:20:49
    up matching incorrect parts of the image
  • 00:20:51
    to each other. So matching does its best
  • 00:20:55
    to try to figure out what matches, but
  • 00:20:56
    it might be wrong. It's geometric
  • 00:20:58
    verification's job to come in and clean
  • 00:21:01
    those up to figure out, well, now that I
  • 00:21:03
    have these initial set of matching key
  • 00:21:05
    points between my two images, which ones
  • 00:21:08
    actually make sense based on our
  • 00:21:10
    knowledge of geometry and how cameras
  • 00:21:12
    move. And so that's where sometimes you
  • 00:21:15
    can leverage, you know, knowing what
  • 00:21:16
    kind of camera model you have can be
  • 00:21:18
    helpful. knowing if if you expect a lot
  • 00:21:20
    of distortion or if it's a fisheye lens,
  • 00:21:22
    that can help. But sometimes um some
  • 00:21:25
    methods don't even try to use that
  • 00:21:27
    information. We'll just look at the 2D2D
  • 00:21:29
    relationships. Mhm. And so there are
  • 00:21:31
    some key words that you might see would
  • 00:21:33
    be it's you know estimating a homography
  • 00:21:36
    homography like a perspective transform
  • 00:21:38
    or an essential matrix or a fundamental
  • 00:21:40
    matrix. So each of these sort of
  • 00:21:42
    relationships, each of these matrices is
  • 00:21:44
    a way to describe how a point in one
  • 00:21:47
    image matches to a location in another
  • 00:21:50
    image or a set of locations in another
  • 00:21:52
    image. And and so we're trying to
  • 00:21:54
    estimate, you know, is there a valid
  • 00:21:57
    camera motion that we can imagine to get
  • 00:22:01
    a set of points in one image to move to
  • 00:22:03
    the set of points in the other image.
  • 00:22:04
    That's what geometric verification is
  • 00:22:06
    doing. Just figuring out those those 2D
  • 00:22:09
    relationships uh between images. And
  • 00:22:12
    somewhere in my logs, you can see some
  • 00:22:14
    hints of that. So, as this running, it's
  • 00:22:16
    showing all kinds of text on your screen
  • 00:22:18
    and it's I'm sure some of that when it's
  • 00:22:21
    well, it's showing bundle adjustment on
  • 00:22:23
    my screen right now, but at one point
  • 00:22:25
    it's it's talking about some of that the
  • 00:22:27
    matches and running different algorithms
  • 00:22:30
    in the background to get that. Um, so
  • 00:22:32
    and then if I if I click on like one of
  • 00:22:34
    these points that it created, it almost
  • 00:22:36
    it shows you where you have multiple
  • 00:22:37
    matches on a specific point and things
  • 00:22:40
    you can do to kind of get different
  • 00:22:42
    views and get hints of what we're
  • 00:22:44
    talking about here. But so one thing we
  • 00:22:46
    we didn't really talk about when you're
  • 00:22:47
    matching these images too that there's
  • 00:22:50
    there's different options as well. So
  • 00:22:52
    when I go through here, I'm processing,
  • 00:22:53
    I've got my key points. It goes fast on
  • 00:22:55
    a GPU because it's able to like look at
  • 00:22:57
    all the different images all at once,
  • 00:22:59
    right? They don't care about respect to
  • 00:23:00
    each other when you're extracting
  • 00:23:02
    features. But then you get to the point
  • 00:23:03
    where you need to do your matching. This
  • 00:23:06
    is where it's all CPU driven because
  • 00:23:08
    it's kind of either a sequential or
  • 00:23:10
    exhaustive, but it's not able to look at
  • 00:23:11
    every image all at once. But there's
  • 00:23:14
    options here where if I go to this
  • 00:23:17
    button here, it's not displaying on my
  • 00:23:19
    screen correctly for some reason. Oh,
  • 00:23:20
    there we go. You can you can do
  • 00:23:22
    exhaustive, sequential, vocab tree,
  • 00:23:25
    spatial. There's these different styles
  • 00:23:27
    you can pick or I want to say styles,
  • 00:23:29
    different algorithms you can pick to
  • 00:23:31
    match these. Yep. My understanding
  • 00:23:33
    always is if you have a random
  • 00:23:35
    collection of images like someone walked
  • 00:23:37
    around and they're not necessarily one
  • 00:23:39
    image is taken and then your next image
  • 00:23:42
    you moved over and took just of the same
  • 00:23:44
    part of the scene. But I don't know,
  • 00:23:46
    maybe you're just walking around taking
  • 00:23:47
    pictures in all which directions.
  • 00:23:49
    Exhaustive is what you want to use
  • 00:23:50
    because it's going to you can explain
  • 00:23:52
    this but it's going to like kind of try
  • 00:23:54
    to get every image to match to every
  • 00:23:56
    image versus sequential where you're
  • 00:23:58
    saying no no no each image was taken in
  • 00:24:01
    sequence. So I see the fountain from one
  • 00:24:03
    spot I moved a few feet took another
  • 00:24:05
    photo of it. They should be sequentially
  • 00:24:06
    somewhat matching to each other. Does
  • 00:24:09
    that sound correct? Is am that the right
  • 00:24:11
    assumption? You you're exactly right.
  • 00:24:13
    You're exactly right. So yeah, once you
  • 00:24:15
    once you've extracted the key points
  • 00:24:16
    from a single image, now you want to
  • 00:24:18
    figure out well which pairs of images
  • 00:24:20
    you know are related to each other. So
  • 00:24:22
    the the simplest most naive way is to
  • 00:24:24
    say well let me match every single image
  • 00:24:25
    to every single other one. Let me look
  • 00:24:27
    at all order n squared every single
  • 00:24:30
    combination of pairs of images that I
  • 00:24:32
    can imagine. And so that's what
  • 00:24:34
    exhaustive matching is doing. So
  • 00:24:35
    exhaustive matching like you said it's
  • 00:24:37
    great when you have sort of an unsorted
  • 00:24:40
    random collection of images and
  • 00:24:41
    especially it works well if you have you
  • 00:24:43
    know the order of a few hundred images
  • 00:24:45
    um you know because because it is doing
  • 00:24:48
    this you know every image to every other
  • 00:24:50
    image that quickly gets expensive in
  • 00:24:52
    terms of time like that's going to take
  • 00:24:53
    a lot of time to compute if you try to
  • 00:24:55
    do this on thousands of images you can
  • 00:24:57
    still do it you just have to wait a long
  • 00:24:59
    time but yeah it's it's great because
  • 00:25:00
    it's going to try to discover every
  • 00:25:02
    single pair of matching images that it
  • 00:25:05
    Mhm. And so that's where then the
  • 00:25:06
    sequential is nice if you have something
  • 00:25:08
    like you said there in the fountain
  • 00:25:10
    sequence where you know hey these are
  • 00:25:12
    you know frames from a video or my
  • 00:25:14
    images you maybe I was taking photos but
  • 00:25:16
    I I'm taking them in order like oh I
  • 00:25:18
    started here took a photo took a few
  • 00:25:20
    steps took another photo took a few more
  • 00:25:22
    steps took another photo and so there is
  • 00:25:23
    some sort of sequential information to
  • 00:25:26
    those photos you know that images taken
  • 00:25:29
    near each other in that list show
  • 00:25:31
    similar content and that's what
  • 00:25:33
    sequential it'll leverage leverage that
  • 00:25:35
    information to help the the matching be
  • 00:25:36
    more efficient. And then I don't really
  • 00:25:39
    understand vocab tree. I do know that if
  • 00:25:41
    you want to do an exhaustive style
  • 00:25:43
    match, not sequential, but you have
  • 00:25:45
    let's say 800 images, I've always heard
  • 00:25:47
    use a vocab tree. Yeah. Yeah, that
  • 00:25:50
    that's exactly right. So the vocab tree,
  • 00:25:53
    you might heard like it's a vocabulary
  • 00:25:55
    tree or image retrieval style matching.
  • 00:25:57
    Yeah. What it's doing behind the scenes
  • 00:25:59
    is is it uses a image lookup data
  • 00:26:04
    structure. So it takes all the images,
  • 00:26:06
    comes up with a really compact
  • 00:26:09
    summarization of the kinds of things
  • 00:26:11
    that are in each image and then provides
  • 00:26:13
    a way that I can say, hey, for this
  • 00:26:15
    given image, what other images in my
  • 00:26:18
    data set are likely to have the same
  • 00:26:21
    kinds of things in them. you know, it's
  • 00:26:23
    not a guarantee, but it just says, you
  • 00:26:25
    know, if I'm I have one image and I've
  • 00:26:27
    got 10,000 other images I can match to,
  • 00:26:29
    I can ask it, well, hey, I don't want to
  • 00:26:31
    look at all 10,000. Can you at least
  • 00:26:33
    give me a sorted list of the ones that
  • 00:26:35
    are most likely to match? And so that's
  • 00:26:37
    what the vocab tree option does for you
  • 00:26:38
    is it returns that ranked list and then
  • 00:26:41
    so instead of matching all 10,000, I can
  • 00:26:43
    choose to match the best 50 or the best
  • 00:26:45
    100 or whatever my threshold. Speed up.
  • 00:26:48
    Yep. It's more efficient. Yeah. Um, once
  • 00:26:51
    you get beyond three to 400 images,
  • 00:26:55
    exhaustive should not be your option.
  • 00:26:57
    You should go to vocab tree unless
  • 00:26:59
    they're all sequentially taken. And then
  • 00:27:01
    always use sequential. Well, not always,
  • 00:27:02
    but that's that's probably your default.
  • 00:27:04
    So, if I'm taking a video and then
  • 00:27:06
    extracting images, sequential is always
  • 00:27:09
    going to work. Well, always going to be
  • 00:27:10
    your first option if you want to be as
  • 00:27:12
    fast as possible. And so, and then and
  • 00:27:14
    then in here, you can I know you can you
  • 00:27:16
    can uh pick loop detection. So, it's
  • 00:27:18
    trying to we've talked about that
  • 00:27:19
    before, right? is it's trying to detect
  • 00:27:21
    have you come back to an area correct
  • 00:27:23
    and and and that will do it using the
  • 00:27:26
    vocab tree option like so if I do loop
  • 00:27:28
    detection so under the sequential tab if
  • 00:27:31
    I do loop detection and then specify a
  • 00:27:33
    vocab tree path there at the bottom that
  • 00:27:37
    will enable it to say oh as I'm
  • 00:27:39
    processing through all those video
  • 00:27:40
    frames you know every 10th frame or
  • 00:27:42
    every 50th frame or every 100th frame
  • 00:27:43
    whatever you set it to you can have it
  • 00:27:46
    go and then do a vocabulary tree
  • 00:27:48
    retrieval do that image retrieval step
  • 00:27:50
    to try to discover loop closures within
  • 00:27:53
    within some of that uh that okay so we
  • 00:27:56
    have these options I always just say and
  • 00:27:58
    then there's spatial and transitive we
  • 00:28:00
    haven't talked about that does spatial
  • 00:28:02
    have to do with GPS exactly right so it
  • 00:28:04
    just says you know for each image
  • 00:28:05
    assuming if the images have embedded uh
  • 00:28:08
    geo tags so GPS data embedded in the
  • 00:28:10
    EXIF it will say for each image just
  • 00:28:13
    find other images with similar GPS and
  • 00:28:15
    match to those yes I love that a lot of
  • 00:28:18
    people here listening probably are
  • 00:28:20
    taking drone images and spatial is the
  • 00:28:24
    one I always use. That's a great option
  • 00:28:26
    because a lot of times that drone is
  • 00:28:27
    looking straight down or you know it's
  • 00:28:29
    not looking at completely random
  • 00:28:31
    directions but there is some order and
  • 00:28:33
    structure to that drone data and so that
  • 00:28:36
    and in fact a lot of the drones that
  • 00:28:37
    people are using nowadays have a really
  • 00:28:39
    good GPS on it. thinking of the
  • 00:28:42
    enterprise versions of like a DJI drone
  • 00:28:45
    are getting really good GPS. Even even
  • 00:28:47
    without a RTK attachments, it's not
  • 00:28:50
    going to it's not going to throw a bunch
  • 00:28:52
    of air into there. And then what's
  • 00:28:53
    transitive? That's the one I don't think
  • 00:28:55
    I've ever touched. I don't even know
  • 00:28:56
    what that means. Yeah, that just that's
  • 00:28:58
    a way to densify a set of existing
  • 00:29:01
    matches. So suppose you had gone and run
  • 00:29:04
    one of the existing modes. see ran.
  • 00:29:06
    Okay, maybe not exhaustive, but like if
  • 00:29:08
    you had ran sequential or ran your
  • 00:29:11
    spatial or ran your vocab tree, but then
  • 00:29:14
    you wanted to go back and create a a
  • 00:29:16
    more complete set of connections between
  • 00:29:18
    images. What transitive will do is it'll
  • 00:29:20
    look at your database and it'll say,
  • 00:29:22
    "Hey, if image A matched to B and image
  • 00:29:26
    B matched to image C, but I didn't try
  • 00:29:29
    to match image A directly to C, let me
  • 00:29:31
    go ahead and do that now." And so it
  • 00:29:33
    goes back and finds these transitive
  • 00:29:35
    links between images and attempts to do
  • 00:29:38
    that matching. And so what that does
  • 00:29:39
    that just creates a stronger set of
  • 00:29:40
    connections between images which will
  • 00:29:43
    help it call map out during the
  • 00:29:44
    reconstruction phase. Okay. So that I
  • 00:29:46
    feel like this gives me a good idea then
  • 00:29:48
    of or the the
  • 00:29:49
    listener/viewer an idea. There's
  • 00:29:52
    different options. Pick the one that
  • 00:29:54
    makes sense for the data set you have.
  • 00:29:56
    You might get the best results out of
  • 00:29:59
    exhaustive as far as air, but you might
  • 00:30:01
    be waiting a day. Heard people say, "I
  • 00:30:04
    set this and now it's telling me it'll
  • 00:30:06
    be ready in 28 hours." Well, probably
  • 00:30:08
    not the right mode. You probably used a
  • 00:30:09
    vocab tree, but you know, I always say
  • 00:30:12
    find the right one. Start with
  • 00:30:14
    sequential. If you have sequential
  • 00:30:15
    images, at least you probably get good a
  • 00:30:17
    good result there. I also want and just
  • 00:30:20
    to mention it back you know in the
  • 00:30:22
    diagram under the corresponding search
  • 00:30:24
    you know they do break it down versus
  • 00:30:26
    the feature extraction feature matching
  • 00:30:28
    and then geometric
  • 00:30:29
    verification that geometric verification
  • 00:30:32
    those options show up on that matching
  • 00:30:35
    those matching settings screens that we
  • 00:30:36
    just saw for each of those tabs at the
  • 00:30:39
    bottom there was the general settings or
  • 00:30:41
    general options and a lot of those
  • 00:30:43
    general options are related to geometric
  • 00:30:46
    verification saying when I'm matching
  • 00:30:49
    these points and I want to then verify,
  • 00:30:51
    you know, what sort of pixel error do I
  • 00:30:53
    expect or what is the minimum number of
  • 00:30:55
    inliers or an inlier ratio and so that
  • 00:30:58
    those inliers are the number of
  • 00:31:00
    geometrically verified matches between a
  • 00:31:02
    pair of images. And so that's that's
  • 00:31:04
    where geometric verification kind of
  • 00:31:06
    comes into play within this call map
  • 00:31:08
    workflow. Okay. So just move this along.
  • 00:31:11
    Then I do want to point out I'm going to
  • 00:31:13
    show call map one more time. At this
  • 00:31:15
    point, you've ran both your feature
  • 00:31:17
    extraction and feature matching. You
  • 00:31:19
    will still see nothing on your screen.
  • 00:31:21
    Well, you will see logs, but you will
  • 00:31:22
    not see these camera poses, which I
  • 00:31:24
    have. So, I have a point I have this
  • 00:31:26
    sparse point cloud. I have these red
  • 00:31:28
    camera positions around it, and none of
  • 00:31:31
    this shows up because at this point, we
  • 00:31:33
    haven't we haven't created a point
  • 00:31:35
    cloud. We haven't projected anything
  • 00:31:37
    yet. So, we're moving from
  • 00:31:40
    correspondence search to, if I bring up
  • 00:31:42
    that diagram one more time, we're moving
  • 00:31:43
    on to incremental reconstruction, and
  • 00:31:45
    that's where we start to see fun things
  • 00:31:47
    happening on a cool map uh guey screen.
  • 00:31:50
    If you're running on a guey, you'll
  • 00:31:51
    start to see camera poses show up. So,
  • 00:31:54
    the first step is initialization. What
  • 00:31:56
    is that? So, is that just just starting?
  • 00:31:59
    Yeah, that's what it is. I mean, it's
  • 00:32:00
    it's the starting process for this
  • 00:32:03
    incremental reconstruction. So
  • 00:32:06
    incremental reconstruction is just one
  • 00:32:08
    style to attempt to do 3D
  • 00:32:10
    reconstruction. And so the the core idea
  • 00:32:13
    here is that you know like you said we
  • 00:32:16
    don't have any in 3D information yet. So
  • 00:32:18
    we're going to start with the minimum
  • 00:32:19
    amount that we need which is a pair of
  • 00:32:21
    images. So let's start with a pair of
  • 00:32:22
    images and then figure out what is their
  • 00:32:25
    3D relationship you know between those
  • 00:32:27
    images as well as what 3D points did
  • 00:32:30
    they see in the scene. And so we're
  • 00:32:32
    going to create this two view
  • 00:32:33
    reconstruction. take that pair of
  • 00:32:34
    images, triangulate an initial set of 3D
  • 00:32:36
    points, and then we use that as the
  • 00:32:39
    initialization for the rest of the
  • 00:32:41
    reconstruction. And so everything after
  • 00:32:43
    that is going to figure out, well, based
  • 00:32:44
    on these initial two images and some
  • 00:32:46
    points, how can I add a third image to
  • 00:32:49
    that? And how does it relate? And now
  • 00:32:50
    that I have these three, how can I add a
  • 00:32:52
    fourth and then a fifth and a sixth? And
  • 00:32:54
    so you just keep adding images one at a
  • 00:32:56
    time to grow a larger and larger
  • 00:32:59
    reconstruction. But initialization is
  • 00:33:01
    just what is that initial pair? Which
  • 00:33:04
    two images am I going to start with to
  • 00:33:07
    build this entire reconstruction?
  • 00:33:09
    Okay. And then and then it kind of goes
  • 00:33:11
    into a circle. So if you look at this, I
  • 00:33:13
    say circle the the diagram on the screen
  • 00:33:15
    shows image registration, triangulation,
  • 00:33:18
    bundle adjustment, outlier filtering,
  • 00:33:20
    and then if you follow the lines, you
  • 00:33:22
    notice you're really doing a loop. Yep.
  • 00:33:24
    So it's looping through that process.
  • 00:33:26
    And then also this dashed line showing
  • 00:33:29
    reconstruction. So it's kind of probably
  • 00:33:30
    looping through that and adding to the
  • 00:33:33
    reconstruction while it's going or Yep.
  • 00:33:36
    Okay. Exactly right. Exactly right. So
  • 00:33:38
    it's it's that initialization that picks
  • 00:33:41
    the first pair of images. But as but
  • 00:33:43
    once I have my pair of images now I'm
  • 00:33:46
    going to enter in this loop that starts
  • 00:33:49
    with image registration. So image
  • 00:33:50
    registration is is a fancy name to say
  • 00:33:53
    how does a new image how can I add a new
  • 00:33:55
    image to my existing reconstruction. And
  • 00:33:58
    so what it's going to look at is based
  • 00:34:01
    on the 3D points that have already been
  • 00:34:03
    triangulated. It's going to ask what's
  • 00:34:06
    the best next image in my data set that
  • 00:34:10
    also saw those points. And then if um
  • 00:34:13
    and once I find that image you know via
  • 00:34:15
    via the set of feature matches. So we
  • 00:34:17
    say you know uh if if I've matched image
  • 00:34:19
    one and two and triangulated that well
  • 00:34:21
    two image two matched to image three
  • 00:34:23
    well then image three is seeing the same
  • 00:34:25
    points in the scene. So let me add image
  • 00:34:27
    three and so there it's a 2D to 3D
  • 00:34:30
    registration process 2D 3D pose
  • 00:34:32
    estimation process where I take the 2D
  • 00:34:34
    points in that third image and I want to
  • 00:34:37
    align those 2D points with the 3D points
  • 00:34:40
    that have been triangulated. So you
  • 00:34:41
    might hear that as image registration or
  • 00:34:44
    perspective endpoint problem, pose
  • 00:34:47
    estimation. There's a few different
  • 00:34:48
    words for what this process is, but
  • 00:34:50
    you're adding a new image to the
  • 00:34:52
    reconstruction. And so that's the image
  • 00:34:54
    registration step. I do know when I ran
  • 00:34:56
    this um I can always take a a video and
  • 00:34:59
    kind of project onto this in post. But
  • 00:35:02
    when it's creating this reconstruction,
  • 00:35:04
    instead of taking image one and then
  • 00:35:08
    image two and then image three and kind
  • 00:35:11
    of building off that, I'll notice it'll
  • 00:35:13
    pick, if you look at my if if you're
  • 00:35:15
    watching this on video, you'll notice I
  • 00:35:17
    took two loops and some of the images
  • 00:35:19
    are like right above each other almost
  • 00:35:20
    where I held the phone at like above my
  • 00:35:22
    head and then I held it down at chest
  • 00:35:23
    level. So I have two loops and there's a
  • 00:35:25
    lot of common key points, common
  • 00:35:27
    features. So, as it's building this up,
  • 00:35:30
    it started at this kind of where I
  • 00:35:31
    started walking around this this
  • 00:35:33
    fountain, but it's using images from
  • 00:35:35
    further along in the video extraction or
  • 00:35:38
    sorry, the images I had. So, it use like
  • 00:35:40
    image one and image 180 because those
  • 00:35:45
    are next to each other and had a lot of
  • 00:35:47
    strong feature matches. So, they're not
  • 00:35:48
    necessarily using images in sequence of
  • 00:35:50
    how you took them. It's ones that had
  • 00:35:52
    strong correlation.
  • 00:35:54
    That's a great point. That's a great
  • 00:35:56
    point. Yeah, it it isn't just going to
  • 00:35:57
    go, you know, 1 2 3 4 5 6, you know,
  • 00:36:00
    it's not going to do them in order, you
  • 00:36:01
    know, it's going to start that pair of
  • 00:36:03
    images. It's going to look through all
  • 00:36:04
    of the images in your collection and
  • 00:36:07
    find the pair. And it might not be the
  • 00:36:08
    consecutive pair, but find the pair of
  • 00:36:10
    images, you know, that maximizes some
  • 00:36:12
    criteria. You know, it's a pair of
  • 00:36:14
    images that has strong connectivity. So,
  • 00:36:16
    there were a lot of feature matches, but
  • 00:36:18
    I also want to make sure that that pair
  • 00:36:19
    of images has, you know, differences in
  • 00:36:22
    viewpoint. I don't want two images that
  • 00:36:24
    were taken at the exact same position in
  • 00:36:26
    space because that gives me no 3D
  • 00:36:29
    information. I need, you know, we talked
  • 00:36:31
    about this in the last episode, this
  • 00:36:32
    concept of a baseline. I need some sort
  • 00:36:34
    of translation. I need some motion
  • 00:36:36
    between two images or maybe it was in
  • 00:36:38
    our depth map depth map episode, you
  • 00:36:41
    know, we talked about this, you know, in
  • 00:36:43
    that we need motion between images in
  • 00:36:45
    order to estimate depth. So the
  • 00:36:47
    initialization could look for the same
  • 00:36:48
    thing. I want it wants lots of matches
  • 00:36:50
    between the image, but it also wants a
  • 00:36:52
    strong amount of motion between that.
  • 00:36:54
    So, it's going to pick whichever pair of
  • 00:36:56
    images maximizes those that criteria and
  • 00:36:59
    once it has that, then it'll start
  • 00:37:01
    adding other images that are strongly
  • 00:37:04
    connected to those initial ones. And
  • 00:37:05
    yeah, it won't necessarily do it in
  • 00:37:07
    order that you capture those images. It
  • 00:37:09
    can be in the order in which those
  • 00:37:10
    connections are strongest. And I I I was
  • 00:37:13
    seeing mostly you were I was seeing like
  • 00:37:16
    the first photo and then somewhere
  • 00:37:17
    further along where I came and did a
  • 00:37:20
    loop. I saw those two photos start
  • 00:37:22
    together because I think there was more
  • 00:37:24
    as we were talking about a baseline was
  • 00:37:26
    was better. There was more parallax
  • 00:37:28
    because I have these are pretty closely
  • 00:37:30
    spaced images I took from picture to
  • 00:37:32
    picture. So not a lot has changed versus
  • 00:37:34
    the next loop I have a I'm looking the
  • 00:37:36
    exact same part of the fountain but I
  • 00:37:38
    have a different elevation and angle. So
  • 00:37:40
    there's a lot of parallax movement
  • 00:37:42
    between those those images. So it it was
  • 00:37:45
    it was matching those better as opposed
  • 00:37:48
    to image one to image two. It's more of
  • 00:37:50
    image one to image 180 because of that
  • 00:37:52
    baseline was probably better. So you get
  • 00:37:54
    to the fun thing is when you run this in
  • 00:37:56
    the guey, this coal map, you get to
  • 00:37:57
    watch those build and you get to see the
  • 00:37:59
    point cloud just start to generate in
  • 00:38:03
    front of you and you get an
  • 00:38:04
    understanding then of what it's doing in
  • 00:38:06
    these logs that are looping through this
  • 00:38:08
    process over and over. And you can kind
  • 00:38:10
    of see it just iteratively add to the
  • 00:38:12
    scene and build and refine. When it's
  • 00:38:14
    doing this incremental reconstruction,
  • 00:38:17
    is it refining the camera poses as it
  • 00:38:19
    goes or is it just saying, "Here's the
  • 00:38:21
    camera poses. There's where they are."
  • 00:38:24
    No, there's there's refinement. There's
  • 00:38:26
    refinement. And a lot of times that
  • 00:38:27
    refinement is is called bundle
  • 00:38:29
    adjustment. That's that's a key word
  • 00:38:31
    that's used commonly in the literature.
  • 00:38:32
    I remember the first time I heard the
  • 00:38:33
    word bundle adjustment. I was a first
  • 00:38:35
    year grad student and I had no idea what
  • 00:38:38
    the person was talking about. I was
  • 00:38:39
    like, "What? A bundle of sticks? A
  • 00:38:41
    bundle of what? A straw? What is going
  • 00:38:43
    on?" Um, but no, b a bundle adjustment.
  • 00:38:45
    So, it's the idea of refining the 3D
  • 00:38:49
    points as well as the camera positions.
  • 00:38:51
    And so you end up with just a bundle of
  • 00:38:53
    constraints, you know, a bunch of
  • 00:38:54
    constraints saying, you know, these 2D
  • 00:38:56
    points in these images all triangulate
  • 00:38:58
    and all saw the same 3D point in the
  • 00:39:00
    scene, but I've got a bunch of images
  • 00:39:01
    and I've got a bunch of points. How can
  • 00:39:04
    I optimize the alignment of all of this
  • 00:39:07
    data? And that's what bundle adjustment
  • 00:39:09
    is. So yeah, so as call map is running,
  • 00:39:13
    it's it's doing that image registration
  • 00:39:15
    process. It'll add a new image. It then
  • 00:39:17
    runs triangulation which creates new 3D
  • 00:39:20
    points based on that new image and other
  • 00:39:22
    images that are already there but then
  • 00:39:24
    it'll do bundle adjustment which will
  • 00:39:25
    say how can I refine that and there's
  • 00:39:28
    two styles of bundle adjustment that I
  • 00:39:30
    believe call map uses one of them is
  • 00:39:32
    local bundle adjustment the other is
  • 00:39:33
    global so a lot of times what you will
  • 00:39:35
    see is you know suppose suppose we had
  • 00:39:38
    already reconstructed a thousand images
  • 00:39:40
    and we're adding that a thousand in
  • 00:39:42
    first um when I add that thousand first
  • 00:39:45
    you know trying to do bundle adjustment
  • 00:39:46
    using all thousand images that takes a
  • 00:39:48
    long time. Um, and so I can re we
  • 00:39:52
    recognize that well that first image
  • 00:39:54
    that that that thousand first that next
  • 00:39:56
    image that I'm adding, you know, well,
  • 00:39:57
    it's off in the corner of the
  • 00:39:58
    reconstruction, you know, it's far away
  • 00:40:00
    from the other side of the
  • 00:40:01
    reconstruction. You know, these these
  • 00:40:02
    things aren't really related to each
  • 00:40:04
    other. So, I can run a local bundle
  • 00:40:06
    adjustment. Let me just optimize only
  • 00:40:08
    those cameras and points that are near
  • 00:40:11
    that new image that I just added or
  • 00:40:13
    those new points that I've triangulated.
  • 00:40:15
    And so, that's a way to sort of do this
  • 00:40:16
    local refinement. And I can do that
  • 00:40:18
    every single time I add a new image. And
  • 00:40:21
    then periodically, com will run a global
  • 00:40:24
    bundle adjustment. So there's some
  • 00:40:25
    settings there. I think every, you know,
  • 00:40:27
    once the reconstruction is increased in
  • 00:40:29
    size by 10% or you've added every, you
  • 00:40:31
    know, 500 images or something, there's
  • 00:40:33
    certain criteria, especially at the end
  • 00:40:35
    of the reconstruction, homeup will run a
  • 00:40:37
    global bundle adjustment which says,
  • 00:40:39
    let's optimize everything. Let's
  • 00:40:41
    optimize the points. Let's optimize the
  • 00:40:44
    camera poses. And something we haven't
  • 00:40:46
    mentioned is it will also be optimizing
  • 00:40:48
    the camera parameters. So back when we
  • 00:40:51
    picked that camera model and we said,
  • 00:40:52
    "Oh, you know, we're going to use a
  • 00:40:53
    camera model that has a focal length
  • 00:40:55
    term and a principal point CX and C Y or
  • 00:40:57
    maybe has some radial distortion terms."
  • 00:40:59
    During bundle adjustment, COM app will
  • 00:41:01
    also be optimizing those parameters as
  • 00:41:04
    well to figure out well what is the
  • 00:41:05
    field of view of my camera that's the
  • 00:41:07
    focal length or how much lens distortion
  • 00:41:09
    was there in order to achieve that line
  • 00:41:12
    of. Would it run those? if you cuz we
  • 00:41:15
    didn't cover this earlier on, but let's
  • 00:41:18
    say you do have a camera model
  • 00:41:21
    uh calibration file. So, you're saying I
  • 00:41:23
    know this. I think DJI's in their again
  • 00:41:27
    in their enterprise level drones will
  • 00:41:28
    give you this information on their
  • 00:41:31
    lenses cuz they've been calibrated and
  • 00:41:33
    it's in the XF data. Will will that
  • 00:41:35
    change? Does it do like a refinement on
  • 00:41:37
    top of that or does it just say no, no,
  • 00:41:39
    no, you give us that, we won't change
  • 00:41:40
    that. That's that's an option. So I
  • 00:41:43
    think under the either under the
  • 00:41:44
    reconstruction options or under the
  • 00:41:46
    bundle adjustment options there are ways
  • 00:41:47
    to say hey do I want to refine my focal
  • 00:41:50
    length you want to refine you know my
  • 00:41:52
    distortion terms. Um so you could you
  • 00:41:55
    know enable or disable that setting. To
  • 00:41:57
    that point I do believe you know that
  • 00:41:59
    call map will parse the XF data in those
  • 00:42:02
    images and if it sees that yeah there is
  • 00:42:03
    a focal length cuz a lot of times an
  • 00:42:05
    image will you know will contain you
  • 00:42:07
    know that oh this was taken with a 10 mm
  • 00:42:09
    lens or a 24 mm lens you know and so
  • 00:42:12
    call map can parse that data to take an
  • 00:42:14
    initial guess at what it thinks that
  • 00:42:16
    focal length is you know what's the
  • 00:42:17
    field of view of the camera and can use
  • 00:42:18
    that as initialization. But a lot of
  • 00:42:21
    times there is benefit to refine that um
  • 00:42:23
    because it may it may be make it you
  • 00:42:25
    close but not might not be close enough
  • 00:42:28
    to get a really sharp
  • 00:42:30
    reconstruction. So okay so I got a lot
  • 00:42:32
    more appreciation for what's happening
  • 00:42:34
    here. I tell people run this on their
  • 00:42:36
    computer. You don't need the highest
  • 00:42:38
    spec computer to run a small data set
  • 00:42:40
    and learn how this works. I ran this on
  • 00:42:42
    my older computer which doesn't have you
  • 00:42:45
    know 24 cores or anything and it still
  • 00:42:47
    ran fairly quick. I'd say there's
  • 00:42:49
    there's some things you gave me some
  • 00:42:51
    notes. I think we covered largely most
  • 00:42:53
    of it. But then from here, you can do
  • 00:42:55
    things. So, I've ran this through. You
  • 00:42:58
    can hit automatic reconstruction. It'll
  • 00:43:00
    create all this, but then you can hit
  • 00:43:01
    bundle adjustment, which is that global
  • 00:43:03
    one at the end. And then you can build a
  • 00:43:05
    dense reconstruction, which we're not
  • 00:43:06
    really going to cover on this episode.
  • 00:43:08
    This is just kind of like here's how we
  • 00:43:10
    got that that workflow I showed to get
  • 00:43:12
    the camera poses, the sparse point
  • 00:43:14
    cloud, and then from there, you can use
  • 00:43:15
    it for more downstream tasks, right? So
  • 00:43:17
    I could use this for again doing a dense
  • 00:43:20
    3D reconstruction where you're going to
  • 00:43:21
    I want to get millions of points on this
  • 00:43:23
    scene or I can use this as the basis for
  • 00:43:27
    initializing 3D god and splatting.
  • 00:43:29
    There's just different things you can
  • 00:43:31
    use once you got camera positions and a
  • 00:43:34
    point cloud spar sparse point cloud. I'm
  • 00:43:36
    showing also on my screen I didn't talk
  • 00:43:38
    about you have these kind of magenta
  • 00:43:40
    lines. This is showing kind of your
  • 00:43:42
    these images matched. If I clicked on
  • 00:43:44
    double clicked on one, it'll it'll show
  • 00:43:47
    that kind of that information of the key
  • 00:43:49
    points and which ones matched to it. But
  • 00:43:51
    you can just click around and and and
  • 00:43:52
    learn things. Double click on different
  • 00:43:54
    parts of the scene. It'll show you the
  • 00:43:57
    point and which which different cameras
  • 00:43:59
    made up that point. And it's a good tool
  • 00:44:02
    to kind of learn how this works because
  • 00:44:03
    it's very visual on the screen. Lots of
  • 00:44:06
    data, lots of options. You can even
  • 00:44:09
    create animations in this if you really
  • 00:44:11
    want to show off what you learned. There
  • 00:44:12
    is one thing we didn't really talk
  • 00:44:14
    about. Well, there's a couple things.
  • 00:44:15
    So, incremental reconstruction. Everyone
  • 00:44:17
    always complains. I got the newest GPU.
  • 00:44:19
    This should be really fast. Why is this
  • 00:44:22
    running so slow? My GPU is not even
  • 00:44:23
    being used and it says it's taking 5
  • 00:44:26
    hours to run my thousand image data set.
  • 00:44:29
    Why is that? Why can't we use a GPU for
  • 00:44:30
    this incremental reconstruction? Or I
  • 00:44:32
    know we can, but why can't we in co map
  • 00:44:34
    the way it's configured? Yeah. Yeah.
  • 00:44:36
    Because coal map Yeah. A lot of these
  • 00:44:38
    algorithms are not easily to parallelize
  • 00:44:41
    on a GPU. So a GPU works well when
  • 00:44:44
    you're doing the exact same operation on
  • 00:44:46
    millions of things, you know, cuz that's
  • 00:44:48
    what a GPU does. Its job is to draw
  • 00:44:51
    pixels to a screen, you know, on your on
  • 00:44:53
    your monitor on your desktop. And so
  • 00:44:55
    you've got millions of pixels on your
  • 00:44:56
    screen. And so that GPU is processing a
  • 00:44:58
    million pixels at once and figures out
  • 00:45:00
    what to draw. And so for tasks like
  • 00:45:03
    feature extraction where hey I've got a
  • 00:45:05
    again millions of pixels and I want to
  • 00:45:07
    figure out which ones have features in
  • 00:45:08
    them. GPU is great or feature matching.
  • 00:45:12
    I've got tens of thousands of features
  • 00:45:13
    in one image, tens of thousands of the
  • 00:45:15
    other. I want to figure out which
  • 00:45:16
    features match with each other. Then
  • 00:45:18
    again that's great for a GPU. for
  • 00:45:20
    incremental reconstruction. It's like
  • 00:45:23
    I'm operating on one image at a time and
  • 00:45:25
    I have to just solve a math equation and
  • 00:45:28
    do some, you know, linear algebra to
  • 00:45:30
    figure out what's the 3D position or
  • 00:45:32
    pose of that image. That's not a very
  • 00:45:34
    paralyzable task. And so it's it's not
  • 00:45:37
    very easy to uh adapt some of these
  • 00:45:40
    algorithms to the GPU. I will say in the
  • 00:45:42
    another thing too that contributes to it
  • 00:45:44
    is COMAP is very uh flexible. There's a
  • 00:45:48
    lot of algorithms, a lot of switches, a
  • 00:45:50
    lot of different techniques that you can
  • 00:45:51
    use and to implement all of those on the
  • 00:45:54
    GPU would just take a lot of time. It's
  • 00:45:56
    nice having software that's flexible.
  • 00:45:58
    You know, with Clap being open source, a
  • 00:46:00
    bunch of people contributing to it, it's
  • 00:46:02
    nice having a flexible platform where
  • 00:46:04
    people can easily dive in, make changes,
  • 00:46:08
    add their own algorithm, plug it in,
  • 00:46:10
    tweak things, and play with it. So
  • 00:46:11
    having that having that sort of more
  • 00:46:13
    general purpose CPUbased implementation
  • 00:46:16
    is is helpful. But yeah, to get back to
  • 00:46:18
    the core, it really is primarily just
  • 00:46:19
    around the algorithms. A lot of these
  • 00:46:21
    algorithms are not parallelizable or or
  • 00:46:24
    not well suited for processing on a GPU.
  • 00:46:27
    That makes sense. I someone I once
  • 00:46:29
    explained it or someone was trying to
  • 00:46:31
    explain it. It's like your CPU is a
  • 00:46:32
    really good detective at solving clue by
  • 00:46:35
    clue one thing at a time versus GPU.
  • 00:46:38
    It's like it can just point out all the
  • 00:46:39
    clues all at once. Yeah. But you really
  • 00:46:41
    need that like hard math equation. and
  • 00:46:44
    you need a really fast cores to trying
  • 00:46:46
    to solve those things one at a time and
  • 00:46:48
    it's incremental. So think about it.
  • 00:46:50
    It's like you can't you can't solve all
  • 00:46:51
    these all at once as is. So that's
  • 00:46:54
    something that people just have to keep
  • 00:46:55
    in mind that don't get frustrated. It's
  • 00:46:58
    just how this technology works today.
  • 00:47:00
    And there's glow map. So how does glow
  • 00:47:01
    map make this all a sudden magically
  • 00:47:03
    fast? Yeah. So glow map is a different
  • 00:47:06
    style for that reconstruction process.
  • 00:47:09
    So glow map deals with global mapper you
  • 00:47:12
    know. So global reconstruction versus
  • 00:47:15
    incremental reconstruction. So instead
  • 00:47:16
    of here in colap we just talked about it
  • 00:47:18
    uses an incremental reconstruction adds
  • 00:47:20
    you know one image at a time whereas
  • 00:47:23
    global reconstruction it tries to figure
  • 00:47:26
    out the 3D poses of all of the images
  • 00:47:29
    all at once. So glow map still has that
  • 00:47:32
    same correspondent search step. So to
  • 00:47:34
    run glow map you still got to extract
  • 00:47:36
    key points extract features from your
  • 00:47:38
    image. You got to match them. Got to run
  • 00:47:39
    your geometric verification. But once
  • 00:47:42
    you have that web of connectivity
  • 00:47:44
    between your images, you can then run
  • 00:47:46
    global reconstruction techniques. And so
  • 00:47:49
    there's a few different steps there. In
  • 00:47:51
    glow map, they run rotation averaging
  • 00:47:54
    first. So the idea with that is that you
  • 00:47:57
    look at all of the feature matches
  • 00:47:59
    between your pairs of images. For each
  • 00:48:02
    pair, you estimate how much rotation
  • 00:48:05
    occurred between that pair of images,
  • 00:48:07
    you know. So that gives you a
  • 00:48:08
    constraint. But now if I look at all of
  • 00:48:10
    the rotations that I estimated between
  • 00:48:12
    all of the pairs, can I come up with a
  • 00:48:15
    consistent orientation for all of my
  • 00:48:17
    images that satisfies each of those
  • 00:48:19
    pair-wise constraints? So, can I arrange
  • 00:48:22
    the orientations of my images so that
  • 00:48:24
    all of those pair-wise rotations make
  • 00:48:26
    sense? And that's what rotation
  • 00:48:28
    averaging does. So, it's not even
  • 00:48:30
    looking at position. It's just trying to
  • 00:48:32
    rotate all of the images. And once
  • 00:48:34
    they're rotated in 3D space, then it
  • 00:48:36
    does a global positioning step which
  • 00:48:39
    simultaneously solves both the camera
  • 00:48:41
    positions as well as some of the 3D
  • 00:48:43
    points. And so it kind of throws all of
  • 00:48:45
    the cameras into a big soup, a big mess.
  • 00:48:47
    It gives them a bunch of random
  • 00:48:49
    initializations and then defines these
  • 00:48:51
    constraints saying, well, these images
  • 00:48:53
    saw these common points. How can I
  • 00:48:56
    rearrange all of these images so that
  • 00:48:59
    they line up and see those common
  • 00:49:01
    points? So it's it's similar to bundle
  • 00:49:03
    adjustment. So that the idea of take a
  • 00:49:05
    bunch of images that see points and
  • 00:49:07
    refine it, but uh it uses a different
  • 00:49:10
    formulation, a different set of
  • 00:49:11
    constraints that is better suited to,
  • 00:49:15
    you know, random unknown camera
  • 00:49:16
    positions. And so that's this global
  • 00:49:18
    positioning sol problem that they solve.
  • 00:49:20
    So that gets you pretty pretty so once
  • 00:49:22
    you've run your rotation averaging, your
  • 00:49:23
    global positioning, you get a
  • 00:49:24
    reconstruction that's pretty close. And
  • 00:49:26
    then you can run bundle adjustment, you
  • 00:49:29
    know, an actual high quality refinement
  • 00:49:31
    using bundle adjustment. And then you
  • 00:49:33
    have your your 3D reconstruction. So it
  • 00:49:35
    skips a lot of this incremental slow
  • 00:49:37
    process that wasn't parallelizable. The
  • 00:49:39
    rotation averaging uh and global
  • 00:49:41
    positioning that's a little better
  • 00:49:42
    suited to parallelization and is is more
  • 00:49:44
    efficient because you're not having to
  • 00:49:45
    do this one after the other after the
  • 00:49:47
    other. Yeah. And I have it on my screen
  • 00:49:49
    here, the project page where it kind of
  • 00:49:51
    showed you were talking about. And this
  • 00:49:53
    last showing it all happening all at
  • 00:49:55
    once where it just kind of all just kind
  • 00:49:58
    of resolves at once. I do want to say
  • 00:50:01
    that it it's something it's it to me
  • 00:50:03
    it's there's a low a low what's the
  • 00:50:06
    right words? It's not you're not going
  • 00:50:08
    to be wasting a lot of your time to give
  • 00:50:10
    this a shot to see if this works well
  • 00:50:12
    for your project because you don't have
  • 00:50:14
    to wait a lot of time for it to do the
  • 00:50:16
    incremental reconstruction. So, it
  • 00:50:17
    doesn't work well with all scenes as I
  • 00:50:19
    found, but because you know within
  • 00:50:22
    minutes if it's going to work well or
  • 00:50:23
    not, it's worth a shot and you get to
  • 00:50:26
    learn what scenes work well with it.
  • 00:50:27
    You've done some tests as well, Jared.
  • 00:50:30
    You kind of you can't get too tied in on
  • 00:50:31
    a bunch of little things. I feel like
  • 00:50:33
    you need a more of a global view or a
  • 00:50:35
    you know, the the example images have a
  • 00:50:37
    lot of features and aren't really close
  • 00:50:39
    tied in on little features in a scene.
  • 00:50:43
    Mhm. Yeah. You want from my for my
  • 00:50:46
    experience with glow map and and other
  • 00:50:48
    global structure for motion, global
  • 00:50:49
    reconstruction techniques, they work
  • 00:50:52
    best when you have a lot of connections
  • 00:50:55
    between your images. Mhm. So it's not
  • 00:50:58
    you just walking through a cave or
  • 00:51:00
    walking down, you know, a city street
  • 00:51:01
    and never returning back. It likes a lot
  • 00:51:04
    of loop closures. It likes a lot of
  • 00:51:06
    connectivity, a lot of different vantage
  • 00:51:08
    points and overlap and diverse content.
  • 00:51:10
    And so it it takes the strength of those
  • 00:51:14
    diverse and dense connections and very
  • 00:51:17
    quickly figures out how to arrange them
  • 00:51:18
    to produce that final reconstruction.
  • 00:51:20
    And that's probably why in my experience
  • 00:51:22
    when I have these more broader view
  • 00:51:24
    shots, it works well because I have a
  • 00:51:25
    lot of connections. I have a lot of
  • 00:51:28
    unique features and you get too close in
  • 00:51:31
    on one little object or you have a lot
  • 00:51:33
    of like I think inside I've done some
  • 00:51:35
    indoors that haven't turned out because
  • 00:51:36
    you have a lot of just blank white walls
  • 00:51:38
    with not a lot of features. So, it's
  • 00:51:40
    just not able to do that. So, all right.
  • 00:51:43
    Well, this is something I say I had on
  • 00:51:44
    my screen just to to kind of show some
  • 00:51:47
    examples. If you're listening, I I will
  • 00:51:49
    make sure I'll link in the show notes as
  • 00:51:50
    well. Glow map and coal map, but glow
  • 00:51:53
    map's an interesting one you can look
  • 00:51:55
    at. It's it it drops on top of coal map.
  • 00:51:59
    So, even get it running isn't like a
  • 00:52:00
    large lift. And you see Johannes in the
  • 00:52:03
    the list of names. So, you can see he's
  • 00:52:05
    still working on these things. I think
  • 00:52:07
    this is interesting because it does make
  • 00:52:08
    things go faster. And if you look in the
  • 00:52:10
    results that they are in the same range
  • 00:52:13
    of accuracy as you get with incremental
  • 00:52:15
    reconstruction using coal maps. So it's
  • 00:52:17
    not saying well this is fast but it's
  • 00:52:18
    not nearly as good. It's fast and is
  • 00:52:20
    good if you have a good result but you
  • 00:52:22
    find out really quick because I've
  • 00:52:23
    noticed that the results either are
  • 00:52:25
    absolutely all over the place or you
  • 00:52:27
    have a really good sparse point cloud
  • 00:52:29
    and so you know if it's good or not. In
  • 00:52:31
    fact, you'll see cameras all over the
  • 00:52:33
    place where everything's kind of like
  • 00:52:34
    this weird looking cube and and that's
  • 00:52:37
    how you know it didn't work. But you
  • 00:52:39
    will know based off of your output.
  • 00:52:40
    Yeah, I've gotten a few B bork I say
  • 00:52:43
    Borg cubes, that's what I think they
  • 00:52:44
    look like, but I think I've gotten a few
  • 00:52:46
    cubes in my uh Yeah, and as my results.
  • 00:52:49
    All right. Well, I think we covered I
  • 00:52:51
    think we covered this all really well. I
  • 00:52:53
    hope at the end of this people will go
  • 00:52:56
    try coal map or go I mean even if they
  • 00:52:58
    use other software it will follow
  • 00:53:01
    relatively the same sort of process. I
  • 00:53:04
    don't think you could maybe there's
  • 00:53:05
    other ways it's done. I'm sure there is,
  • 00:53:06
    but this is the standard kind of method
  • 00:53:09
    that most at least follow this sort of
  • 00:53:12
    style. And now there's all this machine
  • 00:53:14
    learning stuff that's different. But as
  • 00:53:15
    far as classical 3D reconstruction from
  • 00:53:18
    imagery, this is a very well-known and
  • 00:53:20
    reused pipeline for a lot of projects.
  • 00:53:24
    Yeah. And it's a great, like you said,
  • 00:53:25
    like just go and try that. That's that's
  • 00:53:27
    I can't stress that enough. Just just
  • 00:53:28
    try it. you know, if if you're either
  • 00:53:31
    one just get, you know, new to computer
  • 00:53:33
    vision and want to understand how 3D
  • 00:53:34
    reconstruction works, you know, or maybe
  • 00:53:36
    you kind of understand it but don't
  • 00:53:37
    under, you know, but want to get a
  • 00:53:39
    better insight of how things work behind
  • 00:53:41
    the scenes. A tool like call map is
  • 00:53:43
    great just to, you know, throw some
  • 00:53:44
    images at it, run a reconstruction, and
  • 00:53:46
    then start poking around. There's a lot
  • 00:53:48
    of neat visualizations that Jonathan
  • 00:53:50
    showed where you can look at a point and
  • 00:53:51
    see which image is solid or in an image,
  • 00:53:54
    what did it match to. There's other
  • 00:53:56
    debug visualizations where you can look
  • 00:53:57
    at sort of the match graph or the match
  • 00:53:59
    matrix and see how uh the different
  • 00:54:02
    patterns or ways that images are
  • 00:54:04
    matching to each other. So, it's it's a
  • 00:54:06
    nice way to get in get your hands dirty
  • 00:54:09
    and see how this process of turning
  • 00:54:12
    pixels to 2D information to final 3D
  • 00:54:15
    results, you know, and and that mapping
  • 00:54:17
    from, you know, 2D to 3D and all the uh
  • 00:54:20
    information that goes into that. So,
  • 00:54:21
    it's a great way to get in there and get
  • 00:54:22
    an intuition for how this all works
  • 00:54:23
    behind the scenes. Yes, definitely. And
  • 00:54:26
    I would say the most important part when
  • 00:54:29
    you're trying to run this is picking the
  • 00:54:31
    right matching strategy because that can
  • 00:54:33
    be the that can be the difference
  • 00:54:34
    between waiting hours and an hour or
  • 00:54:37
    minutes. So, well, thanks Jared for this
  • 00:54:39
    episode and kind of covering all this
  • 00:54:42
    stuff. I hope this was tangible enough
  • 00:54:44
    for people to go try it and having the
  • 00:54:46
    visuals up. So, if you're listening, go
  • 00:54:48
    find this video on the EveryPoint
  • 00:54:51
    YouTube channel. We have a playlist of
  • 00:54:53
    all of our episodes. I'll make sure. I
  • 00:54:56
    haven't named it yet, but I'm sure
  • 00:54:57
    Colemap will be in the name. It'll be uh
  • 00:55:00
    I can't remember what episode we're on,
  • 00:55:01
    but it's like 15 or 16. You will see
  • 00:55:04
    that it's a great it's a it'll be a
  • 00:55:07
    great way for you to learn this if
  • 00:55:08
    you're if you're getting into there, cuz
  • 00:55:09
    I see every day I didn't go over these,
  • 00:55:11
    but we have questions I see every day
  • 00:55:13
    either on my videos or on Reddit or
  • 00:55:18
    Discord. There's these different
  • 00:55:19
    communities that are all using projects
  • 00:55:21
    that require coal map to run to start
  • 00:55:24
    think 3D gods been splatting and it's
  • 00:55:26
    just obvious that this is something that
  • 00:55:28
    people just know they have to use but
  • 00:55:30
    have no idea what's happening. They just
  • 00:55:32
    know they threw a bunch of images at it
  • 00:55:34
    and something came out and then they're
  • 00:55:36
    going to do something else with it. But
  • 00:55:38
    they have no appreciation for the
  • 00:55:40
    sausage making of coal mapping. If you
  • 00:55:43
    know what each step is, you can get
  • 00:55:45
    better results in my opinion. Just play
  • 00:55:47
    with it. see what works, learning what
  • 00:55:49
    those different options are. If you
  • 00:55:51
    don't know what an option is as well,
  • 00:55:52
    jump on our YouTube channel, ask a
  • 00:55:54
    question. I will be watching and trying
  • 00:55:56
    to respond as intelligently as possible
  • 00:55:59
    on those and and give you a a good
  • 00:56:01
    answer. So Jared, any other parting
  • 00:56:03
    thoughts you want on this? You you said
  • 00:56:04
    go get give it a try. Any other tips you
  • 00:56:07
    would give people? Take good sharp
  • 00:56:09
    imagery. Take I just do it do it
  • 00:56:11
    yourself. Get out and try, you know,
  • 00:56:12
    take your own photos and see how see how
  • 00:56:14
    they turn out. Yeah, take your own
  • 00:56:16
    photos. Don't go use the like open-
  • 00:56:18
    source data sets because they know those
  • 00:56:20
    are going to work and you know those are
  • 00:56:22
    great for testing but not great for
  • 00:56:24
    learning on your own data. So right well
  • 00:56:28
    thank you and if again you're if you're
  • 00:56:29
    listening this will be on all major
  • 00:56:31
    podcast players please if you can
  • 00:56:34
    subscribe to our to our channel or to
  • 00:56:37
    one of our podcast episodes that'll mean
  • 00:56:39
    a lot to us know that we're making the
  • 00:56:40
    right content and that you guys care
  • 00:56:42
    about learning about this information.
  • 00:56:44
    And as always, let us know in the
  • 00:56:46
    comments as well on our YouTube channel
  • 00:56:47
    if there is something here that you
  • 00:56:49
    would like us to go deeper in. Maybe we
  • 00:56:51
    can get someone like Johannes on one of
  • 00:56:53
    these episodes to go super deep if you
  • 00:56:56
    want to. Anyways, well, thanks Jared for
  • 00:56:58
    being on this episode and I'll see you
  • 00:57:00
    guys in the next
الوسوم
  • Colmap
  • 3D Reconstruction
  • Structure from Motion
  • Feature Extraction
  • Camera Pose
  • Geometric Verification
  • Incremental Reconstruction
  • Global Reconstruction
  • Computer Vision
  • Open Source