Volumetric Video and Photogrammetry

Capture Tips and Tricks

21 min readSep 25, 2020

This is a transcript taken from a series of lightning talks focused on modern content creation techniques during SIGGRAPH 2020. Enjoy!

ALEX: Our goal is to share some of the insider info around volumetric video and photogrammetry and how to get the best capture from whichever type of capture method you choose. There are some intrinsic things that we’ll mention that are specific to the MOD Tech Labs processing solution, but overall, our goal is to create opportunities across the spectrum. Our intake is completely universal — we can take in any kind of volumetric data, photogrammetry data, and scan data. And our output is completely universal as well — .obj and .fbx , etc. — standard file types. These capture tricks are broad here, but we do have a more lengthy capture guide that you can find at the bottom of this article.

ALEX: I’m Alex Porter, CEO and Co-Founder of MOD Tech Labs. This is our second startup in the tech space — we actually come from XR, and this tool was originally created when Tim and I were running Underminer Studios. Ultimately, what we wanted to do was create the opportunity for massively scalable content creation. And what we have come to after three and a half years of working on this tool is a highly scalable cloud SaaS solution. I’m not going to go too deep into that, but that’s sort of the frame of reference of where we’re coming from.

My background is in interior design and construction technology, and across the last few years, we’ve worked in entertainment, media, medical tools, and more. What we’ve always done is that sort of back-end tools creation — whether building an AR scanning tool or VR wheelchair driving experience, there’s an opportunity for folks to build up and scale their own interactions and content, immersively in VFX, geospatial, medical or elsewhere.

We are a venture backed startup, based in Austin, Texas, and over the last few years, have been awarded Top Innovator awards by Intel for 3 years running and the City of Austin Innovation Award in 2019. We are also part of the NVIDIA Inception program.

TIM: And I’m Tim Porter, Co-Founder and CTO at MOD Tech Labs. I’ve spent coming up on 20 years in the video game, movie, and immersive media industries. In games, I was a technical artist, and in movies I was a pipeline technical director. Now, I’m a CTO — so, kind of a natural slide for me. I am the Chair of the Consumer Technology Association’s XR working group and also serve as their Vice-chair of XR standards.

Where my perspective comes from is the maker side. Basically, “How do we make tools that can reach everyone?”. I am very fortunate in being able to pick up new technologies and then be able to use them in a technical fashion very quickly. The whole idea behind it is — how do we help people? Being a tech artist, it’s kind of a mixed bag of tricks. My specific role was acting as an inter-department liaison. I did automated tools and toys for artists and device specific optimization, and all those really kind of lead into where MOD is. Which is being able to take technology that is really difficult to either build, automate, or very time consuming, then distill that down into something that is easy to use, quick, and doesn’t require infrastructure — something that small and mid-tier studios have a massive issue with.

ALEX: We’re gonna break down our suggestions into photogrammetry, scanning, and volumetric video for best capture practices.

TIM: Photogrammetry — real light one-on-one, conversation wise — using photos/scanning. We’re going to talk about RGBD data and LIDAR data. So we’re talking about scanning, both structured/unstructured light and laser. Then volumetric video — we’re gonna focus a little bit more on videogrammetry style of volumetric video, since it is a bit more agnostic and more widely known.

ALEX: For your photogrammetry rig setup, there are a few really important things. Camera placement and camera focus is extremely important. Dependent upon your physical floorspace and what you’re trying to capture, the still object is placed, typically, in the center.

If you have multiple cameras or if you’re using a single point-and-shoot, you’re definitely going to want to use something like a tripod, so that you can have a professional quality with even photographs from every sort of angle and part that you’re trying to capture. That will help you get even more detailed. For extremely detailed objects, you need even more photos. You always want to use — where possible — identical cameras and lenses. There are some solutions that that can use multiple styles of cameras, but ultimately, it’s a lot easier to solve if you have a singular type. So the general way that we like to say it is a 15 degree section between each of the cameras and that helps create those overlapping data points that you’ll want to make it really high-end

TIM: Exactly. This is just a light concept when you’re thinking about any level of scanning. I’ve seen people do a lot more. I’ve seen things get away with less. But general rule of thumb — especially if you’re building a static rig — is 15 degrees each direction. 14.5, if you’re really being super precise and want to do something super high-end professional grade, especially like faces and things like that. But really, once you get below that number, you start running into data that’s not really needed — a lot of systems that are newer will actually throw away that data.

When you’re talking about having individual assets, you can go a little bit more on the topology and the flow of the asset. If it’s something that you have a handheld camera for, you know, kind of follow the edges and go, “Okay, well, this is an overlapping area. So, I need to get a couple extra in this area.” — multiple rigs for your capture. But with static rigs, you’re looking for a good average coverage for each one of these different things. When we’re talking about identical lenses, and the reason why it’s easier to solve, is that machine learning algorithms do go through, and they do an amount of understanding as to the camera‘s intrinsics and extrinsics — basically, where it is in three-dimensional space, and the cameras field of view, and a whole bunch of other very heavy data points, and it does that over an aggregation of the images.

So, the more images that you have — especially if you have all of them the same, the calculation speed goes up (the amount of time that you spend calculating, that information goes down), and then as the quality goes higher because you have that same amount of technical capture information that goes through all of the different images as they go along. It ends up creating a higher quality result.

Yes, I’ve seen tons of different ones. A lot of professional rigs use multiple, but you’re talking about rigs with typically 100-200 cameras, and there really is no replacement for that. If you instead did something that was closer to single lenses and cameras, you can get away with a little bit less for sure.

And of course, you always want to make sure that you prefocus your cameras, that the subject is in frame as much as possible, and that you have overlap with the frame. The general rule of thumb is three images per point — that’s pretty good.

ALEX: Some of the typical configurations are: dome coverage — around, with the subject in the middle — and often, if you’re doing full body, you will see cylindrical styles of rig setup. It has a lot more to do with what you’re capturing and what your physical footprint is. There are some ways to use both of these to your own benefit in different situations.

That “three-shots-per-point” is really important. We recently had some folks submit photogrammetry, with massive whitespace — they were too far back from the object that they were capturing. And when you have that, you’re going to miss a lot of the detail and you’re going to miss a lot of those really fine points that need to overlap. So, getting that object in-frame as much as possible with the least amount of extraneous stuff in the background or outside of the object is really important.

Then, for each scene, you want to overlap by 40%. Again, a lot of that has to do with mapping those points of interest across all of the data.

TIM: And of course, these numbers are going to continue to update themselves. At one point in time, it was 60% and you wanted six images apiece. Now, it’s getting down to three, and with things like view synthesis, those numbers are going down on a regular basis. I’ve seen view synthesis shots that can do a full capture of an asset in under 30 images — all the way around — and it gets absolutely everything, and the quality is just phenomenal; crisp edges, shine and sheen, and things like that.

But once we roll back, and talk about the today technology — what everyone’s using, what goes on queues and people processed with — it’s still three images right now. 40% overlap… you can you can get away with a little bit less if you have either really high interest points without a lot of surface divisions — basically a massive silhouette change. If you have something like an earring, you’re going to need more information in there — especially if it’s an intricate earring versus a stud and different things like that.

So, it really is that kind of balancing act between them. If you have a static object with something that isn’t, like a very interesting shirt, you might end up needing more photos to make the images — information based off of either lighting changes, shadow pole, or something like that — just kind of depending on what’s going in there. Things like stark assets are very difficult because it’s looking for interest points to map between them.

Being able to shoot on a white background is kind of difficult at times because then you end up getting either reflection or refraction bounce from the ground and things like that. The same if you’re wearing just an all-black shirt. You can get good shots out black shirts and things like that, but it’s just a more difficult solve all-in-all, and you’ll get much cleaner results if you provide something like a plaid. But on the other end, with a material like that, you end up running into different issues like making sure that the line stays straight and things like that. Anything that is camera-safe, but is still has some form to it is a good answer there.

ALEX: Scanners are more of a continuous roll, if you will, rather than individual images. The goal is to maintain the integrity and the level all-around. You want to stay at the same angle and travel across the object, evenly. Again, a tripod is required to have that stability and that professional quality.

There are lots of ways that scanners are used: LIDAR, drones, etc. Tim’s going to talk more about RGBD. Then there’s no focus required for a scanner, specifically — typically, they have all those things intrinsically set up. Again, you’ll want to fill the frame as much as possible with the subject. One common thing that we are seeing is that combination of scanner data with photogrammetry, so that’s a really great way to supercharge your data sets.

TIM: The one thing that you always want to pay attention to is each and every one of these scanners has a min and max distance — you kind of want to be in that sweet spot. There are physical charts that are out there like the 435 that comes from Intel. The RealSense camera has a two foot distance on that — once you get outside of that, then you start losing quality. Once you get inside of that you start having packed information that ends up causing issues with reconstruction. You can either get issues with warping, or the other thing that you get is stippling that comes across once you’re too close. And of course once you’re too far away, it’s like a stippling — but it’s more of a mountainous/spurious kind of visual that comes out of it. So you know, be careful that.

With LiDAR — if you’re talking about most ground-based LiDAR — it’s set up to scan the way that it sets up. FARO does a wonderful job at setting up their systems so that they do what they do. If you’re talking about plane base scanning, make sure that you get about a 20% to 30% overlap so that when the data comes back, you can use that to clean up as you’re getting a fly bypass.

Specifically, drone technology has come a long way with the fact that quadcopters can carry heavier and heavier assets. I’m starting to see a lot more data that is revolving around photogrammetry on top of the scan data, so you’re seeing a lot of time-of-flight scanners that are out there and not nearly as many structured scanners that are on drones. Although, I have seen some RealSense scanning drones out there that mix with a DSLR, of sorts, and they provide some decent feedback — but it kind of really depends on what you’re going for. If you have to get a drone that close, you got to have a really good pilot — so there’s there’s a lot of trade offs and a lot of different areas with this.

One of the better solutions is sky-based LiDAR with ground-based photogrammetry. Combining those two together provides both crisp edges that you receive from LiDAR, and a lot of the fill-in information that you get from photogrammetry — that kind of “pray and spray” setup — especially when you have a large area that you need to cover, versus the precision that just photogrammetry and ground-based LiDAR will end up getting you. So, if you have big things, the combination of the two does provide more filled results in a shorter amount of time and is more economical.

ALEX: Volumetric video rig setup — this is interesting and fun. The typical model right now for much of the volumetric video capture is a dedicated stage. We believe that is definitely a valuable way for some people to access it, but it may not be realistic for other folks. That is part of the reason that we actually created our processing solution for MOD was to create opportunity to bring volumetric video to others that already are doing photogrammetric capture — they already understand the tenants of this, they have equipment to do photogrammetric capture, and all they really need to do is a few calibration tweaks to be able to capture volumetric video.

So, a lot of these things are relatively similar to the photogrammetric capture set up with a few qualifiers here and there. For starters, the minimum of three cameras per 15 degree section, which is based on the tenets of photogrammetry — call it “videogrammetry” if you will. We are working to create that overlap in data and make sure that you get the most detail that you can to create that moving object.

With the camera focus, you really want the same focal length in each camera, the same type of focus on each camera — no autofocus — that definitely causes issues because all the cameras are going to do their own variations of autofocus. Then it’s harder to to create that end result where you’re combining them. The global shutter is preferred, and stay away from fisheye lenses. You don’t want to warp or have any of the cameras be individual, you’ll want them all to have the same set up, and if have all the exact same cameras… even better. We have worked with everything from webcam rigs to DSLR rigs to bullet-time rigs, and creating opportunity to recalibrate those styles and bring them in for this opportunity.

TIM: Why do we talk about three sections? It really is the vertical overlap — you end up having a kicker section, a mid-section, and a facial section. You do want to put at least a couple up over top, and this is something I tend to see almost every client rig that we get — they don’t really count on the top of the head coming out that good. They aren’t that worried about it because a lot of them wear skullcaps and then they put the hair on afterwards. While I can tell you, if you believe that you have the manpower to go ahead and put hair onto every single frame of volumetric video… you go ahead do that, but that sounds like enjoyment well past the level of entertainment that I find myself in on a regular basis.

You really do have to capture now — it’s all now, or you’re going to be in for a lot of pain later. So, doing things over the head, and making sure that you do count the floor — I see this even with very large professional volumetric rigs where they don’t do a lot of work in getting that separation between the ground and people’s feet. You’ll end up seeing these flat feet. I know not everybody wears a pair of Chuck Taylors, there is a sole on these things… but they do go through and they chop off the bottom of the feet. You end up needing to have more on the ground than people really think that you should, so that you can go ahead and separate these individuals. It’s very important.

One caveat when it comes to fisheye lenses — fisheye lenses are nasty. The reason why they’re nasty is because what they do is actually stretch the image that are on the edges. Every camera does do this — very true. Whether you realize it or not, the reason why we understand how far away a point is in three-dimensional space is because of the determined that a camera provides onto a flat image. So, you have this flat image and then we get closer towards the center of the image… as it gets further out towards the edge, every single image has a stretching — even a prime lens has a certain amount of stretch that comes out — it’s just stuck to that exact focal length and provides a much better result.

Fisheye lenses do that at a much higher rate. What ends up happening out of this is that you end up losing more viable information — even if you do a wonderful de-warp, which I have seen some de-warps where your eye will not see it. I can promise you a computer vision will see every bit of minute difference that happens between each one of these images. And when it’s trying to go around the entire circle, it will see those little bit of results and it will provide a little over here versus a little over here. That may seem like that’s not a lot, but when you’re talking about “a little over here” for every single frame, that makes the edges dance and dancing edges make people nauseous, and people don’t like being nauseous. This is something that we try not to do.

There are several solutions that we are definitely working on a regularly involving machine learning algorithms that are obviously way smarter than the people that build them (i.e. me). This is something that will produce great results. I’ve seen a lot of good solutions at SIGGRAPH that have come up with how do we deal with fisheye lenses because sometimes you only have the space for a fisheye. Fisheyes are really wonderful at getting coverage. The problem is the coverage that they provide is not the coverage that you want.

ALEX: The rig coverage, again, is really similar to photogrammetry. Very typical, dome coverage — definitely want to make sure that you get the top of the head, as Tim mentioned… the feet and the head if you’re doing full body.

There are some technologies we’re experimenting with, and we’ve actually created temporal illusion to recreate some missing data — we’ve had some datasets that did not have all of the ideal shots. That’s not always going to be feasible to be honest, depending upon what the subject is. So having that correct amount of coverage is very important for that fidelity. Especially on the face if you’re doing a bust, because the whole point of facial volumetric video is to get all the macro and micro expressions — all that flushing, all of the little fine lines — the minute movements of our face that make us human.

So having cameras directed at all sides of the subject and making sure that you clearly get coverage for things like ears, hair, and the top and the back of the head, all those sort of really interesting, weird places that are a little hidden.

The bust shots require a minimum of 210 degrees of record data. So it’s not a 180°, even though we typically call 180°, it’s really 210° because you do want to get that back of the ear. That’s a huge part of it.

TIM: So you get 210 degrees so you can chop it down to 180°. Go back to our rule of three… you need to add an extra 15 degrees on either side — call it a 210. Then you end up getting the area that you need, because you’ll end up getting some warping and wobbling based on the fact that those areas only have one in certain points that are along there, or maybe two in certain other points, just based on the right left north south of the 180 degree drop off that you actually need.

When we’re talking about cameras on all sides of the subject, it’s the same — it’s photogrammetry — you want it going all the way around. Cylinders do a really good job at this. The biggest issue with cylinders is the concept that you have arms (underneath and above) and groin areas that are much further away, under the bottom of them, than there is from any camera point. So you end up seeing a lot of issues in those areas. If you can do a full dome, you still run into those issues.

So, a lot of things that I see always involves more cameras. When you get up to those 210° camera ranges, you’ll want them pointing upwards in certain areas. You have some of them that point in that cylindrical kind of pattern and then you have these ones that point up under the arms and other problems shots that are in there. If you’re smart about it, you point everybody into a single direction and then you get under arm and under groin shots so you end up getting quality results. It’s kind of a lot of fun trying to figure this out and get those good quality results.

ALEX: The other way that we’ve actually combated this internally, here at MOD, is we’ve created the opportunity to use the best of both worlds. So photogrammetry in an A pose or T pose for the body itself, and then that body can be rigged or you can put mocap on it — there’s a lot of cool things you can do with that point — all kinds of animations or sequences. You can map it to an actual motion capture suit capture and then volumetric for the bust. The other benefit there is that volumetric for the bust is a significantly smaller physical footprint, and that gives you know the opportunity to do that high fidelity facial and body capture then create a combination technology that maximizes your your output.

TIM: Definitely. And let’s be honest, people know how to deal with mocap a lot better and with the way that we’re doing all-in-one .fbx — where all the assets are in there at one time — it really is just a smart result for the output and for the use cases

ALEX: We’re not going to go through all of this (pictured above) one-by-one — but this is also found in our capture guide. This is just a sort of an overview. General best practices. This pretty much goes across for almost all of them. Some of them are a little bit better — as you get into scanning, you’re going to capture those thin objects better, more effectively. Some of that shine and sheen is going to be less of a problem than it is with photogrammetry or volumetric video.

Again, on the camera specs, many of them are the same, right? If you use the same camera, it’s great. If you can minimize the extraneous features — the fisheye, the autofocus, the white balance — really just try to keep them the same across. It’s ideal.

One other difference between the way that the capture data we intake works is that our processing solution actually is most functional when you do not have a blue, green, or white screen/set behind you. It works much more efficiently when we have points of data/points of interest behind and around the subject — whether it’s photogrammetric or volumetric video. We use machine learning and computer vision to actually do the camera calibration, background extraction, edge detection, and things like that intrinsic to understanding that trigonometry, the depth, and where the subject ends and the world begins.

TIM: One of the big things that people tend to miss often is that they use bounce cards for flat lighting. What you end up getting out of that are white reflections and refractions in human skin. So, panel lighting is good as long as it’s not too hot. You can find some really good ones on Amazon — a pair for $50 or less now on the low-end — and get really good results, and they have battery packs in them. We like them. I’ve used them in a couple of shoots and they’re great, especially for traveling around and things like that.

DSLRs are really good in a lot of different areas. They do kind of have a limitation when it comes to volumetric video — and some of them can be quite loud when you do individual image captures — although, if you are switching over to video they still provide wonderful results. Something that’s in the middle are things like RX0s and things like that because they are they are specifically meant for small-use photos. Basically anything minus a GoPro because GoPro lenses are pretty rough, being fisheye and whatnot.

And then of course, as we’ve covered before, no autofocus and you definitely want to white balance your devices normally. We do have a solution now that automates a color passport. We also have one that’s coming out right now, that uses a neural network to create the same color tone across all the images. So, if you if you provide a white balanced image that you want everything to look like that’s in your list, then you can do that so you don’t have to really worry about that. Then they do need to be processed, of course, afterwards in our systems, but you can also do that on your own. It’s just really depends on where you’re going on that. But you definitely want your assets that go into processing to be white balanced — there’s just no replacement for the quality on that.

One thing that’s not necessarily on here, but is well-known is is the use of raw. In a lot of cases, you do get better results. In some cases you don’t. And that kind of just depends on how you’re doing that.

ALEX: So, how does MOD fit in? Why is this important to us? It’s important to us because capture is not our specialty. We are very familiar with capture — we understand the best practices, how to enable others to capture more effectively, and to increase their capabilities. But what we are is the processing solution.

We actually have distributed processing, so we are 98% faster, in a lot of cases. We use automated systems: so it’s drag and drop the imagery data into the project folder, it uploads on an unencrypted system, goes directly to our private secure cloud, we process to your specifications, and we deliver it back to you.

Really, the entire premise of this is to open up the ecosystem and the capability for people to do more functional things without having to have the infrastructure and be able to minimize the specialized staff that you have to have for a lot of these processes, and really drop that massive overtime. A lot of the things that we’re doing and processing are those really manual time intensive tasks that are best served by a machine.

ALEX: We have another resource aside from the capture guide and our website: if you scroll down on our homepage, the capture guide is available for you to download as a PDF. We also have an Intel article that was published a couple of years ago that really is a highly technical article. It talks about cost benefits, and more intensively, the details around how to execute these kinds of capture. You can find that at the link above. And we’re always here for you as a resource. We really are interested in sharing our knowledge and giving people more capabilities.

TIM: So thank you, everybody! You can find us at modtechlabs.com. Alex’s email is alex@modtechlabs.com and you can reach me at tim@modtechlabs.com. And with the code MODtalks, we are giving out $500 worth of free processing. So please go on our platform, sign up, use that activation code and have a little fun. See what you like and how you enjoy it. So, thank you all very much.

Volumetric Video and Photogrammetry

Capture Tips and Tricks

Written by MOD Tech Labs

No responses yet