Limits of Cinematic VR – Bernie Roehl

There’s been a lot of interest lately in “cinematic VR”, and a lot of confusion about exactly how it works. Since I’ve had to answer the same questions multiple times, I thought I would put all the answers together into a blog post so I can refer people here rather than explaining it over and over. 🙂

There are two basic types of VR experiences.

The first type uses realtime, interactive 3D graphics to display a virtual world. The scene is represented by a large number of small triangles (polygons) and is typically displayed using a game engine such as Unity or Unreal. In this type of VR, you have complete freedom to move around, look wherever you like, and experience the entire world in full stereoscopic 3D. The disadvantage is that because the entire scene has to be rendered at anywhere from 60 to 120 frames per second, some sacrifices have to be made in terms of richness and detail.

The second type of VR experience uses either real-world video footage that’s been captured using a special camera rig, or pre-rendered computer graphics that typically have a lot more detail than could be rendered using a realtime game engine. It’s this second type that is the focus of this blog post.

The term “cinematic VR” is often used to describe this type of VR, since the technology is very similar to that of film. A set of cameras (real or virtual) is used to produce a linear experience that may or may not be viewed stereoscopically. The visual quality is generally much better than that of realtime VR, but the tradeoff is that the user has very limited (if any) freedom to interact with the experience.

Basic Technology

In order to better understand the limitations of cinematic VR, let’s take a look at how it works. The simplest form of cinematic VR is a simple 360-degree spherical monoscopic video. The user is basically surrounded by a virtual sphere on which is projected a video. They can look all around (360 degrees horizontal) and all the way up and down (180 degrees vertical). There are many cameras that can be used to create this sort of video, including Bublcam, the various 360heroes rigs, and more. For content distribution, there are a number of websites that aim to be the “Youtube of VR video”, and of course Youtube itself has recently added support for these kind of 360 degree monoscopic videos.

However, this approach has a huge limitation. With a monoscopic camera, what the user is seeing is still a flat image — there’s no depth to it. That’s fine if everything in the scene is far away, such as when the user is standing in the Grand Canyon or hang gliding over the Azores. However, as soon as you move to an indoor scene where objects are closer to the camera, the illusion is broken. It’s immediately clear that you’re just looking at a projection onto a sphere, not actually standing in a virtual environment.

That brings us to the next “level” of cinematic VR — stereoscopic viewing. Most people are somewhat familiar with how ordinary stereograms work — the idea is that you present a slightly different image to each eye, so that the brain can use the disparities between the images (parallax) to get a sense of depth. That’s the principle behind everything from a Viewmaster to a 3D movie.

Creating stereoscopic video from a single point of view is pretty straightforward — two cameras, side by side, give you a pair of images that you can display to the user to recreate the scene in three dimensions. The challenge is to combine this stereoscopic viewing with a full 360 x 180 spherical view. That brings us to…

Issue #1 — You need a special camera rig

A lot of people mistakenly assume that it should be as simple as taking two monoscopic spherical cameras (such as the Bublcam) and placing them side by side. That doesn’t work at all, and when you think about it for a bit you can see why. Each of the two cameras is capturing a spherical panorama from the location of the corresponding eye. However, when you look in different directions in the real world, your eyes don’t swivel through 360 degrees in their sockets — instead, they rotate around a central point (the top of your neck). That makes a huge difference.

To understand the difference, imagine that you have captured a 360 degree spherical image from each of two cameras separated by approximately the same distance as your eyes. When you’re looking straight ahead, everything is great — you see perfect stereo, as captured by the cameras. However, when you try looking over your right shoulder, the view you see from the two cameras no longer gives you any parallax — the cameras are aligned with each other, and are looking in the exact same direction. Any sense of stereo depth has completely disappeared.

Things get even worse when you rotate past 90 degrees. The sphere for the right eye is now showing the left-eye perspective, and the sphere for the left eye is now showing the right-eye perspective. Your brain has no idea what to do with this, and all you can do is close one eye and view the scene monoscopically. You can avoid this reversal by sticking to an upright hemisphere, 180 horizontal by 180 vertical, instead of trying to do a full 360 degrees around. Even so, you’ll only have stereo depth when looking straight ahead, and it will diminish to zero as you look towards the sides, top or bottom of the hemisphere. If that’s acceptable, you can simply use a pair of cameras with fisheye lenses to capture the scene.

The right way to do it for a still image is to rotate a pair of conventional cameras around a central point, taking a series of shots, and then stitch them together for each eye separately. Experience has shown that 30 to 40 rotation steps are enough to produce good results.

However, that obviously doesn’t work for video, which is why people are developing complicated rigs (such as IZugar, 360Heroes 3DH3PRO12H, Jaunt, Samsung’s Project Beyond and others). These rigs have multiple cameras arranged in a circle, each camera equipped with a wide-angle lens. The cameras can be grouped in pairs, and images from the left-eye cameras are stitched together separately from the images from the right-eye cameras. Because you typically only have half a dozen pairs of cameras (compared with the 30 to 40 rotation steps for the still-image case), there’ll be some quite noticeable stitching artifacts (seams).

Issue #2 — Fixed IPD

The distance between your eyes, often referred to as the Inter-Pupilary Distance or IPD, varies quite a bit from one person to another. However, in any stereoscopic camera rig the cameras will be a fixed distance apart. That means the world will look perfect for some people (the ones whose IPD is a close match for the camera spacing) but will look wrong for most others. There’s nothing to be done about this in software, since it’s purely the result of a mismatch between two physical distances (the spacing of the cameras and the IPD of the viewer).

Issue #3 — You can’t tilt your head

In this case, by “tilt your head”, I mean tip your head over onto your left or right shoulder (what more accurately would be called “rolling” your head). You can look all around you, and you can look up and down (with the caveat that you lose stereo depth, as described above), but if you tip your head you lose the illusion completely. Again, it’s easy to see why — the cameras were side by side when the footage was shot, not at an angle, and if your eyes are at an angle it won’t look right. In fact, cinematic VR playback systems usually ignore rolling altogether, and only pay attention to pitch and yaw (i.e. looking up and down and all around).

And speaking of ignoring inputs…

Issue #4 — You can’t move your head

This is perhaps the biggest limitation of cinematic VR. Since the footage was only shot from one specific location, it can only be viewed from one location. To understand this, imagine that the camera rig was set up in an office, facing an open door leading to the hallway. Now imagine that in the hall, just a bit to the left of the doorway, is a coat rack. Also imagine that just a bit to the right of the doorway is a chair.

If the camera rig is facing the door, it will see the hallway and neither the coat rack nor the chair will be visible. If you were actually sitting where the camera was positioned, you could shift over a bit to your right to see the coat rack, or a bit to your left to see the chair. However, when viewing the scene in VR, that doesn’t work — no matter how much you move your head left or right, you’ll never see the coat rack or the chair, since they were never visible to the camera. Cameras can’t see around corners, and no amount of clever software will let you see something that wasn’t recorded!

If you try to move your head in a cinematic VR experience, even a little bit, all that will happen is that the image will be distorted. The further your head moves away from the location of the camera, the more distortion there’ll be. That’s why cinematic VR players ignore positional input, since there’s no way to make use of it.

The Future

Are there ways around these limitations? Not using current approaches, no. However, it’s possible to get around all of these problems using lightfields — but that’s a technology that won’t be available for a while, and it will be a topic for another blog post.