Overview of Multi-View Image Technology

Kensuke Hisatomi

A real-space three-dimensional model can be generated from multi-view images filmed by a number of cameras. We believe that such three-dimensional models can be used to produce three-dimensional images since they can be handled with a high degree of freedom within a virtual space. In this article, we classify methods of generating a three-dimensional model from multi-view images into the multi-view based and stereo-view based approaches, and explain each method. We then describe a method of generating integral 3D images from the three-dimensional model based on the ray tracing algorithm.

1. Introduction

NHK is conducting research into integral 3D TV that does not require special glasses and uses the principles of integral photography (IP). The filming of integral 3D TV generally involves using a high-resolution camera to film through a lens array formed of a large number of micro lenses arranged in two dimensions. A large number of small images (elemental images), each one corresponding to a micro lens, are filmed through a lens array placed in front of the imager. The elemental image is a record of light rays, emitted in different directions from the subjects and passing through each micro lens. If we show these elemental images on a high-resolution, two-dimensional display and place a lens array on its surface, the light ray emitted from a pixel in the elemental image travels only along a straight line through the optical principal point^*1 of the corresponding micro lens. As a result, this enables to transmit different light rays to the right and left eyes and give the feel of depth, because the light rays are reproduced in a similar way to those in the real space.

However, if this technique is used to film a distant subject, the integral 3D images are displayed flat without sufficient depth, since only light rays passing through the lens array can be observed and the lens array gets relatively narrow as the target subject gets bigger. In this case, although the reproduced image faithfully reproduces the "light-ray space" in real space, the resulting image will not have sufficient depth, when the subjects are zoomed in the three-dimensional space in the same way as zooming does in two-dimensional TV broadcasting.

One way of overcoming this problem is to generate the elemental images from multi-view images filmed by a number of cameras¹⁾. To generate the elemental images with this technique, we initially generate a three-dimensional model of the subject from multi-view images, then place that model together with a lens array and a display in a virtual space within the computer, and obtain the pixel values that ought to be shown on each pixel of the display by using a ray tracing algorithm. A model generated in this way has the advantage that a wider angle can be covered by fewer cameras, because light rays between cameras can be acquired by interpolation. Doing the calculations in a virtual space also makes it possible to generate images of large-scale subjects or distant subjects that give the feel of satisfying depth, since the lens array and imaging elements can be freely set and it is possible to configure a large lens array that would be difficult set up in real space.

Below, we describe a technique of generating a three-dimensional model from multi-view images and the making of integral three-dimensional images from the three-dimensional model.

2. Generation of three-dimensional model from multi-view images

The techniques of generating three-dimensional models from multi-view images can be roughly divided into multi-view based approach that generate a model all at once, by using all of the camera images and stereo-view based approach techniques that integrate models that have been sequentially generated from a pair of cameras.

2.1 Multi-view based approach

An example of the multi-view based approach is the volume intersection method, or silhouette method. Here, a binary image, called a silhouette image (the silhouette of the subject is represented by black and the other regions are represented by white), is made from each camera image and is used to generate a three-dimensional model. The three-dimensional space of the filming region (or a region in which a subject could exist) is divided into small cubes (voxels) and the voxels are disposed equidistantly, as shown in Fig. 1. In this case, if a voxel in the interior of the subject is projected into the silhouette images, the projected point is included within the black region in all of the silhouette images. Using this principle, the placed voxels are projected in sequence on the silhouette images. If a projected point is not included in the black region of even one image, that voxel is deleted. If a projected point is included in the black region in all of the silhouette images, it is subjected to remainder processing, leaving the yellow region. The extraction of that region as a three-dimensional model completes the volume intersection method. Since indentations in the subject are not reflected in the silhouette, this technique has a problem with accuracy in that it is not possible in principle to reconstruct indentations. However, it enables the acquisition of a comparatively stable three-dimensional model, which is often used as the initial shape for generating a more accurate three-dimensional model.

In addition, since there are few parameter adjustments and the amount of computations is comparatively small, it can be used in practical applications. An example of an application using a human model generated by the volume intersection method is the digital extra in a TV drama (Fig. 2). We filmed two actors with 24 HD cameras and used the volume intersection method to generate three-dimensional human models from these images, as shown in Fig. 2 (a). We generated a crowd scene of several hundred people by copying and positioning these models²⁾. A test case of a crowd scene produced by duplicating one such digital extra is shown in Fig. 2 (b) and an example of a similar scene in an actual TV drama is shown in Fig. 2 (c).

A technique that is similar to the volume intersection method is voxel coloring³⁾. In this technique, each voxel is projected to the visible cameras and the pixel values at the projected destination are acquired. Then, the variance of the pixel values are calculated. Voxels with a low variance remain, and the three-dimensional model is generated by deleting the voxels with large variance, assuming that the variance of the voxel on the model surface gets small. With this method, it is possible to reconstruct an indentation, but the challenge is to maintain the continuity of the surface since the shape is reconstructed after the deletion for each voxel independently. A method has been proposed that moves voxels on the surface of the initial shape obtained by the volume intersection method towards the interior and searches for a position such that projection errors (the sum of differences between pixel values of two visible cameras at the projection destination) are reduced while maintaining continuity with the surrounding voxels⁴⁾. Fig. 3 shows a three-dimensional model of a Noh the classical Japanese musical actor that was generated by using this technique together with a three-dimensional archive of Noh dance and 40 multi-view images⁹⁾.

Recently, attempts have been made to obtain an initial shape by using the volume intersection method and then to project voxels in the vicinity of the surface into multiple cameras, thereby obtaining projection errors by which the accuracy of the surface shape can be improved by using mathematical optimization. Fig. 4 shows a three-dimensional model generated by a two-partition optimization technique^*2 called graph cuts⁵⁾⁶⁾. The surface shape can be accurately reconstructed from 24 cameras, fewer than those of Fig. 3.

Figure 3: 3D model generation supported by regional surface feature

Figure 4: 3D model generation using graph cuts

2.2 Stereo-view based approach

The technique of generating a three-dimensional model from a stereo image selects two cameras from multiple cameras, decides how the pixels in one image correspond to those in the other image, and estimates the depth of the subject by using the principle of triangulation. For example, when two cameras are arranged in parallel as shown in Fig. 5, if a pixel q in the right image corresponding to a pixel p of the left image is viewed, we can obtain the depth z from the position coordinates of p and q by using the similarity relationship of the blue-framed triangles, as follows:

where f, B, x_L , and x_R are the focal length, baseline length between the cameras, coordinates of p, and coordinates of q, respectively. If this is done for all the pixels in the left image, the depth map that the depth values are arranged in two dimensions can be acquired. Since these depth maps are arrays of distances from the filming position for the left image, they do not include three-dimensional information on regions such as the side surfaces and rear surface that cannot be seen from the filming position for the left image. For that reason, if the viewpoint moves away from the filming position for the left image, a dropout region having no three-dimensional information will appear as a "hole" in the image, even if these range images are made three-dimensional. This means it is necessary to integrate a number of range images in order to make a three-dimensional model that includes the side surfaces.

Camera placement also tends to be different between the multi-view and stereo-view methods. With the multi-view method, the distance between cameras is comparatively long and filming is often from the entire periphery or from a comparatively large number of directions. With the stereo-view method, on the other hand, the distance between the two cameras is comparatively short, and it is not always necessary to film from many directions. In addition, since it is possible to estimate the depth from two cameras with the stereo-view method, the system often needs fewer cameras than the multi-view method needs.

There are various ways of generating three-dimensional models, such as Structure from Motion⁷⁾, which also estimates the camera parameters, and Structure from Shading⁸⁾, which estimates the direction of the surfaces from the shading of shadows. The best choice of model generation technique depends on the filming method, filming environment, and usage purpose.

Figure 5: Depth estimation by stereo-view

3. Generation of integral three-dimensional image

The techniques described in Section 2 to generate a three-dimensional model enable us to more freely make elemental images in a virtual space on a computer than in real space. In this section, we show how to position components such as the model and lens array within the virtual space when generating elemental images. Then we describe the process of generating elemental images.

Suppose that a lens array forming a reference point is placed within the virtual space and the three-dimensional model of the subject is placed in the vicinity of the lens array, as shown in Fig. 6. The display surface of elemental images is placed at a position from the lens array equal to the focal length of the array. The three-dimensional model can be placed beyond, in front of, or could even straddle the lens array.

Elemental images can be generated by executing the ray tracing algorithm for each pixel of the display. In other words, we draw a straight line between each pixel and the optical principal point of the micro lens that is closest to that pixel. Of the points where this straight line intersects the model, we give the pixel value of the intersection that is closest to the viewpoint to that pixel.

Fig. 7 shows elemental images generated from a three-dimensional model of a Noh actor. If we show these elemental images on a high-resolution display and observe it through the lens array, a three-dimensional image appears that changes with the viewpoint. As shown in Fig. 8, the positional relationship between the Noh actor in the foreground and the pine tree in the background varies with the viewpoint. Since this three-dimensional model is generated with the multi-view technique described in Section 2, it does not include a background. The Noh stage that forms the background in the figure was generated by computer graphics (CG).

3D TV broadcasting will require three-dimensional images that include real-space backgrounds. To that end, surround filming is not desirable because the cameras will capture themselves, so it is necessary to generate a three-dimensional model that also includes the background from images filmed from one side, similar to the traditional camera placement in broadcasting. In such a case, it would be difficult to use the multi-view technique which presupposes an image filmed from all directions.

Figure 6: Method of generating elemental images
(showing horizontal section)

Figure 8: Integral three-dimensional image (retakes)

4. Conclusion

We described techniques of generating three-dimensional models by using the multi-view and stereo-view based approaches. As an example of an application of a three-dimensional model, we described a method to generate integral three-dimensional image that enables three-dimensional viewing without glasses. Our techniques should make the conditions for filming integral three-dimensional images less restrictive and enable filming of more varied 3D content.