NHK Laboratories Note No. 487


Atsushi Matsui, Simon Clippingdale, Fumiki Uzawa*, and Takashi Matsumoto**

Intelligent Information Processing, NHK Science & Technical Research Laboratories
NHK Okayama Broadcasting Station
Dep. of Electrical Engineering & Bioscience, Waseda University


A new algorithm is proposed for face recognition by a Bayesian framework. Posterior distributions are computed by Markov chain Monte Carlo (MCMC). Face features used in the paper are those used in our previous work [1][2] based on the Elastic Graph Matching method. While our previous method attempts to optimize facial feature point positions so as to maximize a similarity function between each model and face region in the input sequence, the proposed approach evaluates posterior distributions of models conditioned on the input sequence. Experimental results show a rather dramatic improvement in robustness. The proposed algorithm eliminates almost all identification errors on sequences showing individuals talking, and reduces identification errors by more than 90% on sequences showing individuals smiling although such data was not used in training.

To appear in Proceedings of ICPR2004, Cambridge, 23-26 August 2004

Human faces in broadcast video exhibit substantial variation in position, size, head pose, facial expression and so on, forcing face recognition systems for video indexing to incorporate flexibility in the database and/or matching algorithms used. The authors have introduced a prototype recognition system[1][2] which uses deformable template matching and is based on the Elastic Graph Matching method[3][4]. Although this system can absorb a certain amount of facial deformation due to expressions and speech movements, recognition errors can occur for larger deformations, and additionally there are a number of system parameters which are set in a heuristic fashion.

In this work, we introduce a probabilistic Bayesian approach which estimates posterior probabilities for each template, conditioned on the input sequence, and uses a Markov chain Monte Carlo (MCMC) method to sample the combined space of system parameters and template deformations. We show that this approach achieves superior recognition results, including cases where input sequences contain speech movements and facial expressions.

In section 2 we briefly review the deformable template matching procedure and similarity function used in our original system. In section 3 we introduce the new Bayesian approach, and show how the most probable model can be estimated together with system parameters. Experimental results are shown in section 4, and the paper concludes with a discussion of the results and possible directions for further work.

The deformable templates used in our original system [1][2] are constructed from face images of target individuals at multiple poses, labeled with feature point positions. Each template consists of normalized feature point coordinates together with features computed by convolutions with Gabor wavelets at each of the feature points. The Gabor wavelet at resolution r and orientation n is a sinusoidal grating patch with a 2-D Gaussian envelope:


with 2-D spatial frequency given by


for Norns = 8 orientations n = 0, ... ,7 and Nres = 5 resolutions r = 0, ... ,4.
This data representation is similar to that used in the Elastic Graph Matching system[3][4] for face recognition in static images, but the chosen feature points differ, as do the parameters of the Gabor wavelets.
The original scheme[1][2] applies templates to input video frames and deforms them by shifting the feature points so as to maximize the similarity to the Gabor features in the template. It then computes an overall match score for each deformed template, incorporating a penalty related to the deformation as follows:


where A denotes the undeformed template and B the deformed feature points on the image; cA and cB are feature vectors of Gabor wavelet coefficients from the template and from the deformed feature point positions on the image; EA,B is the deformation energy between the feature points in the template and the deformed feature points on the image, up to a dilation, rotation and shift; are weights for the feature similarity and spatial deformation terms; and  is the modulation wavelength of the Gabor wavelet at resolution r.


3.1. Likelihood

In general, finding the global maximum of the similarity function (3) is difficult, and prone to falling into local maxima. Moreover, the optimal value of each parameter often depends on the other variable parameters and fixed system parameters, and good results may be achieved only if all parameters are optimized with respect to the input data. But generalizing the optimization to unknown input data may not be feasible without some strong constraints or assumptions on the data. A Bayesian approach, however, offers a principled way of tackling this problem.
Within a Bayesian framework, the feature similarity term in (3) can be used to define a likelihood function:


where D denotes observed data, x represents a set of feature points at which the features are measured, ß is a parameter, H is a model (hypothesis or template) and Zb(x,ß) is a normalizing factor.
3.2. Prior Distribution for Feature Points

We consider a prior distribution for the feature points and derive a posterior distribution with the Bayesian framework. The penalty term in (3) can be utilized to formulate a prior distribution. Suppose that we are provided with a set of sample feature points x. We assume the following Gaussian prior distribution for x:



and Za is a normalizing factor:


where NEP is the number of feature points. Figure 1 shows a sample from a set of images consisting of 7 facial expressions (6 basic facial expressions[5] + 1 neutral) posed by 3 Japanese actors and 5 Japanese actresses. The feature points between the eyes and on the upper lip are used for normalization, and hence have zero variance.

3.3. Joint Posterior Probability of Parameters

Bayes′ formula gives the joint posterior distribution of x, α, and β given the observed data D :


Now let Dtrain denote the training data and d the test data . The proposed algorithm seeks





with model prior distribution P(H).
Our algorithm consists of two steps; the training phase which computes given by (8), and a recognition phase which estimates P(H| d ,Dtrain ) given by (10) and selects that H for which (10) is maximum.
3.4. Video Sequence Data

This paper considers the situation where ndata is given as a video sequence. Let dn be image data at the nth frame and let Dn-1= {d1, d2, ..., dn-1} be the image data set up to the previous frame. At each frame n, one can consider P(H | Dn-1, Dtrain) to be the model prior distribution in place of P(H | Dtrain) in (10). Thus, (9) in the present setting gives:






(Dn-1 is absent from the right of (14) because the test data sets are not used for computing parameter posteriors.) When no information is available about the probability of each model H before the first observation, d1, arrives, it is natural to assume a uniform prior for every H:


where Npersons is the number of registered persons.
3.5. MCMC Sampler

The proposed algorithm uses Markov chain Monte Carlo (MCMC) to carry out the integration defined by (14). We draw Q samples


so that we have


Thus the posterior probability for each H at the nth observation of a series of test images {d1,...,dn} can be estimated as follows:


Figure 2 shows an example of the spatial distribution of feature point locations x(q) drawn by the Metropolis-Hastings method[6] with Gaussian proposal densities.
Figure 2. Example of the distribution of x(q) and training data Dtrain.


Table 1 shows recognition results for three different input test conditions: neutral, talking and smiling. Template images (with neutral expressions) and test images showed 8 males and 2 females in frontal pose against a neutral background. For the smiling test data, each individual varies their facial expressions from "neutral" to "happiness", and their sequential images are taken by a video camera. We tuned the parameters of the original system for comparison using all test images. For the Bayesian MCMC system, we used a sample of 20 images from the "talking" test data. We assumed that face regions were pre-detected so that the center position and radius of the face region were already available.
  In formulating a hierarchical Bayesian model as described here, one way of defining prior distributions is to assign to the top-level hyperparameters prior distributions that are vague[7] without being improper. Gamma priors for hyperparameters associated with Gaussian distributions are often successfully applied. In our experiments, we used


which gives a reasonably vague Gamma distribution. The parameter ß was fixed at 5.0 (see below).

Table 1. Face recognition results (ID error rate, 2,000 samples/person)
4.0%(32 / 796) 0.0%(0 / 796)
Talking 8.2%(121 / 1483) 0.1%(1 / 1483)
Smiling 37.4%(461 / 1233) 2.6%(32 / 1233)
TOTAL 17.5%(614 / 3512) 0.9%(33 / 3512)
proc.time* 3 min. 50 min.

To evaluate the sensitivity of the approach to errors at the preceding face detection stage, we performed a numerical simulation of a noisy face detector. Tables 2, 3 and 4 show the recognition results when face regions were dilated , horizontally shifted and vertically shifted . We skipped 9 of every 10 MCMC samples to reduce the total processing time.

Table 2. ID error rate of Bayesian MCMC using noisy face detector [%]
(dilation, 200 samples/person)
-4 -2 -1 0 +1 +2 +4
Neutral 49.6 18.8 6.5 0.1 0.0 10.2 30.8
Talking 31.8 1.3 0.4 0.7 0.3 5.8 29.0
Smiling 49.4 17.6 9.2 4.4 7.1 13.4 27.9
TOTAL 42.1 11.0 4.4 1.9 2.6 9.5 29.0
proc. time * 5 min.

Table 3. ID error rate of Bayesian MCMC using noisy face detector [%] (x-shift, 200 samples/person)
-4 -2 -1 0 +1 +2 +4
Neutral 71.6 9.8 0.1 0.1 17.2 35.2 45.2
Talking 66.3 17.3 0.0 0.7 3.3 11.9 33.5
Smiling 71.7 47.6 11.6 4.4 16.4 16.4 56.6
TOTAL 69.4 26.5 4.2 1.9 7.7 18.7 44.4
proc. time* 5 min.

Table 4. ID error rate of Bayesian MCMC using noisy face detector [%] (x-shift, 200 samples/person)
-4 -2 -1 0 +1 +2 +4
Neutral 86.3 18.0 5.3 0.1 17.2 65.8 89.2
Talking 83.0 13.3 8.2 0.7 1.3 31.3 91.1
Smiling 86.2 27.5 14.4 4.4 8.7 34.7 93.5
TOTAL 84.9 19.4 9.8 1.9 7.5 40.2 84.8
proc. time* 5 min.
*CPU: Alpha21264-1.25GHz; mem.:2GB; OS: Tru64 UNIX. Processing times shown are for the recognition phase.

Table 1 shows that the Bayesian MCMC approach was able to recognize all the faces in the neutral test data without error. On the talking test data, a small sample of which was used for training, the proposed approach reduced the ID error rate from 8.2% to 0.1%, showing that the MCMC samples adequately represent likely patterns of facial deformation during speech. On the smiling test data, none of which was used for training, the Bayesian MCMC approach reduced the ID error rate by over 90%. There are several possible reasons for this robustness to distortions in the test data. The most likely reason seems to be that the distribution of the MCMC samples is sufficiently broad to capture most deformations, while representing facial structure useful for identification using whole distributions, not merely the modal feature points. The sampling is performed independently for each template H , and a set of samples distributed around each feature point can encode much more about H than the feature point positions alone as registered in the templates.
  On the other hand, the current Bayesian MCMC approach has no procedure for adjusting for differences in face position between training data and test data. If there are geometrical errors in the face detection process, the system projects all MCMC samples onto the test data with some bias in their size and/or position. Therefore, it fails to capture peaks of the posterior distribution given the test data, due to the incorrect mapping and sharpness of the posterior distribution. Tables 2, 3 and 4 show that the method is sensitive to size and position errors in face detection, which needs to be accurate to about ±1 pixel to keep total ID error rate below 10%. The average radius of test images was 50 pixels, so the required accuracy is roughly 2%.


In this study, we set the parameter b heuristically due to a difficulty in evaluating the normalizing constant:


Annealed importance sampling[8] is one possible solution to this problem. Sequential Monte Carlo (SMC) [9] may be effective in representing the influence of past data D n -1 in (14).
Additionally, as noted above, an accurate face detector is currently required. Lampinen et. al. introduced an object detector[10] using Gabor filters and MCMC sampler, which might be the most compatible detector with our face recognizer.


[1]  Clippingdale, S. and Ito, T., "A Unified Approach to Video Face Detection, Tracking and Recognition," Proc. ICIP′99, Kobe, Japan, 1999, pp.662-666.
[2]  Clippingdale, S. and Ito, T., "Partial automation of database acquisition in the FAVRET face tracking and recognition system using a bootstrap approach,"  Proc. MVA2000, Tokyo, Japan, 2000, pp.5-8.
[3]  Wiskott, L., Fellous, J-M., Krüger, N., and von der Malsburg, C., "Face Recognition by Elastic Bunch Graph Matching," Technical Report IR-INI 96-08, Institut für Neuroinformatik, Ruhr-Universit舩 Bochum, 1996.
[4]  Okada, K., Steffens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., and von der Malsburg, C., "The Bochum/USC Face Recognition System And How it Fared in the FERET Phase III Test," Face Recognition: From Theory to Applications, Springer, 1998.
[5]  Ekman,P. and Friesen,W.V., Unmasking the Face, Prentice-Hall,Inc.,Englewood Cliffs, 1975.
[6]  MacKay, D. J. C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003.
[7]  Neal, R. M., Bayesian Learning for Neural Networks, Springer, 1996.
[8]  Neal, R. M., "Annealed Importance Sampling," Technical Report 9805, Dept. of Statistics, Univ. of Toronto, 1998.
[9]  Doucet, A., Freitas, N. de, and Gordon, N., Sequential Monte Carlo Methods in Practice, Springer, 2001.
[10]  Lampinen, J., Tamminen, T., Kostiainen, T., and Kalliomaki, I., "Bayesian object matching based on MCMC sampling and Gabor filters," Proc. SPIE Intelligent Robots and Computer Vision XX: Algorithms, Techniques, and Active Vision, 2001, Vol.4572 pp.41-50.

Mr. Atsushi Matsui Mr. Atsushi Matsui
Atsushi Matsui received the B.E. and M.E. degrees in Electrical Engineering from Waseda University, Tokyo, Japan in 1994 and 1996 respectively. He joined Japan Broadcasting Corporation (NHK) in 1996. Since 1998, he has been with NHK Science and Technical Research Laboratories, engaged in research on speech recognition and face recognition. He is a member of the Institute of Electronics, Information and Communication Engineers (IEICE) and the Institute of Image Information and Television Engineers of Japan (ITE).
Mr. Simon Clippingdale Mr. Simon Clippingdale
Simon Clippingdale received the B.Sc. Honours degree in Electronic and Electrical Engineering in 1982 from the University of Birmingham, U.K. and the Ph.D. degree in Computer Science in 1988 from the University of Warwick, U.K. He was a Japanese Government Science & Technology Agency (STA) Research Fellow at NHK Science and Technical Research Laboratories in 1990-91, and after lecturing at the University of Warwick, joined NHK in 1996. Since then he has been with NHK Science and Technical Research Laboratories, pursuing research on image recognition and vision. He is currently a Senior Research Engineer, and is a member of the Institute of Electronics, Information and Communication Engineers (IEICE) and the Institute of Image Information and Television Engineers of Japan (ITE). He currently serves on the IEICE Technical Committee on Pattern Recognition and Media Understanding.
Mr. Fumiki Uzawa Mr. umiki Uzawa
Fumiki Uzawa received the B.E. degree in Electrical, Electronics and Computer Engineering from Waseda University, Tokyo, Japan, in 2004. His research interests include face recognition and image processing. He joined Japan Broadcasting Corporation (NHK) in 2004. Since then he has been with NHK Okayama Broadcasting Station. He is a member of the Institute of Electronics, Information and Communication Engineers (IEICE) .
Mr. Takashi Matusmoto Mr. Takashi Matusmoto
Takashi Matsumoto received a B.S. degree in Electrical Engineering from Waseda University, Tokyo, Japan, a M. Sc. degree in Applied Mathematics from Harvard University, Cambridge, MA, and a Ph.D. degree in Electrical Engineering from Waseda University. Since 1980, he has been a Professor at Waseda University, Tokyo, Japan. Currently, Dr. Matsumoto is with the Signal Processing Group, Cambridge University, U.K., on leave from Waseda University. His research interests include Monte Carlo based hierarchical Bayesian algorithms, Particle Filters, MCMC, on-line signature verification, on-line face recognition, on-line handwriting recognition, non-linear time series prediction, and bioinformatics. He is on the editorial board of Circuits, Systems, and Signal Processing. He is a coauthor of the book Bifurcations (Springer-Verlag, 1993). Dr. Matsumoto is a past Editorial Board Member, as well as a guest coeditor of the Proceedings of the IEEE. He is a member of the IEICE Biometric Person Authentication Technical Committee as well as a member of the ITE Technical Committee on Next-Generation Image Input Devices. He serves as a member of the Steering Committee of SVC 2004, Signature Verification Competition. He is a fellow of the IEEE.

Copyright 2004 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.