NHK Laboratories Note No. 485


Simon Clippingdale, Mahito Fujii

Human Science


  Face recognition systems based on elastic graph matching work by comparing the positions and image neighborhoods of a number of detected feature points on faces in input images with those in a database of pre-registered face templates. Such systems can absorb a degree of deformation of input faces due for example to facial expression, but may generate recognition errors if the deformation becomes significantly large. We show that, somewhat counter-intuitively, robustness to facial expressions can be increased by applying random perturbations to the positions of feature points in the database of face templates. We present experimental results on video sequences of people smiling and talking, and discuss the probable origin of the observed effect.

To appear in Proceedings of VLBV'03, Madrid, 18-19 September 2003
Published in Springer-Verlag Lecture Notes in Computer Science series
  The detection and recognition of faces in video sequences promises to make feasible automatic indexing based on meaningful descriptions of content such as "persons A and B talking" or "meeting of E.U. finance ministers." However, broadcast video is largely unconstrained, requiring face recognition systems to handle head rotation about various axes; nonrigid facial deformations associated with speech and facial expression; moving and cluttered backgrounds; variable lighting and shadow; and dynamic occlusions. In this paper we address the issue of robustness to facial expression.

  A prototype face recognition system has been developed (FAVRET: FAce Video REcognition & Tracking [1]) which aims to handle some of the variability by using a database containing multiple views of each registered individual, and a flexible matching method (deformable template matching) which can absorb a certain amount of facial deformation. The template matching process is embedded in a dynamic framework in which multiple hypotheses about face identity, pose, position and size are carried forward and updated at each frame (cf. [2]).

  The template data representation is based on that used in the Elastic Graph Matching (EGM) system of Wiskott and von der Malsburg [3,4] and a number of related systems [5,6,7]. It consists of the coordinates of a number of predefined facial feature points, together with Gabor wavelet features computed at multiple resolutions at each feature point. Like the EGM system, the present system works by matching input face images against a database of face templates using the positions of facial feature points detected in the input image, and Gabor wavelet features measured on the input image at each detected feature point [1].

  Although the system can absorb some deformation of input faces due to facial expressions or speech movements, significant deformations can give rise to recognition (identity) errors even when the system tracks the facial feature points more or less correctly. However, random perturbations of the feature points from which the deformable templates are constructed can be shown to increase the robustness of recognition under such conditions without affecting the accuracy of the system where there is no deformation of the input face.

  Section 2 describes the FAVRET prototype system and the template matching process, and shows examples of the system output in the cases of correct recognition and of recognition error caused by facial expression. Section 3 discusses the effect of perturbation of the feature points in the database of face templates, and gives numerical results of experiments on video sequences. Section 4 contains a discussion of the results and the likely source of the observed increase in robustness. The paper concludes with some comments on the applicability of the method and remaining issues in the deployment of such recognition systems.

2.1 Architecture

  The architecture of the FAVRET prototype system is illustrated in figure 1. The database contains deformable templates at multiple resolutions constructed from face images of target individuals at multiple poses (section 2.2). The system generates a set of active hypotheses which describe the best few template matches (feature point positions, face pose and identity) achieved for each likely face region on the previous frame of input video. Templates are initialized on each new input frame based on the contents of these hypotheses, and are then allowed to deform so as to maximize the similarity between the features in the template and those measured on the image at the deformed feature points (section 2.3).
  Once each feature point attains a similarity maximum, an overall match score is computed for the entire deformed template (section 2.3). If the match score exceeds a threshold, processing continues recursively to templates at the next higher resolution in the database until a set of terminating leaves in the database 'tree' is obtained. At each terminating leaf a new active hypothesis is generated, with an 'evidence' value accumulated during the descent of the tree according to the matches achieved at each template. The active hypotheses compete on this evidence value with others from the same region of the image. Information from them is integrated over time by updating at each frame a set of region hypotheses, one per face region. Running estimates of probability for each registered individual are updated from these evidence values in broadly Bayesian fashion and the largest determines the identity output by the system for the given region; the probability estimate is output as a confidence measure.

Figure1.Architecture of FAVRET prototype system Figure1.Architecture of FAVRET prototype system
Figure1.Architecture of FAVRET prototype system Figure1.Architecture of FAVRET prototype system
A large picture

Figure 1. Architecture of FAVRET prototype system

2.2 Deformable Templates

  The deformable templates in the system database are constructed from face images of target individuals at multiple poses, labeled with feature point positions. Figure 2 shows images at 10-degree intervals; wider intervals can be used (work is in progress on automatically extracting and labeling such images from video sequences [8]).
  Each template consists of the normalized spatial coordinates of those feature points visible in the image, together with features (Gabor wavelet coefficients) computed as point convolutions with a set of Gabor wavelets at each of the feature points. The Gabor wavelet at resolution r and orientation n is a complex-exponential grating patch with a 2-D Gaussian envelope (figure 3):

with the spatial extent of the Gaussian envelope given by and 2-D spatial center frequency given by for Nres = 5 resolutions r = 0,...,4 at half-octave intervals and Norn = 8 orientations n = 0,...,7.

Figure2 Figure2
Figure 2. Multiple views annotated with feature point positions

Figure 3. Example Gabor wavelet. Left: real part; right: imaginary part

  This data representation resembles that used in the Elastic Graph Matching system of Wiskott and von der Malsburg [3,4] for face recognition in static images, but the chosen feature points differ, as do the parameters of the Gabor wavelets. A given template contains features computed at a single resolution r; Nres = 5 templates are computed from each registered image. Sufficiently similar low-resolution templates are merged by appropriate averaging to give the tree-structured database depicted in figure 1 (only 3 resolutions are shown in the figure).

2.3 Template Matching

  The system initializes templates from active hypotheses by estimating the least-squares dilation, rotation and shift from the feature point set in the template to that in the hypothesis and applying this transformation to the template, thus preserving the spatial relationships between template feature points. Thereafter, each feature point is allowed to migrate within the image until a local maximum of feature similarity is found between the features in the template and those measured on the image. The estimation of the feature point shift required to maximize similarity uses phase differences [3] between the Gabor wavelet features in the template and those measured from the image. The process of estimating the shift, measuring the wavelet features at the shifted position and computing the shifted feature similarity is iterated until the similarity is maximized.
  Once each feature point attains a similarity maximum, the overall match score for the deformed template is computed from the individual feature point similarities and the deformation energy of the feature point set relative to that in the undeformed template:

where A denotes the undeformed template and A' the template deformed on the image; cA and cA' are feature vectors of Gabor wavelet coefficients from the template and from the deformed feature point positions on the image; EA,A' is the deformation energy (sum of squared displacements) between the feature point positions in the template and the deformed feature point positions on the image, up to a dilation, rotation and and are weights for the feature similarity and spatial deformation terms; and = 2 / kr is the modulation wavelength of the Gabor wavelet at resolution r.

2.4 System Output and ID Error Examples

  For each face region in the input video, the system outputs the identity of the registered individual for which the estimated probability is highest, or indicates that the face is unknown. Figure 4 shows examples from an experiment using a database of 12 individuals registered at 10-degree pose intervals. The output is superimposed on the relevant input frame. The identity in the left example is correct. However, when this subject smiles, the system outputs an incorrect identity as shown in the right example.

Figure4 Figure4
Figure 4. Correct (left) and incorrect (right) recognition examples


  It has been found that applying a random perturbation to the locations of feature points used to construct deformable templates improves the robustness of the system with respect to facial deformations such as result from facial expressions and speech movements.

Figure5 Figure5
Figure 5. Original (left) and perturbed (right) feature point positions

  The left side of figure 5 shows an image annotated with the 'correct' feature point positions used to build the database in the system which produced the output shown in figure 4. The right side of figure 5 shows the same image with feature point positions perturbed by 5 pixels, rounded to the nearest pixel, in uniformly-distributed random directions (the scaling differs slightly because image normalization is based on the feature point positions). A second databaswas constructed using such perturbed feature points: a different set of perturbations was applied to the feature points identified on each registered image before deformable templates were computed. The usual merging of similar templates was omitted in both cases, because the perturbed templates are merged little if at all due to the dissimilarity of their feature point positions.
  Experiments were conducted on video sequences showing 10 of the 12 registered individuals. The sequences showed subjects neutral (expressionless), smiling and talking. Table 1 shows the numbers of correct and incorrect IDs output over each video sequence by each system, summed over all 10 individuals in each condition. The totals (correct + incorrect) differ slightly due to differences in the instants at which tracking locked on or failed.

Table 1: Recognition results using databases with original and
perturbed feature points, under various input test conditions

Correct (%) Incorrect (%)
Correct (%) Incorrect (%)
934 (97.5) 24 (2.5)
679 (51.6) 636 (48.4)
1151 (75.2)  380 (24.8)
948 (99.0) 10 (1.0)
1074 (82.7) 224 (17.3)
1431 (93.5) 99 (6.5)
2764 (72.7) 1040 (27.3)
3453 (91.2) 333   (8.8)

  It is clear from the results shown in Table 1 that perturbation of the feature points leads to a significant increase in robustness of the system to those deformations associated with smiling and talking. Although the results vary among the ten individuals tested, there is no overall performance penalty even in the neutral (expressionless) case.


  One might be tempted to suspect that the effect is due to the relative lack of high-resolution feature energy at the perturbed feature points, which tend to lie in somewhat smoother areas of the face (the perturbation corresponds to only about a quarter of the modulation wavelength of the lowest-resolution wavelets, but about one wavelength for the highest-resolution wavelets, sufficient to shift the envelope significantly off the original feature points). However, previous experiments which restricted the maximum resolution used for matching suggest that this alone does not have the pronounced effect observed here.

  Rather, it seems that the observed effect is due to an increase in the separation between feature point configurations deformed by facial expressions and the feature point configurations of templates corresponding to other individuals in the database. This prevents deformations from transforming one individual's configuration into something very close to another's, leading to a low value of the deformation energy term EA,A' in (3) for an incorrect individual. The randomness of the perturbation directions ensures that deformations due to facial expression are unlikely to mimic the perturbations by chance (as a simple example, deformations due to facial expression are often more or less reflection-symmetric about the center line of the face, whereas the perturbations are not). Considering a configuration of N feature points in a template as a single point in 2N-dimensional space, the random perturbations increase the separation between such points in a subspace orthogonal to that generated by deformations due to facial expression.

  The method is applicable in principle to all EGM-based recognition systems provided that the perturbations are sufficiently small and the features used are of sufficiently broad support that the target facial structure remains represented at the perturbed positions. Although the experiments reported here used a single perturbation at each feature point at all resolutions, it may be that smaller perturbations should be applied at higher resolutions for best results.

  The learning of models of feature point motions associated with typical facial expressions would allow systems to disregard expression-induced subspaces of deformation when computing the deformation energy term EA,A' in (3) while penalizing deformation components in orthogonal subspaces. This is in some ways complementary to the perturbation approach discussed here, in that it attempts to reduce EA,A' for the correct individual while the perturbation approach increases EA,A' for others. Further work is required to determine whether a combination of the two would be effective.

  The above discussion has not considered the role of the wavelet coefficients and the feature similarity term in (3); it may be that there is some effect other than that resulting from the reduction of the effective resolution mentioned above, and further work is again required to assess the nature of any such effect.

  Probably the two greatest hurdles to be overcome before recognition systems can be useful in practice on unconstrained video are robustness to deformations such as those discussed here, and robustness to lighting conditions, which in the context of EGM-based systems of the present type is likely to involve attention to the feature similarity term in (3) and the effect on it of lighting-induced feature variation.


[1] S. Clippingdale and T. Ito, "A Unified Approach to Video Face Detection, Tracking and Recognition," Proc. International Conference on Image Processing ICIP'99, Kobe, Japan, 1999.
[2] M. Isard and A. Blake, "Contour tracking by stochastic propagation of conditional density," Proc. European Conference on Computer Vision ECCV'96, pp. 343-356, Cambridge, UK, 1996.
[3] L. Wiskott, J-M. Fellous, N. Krger and C. von der Malsburg, "Face Recognition by Elastic Bunch Graph Matching," TR96-08, Institut fr Neuroinformatik, Ruhr-Universitt Bochum, 1996.
[4] K. Okada, J. Steffens, T. Maurer, H. Hong, E. Elagin, H. Neven and C. von der Malsburg, "The Bochum/USC Face Recognition System And How it Fared in the FERET Phase III Test," Face Recognition: From Theory to Applications, eds. H. Wechsler, P.J. Phillips, V. Bruce, F. Fogelman-Sulie and T.S. Huang, Springer-Verlag, 1998.
[5] M. Lyons, S. Akamatsu, "Coding Facial Expressions with Gabor Wavelets,", Proc. Third IEEE International Conference on Automatic Face and Gesture Recognition FG'98, Nara, Japan, 1998.
[6] S. McKenna, S. Gong, R. Wrtz, J. Tanner and D. Banin, "Tracking Facial Feature Points with Gabor Wavelets and Shape Models," Proc. 1st International Conference on Audio- and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, Springer-Verlag, 1997.
[7] D. Pramadihanto, Y. Iwai and M. Yachida, "Integrated Person Identification and Expression Recognition from Facial Images," IEICE Trans. Information & Systems, Vol. E84-D, 7, pp. 856-866 (2001).
[8] S. Clippingdale and T. Ito, "Partial Automation of Database Acquisition in the FAVRET Face Tracking and Recognition System Using a Bootstrap Approach," Proc. IAPR Workshop on Machine Vision Applications MVA2000, Tokyo, November 2000.

Simon Clippingdale Simon Clippingdale
Simon Clippingdale received the B.Sc. Honours degree in Electronic and Electrical Engineering in 1982 from the University of Birmingham, U.K., and the Ph.D. degree in Computer Science in 1988 from the University of Warwick, U.K. He was a Japanese Government Science & Technology Agency (STA) Research Fellow at NHK Science and Technical Research Laboratories in 1990-91, and after lecturing at the University of Warwick, joined NHK in 1996. Since then he has been with the Human Science Research Division of NHK Science and Technical Research Laboratories, pursuing research on image recognition and vision. He is currently a Senior Research Engineer.
Mahito Fujii Mahito Fujii
Mahito Fujii obtained a Bachelor's degree in Electrical Engineering in 1981 and a Master's degree in 1983, both from Nagoya University. He joined NHK in 1983 and has been with NHK Science and Technical Research Laboratories since 1987, working on visual biocybernetics, image recognition and 3-D image information processing. From 1998 to 2001 he was with the Human Information Processing (HIP) Laboratories of Advanced Telecommunications Research Institute International (ATR). He is currently a Chief Engineer in the Research Planning and Coordination Division of NHK Science and Technical Research Laboratories.

Copyright 2003 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.