%PDF-1.1 % 1 0 obj [/CalRGB << /WhitePoint [0.9505 1 1.089] /Gamma [1.8 1.8 1.8] /Matrix [0.4497 0.2446 0.02518 0.3163 0.672 0.1412 0.1845 0.08334 0.9227] >> ] endobj 3 0 obj << /Length 11329 >> stream BT /T1 1 Tf 13.92 0 0 13.92 85.2 679.92 Tm 0 Tr 0 g 0.013 Tc 0 Tw (SPEECHREADING USING SHAPE AND INTENSITY INFORMATION)Tj /T2 1 Tf 11.04 0 0 11.04 192.24 647.52 Tm 0.009 Tc (Juergen Luettin)Tj /T3 1 Tf 6.96 0 0 6.96 261.84 652.56 Tm 0.034 Tc (1,2)Tj /T2 1 Tf 11.04 0 0 11.04 270.72 647.52 Tm 0.014 Tc (, Neil A. Thacker)Tj /T3 1 Tf 6.96 0 0 6.96 346.32 652.56 Tm (1)Tj /T2 1 Tf 11.04 0 0 11.04 349.92 647.52 Tm 0.011 Tc (, Steve W. Beet)Tj /T3 1 Tf 6.96 0 0 6.96 415.92 652.56 Tm (1)Tj /T4 1 Tf -30.759 -4.69 TD (1)Tj /T5 1 Tf 11.04 0 0 11.04 205.44 614.88 Tm 0.008 Tc (Dept. of Electronic and Electrical Engineering)Tj -0.174 -1.152 TD 0.01 Tc (University of Sheffield, Sheffield S1 3JD, UK)Tj /T4 1 Tf 6.96 0 0 6.96 204.96 594.48 Tm (2)Tj /T5 1 Tf 11.04 0 0 11.04 208.56 589.44 Tm (IDIAP, CP 592, 1920 Martigny, Switzerland)Tj -3.587 -2.304 TD 0.008 Tc (Luettin@idiap.ch, N.Thacker@shef.ac.uk, S.Beet@shef.ac.uk)Tj /T6 1 Tf 12 0 0 12 145.68 506.16 Tm 0.003 Tc (ABSTRACT)Tj /T7 1 Tf 9.12 0 0 9.12 57.6 487.68 Tm 0.022 Tc 0.346 Tw [(W)-17(e)-17( describe a)-31( speechreading system that uses both, shape)]TJ 0 -1.211 TD 0.019 Tc -0.001 Tw [(information from the lip contours and )-8(intensity )-8(information )-8(from)]TJ T* 0.02 Tc 0.085 Tw (the mouth )Tj 4.579 0 TD 0.032 Tc (area. )Tj 2.342 0 TD 0.017 Tc 0.115 Tw [(Shape information i)-22(s)-22( obtained by tracking and)]TJ -6.921 -1.211 TD 0.022 Tc 0.215 Tw (parameterising the inner and outer lip )Tj 16.947 0 TD 0.01 Tc 0.227 Tw [(boundary i)-16(n)-16( a)-16(n)-16( )]TJ 7.026 0 TD 0.026 Tc (image)Tj -23.974 -1.211 TD 0.021 Tc 0.163 Tw [(sequence. Intensity information i)-18(s)-18( extracted from a)-32( grey level)]TJ T* 0.02 Tc 0.006 Tw [(model, based o)7(n)7( principal component )-27(analysis. )-27(I)-6(n)-6( )-27(comparison )-27(t)-6(o)]TJ T* 0.021 Tc 0.268 Tw (other approaches, the intensity )Tj 13.816 0 TD 0.027 Tc 0.289 Tw (area deforms )Tj 6.158 0 TD 0.022 Tc 0.294 Tw (with the shape)Tj -19.974 -1.211 TD 0.025 Tc 0.001 Tw (model to ensure that similar object features )Tj 18.053 0 TD 0.035 Tc (are )Tj 1.553 0 TD 0.022 Tc (represented )Tj 4.974 0 TD 0.033 Tc (after)Tj -24.579 -1.211 TD 0.017 Tc 0.588 Tw [(non-rigid deformation o)-9(f)-9( )]TJ 12.158 0 TD 0.024 Tc 0.581 Tw [(the lips. W)-15(e)-15( describe speaker)]TJ -12.158 -1.211 TD 0.019 Tc 0.007 Tw [(independent )9(recognition )9(experiments )9(based o)6(n)6( these features a)6(n)6(d)]TJ T* 0.02 Tc 0.033 Tw [(Hidden )27(Markov Models. Preliminary )]TJ 15.447 0 TD 0.023 Tc 0.03 Tw (results suggest that )Tj 8.158 0 TD 0.031 Tc (similar)Tj -23.605 -1.211 TD 0.022 Tc 0.083 Tw [(performance )26(can be achieved b)9(y)9( using either shape or intensity)]TJ T* 0.021 Tc 0.084 Tw [(information and slightly higher performance b)8(y)8( )-27(their )]TJ 22.474 0 TD 0.015 Tc (combined)Tj -22.474 -1.211 TD 0.026 Tc (use.)Tj /T6 1 Tf 12 0 0 12 124.32 321.6 Tm 0.006 Tc 0 Tw (1. INTRODUCTION)Tj /T7 1 Tf 9.12 0 0 9.12 57.6 303.12 Tm 0.022 Tc 0.452 Tw (Visual information of the speaker\222s )Tj 17.026 0 TD 0.033 Tc (face )Tj 2.447 0 TD 0.018 Tc 0.482 Tw (provides speech)Tj -19.474 -1.211 TD 0.021 Tc 0.005 Tw [(information which i)-18(s)-18( often complementary to )-27(the )-27(acoustic )-27(signal)]TJ T* 0.02 Tc 0.112 Tw [(and which can improve the )-26(performance )-26(of )-26(speech )-26(recognition)]TJ T* 0.023 Tc 0.003 Tw [(systems [1 )26(])-30([)23(2 )26(])-16(.)-16( One of the main difficulties in )-27(speechreading )-27(i)-30(s)]TJ T* 0.021 Tc -0.001 Tw [(the extraction of visual speech )-6(features. )-6(I)-18(t)-18( )-6(i)-18(s)-18( )]TJ 18.289 0 TD 0.032 Tc (still )Tj 1.842 0 TD 0.022 Tc 0.004 Tw (not well )Tj 3.605 0 TD 0 Tc (known)Tj -23.737 -1.211 TD 0.022 Tc 0.057 Tw (which features )Tj 6.342 0 TD 0.035 Tc (are )Tj 1.579 0 TD 0.02 Tc 0.059 Tw (important for speech recognition and )Tj 15.658 0 TD 0.009 Tc 0.096 Tw [(how t)-17(o)]TJ -23.579 -1.211 TD 0.024 Tc 0.029 Tw (represent them. )Tj 6.632 0 TD 0.015 Tc 0.064 Tw [(Although i)-24(t)-24( i)-24(s)-24( generally )]TJ 10.184 0 TD 0.023 Tc 0.056 Tw (agreed that most visual)Tj -16.816 -1.211 TD 0.021 Tc 0.347 Tw [(speech )26(information )26(i)-18(s)-18( )26(contained in the inner and outer lip)]TJ T* 0.295 Tw [(contour, i)-18(t)-18( has also been )]TJ 11.684 0 TD 0.011 Tc (shown )Tj 3.184 0 TD 0.02 Tc 0.322 Tw [(that )26(information about the)]TJ -14.868 -1.211 TD 0.022 Tc 0.241 Tw (visibility of teeth and )Tj 10 0 TD 0.018 Tc 0.245 Tw [(tongue provide important )-26(speech )]TJ 14.658 0 TD 0.026 Tc (cues)Tj -24.658 -1.211 TD 0.024 Tc 0.108 Tw [([3 )132(])-29([)24(4 )132(])-15(.)-15( )27(Particularly )27(for )27(fricatives, the place of articulation can)]TJ T* 0.021 Tc 0.032 Tw [(often b)-5(e)-5( determined visually, )]TJ 12.316 0 TD 0.03 Tc 0.023 Tw (i.e. for )Tj 3.079 0 TD 0.02 Tc 0.033 Tw [(labiodental \(upper teeth o)20(n)]TJ -15.395 -1.211 TD 0.023 Tc 0.109 Tw (lower lip\), interdental )Tj 9.526 0 TD 0.016 Tc 0.142 Tw (\(tongue behind front )Tj 9.053 0 TD 0.023 Tc 0.135 Tw (teeth\) and alveolar)Tj -18.579 -1.211 TD 0.017 Tc 0.194 Tw (\(tongue touching gum ridge\) )Tj 12.737 0 TD 0.021 Tc 0.216 Tw (place. Other speech information)Tj -12.737 -1.211 TD 0.016 Tc 0 Tw (might be contained in the protrusion and wrinkling of lips.)Tj 0 -1.974 TD 0.021 Tc 0.242 Tw [(Speechreading )26(approaches can be classified into image-based)]TJ 0 -1.211 TD 0.023 Tc 0.135 Tw [(and )26(model-based )26(systems. Image-based systems use grey level)]TJ T* 0.022 Tc 0.267 Tw [(information )26(from an image region containing the lips either)]TJ T* 0.025 Tc 0 Tw (directly or after some )Tj 9.079 0 TD 0.023 Tc 0.003 Tw [(processing a)-16(s)-16( speech features. Most image)]TJ -9.079 -1.211 TD 0.021 Tc 0.032 Tw [(information i)-18(s)-18( therefore retained, but i)-18(t)-18( i)-18(s)-18( )]TJ 17.421 0 TD 0.029 Tc 0.05 Tw (left to the )Tj 4.421 0 TD 0.016 Tc (recognition)Tj 6.184 47.816 TD 0.023 Tc 0.477 Tw [(system )26(to )26(discriminate speech information from linguistic)]TJ 0 -1.211 TD 0.021 Tc 0.295 Tw [(variability and illumination variability. Model-based )-26(systems)]TJ T* 0.023 Tc 0.03 Tw [(usually )27(represent )27(the )27(lips )27(b)10(y)10( geometric measures, like the height)]TJ T* 0.109 Tw (or width of the outer or inner lip )Tj 14.447 0 TD 0.011 Tc 0.121 Tw [(boundary o)-15(r)-15( by a)-42( )]TJ 7.658 0 TD 0.026 Tc (parametric)Tj -22.105 -1.211 TD 0.02 Tc 0.533 Tw (contour model which represents )Tj 15.421 0 TD 0.021 Tc 0.532 Tw [(the lip boundaries. )-26(T)8(h)8(e)]TJ -15.421 -1.211 TD 0.024 Tc 0.45 Tw (extracted features )Tj 8.395 0 TD 0.035 Tc 0.439 Tw [(are o)9(f)9( )]TJ 3.553 0 TD 0.019 Tc 0.455 Tw [(low dimension and invariant t)-7(o)]TJ -11.947 -1.211 TD 0.021 Tc 0.111 Tw (illumination. Model-based systems )Tj 14.895 0 TD 0.013 Tc 0.119 Tw (depend on )Tj 4.684 0 TD 0.021 Tc 0.137 Tw (the definition of)Tj -19.579 -1.211 TD 0.023 Tc 0.003 Tw [(speech related features b)10(y)10( the user. The definition )-27(may )-27(therefore)]TJ T* 0.019 Tc 0.034 Tw (not include )Tj 4.895 0 TD 0.035 Tc (all )Tj 1.342 0 TD 0.023 Tc 0.056 Tw [(speech )26(relevant )26(information and features like the)]TJ -6.237 -1.211 TD 0.019 Tc 0 Tw (visibility of teeth and tongue are difficult to represent.)Tj 0 -1.974 TD 0.02 Tc 0.006 Tw [(We )8(have )8(previously described a)-33( speechreading system [5 )26(])-33( based)]TJ 0 -1.211 TD 0.022 Tc 0.136 Tw [(o)9(n)9( )26(shape )26(features )26(which represent the outline of the inner a)9(n)9(d)]TJ T* 0.023 Tc 0.003 Tw (outer lip )Tj 3.737 0 TD 0.016 Tc 0.01 Tw (contour and )Tj 5.105 0 TD 0.021 Tc 0.005 Tw [(their modelling b)8(y)8( )]TJ 7.842 0 TD 0.017 Tc 0.009 Tw [(Hidden Markov )-27(Models)]TJ -16.684 -1.211 TD 0.023 Tc 0.056 Tw [(\(HMMs\). )26(The )26(system )26(performed well for a)-30( speaker )]TJ 21.526 0 TD 0.013 Tc (independent)Tj -21.526 -1.211 TD 0.019 Tc 0.034 Tw [(recognition )27(task, )27(but i)-20(t)-20( did not contain any intensity information)]TJ T* 0.218 Tw [(which might provide additional )-26(speech )-26(information. )]TJ 22.816 0 TD 0.026 Tc 0.237 Tw (Here we)Tj -22.816 -1.211 TD 0.022 Tc 0.373 Tw [(extend this system b)9(y)9( augmenting the )-26(feature )-26(vector )-26(with)]TJ T* 0.021 Tc 0.347 Tw [(intensity information extracted from the mouth region. W)-32(e)]TJ T* 0.024 Tc 0.029 Tw (evaluate the )Tj 5.237 0 TD 0.019 Tc 0.034 Tw [(contribution o)-7(f)-7( intensity information separately a)6(n)6(d)]TJ -5.237 -1.211 TD 0 Tw (in combination with shape features.)Tj /T6 1 Tf 12 0 0 12 368.16 283.2 Tm 0.009 Tc (2. SHAPE MODELLING)Tj /T7 1 Tf 9.12 0 0 9.12 313.2 264.72 Tm 0.023 Tc 0.056 Tw [(For modelling the shape variability of lips, )-26(we )-26(use )-26(an )]TJ 22.763 0 TD 0.015 Tc (approach)Tj -22.763 -1.211 TD 0.023 Tc 0.214 Tw [(based o)10(n)10( active shape models [6 )237(])-30([)23(7 )237(])-16(.)-16( )-26(These )]TJ 20.026 0 TD 0.028 Tc 0.235 Tw (are statistically)Tj -20.026 -1.211 TD 0.02 Tc 0.059 Tw [(based deformable models which represent a)-33( contour )-26(b)7(y)7( )-26(a)-33( )]TJ 24.105 0 TD 0.035 Tc 0.07 Tw [(set o)9(f)]TJ -24.105 -1.211 TD 0.023 Tc 0.214 Tw (points. Patterns of characteristic shape variability )Tj 21.737 0 TD 0.035 Tc (are )Tj 1.737 0 TD 0.022 Tc (learned)Tj -23.474 -1.211 TD 0.083 Tw [(from a)-31( training )]TJ 6.684 0 TD 0.033 Tc (set, )Tj 1.789 0 TD 0.018 Tc 0.114 Tw (using principal component analysis \(PCA\).)Tj -8.474 -1.211 TD 0.022 Tc 0.083 Tw [(The main modes of shape variation captured in the training )-27(s)-17(e)-17(t)]TJ T* 0.001 Tw [(can therefore be described by a small number of parameters. T)9(h)9(e)]TJ T* 0.021 Tc 0.242 Tw [(main advantage of this modelling technique )-26(i)-18(s)-18( )-26(that )-26(heuristic)]TJ T* 0.02 Tc 0.006 Tw [(assumptions about legal shape )-27(deformation )]TJ 17.947 0 TD 0.035 Tc (are )Tj 1.553 0 TD 0.019 Tc 0.034 Tw (avoided. Instead,)Tj -19.5 -1.211 TD 0.023 Tc 0.161 Tw [(the model i)-16(s)-16( )]TJ 5.789 0 TD 0.019 Tc 0.165 Tw [(only allowed t)-7(o)-7( deform t)-7(o)-7( shapes )]TJ 14.842 0 TD 0.028 Tc 0.183 Tw (similar to the)Tj -20.632 -1.211 TD 0.023 Tc 0.082 Tw (ones seen in the training )Tj 10.658 0 TD 0.033 Tc (set. )Tj 1.789 0 TD 0.009 Tc (Any )Tj 2.105 0 TD 0.021 Tc (shape )Tj /T8 1 Tf 2.684 0 TD (x)Tj /T7 1 Tf 0.5 0 TD 0.022 Tc 0.11 Tw [( )-26(representing the co-)]TJ -17.737 -1.211 TD 0.016 Tc 0 Tw (ordinates of the contour points can be approximated by)Tj ET 0 G 1 i 1 J 1 j 0.48 w 10 M []0 d 368.4 132.96 m 374.4 132.96 l S BT /F2 1 Tf 12 0 0 12 349.44 125.76 Tm 1.08 Tc [(xx)80(P)1091(b)]TJ /F3 1 Tf 0.74 0 TD 1.031 Tc (=+)Tj /T9 1 Tf 10.08 0 0 10.08 401.76 125.76 Tm 0.024 Tc [(,)-12286(\(1\))]TJ /T7 1 Tf 9.12 0 0 9.12 313.2 105.84 Tm 0.021 Tc (where )Tj ET 0.72 w 339.6 112.08 m 345.84 112.08 l S BT /F2 1 Tf 10.08 0 0 10.08 339.6 105.84 Tm (x)Tj /T7 1 Tf 9.12 0 0 9.12 347.04 105.84 Tm 0.024 Tc 0.029 Tw [( )-26(i)-15(s)-15( the mean shape, )]TJ /T8 1 Tf 8.237 0 TD (P)Tj /T7 1 Tf 0.605 0 TD 0.055 Tw [( the )26(matrix )26(of eigenvectors of the)]TJ -12.553 -1.211 TD 0.02 Tc 0 Tw (covariance matrix and )Tj /T8 1 Tf 9.263 0 TD (b)Tj /T7 1 Tf 0.553 0 TD 0.021 Tc 0.005 Tw [( )-13(a)-18( vector containing the weights for each)]TJ -9.816 -1.211 TD 0.018 Tc 0.001 Tw [(eigenvector. Only the first few eigenvectors )-7(corresponding )-7(t)-8(o)-8( )-7(t)-8(h)-8(e)]TJ ET endstream endobj 4 0 obj << /ProcSet [/PDF /Text /ImageB] /ColorSpace <> /Font << /F2 5 0 R /F3 6 0 R /T1 7 0 R /T2 8 0 R /T3 9 0 R /T4 10 0 R /T5 11 0 R /T6 12 0 R /T7 13 0 R /T8 14 0 R /T9 15 0 R >> >> endobj 17 0 obj << /Type /XObject /Subtype /Image /Name /Im1 /Width 172 /Height 131 /BitsPerComponent 8 /ColorSpace /DeviceGray /Length 22533 >> stream @@