Integration of Ohman and Rule-based Coarticulation Models for Visualization of Pure Lithuanian

Visualization methodology for pure Lithuanian diphthongs that are influenced by neighbouring phonemes are presented. Ohman and Rule-based coarticulation control models are integrated. Rules for Lithuanian diphthong animation are defined based on Lithuanian phonology. Most of them can be realized using Ohman control model, others use linear interpolation of expressiveness parameters. The proposed coarticulation model can be successfully applied to define expressiveness coefficients in viseme-driven speech animation engines. DOI: http://dx.doi.org/10.5755/j01.eee.19.1.3260


I. INTRODUCTION
Movements of vocal organs which are involved in the production of speech sounds (articulators) significantly improve the understanding of acoustic signal.So, computer generated 3D human head models, specified to animate synthesized or natural speech (talking heads) recently became an important part of human-computer communication.Talking heads can be employed for econsulting services: virtual secretary, WEB navigator or virtual agent who is responsible for information conveying to user in a Smart Ecological and Social Apartments (SESA).Also talking heads are widely used in e-learning technologies for the correct presentation of the sound pronunciation [1] or applied in movie, advertising and computer game industries.
Invention of new speech animation engine is time consuming task and also requires considerable financial support and specific knowledge.Translingual speech animation can be applied to save resources and to integrate a talking head based on phonetics in foreign language.For instance, visual similarity of Lithuanian and English phonemes was used to animate Lithuanian speech using English speech animation engine [2].
However, every language is unique and has specific phonetic rules for speech production.Moreover, a coarticulation effect that refers to the situation in which a conceptually isolated speech sound is influenced by neighbouring segments has a high influence on visual speech.Coarticulation control models [3] should be applied to govern the articulatory movements for a given phonetic target specification.Typically the coarticulation control models can be expressed by a sequence of time labelled phonemes, including stress and phrasing markers.

II. COMPARISON OF COARTICULATION CONTROL MODELS
Different researchers identified various coarticulation control models including Rule-based [4], Cohen-Massaro [5], Ohman [6] and Artificial Neural Network (ANN) [7] models.Rule-based (parametric) coarticulation models use a set of explicit rules to model steady-state properties of pronounced phonemes and parametrically control how these phonemes are fused into connected speech.For instance, Beskow [4] proposed rule-based model, where each phoneme is assigned to a target vector of articulatory control parameters.Some parameter values can be left undefined to allow these targets to be influenced by coarticulation.If a target is left undefined, the value is inferred from context using interpolation.For example, the lip rounding parameter in V 1 CCCV 2 utterance (vowel V 1 is unrounded, V 2rounded) is unspecified for the consonants C, so consonant targets are determined from the vowel context by linear interpolation from V 1 , to V 2 .
Most rule-based animation engines are structured on formal linguistic theory, so implementation of rule-based model is limited by people's incomplete understanding of coarticulation effect and their inability to build a full set of rules for phonemes, that are influenced by neighbours.
Data-driven animation engines commonly exploit Cohen-Massaro, Ohman [6] and ANN coarticulation control models.Cohen-Massaro and Ohman models are associated with speech production theory.Artificial neural networks (ANN) [7] must be trained to predict articulatory parameter values on a frame-by-frame basis.
Cohen-Massaro [5] uses dominance function that gradually increases up to a peak value and then decreases.
This function is employed to model articulatory gestures.However, certain targets, such as the closure in a bilabial stop, cannot be achieved with this model.
Ohman coarticulation control model modified by Reveret [8] where C is the set of all consonants in the analyzed phonetic structure.
The vowel track v(t) is formed by temporally blending successive fixed vowel targets i a , according to the function where N is the number of vowels in the analysed utterance Perceptual intelligibility experiment compared coarticulation control models together with an audio-alone condition (Table I) [3].The results confirm that all coarticulation control models give significantly increased speech intelligibility over the audio-alone case (Table 1).It also demonstrates that rulebased model has the highest intelligibility score and Ohman is more effective than Cohen-Massaro.So, in our research we propose to integrate Rule-based and Ohman coarticulation models to visualize pure Lithuanian diphthongs.

III. COARTICULATION ANALYSIS OF LITHUANIAN DIPHTHONGS
The term diphthong refers to two adjacent vowel sounds occurring within the same syllable.Technically, a diphthong is a vowel with two different targets.This means that tongue and lips move during the pronunciation of this vowel.Languages differ in the length of diphthongs.Diphthongs typically behave like long vowels in languages with phonemically short and long vowels.Lithuanian language is a good example of this case [9].Besides, languages differ in the count of diphthongs (10 in British English [10], 6 in Dutch [11], etc.).
There are two types of Lithuanian diphthongs: 9 pure (Vowel-Vowel structure (VV)) and 20 mixed diphthongs, that are made of vowels "a", "e", "i", "u" and consonants "l", "m", "n", "r" and has the Vowel-Consonant structure (VC) [12].Only pure Lithuanian diphthongs (ai, au, ei, ui, ie, uo, eu, oi, ou) will be analysed in our research.They regularly appear in Lithuanian words (e.g.miegas, saulė, eisena), so visualization of these diphthongs requires additional attention; especially when English speech animation engine is used to animate Lithuanian speech.Architecture of Lithuanian speech animation engine was presented in [2].
Stressing of the diphthong highly influence visualization of the speech since stressed syllable is more expressive then others.Besides, position of the accented phoneme strongly influences appearance of neighbouring phonetic segments.English diphthongs are always stressed with the falling accent and Lithuanian diphthongs can be stressed with rising accent, too.Thus, three situations of Lithuanian diphthong stressing can be distinguished: 1) Diphthong is stressed with falling accent (ái).
Falling diphthong starts with a vowel quality of higher importance (higher pitch or volume) and ends in a semivowel with less prominence.Examples of falling diphthongs: láimė, áugti, léisti; 2) Diphthong is stressed with rising accent (aı̇).Rising diphthong begins with a less prominent semivowel and end with a more prominent full vowel.Examples: eı̇ti, šiẽnas; 3) Diphthong is in the unstressed syllable (e.g.traukinys, ąžuolas).People prepare themselves for pronunciation of the next phoneme during articulation of the current phoneme.So, the second group of features that influence visualization and expressiveness of Lithuanian diphthongs includes its location in the word and its neighbours.
Diphthong position in the word together with information about its stressing are analysed to investigate their influence for speech animation.Rules for Lithuanian diphthong visualization are included in the framework for Lithuanian diphthong visualization (Fig. 1).

IV. PROPOSED FRAMEWORK FOR PURE LITHUANIAN DIPHTHONG VISUALIZATION
Rules for pure Lithuanian diphthong visualization (Fig. 1) can be divided into two main groups: rules where adapted Ohman coarticulation control model can be applied (dotted and greyed rules) and those, where it cannot be done (white rules).
Ohman coarticulation control model proposed earlier does not define coarticulation between two consonants.So, the influence of analysed consonant must extend no further than to the peak of the preceding or following gesture.Moreover, Ohman model [8] is designed for VCV, VCCV or VCCCV phonetic utterances, therefore its application for VVCV and VCVV utterances must be considered separately.So, in this paper we propose technique, how VVCV and VCVV utterances can be visualized using Ohman coarticulation model designed for VCCV phonetic structure.
Lithuanian diphthong can be stressed in three ways (falling accent, rising accent or non-stressed), so the influence of the stressed vowel S V for the appearance of non-accented vowel of diphthong should be analysed.It was stated earlier, that falling diphthong ( ) starts with a vowel quality of higher importance and ends in a semivowel with less prominence.It means that semivowel ( is visually much less expressed too.Therefore we define that supreme expressiveness coefficient ) The rising diphthong S 2 1 V V begins with a semivowel and ends with a more prominent full vowel, so 1 semi = V V .In the meantime, we treat the non-stressed ( ).
Vowel articulation strongly influences pronunciation of neighbouring phonemes [10].Consonants' visual expressiveness is much lower, so they are highly dependable from neighbouring vowels.In the meantime semivowel has similar characteristics as consonant: it is highly influenced by stressed vowel and its expressivity is much lower.So in VVCV and VCVV utterances, we propose to treat semivowel as virtual consonant ( V C ), which has the maximum expressiveness values equal to the maximum expressiveness of semivowel.This transcription gives us possibility to use Ohman coarticulation rule model for diphthongs in VCVV and VCVV syllables.For instance: syllable V is a semi vowel.In the case, when unstressed vowel of the diphthong is the last phoneme of the word and its visual appearance is very important for speech understanding, expressiveness coefficient for viseme of this semivowel is equal to 2/3 of vowels maximum expressiveness.
Finally two white rules in Fig. 1. describe the situation, when stressed diphthong is at the last syllable of word, which ends in consonant.Ohman is not suitable, so we propose to apply linear interpolation between expressivity coefficients of these phonemes.
To estimate quality of our proposed Ohman and Rulebased control model, we've compared visemes expressivity before and after application of our model.Results of this experiment are shown in Fig. 2. They confirm that expressiveness parameters of visemes defined by proposed coarticulation model are much more reliable to coarticulation of Lithuanian word "juodas", which includes pure Lithuanian diphthong.V. CONCLUSIONS Since all humans are experts in lip reading and detects even the slightest errors during speech animation, expressive speech with integrated coarticulation rules is crucial part of any speech animation system.Ever since pure Lithuanian diphthongs regularly appear in Lithuanian words, their accurate visualization considerably improves the perspicuity of animated speech.
Eight rules for pure Lithuanian diphthong animation were defined in this paper.Six of them employ adapted Ohman coarticulation control model to define expressiveness parameter of visemes and the rest of them exploit linear interpolation between expressiveness parameter of visemes.The proposed integration of Ohman and Rule-based coarticulation models was applied in rule-based Lithuanian speech animation engine [2].Comparison of visemes expressivity before and after integration of the proposed rules proved that intelligibility of animated Lithuanian words with pure Lithuanian diphthongs noticeably increased.
there is no intervocalic coarticulation in the Ohman model, the blending function ) (t b j can be also applied to consonants that are between vowels.

Fig. 1 .
Fig. 1.Framework for pure Lithuanian diphthong animation.Dotted and greyed rules define situation when Ohman coarticulation control model can be applied and white rules define state, where it cannot be done.

Fig. 2 .
Fig. 2. Expressiveness parameter before and after application of our proposed model.
defines coarticulation between two vowels.The vowel Integration of Ohman and Rule-based Coarticulation Models for Visualization of Pure Lithuanian Diphtongs track )(t v is formed by interpolation between fully expressed vowel targets (visemes).So, this model can be applicable to track parameters of the visemes that appear in phonetic structures is specified by a target value c , a coarticulation factor c

TABLE I .
[3]MARY OF INTELLIGIBILITY TEST OF VISUAL SPEECH SYNTHESIS CONTROL MODELS[3].