Web Services Based Hybrid Recognizer of Lithuanian Voice Commands

This paper presents the recently developed medical-pharmaceutical informative system with voice user interface. This is the first computerized system oriented towards healthcare services and industry where Lithuanian voice commands are used as a primary mean for control. Another essential property of the developed system is its hybrid nature: two different recognizers - an adapted commercial Spanish speech recognizer available from Microsoft and a locally developed HMM speech recognizer based on Lithuanian acoustic models – are operating in parallel. The recognition hypotheses produced by those recognizers are joined together using logical rules obtained using decision rules induction algorithms such as Ripper. All these measures and approaches allowed achieve very high speaker independent voice commands recognition accuracy acceptable for the system implementation in practice. The best achieved recognition was 98.9 % for 1000 Lithuanian voice commands. The paper presents optimization issues related with the development of the system. DOI: http://dx.doi.org/10.5755/j01.eee.20.9.8713


I. INTRODUCTION
During last 15 years speech technologies (recognition, synthesis, speaker identification) became integral part of industrial applications of information technologies in various areas of human activities.It is widely accepted that one of the areas where speech recognition are gaining the strongest positions is healthcare industry and services.The main rationale for the application of voice processing in the healthcare is the desire to save the work time of highest qualification medical personal which is spent to routine operations of documentation as well as the desire to speed up and to easy up the information search and presentation and sometimes to allow healthcare practitioners to concentrate their attention to more important and urgent tasks.But implementation of speech recognition to the healthcare services has far more reaching consequences and has bigger benefits than just simply aiding the healthcare practitioners to document information or to find the necessary information.It is acknowledged that speech Manuscript received February 24, 2014; accepted June 10, 2014.recognition has 5 main benefits for the healthcare industry: reusability, flexibility, less medical errors, productivity and so-called "one and done" effect [1].Reusability means that speech recognition forces to produce structured text which is better suited for further processing in other applications and often is immediately available e.g. for billing.Flexibility means that voice recognition isn't linked to a single device and is well suited to the cloud implementation which means that physician may use the same recognition profile from any device or any place.From a reimbursement standpoint, the closer the organization is between delivery-of-care and recording-of-care the more accurate the information is.Less medical errors means that speech recognition systems provide structured text and often strictly predefined vocabulary which is easier transferred and understood exactly with higher probability to another healthcare institution (historically medical documentation started off when the need to transfer patient from doctor to another occurred).Productivity means that using EMR (electronic medical records) physicians are able to serve more patients (some studies showed, that up to one third more patients [1])."One and done" principle means that using speech recognition physician is able to perform some tasks independently which otherwise requires the help of other people (transcriptionists, secretaries, etc.): physicians can dictate and then see what they said, make small corrections, sign the documentation, then send it off in this way avoiding multiple steps and people involved in them.
Looking at the practice worldwide it is evaluated that about half a million health care practitioners have the possibilities to use speech technologies in their daily activities [2].Among well-known examples of speech recognition applications in medical practice is one of the biggest healthcare services providers in Northern America -Providence Health and Services -installed voice controlled electronic health records management system Epic in 27 hospitals and more than 250 clinics.It is stated that this system is used by more than 8000 practitioner physicians daily.The financial benefits of voice recognition is well illustrated by the example that Dragon Dictate software allowed to save the Norman Regional Health System -single hospital centre based in Oklahoma City -$1.8 million in transcription services only in 2013 [3].According to a recent UN study [4] to many healthcare stakeholders the number of people with access to cell phones -around 6 billion, the study estimates -constitutes a major opportunity to increase access to healthcare worldwide.It would be difficult to improve access to healthcare services in many places of the world otherwise.
In any case the main prerequisite to implement such systems in Lithuania is the development of speech engine with proper reliability for the recognition of Lithuanian medical terms and other voice commands and phrases.It is obvious that from the user perspective, the more universal system is, and the more flexible vocabulary is used, the better.[5].But evaluating the possibilities to achieve practically implementable system that could be useful for healthcare practitioners some vocabulary restrictions should be applied.The developed Lithuanian medical information system is able to recognize the names of most often met disease names, the most popular drug names and most often met in medical practice complaints.The total number of voice commands implemented in the system, is about 1000, which is probably the biggest number of Lithuanian voice commands implemented in practically oriented Lithuanian speech recognition system so the lack of available speech resources causes that still many Lithuanian speech recognition studies are devoted to issues of speakerdependent recognition including some recent ones (e.g.[6]).
To achieve the accuracy goals, there was made a decision to develop hybrid speech recognition system for Lithuanian medical and pharmaceutical terms.Under the term hybrid system we assume the exploitation of several different speech recognizers and the combination of the hypotheses produced by them to derive a final decision [7].
Our earlier experience suggested the use of Microsoft Spanish speech engine as the foreign language speech engine for the adaptation purposes as one of the recognizers [8].The CD-HMM based speech recognizer was used as the preferred selection for proprietary Lithuanian speech recognizer since it has shown that such model is the most efficient one in wide variety of applications.In [9] we showed that such approach could have the potential to reach the goals since the errors produced by different recognizers aren't correlated and different approaches may be used to improve overall performance.This paper is devoted to various system optimization issues and the final design.
Further this paper is organized as follows.The Chapter II presents the methodology used to design and evaluate speech recognition system for Lithuanian healthcare institutions.Chapter III presents the speech corpora used to train and optimize the performance of the system.The Chapters IV and V present some experimental evaluation results and system optimization issues which enabled to achieve the highest recognition accuracy.Finally some conclusions are presented.

II. ALGORITHMS AND APPROACHES
This chapter introduces methods and principles used when developing recognition system.As was mentioned earlier hybrid system uses two different recognizers: adapted to recognize Lithuanian voice commands Spanish commercial speech recognizer and CD-HMM based proprietary Lithuanian recognizer.Adaptation of foreign language recognizer is based on principles of multilingual recognition [10].These approaches could be summarized as an expectation that phonetic properties of one language (usually less widely used) could be largely described by the acousticphonetic models of other language.The aim of the training is to find the best compliance among the phonetic models in different languages.This adaptation procedure is usually called mapping and was investigated in many studies [11]- [13].The research showed that in many cases this method allows to achieve high enough recognition accuracy.The main problem of adaptation, in our case, is to find the best phonetic sequences (transcriptions) to write Lithuanian voice commands.Our previous activities showed that the use of Spanish synthesizer or automatic generation of a wide set of possible transcription sequences and experimental evaluation of their efficiency may be useful.
Proprietary Lithuanian speech recognizer was built using continuous density hidden Markov models (CD-HMM) [14].This model proved to be the most reliable and widely used approach today when designing speech recognition engines for a very wide set of applications.The essential problem is to have necessary speech corpora to train CD-HMM models.The absence of large enough amounts of speech data is the main obstacle when developing recognizers for such language as Lithuanian.Naturally proprietary Lithuanian recognizer has the higher potential to achieve highest recognition accuracy, if enough training data is available.

III. SPEECH CORPORA
To train and evaluate medical-pharmaceutical Lithuanian voice command recognition system special speech corpora MEDIC has been developed [9].In the first stage recordings of 631 voice commands were collected.Among them were 217 names of most widely used diseases, 208 of most frequent complaints and 206 most often used drugs.This command set was formed from 1114 separate words, among them 777 words were unique.Later recordings of 100 more drug names were collected too.In this way the total number of drug names was 306 and the total number of recorded voice commands was 731.Each voice command was recorded 20 times by 12 different speakers (5 females and 7 males).The total duration of recordings was about 100 hours.Finally the list of voice commands has been supplemented with the names of 83 diseases, 92 complaints and 94 drug names without new recordings.In this way the total list of voice commands that speech recognizer is trained and is able to recognize is 1000.

IV. OPTIMIZATION OF SPANISH RECOGNIZER
In [9] we presented the basic results showing that adapted Spanish recognizer and proprietary Lithuanian recognizer provided uncorrelated results.The recognition accuracy of adapted Spanish recognizer wasn't high.This chapter presents some issues on optimization of Spanish recognizer.Several transcription selection criteria were evaluated: a) synthesis of transcribed Lithuanian voice commands with Spanish synthesizer and selecting transcription providing the best sounding realization; b) empirical selection of best transcriptions; c) intuitive selection.In the case a) investigator writes Lithuanian command using Spanish grammar rules and then feeds the transcribed command to the synthesizer.If the synthesized command sounds naturally enough, the transcription is used for the recognition.The empirical selection is based on the fact that foreign recognizers have lexicon design tools.E.g. for Lithuanian command "viduriavimas" the tool provided following set of possible Spanish transcriptions (called PRON transcriptions):

B I D U DX J A B I M A S B I D U DX J A B J M A S B J D U DX J A B I M A S B I D W DX J A B I M A S B I D U DX I A B I M A S
Intuitive transcription selection method is based on analogies from other commands or other heuristic methods.The optimal transcriptions selection process is very time consuming process.For example we are giving the procedure of optimization of command "nemiga".Initial transcription enabled to achieve only the 28.7 % accuracy rate.Lexicon design tool generated 6 transcription candidates , while using intuitive approach 6 more transcriptions were obtained too ((nemiga, niamiga, , niamika)).Analysing all 14 transcriptions was found template which enabled to achieve the highest accuracy of 79.6 % this command.Table I shows the recognition accuracy of some commands.These results allows to make conclusion that for the middle size vocabulary (more than 500 commands) recognition tasks there are no universal method to find the best transcriptions since there is necessary to evaluate many possible candidates and good result will not be guaranteed.But properly evaluating lexical and grammatical constraints recognition of a limited vocabulary could provide to relatively good accuracy level.

V. TRAINING AND EVALUATION OF PROPRIETARY LITHUANIAN RECOGNIZER AND REALIZATION OF RECOGNITION SYSTEM
Proprietary Lithuanian speech recognizer is based on CD-HMM model.Its basic version uses triphones as a basic speech element to model acoustic events occurring in speech utterance.Gaussian mixtures are used to model probabilities of particular acoustic events.Acoustic properties were described using MFCC features.Viterbi search algorithm was used as a basis for the decoding procedure to find the most likely sequences of acoustic events.
First acoustic models were obtained using earlier collected speech corpora (about 35 hours of recordings) which content wasn't related with the medicalpharmaceutical topics.Later models were retrained using the specially designed speech corpora described above.Data from the MEDIC corpora was used to evaluate the efficiency of recognizer too but seeking to model speakerindependent mode the recordings of the tested speaker weren't used during evaluation.Similar experiments were carried on with 1000 command set including those voice commands that weren't specially included to the corpora.In this case overall recognition accuracy for 1000 commands was 98.83 %.This is minor decrease comparing with the case when only 731 commands were used.It is also important that similar slight accuracy decrease has been observed for each speaker used to test the accuracy but there were no situations when some particular speaker saw significant performance degradation.
Finally both recognition approaches were combined into the single hybrid recognizer.To make the final decision rule induction algorithm Ripper was applied [15]- [16].The detailed description of combination approach and the rules used to combine the output of two recognizers could be found in [17].Each object in the training set (set of recognizer output parameters) has been described using seventy features Among those features are such parameters as confidence of the result provided by SP recognizer, average log probability of the LT recognizer hypothesis, proportion and likelihood of all sounds present in the hypothesis produced by both recognizers and some other parameters (such as gender probability, silence probability at the start and the end of the utterance, etc.).Hybrid decision rule efficiency was evaluated standard cross checking procedure: data from 11 speakers were used to derive the rule while 12 th speaker was used to check efficiency.Later the results of all speakers were averaged.Experiments evaluation showed that Ripper rule operates with 97.85 %  2.30 % accuracy.Since the logical rule is invoked only when outputs of recognizers differs we can conclude that overall accuracy of hybrid recognizer should be 98,92 %.Such accuracy for 1000 commands speaker independent task could be treated as a very high and has been never achieved for Lithuanian speaker-independent recognition tasks using the vocabularies of similar size.
Further we will provide some insights into practical

TABLE I .
RECOGNITION ACCURACY OF SEVERAL VOICE COMMANDS USING ADAPTED SPANISH RECOGNIZER BEFOREAND AFTER OPTIMIZATION.
Table II shows the recognition accuracy of the 731 commands from speech corpora averaged per all speakers.