The Acoustic Model
An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. A well trained acoustic model (often called a voice model) is critical in the performance of an ASR enabled application.
Analog to Digital Conversion
You may be familiar with this visual representation of spoken audio, called a spectrogram.
The spectrogram is a visual representation of a frequency spectrum. It is a deconstruction of the frequencies that comprise the source audio. It is depicted in segments called frames. These frames are approximately 25 ms each and contain not only the frequency information but the energy (volume) contained in the signal. This energy level is represented in the spectrogram by colors.
Before the information contained in the audio and shown in the spectrogram can be used by an ASR, it must be converted into a digital representation. This digital representation is in the form of a large group of numbers. These numbers are the basis of an acoustic model.
Acoustic Model Training
Acoustic models are trained by taking audio recordings of speech, and their text transcriptions, and creating statistical representations of the sounds that make up each word. A simple rule is that a model trained with hundreds and sometimes thousands of hours of transcribed audio will perform better than a model trained with limited audio. Quantity of data is important but it should also include a broad distribution of sounds.
Distribution describes how evenly the audio data is spread over each sound unit. Speech recognition will not work well if the acoustic model training data contains only a portion of the 40 phonemes in the English language. Additionally, it might be a poor model if some phonemes have more training data than others. To avoid delving into the complexities of acoustic models we can say that a model performs best when used by speakers that sound similar to the recorded training audio. It is for this reason that you will often see reference by vendors to different models even for the same language, e.g., English. Because of the variation in how the global population speaks English, acoustic models are created for regional variations, e.g., US English, Australian English, UK English.
Why Should You Care About Acoustic Models?
If written correctly, speech recognition requirements will initially protect you from added and unexpected costs. If ASR accuracy results do not meet the prescribed standards, the vendor bears the cost of correction. Testing for accuracy is likely limited to a relatively small set of speakers using a limited set of support phrases. After system acceptance, the speech system will be exposed to a more challenging environment that will highlight areas of unacceptable performance. This will likely be as a result of discovering new phrases that were not previously considered or by the addition of a user that has a difficult accent or speaking style. In an earlier post, Defining Supported Phrases suggestions were given as to how to limit your cost exposure in the area of supported phrases (the language model), however, accuracy will be affected by acoustic model performance and can be more troublesome to plan for.
Speaker Dependent and Independent Acoustic Models
In speech recognition, there is the concept of speaker-dependent and speaker-independent acoustic models. A speaker-dependent model would use only audio spoken by the intended user. For many applications, speaker-dependent models would be impractical. The work required to gather and prepare and process the data for each speaker model is costly.
It is more common to see systems that use speaker-independent models. However, it is not uncommon for the resulting system to perform poorly for some speakers who may have speech characteristics that are not represented well by the acoustic model. A strong Scottish accent will not be well recognized by an ASR using a US acoustic model. Neither will a person with a strong US southern accent achieve satisfactory results if using a model developed using audio from only speakers in New York. It is not unusual for female and male speakers with similar regional accents to achieve different results simply because the characteristics of the male and female voice differ.
In an application that is intended to be used by many speakers over a long period, speaker-related performance issues can be quite common. If they are, what means are available to you, so that you can improve or resolve the situation?
Acoustic Model Improvements
There are a number of ways to improve the acoustic model performance. A description of each method is not useful for this discussion. The most common methods usually relate to some form of model retraining with the provision of additional training data. For troublesome speakers, transcribed audio from just that speaker can be used to make significant improvements. Audio that addresses the distribution issue described earlier will also help. The methods are not important, but who can make those improvements is.
Questions to Ask Your Vendor
The long term success of the effective use of your proposed system or application purchase can and likely will hinge upon the selection of the ASR technology. The following questions will assist in choosing the correct vendor.
- Do I have access to tools and processes that can be used by my staff to address acoustic model performance?
- If yes, what tools are available and what level of expertise is required to use them effectively?
- Is there a cost associated with access to this capability?
If you are required to have the vendor make improvements, the following questions should be asked.
- Do you as the vendor provide an audio transcript service and what does this service cost? It is approximated that for each hour of audio, it requires as much as 20 hours to provide a transcript. The cost of transcriptions can be a material part of the service.
- What subject matter expertise is used to provide transcripts? Many domains contain specialized terminology that is difficult for a transcriber to understand without subject matter experience. Accurate transcriptions are very important to achieve the best results.
- What methods are available for acoustic model improvements?
- What are the pros and cons of each method? When should each be used?
- What are the costs and conditions associated with each method?
Who Developed the Speech Technology?
It is most important to establish if the application vendor develops its own speech recognition product. It is very typical the answer is no and the application vendor is a reseller of ASR technology.
If your application vendor is a reseller, tread carefully if acoustic model modification is important to you. There are not many ASR providers who sell suitable technology for the domains described in these articles. The larger and more well-known speech recognition companies are not known for responsiveness and flexibility. Does this company make most of its revenue by selling tens of thousands of licenses into a particular industry? These companies have little to no incentive to help if you only use a handful of licenses. Additionally, the speech company’s COTS acoustic model may also not be suited to your application and domain.
The acoustic model is a very important component of the recognition process. If you have concerns that the difficulties described in this article may apply to you, well-written requirements are paramount. The information shared here will help you write requirements that will minimize unexpected costs later in your program. Given the importance of acoustic models, a responsive vendor of speech recognition technology that provides access to tools and services at a reasonable cost would seem the obvious choice.
Of course, the ASR technology is not necessarily the driving factor in choosing a command and control or simulation application. But what can you do if you want a particular system or application product but the answers to your ASR questions fall short? It is reasonable to insist that any ASR offered should be standards-based. If integration with the application or system is with a standards-based ASR, you may be able to insist that the ASR component is replaced for one that is more suited to your needs. If you are not being offered a standard-based ASR, you may need to question the longer-term costs and viability of the product you are being offered.