Tuning and on-going acoustic model training have become the accepted norm for masking underlying phoneme error rates and the subsequent poor word and phrase recognition. Given the diverse nature of how we speak, out of the box speech recognition can often struggle to recognize the spoken characteristics of many members of the target audience.
Tuning is specifically teaching the ASR to recognize those words or phrases where it had previously failed. Tuning can involve both modifications to the language model (e.g. addition of real names and words to the dictionary) and adaptation of the acoustic model through the collection and transcription of additional training audio.
Tuning and training for each domain specific VUI is a time consuming and often a very expensive task. It also may seem to be a never-ending task. Small incremental improvements to the application usability are gained each time the data is analyzed and further tuning executed. The cost of application and acoustic tuning can become a significant yearly expense. Remember that attractive license pricing may only be a small part of the cost of ownership of an ASR system.
Most ASR vendors agree with and promote the industry recommendation that 40% to 50% of the total deployment cost of an ASR system should be spent on tuning.
When selecting an ASR vendor, Verbyx recommends that you specifically ask what is covered in your initial purchase. If you are concerned, insist on performance based payment terms and reasonable support costs. It is also recommended that you question the features of the underlying technology, e.g., how quickly can you adapt our acoustic models? How much training data is required for this process? Can we edit our own dictionaries? What features does your software provide to mitigate the need for ongoing tuning? What methods do you employ for tracking performance and performance improvements?