Speech Recognition Accuracy Doesn’t Matter!

Error Rate

Vendor Accuracy is Meaningless!

Why when accuracy would seem all-important? Let me clarify.

Of course, speech recognition accuracy is an important metric. There are several measures of ASR Performance. Your ASR enabled application dictates which measure better determines usability. However, ASR accuracy figures quoted by vendors are mostly meaningless.

ASR Accuracy is a Distraction

Quoted ASR accuracy should not distract you in buying decisions. The figures are unique to the circumstances of the vendor’s test. The accuracy produced by your ASR system will not match those numbers provided by the vendor.

Factors Affecting ASR Accuracy

Many factors impact speech recognition accuracy. Comparing accuracy scores requires that testing takes place in close to identical circumstances for each ASR. That is most definitely not going to be true.

The following key factors will impact the ASR accuracy results
1. Test environment particularly background noise, for instance, air conditioning can have a major impact when live testing.
2. The complexity and size of the language model and dictionary impact test performance.
3. A test platform with large CPU and memory resources will outperform a lowered specified computer.
4. There is a direct correlation between accuracy and processing speed. Improved accuracy can be achieved by altering the configuration of the ASR.  The improvement may come at the cost of reduced speed. Consequently, it is necessary to also compare the processing time. If processing times are lengthy, a high accuracy score is meaningless.

The Importance of Acoustic Models in ASR Accuracy

The Acoustic Model is arguably the biggest contributor to accuracy outside of the decoder. Does the format of the acoustic model used in producing the quoted accuracy scores suit your end-use environment? Acoustic models are particularly impacted by:

  • Model resolution – Using an 8KHz acoustic model will produce different results when tested with 8KHz and 32KHz audio.
  •  Acoustic models training utilizes market and domain representative audio. Vendors might test a model optimized for UK English against predominantly UK English speakers. Testing the same model against US English speakers will produce different scores.
  • An acoustic model should be tuned for specific user domains. A banking call center acoustic model may not test well in a technical support call center. If the model has no previous knowledge of specialist words or phrases in the new domain, accuracy will be negatively impacted. For instance the airline callsign Alitalia is not a word found in an English dictionary. An Alitalia dictionary addition is required to eliminate the error.  and an adaptation to the acoustic model will probably be required.

You now understand why the ASR accuracy measures in your user environment will not match those quoted by ASR vendors. There are several methods of limiting your exposure to the risk of a poorly performing system. This Free Download provides a guide to writing requirements that mitigate risk

What is Important? Planning!

Adding ASR to an application such as a training simulator is expensive. If the ASR project is carefully planned and well-implemented is also money well spent. Your decision to use ASR must come only after extensive research and planning. Part of those activities is the side-by-side comparison of ASR accuracy.

The task of comparison is quite straight forward, use batch recognition testing. The testing process is simple and requires relatively little time. You should be present for the testing process.

Speech Recognition Accuracy Testing

Preparing for speech recognition accuracy tests is a little more involved. You cannot control many of the factors described earlier but you can prepare your representative test data. The data should include recordings of people speaking the phrases typically used in your application and the associated text transcripts. Test results are more meaningful as the number of recordings and the variation in speakers increase.

You may be fortunate enough to have existing recordings from your application. The buyer should closely guard the test recordings, but transcripts can be shared with the vendor. Batch testing requires both audio and transcripts.

The language model used in the test will be problematic for some of your target vendors. Demanding a batch recognition test will eliminate those companies that have no existing expertise in your domain. Likewise, companies may have no desire to invest in the industry or you as a customer. Some companies will decide not to pursue the opportunity if batch testing is a requirement.

Yes, this comparison does seem like a lot of work. If you choose not to do side-by-side testing of the ASR then you must be meticulous in defining the test and acceptance criteria. The vendor must clearly understand the contractual consequences of the test failures.

Buying ASR From the Application Vendor

In this earlier article, we discussed buying your ASR from your connected application vendor. If after following the advice provided you choose to do so, there are further benefits. Batch recognition is easier for application vendors. A company that sells only speech recognition licenses will find difficulty in meeting your requirements. An existing language model will likely be available for batch testing use. Having a partnership with an experienced vendor is crucial to the successful deployment of an ASR equipped application.

ASR Accuracy The Real Questions – Future Costs

Given the cost of integrating ASR into your application, future-proofing your investment should be a major consideration. All too often buyers do not budget sufficiently for ongoing improvements, support, and maintenance of an ASR. Understanding the future expense of ownership is critical.

What Answers Do I Need?

Seek to understand the key factors impacting the ASR accuracy described earlier. Additionally, questions designed to elicit an understanding of costs of ownership should be asked. How can you as a user make improvements after delivery? Additionally, do you always require the services of the vendor to do so?

The focus of your questions should be related to the acoustic model:

1. Can I add words to the ASR dictionary?
2. Can I create or improve my language models?
3. Is the language model format W3C standards bases?
4. Is acoustic model adaptation by the buyer permitted?
5. Is the acoustic model format proprietory?
6. Can I use open-source acoustic models?
7. Are the acoustic model training tools proprietory?
8. Are tools included to assist in the analysis of test results?
9. Is a batch recognition test capability included?

Note that even with access to the tools to improve speech recognition accuracy, you may not have the resources or expertise to do so. Specialized experience is needed. In other words, outside help might be necessary. Additionally, the vendor may insist you use their services. In both cases, it is paramount that you understand the potential costs of outside help and the expected response time for each task. Fair prices for services are great but not if the waiting time is unreasonable.

A Tale of Caution

I will leave you with a tale of caution that demonstrates the possible outcome of not having some control over key components of the system. A government customer purchased many ASR equipped training simulators. The experts (the vendor) estimated that just over 100 phrases would be sufficient to meet the end-user needs. The training system now has support for tens of thousands of phrases.

The customer regularly requests the addition of new phrases. The simulator has been in service for more than 20 years.


The vendor quoted accuracy is meaningless. Your ability to see improvements in the system you choose on the other hand is quite the opposite. Spend the time to understand the capabilities of the company you plan to do business with.

VBX blogger

