Speech Recognition Accuracy Measures

Share on twitter
Share on linkedin
Share on facebook
Share on whatsapp
Share on skype
Share on email

Speech Recognition Accuracy Measures

Speech Recognition Accuracy Measures

In this first post on speech recognition accuracy measures, we will describe the common methods for measuring error/accuracy rates. In a follow-up post, we will continue to discuss why those error rate measurements might not be a practical way to describe your simulation ASR requirements.

Word Accuracy

The most commonly quoted metric of speech recognition accuracy measures is Word Accuracy (Wacc).  The number of correctly recognized words from the total number of words spoken represents Wacc. It can be a useful measurement when validating language models. Word Accuracy is calculated by comparing the returned sentence with the known accurate transcription of the test audio.

Warr is a poor measure of usability.
The formula for Calculating Word Accuracy

The  (Wacc) performance is then evaluated using the following formula:


W indicates the total number of words to be recognized.

Ws is the number of words that were substituted (the result replaced a word in the transcription with another word).

Wi is the number of words inserted (a word was added that was not in the transcription).

Wd is the number of words that were deleted (a word that was in the transcription was missing from the result).

Digits Accuracy

In many applications, Digit Accuracy is a very important metric when specifying speech recognition accuracy measures. Digit accuracy only considers the numbers in a sentence. It calculated in an identical manner to word accuracy. In some applications, it can be difficult to achieve high digit accuracies. If digits are important to your application, make sure that you specifically ask about digit performance.

Sentence Accuracy

Sentence accuracy has limited value when defining speech recognition accuracy measures.

When measuring the sentence accuracy, the returned sentence (hypothesis) is compared against the spoken audio (reference). If the hypothesis does not match the reference exactly then the Sentence Accuracy is 0%.

For example, the user says, “I would like to speak to the accounting department please” and the ASR returns. “I would like to speak with the accounting department please”. The ASR in this example returned a hypothesis almost identical to the reference, but this is still considered a sentence error. Furthermore, in this example there was a single word incorrectly recognized from a total of 10, resulting in a 90% word accuracy.

Note that a true measure of error rates should be calculated using many test utterances, comprised of a wide assortment of speakers. The results from a single utterance are not truly representative of overall performance. Additionally, if the speaker uses words or phrases that are not part of the defined grammar, the error for that word or phrase is removed from the score.

Speech Recognition Accuracy – Semantic Errors

Arguably the most useful measurement of accuracy in a command and control application is a semantic error (or command error). The Semantic error determines whether the desired end result occurred in the application.

A semantic error is useful in validating application design. For example, the user said “YES” the ASR returned “YES”, and the “YES” action was executed, it is clear that the desired result was achieved.

What happens if the ASR returns text that does not exactly match the utterance? For example, the user said “NOPE”, the ASR returned “NO”, and the “NO” action was executed. Semantically this should also be considered a successful dialog? If you measure performance using WER or SER this example will result in a poor score but in practice, the ASR produced the desired hypothesis.

What does Speech Recognition Accuracy mean for you?

When trying to determine the suitability of one ASR over another, speech recognition accuracy measures are an important indicator of the expected performance. When evaluating at accuracy, think carefully about what measure is being used; how those measures may apply to your specific application and what constraints were in place during testing, e.g. a 1% error in a 100 name telephone directory, using a context-free grammar, is easier to achieve than a 5% error using a statistical language model for free speech applications. ASR accuracy/error rates alone are not enough to determine the suitability of one vendor over another. Be wary of any vendor that tries to convince you based on their headline numbers alone.

Free Download

This Free Download provides additional guidance on the topic. It also includes advice for writing general requirements for the purchase of a speech recognition system.

P.S. Just in case you choose to ignore this advice, VRX is 99.9999% accurate! 😊

VBX blogger

VBX blogger

Leave a Reply


With decades of collective experience in all aspects of speech recognition and its practical application, the Verbyx team are committed to delivering first class products with first class support.

Follow Us

Recent Posts

Free Guide

Advice, guidance, and tips for those interested in buying an ASR

Sign up for New Updates

We do not bombard you with marketing emails. For the occaisional news update subscribe here.

Scroll to Top
Scroll to Top