Speech Recognition Accuracy Measures
In this first post on speech recognition accuracy measures, we will describe the common methods for measuring error/accuracy rates. In a follow-up post, we will continue to discuss why those error rate measurements might not be a practical way to describe your simulation ASR requirements.
Word Accuracy
The most commonly quoted metric of speech recognition accuracy measures is Word Accuracy (Wacc). The number of correctly recognized words from the total number of words spoken represents Wacc. It can be a useful measurement when validating language models. Word Accuracy is calculated by comparing the returned sentence with the known accurate transcription of the test audio.
The (Wacc) performance is then evaluated using the following formula:
Where:
W indicates the total number of words to be recognized.
Ws is the number of words that were substituted (the result replaced a word in the transcription with another word).
Wi is the number of words inserted (a word was added that was not in the transcription).
Wd is the number of words that were deleted (a word that was in the transcription was missing from the result).
Digits Accuracy
In many applications, Digit Accuracy is a very important metric when specifying speech recognition accuracy measures. Digit accuracy only considers the numbers in a sentence. It calculated in an identical manner to word accuracy. In some applications, it can be difficult to achieve high digit accuracies. If digits are important to your application, make sure that you specifically ask about digit performance.
Sentence Accuracy
Sentence accuracy has limited value when defining speech recognition accuracy measures.
When measuring the sentence accuracy, the returned sentence (hypothesis) is compared against the spoken audio (reference). If the hypothesis does not match the reference exactly then the Sentence Accuracy is 0%.
For example, the user says, “I would like to speak to the accounting department please” and the ASR returns. “I would like to speak with the accounting department please”. The ASR in this example returned a hypothesis almost identical to the reference, but this is still considered a sentence error. Furthermore, in this example there was a single word incorrectly recognized from a total of 10, resulting in a 90% word accuracy.
Note that a true measure of error rates should be calculated using many test utterances, comprised of a wide assortment of speakers. The results from a single utterance are not truly representative of overall performance. Additionally, if the speaker uses words or phrases that are not part of the defined grammar, the error for that word or phrase is removed from the score.
Speech Recognition Accuracy – Semantic Errors
Arguably the most useful measurement of accuracy in a command and control application is a semantic error (or command error). The Semantic error determines whether the desired end result occurred in the application.
A semantic error is useful in validating application design. For example, the user said “YES” the ASR returned “YES”, and the “YES” action was executed, it is clear that the desired result was achieved.
What happens if the ASR returns text that does not exactly match the utterance? For example, the user said “NOPE”, the ASR returned “NO”, and the “NO” action was executed. Semantically this should also be considered a successful dialog? If you measure performance using WER or SER this example will result in a poor score but in practice, the ASR produced the desired hypothesis.
What does Speech Recognition Accuracy mean for you?
When trying to determine the suitability of one ASR over another, speech recognition accuracy measures are an important indicator of the expected performance. When evaluating at accuracy, think carefully about what measure is being used; how those measures may apply to your specific application and what constraints were in place during testing, e.g. a 1% error in a 100 name telephone directory, using a context-free grammar, is easier to achieve than a 5% error using a statistical language model for free speech applications. ASR accuracy/error rates alone are not enough to determine the suitability of one vendor over another. Be wary of any vendor that tries to convince you based on their headline numbers alone.
Free Download
This Free Download provides additional guidance on the topic. It also includes advice for writing general requirements for the purchase of a speech recognition system.
P.S. Just in case you choose to ignore this advice, VRX is 99.9999% accurate! 😊