Accuracy Methods for Speech Recognition
In this first post on accuracy methods for speech recognition, we will describe the common methods for measuring error rates. In a follow-up post, we will continue to discuss why those error rate measurements might not be a practical way to describe your simulation ASR requirements.
The most commonly quoted measurement of accuracy is Word Accuracy (Wacc). The number of correctly recognized words from the total number of words spoken represents Wacc. It can be a useful measurement when validating language models. Word Accuracy is calculated by comparing the returned sentence with the known accurate transcription of the test audio.
The (Wacc) performance is then evaluated using the following formula:
W indicates the total number of words to be recognized.
Ws is the number of words which were substituted (the result replaced a word in the transcription with another word).
Wi is the number of words inserted (a word was added that was not in the transcription).
Wd is the number of words that were deleted (a word that was in the transcription was missing from the result).
Digit Accuracy only considers the numbers in a sentence. It calculated in an identical manner to word accuracy. In some applications, it can be difficult to achieve high digit accuracies. If digits are important to your application, make sure that you specifically ask about digit performance.
When measuring the sentence accuracy, the returned sentence (hypothesis) is compared against the spoken audio (reference). If the hypothesis does not match the reference exactly then the Sentence Accuracy is 0%.
For example, the user says, “I would like to speak to the accounting department please” and the ASR returns. “I would like to speak with the accounting department please”. The ASR in this example returned a hypothesis almost identical to the reference, but this is still considered a sentence error. Furthermore, in this example there was a single word incorrectly recognized from a total of 10, resulting in a 90% word accuracy.
Note that a true measure of error rates should be calculated using many test utterances, comprised of a wide assortment of speakers. The results from a single utterance are not truly representative of overall performance. Additionally, if the speaker uses words or phrases that are not part of the defined grammar, the error for that word or phrase is removed from the score.
Arguably the most useful measurement of accuracy in a command and control application is semantic error (or command error). The Semantic error determines whether the desired end result occurred in the application.
A semantic error is useful in validating application design. For example, the user said “YES” the ASR returned “YES”, and the “YES” action was executed, it is clear that the desired result was achieved.
What happens if the ASR returns text that does not exactly match the utterance? For example, the user said “NOPE”, the ASR returned “NO”, and the “NO” action was executed. Semantically this should also be considered a successful dialog? If you measure performance using WER or SER this example will result in a poor score but in practice, the ASR produced the desired hypothesis.
What does this mean for you?
When trying to determine the suitability of one ASR over another, error rates are a key indicator of the performance we should expect in our own implementations. When looking at accuracy, think carefully about what measure is being used; how those measures may apply to your specific application and what constraints were in place during testing, e.g. a 1% error in a 100 name telephone directory, using a context-free grammar, is easier to achieve than a 5% error using a statistical language model for free speech applications. ASR accuracy/error rates alone are not enough to determine the suitability of one vendor over another. Be wary of any vendor that tries to convince you based on their headline numbers alone.
P.S. Just in case you choose to ignore this advice, VRX is 99.9999% accurate! 😊