One measure of the performance of an ASR is accuracy - or how well it recognizes utterances. Recognition accuracy is dependent on many factors, one being the design of the language model. There is little point in building a great ASR if in operation, your language model does not support technical words and phrases that might be spoken during the dialog. Often ASR vendors report an accuracy of over 99%, but what does that term mean? And is it of any value to your application?
Accuracy is a quantitative measurement and is often quoted in the form of an error rate. Error rates can be calculated in several ways.
This is the most commonly quoted measurement of accuracy. Word error rate (WER) represents the number of words recognized incorrectly from the total number of words spoken. It can be a useful measurement when validating language models.
A significant problem in using WER as a measure is that it often does not give an indication of usability. If during a dialog with a user, the ASR returned "NOPE" when the user said "NO", this would be considered a recognition error. In practice however, both words have the same semantic meaning.
WER can often be reduced by analysis of your grammar. For instance, adding "nope" as a valid word..
Digit error rate (DER) is a similar to WER, it is calculated in an identical manner. In some applications, it can be difficult to achieve high digit accuracies. If digits are important to your application, make sure that you specifically ask about digit performance.
When measuring sentence error rate (SER), the returned sentence (hypothesis) is compared against the spoken audio (reference). If the hypothesis does not match the reference exactly then the SER is 100%.
For example, the user says "I would like to speak to the accounting department please" and the ASR returns. "I would like to speak with the accounting department please". The ASR in this example returned a hypothesis almost identical to the reference, but this is still considered a sentence error. Furthermore, in this example there was a single word incorrectly recognized from a total of 10, resulting in a 10% WER. Note that a true measure of SER should be calculated using many test utterances as the results from a single utterance are not truly representative of overall performance.
Semantic error (or command error) is not a commonly used measurement but arguably it is the most useful measurement of accuracy in a command and control application. Semantic error determines whether the desired end result occurred in the application.
Semantic error is useful in validating application design. For example, if the user said "YES" the ASR returned "YES", and the "YES" action was executed, it is clear that the desired result was achieved. What happens if the ASR returns text that does not exactly match the utterance? For example, what if the user said "NOPE", the ASR returned "NO", and the "NO" action was executed? Should that be considered a successful dialog? If you measure performance using WER or SER this example will result in a poor score but in practice the ASR produced the desired hypothesis.
When trying to determine the suitability of one ASR over another, error rates are a key indicator of the performance we should expect in our own implementations. When looking at error rates, think carefully about what measure is being used; how those measures may apply to your specific application and what constraints were in place during testing, e.g. a 1% WER in a 100 name telephone directory, using a context-free grammar, is easier to achieve than a 5% WER using a statistical language model for free speech applications. ASR error rates alone are not enough to determine the suitability of one vendor over another. Be wary of any vendor that tries to convince you based on their headline numbers alone.
P.S. Just in case you choose to ignore this advice, VRX is 99.9999% accurate!