Word Error Rates are Misleading

The Common Use of Word Error Rates

The system must have a Word Error Rate (WER) of 2% or less. This appears to be a reasonable requirement. With such a small error rate, surely, I can be confident that my speech recognition implementation will be a success?

Unfortunately, the answer is no for simulation or command and control applications. Word Error Rate is not a good measure of the usability of a speech system in certain applications. 

Testing Method

Prior to automated testing for WER, some preparation is necessary. Accurate transcription of the audio is the first requirement. To clean the data, the removal of poor quality audio and utterances that contain unsupported phrases is required. A comparison of the transcriptions and the hypothesis (ASR result) are made to identify errors. For the generation of meaningful test results a large test set, with a broad range of speakers, is vital. The data includes male and female speakers with a range of accents that represent the target users.

Customer testing usually includes a free-play test phase. After free-play, the same scoring method is used; however, in this case, the audio has not previously been transcribed. The contract may also allow the vendor to remove the audio utterances from certain speakers that have a difficult speaking style.

ASR is seeing adoption as a means of cockpit command and control. In this application very low word error rates are vital.

Ideal Test Conditions

So word error rate is an excellent measure of the performance of ASR when using supported terminology with approved speakers. But the impact of using ASR with non-approved speakers must be considered, that is speakers not familiar with the strict approved terminology (trainees) in less than ideal circumstances. This is an issue we will tackle in a future post. For now, we will continue to examine word errors.

Why are WER Not a Good Measure

If the test data contained 10,000 words and our error rate was 2% then we have 200 words that were incorrect. This does not seem to be a major hurdle, but consider an application where numbers are critical and a common occurrence, such as a flight simulator:

“American one one three five, heading two three zero and descending to three thousand.”

If on examination we find that the errors are with words such as “and”, in the example above it is likely we will still achieve the correct semantic match. If the majority of the two hundred errors are the word three, we now have an unusable system.

So if you can’t rely upon WER, how can you specify your accuracy requirements? Stay tuned.

Please follow and like us: