The benefits of adding speech recognition to a training application are numerous, but primarily, ASR can reduce the costs of simulation by reducing or eliminating the need for human role players. However, the challenges presented by the removal of people can often seem daunting to overcome. There are three primary attributes to role players that must be matched, to ensure that the introduction of ASR is a success:
Comprehension – Role-players can interpret the meaning of a trainee communication, even if that communication was given in an unexpected or unofficial format. An air traffic controller is subject to well defined formal “phraseology” when communicating with pilots. The controller may be expected to issue the instruction – American eleven twenty-three, runway zero five, cleared to land”. However, an inexperienced trainee in a moment of pressure, can often struggle to remember this instruction in the format that it is expected. Even if the trainee were to say – American eleven twenty-three, when you are ready, land on runway zero five, the role-player will understand exactly the intent of the instruction, whereas a poorly implemented ASR would not.
Conditional Instructions – If a trainee issues a conditional instruction, i.e., when condition x is met, then perform action y, a role-player would typically monitor the target entity until evidence of condition x is noted. They will then manually enter the instructions for the entity to perform action y. Without a human-in-the-loop to continuously monitor the simulation, it becomes the sole responsibility of the simulator to understand that a conditional instruction has been issued and to execute action y when the conditions are met. This is not an ASR technology problem as the ASR can return the conditional instructions to the connected device, but it is of paramount importance that you understand the implications to your simulator if you are considering the addition of ASR to your training program.
Conversation – Customers who implement an ASR system often specify word accuracy as a method to evaluate the performance of the speech technology Typically it will be expressed as follows, the system must achieve a word accuracy of greater than 95%. (we will discuss word accuracy in a future post). Even if this accuracy were to be achieved, there is no guarantee that your system will be usable. As a retired air traffic controller, I can testify that controller to pilot communications in the real world are significantly less than a 95%-word accuracy, so if this is the case, why might 95%-word accuracy not be enough. Simply, when there is any doubt in the real world, there is a conversation between speakers that continues until the ambiguity is gone. An ASR must be able to handle the ambiguity by simulating the conversation process.
In the next post, we will introduce and discuss the language model and the considerations needed when dealing with non-expert speakers.