Speech Recognition Basics Background


Speech recognition is an immensely complex task, especially given the thousands of languages, many thousands of dialects and millions of accents spoken world-wide. With the relative success and popularity of voice enabled personal assistants, we might be forgiven for thinking speech recognition is a solved problem. Nothing could be further from the truth.

The Problem!

The primary function of an ASR system is to recognize words, numbers and sentences. It starts this process by attempting to recognize strings of phonemes (a small unit of sound). Unfortunately for the ASR user, all recognizers perform poorly at the phoneme recognition task. There are a number of excellent speech recognition applications that deliver more than acceptable levels of usability. The applications work around the inherent limitations of the recognition process. The voice user interface (VUI) is often designed to guide you toward a limited set of responses; the acoustic model is trained with huge quantities of transcribed audio and the language model is constantly undergoing an optimization process.

By undertaking these actions, the ASR, even with its poor phoneme recognition, can take an educated guess at what was said. As long as the user plays by the imposed rules, the system will often work well. Unfortunately humans tend to dislike being told what to do and we do not always play by the rules, especially if the rules do not make sense.

Improving Accuracy

For many years, the common approach to speech recognition accuracy improvements has been through acoustic model training. The theory is quite simple, i.e., if we train our system with samples of voices from as many people as possible then the accuracy will improve. A simple theory that requires significant data processing and an approach that will only result in incremental improvements.

Getting Back to the Science of ASR

At Verbyx we understand the importance of acoustic model training, but we do not rely on crunching data and improved acoustic models. At Verbyx, we deliver a product that is sharply focused on recognition accuracy. Verbyx scientists use their deep understanding of physics, mathematics, linguistics and biology (aural physiology, neuroscience) to deliver genuine groundbreaking improvements.