Home / ASR 101 / ASR Process

How Does ASR Work? ASR 101


Speech recognition is a complex and computationally intensive process. There are thousands of publications that explain the science behind current speech recognition technologies. However, the basic process can be described very simply.

Speech Recognition Basics

The recognition process begins whenPicture of woman speaking with letters coming from her mouth. the ASR "front-end" receives its input in the form of digital audio. The audio is converted in a series of frames from the time domain to thefrequency domain. Typically, each frame has 1/50th of a second duration. The frames are further processed resulting in a set of feature data called "cepstral coefficients". The feature data is then passed to a second key component, "the decoder", where it will be processed using a hidden Markov models (HMM's). This part of the process calls upon two other pre-prepared components, the acoustic model and language model (grammar).

The acoustic model is pre-prepared by the ASR training tools. The trainer processes huge quantities of audio files and their text transcriptions, to extract common acoustic characters for each individual phoneme (context independent phoneme) as well as each phoneme with its context (context dependent phoneme). All phoneme characters form an acoustic model.

The language model includes all possible sentences (or word sequences) where the ALL means all in a specific application domain, not all in the selected language! This helps the recognition accuracy and performance, since it removes a huge number of non-possible sentences.

The decoder works by means of the HMM to execute two tasks, segmentation and recognition simultaneously. The segmentation processing cuts the input audio into separate pieces, where each piece maps a state of a phoneme. The recognition processing analyses each sequence of 5 audio segments to determine which phoneme of the ~45 phonemes (in the English language) it probably is.

Woman with her hand cupped to her ear.The decoder through a series of complex algorithms evolves a number of simultaneous hypothesese, evaluating the probabilities of each hypothesis as the process continues. At the end of the audio utterance, a single hypothesis (typically the one with the highest probability) is offered as the correct recognition result. Multiple suggestions can be offered for further processing outside of the core ASR process (This feature is called n-best). Once the hypothesis is produced, the ASR process is over. The application that has been integrated with the ASR will then determine how to process the provided results.