The front-end extracts data from the digital representation of the spoken words (the audio signal), putting it into a form the decoder can use. To make pattern recognition easier, the digital audio is transformed into the "frequency domain”. In the frequency domain, the frequency components of a sound can be identified. From the frequency components, it’s possible to approximate how the human ear perceives the sound. The transformation results in a graph of the amplitudes of frequency components, describing the sound heard.
The acoustic model is produced or pre-prepared by the ASR training tools. The trainer processes a mass of audio files with their text transcriptions to extract common acoustic characters for each individual phoneme (context independent phoneme) as well as each phoneme with its context (context dependent phoneme). Actually, each phoneme has been divided into 5 states in a time sequence, since the characters are slightly different from begin to end. All phoneme-state based characters form a voice model.
The acoustic model is the output from the training process. The model provides a representation of the many sounds that a human is capable of generating. An acoustic model is typically specific to a language or dialect. However, it is possible to create a voice model suitable for speakers with a number of different accents and pronunciations within the same language, e.g., regional differences in pronunciation between states or even UK and US English speakers.
The language model (including the dictionary) is used by the decoder to determine the most likely hypothesis. The language model describes to the decoder, the relationship between words and the probability of words appearing in a particular order, e.g., the language model will describe that the phrase “I went for a walk” is probable but “For a walk I went” is highly unlikely to occur. language models may be domain specific. The language model is referred to as the grammar when referencing context free implementations.