What is an Acoustic Model?
Acoustic models characterize how sound changes over time. Each phoneme or speech sound is modeled by a sequence of states and signal observation probabilites — distributions of sounds that you might hear (observe) in that state. Sphinx2 is implemented using a 5-state phonetic model; each phone model has exactly five states. At run-time, frames of the input audio are compared to the distributions in the states to see which ones the sound could have come from — which might be likely producers of the observed (heard) audio. Acoustic models that are matched to the conditions they will be used in perform best. That is to say, English acoustic models work best for English, and telephone models work best on the telephone. With SphinxTrain, you can train acoustic models for any language, task, or channel condition. Context-independent phones (CI-phones) are modeled using data from many different context, and triphones are phones that take into account left and right context in the modeling.