How does TTS work?
TTS is often described as two conceptual stages. In the first stage, it decides how the text should be spoken, that is, how each word should be pronounced, what length and pitch each phoneme should have, etc. In the second stage, the system does its best to create audio that matches the specifications produced by stage one. TTS software has little or no understanding of the text being read. It uses rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece of text should be read. While general performance can be quite good, some decisions are intrinsically hard to make without some level of understanding. For example, the word “bass” in the phrases “bass drum” or “bass boat”. Intonation depends in many cases on the writer’s intention, which often cannot be inferred in short texts even by human readers. As a result, TTS systems will occasionally make mistakes and can be fooled by carefully constructed texts. These are challenging problems for all TTS systems,