Can you elaborate on the difference between diphone synthesis and unit selection synthesis?
In unit selection the size of the synthesis unit chosen for a particular system may be one of many choices: half-phones, phones, diphones, syllables, etc. The key idea is that we get multiple examples of the same unit in different contexts (where context may be some combination of adjoining phonemes and maybe prosody features, e.g. emphasized or non-emphasized). We can cluster the examples to find representative units for acoustically different units. With diphone synthesis, we only use one example of each phone-phone transition and do not have different versions of the diphone depending on context. Unit selection systems are more difficult to build and do require more labelled data, but may produce much better quality than diphone (though not necessarily).
Related Questions
- What is the difference between Cascade coprocessor synthesis and custom instruction set processor (CISP) design using configurable IP?
- Can you elaborate on the difference between diphone synthesis and unit selection synthesis?
- What is the difference between transcription, and translation in protein synthesis