How fast is speech recognition?
This is more than one question. First, one must distinguish Computational-Real-Time (CRT) from Impressionistic-Real-Time (IRT). CRT means the processing time equals 1 times the duration of the speech input. IRT is the user’s impression of not having to wait, which depends on intelligent user-interface design, stream-oriented processing, and CRT. If you keep the user waiting 20sec before responding after a 20sec spoken turn, then even though it’s doing CRT, it feels slow, so it’s not IRT. An important influence on IRT is whether the system does file or batch processing on the one hand or streaming processing on the other. File processing requires that the speech waveform input be completed before further processing is started. Streaming processing starts the decoding process on blocks of waveform as they become available. IRT can be improved by various interface-design techniques, such as playing a canned part of a response (which doesn’t depend on the result of the decoding) after the