Maciej Zasadzki ADVICOS Report post Posted 10/21/2025 10:12 AM We are using Google Cloud Services for speech recognition in the Play module and have observed that the response time is often quite long – sometimes taking several seconds to complete. This delay negatively affects the user experience. We would like to ask: Which specific parameters can be used in the Play module to reduce the STT response time? Are there any recommended settings for improving recognition speed (e.g. timeout settings, silence detection thresholds, interim results, etc.)? Is it possible to force early termination of recognition once speech is detected and silence follows? Any advice or configuration examples would be greatly appreciated. Share this post Link to post
SupportTeam Report post Posted 10/22/2025 04:35 AM First of all, if you have a requirement that involves any Speech Recognition - whether Voice Agents or Transcription or anything else, please get in touch with sales@voiceguide.com The best solution to meet your requirements will then be discussed directly. In your specific case we can see that you have been using the Google Cloud Services STT integration that has been created over 6 years ago, and from your post it appears that you are using this integration for a 'Voice Agent' type application. There are better options available from us now for the 'Voice Agent' use case then that 6 year old approach. Please get in touch with sales@voiceguide.com But regarding Google Cloud Services STT: Google Cloud Services STT has some delay before returning the final confirmation of the detected words and the 'end of speech' marker. In our experience that delay is a bit less then a second, but it may vary by language. "Interim results" for each word arrive earlier - in our experience about half a second earlier (may vary by language). For fastest reaction on the Google Cloud Services STT based integration the below is the general approach: On the Google Cloud Services STT integration module certain Result Variables get updated when results arrive. Both 'confirmed' and 'interim' results: See: $RV[moduletitle_speech_preview] $RV[moduletitle_speech_preview_stability] $RV[moduletitle_speech] $RV[moduletitle_speech_stability] These $RVs get updated in real-time as your client is speaking and Google STT sends out the recognition events word by word. The $RV[moduletitle_speech] stores the 'confirmed' speech and the $RV[moduletitle_speech_preview] shows the just recognized 'preview'. So as you have access to those $RVs in real-time, you can often determine the customer's intent even before the customer has finished speaking, and prepare an answer and start playing back something to caller as you determine that the caller is completing their question/statement. If done right you will have your voice agent starting the reply right when the customer is completing their sentence. Just like a live human would. And note that "Preparing an answer" does not necessarily mean always performing a TTS and playing that generated TTS afterwards. You can select one of the prepared before ready-to-play sound files - or just the start playing a lead-in sound prompt to a certain answer category while you are generating the rest of the answer using TTS in the background, with the TTS ready by the time the 'lead in' sound file completes playing. Sometimes it's a good idea to save your answers for a while just in case same answer may need to be re-used on some other future call. Note that with all these prepared/saved answers you would need some sort of a database storing the previously generated answers and what intent they were for, and then a lookup against that database will let you re-use a previously generated sound file for at least part of your reply. Read up on "Semantic Caching". Note that this requires ongoing Natural Language Processing (NLP) type analysis as the words arrive - but often the intent is determined fairly easily, and you can also add more intelligence to your process to select which intent determination approach should be used in different situations or different parts of the caller's question/statement, only resorting to more advanced analysis when deemed necessary. You usually do not need constant NLP analysis or to call an "AI" type system after every word... Any process (even a VBScript or JavaScript started just before the STT module and which is matched to that module only) can be polling those $RVs and look at both $RV[moduletitle_speech] and $RV[moduletitle_speech_preview] and perform relevant analysis and reactions. (btw: make sure you have some short 'sleep' between successive $RV polling calls). See the vgEngine trace to see how these Result Variables get updated in real-time during the STT recognition process. You can verify your Google Cloud Services STT delays by looking at the sound file that stores the speech received by VoiceGuide during the STT process, and comparing what is recorded in that sound file to the timestamps of events from Google Cloud that are shown in the vgEngine trace. The input data sound file is saved in VoiceGuide's \temp\ subdirectory. There aren't any settings that you can modify in that integration that will affect Google Cloud Services' STT reporting delay. Share this post Link to post
Maciej Zasadzki ADVICOS Report post Posted 10/22/2025 01:02 PM We are experiencing long recognition delays when using Google Cloud STT in the Play module. From the logs, we can see that speech is actually recognized quickly, but the recognition process continues for several seconds until EV_TIMEOUT_RECOGNITION_COMPLETE is triggered. Example from the logs: At 142433.310, we receive: initialPrompt_library_speech_stability|0.9 → meaning the speech was confidently recognized ("tak mam") But the recognition session ends only at 142440.443 → more than 7 seconds later This delay is too long for production use. We would like to ask: Is it possible to terminate recognition early once speech is recognized with high stability (e.g., 0.9)? Can this be controlled using parameters within the Play module? Or does this require a custom script (e.g. based on an evScriptEvent)? Any advice or configuration examples would be greatly appreciated. evEngine.txt Share this post Link to post
SupportTeam Report post Posted 10/22/2025 11:37 PM You would need to contact the STT service provider with this question. Sounds like their recognition model for your language needs to have a bug reported. Trace shows that this was a very short utterance. Does similar delay (lack of response/event from your STT provider) happen with this STT provider for your language when a longer utterance is spoken? As this was a very short utterance (looks like a yes/no response was expected?) you could handle such short responses by having your own supervisory process react to it, rather then wait for official 'end of speech' event from the STT provider. You will get a faster system reaction this way to short utterances. As outlined in previous post in this thread: Quote Any process (even a VBScript or JavaScript started just before the STT module and which is matched to that module only) can be polling those $RVs and look at both $RV[moduletitle_speech] and $RV[moduletitle_speech_preview] and perform relevant analysis and reactions. (btw: make sure you have some short 'sleep' between successive $RV polling calls) Share this post Link to post