ASR Delay Issue – Seeking Solutions for Faster Speech Recognition

10/21/2025 10:12 AM

We are using Google Cloud Services for speech recognition in the Play module and have observed that the response time is often quite long – sometimes taking several seconds to complete. This delay negatively affects the user experience.

We would like to ask:

Which specific parameters can be used in the Play module to reduce the STT response time?
Are there any recommended settings for improving recognition speed (e.g. timeout settings, silence detection thresholds, interim results, etc.)?
Is it possible to force early termination of recognition once speech is detected and silence follows?

Any advice or configuration examples would be greatly appreciated.

10/22/2025 04:35 AM

First of all, if you have a requirement that involves any Speech Recognition - whether Voice Agents or Transcription or anything else, please get in touch with sales@voiceguide.com

The best solution to meet your requirements will then be discussed directly.

In your specific case we can see that you have been using the Google Cloud Services STT integration that has been added to VoiceGuide over 6 years ago (in 2019), and from your post it appears that you are using this integration for a 'Voice Agent' type application. There are better options available from us now for the 'Voice Agent' use case then that 6 year old approach. Please get in touch with sales@voiceguide.com

But regarding Google Cloud Services STT:

Google Cloud Services STT has some delay before returning the final confirmation of the detected words and the 'end of speech' marker. In our experience that delay is a bit less then a second, but it may vary by language. "Interim results" for each word arrive earlier - in our experience about half a second earlier (may vary by language).

For fastest reaction on the Google Cloud Services STT based integration the below is the general approach:

On the Google Cloud Services STT integration module certain Result Variables get updated when results arrive. Both 'confirmed' and 'interim' results:

See:

$RV[moduletitle_speech_preview]

$RV[moduletitle_speech_preview_stability]

$RV[moduletitle_speech]

$RV[moduletitle_speech_stability]

These $RVs get updated in real-time as your client is speaking and Google STT sends out the recognition events word by word.

The $RV[moduletitle_speech] stores the 'confirmed' speech and the $RV[moduletitle_speech_preview] shows the just recognized 'preview'.

So as you have access to those $RVs in real-time, you can often determine the customer's intent even before the customer has finished speaking, and prepare an answer and start playing back something to caller as you determine that the caller is completing their question/statement. If done right you will have your voice agent starting the reply right when the customer is completing their sentence. Just like a live human would.

And note that "Preparing an answer" does not necessarily mean always performing a TTS and playing that generated TTS afterwards. You can select one of the prepared before ready-to-play sound files - or just the start playing a lead-in sound prompt to a certain answer category while you are generating the rest of the answer using TTS in the background, with the TTS ready by the time the 'lead in' sound file completes playing. Sometimes it's a good idea to save your answers for a while just in case same answer may need to be re-used on some other future call. Note that with all these prepared/saved answers you would need some sort of a database storing the previously generated answers and what intent they were for, and then a lookup against that database will let you re-use a previously generated sound file for at least part of your reply. Read up on "Semantic Caching".

Note that this requires ongoing Natural Language Processing (NLP) type analysis as the words arrive - but often the intent is determined fairly easily, and you can also add more intelligence to your process to select which intent determination approach should be used in different situations or different parts of the caller's question/statement, only resorting to more advanced analysis when deemed necessary. You usually do not need constant NLP analysis or to call an "AI" type system after every word...

Any process (even a VBScript or JavaScript started just before the STT module and which is matched to that module only) can be polling those $RVs and look at both $RV[moduletitle_speech] and $RV[moduletitle_speech_preview] and perform relevant analysis and reactions. (btw: make sure you have some short 'sleep' between successive $RV polling calls).

See the vgEngine trace to see how these Result Variables get updated in real-time during the STT recognition process.

You can verify your Google Cloud Services STT delays by looking at the sound file that stores the speech received by VoiceGuide during the STT process, and comparing what is recorded in that sound file to the timestamps of events from Google Cloud that are shown in the vgEngine trace. The input data sound file is saved in VoiceGuide's \temp\ subdirectory.

There aren't any settings that you can modify in that integration that will affect Google Cloud Services' STT reporting delay.

10/22/2025 01:02 PM

We are experiencing long recognition delays when using Google Cloud STT in the Play module.
From the logs, we can see that speech is actually recognized quickly, but the recognition process continues for several seconds until EV_TIMEOUT_RECOGNITION_COMPLETE is triggered.

Example from the logs:

At 142433.310, we receive:
initialPrompt_library_speech_stability|0.9 → meaning the speech was confidently recognized ("tak mam")
But the recognition session ends only at 142440.443
→ more than 7 seconds later

This delay is too long for production use.

We would like to ask:

Is it possible to terminate recognition early once speech is recognized with high stability (e.g., 0.9)?
Can this be controlled using parameters within the Play module?
Or does this require a custom script (e.g. based on an evScriptEvent)?

Any advice or configuration examples would be greatly appreciated.

evEngine.txt

10/22/2025 11:37 PM

You would need to contact the STT service provider with this question. Sounds like their recognition model for your language needs to have a bug reported.

Trace shows that this was a very short utterance. Does similar delay (lack of response/event from your STT provider) happen with this STT provider for your language when a longer utterance is spoken?

As this was a very short utterance (looks like a yes/no response was expected?) you could handle such short responses by having your own supervisory process react to it, rather then wait for official 'end of speech' event from the STT provider. You will get a faster system reaction this way to short utterances.

As outlined in previous post in this thread:

Quote

Any process (even a VBScript or JavaScript started just before the STT module and which is matched to that module only) can be polling those $RVs and look at both $RV[moduletitle_speech] and $RV[moduletitle_speech_preview] and perform relevant analysis and reactions. (btw: make sure you have some short 'sleep' between successive $RV polling calls)

10/23/2025 10:16 AM

While using the Play module with ASR (speech recognition) enabled, we consistently see the following entry in the logs:

120413.600   20   5   1   200 t     timer set   10 sec : EV_TIMEOUT_RECOGNITION_COMPLETE

It seems this timer defines the maximum recognition window after the Play module is triggered.

Our question is:
Is there a way to change the timeout duration to:

something longer (e.g. 15 seconds), or
shorter (e.g. 5 seconds)?

We’ve already tried modifying this parameter before calling the Play module, but unfortunately, that had no effect — the timeout is still set to 10 seconds.

Additionally:

Is it possible to use silence detection (i.e., silence timeout, silence) as the trigger for recognition completion instead of a fixed time limit?

We would greatly appreciate any advice or configuration tips — either for local module use or global settings.

Thank you in advance!

10/27/2025 07:24 AM

Could you please share an example script that reacts to the $RV[moduletitle_speech_preview_stability] variable?
It would be very helpful for us to implement your recommendations more quickly.

10/28/2025 08:27 AM

We had the STT silence detection functionality patched into that Google Cloud STT module which you are using.

The version with that patch can be downloaded here:

[LINK REMOVED]

Please try it on your system and look at the vgEngine traces.

You may want to modify your scripts to use "silence_short", "silence_medium" and "silence_long" paths, instead of the catch-all "silence" path. The patch now issues those 4 events. "silence_short" is fired if a short silence occurs early on the STT process, and "silence_medium" is fired if a longer silence occurs later on in the STT process. "silence_long" is a long silence (3 seconds).

Quote

Could you please share an example script that reacts to the $RV[moduletitle_speech_preview_stability] variable?

You would basically use the RvGet API call from a VBScript or JavaScript (or any other external process)

see: https://www.voiceguide.com/vghelp/source/html/com_rvget.htm

below illustrates approach in VBScript

set vg = CreateObject("vgServices.CommandLink")
speech_preview_stability_value = ""
Do While speech_preview_value = ""
  speech_preview_stability_value = vg.RvGet($RV_LINEID, "$RV[moduletitle_speech_preview_stability]")
  If speech_preview_stability_value <> ""
    vg.Run_ResultReturn , "[FoundPreviewStability]{" + speech_preview_stability_value + "}"
    Exit Do
  End If
  WScript.Sleep(200)
Loop
set vg = Nothing

and you can then use $RV[FoundPreviewStability] later in the callflow.

Edited 11/03/2025 10:29 AM by SupportTeam

10/28/2025 09:11 AM

Thank you for your help so far.

I have a follow-up question:

How exactly should I apply the VBScript to detect when $RV[moduletitle_speech_preview_stability] reaches a value of 0.9 — and what should happen next?

Should the instruction to proceed to another module (e.g., Follow-Up) be written directly in the script (using Run_ResultReturn), or is it possible to configure this transition using a standard path in the Play module,
similar to how we can configure on {timeout 0} goto [next module]

I’d really appreciate a simple example that clearly shows where to place this logic — whether it should be handled entirely inside the script or via standard callflow configuration.

Thanks again!

10/28/2025 10:15 AM

I’m attaching both a sample speech recognition script and logs from the test run of the Visual Basic script, as well as the behavior of the Play module.

vgEngine_20251028_110900.txt

asr_test.vgs

10/28/2025 10:35 PM

Quote

How exactly should I apply the VBScript to detect when $RV[moduletitle_speech_preview_stability] reaches a value of 0.9 — and what should happen next?

Google Cloud STT reports recognition that is confident of as having high 'stability' and new recognitions - that might still change in future - with low stability. VoiceGuide's 'preview' result variables just contain the latest low stability reports. Note that the 'preview' only contains words that have not yet been moved into the 'confident'/'stable' recognized category. The 'stable' recognition keeps growing and the 'preview' is just the latest unconfirmed word(s) and it keeps changing all the time - just holding the latest words.

The '_speech_preview_stability' will never be 0.9. The 0.9 value is the stability value used by Google Cloud STT for 'confident'/'stable' recognition.

Run a few recognition attempts and look at the vgEngine trace to track how Google Cloud STT is reporting the speech data and how the various $RVs are set when those Google STT results are received.

In general, handling of STT data is best done from a separate script/process then using exact match paths in callflow modules. It gives you the flexibilty that is usually needed when analysing STT output.

Quote

I’m attaching both a sample speech recognition script and logs from the test run of the Visual Basic script, as well as the behavior of the Play module.

The example approach outlined a few posts above illustrates a general approach only. The general approach is that a script that needs to run at the same time as when the speech recognition is done, and it needs to look at some RVs as they are changing. So that script needs to run when your "Prompter" module is active ("Prompter" module is in the callflow you just posted). So your 'Run Script' module needs to be set to 'do not wait to complete' - so the callflow moves to next module while the script is still running in background.

Also, you should be looking at different $RVs. See the vgEngine trace to better see what RVs are created. Simplest to just look at $RV[moduletitle_speech]

Please . ZIP and post the 1028_ktTel.txt trace file as well so we can better see what is happening on your system.,

10/29/2025 08:39 AM

Unfortunately,

in the Play module — specifically the one named Prompter — VoiceGuide still waits until the end of the defined timeout (10 seconds), completely ignoring the [silence medium] setting. This behavior effectively stops the speech recognition process from proceeding as expected.

I’m attaching the logs for your review.

ktTel_vgEngine.zip

10/29/2025 08:54 AM

Can you please .ZIP up and post here the 3 files in C:\Program Files (x86)\VoiceGuide\temp that start with this prefix:

asr_in_20251028_110817_11

10/29/2025 09:22 AM

asr_in_files.zip

11/03/2025 10:28 AM

Looks like that ware a bug when integrating that patch that was last provided.

Please update system to the version below and try again:

[LINK REMOVED]

Also, please add the following to the VG.INI file, in the [PlayRecordConfig] section:

;in 100ms units. 3=300ms, 20=2sec
stt_silence_length_short=3
stt_silence_window_short=20

;in 100ms units. 5=500ms, 100=10sec
stt_silence_length_medium=5
stt_silence_window_medium=100

;in 100ms units. 20=2sec
stt_silence_length_long=20

Edited 11/06/2025 02:27 AM by SupportTeam

11/03/2025 10:32 AM

Quote

Should the instruction to proceed to another module (e.g., Follow-Up) be written directly in the script (using Run_ResultReturn), or is it possible to configure this transition using a standard path in the Play module,
similar to how we can configure on {timeout 0} goto [next module]

You would usually use the Script_Goto API command to have the callflow jump to the specified module.

See: https://www.voiceguide.com/vghelp/source/html/com_script_goto.htm

11/03/2025 07:37 PM

Hi,

Unfortunately, there’s still no progress — the system does not react to the Silence parameter as expected.
I’m attaching both the VG Engine trace and the script for your reference.
I’d be grateful if you could test it on your end and let me know how these settings should be configured correctly.

Thank you in advance for your help.

ASR TEST.zip

11/03/2025 08:18 PM

Please post ktTel trace and latest files from the /temp/ subdirectory.

11/04/2025 06:24 AM

ASR.zip

1103_ktTel.zip

11/06/2025 02:25 AM

Looks like the patch had to be added to to handle A-Law which is used on your system.

Please update system to the version below and try again:

https://www.voiceguide.com/release/VoiceGuide_7.7.10_251106_XKSUGEPA.exe

Please post traces and /temp/ files as before if still any issues.

11/06/2025 11:45 AM

Hi,

thank you for sending the modified version — we’ve performed the tests, and we can confirm that silence-long detection now works correctly.
Short responses are properly recognized and the module proceeds as expected.

However, we have two questions:

Why does the Prompter (Play) module require or only respond to silence-long?
How should we correctly use the parameters silence_long, silence_medium, and silence_short?
Is there a way to specify which one is applied in a given module?
About the EV_Timeout_Hangup variable:
During longer user responses, this timeout triggers and ends the call after about 10 seconds, even if the speaker is still talking.
How can we modify or disable this behavior?
We’d like to be able to adjust the maximum allowed duration of a response in the Play module to better match the expected type of answer — for example, allow longer responses when needed, while still ending the input normally after about 3 seconds of silence.

What is the recommended way to configure these two mechanisms — silence detection and the evaluation timeout — so that both short and long responses are handled properly?

Thanks again for your help and clarification.

20251106_vgEngine_KtTel.zip

11/06/2025 09:33 PM

Please .ZIP up and post all the files in /temp/ subdirectory that start with "asr_in_20251106_"

We can then better see what is happening on the system and advise.

11/06/2025 10:02 PM

Quote

Why does the Prompter (Play) module require or only respond to silence-long?
How should we correctly use the parameters silence_long, silence_medium, and silence_short?
Is there a way to specify which one is applied in a given module?

Please also .ZIP up and post all the vgEngine traces starting with "1106".

ktTel trace shows silence_short and silence_medium events were raised, and we need to see the vgEngine traces for those calls to see what has happened at vgEngine level when those events arrived on your system. Do you have "on {silence} goto [...]" or "on {silence_short} goto [...]" or "on {silence_medium} goto [...]" paths defined?

258 121119.869 16084 8 2 eccb silence_short within window_short
259 121119.869 16084 8 2 Event_Silence begin
260 121119.869 16084 8 2 r Silence

377 121122.205 16084 8 2 eccb silence_medium within window_medium
378 121122.205 16084 8 2 Event_Silence begin
379 121122.205 16084 8 2 r Silence

601 121135.358 16084 11 3 eccb silence_short within window_short
602 121135.358 16084 11 3 Event_Silence begin
603 121135.358 16084 11 3 r Silence

728 121137.470 16084 11 3 eccb silence_long
729 121137.470 16084 11 3 Event_Silence begin
730 121137.470 16084 11 3 r Silence

... etc ...

833 121457.226 16084 14 4 eccb silence_short within window_short
834 121457.226 16084 14 4 Event_Silence begin
835 121457.226 16084 14 4 r Silence

378 122635.602 16084 5 1 eccb silence_medium within window_medium
379 122635.602 16084 5 1 Event_Silence begin
380 122635.602 16084 5 1 r Silence

11/06/2025 10:45 PM

Quote

About the EV_Timeout_Hangup variable:
During longer user responses, this timeout triggers and ends the call after about 10 seconds, even if the speaker is still talking.
How can we modify or disable this behavior?
We’d like to be able to adjust the maximum allowed duration of a response in the Play module to better match the expected type of answer — for example, allow longer responses when needed, while still ending the input normally after about 3 seconds of silence.

You need to set your own timeout path and its value in the module if you want to override the default 'no action received, so lets hangup' timeout - EV_TIMEOUT_HANGUP - which is by default 10 seconds.

eg, to set max STT time to 30 seconds add this path to that module:

on {timeout 30} goto [max talk time reached]

11/07/2025 07:49 AM

Hi,
I’m sending the zipped ASR log files as requested.
After adding the timeout parameter with the specified number of seconds, the module now appears to work correctly and reliably.
Many thanks for your support and help in resolving this issue — wishing you all the best.

20251106_ASR.zip

11/07/2025 09:57 AM

Hi,

I’d like to ask for some guidance on model or recommended settings for speech recognition using the Google STT engine.

From what I understand, for each Play module that should listen for speech, I need to create a variable named, for example:

asr_nlp_prompter

(where Prompter is the module name),
and in that variable define the engine parameters such as:

{"speech-processors" : [{"engine" : "google-gcp", "culture" : "pl-PL"}]}

In addition, within the Play module I define parameters such as:

on {silence_long} goto [prompter_ASR_response]
on {Transcribe_End} goto [prompter_ASR_response]
on {timeout 40} goto [prompter_ASR_response]

My question is:
– Are there any recommended or best-practice settings for optimizing speech recognition performance?
– Specifically, what is the correct way to use and tune SilenceDetectLevel (and related parameters)?
– Does the silence level still play an important role when using Google STT, and how should it be adjusted for best results?
– Is it possible to detect or measure the background noise level on the caller’s side, so that the system could automatically adjust the silence detection level to match the current noise conditions during the call?

Additionally, I’d like to better understand the purpose and behavior of the stt_silence_window_* parameter.
I understand that stt_silence_length_* defines the duration of silence required to trigger an event, but how exactly should the window value be interpreted?
For example, what does stt_silence_window_medium = 100 mean in practice — does it represent a time window in which the system searches for a continuous silence period, or does it have another role in the detection process?
Any clarification or example of how it interacts with stt_silence_length_* and affects speech detection would be very helpful.

Any advice or examples of model configurations for stable and accurate recognition would be greatly appreciated.

I really appreciate your help and I’m very grateful for the support you’ve provided.
Thanks in advance!

11/07/2025 11:45 PM

Quote

For example, what does stt_silence_window_medium = 100 mean in practice — does it represent a time window in which the system searches for a continuous silence period,

stt_silence_window_short sets the length of time from start of STT during which silences of length stt_silence_length_short will be reported. Those silences will be reported as both silence_short and silence events.

stt_silence_window_medium sets the length of time from start of STT during which silences of length stt_silence_length_medium will be reported. Those silences will be reported as both silence_medium and silence events.

SilenceDetectLevel value is not used in silence detection during STT.

If you have issues with how silence detection during STT is working please post vgEngine/ktTel traces and files from /temp/ as before.

ASR Delay Issue – Seeking Solutions for Faster Speech Recognition

Recommended Posts

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Example from the logs:

We would like to ask:

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam

Share this post

Link to post

SupportTeam

Share this post

Link to post

SupportTeam

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

Maciej Zasadzki ADVICOS

Share this post

Link to post

SupportTeam