How about an EVNT chunk in the wave files? #1473

FrontierDK · 2022-04-05T20:25:16Z

FrontierDK
Apr 5, 2022

Hi all :)

When using the good ol' SAPI interface, events could be put into the .WAV files, allowing for accurate timing when highlighting words on a web site etc. This happened using the EVNT chunk in .WAV files. How about adding it to the TTS part? Some also call these SSML tags.

Here is a demo .WAV file which contaings events.
Events.zip

Besides text high-lighting, it can also be used for other functions, such as lip syncing:
https://www.youtube.com/watch?v=ui9XT47uwxs

More info
https://documentation.help/SAPI-5/WP_SimpleTTS.htm
https://groups.google.com/g/microsoft.public.speech_tech.sdk/c/VfotWbZ7oDQ?pli=1
https://groups.google.com/g/microsoft.public.speech_tech.sdk/c/R6vbasYoHNQ/m/1EpOaKUslloJ
https://github.com/JakobOvrum/speech4d/blob/master/source/speech/windows/sapi.d
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?pivots=programming-language-csharp
https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html

Here are the SAPI events for the attached file (in seconds)

Event START_INPUT_STREAM at 0
Event VOICE_CHANGE at 0
Event PHONEME at 0
Event VISEME at 0
Event SENTENCE_BOUNDARY at 0,00
Event WORD_BOUNDARY at 0,00
Event PHONEME at 0,00
Event VISEME at 0,00
Event PHONEME at 0,06
Event VISEME at 0,06
Event PHONEME at 0,15
Event VISEME at 0,15
Event WORD_BOUNDARY at 0,26
Event PHONEME at 0,26
Event VISEME at 0,26
Event PHONEME at 0,38
Event VISEME at 0,38
Event PHONEME at 0,49
Event VISEME at 0,49
Event WORD_BOUNDARY at 0,53
Event PHONEME at 0,53
Event VISEME at 0,53
Event PHONEME at 0,62
Event VISEME at 0,62
Event PHONEME at 0,79
Event VISEME at 0,79
Event WORD_BOUNDARY at 0,96
Event PHONEME at 0,96
Event VISEME at 0,96
Event PHONEME at 1,01
Event VISEME at 1,01
Event PHONEME at 1,07
Event VISEME at 1,07
Event WORD_BOUNDARY at 1,11
Event PHONEME at 1,11
Event VISEME at 1,11
Event PHONEME at 1,15
Event VISEME at 1,15
Event PHONEME at 1,21
Event VISEME at 1,21
Event PHONEME at 1,27
Event VISEME at 1,27
Event WORD_BOUNDARY at 1,32
Event PHONEME at 1,32
Event VISEME at 1,32
Event PHONEME at 1,39
Event VISEME at 1,39
Event WORD_BOUNDARY at 1,47
Event PHONEME at 1,47
Event VISEME at 1,47
Event PHONEME at 1,51
Event VISEME at 1,51
Event PHONEME at 1,57
Event VISEME at 1,57
Event PHONEME at 1,66
Event VISEME at 1,66
Event PHONEME at 1,74
Event VISEME at 1,74
Event PHONEME at 1,79
Event VISEME at 1,79
Event WORD_BOUNDARY at 1,84
Event PHONEME at 1,84
Event VISEME at 1,84
Event PHONEME at 1,92
Event VISEME at 1,92
Event PHONEME at 1,96
Event VISEME at 1,96
Event WORD_BOUNDARY at 1,99
Event PHONEME at 1,99
Event VISEME at 1,99
Event PHONEME at 2,02
Event VISEME at 2,02
Event WORD_BOUNDARY at 2,08
Event PHONEME at 2,08
Event VISEME at 2,08
Event PHONEME at 2,15
Event VISEME at 2,15
Event PHONEME at 2,21
Event VISEME at 2,21
Event PHONEME at 2,26
Event VISEME at 2,26
Event PHONEME at 2,32
Event VISEME at 2,32
Event PHONEME at 2,40
Event VISEME at 2,40
Event PHONEME at 2,49
Event VISEME at 2,49
Event PHONEME at 2,55
Event VISEME at 2,55
Event PHONEME at 2,60
Event VISEME at 2,60
Event PHONEME at 2,66
Event VISEME at 2,66
Event PHONEME at 2,81
Event VISEME at 2,81
Event END_INPUT_STREAM at 3,43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about an EVNT chunk in the wave files? #1473

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How about an EVNT chunk in the wave files? #1473

FrontierDK Apr 5, 2022

Replies: 0 comments

FrontierDK
Apr 5, 2022