Description

This project demonstrates the use of the conversational tools in Google Cloud, by building a voice-powered web game.

The game has three functions:

Shows a picture question
Consumes and processes a voice-based answer
Checks if the answer is correct and provides feedback

System Diagram

Capturing Audio

Capturing audio involves device-specific steps and requires referring to native documentation and open-source libraries.

In this project, we use the Web Browser's native API, the Web Audio API to capture audio. Additionally, we cover relevant knowledge about audio streams and various audio file formats that is crucial for effective implementation.

For e.g, in Chrome, we can get access to the users' audio with the code below -

navigator.getUserMedia({ audio: true }, onSuccess, onError);

The audio stream now needs to be converted into a compatible format. For this purpose, which we use the open-source RecordRTC javascript library.

function onSuccess(stream) {
  recordAudio = RecordRTC(stream, { type: "audio", mimeType: "audio/webm" });
}

Text-to-Speech and Speech-to-Text

Converting text-to-speech and vice versa are done using Natural Language Processing (NLP), which enables machines to read and understand human languages. NLP models can perform tasks like transcription (speech to text), speech synthesis (text to speech), sentiment analysis, intent detection, spam filtering, and autocomplete. When choosing a model, one can use existing models for speed and leverage existing advancements, though this may lack flexibility. Alternatively, one can bring their own model (BYOM) for more control and customization, but this requires significant time, money, and expertise. For deployment, models can be deployed on the cloud via an API or a paid service like Google Cloud Speech, offering processing power and abstraction but with potential security and latency issues. Alternatively, models can be deployed locally on the device, utilizing capabilities like the Web Speech API for data privacy and low latency, though this approach faces challenges with model conversion and processing limitations.

`Voice.js`

This code provides functionalities for capturing and processing speech using both the Web Speech API and a custom cloud-based solution.

The SpeechSingleton class manages a singleton instance of the Web Speech API for speech recognition. It initializes the recognizer, handles speech events, and processes speech results. The useWebSpeechApi hook initializes the SpeechSingleton with a provided callback for handling speech responses and resets it upon unmounting.

The cloud-based solution uses RecordRTC for audio recording and socket.io for streaming audio to a server for processing. The useCloudSpeechApi hook sets up a socket connection to receive speech results from the server and starts continuous audio recording using RecordRTC. The recorded audio is streamed to the server for speech-to-text conversion.

The speak function utilizes the Web Speech API for text-to-speech synthesis. It manages the state of the recognizer and recorder during speech synthesis to ensure smooth operation.

Overall, the code provides a flexible setup for speech recognition and synthesis, leveraging both client-side APIs and server-side processing.

Web Speech Api

The Web Speech API enables voice data handling in web apps through two main components:

SpeechSynthesis (Text-to-Speech) and
SpeechRecognition (Asynchronous Speech Recognition).

Speech Recognition:

This is accessed via the SpeechRecognition interface. It recognizes voice input from an audio source and responds accordingly. and uses the SpeechGrammar interface to define recognized grammar using JSpeech Grammar Format (JSGF).

Speech Synthesis

This is accessed via the SpeechSynthesis interface and allows web apps to read text content out loud. It represents voices with SpeechSynthesisVoice objects and text to be spoken with SpeechSynthesisUtterance objects.

Streaming with sockets

The backend code (server folder) sets up an Express server with Socket.io to handle real-time audio transcription using Google Cloud's Speech-to-Text service.

Server Setup: An Express application is created, serving a simple "hello world" response on the root endpoint and listening on port 3022.
Socket.IO Integration: The server uses Socket.IO to handle incoming connections. When a client sends an audio stream (stream-translate event), the server saves the audio stream to a file and processes it using the transcribeAudioStream function.
Transcription Logic: The transcribeAudioStream function uses Google Cloud's Speech-to-Text client to transcribe the audio stream. It creates a request with specified audio settings and streams the audio to Google Cloud's API. The transcribed results are sent back to the client via Socket.IO events.
Google Cloud Speech-to-Text: The code configures the Speech-to-Text client with parameters such as sample rate, encoding, and language code, ensuring the audio settings match between the client and the server.

We enable real-time audio transcription by streaming audio from the client to the server, which then uses Google Cloud's Speech-to-Text service to transcribe the audio and send back the results.

Talks & Presentation

Slides
Talk

Running the project

Install the packages, by navigating to the folder client and running npm install. Run npm start to start the app.

By default, the app uses the browser-based WebSpeech API. To use the Cloud Speech API, navigate to the server folder, and run npm install and npm start to start the node server. In the client app,

References

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
client		client
server		server
.gitignore		.gitignore
README.md		README.md
overall-diagram.png		overall-diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

System Diagram

Capturing Audio

Text-to-Speech and Speech-to-Text

`Voice.js`

Web Speech Api

Speech Recognition:

Speech Synthesis

Streaming with sockets

Talks & Presentation

Running the project

References

About

Releases

Packages

Contributors 2

Languages

akshatamohanty/gcp-conversational-ai-demo

Folders and files

Latest commit

History

Repository files navigation

Description

System Diagram

Capturing Audio

Text-to-Speech and Speech-to-Text

Voice.js

Web Speech Api

Speech Recognition:

Speech Synthesis

Streaming with sockets

Talks & Presentation

Running the project

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`Voice.js`

Packages