Here’s a README.md
file for your project:
# Multi-Modal AI Voice Assistant
## Overview
The Multi-Modal AI Voice Assistant is a sophisticated application designed to interact with users through advanced text and image-based responses. Utilizing cutting-edge technologies like Groq and Google Generative AI, this assistant provides comprehensive and contextually accurate answers based on user inputs. It integrates various functionalities, including image processing, clipboard text extraction, and a versatile user interface to enhance the overall user experience.
## Features
- **Advanced AI Integration**: Utilizes Groq and Google Generative AI for high-quality text and image-based responses.
- **Image Processing**: Includes capabilities for screenshot capture and webcam image analysis to provide detailed context.
- **Clipboard Extraction**: Extracts and incorporates clipboard text into responses for enriched user interaction.
- **Versatile Interface**: Supports both text input and manual command selection for flexible interaction with the assistant.
## Installation
To get started with the Multi-Modal AI Voice Assistant, follow these installation steps:
1. **Clone the Repository**:
```bash
git clone https://github.com/Aditya-professional-life/multi-modal-ai-assistant.git
-
Navigate to the Project Directory:
cd multi-modal-ai-assistant
-
Install the Required Dependencies:
pip install -r requirements.txt
Ensure that you have the following libraries installed:
faster-whisper
groq
google-generativeai
opencv-python
Pillow
pyperclip
pyttsx3
SpeechRecognition
pytesseract
-
API Keys: Configure your API keys for Groq and Google Generative AI by replacing the placeholders in the code with your actual keys.
-
Webcam and Screenshot Paths: The application saves captured images and screenshots as
webcam.jpg
andscreenshot.jpg
respectively. Ensure these paths are writable.
-
Running the Assistant:
Execute the script:
python main.py
-
Interaction Modes:
- Voice Mode: Say the wake word followed by your command to interact via voice.
- Text Mode: Type your prompt directly and see the response.
-
Commands:
- "Take Screenshot": Captures a screenshot of your current screen.
- "Capture Webcam": Takes a photo using the webcam.
- "Extract Clipboard": Incorporates text from the clipboard into the response.
Contributions are welcome! Please fork the repository and submit a pull request with your proposed changes.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or support, please reach out to [email protected].
Happy coding!
Feel free to adjust any specific details, like API key setup or paths, to match your actual project setup.