This repo is an experimental prototype for Google BardGemini, Google Quickdraw, OpenCV, SpeechRecognition and Python. The purpose is for a person/child to draw a doodle with their fingers using OpenCV. Gemini tries to guess what the doodle is. Google Quickdraw is used to generate more doodles of what the drawing might be. The person/child can tell Gemini if the guess is correct or wrong via one of the following user_modes: keyboard input, microphone/speech input, or hand sign. If asked, Gemini will tell a story about the drawings. The end.
Sort of. I said this prototype was for Gemini to guess about a drawing. However, Gemini happens to be a bit wonky at the moment, often overexplaining itself, lecturing you about political correctness or creepily telling you about your current location.
The other purpose of this prototype is to guide Gemini to generate the more helpful and appropriate content for the user. There are three modes to achieve this: --ref_mode text_ref
submits explicit instructions and context to Gemini. These are found in the various .txt
files and can be set once. This could be considered a "first order" method. --ref_mode self_ref
feeds Gemini back its own reply with instructions found in the doodle_refine_instruction.txt
file. This could be considered a "second order" method as Gemini is tasked with a second reply to refine its first reply. --ref_mode critic_ref
takes this further and creates a second instance of Gemini with the instructions to refine the first Gemini's reply. Users may easily modify the text files but be sure to keep the same formatting. After you complete your session, you may compare the results of these "interventions" in the sessions.csv
file.
Below, the critic Gemini is instructed to simply the first Gemini reply so a 4-year-old can understand, and to not talk about itself.
First Order | Second Order |
---|---|
In this session, we had a conversation about what you might be drawing. I guessed that you were drawing a bird based on the limited information I have about your age and the potential clues in the image you might be creating. We also discussed the challenges of accurately guessing a drawing without seeing it. | We talked about your drawing! It's hard to know for sure what it is without seeing it all, but maybe it's a bird? |
It is recommended to setup a Conda or venv and run pip install -r requirements.txt
in the command line with the environment active.
Also, you will need to install Firefox and have a wifi connection. You may need to open a Gemini session in Firefox first so the program can access the cookies. However, in my use this only needs to be done once - afterwards the cookies can be accessed even if the Firefox app is closed.
This repo includes about 10 categories of Google Quickdraw data. If you want all 345 categories, download the bin files here and update the doodle_list.txt
file accordingly
Here is a quick demo of what you should see when running the code, including some bugs and suboptimal performance of the free version of Gemini, opencv, my drawing skills, coding, etc.
python main.py
will run the program with default --user_mode
of keyboard input and default --ref_mode
of text_ref.
In --user_mode keyboard
(default), type and enter in the instructions you want Gemini to perform. Recall that one purpose of this experiment is to guide Gemini's response, so the various .txt
files contain key words and instructions that the key words trigger. Try modifying these. Default keywords are guess
to guess the drawing and wrong
try
again
to ask Gemini to try again.
--user_mode speech
functions similarly, except you simply speak instead of typing. When ready, press s
to send the speech to Gemini. It should automatically record when clear speech is detected. --user_mode sign
is more limited. Raise your left index finger to have Gemini guess about the drawing. Raise your left index and middle finger to have Gemini guess again. Raise your left index, middle and ring finger to have Gemini tell a story about the drawing (what Gemini guessed).
The --ref_mode
options have been discussed in the introduction. The other args --smooth
and --mode_len
control the drawing and number of frames to compute gesture recognition. Smooth must be between 0 and 1 while mode_len can be 1 through any number, but keep in mind we probably have 30 or so frames a second, so 30 might be a reasonable upper limit.
To draw, you must press up arrow
to enter draw mode. Down arrow
will exit draw mode. Draw options are as follows:
- raise your right index finger to draw a thin line
- raise your right index and middle finger to draw a thicker line
- raise your right index, middle and ring finger will erase the drawing.