Inspiration
We were inspired by seeing the difficulty of communicating effectively for those who use augmentative and alternative communication tools (AACs). We saw a group of people that could greatly benefit from using Gemini API to communicate more quickly and with more personality.
What it does
GeminAAC takes audio input, an image of the user's face, and engineered text to automatically generate possible responses for a user of AACs. Instead of having to press several buttons to navigate through a complex UI, the user can choose from a selection of generated responses or use the streaming feature to quickly respond word by word. These responses are custom-tailored to the user's age, sex, occupation, interests, and hobbies. Our Gemini-powered solution gives users the ability to efficiently select a response that reflects their personality to relieve as much burden on the user as possible.
How we built it
GeminAAC's backend was written in Python. We used pyaudio to capture streamed audio and OpenCV to capture the user's facial expressions. We uploaded these media files to the Google File API to then feed into Gemini. Our Gemini prompt was a combination of these media files, engineered text to ask the model for possible responses, system instructions that tell the model about the user and their personality, and the specifications of a JSON schema to guarantee we are able to parse Gemini's responses. We then used this prompt to call the Gemini REST API using the Python requests library. Gemini's response was then sent to our frontend both in the form of fully-generated replies or streamed words that the user can select individually. We seamlessly integrated our frontend with the existing AAC software CBoard to give the user comprehensive options.
Challenges we ran into
On the front-end, we had great challenges. We used an open-source AAC called cboard, where we ran into many issues when trying to create our own components, add/remove tiles, create custom tiles, work properly with the API, handle asynchronous loading of responses, and much more. We would try to implement a feature, only for the feature to bug out because of something inside of the massive codebase, and would have to spend hours trying to find why. We also had issues running it on Mac, so collaboration was limited. Also, none of us were particularly experience in frontend, so we had to learn a ton of React on the spot, on an pre-existing large undocumented codebase, implementing an entire copilot and recommendation system. On the backend, our biggest challenge was building correct API calls. We originally planned on using the Python SDK for its convenience, but the SDK does not include the brand-new JSON feature. We wanted to guarantee that Gemini's response would always be useable by our application, so we had to build the API calls using HTTP requests with the Python requests library. These requests grew complicated as we had to carefully craft a request that included our audio, image, text, system instructions, prompt configuration, and the conversation history. This was a very tedious process but had to be done to minimize breaking points.
Accomplishments that we're proud of
We are proud of the solution that we are able to deliver to those who use AAC devices, as generating responses and providing suggested completions reduces the latency in response. We are also proud of the ability to capture emotion through images in order to guide the responses. We are proud of our integration with an open source AAC in order to deliver this functionality, as well as our products ability to take in user information to deliver personalized responses. We also take pride in developing a solution that uses the multimodal functionality of the Gemini API, and demonstrates it’s capabilities in driving meaningful solutions to users.
What we learned
We learned a lot about accessibility devices, from their major pain points to their strengths and benefits. We learned developing solutions with accessibility in mind, from measuring the latency of the backend to developing an intuitive and efficient user interface for the front end. We learned about integrating and working with a large language model API, from the benefits of having rapid autocompeltion and suggested responses, to reading documentation to follow data formatting practices to achieve deliberate output such as the JSON mode only available through the REST API.
What's next for GeminAAC
We hope to make GeminAAC a product that people will use. We still need to iterate the front-end, as no one on our current team is particularly proficient in front-end. There are some bugs with the backend, especially with Gemini quota limits that we hope to be able to handle well. We hope to reduce latency on GeminAAC, which would greatly help the user experience. We also hope to be able to learn from user responses, using the user's actual responses to learn what they might say for future responses. We also want to better integrate with the AAC, using some kind of image generation model to generate icons for the GeminAAC generated tiles. We also believe that fine-tuning GeminAAC on AAC common responses would also help greatly for our suggested responses and our copilot. We firmly believe GeminAAC is promising for using AI for the greater good, and with these improvements, we will be able to have a great product.
Built With
- cv2
- javascript
- pyaudio
- python
- react
- restapis
Log in or sign up for Devpost to join the conversation.