OASIS - Online Agent for Shopping Interactivity and Support

Inspiration

Imagine a world where the internet is truly accessible to everyone, regardless of age or physical ability. We're building a web-based autonomous AI agent to turn this vision into reality. Our agent acts as a personal navigator for the web, making it easy and intuitive for the elderly and those with physical challenges to explore, interact, and benefit from online resources. By harnessing the power of AI, we aim to eliminate digital barriers and ensure that the internet is a place of empowerment and accessibility for all.

What it does

Our project is a web-based AI agent that can navigate websites and perform actions based on user intents. It leverages computer vision techniques to analyze the content of web pages, identifies clickable elements, and interacts with them autonomously. The agent engages in a conversation with the user, understanding their intents through speech-to-text input. It then determines the appropriate actions to take on the website, such as clicking buttons, filling out inputs, or scrolling, to fulfill the user's request. The agent provides a hands-free, voice-controlled browsing experience, making it easier for users to interact with websites.

How we built it

We built the web-based AI agent using a combination of technologies and frameworks. The core of the agent is developed using Python, leveraging the Selenium library for web automation and interaction with web pages. We utilized the Google Vertex AI platform, specifically the Gemini model, to process user intents and determine the appropriate actions to take on the website. The agent employs computer vision techniques, such as element detection and text recognition, to analyze the content of web pages and identify clickable elements. We used the Google Cloud Speech-to-Text API for converting user speech into text and the Google Cloud Text-to-Speech API for generating spoken responses. The agent's decision-making process is guided by a set of predefined rules and heuristics based on the user's intent and the available actions on the web page.

Challenges we ran into

During the development process, we encountered several challenges. One of the main challenges was accurately identifying and interacting with clickable elements on web pages. Websites often have complex layouts and dynamic content, making it difficult to reliably locate and click on specific elements. We had to develop robust computer vision techniques to handle different types of elements and account for variations in web page structures. Another challenge was that the structure of prompts was not always accurately parsable, we looked in to pydantic and implemented various validation techniques for this. Another challenge was integrating the various components of the system, such as the speech recognition, text-to-speech, and the Gemini model from Google Vertex AI. Ensuring smooth communication and data flow between these components required careful design and troubleshooting. Additionally, handling edge cases and unexpected user inputs posed challenges in terms of error handling and providing meaningful responses to the user.

Accomplishments that we're proud of

We are proud of several accomplishments in our project. Firstly, we successfully developed an AI agent that can autonomously navigate websites based on user intents. The agent can understand natural language commands through speech input and perform actions on web pages accordingly. It can click buttons, fill out forms, scroll through content, and provide spoken feedback to the user. We are proud of the seamless integration of speech recognition, text-to-speech, and web automation technologies, creating a hands-free browsing experience. Additionally, we are pleased with the agent's ability to handle various types of websites and adapt to different page layouts and element structures. The agent's decision-making process, guided by the Gemini model and predefined rules, enables it to determine the most appropriate actions to take based on the user's intent and the available options on the web page.

What we learned

Throughout the development of this project, we gained valuable knowledge and skills in several areas. We learned how to leverage the power of the Google Vertex AI platform, particularly the Gemini model, for natural language understanding and decision-making. We gained experience in integrating speech recognition and text-to-speech technologies to enable voice-based interactions with the AI agent. We deepened our understanding of web automation using Selenium and learned techniques for identifying and interacting with web elements programmatically. Additionally, we learned the importance of error handling and designing robust systems that can handle unexpected user inputs and edge cases. We also gained insights into the challenges and considerations involved in building conversational AI agents that can understand and fulfill user intents in a web browsing context. Overall, this project provided us with hands-on experience in developing an AI-powered web agent and expanded our knowledge in the areas of natural language processing, computer vision, and web automation.