According to the NVIDIA Technology Blog, NVIDIA has unveiled ReMEmbR, a groundbreaking project that leverages generative AI to enable robots to reason and act based on expanded observations.
Innovative Vision Language Model
Visual Language Models (VLMs) combine the powerful language understanding of basic large-scale language models (LLMs) with the visual capabilities of visual transformers (ViTs). These models can process unstructured multimodal data, infer it, and return structured outputs by projecting text and images into the same embedding space. Based on extensive pretraining, VLMs can be adapted to a variety of vision-related tasks through new prompts or parameter-efficient fine-tuning.
ReMEmbR: Improving Robot Perception and Autonomy
ReMEmbR integrates LLM, VLM, and augmented generation (RAG) to enable robots to reason and act based on what they observe over long periods of time, from hours to days. The system is designed to address challenges such as large-scale context processing, reasoning about spatial memory, and building prompt-based agents that query for additional data until the user’s question is answered.
The memory construction phase of this project uses VLM and a vector database to build a long-horizon semantic memory. In the query phase, the LLM agent infers on this memory. ReMEmbR is completely open source and runs on the device, making it accessible to a wide range of applications.
Real-world applications and demos
To demonstrate the capabilities of ReMEmbR, NVIDIA developed a real-world example using Nova Carter and NVIDIA Isaac ROS. A robot equipped with ReMEmbR can answer questions and guide individuals within an office environment. The demo highlights the system’s ability to build an occupancy grid map, run a memory builder, and operate ReMEmbR agents.
In the demo, the robot uses a monocular camera and global position information to create a vector database. This database stores text embeddings, timestamps, and pose information, allowing the robot to efficiently query and retrieve information to perform tasks such as guiding a user to a specific location.
Integration with speech recognition
Recognizing the need for intuitive user interaction, NVIDIA has integrated speech recognition into the ReMEmbR system. Using the WhisperTRT project, which optimizes OpenAI’s Whisper model with NVIDIA TensorRT, robots can process voice queries and generate appropriate responses to enhance the user experience.
Future outlook
ReMEmbR’s innovative approach of combining generative AI, VLM, and RAG opens up new possibilities for robotics applications. By giving robots the ability to reason and act based on extended observations, this technology has the potential to revolutionize areas such as autonomous driving, surveillance, and conversational assistance.
For those interested in exploring generative AI in robotics, NVIDIA offers a wide range of resources and documentation through its Developer Program, including tutorials, code samples, and community support to help developers get started with their own generative AI robotics applications.
Image source: Shutterstock