©2023 The Headline House
Google’s new robot system relies on a language model for control. Thanks to internal monologues, the system can interact more flexibly with its environment.
Flexible robots that are to perform multiple tasks in the real world must have a large repertoire of basic skills and be able to plan their use. This includes recognizing when they need to change their approach because a particular action or plan isn’t working.
This planning, constant perceptual feedback, and the control of the system at all levels are some of the subtasks that an embodied agent must seamlessly combine to act intelligently in its environment.
AI researchers try to solve these challenges with different approaches. Many use reinforcement learning to teach robots to move, but it takes more planning and flexibility.
Metas AI boss Yann LeCun presented his plans for autonomous artificial intelligence around March. It is not supposed to be located in a robot for the time being, but otherwise has all the building blocks for a flexible agent that has the ability to plan.
Large Language Models for Embodied Agents
Central to LeCun’s model is a world model in which a basic understanding of the world should be located within the AI system. These world models do not exist yet.
One reason to assume that they will be technically possible has been provided in recent years large language models. These models can generate and process text. By training with gigantic amounts of text, they have a wealth of knowledge about the world. In some examples, they also show a rudimentary – albeit not robust – ability to reason, such as in Google’s PaLM experiments.
AI researchers from Google’s robotics department, among others, are therefore asking the question: Can language models serve as reasoning models that combine multiple feedback sources and become interactive problem solvers for embodied AI tasks in robots, for example?
Other work has already shown that language models can be used to plan actions in robots. The Google team is now wondering if the capabilities of language models can also be used to reschedule when things go wrong.
Google shows inner robot monologues
It is modeled on what is called “Thinking in language”
. As an example, the team cites an internal monologue that might play out when a person tries to unlock a door: “I have to unlock the door; I’m trying to take this key and put it in the lock…no wait it doesn’t fit, I’m trying another…it worked, now I can turn the key.”
This thought process includes making decisions about immediate actions to solve the overall task (picking up the key, unlocking the door), observing the results of the attempted actions (key doesn’t fit), and corrective actions in response to these observations (try different key). Such an inner monologue is therefore a natural framework for the integration of feedback for large language models, according to the researchers. They call the approach “Inner Monologue”.
Whereas older approaches directly generate a complex plan for a goal from a language model and thus have no possibility for corrections, the Google team continuously feeds the language model with more information as the robot interacts with the environment.
- Cancellation online at any time
away 2,80 €
- / month
This information includes, for example, a description of the objects visible in a scene or feedback as to whether an action was successful or not. Based on this information, the language model can also ask people questions if an instruction is unclear or no longer executable.
Google Inner Monologue controls robots in simulation and reality
Google’s team is testing Inner Monologue in simulation and reality. The language model also generates commands that control the robot. The model was only prepared for this with a few examples (few-shot-learning).
In the simulation, a virtual robotic arm sorts virtual objects, in reality a real robotic arm sorts plastic bananas and ketchup bottles. If an action is unsuccessful, the language model reissues the same command.
Google’s use of robots in a real test environment is impressive, in which a moving robotic arm has to pick up, store or throw away cans or snacks and has to deal with human intervention in the process. He repeats failed actions, describes scenes and asks appropriate questions.
Thanks to the language capabilities, the system can continuously adapt to new instructions and set new goals when old ones are not achievable. It also understands multiple languages, can use past actions and environmental feedback to better understand a scene, and can handle typos. There are video examples of this on the Inner Monologue project page.
In the future, the team plans to reduce the model’s reliance on human feedback, such as by using advanced image/video captioning and visual question-answering.
Note: Links to online shops in Articles can be so-called affiliate links. If you buy via this link, MIXED.de will receive a commission from the seller. The price does not change for you.