AI Mimics Toddler-Like Learning to Unlock Human Cognition
AI Mimics Toddler-Like Learning to Unlock Human Cognition

AI Mimics Toddler-Like Learning to Unlock Human Cognition

Summary: Using a PV-RNN approach, this embodied AI model stitches together visual observations, body-sense feedback, and verbal instructions to develop language and action skills in a way that mirrors toddler learning. Unlike large language models (LLMs), which rely on large data sets, this system uses embodied interactions to achieve compositional results with fewer data and computational demands.

Researchers have found that the modular and transparent design of AI is useful for studying how people acquire cognitive skills, such as integrating language and actions. The model offers insights into developmental neuroscience and could lead to safer and more ethical AI based on learning from behavior and transparent decision-making processes.

Important facts:

  • Learning like a toddler: AI learns composition by combining sensory input, language, and actions.
  • Transparent design: The architecture allows researchers to study internal decision-making processes.
  • Practical benefits: LLMs require less data and emphasize developing ethical and embodied AI.

Source: OIST

People are experts at generalizing. If you teach a young child to recognize the color red by showing them a red ball, a red truck, and a red rose, they will likely correctly identify the color of the tomato, even if it is the first time they have seen it.

A key milestone in learning to generalize is composition: the ability to assemble and decompose a whole into reusable parts, such as the construction of an object.

The first neural networks, which later evolved into large language models (LLMs) that revolutionized our society, were developed to study how information is processed in our brains.

Ironically, the information processing pathways within these models became increasingly obscure as they became more sophisticated. Today, some models have trillions of adjustable parameters.

Researchers at OIST’s Cognitive Neurorobotics Research Unit have crafted a new embodied-intelligence architecture that exposes the network’s hidden states to scientific scrutiny. Remarkably, this model learns to generalize in ways that closely mirror the learning patterns of young children. Their results have now been published in Science Robotics.

“This paper demonstrates a potential mechanism by which neural networks can achieve compositionality,” said Dr. Prasanna Vijayaraghavan, first author of the study.

“Our model does not derive conclusions from large data sets, but rather by combining language with vision, proprioception, working memory, and attention, just as young children do.”

Totally incomplete.

LLMs, based on a transformer network architecture, learn statistical relationships between words in sentences from vast amounts of textual data. They have access to virtually any word in any conceivable context and, based on that, predict the most likely answer to a given question.

Instead, the new model is based on the PV-RNN framework (a predictive coding-inspired variable recurrent neural network), trained through embodied interactions that integrate three inputs simultaneously from different senses: vision, with video of a robotic arm moving colored blocks; proprioception, the sense of movement of our limbs, with the angles of the robotic arm’s joints as they move; and language instructions such as “put red on blue.”

The model is then asked to produce a visual prediction and corresponding joint angles in response to language instructions, or language instructions in response to sensory input.

This principle states that our brain continuously predicts sensory input based on previous experiences and takes steps to minimize the difference between prediction and observation.

This difference, referred to as “free energy,” is a measure of uncertainty. By minimizing free energy, our brain maintains a steady state.

Combined with limited working memory and attention span, AI mirrors human cognitive limitations. This requires AI to process information sequentially and update predictions, not all at once, as LLMs do.

By studying the flow of information within the model, researchers can gain insight into how different input data are integrated to produce simulated actions.

By capping its working memory and attention span, the AI mimics human cognition—processing inputs and updating predictions step by step instead of all at once like LLMs.
By capping its working memory and attention span, the AI mimics human cognition—processing inputs and updating predictions step by step instead of all at once like LLMs.

This modular architecture has allowed researchers to learn more about how infants’ structures develop. As Dr. Vijay Araghavan explains, “We found that the more times the model encounters the same word in different contexts, the better it learns it.”

This is consistent with real life, where a toddler learns to understand color much more quickly by interacting with red objects in different ways, rather than pushing a red truck multiple times.

Opening the black box

“Our model requires significantly smaller training sets and much less computing power to achieve compositional results. It makes more errors than LLM, but they are comparable to humans,” says Dr. Vijayarghoon.

It is precisely this property that makes the model so useful for cognitive scientists and AI researchers who want to map the decision-making processes of their models.

Although its purpose is different from currently used LLMs and therefore cannot be meaningfully compared in terms of effectiveness, PV-RNN nevertheless demonstrates how neural networks can be organized to provide greater insight into their information processing pathways: its relatively minimal structure allows researchers to visualize the hidden state of the network.

The model also addresses the problem of stimulus deprivation. This model assumes that the language input available to children is insufficient to explain their rapid language acquisition.

Despite the very limited data set, especially compared to LLMs, the model captures structural features. This suggests that the basis of language on behavior may be an important catalyst for children’s ability to acquire affective language.

Furthermore, this embodied learning could pave the way for safer and more ethical AI in the future, both by improving transparency and better understanding the impacts of AI actions.

When you learn the word ‘suffering’ from a purely linguistic perspective, as LLMs do, it has less emotional impact than PV-RNN, which learns meaning not simply through language but through embodied experiences.

We continue to work on improving the capabilities of this model and use it to explore many domains of developmental neuroscience.

“We are curious to see what insights we can gain into the process of cognitive development and language acquisition in the future,” said Professor Jon Tani, head of the research unit and lead author of the paper.

How we acquire the intelligence necessary to build our society is one of the great questions in science. While PV-RNN does not answer it, it opens up new avenues for research into how information is processed in our brains.

“Observing how the model merges linguistic cues with physical actions sheds light on the core mechanisms of human cognition,” said Dr. Prasanna Vijayaraghavan.

“It has already taught us a lot about structure in language acquisition and shows that there is potential for more efficient, transparent and secure models.”

Abstract

Developing structure through interactive learning of robot language and actions

Humans excel at applying acquired skills to novel situations. Central to this flexibility is compositionality our knack for breaking complex wholes into reusable parts and reassembling them as needed.

One of the fundamental questions in robotics is related to this function: How can linguistic structures be developed simultaneously with sensorimotor skills through associative learning, especially when people only learn combined linguistic compositions and associated sensorimotor patterns?

To address this issue, we propose a brain-inspired neural network model that integrates vision, proprioception, and language into a predictive coding and active inference framework based on the free energy principle.

The effectiveness and capabilities of this model were assessed through multiple simulation experiments with a robotic arm.

Our results show that generalization in learning unlearned verb-noun compounds is significantly improved when the training variability of task composition is increased.

We believe this stems from the strong impact of sensorimotor learning on the structures the model builds within its latent linguistic space.

Ablation research suggests that visual attention and working memory are necessary to accurately generate visuometer sequences to capture linguistically presented targets.

This insight broadens our understanding of the underlying mechanisms of the development of composition through the interaction between linguistic and sensorimotor experience.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *