论文标题
在体现的神经认知模型中接地
Crossmodal Language Grounding in an Embodied Neurocognitive Model
论文作者
论文摘要
人类婴儿在很小的时候似乎很容易获得自然语言。他们的语言学习似乎与学习其他认知功能以及与环境和护理人员的嬉戏互动同时发生。从神经科学的角度来看,自然语言体现在大多数(如果不是全部)感官和感觉运动方式中,并通过跨模式整合获得。但是,表征大脑中的潜在机制是困难的,并且在跨模式感知和行动中解释语言的基础仍然具有挑战性。在本文中,我们提出了一种用于语言接地的神经认知模型,该模型反映了生物启发的机制,例如时间尺度的隐式适应以及端到端的多模式抽象。它解决了发展的机器人互动,并使用基于知识的大规模数据扩展其学习能力。在我们的情况下,我们利用类人机器人NICO来获取EMIL数据收集,其中认知机器人在儿童游乐场环境中与对象相互作用,同时从照顾者那里接收语言标签。模型分析表明,跨模式集成的表示足以仅通过与环境中的对象相互作用从感觉输入中获取语言。表示形式通过组成和分解来层次自组织,并嵌入时间和空间信息。该模型还可以为进一步的跨模式集成感知扎实的认知表示形式提供基础。
Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterising the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilise the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organise hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.