Cambridge, United Kingdom – 17th June 2024

Toshiba’s Cambridge Research Laboratory (CRL) is excited to announce its acceleration of cutting-edge research into Embodied AI, a technology that combines physical presence with cooperative intelligence. This move reflects CRL’s commitment to advancing innovative research in the field of sustainable and human-centric AI. Our latest results are presented in two papers at the high-impact computer science conference, CVPR (Conference on Computer Vision and Pattern Recognition).

In today’s rapidly evolving landscape, AI is becoming the cornerstone of technological innovation. In this new world, conversational agents and virtual assistants have become common place, however, AI still hasn’t been effectively brought into physical domains, or to every industry; Fields such as logistics, maintenance and manufacturing cannot be fully addressed in cyberspace or using software alone. For example, Embodied AI plays an important role in reshaping the retail industry which has interaction in dynamic environments with ever-changing product offerings and customer demands.

CRL is at the forefront of research into Embodied AI, fueled by a £15 million investment from Toshiba in the next five years into the field. In line with this vision, Toshiba is excited to announce that CRL’s innovative core technologies are going to enhance Toshiba’s existing AI catalogue very soon. Our first industrial prototype of Embodied AI is planned for presentation in 2027, bringing us closer to a new era of intelligent collaboration between humans and machines.

The Essence of Embodied AI

Embodied AI is an agent-based AI that can manipulate objects and communicate with people, assisting us in physical tasks. These agents learn interactively from users in their environment, adapt quickly, and continuously expand their capabilities. The classical paradigm of perception and manipulation is to understand to act – e.g., segmentation is used to detect pedestrians so that ADAS (advanced driver assistance system) can operate. In the new framework of Embodied AI, we instead act to understand: Embodied AI has fast adaptation and continuous learning through task execution at its core, building a growing catalog of skills in perception, reasoning, and action.

Embodied AI is critical to addressing the real-world challenges of the next generation industry. Fast adaptation enables a workforce that can intuitively deploy the power of physical AI to novel tasks through simple interaction. Our research focuses on software that can be applied to versatile hardware, integrating intelligence from disparate systems, learning from many modalities and deployments. Ultimately, CRL’s Embodied AI will enable a versatile assistive system with the potential to revolutionize business operations for numerous sectors, while also facilitating flexible adaptation to meet the ever-changing needs of industry.

The Research:

The renewed focus of CRL to Embodied AI reflects Toshiba’s ongoing commitment to improve the world through innovation. Building on a foundation of results from former Computer Vision Group and Speech Technology Group in 3D perception and human interaction, CRL’s new Vision & Learning Group (VLG) and Language & Interaction Group (LIG) will drive advancements towards key objectives:

  • Fast Adaptation: CRL’s research will explore methods for enabling rapid adaptation of AI systems to new environments. By interacting with both humans and the environment, these systems can be deployed with minimal effort and cost.

  • Continuous Learning: Leveraging past experiences and multiple deployments, CRL’s technology will generalise knowledge into “common sense.” This continuous learning process enhances functionality and ensures ever-improving AI technologies across diverse scenarios.

  • CRL’s strategic shift to Embodied AI aligns with Toshiba’s strategic plan for software defined services. As part of Toshiba’s broader digital transformation, CRL envisions an AI that can adapt to new hardware with minimal effort and facilitate the completion of long-horizon tasks by combining the collaborative strength of humans and machines.

    Toshiba CRL’s Technical Presentations at CVPR 2024:

    As part of their latest research achievements, Toshiba’s Vision & Learning Group (VLG) is set to present two papers on Embodied AI at CVPR. CVPR is the largest and one of the most influential international conferences in this field. These breakthroughs address fundamental technologies on two core challenges of Embodied AI: simplified interaction and fast adaptation.

    1. Simplified Interaction: An Innovative Pose estimation through Natural Language

    Traditionally, setting up robotic systems required expert knowledge, however, VLG’s technology aims to simplify this process, making it accessible to a wider audience. VLG’s Dialogue-Based Localization system pioneers the combination of natural language interaction with geometric computer vision tasks. By reasoning about possible robot poses within novel environments, the system iteratively refines pose estimates during dialogues. Key features include:

  • Natural Language Reasoning: Our system leverages state-of-the-art machine learning methods that are trained on incredibly large language and vision datasets (called foundation models) to estimate poses based on textual input.

  • Chao Zhang, VLG’s expert on multi-modal foundation models, emphasises: “Not only does this present a world first in combining a vision and language foundation model in an iterative setting. Looking at it from a customer perspective, our technology is also privacy preserving, as no sensitive image data is required for the localization task.”

    2. Fast Adaptation: Introducing ReCoRe, an Efficient Training Framework of World Models

    In our ever-changing world, robots must quickly adapt to new environments and tasks. VLG’s approach ensures efficient learning and generalization across diverse scenarios. ReCoRe (Regularized Contrastive Representation Learning) guides the training of world models in autonomous systems. These models represent a simplified internal environment abstraction, capturing essential aspects without unnecessary complexity. Our approach:

  • Guided Learning: By incorporating task-specific auxiliary tasks based on expert knowledge, our model learns faster and more efficiently (with fewer samples and reduced computation).

  • This technology will be presented by Rudra Poudel, VLG’s lead scientist on World Models for Reinforcement Learning, commenting on the results he says “World Models compress noisy sensor input, emphasizing task-relevant signals. They let robots ‘imagine’ future outcomes and choose optimal actions. Our ReCoRe framework leads in efficient world model learning for reinforcement learning and domain adaptation.”

    --- End ---

    About Toshiba

    Toshiba Corporation leads a global group of companies that combines knowledge and capabilities from almost 150 years of experience in a wide range of businesses—from energy and social infrastructure to electronic devices—with world-class capabilities in information processing, digital and AI technologies. These distinctive strengths support Toshiba in building infrastructure that everyone can enjoy, and a connected data society, and in achieving the Company’s ultimate goal, a future that realizes carbon neutrality and a circular economy. Guided by the Basic Commitment of the Toshiba Group, “Committed to People, Committed to the Future,” Toshiba contributes to society’s positive development with services and solutions that lead to a better world. The Group and its 110,000 employees worldwide secured annual sales of 3.4 trillion yen (US$25.1 billion) in fiscal year 2022.

    Latest Publications

    Davide G. Marangon, Peter R. Smith, Nathan Walk, Taofiq K. Paraïso, James F. Dynes, Victor Lovic, Mirko Sanzaro, Thomas Roger, Innocenzo De Marco, Marco Lucamarini, Zhiliang Yuan & Andrew J. Shields

    J. A. Dolphin T, ENG, T. K. Paraïso T, H. Du T, R. I. Woodward T, D. G. Marangon T and A. J. Shields T

    View more