Centered Image

Task —— Human-centered In-building Embodied Delivery:

We have developed a brand-new virtual environment system from scratch, constructing a multi-level connected building space modeled after a polar research station. This environment also includes autonomous human characters and robots with grasping and mobility capabilities, as well as a large number of interactive items. Based on this environment, we have built a delivery dataset containing 13k language instructions to guide robots in providing services. We simulate human behavior through human characters and sample their various needs in daily life. Finally, we proposed a method centered around a large multimodal model to serve as the baseline system for this dataset. Compared to past embodied data work, our work focuses on a virtual environment centered around human-robot interaction for commercial scenarios. We believe that this will bring new perspectives and exploration angles to the embodied community.
  • Purpose: Deliver the requested item to the vicinity of the designated character.
  • Delivery Items: Items in the environment that can be grabbed and moved.
  • Customers: Ten virtual human characters with different daily activities inside the building. They will move within the building for their own purposes.
  • Spatial Scope: The reachable areas within different rooms of a three-story building.
  • Time Setting: Real-world time, but simulation can be accelerated.
  • Scenario Map: 2D projected obstacle map of scenario, and pre-sampled panoramic photos at various locations on the map.
  • Robot Positioning: We adopt relative localization rules for robot positioning.
  • Robot Actions: Movement, Joint control, and manipulation.
  • Robot Skills: Local navigation by coordinate, 6-DOF visual grasping, and pose adjustment.
  • Sensors: Two RGB-D cameras (head and arm), tactile sensors.
  • Success Criteria: Place the target object within a 3 meter range of the target person.
  • Constraints: Completion within 8 minutes without any dangerous collisions and unavailability of environmental metadata.
  • Evaluation: Based on the total time taken and success rate of the object delivery, grasping the target object, and identifying the target character.

Human-centered in-building embodied delivery describes a task that originates from a real commercial delivery scenario. It mainly refers to the precise delivery service for users in private spaces where external delivery services cannot be used, achieved through embodied robots. This task typically requires the robot to locate the target item based on the user's needs (e.g. grasp a water bottle from the kitchen and bring it to me.) across multiple rooms within the three-story building (a polar research station building, see the thumbnail in the top right corner) and ultimately deliver it to the designated location/person. The robot needs to consider the user's context (behavior or schedule), as the user will be moving around the building according to their own goals during the delivery. Centered Image Centered Image

In constructing delivery service data, we initially utilize the large language model (LLM) to generate reasonable daily activities and varied demands for virtual characters based on their profiles. Consequently, the robot is required to locate and deliver the appropriate objects to meet the human characters' demands and accomplish task objectives. We continually generate diverse data by modifying character needs, daily routines, and target objects. Furthermore, we incorporate a manual review and refinement stage to ensure the balance of the task data. Centered Image
The dataset is represented as a JSON file, and task initialization, execution, and evaluation are accomplished using the Python API. The task involves a variety of objects with different styles. NPCs in the environment engage in continuous simulated life activities, generating various needs over time, such as eating, drinking, working, and resting. At these moments, NPCs potentially require certain items to fulfill their demands (e.g., food, drinks, mobile phones). Thus, we simulate robot delivery services by collecting these needs. By querying environmental data, we automatically gather a large number of delivery tasks. We refine the language content using the LLM and conduct manual checks and corrections. Specifically, we introduce LMM to perform textual annotation (visual feature description) of image data to decrease manual work and increase diversity. The figure above indicates ten distinct NPCs as the service targets, each with their own profile and preferences. The figure above illustrates the spatial distribution of scenes within the task set, demonstrating long-range visibility across spaces. Centered Image

A data generation instance. We generate human activities, target objects, robot positions, task instructions, and a complete process of robot execution based on the settings combined with large models.

Centered Image

The available information in task.

Task Definition and Settings
  • Robots operate within a relatively fixed building space.
  • The residents within the building are the recipients of the service, and they typically move throughout the building based on personal needs and objectives. Robots can access relevant information about the recipients to better locate and identify them.
  • The transportation service may cover a substantial area, involving different floors and rooms.
  • Robots typically need to understand human instructions in order to search for and retrieve the correct target items, and deliver them to the designated recipients.
  • → Based on the aforementioned scenario requirements, we provide the following task definition and settings, as shown in the following table:

    Centered Image

    Human-centered in-building embodied delivery task setting.


    We propose an LMM-based approach as the baseline method, employing a modular architecture encompassing language instruction analysis, multimodal target search, and robotic action execution. The baseline method comprises multiple modules, including the language, vision, and action modules, as shown in the following figure, for tasks such as language parsing, navigation search, scene understanding, object recognition, segmentation, action, localization, and object manipulation. Centered Image

    Modular method for the robot delivery task with LLM and LMM.

    Simulation Environment

    Grounded in the task setting and business requirements, we have constructed from scratch a novel simulation environment modelled after a real-world polar research station (referred to as the Polar Research Station Environment, PRS). This environment comprises a three-story building interconnected by stairs and a functional elevator. It integrates common human societal scenarios into a community-like pattern, such as bedrooms, gyms, offices, laboratories, medical rooms, wards, living rooms, leisure spaces, etc. This design aims to cover as a wide range of everyday scenarios within the building as possible. Additionally, to simulate daily activities for delivery services, the environment includes over a dozen virtual human characters engaging in activities according to their individual intentions. Furthermore, we provide a range of interactive objects to support the tasks. Lastly, we have designed a robotic simulation with grasping and moving capabilities to serve human character agents. Centered Image

    We present a multi-story polar research station building. Virtual environments typically need to meet the task requirements. Clearly, to depict corresponding commercial scenarios, existing environments are still constrained by factors such as the richness of the scene, the complexity of space, character portrayal, continuous environmental state systems, long-term operation, and the setup of items and robots. Consequently, we have crafted a brand-new virtual environment, the PRS Environment, which is specifically tailored to facilitate the various generalist agent and robotic tasks.

    PRS Simulator Trial