RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot

Shanghai Jiao Tong University

RH20T includes millions of <Human Demonstration, Robot Manipulation> pairs for each task.

Abstract

A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improving task and motion planning. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve.

This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110,000 contact-rich robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected in the real world. Each sequence in the dataset includes visual, force, audio, and action information, along with a corresponding human demonstration video. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset will be made publicly available.

Tasks

We select 48 tasks from RLBench, 29 tasks from MetaWorld, and introduce 70 self-proposed tasks that are frequently encountered and achievable by robots. Here are some selected tasks:

Tele-Operation

Unlike previous methods that simplify the tele-operation interface using 3D mice, VR remotes, or mobile phones, we place emphasis on the importance of intuitive and accurate tele-operation in collecting contact-rich robot manipulation data.

Platform Setup

Each platform contains a robot arm with force-torque sensor, gripper and 1-2 inhand cameras, 8-10 global RGBD cameras and 2 microphones for data collection. A haptic device and a pedal are utilized to allow the operator to tele-operate the robot intuitively. These devices are all linked to a data collection workstation.

Data Collection

Data Details

Overview

The following table depicts the data modality in our dataset. The last modality of fingertip tactile sensing is only available in robot Cfg. 7.

ModalSizeFrequency
RGB image1280 x 720 x 310 Hz
Depth image1280 x 72010 Hz
Binocular IR images2 x 1280 x 72010 Hz
Robot joint angle6 / 710 Hz
Robot joint torque6 / 710 Hz
Gripper Cartesian pose6 / 7 100 Hz
Gripper width110 Hz
6-DoF Force/Torque6100 Hz
AudioN/A30 Hz
Fingertip tactile2 x 16 x 3200 Hz

The sizes of the robot joint angle, the robot joint torque and the gripper Cartesian pose depend on the robot type.

Sample

Here is a sample visualization of data, including RGBD images and binocular infrared images.

Point Cloud

We visualize the point cloud generated by fusing the RGBD data from these multi-view cameras. The red pyramids indicate the camera poses. Additionally, the robot model is rendered in the scene based on the joint angles recorded in our dataset. In the following videos, it is evident that all the cameras are calibrated with respect to the robot's base frame, and all the recorded data are synchronized in the temporal domain. The details of the robot configuration (Robot Cfg) can be found in the paper appendix.

Download

Caution: The RH20T dataset comprises volunteer-recorded human-robot interactions, possibly featuring volunteers' faces and voices. Exercise care to avoid inspecting or sharing sensitive content; kindly utilize the dataset solely for model training purposes.


Note: We provide a resized version (640x360) of our dataset as the original size is too large (40TB). After unzipping and extracting images from videos, the current version would be about 5TB for RGB and 10TB for RGBD. You can use Rclone to download files from Gdrive according to this thread.


Task Description File

RGB with Robot Infomation:
RH20T_cfg1.tar.gz (178GB) (Google Drive|Baidu Cloud)
RH20T_cfg2.tar.gz (80GB) (Google Drive|Baidu Cloud)
patch.tar.gz (3GB) (Google Drive|Baidu Cloud) (contains the camera calibration files and robot joint angles for cfg1 and cfg2, unzip and merge with cfg1 and cfg2 respectively.)
RH20T_cfg3.tar.gz (26GB) (Google Drive|Baidu Cloud)
RH20T_cfg4.tar.gz (88GB) (Google Drive|Baidu Cloud)
RH20T_cfg5.tar.gz (37GB) (Google Drive|Baidu Cloud)
RH20T_cfg6.tar.gz (76GB) (Google Drive|Baidu Cloud)
RH20T_cfg7.tar.gz (37GB) (Google Drive|Baidu Cloud)

Depth:
RH20T_cfg1_depth.tar.gz (228GB) (Google Drive|Baidu Cloud)
RH20T_cfg2_depth.tar.gz (108GB) (Google Drive|Baidu Cloud)
RH20T_cfg3_depth.tar.gz (26GB) (Google Drive|Baidu Cloud)
RH20T_cfg4_depth.tar.gz (83GB) (Google Drive|Baidu Cloud)
RH20T_cfg5_depth.tar.gz (66GB) (Google Drive|Baidu Cloud)
RH20T_cfg6_depth.tar.gz (99GB) (Google Drive|Baidu Cloud)
RH20T_cfg7_depth.tar.gz (41GB) (Google Drive|Baidu Cloud)

Data Format


|-- RH20T
    |-- RH20T_cfg1
    |   |-- calib/                            # Calibration folder, including calibration-time Gripper Cartesian pose, intrinsic and extrinsic matrices etc. Extrinsic matrices are the Aruco marker's Translations with respect to the camera frame.
    |   |-- task_0001_user_0001_scene_0001_cfg_0001/        # Robotic manipulation data
    |   |    |-- metadata.json                # Robot manipulation scene metadata, including scene finishing timestamp, task completion rating (0 denotes robot failure, 1 denotes task failure, 2-9 denotes completion quality, higher is better), calibration timestamp and calibration quality (0 means some cameras are not calibrated, 1-5 means calibration accuracy, lower is better), etc.
    |   |    |-- cam_[serial_number]/         # Multiple cameras
    |   |    |    |-- color.mp4               # Color images, encode as video. The extraction code is available in our API code.
    |   |    |    |-- timestamps.npy          # Timestamp for each image, our extraction code will use it to decode images.
    |   |    |    `-- depth.mp4 (optional)    # Depth images, encode as video. The extraction code is available in our API code.
    |   |    |-- transformed/
    |   |    |    |-- tcp.npy                 # Gripper Cartesian pose in each cam's coord, {serial number: [{"timestamp": ..., "tcp": ..., "robot_ft": ...}]}, where "tcp" values are xyz+quat (7D) Gripper Cartesian poses
    |   |    |    |-- tcp_base.npy            # Gripper Cartesian pose in base coord, {serial number: [{"timestamp": ..., "tcp": ..., "robot_ft": ...}]}, where "tcp" values are xyz+quat (7D) Gripper Cartesian poses
    |   |    |    |-- joint.npy               # Joint angles, {serial number: {timestamp: joint angle array}}
    |   |    |    |-- gripper.npy             # Gripper commands and information, {serial number: {timestamp: {"gripper_command": 3D array, "gripper_info": 3D array}}}, where the 1st element in the 3D array is the actual gripper width in millimeters(0-110)
    |   |    |    |-- force_torque.npy        # 6-DoF force/torque in cam's coord, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ...}]}, where "zeroed" values are pre-processed
    |   |    |    |-- force_torque_base.npy   # 6-DoF force/torque in base coord, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ...}]}, where "zeroed" values are pre-processed
    |   |    |    `-- high_freq_data.npy      # High frequency data, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ..., "tcp": ...}]}
    |   |    `-- audio_mixed/
    |   |
    |   |-- task_0001_user_0001_scene_0001_cfg_0001_human/  # Human demonstration data corresponds to the above robotic manipulation
    |   |    |-- metadata.json                # Human demonstration metadata, including scene starting and finishing timestamps, calibration timestamp and quality
    |   |    |-- cam_[serial_number]/
    |   |    |    |-- color.mp4
    |   |    |    |-- timestamps.npy
    |   |    |    `-- depth.mp4 (optional)
    |   |    `-- audio_mixed/
    |   |
    |   |
    |   `-- ... ...
    |
    |
    |-- RH20T_cfg2/
    |   `-- same as above
    |
    |
    |-- ...
    |
    |
    `-- RH20T_cfg7/

BibTeX

@inproceedings{
    fang2023rh20t,
    title = {RH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot},
    author = {Fang, Hao-Shu and Fang, Hongjie and Tang, Zhenyu and Liu, Jirong and Wang, Junbo and Zhu, Haoyi and Lu, Cewu},
    booktitle = {RSS 2023 Workshop on Learning for Task and Motion Planning},
    year = {2023}
}

License

The dataset is licensed under a mixture of licenses as it is partly funded by a company. It is divided into two subsets: RH20T-C (commercial) and RH20T-NC (non-commercial).

The RH20T-C subset contains episodes with names containing 'scene_0001' to 'scene_0005'. It is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

The RH20T-NC subset contains episodes with names containing 'scene_0006' to 'scene_0010'. It is licensed under a Creative Commons Attribution 4.0 Non-Commercial License (CC BY-NC 4.0), which is freely available for free non-commercial use, and may be redistributed under these conditions. Commercial use of the RH20T-NC subset or models trained on it is not allowed.

If you have any further questions, please contact fhaoshu@gmail.com.