RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot

Abstract

A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improving task and motion planning. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve.

This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110,000 contact-rich robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected in the real world. Each sequence in the dataset includes visual, force, audio, and action information, along with a corresponding human demonstration video. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset will be made publicly available.

Tasks

We select 48 tasks from RLBench, 29 tasks from MetaWorld, and introduce 70 self-proposed tasks that are frequently encountered and achievable by robots. Here are some selected tasks:

Tele-Operation

Unlike previous methods that simplify the tele-operation interface using 3D mice, VR remotes, or mobile phones, we place emphasis on the importance of intuitive and accurate tele-operation in collecting contact-rich robot manipulation data.

Platform Setup

Each platform contains a robot arm with force-torque sensor, gripper and 1-2 inhand cameras, 8-10 global RGBD cameras and 2 microphones for data collection. A haptic device and a pedal are utilized to allow the operator to tele-operate the robot intuitively. These devices are all linked to a data collection workstation.

Data Collection

Data Details

Overview

The following table depicts the data modality in our dataset. The last modality of fingertip tactile sensing is only available in robot Cfg. 7.

Modal	Size	Frequency
RGB image	1280 x 720 x 3	10 Hz
Depth image	1280 x 720	10 Hz
Binocular IR images	2 x 1280 x 720	10 Hz
Robot joint angle	6 / 7	10 Hz
Robot joint torque	6 / 7	10 Hz
Gripper Cartesian pose	6 / 7	100 Hz
Gripper width	1	10 Hz
6-DoF Force/Torque	6	100 Hz
Audio	N/A	30 Hz
Fingertip tactile	2 x 16 x 3	200 Hz

The sizes of the robot joint angle, the robot joint torque and the gripper Cartesian pose depend on the robot type.

Sample

Here is a sample visualization of data, including RGBD images and binocular infrared images.

Point Cloud

We visualize the point cloud generated by fusing the RGBD data from these multi-view cameras. The red pyramids indicate the camera poses. Additionally, the robot model is rendered in the scene based on the joint angles recorded in our dataset. In the following videos, it is evident that all the cameras are calibrated with respect to the robot's base frame, and all the recorded data are synchronized in the temporal domain. The details of the robot configuration (Robot Cfg) can be found in the paper appendix.

Download

Caution: The RH20T dataset comprises volunteer-recorded human-robot interactions, possibly featuring volunteers' faces and voices. Exercise care to avoid inspecting or sharing sensitive content; kindly utilize the dataset solely for model training purposes.

Note: We provide a 640x360-resized version of our dataset as the original size is too large (40TB). After extraction, the current dataset size is ~5TB for RGB and ~10TB for RGBD. Depth images may have inaccuracies due to compression in this version. You can use Rclone to download files from Gdrive according to this thread. We also provide a very well formatted data parsing and visualization API to decode and use this dataset.

Task Description File

RGB with Robot Infomation:
RH20T_cfg1.tar.gz (178GB) (Google Drive|Baidu Cloud)
RH20T_cfg2.tar.gz (80GB) (Google Drive|Baidu Cloud)
patch.tar.gz (3GB) (Google Drive|Baidu Cloud) (contains the camera calibration files and robot joint angles for cfg1 and cfg2, unzip and merge with cfg1 and cfg2 respectively.)
RH20T_cfg3.tar.gz (26GB) (Google Drive|Baidu Cloud)
RH20T_cfg4.tar.gz (88GB) (Google Drive|Baidu Cloud)
RH20T_cfg5.tar.gz (37GB) (Google Drive|Baidu Cloud)
RH20T_cfg6.tar.gz (76GB) (Google Drive|Baidu Cloud)
RH20T_cfg7.tar.gz (37GB) (Google Drive|Baidu Cloud)

Depth:
RH20T_cfg1_depth.tar.gz (228GB) (Google Drive|Baidu Cloud)
RH20T_cfg2_depth.tar.gz (108GB) (Google Drive|Baidu Cloud)
RH20T_cfg3_depth.tar.gz (26GB) (Google Drive|Baidu Cloud)
RH20T_cfg4_depth.tar.gz (83GB) (Google Drive|Baidu Cloud)
RH20T_cfg5_depth.tar.gz (66GB) (Google Drive|Baidu Cloud)
RH20T_cfg6_depth.tar.gz (99GB) (Google Drive|Baidu Cloud)
RH20T_cfg7_depth.tar.gz (41GB) (Google Drive|Baidu Cloud)

Note 2: A 320x180-resized version of our dataset is also available. In this version, RGB images are video-compressed, while depth images use lossless compression for more accurate 3D data. This version is ideal if you want to utilize the precise real-world 3D information.

RGB:
RH20T_cfg1.tar.gz (30.3GB) (Google Drive | Baidu Cloud)
RH20T_cfg2.tar.gz (14.8GB) (Google Drive | Baidu Cloud)
RH20T_cfg3.tar.gz (4.4GB) (Google Drive | Baidu Cloud)
RH20T_cfg4.tar.gz (14.7GB) (Google Drive | Baidu Cloud)
RH20T_cfg5.tar.gz (8.2GB) (Google Drive | Baidu Cloud)
RH20T_cfg6.tar.gz (13.6GB) (Google Drive | Baidu Cloud)
RH20T_cfg7.tar.gz (6.7GB) (Google Drive | Baidu Cloud)

Depth:
RH20T_cfg1.tar.gz (572.2GB) (Google Drive | Baidu Cloud)
RH20T_cfg2.tar.gz (319.8GB) (Google Drive | Baidu Cloud)
RH20T_cfg3.tar.gz (71.3GB) (Google Drive | Baidu Cloud)
RH20T_cfg4.tar.gz (227.9GB) (Google Drive | Baidu Cloud)
RH20T_cfg5.tar.gz (200.7GB) (Google Drive | Baidu Cloud)
RH20T_cfg6.tar.gz (272.4GB) (Google Drive | Baidu Cloud)
RH20T_cfg7.tar.gz (96.0GB) (Google Drive | Baidu Cloud)

LowDim:
RH20T_cfg1.tar.gz (79.9GB) (Google Drive | Baidu Cloud)
RH20T_cfg2.tar.gz (31.9GB) (Google Drive | Baidu Cloud)
RH20T_cfg3.tar.gz (11.3GB) (Google Drive | Baidu Cloud)
RH20T_cfg4.tar.gz (38.6GB) (Google Drive | Baidu Cloud)
RH20T_cfg5.tar.gz (6.3GB) (Google Drive | Baidu Cloud)
RH20T_cfg6.tar.gz (31.3GB) (Google Drive | Baidu Cloud)
RH20T_cfg7.tar.gz (14.9GB) (Google Drive | Baidu Cloud)

Calibration:
RH20T_cfg1.tar.gz (805.6MB) (Google Drive | Baidu Cloud)
RH20T_cfg2.tar.gz (584.0MB) (Google Drive | Baidu Cloud)
RH20T_cfg3.tar.gz (334.7MB) (Google Drive | Baidu Cloud)
RH20T_cfg4.tar.gz (391.6MB) (Google Drive | Baidu Cloud)
RH20T_cfg5.tar.gz (79.1MB) (Google Drive | Baidu Cloud)
RH20T_cfg6.tar.gz (79.6MB) (Google Drive | Baidu Cloud)
RH20T_cfg7.tar.gz (14.9MB) (Google Drive | Baidu Cloud)

Data Format


|-- RH20T
    |-- RH20T_cfg1
    |   |-- calib/                            # Calibration folder, including calibration-time Gripper Cartesian pose, intrinsic and extrinsic matrices etc. Extrinsic matrices are the Aruco marker's Translations with respect to the camera frame.
    |   |-- task_0001_user_0001_scene_0001_cfg_0001/        # Robotic manipulation data
    |   |    |-- metadata.json                # Robot manipulation scene metadata, including scene finishing timestamp, task completion rating (0 denotes robot failure, 1 denotes task failure, 2-9 denotes completion quality, higher is better), calibration timestamp and calibration quality (0 means some cameras are not calibrated, 1-5 means calibration accuracy, lower is better), etc.
    |   |    |-- cam_[serial_number]/         # Multiple cameras
    |   |    |    |-- color.mp4               # Color images, encode as video. The extraction code is available in our API code.
    |   |    |    |-- timestamps.npy          # Timestamp for each image, our extraction code will use it to decode images.
    |   |    |    `-- depth.mp4 (optional)    # Depth images, encode as video. The extraction code is available in our API code.
    |   |    |-- transformed/
    |   |    |    |-- tcp.npy                 # Gripper Cartesian pose in each cam's coord, {serial number: [{"timestamp": ..., "tcp": ..., "robot_ft": ...}]}, where "tcp" values are xyz+quat (7D) Gripper Cartesian poses
    |   |    |    |-- tcp_base.npy            # Gripper Cartesian pose in base coord, {serial number: [{"timestamp": ..., "tcp": ..., "robot_ft": ...}]}, where "tcp" values are xyz+quat (7D) Gripper Cartesian poses
    |   |    |    |-- joint.npy               # Joint angles, {serial number: {timestamp: joint angle array}}
    |   |    |    |-- gripper.npy             # Gripper commands and information, {serial number: {timestamp: {"gripper_command": 3D array, "gripper_info": 3D array}}}, where the 1st element in the 3D array is the actual gripper width in millimeters(0-110)
    |   |    |    |-- force_torque.npy        # 6-DoF force/torque in cam's coord, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ...}]}, where "zeroed" values are pre-processed
    |   |    |    |-- force_torque_base.npy   # 6-DoF force/torque in base coord, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ...}]}, where "zeroed" values are pre-processed
    |   |    |    `-- high_freq_data.npy      # High frequency data, {serial number: [{"timestamp": ..., "zeroed": ..., "raw": ..., "tcp": ...}]}
    |   |    `-- audio_mixed/
    |   |
    |   |-- task_0001_user_0001_scene_0001_cfg_0001_human/  # Human demonstration data corresponds to the above robotic manipulation
    |   |    |-- metadata.json                # Human demonstration metadata, including scene starting and finishing timestamps, calibration timestamp and quality
    |   |    |-- cam_[serial_number]/
    |   |    |    |-- color.mp4
    |   |    |    |-- timestamps.npy
    |   |    |    `-- depth.mp4 (optional)
    |   |    `-- audio_mixed/
    |   |
    |   |
    |   `-- ... ...
    |
    |
    |-- RH20T_cfg2/
    |   `-- same as above
    |
    |
    |-- ...
    |
    |
    `-- RH20T_cfg7/

BibTeX

@inproceedings{
  fang2024rh20t,
  title        = {RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot},
  author       = {Fang, Hao-Shu and Fang, Hongjie and Tang, Zhenyu and Liu, Jirong and Wang, Chenxi and Wang, Junbo and Zhu, Haoyi and Lu, Cewu},
  booktitle    = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
  pages        = {653--660},
  year         = {2024},
  organization = {IEEE}
}

License

The dataset is licensed under a mixture of licenses as it is partly funded by a company. It is divided into two subsets: RH20T-C (commercial) and RH20T-NC (non-commercial).

The RH20T-C subset contains episodes with names containing 'scene_0001' to 'scene_0005'. It is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

The RH20T-NC subset contains episodes with names containing 'scene_0006' to 'scene_0010'. It is licensed under a Creative Commons Attribution 4.0 Non-Commercial License (CC BY-NC 4.0), which is freely available for free non-commercial use, and may be redistributed under these conditions. Commercial use of the RH20T-NC subset or models trained on it is not allowed.

If you have any further questions, please contact fhaoshu@gmail.com.