Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
Dataset Card for MotIF-1K#

MotIF-1K is a robotics motion dataset containing 1,022 demonstrations across 13 task categories, used to benchmark and fine-tune vision-language models (VLMs) for motion-based success detection. Each demonstration includes a video of the motion, multiple pre-rendered trajectory visualizations, task instructions, and motion descriptions.
The FiftyOne dataset is a grouped dataset where each group represents one trajectory and each group slice represents a different visual representation of that trajectory — mirroring the exact input formats used in the paper.
This is a FiftyOne dataset with 1023 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/motif-1k")
# Launch the App
session = fo.launch_app(dataset)
Dataset Details#
Dataset Sources#
HuggingFace repository: https://huggingface.co/datasets/myconnects/motif
Paper: https://arxiv.org/abs/2409.10683
Code / collection scripts: https://github.com/Minyoung1005/motif
Paper: MotIF: Motion Instruction Fine-tuning (Hwang et al., 2024)
Project page: https://motif-1k.github.io
License: MIT
FiftyOne Dataset Structure#
Grouped Dataset Overview#
Dataset name: motif-1k
Media type: group
Default slice: video_trajviz
Groups: 1,022 (653 human_motion + 369 stretch_motion)
Group Slices#
Every trajectory group contains up to 13 slices. Each slice is a separate fo.Sample with its own media file and labels. Not all slices are present for every group — see the Incomplete samples note below.
Slice name |
Media type |
Description |
Always present? |
|---|---|---|---|
|
video |
Raw video with the trajectory overlaid — the default slice and the paper’s primary representation |
No (absent for 182 incomplete stretch samples) |
|
video |
Clean video without any trajectory overlay; carries the interactive per-frame trajectory Polyline |
Yes |
|
image |
Final video frame with trajectory overlay — the exact image input used by the paper’s VLM |
No |
|
image |
Final video frame, no overlay |
No |
|
image |
Full optical-flow visualization of all keypoints |
No |
|
image |
2-keyframe storyboard, clean |
No |
|
image |
2-keyframe storyboard with trajectory overlay |
No |
|
image |
4-keyframe storyboard, clean |
No |
|
image |
4-keyframe storyboard with trajectory overlay |
No |
|
image |
9-keyframe storyboard, clean |
No |
|
image |
9-keyframe storyboard with trajectory overlay |
No |
|
image |
16-keyframe storyboard, clean |
No |
|
image |
16-keyframe storyboard with trajectory overlay |
No |
Sample-Level Fields#
All fields below are present on every slice of every group.
Field |
Type |
Description |
|---|---|---|
|
|
FiftyOne group handle; |
|
|
Source config: |
|
|
Trajectory index within its config (0-based) |
|
|
High-level task goal, e.g. |
|
|
Fine-grained motion specification, e.g. |
|
|
Number of steps as stored in the source (may differ from |
|
|
Actual number of trajectory points ( |
|
|
Whether this sample’s group has all pre-rendered visualizations. |
|
|
Always includes the config name; incomplete groups are also tagged |
Label Fields#
video_raw slice — frames.trajectory (per-frame Polyline)#
The video_raw slice carries a frame-level progressive trajectory annotation. At frame N, the Polyline contains the first N trajectory points, so the path draws itself out as the video plays.
Frame 1: a zero-length degenerate segment marking the trajectory start position (renders as a dot)
Frame N: the full trajectory path accumulated to that point
Each Polyline carries these label attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Coordinate convention used: |
|
|
Whether the source provided a |
|
|
How the trajectory was corrected: |
|
|
Pixel offset applied in x (0 for identity and realsense_heuristic) |
|
|
Pixel offset applied in y (0 for identity and realsense_heuristic) |
All Polyline coordinates are normalized to [0, 1] × [0, 1] relative to the video frame.
Dataset Composition#
Config |
Agent |
Trajectories |
Has all slices? |
|---|---|---|---|
|
Human (6 different people) |
653 |
Yes — all 13 slices |
|
Hello Robot Stretch 2 |
188 |
Yes — all 13 slices |
|
Hello Robot Stretch 2 |
182 |
|
Total |
1,022 |
Task Categories#
13 categories spanning non-interactive, object-interactive, and user-interactive motions:
Category |
Tasks |
|---|---|
Non-interactive |
Outdoor Navigation, Indoor Navigation, Draw Path |
Object-interactive |
Shake, Pick and Place, Stir, Wipe, Open/Close Cabinet, Spread Condiment |
User-interactive |
Handover, Brush Hair, Tidy Hair, Style Hair |
Trajectory Coordinate System#
The trajectory field in the source data stores 2D pixel coordinates [x, y] per timestep. The coordinate space differs by config — this is a known source-side inconsistency, not a parsing bug:
|
Applies to |
Correction applied |
|---|---|---|
|
All |
Identity — MediaPipe hand detection runs on the cropped video frame, so coordinates match the stored video dimensions directly |
|
|
Per-sample pixel translation detected from the red endpoint marker in |
|
|
Best-effort: coordinates divided by 1280×720 (the RealSense D435i native capture resolution per the collection script). No source ground truth is available for this subset. |
Known Data Quality Issues#
The following issues were identified during import and are preserved in the data:
Incomplete stretch_motion subset (182 groups): These groups have no pre-rendered visualizations (
video_trajviz,last_frame_trajviz,opticalflow, storyboards are all absent). Onlyvideo_rawis available. These samples cannot be used with the paper’s VLM evaluation methodology without regenerating the visualizations. Identified byhas_source_artifacts == Falseor the"incomplete"tag.num_stepsvstrajectory_lengthdisagreement (~160 rows): The source’snum_stepsfield reflects the original capture length before some post-processing trimmed the trajectory.trajectory_length(=len(trajectory)) is the reliable count and is used for all frame-level annotations.Trajectory partially outside frame: Some trajectories extend into negative coordinates or past the video edges. FiftyOne clips these gracefully at the frame border; no values are modified.
Variable video resolutions: Human demos span 14 different square resolutions (208×208 to 480×480 plus one 640×480). Stretch demos with artifacts use three resolutions (320×320, 352×352, 480×480). The incomplete stretch subset uses eight different resolutions (192×192 to 720×720).
Citation#
@article{hwang2024motif,
title={MotIF: Motion Instruction Fine-tuning},
author={Hwang, Minyoung and Hejna, Joey and Sadigh, Dorsa and Bisk, Yonatan},
journal={arXiv preprint arXiv:2409.10683},
year={2024}
}
APA: Hwang, M., Hejna, J., Sadigh, D., & Bisk, Y. (2024). MotIF: Motion Instruction Fine-tuning. arXiv preprint arXiv:2409.10683.