Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
Dataset Card for Qualcomm Exercise Video Dataset (Benchmark)#
This is the benchmark split of the dataset as described here
This is a FiftyOne dataset with 74 samples.

Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/qualcomm-exercise-video-dataset-benchmark")
# Launch the App
session = fo.launch_app(dataset)
QEVD-FIT-COACH-Benchmark Dataset Card#
Dataset Details#
Dataset Description#
The QEVD-FIT-COACH-Benchmark dataset contains 74 workout video sessions with real-time AI fitness coaching feedback. Each session is approximately 2.5-3 minutes long and features a single participant performing 4-5 different exercises with continuous feedback from an AI fitness coach.
The dataset includes temporal annotations for feedback events, including exercise transitions, form corrections, encouragement, and repetition counting. All feedback is aligned with video frames and includes precise timing information.
Curated by: Qualcomm (based on dataset naming)
Language(s): English (en)
License: (Qualcomm Data Research License)[https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Dataset-Research-License-Feb-25-2025.pdf]
Format: MP4 videos (H.264/AVC codec, 640x360 resolution, 30 FPS) with JSON annotations
Dataset Sources#
Repository: https://www.qualcomm.com/developer/software/qevd-dataset
Paper: https://arxiv.org/pdf/2407.08101v2
Dataset Type: Video + Temporal Annotations
Dataset Statistics#
Total Videos: 74 workout sessions
Total Duration: ~195 minutes (3.2 hours)
Video Duration: 150-166 seconds per session (avg: 158s)
Total Feedback Events: 2,511 temporal detections
Exercise Transitions: 498 events
Regular Feedback: 2,013 events (corrections, encouragement, counting)
Unique Feedback Messages: 1,592 distinct messages
Average Feedback per Session: 33.9 events
Video Specifications#
Resolution: 640x360 pixels (16:9 aspect ratio)
Frame Rate: 30 FPS
Codec: H.264/AVC (avc1) - re-encoded from MPEG-4 Part 2
Container: MP4
Audio: None (video only)
Average Frames per Video: ~4,740 frames
Uses#
Direct Use#
This dataset is suitable for:
Video Understanding Research
Temporal action detection and localization
Activity recognition in fitness/exercise contexts
Video-text alignment and grounding
Fitness AI Development
Training models for automated form correction
Exercise recognition and classification
Real-time feedback generation systems
Human-Computer Interaction
Conversational AI for fitness coaching
Multimodal interaction studies
Feedback timing and delivery analysis
Computer Vision Applications
Pose estimation evaluation
Action segmentation
Temporal event detection
Dataset Structure#
FiftyOne Format#
The dataset is structured as a FiftyOne video dataset with temporal detections:
Sample {
'id': ObjectId,
'filepath': str, # Path to MP4 video file
'video_id': str, # Video identifier (e.g., '0006')
'metadata': VideoMetadata {
'duration': float, # Duration in seconds
'frame_rate': float, # 30.0 FPS
'frame_width': int, # 640 pixels
'frame_height': int, # 360 pixels
'total_frame_count': int, # ~4,740 frames
'encoding_str': str, # 'avc1' (H.264)
'mime_type': str, # 'video/mp4'
},
'feedback_events': TemporalDetections {
'detections': [
TemporalDetection {
'label': str, # Feedback text
'support': [int, int], # [start_frame, end_frame]
'confidence': float, # 1.0 (ground truth)
'is_transition': bool, # True for exercise changes
'feedback_type': str, # 'transition' or 'feedback'
},
...
]
},
'transitions': TemporalDetections, # Subset: transitions only
'feedbacks': TemporalDetections, # Subset: regular feedback only
'num_feedback_events': int, # Total count
'num_transitions': int, # Transition count
'num_feedbacks': int, # Regular feedback count
}
Temporal Detection Format#
Each temporal detection represents a feedback event with:
Frame-based timing:
supportfield contains[start_frame, end_frame]Time-based timing: Can be calculated as
frame / fpsto get secondsFeedback text: The actual coaching message
Type classification: Transition vs. regular feedback
Example:
Frame range: [397, 460]
Time range: 13.23s - 15.33s
Label: "First up are high knees!"
Type: transition
Feedback Categories#
The dataset contains four main types of feedback:
Exercise Transitions (498 events)
Announce new exercises or session end
Examples: “First up are high knees!”, “Moving on to squats!”, “That’s the end of the session.”
Form Corrections
Real-time technique feedback
Examples: “Your stance is too narrow!”, “Tighten your core!”, “Wrong leg pal!”
Encouragement
Motivation and positive reinforcement
Examples: “Nice!”, “You crushed it!”, “Love the high knees!”
Counting
Repetition tracking
Examples: “10”, “We are at 5 reps!”, “20”
Exercise Types#
Common exercises in the dataset include:
High knees
Jumping jacks
Squats (regular, jumps, kicks)
Pushups
Butt kickers
Mountain climbers
Lunges (walking, jumps)
Planks (regular, moving, taps)
Stretches (quad, arm cross chest)
And more…
Dataset Creation#
Curation Rationale#
The dataset was created to benchmark AI fitness coaching systems, particularly for:
Evaluating temporal feedback generation
Testing exercise recognition accuracy
Assessing form correction capabilities
Measuring real-time interaction quality
Source Data#
Data Collection and Processing#
Recording Setup: Controlled environment with single participant
Video Format: Originally MPEG-4 Part 2, re-encoded to H.264 for compatibility
Session Structure: Each session contains 4-5 exercises performed sequentially
Duration: Sessions are approximately 2.5-3 minutes each
Frame Alignment: Feedback annotations are frame-synchronized with video
Data Processing Pipeline#
Original Format: JSON annotations + MP4 videos + NumPy timestamp files
Frame Alignment: Feedback array pre-aligned with video frames
Temporal Detection Extraction: Contiguous frame ranges identified for each feedback
Video Re-encoding: Converted from MPEG-4 to H.264 for browser compatibility
FiftyOne Integration: Structured as temporal detection dataset
Annotations#
Annotation Process#
Annotation Type: Temporal feedback events with frame-level precision
Alignment Method: Frame-by-frame feedback array synchronized with video
Timestamp Format: UNIX timestamps (seconds since epoch) for reference
Quality: Ground truth annotations (confidence = 1.0)
Annotation Fields#
feedbacks: Frame-by-frame feedback array (aligned with video frames)feedback_timestamps: UNIX timestamps for each unique feedback eventis_transition: Boolean flags indicating exercise transitionsvideo_timestamps: Frame-level UNIX timestamps in nanoseconds
Personal and Sensitive Information#
Contains: Video recordings of human subjects performing exercises
Identifiable Information: Visual appearance of participants
Privacy Considerations: Videos show individuals in workout clothing performing exercises
Anonymization: No explicit anonymization mentioned in source data
Citation#
@inproceedings{livefit,
title = {Live Fitness Coaching as a Testbed for Situated Interaction},
author = {Sunny Panchal and Apratim Bhattacharyya and Guillaume Berger and Antoine Mercier and Cornelius B{\"{o}}hm and Florian Dietrichkeit and Reza Pourreza and Xuanlin Li and Pulkit Madan and Mingu Lee and Mark Todorovich and Ingo Bax and Roland Memisevic},
booktitle = {NeurIPS (D&B Track)},
year = {2024},
}
APA#
Panchal, S., Bhattacharyya, A., Berger, G., Mercier, A., Böhm, C., Dietrichkeit, F., Pourreza, R., Li, X., Maden, P., Lee, M., Todorovich, M., Bax, I., Memisevic, R. (2024) Live Fitness Coaching as a Testbed for Situated Interaction