Note
This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.
Dataset Card for RIS-LAD#

This is a FiftyOne dataset with 2103 samples.
Installation#
If you haven’t already, install FiftyOne:
pip install -U fiftyone
Usage#
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/RIS-LAD")
# Launch the App
session = fo.launch_app(dataset)
Dataset Details#
Dataset Description#
RIS-LAD (Referring Low-Altitude Drone Image Segmentation) is the first fine-grained Referring Image Segmentation benchmark specifically designed for low-altitude drone (LAD) scenarios.
The dataset contains 13,871 meticulously annotated image-text-mask triplets collected from real-world drone footage captured at altitudes of approximately 30-100 meters with oblique viewing angles (30°-60°).
Unlike existing remote sensing RIS datasets that focus on high-altitude satellite or fixed-angle imagery, RIS-LAD addresses unique challenges of low-altitude drone perception including:
Strong perspective changes and foreshortening from oblique views
Tiny and densely packed objects
Variable illumination conditions including nighttime scenes
Category drift (tiny targets causing confusion with larger, semantically similar objects)
Object drift (difficulty distinguishing among crowded same-class instances)
The dataset was constructed using a semi-automatic pipeline combining SAM-2 for high-quality instance masks and multimodal LLM-generated referring expressions, followed by human refinement and verification.
Curated by: Kai Ye, Yingshi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao (Xiamen University)
Language(s) (NLP): English
License: CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0)
Dataset Sources#
Repository: https://github.com/AHideoKuzeA/RIS-LAD-A-Benchmark-and-Model-for-Referring-Low-Altitude-Drone-Image-Segmentation
Paper: RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation
Dataset Download: Google Drive
Uses#
Direct Use#
This dataset is intended for:
Referring Image Segmentation (RIS): Training and evaluating models that segment objects based on natural language descriptions
Vision-Language Research: Multi-modal learning combining computer vision and natural language processing
Low-Altitude Drone Perception: Developing perception systems for drone applications operating at 30-100m altitude
Visual Grounding: Research on grounding natural language expressions to visual regions
Benchmark Evaluation: Comparing RIS methods specifically under challenging low-altitude drone conditions with tiny, dense objects and variable illumination
Out-of-Scope Use#
Commercial Applications: The dataset is licensed under CC BY-NC-SA 4.0, restricting commercial use
High-Altitude Remote Sensing: The dataset is specifically designed for low-altitude (30-100m) oblique views and may not generalize well to satellite or high-altitude imagery
Ground-Level Scene Understanding: The oblique drone perspective differs substantially from conventional ground-view datasets
Privacy-Sensitive Applications: Users should be aware that drone imagery may contain identifiable individuals or private property
Dataset Structure#
FiftyOne Format#
When converted to FiftyOne using the provided conversion script, each sample contains:
filepath: Path to the image filetags: Dataset split as a tag (train,val, ortest)prompts: List of all referring expression strings for that imageground_truth: FiftyOne Detections object containing:label: Object category namebounding_box: Normalized bounding box coordinates [x, y, width, height] in range [0, 1]mask: Binary segmentation mask (cropped to bounding box region)ref_id: Unique reference IDann_id: Annotation ID linking to the original datareferring_expression: The natural language description for this specific object
Object Categories#
The dataset includes 8 object categories commonly found in low-altitude drone imagery:
Category |
Count |
Description |
|---|---|---|
car |
4,365 |
Most common category |
people |
2,910 |
Pedestrians and individuals |
motor |
2,803 |
Motorcycles and motorized two-wheelers |
truck |
1,648 |
Trucks and large vehicles |
bus |
732 |
Buses |
bicycle |
640 |
Bicycles |
tricycle |
528 |
Tricycles |
boat |
245 |
Boats and watercraft |
Dataset Creation#
Curation Rationale#
Existing referring image segmentation (RIS) datasets focus primarily on conventional ground-view scenes or high-altitude remote sensing imagery. These settings differ substantially from low-altitude drone (LAD) views where:
Perspectives are oblique (30°-60° angles) rather than top-down or horizontal
Objects are tiny and densely packed
Illumination varies widely, including nighttime scenes
Altitude is much lower (30-100m) compared to satellite imagery (>1000m)
RIS-LAD was created to bridge this gap and enable research on referring image segmentation specifically for low-altitude drone applications, which are increasingly deployed in real-world perception systems due to their flexibility and cost-effectiveness.
Source Data#
Data Collection and Processing#
Image Collection:
Source: Real-world drone footage captured at altitudes of 30-100 meters
Viewing angles: Oblique perspectives at 30°-60° angles
Resolution: 1080×1080 pixels
Conditions: Various illumination including daytime and nighttime scenes
Total images: 2,104 unique images
Annotation Pipeline (Semi-Automatic):
Instance Segmentation: High-quality instance masks generated using SAM-2 (Segment Anything Model 2) with prompting
Referring Expression Generation: Initial expressions generated by multimodal LLMs given:
Cropped instance images
Location cues
Category information
Human Refinement: Manual verification and refinement of both masks and expressions
Quality Control: Careful verification of all 13,871 image-text-mask triplets
Who are the source data producers?#
The source data was collected from real-world drone operations. The specific locations and operators are not disclosed in the publicly available information. The dataset was curated and annotated by researchers at Xiamen University.
Annotations#
Annotation process#
The dataset uses a semi-automatic annotation pipeline:
Segmentation Masks: Generated using SAM-2 with human-in-the-loop prompting and verification
Referring Expressions:
Initially generated by multimodal LLMs
Provided with cropped object images and spatial location information
Manually refined by human annotators
Verified for accuracy and naturalness
The annotations include:
Binary segmentation masks (RLE format)
Bounding boxes
Natural language referring expressions
Object category labels
Tokenized text
Who are the annotators?#
The annotation team consisted of researchers from Xiamen University who performed the human refinement and verification steps of the semi-automatic pipeline. Specific demographic information about annotators is not provided.
Personal and Sensitive Information#
The dataset contains drone imagery captured from low altitudes (30-100m) which may include:
Identifiable individuals: People visible in public spaces
Vehicles: Cars, motorcycles, trucks, buses with potentially visible license plates
Location information: Urban and outdoor scenes
Privacy Considerations:
Images are from real-world drone footage
No explicit anonymization process is described
Users should be aware of potential privacy implications
The non-commercial license (CC BY-NC-SA 4.0) provides some restrictions on use
Dataset-Specific Challenges#
The paper identifies two key failure modes that are prevalent in this dataset:
Category Drift: Tiny targets can cause models to incorrectly segment larger, semantically similar objects
Object Drift: Dense crowds of same-class instances make it difficult to distinguish which specific instance is being referred to
Potential Biases#
Domain Bias: Focused on urban/outdoor surveillance scenarios typical of drone operations
Category Distribution: Heavily skewed toward vehicles (cars: 31%, motor: 20%, truck: 12%) vs. other categories
Illumination Bias: While nighttime scenes are included, the distribution between day/night is not specified
Expression Style: Referring expressions generated by LLMs may have stylistic patterns that differ from purely human-generated descriptions
Citation#
BibTeX:
@misc{ye2025risladbenchmarkmodelreferring,
title = {RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation},
author = {Kai Ye and YingShi Luan and Zhudi Chen and Guangyue Meng and Pingyang Dai and Liujuan Cao},
year = {2025},
eprint = {2507.20920},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2507.20920}
}
APA:
Ye, K., Luan, Y., Chen, Z., Meng, G., Dai, P., & Cao, L. (2025). RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation. arXiv preprint arXiv:2507.20920.