Note

This is a Hugging Face dataset. For large datasets, ensure huggingface_hub>=1.1.3 to avoid rate limits. Learn more in the Hugging Face integration docs.

Dataset Card for Form Understanding in Noisy Scanned Documents Plus#

image/png

This is a FiftyOne dataset with 1026 samples.

Installation#

If you haven’t already, install FiftyOne:

pip install -U fiftyone

Usage#

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/form_understanding_in_noisy_scanned_documents_plus")

# Launch the App
session = fo.launch_app(dataset)

Dataset Details#

Dataset Description#

FUNSD+ (Form Understanding in Noisy Scanned Documents Plus) is an enhanced version of the original FUNSD dataset designed for form understanding tasks. The dataset provides ground truth data for extracting structured information from scanned forms, including entity recognition and relationship extraction between form fields and their values.

FUNSD+ addresses inconsistencies in labeling found in the original FUNSD dataset and significantly expands the document count from 199 to 1,113 documents. The dataset contains annotations for headers, questions (field labels), answers (field values), and their relationships, making it suitable for training and evaluating models for key-value extraction, document layout analysis, and form understanding tasks.

Each sample includes:

Scanned form images
Word-level OCR tokens with bounding boxes
Entity labels (header, question, answer, other)
Grouped words forming semantic units
Linked groups showing relationships between questions and answers
Curated by: Konfuzio (Helm & Nagel GmbH)
Shared by [optional]: Konfuzio via Hugging Face
Language(s) (NLP): English (en)
License: FUNSD+ Custom License

Dataset Sources#

Repository: https://huggingface.co/datasets/konfuzio/funsd_plus
Homepage: https://konfuzio.com/en/funsd-plus/
Paper:
- Original FUNSD: FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents (Jaume et al., 2019)
- Related: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Vu and Nguyen, 2020)
Demo: https://app.konfuzio.com/d/303962/

Uses#

Direct Use#

The FUNSD+ dataset is intended for:

Form Understanding: Training and evaluating models for extracting structured information from scanned forms
Key-Value Extraction: Identifying relationships between field labels (questions) and their corresponding values (answers)
Document Layout Analysis: Understanding spatial and semantic layout of form documents
Named Entity Recognition: Detecting and classifying text entities in documents (headers, questions, answers)
OCR Post-Processing: Improving OCR results by understanding document structure
Multi-modal Document Understanding: Combining visual (layout) and textual information for document comprehension
Benchmarking: Comparing performance of document AI models on a standardized dataset

The dataset can be used directly for training transformer-based models like LayoutLM, LayoutLMv2, LayoutLMv3, BERT-based models, and other architectures designed for document understanding.

Out-of-Scope Use#

Non-English Documents: The dataset contains only English-language forms and may not generalize well to other languages
Modern Digital Forms: Optimized for scanned/noisy documents rather than born-digital forms
Handwritten Forms: The dataset focuses on printed/typed text, not handwriting recognition
Privacy-Sensitive Applications: Users must not attempt to identify individuals in the dataset (per license terms)
Unstructured Documents: Not suitable for documents without form-like structure (e.g., essays, articles, books)

Dataset Structure#

The dataset contains the following fields for each sample:

image (PIL Image): Scanned form image in PNG format
- Typical size: ~1000x1000 to ~1400x1400 pixels
- Format: RGB or grayscale
words (list of strings): OCR-extracted text tokens
- Length: Variable (typically 50-300 words per document)
- Contains individual words/tokens from the form
bboxes (list of lists): Bounding boxes for each word
- Format: [x_min, y_min, x_max, y_max] in absolute pixel coordinates
- Coordinates correspond to word positions in the image
labels (list of integers): Entity type labels for each word
- 0: Other (non-semantic text)
- 1: Header (document titles, form names)
- 2: Question (field labels, prompts)
- 3: Answer (field values, responses)
grouped_words (list of lists): Indices grouping words into semantic units
- Groups related words that form complete entities
- Example: [[0, 1, 2], [3], [4, 5]] groups words 0-1-2 together, word 3 alone, words 4-5 together
linked_groups (list of lists): Indices showing relationships between word groups
- Represents question-answer pairs and other semantic relationships
- Example: [[0, 1]] links group 0 (question) to group 1 (answer)

Dataset Splits#

Split	Number of Samples	Size (MB)
Train	1,026	~183
Test	113	~21
Total	1,139	~204

Comparison with Original FUNSD#

	FUNSD	FUNSD+
Documents	199	1,113
Headers	563	1,604
Questions	4,343	14,695
Answers	3,623	12,154
Questions with no answers	720 (16.6%)	2,691 (18.3%)
Answers without questions	0	114 (0.9%)

Dataset Creation#

Curation Rationale#

The FUNSD+ dataset was created to address several limitations in the original FUNSD dataset:

Scale: Expand from 199 to 1,113 documents to provide more training data for deep learning models
Annotation Quality: Fix inconsistencies in labeling found in the original FUNSD dataset
Key-Value Extraction: Improve the dataset’s effectiveness for training models to extract question-answer pairs from forms
Robustness: Provide more diverse examples of form layouts and structures
Benchmarking: Create a more comprehensive benchmark for evaluating form understanding models

Source Data#

Data Collection and Processing#

The dataset consists of scanned business forms and documents. Based on the original FUNSD methodology:

Forms were scanned at various resolutions and quality levels to simulate real-world noisy documents
OCR was performed to extract text and bounding boxes
Images were processed to standard formats (PNG)
Annotations were created for entity types and relationships

Who are the source data producers?#

The source data consists of business forms and documents. The original FUNSD dataset was created by Guillaume Jaume et al. FUNSD+ was curated and expanded by Konfuzio (Helm & Nagel GmbH), specifically by Davide Zagami and Christopher Helm.

Annotations#

The dataset includes human-annotated labels for:

Entity types (header, question, answer, other)
Word groupings into semantic units
Relationships between entities (question-answer pairs)

Annotation process#

Following the original FUNSD annotation guidelines, annotators labeled:

Individual words with entity type labels
Semantic groupings of words
Relationships between questions and answers

FUNSD+ includes revised annotations that fix inconsistencies from the original FUNSD dataset, particularly improving the accuracy of key-value pair annotations.

[More Information Needed for specific annotation tools, guidelines, inter-annotator agreement, etc.]

Who are the annotators?#

Annotations were performed by Konfuzio team members and potentially external annotators. [More Information Needed for specific annotator demographics and qualifications]

Personal and Sensitive Information#

Per the dataset license, users agree to not attempt to determine the identity of individuals in this dataset. The forms may contain business information, but efforts have been made to use non-sensitive documents. Users should be aware that some forms may contain names, addresses, or other potentially identifying information and should handle the data accordingly.

Bias, Risks, and Limitations#

Technical Limitations:

Language: English-only; may not generalize to other languages
Domain: Business forms; may not transfer well to other document types
OCR Quality: Pre-extracted OCR may contain errors
Annotation Inconsistencies: While improved over FUNSD, some annotation inconsistencies may remain (18.3% of questions have no answers)

Biases:

Geographic Bias: Forms may predominantly reflect US/Western business practices
Temporal Bias: Forms reflect document styles from specific time periods
Domain Bias: Limited to business forms; not representative of all document types

Risks:

Models trained on this dataset may not perform well on documents with significantly different layouts or languages
The dataset size, while larger than FUNSD, may still be limiting for some deep learning approaches
Presence of potentially identifying information requires careful handling

Recommendations#

Users should:

Be aware that models trained on FUNSD+ may not generalize to non-form documents or non-English text
Consider the dataset size limitations when training large models
Comply with the license terms, particularly regarding not attempting to identify individuals
Evaluate models on domain-specific test sets if deploying to production
Be cautious about annotation quality, particularly for edge cases (questions without answers, etc.)
Consider data augmentation to improve model robustness

Citation#

BibTeX:

@misc{zagami_helm_2022,
  title = {FUNSD+: A larger and revised FUNSD dataset},
  author = {Zagami, Davide and Helm, Christopher},
  year = {2022},
  month = {Oct},
  journal = {FUNSD+ | A larger and revised FUNSD dataset},
  publisher = {Helm & Nagel GmbH},
  url = {https://konfuzio.com/funsd-plus/}
}

@inproceedings{jaume2019funsd,
  title={FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents},
  author={Jaume, Guillaume and Ekenel, Hazim Kemal and Thiran, Jean-Philippe},
  booktitle={2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)},
  volume={2},
  pages={1--6},
  year={2019},
  organization={IEEE}
}

APA:

Zagami, D., & Helm, C. (2022, October 18). FUNSD+: A larger and revised FUNSD dataset. Helm & Nagel GmbH. https://konfuzio.com/funsd-plus/

Jaume, G., Ekenel, H. K., & Thiran, J. P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 2, pp. 1-6). IEEE.

More Information#

Konfuzio Homepage: https://konfuzio.com/
Konfuzio Python SDK: https://github.com/konfuzio-ai/konfuzio-sdk
Interactive Demo: https://app.konfuzio.com/d/303962/
Original FUNSD Dataset: https://guillaumejaume.github.io/FUNSD/
Visual Example:

Dataset Card Contact#

Konfuzio Contact: https://konfuzio.com/en/contact/
Dataset Issues: https://huggingface.co/datasets/konfuzio/funsd_plus/discussions