Blog

Paper Review - LongDocURL (ACL 2025)

Riley Learning

07 Sep 2025 • 8 min read

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating (ACL 2025)

Introduction

Published at ACL 2025, "LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating" addresses important limitations in document understanding evaluation. This paper exposes fundamental gaps in how we test AI models on complex documents.

The research, conducted by a collaborative team from Chinese Academy of Sciences and Alibaba Group, addresses a gap in the field: while large vision-language models (LVLMs) have made progress in document understanding, existing benchmarks have been inadequate for evaluating their capabilities on real-world, long-form documents.

The Problem: Why Current Benchmarks Fall Short

Picture this: You're trying to evaluate how well AI models understand complex documents, but your benchmark only tests single-page documents. It's like testing someone's ability to read a novel by only showing them the cover page. That's exactly the problem the researchers identified with existing document understanding benchmarks.

Most current benchmarks like DocVQA can only handle single-page documents, and many models easily achieve 95%+ accuracy. Meanwhile, real-world documents are often 50-150 pages long, filled with complex layouts, tables, figures, and cross-references that require understanding.

Enter LongDocURL: A New Approach

LongDocURL (Long Document Understanding, Reasoning, and Locating) is a benchmark that addresses these limitations. It provides a comprehensive test for document AI models.

What Makes It Special?

Scale That Matters:

2,325 question-answer pairs (not just 3 samples!)
396 unique documents covering over 206,943 pages
8 diverse document types: research papers, manuals, books, theses, project proposals, presentations, meeting minutes, and work summaries
Average 89 pages per document (range: 50-150 pages)
20 distinct subtask categories for fine-grained evaluation

Three Core Capabilities:

Primary Task	Composition Criteria	Subtask Count	Characteristics
Understanding	Single/multi-page, single/cross-element based understanding	8	Based on text/layout/table/figure evidence types
Reasoning	Numerical calculation, statistical summary focused reasoning (classified by page count)	8	Composed based on same range evidence criteria
Locating	Cross-element relationship reasoning between different element types	4	Classified by element combinations (text+table, figure+layout, etc.)

The Innovation: Cross-Element Locating

Here's where LongDocURL gets really interesting. Traditional benchmarks focus on single elements (just text, just tables, just figures). But real documents require understanding relationships between elements.

For example, imagine a question like: "Which section title best matches this paragraph?" This requires the model to:

Understand the paragraph content
Scan through multiple section titles
Analyze the semantic relationship
Make a cross-element connection

This is exactly what LongDocURL tests with its innovative cross-element locating tasks.

Task Examples: Input, Output, and Evidence

Task Type	Input (Question)	Output (Answer)	Evidence (Source)
Understanding	`What section best matches the description: <description>...`	`IV.2.1.1. Main categories of scientific purposes`	Page 29–30, Section title + paragraph
Reasoning	`Which scientific area used the most animals in 2018?` (with options A-D)	`A. Animal diseases and disorders`	Page 37 figure, Page 38 table
Locating	`Where is the figure showing usage located?`	`Page 24, Figure 1`	Page 24 figure location information

The Technical Magic: How They Built It

A Semi-Automated Pipeline That Actually Works

Building a benchmark of this scale manually would be difficult. The researchers created a four-stage pipeline:

Stage 1: Extract & Filter

Crawled 200k PDF documents from CommonCrawl
Filtered for documents with 50-150 pages and English content
Used GPT-4o to classify document types
Final selection: 396 documents

Stage 2: QA Generation

Converted PDFs to "text-type-bbox" triples (text content, element type, bounding box)
Used multi-step iterative querying with GPT-4o
Generated questions across all 20 sub-tasks

Stage 3: Automated Verification

Checked task relevance, format correctness, and faithfulness
Identified and flagged problematic samples
Some tasks had up to 75% negative samples initially!

Stage 4: Human Verification

Human annotators reviewed automated results
Cross-checked each other's work
Recovered negative samples where possible

The Smart Classification System

Each question gets classified along multiple dimensions:

Task Type: Understanding (53.5%), Reasoning (16.6%), Locating (29.9%)
Evidence Elements: Text (42.8%), Layout (33.5%), Table (23.9%), Figure (37.5%)
Page Scope: Single-page (47%) vs Multi-page (53%)
Element Scope: Single-element (62.9%) vs Cross-element (37.1%)

This creates 20 distinct sub-tasks, allowing for granular analysis of model capabilities.

The Data: What's Actually Inside

Question Type Distribution

The dataset includes 9 different question types, with extract being the most common (53.5%):

extract: 1,243 questions - Direct information extraction
extract_fig2tab: 231 questions - Figure to table conversion
topic2title: 201 questions - Topic to title matching
calculate: 145 questions - Numerical calculations
summary2title: 137 questions - Summary to title matching
summary2tab: 126 questions - Summary to table matching
count: 117 questions - Counting tasks
compare: 112 questions - Comparison tasks
summarize: 13 questions - Summarization tasks

Answer Format Distribution

String: 1,882 answers (40.5%) - Text responses
List: 1,514 answers (32.6%) - Multiple items
Integer: 862 answers (18.5%) - Whole numbers
Float: 370 answers (8.0%) - Decimal numbers
None: 22 answers (0.5%) - Unanswerable questions

Evidence Sources

Text: 1,988 (42.8%) - Pure text content
Table: 1,742 (37.5%) - Tabular data
Layout: 1,558 (33.5%) - Headers, titles, footers
Figure: 1,112 (23.9%) - Charts and images

How They Tested: The 3-Stage Evaluation Protocol

Stage	Process	Input Example	Output Example
1. Response Generation	Models generate free-form answers	`"Which scientific area used the most animals in 2018?"`	`"I think the answer is Animal diseases and disorders, based on the chart on page 37."`
2. Answer Extraction	GPT-4o extracts concise final answers	`"I think the answer is the USA, based on the second paragraph."`	`"USA"`
3. Score Calculation	Compare with ground truth using generalized accuracy	Extracted answer vs Ground truth	1 (correct) or 0 (incorrect)

Answer Type Scoring Rules

Answer Type	Scoring Rule	Example
String	Exact or relaxed match	"USA" vs "U.S.A."
Integer	Numeric match	2020
Float	Numeric match with tolerance	16.8
List	Set or subset match	["Asia", "Africa"]
None	Null/empty response for unanswerable questions	null

Input Methods: Image vs Text

Input Method	Description	Performance (GPT-4o)	Pros	Cons
Image Input (Cut-off)	30 continuous pages around answer evidence (LongDocURL default)	64.4	Preserves visual structure, tables, figures, layout	Requires more memory, slower inference
OCR Text (PyMuPDF)	Plain text extraction	36.5	Lightweight, faster	Loses table/chart layout & formatting
OCR Text (Docmind)	Markdown-aware OCR	66.2	Retains some structure (markdown tables)	Still inferior to full image

Why Image Input Wins

The structural information loss in OCR text is significant:

Table formatting disappears
Figure-text relationships break
Layout cues are lost
Cross-element reasoning becomes nearly impossible

The 20 Subtask Categories: A Deep Dive

Understanding Tasks (8 subtasks)

MP_Text_Understanding: 443 questions - Multi-page text extraction
SP_Table_Understanding: 263 questions - Single-page table parsing
SP_Text_Understanding: 259 questions - Single-page text extraction
MP_Figure_Understanding: 174 questions - Multi-page figure analysis
MP_Layout_Understanding: 172 questions - Multi-page layout parsing
MP_Table_Understanding: 115 questions - Multi-page table analysis
SP_Figure_Understanding: 94 questions - Single-page figure analysis
SP_Layout_Understanding: 91 questions - Single-page layout parsing

Reasoning Tasks (8 subtasks)

MP_Text_Reasoning: 115 questions - Multi-page text calculations
SP_Table_Reasoning: 98 questions - Single-page table calculations
MP_Figure_Reasoning: 85 questions - Multi-page figure calculations
MP_Table_Reasoning: 69 questions - Multi-page table calculations
MP_Layout_Reasoning: 40 questions - Multi-page layout calculations
SP_Text_Reasoning: 40 questions - Single-page text calculations
SP_Figure_Reasoning: 28 questions - Single-page figure calculations
SP_Layout_Reasoning: 12 questions - Single-page layout calculations

Locating Tasks (4 subtasks)

Figure_Table_Locating: 231 questions - Finding figure-table relationships
Cross_Title_Locating: 201 questions - Cross-referencing titles
Para_Title_Locating: 137 questions - Paragraph-title matching
Cross_Table_Locating: 126 questions - Cross-table relationships

Open-Source Model Reality Check

Context Length Limitations

Model	Context Length	Max Pages (Image)	Max Pages (Text)	Notes
Qwen2.5-VL-7B	32K tokens	~8-12 pages	~16 pages	Best open-source performance
LLaVA-1.5-7B	4K-32K tokens	~4-8 pages	~8-16 pages	Variable context length
LLaVA-Next-7B	32K tokens	~8-12 pages	~16 pages	Improved version
Llama-3-8B	8K tokens	N/A (text-only)	~2-3 pages	Text-only model

Practical Solutions for Open-Source Models

Solution	Description	Pros	Cons
Chunking	Split 30-page documents into 8-page chunks	Handles long documents within context limits	May lose cross-chunk context
Selective Processing	Focus on evidence pages only (e.g., page 54 from 42-71 range)	Efficient, targeted processing	May miss relevant context
Hybrid Approach	OCR text + selective image processing	Balanced performance and efficiency	Complex implementation
Sliding Window	Process overlapping 8-page windows	Maintains some context continuity	Increased computational cost

Chunking Strategy Example

For a 30-page document (pages 42-71) with evidence on page 54:

Option 1: Simple Chunking

Chunk 1: Pages 42-49 (8 pages)
Chunk 2: Pages 50-57 (8 pages) ← Evidence page 54
Chunk 3: Pages 58-65 (8 pages)
Chunk 4: Pages 66-71 (6 pages)

Option 2: Evidence-Centered Chunking

Focus chunk: Pages 50-57 (8 pages) ← Contains evidence page 54
Context chunks: Pages 42-49, 58-65 for additional context

The Performance Reality Check

After testing 26 different model configurations, here's what they found:

The Winner (Sort Of):

GPT-4o scored 64.5 points - the only model to meet the "passing standard"
But even this top performer has room for improvement

The Open-Source Reality:

Best open-source model: Qwen2-VL with just 30.6 points
Most open-source models with <13B parameters scored below 20 points
That's a 2x performance gap between proprietary and open-source models

The Text vs Image Input Surprise:

Text-input models performed significantly worse than image-input models
Top LLM score trailed top LVLM score by about 30 points
Why? Because converting documents to plain text loses crucial structural information

Where Models Struggle Most

Document Structure Parsing:

Models scored highest on pure text questions
Lowest scores on table-related questions
This highlights a major weakness in document structure understanding

Cross-Element Relationships:

37.1% of questions require understanding relationships between different elements
Models consistently struggle with these cross-element tasks
This is exactly where LongDocURL exposes current limitations

Multi-Page Reasoning:

Surprisingly, single-page questions were harder than multi-page questions
Why? Because multi-page questions often have more context to work with
But models like GPT-4o actually performed worse on multi-page locating tasks

The Error Analysis: What's Really Going Wrong?

The researchers conducted a detailed error analysis on 97 failed cases from GPT-4o. The results reveal the real bottlenecks:

Perceptual Errors (32.7%):

The biggest problem: models can't accurately recognize or parse document elements
Issues with heading hierarchies, figure-text correspondences
Complex document structures remain a major challenge

Reasoning Errors (16.8%):

Even when evidence is correctly identified, models fail at calculation and comparison
Shows that understanding ≠ reasoning

Format Inconsistency (20.6%):

Models give correct answers but in wrong formats
Example: "$50.2 million" vs "50212000" - same answer, different format
Highlights the inflexibility of rule-based evaluation

Other Issues (29.9%):

Hallucinated evidence, irrelevant answers, incomplete evidence
Shows models sometimes "make up" information when they can't find it

Key Insights for Model Development

Multimodal Training Matters:

Qwen2-VL (multimodal) significantly outperformed its text-only variant
Extensive multimodal training strengthens both comprehension and generation

Human Feedback Helps:

LLaVA-OneVision-Chat (with DPO and human feedback) outperformed its base model
Direct Preference Optimization boosts generalization and reasoning

Layout Parsing is Critical:

Frequent failures due to poor layout analysis and table/chart parsing
Incorporating layout parsing into training could significantly improve performance
Multi-stage frameworks (parse structure first, then reason) might be the way forward

Why This Benchmark Changes Everything

The Scale Revolution

10x more pages than existing benchmarks
2x more samples than MMLongBench-Doc
8 diverse document types vs single-type focus

The Complexity Revolution

53% multi-page questions (vs single-page focus)
37% cross-element questions (vs single-element focus)
Real reasoning tasks (vs simple extraction)

What This Means for the Future

For Researchers

This benchmark finally provides a realistic testbed for long document understanding. No more inflated performance numbers from easy single-page tasks. This is the real deal.

For Industry

Companies building document AI systems now have a proper benchmark to evaluate their models. The 2x performance gap between proprietary and open-source models is a clear call to action.

For Open Source

The poor performance of open-source models (best score: 30.6 vs GPT-4o's 64.5) shows there's massive room for improvement. This could drive significant investment in open-source document understanding research.

The Bottom Line

LongDocURL provides a comprehensive benchmark that shows current limitations in document understanding, despite progress in large language models.

The fact that GPT-4o scores 64.5 points on this benchmark indicates that document understanding remains challenging, and better benchmarks are needed to measure progress.

Key Takeaways:

Cross-element relationships are important - 37% of questions require understanding relationships between different document elements
Structure matters - Text-only models perform 30 points worse than vision-language models
Open-source models need improvement - The 2x performance gap presents a research opportunity
Perceptual errors are the biggest bottleneck - 33% of errors come from poor document element recognition

This benchmark provides a comprehensive evaluation framework for document understanding. Models that perform well on LongDocURL are likely to handle real-world document tasks effectively.

The benchmark contributes to advancing document AI research.

References:

GitHub Repository
arXiv Paper - "LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating"
ACL 2025 Proceedings - Presented at ACL 2025
Institutions: Chinese Academy of Sciences, Alibaba Group