Paper Review - LongDocURL (ACL 2025)

Paper Review - LongDocURL (ACL 2025)

Introduction

Published at ACL 2025"LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating" addresses important limitations in document understanding evaluation. This paper exposes fundamental gaps in how we test AI models on complex documents.

The research, conducted by a collaborative team from Chinese Academy of Sciences and Alibaba Group, addresses a gap in the field: while large vision-language models (LVLMs) have made progress in document understanding, existing benchmarks have been inadequate for evaluating their capabilities on real-world, long-form documents.

The Problem: Why Current Benchmarks Fall Short

Picture this: You're trying to evaluate how well AI models understand complex documents, but your benchmark only tests single-page documents. It's like testing someone's ability to read a novel by only showing them the cover page. That's exactly the problem the researchers identified with existing document understanding benchmarks.

Most current benchmarks like DocVQA can only handle single-page documents, and many models easily achieve 95%+ accuracy. Meanwhile, real-world documents are often 50-150 pages long, filled with complex layouts, tables, figures, and cross-references that require understanding.

Enter LongDocURL: A New Approach

LongDocURL (Long Document Understanding, Reasoning, and Locating) is a benchmark that addresses these limitations. It provides a comprehensive test for document AI models.

What Makes It Special?

Scale That Matters:

  • 2,325 question-answer pairs (not just 3 samples!)
  • 396 unique documents covering over 206,943 pages
  • 8 diverse document types: research papers, manuals, books, theses, project proposals, presentations, meeting minutes, and work summaries
  • Average 89 pages per document (range: 50-150 pages)
  • 20 distinct subtask categories for fine-grained evaluation

Three Core Capabilities:

Primary Task Composition Criteria Subtask Count Characteristics
Understanding Single/multi-page, single/cross-element based understanding 8 Based on text/layout/table/figure evidence types
Reasoning Numerical calculation, statistical summary focused reasoning (classified by page count) 8 Composed based on same range evidence criteria
Locating Cross-element relationship reasoning between different element types 4 Classified by element combinations (text+table, figure+layout, etc.)

The Innovation: Cross-Element Locating

Here's where LongDocURL gets really interesting. Traditional benchmarks focus on single elements (just text, just tables, just figures). But real documents require understanding relationships between elements.

For example, imagine a question like: "Which section title best matches this paragraph?" This requires the model to:

  1. Understand the paragraph content
  2. Scan through multiple section titles
  3. Analyze the semantic relationship
  4. Make a cross-element connection

This is exactly what LongDocURL tests with its innovative cross-element locating tasks.

Task Examples: Input, Output, and Evidence

Task Type Input (Question) Output (Answer) Evidence (Source)
Understanding What section best matches the description: <description>... IV.2.1.1. Main categories of scientific purposes Page 29–30, Section title + paragraph
Reasoning Which scientific area used the most animals in 2018? (with options A-D) A. Animal diseases and disorders Page 37 figure, Page 38 table
Locating Where is the figure showing usage located? Page 24, Figure 1 Page 24 figure location information

The Technical Magic: How They Built It

A Semi-Automated Pipeline That Actually Works

Building a benchmark of this scale manually would be difficult. The researchers created a four-stage pipeline:

Stage 1: Extract & Filter

  • Crawled 200k PDF documents from CommonCrawl
  • Filtered for documents with 50-150 pages and English content
  • Used GPT-4o to classify document types
  • Final selection: 396 documents

Stage 2: QA Generation

  • Converted PDFs to "text-type-bbox" triples (text content, element type, bounding box)
  • Used multi-step iterative querying with GPT-4o
  • Generated questions across all 20 sub-tasks

Stage 3: Automated Verification

  • Checked task relevance, format correctness, and faithfulness
  • Identified and flagged problematic samples
  • Some tasks had up to 75% negative samples initially!

Stage 4: Human Verification

  • Human annotators reviewed automated results
  • Cross-checked each other's work
  • Recovered negative samples where possible

The Smart Classification System

Each question gets classified along multiple dimensions:

  • Task Type: Understanding (53.5%), Reasoning (16.6%), Locating (29.9%)
  • Evidence Elements: Text (42.8%), Layout (33.5%), Table (23.9%), Figure (37.5%)
  • Page Scope: Single-page (47%) vs Multi-page (53%)
  • Element Scope: Single-element (62.9%) vs Cross-element (37.1%)

This creates 20 distinct sub-tasks, allowing for granular analysis of model capabilities.

The Data: What's Actually Inside

Question Type Distribution

The dataset includes 9 different question types, with extract being the most common (53.5%):

  • extract: 1,243 questions - Direct information extraction
  • extract_fig2tab: 231 questions - Figure to table conversion
  • topic2title: 201 questions - Topic to title matching
  • calculate: 145 questions - Numerical calculations
  • summary2title: 137 questions - Summary to title matching
  • summary2tab: 126 questions - Summary to table matching
  • count: 117 questions - Counting tasks
  • compare: 112 questions - Comparison tasks
  • summarize: 13 questions - Summarization tasks

Answer Format Distribution

  • String: 1,882 answers (40.5%) - Text responses
  • List: 1,514 answers (32.6%) - Multiple items
  • Integer: 862 answers (18.5%) - Whole numbers
  • Float: 370 answers (8.0%) - Decimal numbers
  • None: 22 answers (0.5%) - Unanswerable questions

Evidence Sources

  • Text: 1,988 (42.8%) - Pure text content
  • Table: 1,742 (37.5%) - Tabular data
  • Layout: 1,558 (33.5%) - Headers, titles, footers
  • Figure: 1,112 (23.9%) - Charts and images

How They Tested: The 3-Stage Evaluation Protocol

Stage Process Input Example Output Example
1. Response Generation Models generate free-form answers "Which scientific area used the most animals in 2018?" "I think the answer is Animal diseases and disorders, based on the chart on page 37."
2. Answer Extraction GPT-4o extracts concise final answers "I think the answer is the USA, based on the second paragraph." "USA"
3. Score Calculation Compare with ground truth using generalized accuracy Extracted answer vs Ground truth 1 (correct) or 0 (incorrect)

Answer Type Scoring Rules

Answer Type Scoring Rule Example
String Exact or relaxed match "USA" vs "U.S.A."
Integer Numeric match 2020
Float Numeric match with tolerance 16.8
List Set or subset match ["Asia", "Africa"]
None Null/empty response for unanswerable questions null

Input Methods: Image vs Text

Input Method Description Performance (GPT-4o) Pros Cons
Image Input (Cut-off) 30 continuous pages around answer evidence (LongDocURL default) 64.4 Preserves visual structure, tables, figures, layout Requires more memory, slower inference
OCR Text (PyMuPDF) Plain text extraction 36.5 Lightweight, faster Loses table/chart layout & formatting
OCR Text (Docmind) Markdown-aware OCR 66.2 Retains some structure (markdown tables) Still inferior to full image

Why Image Input Wins

The structural information loss in OCR text is significant:

  • Table formatting disappears
  • Figure-text relationships break
  • Layout cues are lost
  • Cross-element reasoning becomes nearly impossible

The 20 Subtask Categories: A Deep Dive

Understanding Tasks (8 subtasks)

  • MP_Text_Understanding: 443 questions - Multi-page text extraction
  • SP_Table_Understanding: 263 questions - Single-page table parsing
  • SP_Text_Understanding: 259 questions - Single-page text extraction
  • MP_Figure_Understanding: 174 questions - Multi-page figure analysis
  • MP_Layout_Understanding: 172 questions - Multi-page layout parsing
  • MP_Table_Understanding: 115 questions - Multi-page table analysis
  • SP_Figure_Understanding: 94 questions - Single-page figure analysis
  • SP_Layout_Understanding: 91 questions - Single-page layout parsing

Reasoning Tasks (8 subtasks)

  • MP_Text_Reasoning: 115 questions - Multi-page text calculations
  • SP_Table_Reasoning: 98 questions - Single-page table calculations
  • MP_Figure_Reasoning: 85 questions - Multi-page figure calculations
  • MP_Table_Reasoning: 69 questions - Multi-page table calculations
  • MP_Layout_Reasoning: 40 questions - Multi-page layout calculations
  • SP_Text_Reasoning: 40 questions - Single-page text calculations
  • SP_Figure_Reasoning: 28 questions - Single-page figure calculations
  • SP_Layout_Reasoning: 12 questions - Single-page layout calculations

Locating Tasks (4 subtasks)

  • Figure_Table_Locating: 231 questions - Finding figure-table relationships
  • Cross_Title_Locating: 201 questions - Cross-referencing titles
  • Para_Title_Locating: 137 questions - Paragraph-title matching
  • Cross_Table_Locating: 126 questions - Cross-table relationships

Open-Source Model Reality Check

Context Length Limitations

Model Context Length Max Pages (Image) Max Pages (Text) Notes
Qwen2.5-VL-7B 32K tokens ~8-12 pages ~16 pages Best open-source performance
LLaVA-1.5-7B 4K-32K tokens ~4-8 pages ~8-16 pages Variable context length
LLaVA-Next-7B 32K tokens ~8-12 pages ~16 pages Improved version
Llama-3-8B 8K tokens N/A (text-only) ~2-3 pages Text-only model

Practical Solutions for Open-Source Models

Solution Description Pros Cons
Chunking Split 30-page documents into 8-page chunks Handles long documents within context limits May lose cross-chunk context
Selective Processing Focus on evidence pages only (e.g., page 54 from 42-71 range) Efficient, targeted processing May miss relevant context
Hybrid Approach OCR text + selective image processing Balanced performance and efficiency Complex implementation
Sliding Window Process overlapping 8-page windows Maintains some context continuity Increased computational cost

Chunking Strategy Example

For a 30-page document (pages 42-71) with evidence on page 54:

Option 1: Simple Chunking

  • Chunk 1: Pages 42-49 (8 pages)
  • Chunk 2: Pages 50-57 (8 pages) ← Evidence page 54
  • Chunk 3: Pages 58-65 (8 pages)
  • Chunk 4: Pages 66-71 (6 pages)

Option 2: Evidence-Centered Chunking

  • Focus chunk: Pages 50-57 (8 pages) ← Contains evidence page 54
  • Context chunks: Pages 42-49, 58-65 for additional context

The Performance Reality Check

After testing 26 different model configurations, here's what they found:

The Winner (Sort Of):

  • GPT-4o scored 64.5 points - the only model to meet the "passing standard"
  • But even this top performer has room for improvement

The Open-Source Reality:

  • Best open-source model: Qwen2-VL with just 30.6 points
  • Most open-source models with <13B parameters scored below 20 points
  • That's a 2x performance gap between proprietary and open-source models

The Text vs Image Input Surprise:

  • Text-input models performed significantly worse than image-input models
  • Top LLM score trailed top LVLM score by about 30 points
  • Why? Because converting documents to plain text loses crucial structural information

Where Models Struggle Most

Document Structure Parsing:

  • Models scored highest on pure text questions
  • Lowest scores on table-related questions
  • This highlights a major weakness in document structure understanding

Cross-Element Relationships:

  • 37.1% of questions require understanding relationships between different elements
  • Models consistently struggle with these cross-element tasks
  • This is exactly where LongDocURL exposes current limitations

Multi-Page Reasoning:

  • Surprisingly, single-page questions were harder than multi-page questions
  • Why? Because multi-page questions often have more context to work with
  • But models like GPT-4o actually performed worse on multi-page locating tasks

The Error Analysis: What's Really Going Wrong?

The researchers conducted a detailed error analysis on 97 failed cases from GPT-4o. The results reveal the real bottlenecks:

Perceptual Errors (32.7%):

  • The biggest problem: models can't accurately recognize or parse document elements
  • Issues with heading hierarchies, figure-text correspondences
  • Complex document structures remain a major challenge

Reasoning Errors (16.8%):

  • Even when evidence is correctly identified, models fail at calculation and comparison
  • Shows that understanding ≠ reasoning

Format Inconsistency (20.6%):

  • Models give correct answers but in wrong formats
  • Example: "$50.2 million" vs "50212000" - same answer, different format
  • Highlights the inflexibility of rule-based evaluation

Other Issues (29.9%):

  • Hallucinated evidence, irrelevant answers, incomplete evidence
  • Shows models sometimes "make up" information when they can't find it

Key Insights for Model Development

Multimodal Training Matters:

  • Qwen2-VL (multimodal) significantly outperformed its text-only variant
  • Extensive multimodal training strengthens both comprehension and generation

Human Feedback Helps:

  • LLaVA-OneVision-Chat (with DPO and human feedback) outperformed its base model
  • Direct Preference Optimization boosts generalization and reasoning

Layout Parsing is Critical:

  • Frequent failures due to poor layout analysis and table/chart parsing
  • Incorporating layout parsing into training could significantly improve performance
  • Multi-stage frameworks (parse structure first, then reason) might be the way forward

Why This Benchmark Changes Everything

The Scale Revolution

  • 10x more pages than existing benchmarks
  • 2x more samples than MMLongBench-Doc
  • 8 diverse document types vs single-type focus

The Complexity Revolution

  • 53% multi-page questions (vs single-page focus)
  • 37% cross-element questions (vs single-element focus)
  • Real reasoning tasks (vs simple extraction)

What This Means for the Future

For Researchers

This benchmark finally provides a realistic testbed for long document understanding. No more inflated performance numbers from easy single-page tasks. This is the real deal.

For Industry

Companies building document AI systems now have a proper benchmark to evaluate their models. The 2x performance gap between proprietary and open-source models is a clear call to action.

For Open Source

The poor performance of open-source models (best score: 30.6 vs GPT-4o's 64.5) shows there's massive room for improvement. This could drive significant investment in open-source document understanding research.

The Bottom Line

LongDocURL provides a comprehensive benchmark that shows current limitations in document understanding, despite progress in large language models.

The fact that GPT-4o scores 64.5 points on this benchmark indicates that document understanding remains challenging, and better benchmarks are needed to measure progress.

Key Takeaways:

  1. Cross-element relationships are important - 37% of questions require understanding relationships between different document elements
  2. Structure matters - Text-only models perform 30 points worse than vision-language models
  3. Open-source models need improvement - The 2x performance gap presents a research opportunity
  4. Perceptual errors are the biggest bottleneck - 33% of errors come from poor document element recognition

This benchmark provides a comprehensive evaluation framework for document understanding. Models that perform well on LongDocURL are likely to handle real-world document tasks effectively.

The benchmark contributes to advancing document AI research.


References:

  • GitHub Repository
  • arXiv Paper - "LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating"
  • ACL 2025 Proceedings - Presented at ACL 2025
  • Institutions: Chinese Academy of Sciences, Alibaba Group