GT-Free OCR Metrics — Dataset Collection

Overview

GT-Free OCR Metrics is a reference-free evaluation framework for OCR systems. Instead of comparing OCR output against manually transcribed ground truth, the pipeline renders the OCR output back into an image and measures visual similarity against the original page scan — no ground-truth text required at test time.

The framework is validated on OmniDocBench (1 355 real-world document pages, English and Chinese, 9 document categories: PPT2PDF, academic literature, book, colorful textbook, exam paper, magazine, newspaper, note, research report) by correlating reference-free metric scores with reference-based edit distance and TEDS. The best composite method achieves Spearman ρ = 0.494 (mean across 5 OCR-output variants, p < 0.001 at N = 1 355), with a per-variant peak of ρ = 0.605 on the formula-only variant. The top method stacks per-element CLIP cosine (top-k percentile), DocSim (a learned document-similarity head), multi-scale SSIM, and OCR log-probability entropy.

The collection consists of two derived datasets (each shipped in a fast-download parquet edition alongside the raw per-page layout) and one pre-trained similarity model, grouped in the GT-Free OCR Metrics HuggingFace Collection.

How It Works

Original page scan
       |
       v
  Qwen OCR  -->  HTML with bounding boxes
                          |
        +-----------------+------------------+
        |  Mask non-target regions            |  Parse element bboxes
        v                                     v
  masked_original.png            Render --> reconstructed.png
        |                                     |
        +------------- Compare ---------------+
              (SSIM patches, LPIPS, CLIP,
               DINOv2, logprobs, coverage, ...)

Five OCR extraction variants cover different subsets of document elements (text, formula, table) with and without image masking.

Variant	Elements extracted	Pages
`all`	text + formula + table (images masked)	1 355
`all_no_mask`	text + formula + table (images unmasked)	1 355
`text`	text only	1 349
`formula`	formula only	200
`table`	table only	351

Datasets & Model

Dataset

Render-and-Compare Pairs + DocSim Triplets

For each of 1 355 OmniDocBench pages, the dataset provides a masked_original PNG (the original page scan with non-target document elements grayed out) and a reconstructed PNG (the OCR output rendered back into an image via HTML). The HTML output and per-element bounding-box JSON (text / formula / table) are also included. Five element-subset variants are released as separate configs: ocr_all (image regions masked, rest all rendered back), ocr_all_no_mask (without image masking), ocr_text (text only regions rendered, rest masked), ocr_table (table only regions) and ocr_formula (formula only regions). The pairs are designed for training and evaluating reference-free visual similarity metrics for OCR quality.

A sixth config, docsim_triplets, provides 20 280 anchor / positive / negative triplets (19 266 train / 1 014 validation) used to train the DocSim LoRA model below. Each triplet is a JSONL record whose anchor_path, positive_path, and negative_path fields point to images within the five variant configs of this same dataset (anchor = a masked_original.png; positive = a reconstructed.png with low text edit distance to the GT; negative = a reconstructed.png with high text edit distance). Triplets are formed both same-page (best vs. worst OCR variant of one page) and cross-page; supervision is by per-page text edit distance against the OmniDocBench ground truth.

Distributed in two formats: a parquet edition (64 zstd-shards, ~9.4 GB; image bytes inline; image variants only) and a raw per-page-directory layout (image variants plus the docsim_triplets JSONL config).

Parquet → Raw →

Dataset

Qwen OCR Log-Probabilities

For each of 1 355 OmniDocBench pages, the dataset records the full token-level log-probability stream emitted by Qwen3.5-122B-A10B during OCR inference (including top-N alternatives per token), and a per-bounding-box aggregation derived from those streams (entropy, min / mean logprob, etc.). Together with the rendered-image dataset, these provide an OCR-internal confidence signal that is statistically independent of visual similarity, enabling hybrid metrics that combine the two and studies of where the OCR model is uncertain.

Distributed in two formats: a parquet edition (7 zstd-shards, ~115 MB, JSON inline) and a raw per-page-directory layout.

Parquet → Raw →

Model

DocSim LoRA

A pre-trained document-similarity head that extends frozen OpenCLIP ViT-B/32 (laion2b_s34b_b79k checkpoint) and DINOv2 ViT-B/14 backbones with rank-16 LoRA adapters (α 32, dropout 0.05) on the encoders' Q/V projections, and a 1280→512→256 MLP projection head — approximately 0.87 M trainable parameters on top of ~174 M frozen backbone parameters. The model maps document page images to 256-dimensional ℓ₂-normalized embeddings tuned for fine-grained document similarity. Trained on 20 280 triplets (19 266 train / 1 014 validation) with edit-distance-ranked supervision: anchor = masked original, positive = a reconstruction with low text edit distance, negative = a reconstruction with high text edit distance. Cosine triplet-margin loss (margin 0.1, batch 16, lr 1e-4, 3 epochs on a single NVIDIA RTX 6000 Ada GPU). Best validation accuracy: 99.9 %.

View on HuggingFace →

60-Page Stratified Sample

A 60-page stratified sample of the render-and-compare dataset (~370 MB) is available at omnidocbench-render-compare-sample for quick exploration without downloading the full ~9.4 GB dataset.

Sample creation methodology

Pages were selected by stratified random sampling:

Each of the 1 355 pages in ocr_all was assigned to one of 12 document categories (slides, book, academic paper, financial, newspaper, exam, textbook/notes, notes, magazine, DocStructBench, colorful, other) inferred from its page_id prefix.
5 pages were drawn uniformly at random from each category (random seed 42), giving a 60-page base sample representative of the full distribution.
Up to 5 additional pages were added from the sparse ocr_formula and ocr_table variants to ensure each appears at least 5 times.

Full per-category and per-variant counts are in the sample dataset README.

Source Dataset

Both derived datasets are built from OmniDocBench (OpenDataLab / Shanghai Jiao Tong University, CC-BY-NC-4.0). The original page scans and ground-truth annotations are available from that repository. No personally identifiable information was collected; financial report pages may incidentally reference company names or executive names.

License

Datasets (omnidocbench-render-compare, omnidocbench-render-compare-parquet, omnidocbench-qwen-ocr-logprobs, omnidocbench-qwen-ocr-logprobs-parquet) — CC-BY-NC-4.0, inherited from the OmniDocBench source.
DocSim LoRA weights — Apache 2.0 (model weights); training data is CC-BY-NC-4.0.