Overview
GT-Free OCR Metrics is a reference-free evaluation framework for OCR systems. Instead of comparing OCR output against manually transcribed ground truth, the pipeline renders the OCR output back into an image and measures visual similarity against the original page scan — no ground-truth text required at test time.
The framework is validated on OmniDocBench (1 355 real-world document pages, English and Chinese, 9 document categories: PPT2PDF, academic literature, book, colorful textbook, exam paper, magazine, newspaper, note, research report) by correlating reference-free metric scores with reference-based edit distance and TEDS. The best composite method achieves Spearman ρ = 0.494 (mean across 5 OCR-output variants, p < 0.001 at N = 1 355), with a per-variant peak of ρ = 0.605 on the formula-only variant. The top method stacks per-element CLIP cosine (top-k percentile), DocSim (a learned document-similarity head), multi-scale SSIM, and OCR log-probability entropy.
The collection consists of two derived datasets (each shipped in a fast-download parquet edition alongside the raw per-page layout) and one pre-trained similarity model, grouped in the GT-Free OCR Metrics HuggingFace Collection.
How It Works
Original page scan
|
v
Qwen OCR --> HTML with bounding boxes
|
+-----------------+------------------+
| Mask non-target regions | Parse element bboxes
v v
masked_original.png Render --> reconstructed.png
| |
+------------- Compare ---------------+
(SSIM patches, LPIPS, CLIP,
DINOv2, logprobs, coverage, ...)
Five OCR extraction variants cover different subsets of document elements (text, formula, table) with and without image masking.
| Variant | Elements extracted | Pages |
|---|---|---|
all | text + formula + table (images masked) | 1 355 |
all_no_mask | text + formula + table (images unmasked) | 1 355 |
text | text only | 1 349 |
formula | formula only | 200 |
table | table only | 351 |
Datasets & Model
Render-and-Compare Pairs + DocSim Triplets
For each of 1 355 OmniDocBench pages, the dataset provides a
masked_original PNG (the original page scan with non-target
document elements grayed out) and a reconstructed PNG (the OCR
output rendered back into an image via HTML). The HTML output and
per-element bounding-box JSON (text / formula / table) are also included.
Five element-subset variants are released as separate configs:
ocr_all (image regions masked, rest all rendered back),
ocr_all_no_mask (without image masking),
ocr_text (text only regions rendered, rest masked),
ocr_table (table only regions) and
ocr_formula (formula only regions).
The pairs are designed for training and evaluating reference-free visual
similarity metrics for OCR quality.
A sixth config, docsim_triplets, provides
20 280 anchor / positive / negative triplets
(19 266 train / 1 014 validation) used to train the
DocSim LoRA model below. Each triplet is a JSONL record whose
anchor_path, positive_path, and
negative_path fields point to images within the five
variant configs of this same dataset (anchor = a
masked_original.png; positive = a
reconstructed.png with low text edit distance to the GT;
negative = a reconstructed.png with high text edit
distance). Triplets are formed both same-page (best vs. worst OCR
variant of one page) and cross-page; supervision is by per-page text edit
distance against the OmniDocBench ground truth.
Distributed in two formats: a parquet edition
(64 zstd-shards, ~9.4 GB; image bytes inline; image variants only)
and a raw per-page-directory layout (image variants
plus the docsim_triplets JSONL config).
Qwen OCR Log-Probabilities
For each of 1 355 OmniDocBench pages, the dataset records the full token-level log-probability stream emitted by Qwen3.5-122B-A10B during OCR inference (including top-N alternatives per token), and a per-bounding-box aggregation derived from those streams (entropy, min / mean logprob, etc.). Together with the rendered-image dataset, these provide an OCR-internal confidence signal that is statistically independent of visual similarity, enabling hybrid metrics that combine the two and studies of where the OCR model is uncertain.
Distributed in two formats: a parquet edition (7 zstd-shards, ~115 MB, JSON inline) and a raw per-page-directory layout.
Parquet → Raw →DocSim LoRA
A pre-trained document-similarity head that extends frozen
OpenCLIP ViT-B/32 (laion2b_s34b_b79k checkpoint)
and DINOv2 ViT-B/14 backbones with rank-16 LoRA adapters
(α 32, dropout 0.05) on the encoders' Q/V projections, and a
1280→512→256 MLP projection head — approximately
0.87 M trainable parameters on top of ~174 M frozen
backbone parameters. The model maps document page images to 256-dimensional
ℓ2-normalized embeddings tuned for fine-grained document
similarity. Trained on 20 280 triplets
(19 266 train / 1 014 validation) with edit-distance-ranked
supervision: anchor = masked original, positive = a reconstruction
with low text edit distance, negative = a reconstruction with high text
edit distance. Cosine triplet-margin loss (margin 0.1, batch 16,
lr 1e-4, 3 epochs on a single NVIDIA RTX 6000 Ada GPU).
Best validation accuracy: 99.9 %.
60-Page Stratified Sample
A 60-page stratified sample of the render-and-compare dataset (~370 MB) is available at omnidocbench-render-compare-sample for quick exploration without downloading the full ~9.4 GB dataset.
Sample creation methodology
Pages were selected by stratified random sampling:
- Each of the 1 355 pages in
ocr_allwas assigned to one of 12 document categories (slides, book, academic paper, financial, newspaper, exam, textbook/notes, notes, magazine, DocStructBench, colorful, other) inferred from itspage_idprefix. - 5 pages were drawn uniformly at random from each category (random seed 42), giving a 60-page base sample representative of the full distribution.
- Up to 5 additional pages were added from the sparse
ocr_formulaandocr_tablevariants to ensure each appears at least 5 times.
Full per-category and per-variant counts are in the sample dataset README.
Source Dataset
Both derived datasets are built from OmniDocBench (OpenDataLab / Shanghai Jiao Tong University, CC-BY-NC-4.0). The original page scans and ground-truth annotations are available from that repository. No personally identifiable information was collected; financial report pages may incidentally reference company names or executive names.
License
-
Datasets (
omnidocbench-render-compare,omnidocbench-render-compare-parquet,omnidocbench-qwen-ocr-logprobs,omnidocbench-qwen-ocr-logprobs-parquet) — CC-BY-NC-4.0, inherited from the OmniDocBench source. - DocSim LoRA weights — Apache 2.0 (model weights); training data is CC-BY-NC-4.0.