MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.
Overall Performance (1-Edit, CDM, TEDS aggregated)
Model | Score |
---|
MinerU2.5 | 90.67 |
MonkeyOCR-pro-3B | 88.85 |
dots.ocr | 88.41 |
Gemini-2.5 Pro | 88.03 |
Qwen2.5-VL-72B | 87.01 |
PP-StructureV3 | 86.96 |
MonkeyOCR-pro-1.2B | 86.73 |
Nanonets-OCR-s | 85.59 |
MinerU2-VLM | 85.56 |
InternVL3.5-241B | 82.66 |
POINTS-Reader | 80.98 |
Mistral OCR | 78.83 |
MinerU2-pipeline | 75.51 |
GPT-4o | 75.02 |
OCRFlux | 74.82 |
Dolphin | 74.67 |
Marker | 71.30 |
Element-wise Performance
Text Block (1-Edit)
Model | Score |
---|
MinerU2.5 | 95.34 |
dots.ocr | 95.24 |
PP-StructureV3 | 92.71 |
Qwen2.5-VL-72B | 92.55 |
Gemini-2.5 Pro | 92.52 |
MonkeyOCR-3B | 92.18 |
Formula (CDM)
Model | Score |
---|
MinerU2.5 | 88.46 |
Qwen2.5-VL-72B | 88.27 |
Gemini-2.5 Pro | 87.25 |
MonkeyOCR-3B | 87.23 |
InternVL3.5 | 85.90 |
MonkeyOCR-1.2B | 85.81 |
Table (TEDS)
Model | Score |
---|
MinerU2.5 | 88.22 |
dots.ocr | 86.78 |
MonkeyOCR-3B | 86.78 |
Gemini-2.5 Pro | 85.71 |
Qwen2.5-VL-72B | 84.24 |
InternVL3.5 | 83.54 |
Reading Order (1-Edit)
Model | Score |
---|
MinerU2.5 | 96.62 |
dots.ocr | 94.72 |
PP-StructureV3 | 92.66 |
MonkeyOCR-3B | 91.45 |
Gemini-2.5 Pro | 90.30 |
Qwen2.5-VL-72B | 89.85 |
Model Components
Component | Details |
---|
Vision Backbone | NativeRes-ViT – 675M parameters |
Language Decoder | LM Decoder – 0.5B parameters |
Output Format | Markdown (supports text, tables, formulas, figures) |
Pipeline Overview
Stage | Step | Description |
---|
Stage I: Layout Analysis | Resize | Downsample document image (e.g., 1036 × 1295 px → 2640 × 3320 px) |
| Layout Detection | Detect bounding boxes for elements (tables, images, text, figures, captions) |
| Cropping | Extract crops for each detected region (with <box_start> and <ref_start> tags) |
| Native-Res Handling | Decide whether to Drop or keep as Figure |
| Output Example | Order: 1 → Box [163, 81, 836, 129] → Type: Table Caption → Orientation: ⬆️ |
Stage II: Content Recognition
Process | Details |
---|
Merge by Order | Crops are ordered sequentially for recognition |
Parallel Decoding Modules | – Text Recognition – Table Recognition – Formula Recognition |
Adjustments | Orientation correction (e.g., rotate crops) |
High-Resolution Crops | Examples: 1715 px, 1687 px, 1124 px (fine-grained recognition) |
Output
Format | Description |
---|
Markdown | Structured results with support for lists, headers, equations, tables, figures, etc. |
Performance on OmniDocBench
Across Different Elements
Model Type | Methods | Parameters | Overall ↑ | Textᴱᵈᶦᵗ ↓ | Formulaᶜᴰᴹ ↑ | Tableᵀᴱᴰˢ ↑ | Tableᵀᴱᴰˢ-S ↑ | Read Orderᴱᵈᶦᵗ ↓ |
---|
Pipeline Tools | Marker-1.8.2 | – | 71.30 | 0.206 | 76.66 | 57.88 | 71.17 | 0.250 |
| MinerU2-pipeline | – | 75.51 | 0.209 | 76.55 | 70.90 | 79.11 | 0.225 |
| PP-StructureV3 | – | 86.73 | 0.073 | 85.79 | 81.68 | 89.48 | 0.073 |
General VLMs | GPT-4o | – | 75.02 | 0.217 | 79.70 | 67.07 | 76.09 | 0.148 |
| InternVL3-76B | 76B | 80.33 | 0.131 | 83.42 | 70.64 | 77.74 | 0.113 |
| InternVL3.5-241B | 241B | 82.67 | 0.142 | 87.23 | 75.00 | 81.28 | 0.125 |
| Qwen2.5-VL-72B | 72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 |
| Gemini-2.5 Pro | – | 88.03 | 0.075 | 85.82 | 85.71 | 90.29 | 0.097 |
Specialized VLMs | Dolphin | 322M | 74.67 | 0.125 | 67.85 | 68.70 | 77.77 | 0.124 |
| OCRFlux | 3B | 74.82 | 0.193 | 68.03 | 75.75 | 80.23 | 0.202 |
| Mistral-OCR | – | 78.83 | 0.164 | 82.84 | 70.03 | 78.04 | 0.144 |
| POINTS-Reader | 3B | 80.98 | 0.134 | 79.20 | 77.13 | 81.66 | 0.145 |
| olmOCR-7B | 7B | 81.79 | 0.096 | 86.04 | 68.92 | 74.77 | 0.121 |
| MinerU2-VLM | 0.9B | 85.56 | 0.078 | 80.95 | 83.54 | 87.66 | 0.086 |
| Nanonets-OCR-s | 3.7B | 85.59 | 0.093 | 85.90 | 80.14 | 85.57 | 0.108 |
| MonkeyOCR-pro-1.2B | 1.9B | 86.96 | 0.084 | 85.02 | 84.24 | 89.02 | 0.130 |
| MonkeyOCR-3B | 3.7B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 |
| dots.ocr | 3B | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 |
| MonkeyOCR-pro-3B | 3.7B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 |
⭐ MinerU2.5 | MinerU2.5 | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 |
Across Various Document Types
Model Type | Models | Slides | Academic Papers | Book | Textbook | Exam Papers | Magazine | Newspaper | Notes | Financial Report |
---|
Pipeline Tools | Marker-1.8.2 | 0.1796 | 0.0412 | 0.1010 | 0.2908 | 0.2958 | 0.1111 | 0.2717 | 0.4656 | 0.0341 |
| MinerU2-pipeline | 0.4244 | 0.0230 | 0.2628 | 0.1224 | 0.0822 | 0.3950 | 0.0736 | 0.2603 | 0.0411 |
| PP-StructureV3 | 0.0794 | 0.0236 | 0.0415 | 0.1107 | 0.0945 | 0.0722 | 0.0617 | 0.1236 | 0.0181 |
General VLMs | GPT-4o | 0.1019 | 0.1203 | 0.1288 | 0.1599 | 0.1939 | 0.1420 | 0.6254 | 0.2611 | 0.3343 |
| InternVL3-76B | 0.0349 | 0.1052 | 0.0629 | 0.0827 | 0.1007 | 0.0406 | 0.5826 | 0.0924 | 0.0665 |
| InternVL3.5-241B | 0.0475 | 0.0857 | 0.0237 | 0.1061 | 0.0933 | 0.0577 | 0.6403 | 0.1357 | 0.1117 |
| Qwen2.5-VL-72B | 0.0422 | 0.0801 | 0.0586 | 0.1146 | 0.0681 | 0.0964 | 0.2380 | 0.1232 | 0.0264 |
| Gemini-2.5 Pro | 0.0326 | 0.0182 | 0.0694 | 0.1618 | 0.0937 | 0.0161 | 0.1347 | 0.1169 | 0.0169 |
Specialized VLMs | Dolphin | 0.0957 | 0.0453 | 0.0616 | 0.1333 | 0.1684 | 0.0702 | 0.2388 | 0.2561 | 0.0186 |
| OCRFlux | 0.0870 | 0.0867 | 0.0818 | 0.1843 | 0.2072 | 0.1048 | 0.7304 | 0.1567 | 0.0193 |
| Mistral-OCR | 0.0917 | 0.0531 | 0.0610 | 0.1349 | 0.1341 | 0.0581 | 0.5643 | 0.3097 | 0.0523 |
| POINTS-Reader | 0.0334 | 0.0779 | 0.0671 | 0.1372 | 0.1901 | 0.1343 | 0.3789 | 0.0937 | 0.0951 |
| olmOCR-7B | 0.0497 | 0.0365 | 0.0539 | 0.1204 | 0.0728 | 0.0697 | 0.2916 | 0.1220 | 0.0459 |
| MinerU2-VLM | 0.0745 | 0.0104 | 0.0357 | 0.1276 | 0.0698 | 0.0652 | 0.1831 | 0.0803 | 0.0236 |
| Nanonets-OCR-s | 0.0551 | 0.0578 | 0.0606 | 0.0931 | 0.0834 | 0.0917 | 0.1965 | 0.1606 | 0.0395 |
| MonkeyOCR-pro-1.2B | 0.0961 | 0.0354 | 0.0530 | 0.1110 | 0.0887 | 0.0494 | 0.0995 | 0.1686 | 0.0198 |
| MonkeyOCR-3B | 0.0904 | 0.0362 | 0.0489 | 0.1072 | 0.0745 | 0.0475 | 0.0962 | 0.1165 | 0.0196 |
| dots.ocr | 0.0290 | 0.0231 | 0.0433 | 0.0788 | 0.0467 | 0.0221 | 0.0667 | 0.1116 | 0.0076 |
| MonkeyOCR-pro-3B | 0.0879 | 0.0459 | 0.0517 | 0.1067 | 0.0726 | 0.0482 | 0.0937 | 0.1141 | 0.0211 |
⭐ MinerU2.5 | MinerU2.5 | 0.0294 | 0.0235 | 0.0332 | 0.0499 | 0.0681 | 0.0316 | 0.0540 | 0.1161 | 0.0104 |
GPU Configuration (Inference Rule-of-Thumb)
Scenario | Precision / Quant | “Works” VRAM (est.) | Smooth VRAM (est.) | Example GPUs | Tips / Notes |
---|
Lightweight single-image runs | 8-bit or 4-bit quantized | 6–8 GB | 8–12 GB | RTX 3050/3060 8–12 GB, L4 24 GB (ample) | Keep batch=1, modest max tokens; quantize weights; lower image resolution if tight. (Estimate based on model scale.) |
Standard single-GPU (best balance) | FP16 / BF16 | 10–12 GB | 12–16 GB | RTX 4070/4070 Ti 12–16 GB, A4000 16 GB | Use vLLM engine; enable the provided MinerU logits processor on vLLM ≥ 0.10.1; two-stage flow keeps memory predictable. (Hugging Face) |
High-throughput server | FP16 / BF16 | 16–24 GB | 24–40 GB | A5000 24 GB, A6000 48 GB, A100 40–80 GB, H100 80 GB | Use vllm-async-engine for concurrency (authors cite strong fps on A100). Tune max concurrency / token length; pin threads and pre-load model. (Hugging Face) |
CPU fallback (debug only) | INT8/FP32 (CPU) | — | — | 16-core+ CPU | Very slow; useful just to validate pipeline; switch to GPU for real use. |
Implementation pointers (from the official snippets):
- Prefer vLLM:
LLM(model="opendatalab/MinerU2.5-2509-1.2B", logits_processors=[MinerULogitsProcessor])
. For heavy concurrency, use AsyncLLM.
- Wrapper:
mineru-vl-utils
provides two_step_extract(...)
/ aio_two_step_extract(...)
for the coarse-to-fine routine.
Resources
Link: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
Step-by-Step Process to Install & Run MinerU2.5-2509-1.2B Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MinerU2.5-2509-1.2B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like MinerU2.5-2509-1.2B.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MinerU2.5-2509-1.2B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the MinerU2.5-2509-1.2B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Base System Packages (Ubuntu)
Install the essentials you’ll need for MinerU2.5-2509-1.2B: Python 3.10 venv/pip, Git + LFS, FFmpeg, OpenGL libs, and build tools.
Run the following commands to install base system packages:
sudo apt update
sudo apt install -y python3.10-venv python3-pip git git-lfs ffmpeg libgl1 libglib2.0-0 build-essential
git lfs install
Step 9: Create & Activate a Python Virtual Environment
Isolate everything for MinerU2.5-2509-1.2B in its own venv, then upgrade the basic build tools.
Run the following commands to create & activate a python virtual environment:
python3.10 -m venv ~/miner
source ~/miner/bin/activate
python -m pip install -U pip wheel setuptools
Step 10: Install PyTorch for CUDA
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu124 \
torch torchvision torchaudio
Option A: Transformers Backend (Simple, 6 GB+ VRAM)
Step 11: Install the Utilities
Run the following command to install utilities:
pip install "mineru-vl-utils[transformers]" pillow
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the MinerU2.5-2509-1.2B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Create the Script
Create a file (ex: # app.py) and add the following code:
from vllm import LLM
from PIL import Image
from mineru_vl_utils import MinerUClient, MinerULogitsProcessor # vllm>=0.10.1
llm = LLM(model="opendatalab/MinerU2.5-2509-1.2B",
logits_processors=[MinerULogitsProcessor])
client = MinerUClient(backend="vllm-engine", vllm_llm=llm)
image = Image.open("test_page.png")
blocks = client.two_step_extract(image)
print(blocks)
What This Script Does
- Initializes vLLM with the MinerU2.5 VLM and a logits processor optimized for document parsing.
- Builds a MinerUClient that uses the vLLM engine for fast, GPU-accelerated inference.
- Loads
test_page.png
as the input page image.
- Runs MinerU’s two-step extraction (coarse layout → fine recognition for text/tables/formulas).
- Prints the structured blocks (type, bbox, angle, content) detected on the page.
Step 14: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
Option B: vLLM Backend (Fast, Scalable, 8 GB+ VRAM)
Step 15: Install the Utilities
Run the following command to install utilities:
pip install "mineru-vl-utils[vllm]" "vllm>=0.10.1" pillow
Step 16: Create the Script
Create a file (ex: # mineru_vllm_test.py) and add the following code:
from vllm import LLM
from PIL import Image
from mineru_vl_utils import MinerUClient, MinerULogitsProcessor
llm = LLM(
model="opendatalab/MinerU2.5-2509-1.2B",
logits_processors=[MinerULogitsProcessor],
# keep native context; don't override max_model_len unless you really need to
)
client = MinerUClient(backend="vllm-engine", vllm_llm=llm)
img = Image.open("test_page.png").convert("RGB")
print(client.two_step_extract(img))
What This Script Does
- Initializes vLLM with the MinerU2.5 VLM and MinerU’s logits processor.
- Constructs a MinerUClient that uses the vLLM engine for GPU-accelerated inference.
- Opens
test_page.png
and converts it to RGB for consistent processing.
- Runs MinerU’s two-step extraction (global layout → fine recognition of text/tables/formulas).
- Prints the structured blocks detected (type, bbox, angle, content).
Step 17: Run the Script
Run the script from the following command:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python3 mineru_vllm_test.py
This will download the model and generate response on terminal.
Conclusion
MinerU2.5 makes high-resolution document parsing practical: a compact 1.2B VLM, a fast coarse-to-fine pipeline, and simple Python snippets that run anywhere—from a budget GPU to an A100. With a NodeShift (or any) GPU VM set up, you can start with the Transformers path for simplicity or switch to vLLM for serious throughput (and the async engine when you scale). The examples above take you from a single page image to structured blocks you can turn into Markdown, tables, and LaTeX—then batch it across folders. From here, try the async engine, add quantization for smaller GPUs, or drop a lightweight UI on top; you now have a reliable, production-ready foundation for document intelligence.