How to Install & Run MinerU2.5-2509-1.2B Locally?

by Ayush Kumar | October 3, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.

Overall Performance (1-Edit, CDM, TEDS aggregated)

Model	Score
MinerU2.5	90.67
MonkeyOCR-pro-3B	88.85
dots.ocr	88.41
Gemini-2.5 Pro	88.03
Qwen2.5-VL-72B	87.01
PP-StructureV3	86.96
MonkeyOCR-pro-1.2B	86.73
Nanonets-OCR-s	85.59
MinerU2-VLM	85.56
InternVL3.5-241B	82.66
POINTS-Reader	80.98
Mistral OCR	78.83
MinerU2-pipeline	75.51
GPT-4o	75.02
OCRFlux	74.82
Dolphin	74.67
Marker	71.30

Element-wise Performance

Text Block (1-Edit)

Model	Score
MinerU2.5	95.34
dots.ocr	95.24
PP-StructureV3	92.71
Qwen2.5-VL-72B	92.55
Gemini-2.5 Pro	92.52
MonkeyOCR-3B	92.18

Formula (CDM)

Model	Score
MinerU2.5	88.46
Qwen2.5-VL-72B	88.27
Gemini-2.5 Pro	87.25
MonkeyOCR-3B	87.23
InternVL3.5	85.90
MonkeyOCR-1.2B	85.81

Table (TEDS)

Model	Score
MinerU2.5	88.22
dots.ocr	86.78
MonkeyOCR-3B	86.78
Gemini-2.5 Pro	85.71
Qwen2.5-VL-72B	84.24
InternVL3.5	83.54

Reading Order (1-Edit)

Model	Score
MinerU2.5	96.62
dots.ocr	94.72
PP-StructureV3	92.66
MonkeyOCR-3B	91.45
Gemini-2.5 Pro	90.30
Qwen2.5-VL-72B	89.85

Model Components

Component	Details
Vision Backbone	NativeRes-ViT – 675M parameters
Language Decoder	LM Decoder – 0.5B parameters
Output Format	Markdown (supports text, tables, formulas, figures)

Pipeline Overview

Stage	Step	Description
Stage I: Layout Analysis	Resize	Downsample document image (e.g., 1036 × 1295 px → 2640 × 3320 px)
	Layout Detection	Detect bounding boxes for elements (tables, images, text, figures, captions)
	Cropping	Extract crops for each detected region (with `<box_start>` and `<ref_start>` tags)
	Native-Res Handling	Decide whether to Drop or keep as Figure
	Output Example	Order: 1 → Box [163, 81, 836, 129] → Type: Table Caption → Orientation: ⬆️

Stage II: Content Recognition

Process	Details
Merge by Order	Crops are ordered sequentially for recognition
Parallel Decoding Modules	– Text Recognition – Table Recognition – Formula Recognition
Adjustments	Orientation correction (e.g., rotate crops)
High-Resolution Crops	Examples: 1715 px, 1687 px, 1124 px (fine-grained recognition)

Output

Format	Description
Markdown	Structured results with support for lists, headers, equations, tables, figures, etc.

Performance on OmniDocBench

Across Different Elements

Model Type	Methods	Parameters	Overall ↑	Textᴱᵈᶦᵗ ↓	Formulaᶜᴰᴹ ↑	Tableᵀᴱᴰˢ ↑	Tableᵀᴱᴰˢ-S ↑	Read Orderᴱᵈᶦᵗ ↓
Pipeline Tools	Marker-1.8.2	–	71.30	0.206	76.66	57.88	71.17	0.250
	MinerU2-pipeline	–	75.51	0.209	76.55	70.90	79.11	0.225
	PP-StructureV3	–	86.73	0.073	85.79	81.68	89.48	0.073
General VLMs	GPT-4o	–	75.02	0.217	79.70	67.07	76.09	0.148
	InternVL3-76B	76B	80.33	0.131	83.42	70.64	77.74	0.113
	InternVL3.5-241B	241B	82.67	0.142	87.23	75.00	81.28	0.125
	Qwen2.5-VL-72B	72B	87.02	0.094	88.27	82.15	86.22	0.102
	Gemini-2.5 Pro	–	88.03	0.075	85.82	85.71	90.29	0.097
Specialized VLMs	Dolphin	322M	74.67	0.125	67.85	68.70	77.77	0.124
	OCRFlux	3B	74.82	0.193	68.03	75.75	80.23	0.202
	Mistral-OCR	–	78.83	0.164	82.84	70.03	78.04	0.144
	POINTS-Reader	3B	80.98	0.134	79.20	77.13	81.66	0.145
	olmOCR-7B	7B	81.79	0.096	86.04	68.92	74.77	0.121
	MinerU2-VLM	0.9B	85.56	0.078	80.95	83.54	87.66	0.086
	Nanonets-OCR-s	3.7B	85.59	0.093	85.90	80.14	85.57	0.108
	MonkeyOCR-pro-1.2B	1.9B	86.96	0.084	85.02	84.24	89.02	0.130
	MonkeyOCR-3B	3.7B	87.13	0.075	87.45	81.39	85.92	0.129
	dots.ocr	3B	88.41	0.048	83.22	86.78	90.62	0.053
	MonkeyOCR-pro-3B	3.7B	88.85	0.075	87.25	86.78	90.63	0.128
⭐ MinerU2.5	MinerU2.5	1.2B	90.67	0.047	88.46	88.22	92.38	0.044

Across Various Document Types

Model Type	Models	Slides	Academic Papers	Book	Textbook	Exam Papers	Magazine	Newspaper	Notes	Financial Report
Pipeline Tools	Marker-1.8.2	0.1796	0.0412	0.1010	0.2908	0.2958	0.1111	0.2717	0.4656	0.0341
	MinerU2-pipeline	0.4244	0.0230	0.2628	0.1224	0.0822	0.3950	0.0736	0.2603	0.0411
	PP-StructureV3	0.0794	0.0236	0.0415	0.1107	0.0945	0.0722	0.0617	0.1236	0.0181
General VLMs	GPT-4o	0.1019	0.1203	0.1288	0.1599	0.1939	0.1420	0.6254	0.2611	0.3343
	InternVL3-76B	0.0349	0.1052	0.0629	0.0827	0.1007	0.0406	0.5826	0.0924	0.0665
	InternVL3.5-241B	0.0475	0.0857	0.0237	0.1061	0.0933	0.0577	0.6403	0.1357	0.1117
	Qwen2.5-VL-72B	0.0422	0.0801	0.0586	0.1146	0.0681	0.0964	0.2380	0.1232	0.0264
	Gemini-2.5 Pro	0.0326	0.0182	0.0694	0.1618	0.0937	0.0161	0.1347	0.1169	0.0169
Specialized VLMs	Dolphin	0.0957	0.0453	0.0616	0.1333	0.1684	0.0702	0.2388	0.2561	0.0186
	OCRFlux	0.0870	0.0867	0.0818	0.1843	0.2072	0.1048	0.7304	0.1567	0.0193
	Mistral-OCR	0.0917	0.0531	0.0610	0.1349	0.1341	0.0581	0.5643	0.3097	0.0523
	POINTS-Reader	0.0334	0.0779	0.0671	0.1372	0.1901	0.1343	0.3789	0.0937	0.0951
	olmOCR-7B	0.0497	0.0365	0.0539	0.1204	0.0728	0.0697	0.2916	0.1220	0.0459
	MinerU2-VLM	0.0745	0.0104	0.0357	0.1276	0.0698	0.0652	0.1831	0.0803	0.0236
	Nanonets-OCR-s	0.0551	0.0578	0.0606	0.0931	0.0834	0.0917	0.1965	0.1606	0.0395
	MonkeyOCR-pro-1.2B	0.0961	0.0354	0.0530	0.1110	0.0887	0.0494	0.0995	0.1686	0.0198
	MonkeyOCR-3B	0.0904	0.0362	0.0489	0.1072	0.0745	0.0475	0.0962	0.1165	0.0196
	dots.ocr	0.0290	0.0231	0.0433	0.0788	0.0467	0.0221	0.0667	0.1116	0.0076
	MonkeyOCR-pro-3B	0.0879	0.0459	0.0517	0.1067	0.0726	0.0482	0.0937	0.1141	0.0211
⭐ MinerU2.5	MinerU2.5	0.0294	0.0235	0.0332	0.0499	0.0681	0.0316	0.0540	0.1161	0.0104

GPU Configuration (Inference Rule-of-Thumb)

Scenario	Precision / Quant	“Works” VRAM (est.)	Smooth VRAM (est.)	Example GPUs	Tips / Notes
Lightweight single-image runs	8-bit or 4-bit quantized	6–8 GB	8–12 GB	RTX 3050/3060 8–12 GB, L4 24 GB (ample)	Keep batch=1, modest max tokens; quantize weights; lower image resolution if tight. (Estimate based on model scale.)
Standard single-GPU (best balance)	FP16 / BF16	10–12 GB	12–16 GB	RTX 4070/4070 Ti 12–16 GB, A4000 16 GB	Use vLLM engine; enable the provided MinerU logits processor on vLLM ≥ 0.10.1; two-stage flow keeps memory predictable. (Hugging Face)
High-throughput server	FP16 / BF16	16–24 GB	24–40 GB	A5000 24 GB, A6000 48 GB, A100 40–80 GB, H100 80 GB	Use `vllm-async-engine` for concurrency (authors cite strong fps on A100). Tune max concurrency / token length; pin threads and pre-load model. (Hugging Face)
CPU fallback (debug only)	INT8/FP32 (CPU)	—	—	16-core+ CPU	Very slow; useful just to validate pipeline; switch to GPU for real use.

Implementation pointers (from the official snippets):

Prefer vLLM: LLM(model="opendatalab/MinerU2.5-2509-1.2B", logits_processors=[MinerULogitsProcessor]). For heavy concurrency, use AsyncLLM.
Wrapper: mineru-vl-utils provides two_step_extract(...) / aio_two_step_extract(...) for the coarse-to-fine routine.

Resources

Link: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B

Step-by-Step Process to Install & Run MinerU2.5-2509-1.2B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MinerU2.5-2509-1.2B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like MinerU2.5-2509-1.2B.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MinerU2.5-2509-1.2B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the MinerU2.5-2509-1.2B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Base System Packages (Ubuntu)

Install the essentials you’ll need for MinerU2.5-2509-1.2B: Python 3.10 venv/pip, Git + LFS, FFmpeg, OpenGL libs, and build tools.

Run the following commands to install base system packages:

sudo apt update
sudo apt install -y python3.10-venv python3-pip git git-lfs ffmpeg libgl1 libglib2.0-0 build-essential
git lfs install

Step 9: Create & Activate a Python Virtual Environment

Isolate everything for MinerU2.5-2509-1.2B in its own venv, then upgrade the basic build tools.

Run the following commands to create & activate a python virtual environment:

python3.10 -m venv ~/miner
source ~/miner/bin/activate
python -m pip install -U pip wheel setuptools

Step 10: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu124 \
  torch torchvision torchaudio

Option A: Transformers Backend (Simple, 6 GB+ VRAM)

Step 11: Install the Utilities

Run the following command to install utilities:

pip install "mineru-vl-utils[transformers]" pillow

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the MinerU2.5-2509-1.2B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Create the Script

Create a file (ex: # app.py) and add the following code:

from vllm import LLM
from PIL import Image
from mineru_vl_utils import MinerUClient, MinerULogitsProcessor  # vllm>=0.10.1

llm = LLM(model="opendatalab/MinerU2.5-2509-1.2B",
          logits_processors=[MinerULogitsProcessor])

client = MinerUClient(backend="vllm-engine", vllm_llm=llm)

image = Image.open("test_page.png")
blocks = client.two_step_extract(image)
print(blocks)

What This Script Does

Initializes vLLM with the MinerU2.5 VLM and a logits processor optimized for document parsing.
Builds a MinerUClient that uses the vLLM engine for fast, GPU-accelerated inference.
Loads test_page.png as the input page image.
Runs MinerU’s two-step extraction (coarse layout → fine recognition for text/tables/formulas).
Prints the structured blocks (type, bbox, angle, content) detected on the page.

Step 14: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Option B: vLLM Backend (Fast, Scalable, 8 GB+ VRAM)

Step 15: Install the Utilities

Run the following command to install utilities:

pip install "mineru-vl-utils[vllm]" "vllm>=0.10.1" pillow

Step 16: Create the Script

Create a file (ex: # mineru_vllm_test.py) and add the following code:

from vllm import LLM
from PIL import Image
from mineru_vl_utils import MinerUClient, MinerULogitsProcessor

llm = LLM(
    model="opendatalab/MinerU2.5-2509-1.2B",
    logits_processors=[MinerULogitsProcessor],
    # keep native context; don't override max_model_len unless you really need to
)

client = MinerUClient(backend="vllm-engine", vllm_llm=llm)

img = Image.open("test_page.png").convert("RGB")
print(client.two_step_extract(img))

What This Script Does

Initializes vLLM with the MinerU2.5 VLM and MinerU’s logits processor.
Constructs a MinerUClient that uses the vLLM engine for GPU-accelerated inference.
Opens test_page.png and converts it to RGB for consistent processing.
Runs MinerU’s two-step extraction (global layout → fine recognition of text/tables/formulas).
Prints the structured blocks detected (type, bbox, angle, content).

Step 17: Run the Script

Run the script from the following command:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python3 mineru_vllm_test.py

This will download the model and generate response on terminal.

Conclusion

MinerU2.5 makes high-resolution document parsing practical: a compact 1.2B VLM, a fast coarse-to-fine pipeline, and simple Python snippets that run anywhere—from a budget GPU to an A100. With a NodeShift (or any) GPU VM set up, you can start with the Transformers path for simplicity or switch to vLLM for serious throughput (and the async engine when you scale). The examples above take you from a single page image to structured blocks you can turn into Markdown, tables, and LaTeX—then batch it across folders. From here, try the async engine, add quantization for smaller GPUs, or drop a lightweight UI on top; you now have a reliable, production-ready foundation for document intelligence.

Relevant blog posts

October 2, 2025

How to Install & Run KAT-Dev Locally?

KAT-Dev-32B (Kwaipilot/KAT-Dev) is a 32.8B-parameter coding assistant based on Qwen3-32B, purpose-tuned for software engineering. It’s trained in three phases—mid-training (core skills), SFT + RFT (curated tasks with teacher trajectories), and large-scale agentic RL (prefix caching + trajectory pruning + scalable infra). On SWE-Bench Verified, it reports 62.4% resolved, placing it among the strongest open-source code models at its scale. It supports HF Transformers and vLLM, uses a Qwen-style chat template, and is well-suited for repo-level reasoning, tool use, and multi-turn debugging.

October 1, 2025

How to Install & Run Hunyuan3D-Omni Locally?

Hunyuan3D-Omni is Tencent’s unified, controllable image-to-3D generator built on Hunyuan3D 2.1. Beyond images, it ingests point clouds, voxels, 3D bounding boxes, and skeletal poses through a single control encoder, letting you steer geometry, topology, and pose precisely. The training uses difficulty-aware sampling to robustly fuse modalities (e.g., bias toward harder signals like pose), and optional EMA and FlashVDM switches improve stability and speed at inference. Reported footprint: ~10 GB VRAM for single-asset generation with batch size 1.

September 30, 2025

Building a Math Dueler Agent with K2-Think: Step-by-Step Guide

K2-Think is a 32B parameter open-weights reasoning model developed by LLM360, purpose-built for tough problem-solving in math, code, and science. It excels in competitive benchmarks like AIME, HMMT, and LiveCodeBench, showcasing strong chain-of-thought reasoning and verifiable step-by-step logic. Optimized for efficiency, K2-Think runs on both typical cloud setups and advanced hardware like Cerebras WSE, making it a powerful yet accessible system for researchers and developers who want high-performance reasoning without proprietary restrictions.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.