DeepSeek-OCR is a cutting-edge vision-language model from DeepSeek AI designed for intelligent optical character recognition and document understanding. Built on the DeepSeek-VL-v2 architecture, it fuses visual perception with contextual text reasoning to accurately convert complex images, documents, and charts into structured text or Markdown formats. Optimized for GPU inference with FlashAttention 2, DeepSeek-OCR offers exceptional speed and precision in multilingual OCR, document layout parsing, and visual-text compression — making it a powerful tool for next-generation document intelligence.
Compression on Fox Benchmark
Text Tokens per Page (Ground Truth) | Precision (%) – 64 vis toks | Precision (%) – 100 vis toks | Compression (×) – 64 vis toks | Compression (×) – 100 vis toks |
---|
600–700 | 96.5 | 98.5 | 10.5 | 6.7 |
700–800 | 93.8 | 97.3 | 11.8 | 7.5 |
800–900 | 83.8 | 96.8 | 13.2 | 8.5 |
900–1000 | 85.8 | 96.8 | 15.1 | 9.7 |
1000–1100 | 79.3 | 91.5 | 10.6 | 10.6 |
1100–1200 | 76.3 | 89.8 | 17.7 | 11.3 |
1200–1300 | 59.1 | 87.1 | 19.7 | 12.6 |
Performance on OmnidocBench
Model | Average Vision Tokens per Image | Overall Performance (Edit Distance ↓) | Notes |
---|
DeepSeek-OCR (Gundam-M 200dpi) | ~1200 | < 0.25 | High accuracy (best region) |
DeepSeek-OCR (Gundam) | ~1500 | < 0.25 | High accuracy |
DeepSeek-OCR (Large) | ~1000 | < 0.25 | High accuracy |
DeepSeek-OCR (Base) | ~1000 | < 0.25 | High accuracy |
DeepSeek-OCR (Small) | ~900 | ≈ 0.3 | |
DeepSeek-OCR (Tiny) | ~800 | ≈ 0.35 | |
GOT-OCR2.0 | ~800 | ≈ 0.35 | |
dots.ocr (200dpi) | ~4500 | ≈ 0.2 | |
dots.ocr | ~3000 | ≈ 0.25 | |
Qwen2.5-VL-72B | ~3500 | ≈ 0.25 | |
Qwen2.5-VL-7B | ~2500 | ≈ 0.35 | |
OCRFlux-3B | ~3500 | ≈ 0.25 | |
MinerU2.0 | ~4500 | ≈ 0.2 | |
InternVL3-78B | ~2500 | ≈ 0.3 | |
InternVL2-76B | ~6500 | ≈ 0.45 | |
OLMOCR | ~2500 | ≈ 0.4 | |
SmolDocling | ~300 | ≈ 0.5 | Very low token usage |
DeepSeek-OCR — Suggested GPU Configs
Variant | Image Setup | Vision Tokens | Precision | Min VRAM (safe) | Recommended GPUs | Typical Batch |
---|
Tiny | base_size 512, image_size 512, crop_mode=False | 64–100 | BF16 | 12 GB | RTX 4090 (24 GB), L4 (24 GB), A5000 (24 GB) | 2–4 |
Small | base_size 640, image_size 640, crop_mode=False | 64–100 | BF16 | 16 GB | 4090 (24 GB), L40S (48 GB), A6000 (48 GB) | 2–3 |
Base | base_size 1024, image_size 1024, crop_mode=False | 64–100 | BF16 | 24 GB | 4090 (24 GB)✓, A100 (40/80 GB), L40S (48 GB) | 1–2 |
Large | base_size 1280, image_size 1280, crop_mode=False | 64–100 | BF16 | 32–40 GB | L40S (48 GB), A100 (40/80 GB), H100 (80 GB) | 1–2 |
Gundam | base_size 1024, image_size 640, crop_mode=True | 64–100 | BF16 | 16–24 GB | 4090 (24 GB), L40S (48 GB), A100 (40 GB) | 2 |
Gundam-M (200 dpi) | same as Gundam (higher-dpi inputs) | 64–100 | BF16 | 24 GB | L40S (48 GB), A100 (40/80 GB), H100 (80 GB) | 1–2 |
Environment & Flags (Tested Combo)
Component | Setting |
---|
Python / CUDA | Python 3.12 • CUDA 11.8 |
PyTorch | torch==2.6.0 (cu118) |
Transformers / Tokenizers | transformers==4.46.3 , tokenizers==0.20.3 |
FlashAttention | flash-attn==2.7.3 --no-build-isolation |
Model load hints | _attn_implementation='flash_attention_2' , model.eval().cuda().to(torch.bfloat16) |
Perf tips | Set CUDA_VISIBLE_DEVICES , use pin_memory=True in loaders, try --bf16 if using vLLM, keep vision tokens 64–100 for better compression/speed tradeoff |
Resources
Link: https://huggingface.co/deepseek-ai/DeepSeek-OCR
Step-by-Step Process to Install & Run DeepSeek-OCR Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running DeepSeek-OCR, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like DeepSeek-OCR.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like DeepSeek-OCR.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the DeepSeek-OCR runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.12 python3.12-venv python3.12-dev
python3.12 -m ensurepip --upgrade
python3.12 -m pip install --upgrade pip setuptools wheel
python3.12 --version
python3.12 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/ocr
source ~/.venvs/ocr/bin/activate
python --version
pip --version
Step 10: Clone the DeepSeek-OCR Repo
Run the following command to clone the deepseek-ocr repo:
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
Step 11: Install PyTorch 2.6.0 (CUDA 11.8)
Activate your venv (if not already), then install the CUDA 11.8 wheels:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu118
Step 12: Install All Dependencies From Requirements.txt
After successfully installing PyTorch in Step 11, you’ll now install the remaining dependencies required for DeepSeek-OCR:
pip install -r requirements.txt
Step 13: Install Wheel and Flash Attention
Run the following commands to install wheel and flash attention:
pip install packaging ninja wheel
pip install "flash-attn==2.7.3" --no-build-isolation --no-binary flash-attn
Step 14: Run the Model
Run the model from the following command:
python infer.py
Conclusion
DeepSeek-OCR represents a major leap in document intelligence — combining vision-language reasoning with high-speed GPU inference. Its ability to compress, recognize, and structure multilingual documents with near-perfect precision places it at the forefront of OCR technology. Whether you’re parsing invoices, digitizing academic papers, or processing multilingual PDFs, DeepSeek-OCR delivers unmatched efficiency with FlashAttention 2 acceleration. By following this tutorial, you’ve set up a complete GPU-powered DeepSeek-OCR environment ready for large-scale OCR and document understanding tasks.