How to Install & Run DeepSeek-OCR Locally?

by Ayush Kumar | October 21, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

DeepSeek-OCR is a cutting-edge vision-language model from DeepSeek AI designed for intelligent optical character recognition and document understanding. Built on the DeepSeek-VL-v2 architecture, it fuses visual perception with contextual text reasoning to accurately convert complex images, documents, and charts into structured text or Markdown formats. Optimized for GPU inference with FlashAttention 2, DeepSeek-OCR offers exceptional speed and precision in multilingual OCR, document layout parsing, and visual-text compression — making it a powerful tool for next-generation document intelligence.

Compression on Fox Benchmark

Text Tokens per Page (Ground Truth)	Precision (%) – 64 vis toks	Precision (%) – 100 vis toks	Compression (×) – 64 vis toks	Compression (×) – 100 vis toks
600–700	96.5	98.5	10.5	6.7
700–800	93.8	97.3	11.8	7.5
800–900	83.8	96.8	13.2	8.5
900–1000	85.8	96.8	15.1	9.7
1000–1100	79.3	91.5	10.6	10.6
1100–1200	76.3	89.8	17.7	11.3
1200–1300	59.1	87.1	19.7	12.6

Performance on OmnidocBench

Model	Average Vision Tokens per Image	Overall Performance (Edit Distance ↓)	Notes
DeepSeek-OCR (Gundam-M 200dpi)	~1200	< 0.25	High accuracy (best region)
DeepSeek-OCR (Gundam)	~1500	< 0.25	High accuracy
DeepSeek-OCR (Large)	~1000	< 0.25	High accuracy
DeepSeek-OCR (Base)	~1000	< 0.25	High accuracy
DeepSeek-OCR (Small)	~900	≈ 0.3
DeepSeek-OCR (Tiny)	~800	≈ 0.35
GOT-OCR2.0	~800	≈ 0.35
dots.ocr (200dpi)	~4500	≈ 0.2
dots.ocr	~3000	≈ 0.25
Qwen2.5-VL-72B	~3500	≈ 0.25
Qwen2.5-VL-7B	~2500	≈ 0.35
OCRFlux-3B	~3500	≈ 0.25
MinerU2.0	~4500	≈ 0.2
InternVL3-78B	~2500	≈ 0.3
InternVL2-76B	~6500	≈ 0.45
OLMOCR	~2500	≈ 0.4
SmolDocling	~300	≈ 0.5	Very low token usage

DeepSeek-OCR — Suggested GPU Configs

Variant	Image Setup	Vision Tokens	Precision	Min VRAM (safe)	Recommended GPUs	Typical Batch
Tiny	base_size 512, image_size 512, `crop_mode=False`	64–100	BF16	12 GB	RTX 4090 (24 GB), L4 (24 GB), A5000 (24 GB)	2–4
Small	base_size 640, image_size 640, `crop_mode=False`	64–100	BF16	16 GB	4090 (24 GB), L40S (48 GB), A6000 (48 GB)	2–3
Base	base_size 1024, image_size 1024, `crop_mode=False`	64–100	BF16	24 GB	4090 (24 GB)✓, A100 (40/80 GB), L40S (48 GB)	1–2
Large	base_size 1280, image_size 1280, `crop_mode=False`	64–100	BF16	32–40 GB	L40S (48 GB), A100 (40/80 GB), H100 (80 GB)	1–2
Gundam	base_size 1024, image_size 640, `crop_mode=True`	64–100	BF16	16–24 GB	4090 (24 GB), L40S (48 GB), A100 (40 GB)	2
Gundam-M (200 dpi)	same as Gundam (higher-dpi inputs)	64–100	BF16	24 GB	L40S (48 GB), A100 (40/80 GB), H100 (80 GB)	1–2

Environment & Flags (Tested Combo)

Component	Setting
Python / CUDA	Python 3.12 • CUDA 11.8
PyTorch	`torch==2.6.0` (cu118)
Transformers / Tokenizers	`transformers==4.46.3`, `tokenizers==0.20.3`
FlashAttention	`flash-attn==2.7.3 --no-build-isolation`
Model load hints	`_attn_implementation='flash_attention_2'`, `model.eval().cuda().to(torch.bfloat16)`
Perf tips	Set `CUDA_VISIBLE_DEVICES`, use `pin_memory=True` in loaders, try `--bf16` if using vLLM, keep vision tokens 64–100 for better compression/speed tradeoff

Resources

Link: https://huggingface.co/deepseek-ai/DeepSeek-OCR

Step-by-Step Process to Install & Run DeepSeek-OCR Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running DeepSeek-OCR, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like DeepSeek-OCR.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like DeepSeek-OCR.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the DeepSeek-OCR runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.12 python3.12-venv python3.12-dev
python3.12 -m ensurepip --upgrade
python3.12 -m pip install --upgrade pip setuptools wheel
python3.12 --version
python3.12 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/ocr
source ~/.venvs/ocr/bin/activate
python --version
pip --version

Step 10: Clone the DeepSeek-OCR Repo

Run the following command to clone the deepseek-ocr repo:

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

Step 11: Install PyTorch 2.6.0 (CUDA 11.8)

Activate your venv (if not already), then install the CUDA 11.8 wheels:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
  --index-url https://download.pytorch.org/whl/cu118

Step 12: Install All Dependencies From `Requirements.txt`

After successfully installing PyTorch in Step 11, you’ll now install the remaining dependencies required for DeepSeek-OCR:

pip install -r requirements.txt

Step 13: Install Wheel and Flash Attention

Run the following commands to install wheel and flash attention:

pip install packaging ninja wheel
pip install "flash-attn==2.7.3" --no-build-isolation --no-binary flash-attn

Step 14: Run the Model

Run the model from the following command:

python infer.py

Conclusion

DeepSeek-OCR represents a major leap in document intelligence — combining vision-language reasoning with high-speed GPU inference. Its ability to compress, recognize, and structure multilingual documents with near-perfect precision places it at the forefront of OCR technology. Whether you’re parsing invoices, digitizing academic papers, or processing multilingual PDFs, DeepSeek-OCR delivers unmatched efficiency with FlashAttention 2 acceleration. By following this tutorial, you’ve set up a complete GPU-powered DeepSeek-OCR environment ready for large-scale OCR and document understanding tasks.

Relevant blog posts

October 17, 2025

How to Install & Run Facebook MobileLLM-Pro Locally?

MobileLLM-Pro is Meta’s 1.08B-parameter, on-device–first LLM with a 128k context window and local-global attention (3:1) for faster prefill and tiny KV cache. It ships as base and instruction-tuned variants, plus near-lossless int4 quantization (CPU & accelerator ready), delivering competitive quality vs other ~1B models while fitting comfortably on phones, edge accelerators, and low-VRAM GPUs.

October 16, 2025

How to Install & Run KAT-Dev-72B-Exp Locally?

KAT-Dev-72B-Exp stands as Kwaipilot’s most ambitious open-source model to date — a massive 72-billion-parameter large language model purpose-built for software engineering, debugging, and automated code reasoning. It represents the experimental reinforcement-learning (RL) variant of the proprietary KAT-Coder model, opening a rare window into the techniques and design philosophies that power some of the world’s strongest coding assistants. At its core, KAT-Dev-72B-Exp pushes the boundaries of reinforcement learning for code generation. The Kwaipilot team rewrote key attention kernels and re-engineered the training engine to support shared prefix trajectories, enabling faster and more stable RL training — particularly on scaffolded tasks that demand precise context management. To prevent the common issue of exploration collapse in RL training, they also introduced a novel advantage redistribution technique, which dynamically balances exploration by amplifying high-variance trajectories and soft-penalizing low-exploration ones. These innovations translate directly into real-world performance. On the SWE-Bench Verified benchmark, a demanding test that evaluates a model’s ability to understand, reason about, and patch real GitHub issues, KAT-Dev-72B-Exp achieves an impressive 74.6% accuracy when tested strictly under the SWE-agent scaffold. This places it among the most capable open-source developer models currently available. The model is distributed under the Apache-2.0 license, ensuring that researchers, developers, and organizations can freely explore, adapt, and integrate its architecture into their own projects. The accompanying Transformers quickstart snippet makes it easy to run on both local and cloud environments, while the evaluation parameters — temperature 0.6, max_turns 150, and history_processors.n 100 — ensure reproducibility across experiments. In short, KAT-Dev-72B-Exp is not just another large language model. It is a deep dive into how reinforcement learning can be scaled safely and efficiently for large-context, multi-turn software engineering workflows — bridging the gap between academic research and production-grade coding intelligence.

October 15, 2025

How to Install & Run LFM2-8B-A1B Locally?

LFM2-8B-A1B is Liquid AI’s on-device-friendly MoE: 8.3B total / 1.5B active params with a hybrid conv-attention stack (18 LIV conv + 6 GQA). It uses a ChatML-style template, supports 32K context, and is tuned for agentic tasks, data extraction, RAG, and multi-turn chat. It targets speed on modest hardware (often faster than 1.7B dense baselines) while keeping quality near 3–4B dense models.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.