How to Install & Run MiniCPM-V-4_5 Locally?

by Ayush Kumar | September 3, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

MiniCPM-V 4.5 is the latest milestone in the MiniCPM Vision-Language series by OpenBMB. Built on Qwen3-8B with a SigLIP2-400M vision encoder, this model delivers GPT-4o-level multimodal performance with only ~8.7B parameters. It outperforms models like GPT-4o-latest and Gemini 2.0 Pro in OCR, document parsing, and video understanding—all while being lightweight enough to run on your phone.

Key Highlights

Best-in-class MLLM under 30B: Tops OpenCompass with a 77.0 avg score
96× video token compression via 3D-Resampler for long & high-FPS video
Switchable fast vs. deep thinking modes via hybrid RL training
Top-tier OCR & document reasoning using UHD vision input
Multilingual understanding (30+ languages) and trustworthy outputs (via RLAIF-V)

It supports various deployment methods: Llama.cpp, Ollama, vLLM, SGLang, Gradio UI, and even iOS demos—making it one of the most accessible MLLMs ever.

Scenario	GPU(s)	VRAM per GPU	Total VRAM	Precision	Inference Time	Min Disk	RAM (Sys)	Notes
Full Precision (FP16/bf16)	1× A100 / H100	40–80 GB	40–80 GB	bfloat16 / FP16	~0.26h (Video-MME)	60 GB	32–64 GB	Recommended for research + eval
Quantized (AWQ / GGUF Int4)	1× RTX 3090 / A6000	24 GB	24 GB	INT4 / INT8	~2× faster	40 GB	16–32 GB	For local/Ollama use on consumer GPUs
SGLang / vLLM Server (Batch)	2× A100	40 GB each	80 GB	bfloat16	Highly optimized	80 GB	64+ GB	For high-throughput deployment
Mobile Demo (iOS/iPad M4)	M4 Chip	Shared Memory	–	INT4	Realtime	–	–	iOS app optimized for interactive use
CPU-only (Llama.cpp/Ollama)	No GPU	–	–	INT4	Slowest	30 GB	16+ GB	For testing only, not recommended for video

Framework Support Matrix

Category	Framework	Cookbook Link	Upstream PR	Supported since (branch)	Supported since (release)
Edge (On-device)	Llama.cpp	Llama.cpp Doc	#15575 (2025-08-26)	master (2025-08-26)	b6282
Edge (On-device)	Ollama	Ollama Doc	#12078 (2025-08-26)	Merging	Waiting for official release
Serving (Cloud)	vLLM	vLLM Doc	#23586 (2025-08-26)	main (2025-08-27)	Waiting for official release
Serving (Cloud)	SGLang	SGLang Doc	#9610 (2025-08-26)	Merging	Waiting for official release
Finetuning	LLaMA-Factory	LLaMA-Factory Doc	#9022 (2025-08-26)	main (2025-08-26)	Waiting for official release
Quantization	GGUF	GGUF Doc	—	—	—
	BNB	BNB Doc	—	—	—
	AWQ	AWQ Doc	—	—	—
Demos	Gradio Demo	Gradio Demo Doc	—	—	—

Inference Efficiency

OpenCompass

Model	Size	Avg Score ↑	Total Inference Time ↓
GLM-4.1V-9B-Thinking	10.3B	76.6	17.5h
MiMo-VL-7B-RL	8.3B	76.4	11h
MiniCPM-V 4.5	8.7B	77.0	7.5h

Video-MME

Model	Size	Avg Score ↑	Total Inference Time ↓	GPU Mem ↓
Qwen2.5-VL-7B-Instruct	8.3B	71.6	3h	60G
GLM-4.1V-9B-Thinking	10.3B	73.6	2.63h	32G
MiniCPM-V 4.5	8.7B	73.5	0.26h	28G

Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.

Resources

Link: https://huggingface.co/openbmb/MiniCPM-V-4_5

Step-by-Step Process to Install & Run MiniCPM-V-4_5Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MiniCPM-V-4_5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like MiniCPM-V-4_5.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MiniCPM-V-4_5.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the MiniCPM-V-4_5 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install `pip` (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install `pip` (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install `pip` via `get-pip.py`:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.11 -m venv minicpm
source minicpm/bin/activate

Step 10: Install PyTorch with CUDA Support

Run the following command to install PyTorch with CUDA support:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118

Step 11: Clone The Repo

Run the following command to clone the minicpm repo:

git clone https://github.com/OpenBMB/MiniCPM.git
cd MiniCPM

Step 12: Install Required Packages

Run the following command to install required packages:

pip install -r requirements.txt

Step 13: Install Additional Packages Depending on Usage

Run the following command to install additional packages:

pip install transformers accelerate pillow decord scikit-learn scipy

Step 14: Install Decord

If you want to process videos then, run the following command to install decord:

pip install decord  # For video frame loading

Step 15: Connect to Your GPU VM with a Code Editor

Before you start running transformer and streamlit scripts with the MiniCPM-V-4_5 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 16: Create `app.py` and Load the Model

Step 1: Create a new file named `app.py`

Open your preferred code editor (e.g., VS Code, PyCharm, or any text editor).
Create a new file and name it app.py.

Step 2: Add the following code to `app.py`

from transformers import AutoModel, AutoTokenizer

model_name = "openbmb/MiniCPM-V-4_5"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"  # Automatically uses GPU if available
).eval()

Step 3: Run the script to download the model

python app.py

This will:

Download the tokenizer and model from Hugging Face.
Load the model into memory (using GPU if available).

Step 17: Run Image Inference with MiniCPM-V-4_5

Create a new file named app2.py and add the following code to it:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-V-4_5",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)

# Load image
image = Image.open("Whale.jpg").convert("RGB")

# Ask a question
msgs = [{'role': 'user', 'content': [image, "Describe this image"]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=False
)

print(res)

Step 18: Run the script

Now, run the script using the following command in your terminal:

python app2.py

The model will process the image and generate a description like:

A whale's tail emerging from deep blue ocean water, creating ripples and splashes.

Expected Output

The output will be a natural language description of the image based on the model’s vision-language understanding.

Example:

A large whale tail is visible above the surface of the dark blue ocean, with water splashing around it. The scene captures the moment just before the whale dives back down.

Step 20: Run Video Inference with MiniCPM-V-4_5

Create a new file named app3.py and add the following code to it:

from PIL import Image
from decord import VideoReader, cpu
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer

# --- Load Model ---
print("Loading model...")

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-V-4_5",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
model = model.eval().cuda()  # Remove .cuda() if no GPU

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)

# --- Video Frame Loader ---
def load_video_frames(video_path, num_frames=48):
    vr = VideoReader(video_path, ctx=cpu(0))
    frame_idx = np.linspace(0, len(vr) - 1, num_frames, dtype=int)
    frames = vr.get_batch(frame_idx).asnumpy()
    return [Image.fromarray(f).convert("RGB") for f in frames]

# --- Load Frames ---
print("Loading video frames...")
frames = load_video_frames("whale.mp4", num_frames=48)
print(f"Loaded {len(frames)} frames.")

# --- Pack Frames into Groups (e.g., 6 frames per group) ---
packing_num = 6  # Must be between 1 and 6
grouped_frames = [frames[i:i + packing_num] for i in range(0, len(frames), packing_num)]

# Create temporal_ids: each group of frames shares the same temporal ID
temporal_ids = []
for i, group in enumerate(grouped_frames):
    temporal_ids.extend([i] * len(group))  # All frames in group i get temporal ID = i

# Flatten frames list
flat_frames = [img for group in grouped_frames for img in group]

# --- Prepare Message ---
msgs = [
    {
        'role': 'user',
        'content': flat_frames + ["Describe the video in detail."]
    }
]

# --- Get Response ---
print("Generating description...")

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    use_image_id=False,
    temporal_ids=[temporal_ids],  # Must be a list of lists, but single video → [temporal_ids]
    max_slice_nums=1,
    # enable_thinking=False
)

print("🤖 Model Response:")
print(answer)

Step 21: Run the Script

Now, run the script using the following command in your terminal:

python3 app3.py

What This Does

Loads the MiniCPM-V-4_5 model.
Uses decord to efficiently sample 48 frames from the video.
Groups frames into batches (e.g., 6 frames per group) for input compatibility.
Sends the video frames to the model with a prompt: “Describe this video.”
Prints a natural language description of the video content.

Step 22: Install Streamlit

Run the following command to install streamlit:

pip install streamlit

Step 23: Create the Streamlit App Script (`app_streamlit.py`)

We’ll write a full Streamlit UI that lets you generate a response from model on browser.

Create app_streamlit.py in your VM (inside your project folder) and add the following code:

# app_streamlit.py
import os
import torch
import streamlit as st
from PIL import Image
from decord import VideoReader, cpu
import numpy as np
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig

# --- Page Config ---
st.set_page_config(page_title="MiniCPM-V 4.5", page_icon="🖼️", layout="centered")
st.title("🖼️ MiniCPM-V-4.5: Vision & Video Understanding")
st.markdown("Ask questions about images or videos using the powerful **MiniCPM-V-4.5** model.")

# --- Load Model ---
@st.cache_resource
def load_model():
    st.info("Loading model... This may take a minute.")

    # Optional: 4-bit quantization (recommended for 24GB VRAM or less)
    try:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        use_quant = True
        st.write("✅ Using 4-bit quantization (saves VRAM)")
    except:
        st.warning("bitsandbytes not available. Running in FP16 (requires ~18-24GB VRAM)")
        bnb_config = None
        use_quant = False

    model = AutoModel.from_pretrained(
        "openbmb/MiniCPM-V-4_5",
        trust_remote_code=True,
        quantization_config=bnb_config if use_quant else None,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="sdpa"
    ).eval()

    tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)
    st.success("Model loaded successfully!")
    return model, tokenizer

# --- Load Video Frames ---
def load_video_frames(video_path, num_frames=48):
    vr = VideoReader(video_path, ctx=cpu(0))
    total_frames = len(vr)
    frame_idx = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = vr.get_batch(frame_idx).asnumpy()
    return [Image.fromarray(f).convert("RGB") for f in frames]

# --- Initialize Model ---
try:
    model, tokenizer = load_model()
except Exception as e:
    st.error(f"Failed to load model: {e}")
    st.stop()

# --- Sidebar Inputs ---
st.sidebar.header("Input Settings")
input_type = st.sidebar.radio("Input Type", ["Image", "Video"])
enable_thinking = st.sidebar.checkbox("Enable Deep Thinking Mode", False)

# --- Main Input ---
question = st.text_input("💬 Your Question", placeholder="E.g., Describe this scene or What is happening?")

uploaded_file = None
if input_type == "Image":
    uploaded_file = st.file_uploader("📤 Upload an Image", type=["png", "jpg", "jpeg"])
else:
    uploaded_file = st.file_uploader("📹 Upload a Video (MP4)", type=["mp4"])

# --- Process & Predict ---
if st.button("🚀 Generate Response"):
    if not uploaded_file:
        st.error("Please upload an image or video.")
    elif not question.strip():
        st.error("Please enter a question.")
    else:
        with st.spinner("🧠 Model is thinking..."):
            try:
                if input_type == "Image":
                    # Open image
                    image = Image.open(uploaded_file).convert("RGB")
                    msgs = [{'role': 'user', 'content': [image, question]}]

                    response = model.chat(
                        msgs=msgs,
                        tokenizer=tokenizer,
                        enable_thinking=enable_thinking
                    )

                    # Display result
                    st.image(image, caption="Uploaded Image", use_column_width=True)
                    st.markdown(f"**Q:** {question}")
                    st.markdown(f"**A:** {response}")

                elif input_type == "Video":
                    # Save uploaded video temporarily
                    temp_video_path = "temp_video.mp4"
                    with open(temp_video_path, "wb") as f:
                        f.write(uploaded_file.read())

                    frames = load_video_frames(temp_video_path, num_frames=48)

                    # Pack frames into groups of 6
                    packing_num = 6
                    grouped_frames = [frames[i:i + packing_num] for i in range(0, len(frames), packing_num)]

                    # 🔥 CRITICAL FIX: Create temporal_ids as torch.LongTensor
                    temporal_ids_list = []
                    for i in range(len(grouped_frames)):
                        temporal_ids_list.extend([i] * len(grouped_frames[i]))
                    temporal_ids = torch.tensor(temporal_ids_list, dtype=torch.long)  # Must be torch.long

                    flat_frames = [img for group in grouped_frames for img in group]

                    msgs = [{'role': 'user', 'content': flat_frames + [question]}]

                    response = model.chat(
                        msgs=msgs,
                        tokenizer=tokenizer,
                        use_image_id=False,
                        temporal_ids=[temporal_ids],  # Now correct dtype
                        max_slice_nums=1
                    )

                    st.video(temp_video_path)
                    st.markdown(f"**Q:** {question}")
                    st.markdown(f"**A:** {response}")

                    # Cleanup
                    if os.path.exists(temp_video_path):
                        os.remove(temp_video_path)

            except torch.cuda.OutOfMemoryError:
                st.error("❌ CUDA Out of Memory! Try reducing `num_frames` or use a smaller model.")
            except Exception as e:
                st.error(f"❌ Error during inference: {str(e)}")
                import traceback
                st.code(traceback.format_exc())

Step 24: Launch the Streamlit App

Now that we’ve written our app_streamlit.py streamlit script, the next step is to launch the app from the terminal.

Run the following command inside your VM:

streamlit run app_streamlit.py

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

URL: http://0:0:0:0:8501

Step 25: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://localhost:8501

Step 26: Upload Images and Video Generate Response

Upload Images and Video Generate Response.

Conclusion

MiniCPM-V 4.5 proves that cutting-edge multimodal intelligence doesn’t need massive infrastructure to deliver world-class results. With its state-of-the-art OCR, document reasoning, and high-FPS video understanding, it stands tall among giants like GPT-4o and Gemini 2.0 Pro—yet remains lightweight enough to run on a phone or local GPU.

Whether you’re a researcher, developer, or hobbyist, MiniCPM-V 4.5 offers unmatched flexibility through Llama.cpp, Ollama, vLLM, Streamlit UI, and even iOS apps, making it one of the most accessible MLLMs available today. By following the step-by-step guide above, you can deploy and interact with this powerful model on NodeShift Cloud or any GPU environment with ease.

In short: MiniCPM-V 4.5 isn’t just a model—it’s a complete vision-language ecosystem that brings frontier performance right to your fingertips.

Relevant blog posts

September 4, 2025

How to Install & Run ByteDance USO Locally?

USO (Unified Style–Subject Optimized) from ByteDance unifies style-driven and subject-driven image generation in one framework. It’s trained on triplets (content image, style image, stylized image) and uses a disentangled learning scheme—style-alignment + content–style disentanglement—plus Style Reward Learning (SRL) to boost fidelity. The team also releases USO-Bench, the first benchmark that jointly scores style similarity and subject fidelity; USO reports SOTA among open-source models on both axes. Inference runs on top of FLUX.1 (with AE, T5, CLIP) and lightweight USO adapters (LoRA + projector).

September 1, 2025

How to Install & Run NVIDIA Parakeet TDT 0.6B V3 Locally?

Parakeet-TDT-0.6B-v3 is NVIDIA’s multilingual automatic speech recognition (ASR) model with 600M parameters, built on the FastConformer-TDT architecture. It supports 25 European languages, automatically detects the input language, and delivers accurate transcriptions with punctuation and capitalization. Optimized for NVIDIA GPUs via the NeMo toolkit, it handles both short clips and long-form audio (up to 3 hours with local attention). Trained on a mix of the Granary dataset (660K hours) and NeMo ASR Set 3.0 (10K hours), it achieves strong performance across multilingual benchmarks while remaining lightweight enough for production deployment.

August 29, 2025

RefusalBench Showdown: How Hermes 4 Crushed Frontier Giants

Hermes 4 70B is Nous Research’s flagship reasoning model, built on Llama-3.1-70B and fine-tuned with a massive new post-training corpus (~60B tokens). It introduces a hybrid reasoning mode with explicit segments, giving users the choice between fast responses or deep, step-by-step deliberation. Key upgrades over Hermes 3 include huge improvements in math, logic, code, STEM, and creativity, stronger schema-faithful outputs (valid JSON, structured responses), and much easier steerability with reduced refusal rates. Hermes 4 also supports function calling and tool use, making it production-ready for both conversational and structured applications. With state-of-the-art performance on RefusalBench, Hermes 4 pushes open-source reasoning closer to frontier closed models while staying fully open, steerable, and aligned to user needs.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.