MiniCPM-V 4.5 is the latest milestone in the MiniCPM Vision-Language series by OpenBMB. Built on Qwen3-8B with a SigLIP2-400M vision encoder, this model delivers GPT-4o-level multimodal performance with only ~8.7B parameters. It outperforms models like GPT-4o-latest and Gemini 2.0 Pro in OCR, document parsing, and video understanding—all while being lightweight enough to run on your phone.
Key Highlights
- Best-in-class MLLM under 30B: Tops OpenCompass with a 77.0 avg score
- 96× video token compression via 3D-Resampler for long & high-FPS video
- Switchable fast vs. deep thinking modes via hybrid RL training
- Top-tier OCR & document reasoning using UHD vision input
- Multilingual understanding (30+ languages) and trustworthy outputs (via RLAIF-V)
It supports various deployment methods: Llama.cpp, Ollama, vLLM, SGLang, Gradio UI, and even iOS demos—making it one of the most accessible MLLMs ever.
Scenario | GPU(s) | VRAM per GPU | Total VRAM | Precision | Inference Time | Min Disk | RAM (Sys) | Notes |
---|
Full Precision (FP16/bf16) | 1× A100 / H100 | 40–80 GB | 40–80 GB | bfloat16 / FP16 | ~0.26h (Video-MME) | 60 GB | 32–64 GB | Recommended for research + eval |
Quantized (AWQ / GGUF Int4) | 1× RTX 3090 / A6000 | 24 GB | 24 GB | INT4 / INT8 | ~2× faster | 40 GB | 16–32 GB | For local/Ollama use on consumer GPUs |
SGLang / vLLM Server (Batch) | 2× A100 | 40 GB each | 80 GB | bfloat16 | Highly optimized | 80 GB | 64+ GB | For high-throughput deployment |
Mobile Demo (iOS/iPad M4) | M4 Chip | Shared Memory | – | INT4 | Realtime | – | – | iOS app optimized for interactive use |
CPU-only (Llama.cpp/Ollama) | No GPU | – | – | INT4 | Slowest | 30 GB | 16+ GB | For testing only, not recommended for video |
Framework Support Matrix
Inference Efficiency
OpenCompass
Model | Size | Avg Score ↑ | Total Inference Time ↓ |
---|
GLM-4.1V-9B-Thinking | 10.3B | 76.6 | 17.5h |
MiMo-VL-7B-RL | 8.3B | 76.4 | 11h |
MiniCPM-V 4.5 | 8.7B | 77.0 | 7.5h |
Video-MME
Model | Size | Avg Score ↑ | Total Inference Time ↓ | GPU Mem ↓ |
---|
Qwen2.5-VL-7B-Instruct | 8.3B | 71.6 | 3h | 60G |
GLM-4.1V-9B-Thinking | 10.3B | 73.6 | 2.63h | 32G |
MiniCPM-V 4.5 | 8.7B | 73.5 | 0.26h | 28G |
Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
Resources
Link: https://huggingface.co/openbmb/MiniCPM-V-4_5
Step-by-Step Process to Install & Run MiniCPM-V-4_5Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MiniCPM-V-4_5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like MiniCPM-V-4_5.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MiniCPM-V-4_5.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the MiniCPM-V-4_5 runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.11 -m venv minicpm
source minicpm/bin/activate
Step 10: Install PyTorch with CUDA Support
Run the following command to install PyTorch with CUDA support:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
Step 11: Clone The Repo
Run the following command to clone the minicpm repo:
git clone https://github.com/OpenBMB/MiniCPM.git
cd MiniCPM
Step 12: Install Required Packages
Run the following command to install required packages:
pip install -r requirements.txt
Step 13: Install Additional Packages Depending on Usage
Run the following command to install additional packages:
pip install transformers accelerate pillow decord scikit-learn scipy
Step 14: Install Decord
If you want to process videos then, run the following command to install decord:
pip install decord # For video frame loading
Step 15: Connect to Your GPU VM with a Code Editor
Before you start running transformer and streamlit scripts with the MiniCPM-V-4_5 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 16: Create app.py
and Load the Model
Step 1: Create a new file named app.py
- Open your preferred code editor (e.g., VS Code, PyCharm, or any text editor).
- Create a new file and name it
app.py
.
Step 2: Add the following code to app.py
from transformers import AutoModel, AutoTokenizer
model_name = "openbmb/MiniCPM-V-4_5"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto" # Automatically uses GPU if available
).eval()
Step 3: Run the script to download the model
python app.py
This will:
- Download the tokenizer and model from Hugging Face.
- Load the model into memory (using GPU if available).
Step 17: Run Image Inference with MiniCPM-V-4_5
Create a new file named app2.py
and add the following code to it:
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained(
"openbmb/MiniCPM-V-4_5",
torch_dtype=torch.bfloat16,
trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)
# Load image
image = Image.open("Whale.jpg").convert("RGB")
# Ask a question
msgs = [{'role': 'user', 'content': [image, "Describe this image"]}]
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
enable_thinking=False
)
print(res)
Step 18: Run the script
Now, run the script using the following command in your terminal:
python app2.py
The model will process the image and generate a description like:
A whale's tail emerging from deep blue ocean water, creating ripples and splashes.
Expected Output
The output will be a natural language description of the image based on the model’s vision-language understanding.
Example:
A large whale tail is visible above the surface of the dark blue ocean, with water splashing around it. The scene captures the moment just before the whale dives back down.
Step 20: Run Video Inference with MiniCPM-V-4_5
Create a new file named app3.py
and add the following code to it:
from PIL import Image
from decord import VideoReader, cpu
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer
# --- Load Model ---
print("Loading model...")
model = AutoModel.from_pretrained(
"openbmb/MiniCPM-V-4_5",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
model = model.eval().cuda() # Remove .cuda() if no GPU
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)
# --- Video Frame Loader ---
def load_video_frames(video_path, num_frames=48):
vr = VideoReader(video_path, ctx=cpu(0))
frame_idx = np.linspace(0, len(vr) - 1, num_frames, dtype=int)
frames = vr.get_batch(frame_idx).asnumpy()
return [Image.fromarray(f).convert("RGB") for f in frames]
# --- Load Frames ---
print("Loading video frames...")
frames = load_video_frames("whale.mp4", num_frames=48)
print(f"Loaded {len(frames)} frames.")
# --- Pack Frames into Groups (e.g., 6 frames per group) ---
packing_num = 6 # Must be between 1 and 6
grouped_frames = [frames[i:i + packing_num] for i in range(0, len(frames), packing_num)]
# Create temporal_ids: each group of frames shares the same temporal ID
temporal_ids = []
for i, group in enumerate(grouped_frames):
temporal_ids.extend([i] * len(group)) # All frames in group i get temporal ID = i
# Flatten frames list
flat_frames = [img for group in grouped_frames for img in group]
# --- Prepare Message ---
msgs = [
{
'role': 'user',
'content': flat_frames + ["Describe the video in detail."]
}
]
# --- Get Response ---
print("Generating description...")
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
use_image_id=False,
temporal_ids=[temporal_ids], # Must be a list of lists, but single video → [temporal_ids]
max_slice_nums=1,
# enable_thinking=False
)
print("🤖 Model Response:")
print(answer)
Step 21: Run the Script
Now, run the script using the following command in your terminal:
python3 app3.py
What This Does
- Loads the MiniCPM-V-4_5 model.
- Uses
decord
to efficiently sample 48 frames from the video.
- Groups frames into batches (e.g., 6 frames per group) for input compatibility.
- Sends the video frames to the model with a prompt: “Describe this video.”
- Prints a natural language description of the video content.
Step 22: Install Streamlit
Run the following command to install streamlit:
pip install streamlit
Step 23: Create the Streamlit App Script (app_streamlit.py
)
We’ll write a full Streamlit UI that lets you generate a response from model on browser.
Create app_streamlit.py
in your VM (inside your project folder) and add the following code:
# app_streamlit.py
import os
import torch
import streamlit as st
from PIL import Image
from decord import VideoReader, cpu
import numpy as np
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
# --- Page Config ---
st.set_page_config(page_title="MiniCPM-V 4.5", page_icon="🖼️", layout="centered")
st.title("🖼️ MiniCPM-V-4.5: Vision & Video Understanding")
st.markdown("Ask questions about images or videos using the powerful **MiniCPM-V-4.5** model.")
# --- Load Model ---
@st.cache_resource
def load_model():
st.info("Loading model... This may take a minute.")
# Optional: 4-bit quantization (recommended for 24GB VRAM or less)
try:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
use_quant = True
st.write("✅ Using 4-bit quantization (saves VRAM)")
except:
st.warning("bitsandbytes not available. Running in FP16 (requires ~18-24GB VRAM)")
bnb_config = None
use_quant = False
model = AutoModel.from_pretrained(
"openbmb/MiniCPM-V-4_5",
trust_remote_code=True,
quantization_config=bnb_config if use_quant else None,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa"
).eval()
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-V-4_5", trust_remote_code=True)
st.success("Model loaded successfully!")
return model, tokenizer
# --- Load Video Frames ---
def load_video_frames(video_path, num_frames=48):
vr = VideoReader(video_path, ctx=cpu(0))
total_frames = len(vr)
frame_idx = np.linspace(0, total_frames - 1, num_frames, dtype=int)
frames = vr.get_batch(frame_idx).asnumpy()
return [Image.fromarray(f).convert("RGB") for f in frames]
# --- Initialize Model ---
try:
model, tokenizer = load_model()
except Exception as e:
st.error(f"Failed to load model: {e}")
st.stop()
# --- Sidebar Inputs ---
st.sidebar.header("Input Settings")
input_type = st.sidebar.radio("Input Type", ["Image", "Video"])
enable_thinking = st.sidebar.checkbox("Enable Deep Thinking Mode", False)
# --- Main Input ---
question = st.text_input("💬 Your Question", placeholder="E.g., Describe this scene or What is happening?")
uploaded_file = None
if input_type == "Image":
uploaded_file = st.file_uploader("📤 Upload an Image", type=["png", "jpg", "jpeg"])
else:
uploaded_file = st.file_uploader("📹 Upload a Video (MP4)", type=["mp4"])
# --- Process & Predict ---
if st.button("🚀 Generate Response"):
if not uploaded_file:
st.error("Please upload an image or video.")
elif not question.strip():
st.error("Please enter a question.")
else:
with st.spinner("🧠 Model is thinking..."):
try:
if input_type == "Image":
# Open image
image = Image.open(uploaded_file).convert("RGB")
msgs = [{'role': 'user', 'content': [image, question]}]
response = model.chat(
msgs=msgs,
tokenizer=tokenizer,
enable_thinking=enable_thinking
)
# Display result
st.image(image, caption="Uploaded Image", use_column_width=True)
st.markdown(f"**Q:** {question}")
st.markdown(f"**A:** {response}")
elif input_type == "Video":
# Save uploaded video temporarily
temp_video_path = "temp_video.mp4"
with open(temp_video_path, "wb") as f:
f.write(uploaded_file.read())
frames = load_video_frames(temp_video_path, num_frames=48)
# Pack frames into groups of 6
packing_num = 6
grouped_frames = [frames[i:i + packing_num] for i in range(0, len(frames), packing_num)]
# 🔥 CRITICAL FIX: Create temporal_ids as torch.LongTensor
temporal_ids_list = []
for i in range(len(grouped_frames)):
temporal_ids_list.extend([i] * len(grouped_frames[i]))
temporal_ids = torch.tensor(temporal_ids_list, dtype=torch.long) # Must be torch.long
flat_frames = [img for group in grouped_frames for img in group]
msgs = [{'role': 'user', 'content': flat_frames + [question]}]
response = model.chat(
msgs=msgs,
tokenizer=tokenizer,
use_image_id=False,
temporal_ids=[temporal_ids], # Now correct dtype
max_slice_nums=1
)
st.video(temp_video_path)
st.markdown(f"**Q:** {question}")
st.markdown(f"**A:** {response}")
# Cleanup
if os.path.exists(temp_video_path):
os.remove(temp_video_path)
except torch.cuda.OutOfMemoryError:
st.error("❌ CUDA Out of Memory! Try reducing `num_frames` or use a smaller model.")
except Exception as e:
st.error(f"❌ Error during inference: {str(e)}")
import traceback
st.code(traceback.format_exc())
Step 24: Launch the Streamlit App
Now that we’ve written our app_streamlit.py
streamlit script, the next step is to launch the app from the terminal.
Run the following command inside your VM:
streamlit run app_streamlit.py
Once executed, Streamlit will start the web server and you’ll see a message:
You can now view your Streamlit app in your browser.
URL: http://0:0:0:0:8501
Step 25: Access the Streamlit App in Browser
After launching the app, you’ll see the interface in your browser.
Go to:
http://localhost:8501
Step 26: Upload Images and Video Generate Response
Upload Images and Video Generate Response.
Conclusion
MiniCPM-V 4.5 proves that cutting-edge multimodal intelligence doesn’t need massive infrastructure to deliver world-class results. With its state-of-the-art OCR, document reasoning, and high-FPS video understanding, it stands tall among giants like GPT-4o and Gemini 2.0 Pro—yet remains lightweight enough to run on a phone or local GPU.
Whether you’re a researcher, developer, or hobbyist, MiniCPM-V 4.5 offers unmatched flexibility through Llama.cpp, Ollama, vLLM, Streamlit UI, and even iOS apps, making it one of the most accessible MLLMs available today. By following the step-by-step guide above, you can deploy and interact with this powerful model on NodeShift Cloud or any GPU environment with ease.
In short: MiniCPM-V 4.5 isn’t just a model—it’s a complete vision-language ecosystem that brings frontier performance right to your fingertips.