Ovis2.5-9B is a state-of-the-art Multimodal Large Language Model (MLLM) developed by AIDC-AI. It brings together native-resolution vision perception via NaViT (Native Vision Transformer) and powerful deep multimodal reasoning capabilities using a hybrid of Chain-of-Thought (CoT) and Reflective Thinking. What sets it apart is its ability to process images at their original resolution—crucial for tasks like chart and document OCR, layout understanding, video QA, and complex visual reasoning.
With support for a “thinking mode” and “thinking budget,” the model balances accuracy and latency by optionally allowing multiple rounds of internal reasoning. It is ranked among the top-performing open-source models under 40B parameters and delivers powerful performance even on resource-constrained setups—following the “small model, big performance” philosophy.
Ovis2.5-9B vs Other MLLMs – Benchmark Scores Table
Benchmark | Ovis2.5-9B | Ovis2.5-2B | Ovis2-8B | Qwen2.5-VL-7B | GLM-4.1V-9B-Thinking | GPT-4o |
---|
OpenCompass | 78.3 | 73.9 | 71.8 | 70.9 | 76.1 | 75.4 |
MMMU | 71.2 | 59.8 | 57.4 | 58.0 | 68.0 | 72.9 |
MathVista | 83.4 | 81.4 | 71.8 | 68.1 | 80.7 | 71.6 |
OCRBench v2 | 60.7 | 57.3 | 51.2 | 45.5 | 59.0 | 39.4 |
ChartQA Pro | 63.8 | 59.6 | 53.1 | 51.6 | 56.2 | 44.6 |
BLINK | 67.3 | 65.7 | 55.0 | 56.4 | 65.1 | 66.4 |
GPU Configuration Table for Ovis2.5-9B
Configuration | Recommended Minimum | Recommended Optimal |
---|
GPU | 1× A100 (40GB) or H100 (40GB) | 1× A100 (80GB) / 2× A6000 (48GB) |
vRAM (GPU Memory) | ≥ 40 GB | 60–80 GB |
CPU | ≥ 16 vCPU | 32 vCPU |
RAM (System Memory) | ≥ 64 GB | 128 GB |
Disk Storage | ≥ 100 GB SSD | 200–500 GB SSD |
CUDA Version | 12.0 or above | 12.1 or above |
Torch Version | 2.4.0 | 2.4.0 (Compiled with flash-attn) |
Flash Attention | flash-attn==2.7.0.post2 | With --no-build-isolation |
Inference Interface | Terminal + Python (Stream/Static) | Supports GUI with Gradio/WebUI |
OpenCompass Evaluation Suite – General Multimodal Benchmarks
Model | MMB | MMS | MMMU | MathVista | HB | AI2D | OCR | MMVet | Avg |
---|
Gemini-2.5-Pro | 88.3 | 73.6 | 74.7 | 80.9 | 64.1 | 89.5 | 86.2 | 83.3 | 80.1 |
GPT-4o | 86.0 | 70.2 | 72.9 | 71.6 | 57.0 | 86.3 | 82.2 | 76.9 | 75.4 |
Ovis2-8B | 83.6 | 64.6 | 57.4 | 71.8 | 56.3 | 86.6 | 89.1 | 65.1 | 71.8 |
Qwen2.5-VL-7B | 82.2 | 64.1 | 58.0 | 68.1 | 51.9 | 84.3 | 88.8 | 69.7 | 70.9 |
InternVL3-8B | 82.1 | 68.7 | 62.2 | 70.5 | 49.0 | 85.1 | 88.4 | 82.8 | 73.6 |
MiMo-VL-7B-RL-2508 | 83.9* | 72.7* | 70.6 | 79.7* | 65.3* | 85.3* | 88.6 | 73.4 | 77.4* |
Keye-VL-8B | 79.4* | 75.5 | 71.4 | 80.7 | 67.0 | 86.7 | 85.1 | 67.6 | 76.7* |
GLM-4.1V-9B-Thinking | 85.3 | 72.9 | 68.0 | 80.7 | 63.7* | 87.9 | 84.2 | 66.2 | 76.1* |
Ovis2.5-9B | 84.9 | 72.4 | 71.2 | 83.4 | 65.1 | 87.7 | 87.9 | 74.0 | 78.3 |
Multimodal Reasoning Benchmarks
Model | MMMU | MPro | MathVista | MathVerse | MathVision | LV | WM | DM |
---|
Gemini-2.5-Pro | 74.7 | – | 80.9 | 76.9 | 69.1 | 73.8 | 78.0 | 56.3 |
GPT-4o | 72.9 | – | 71.6 | 49.9 | 43.8 | 64.4 | 50.6 | 48.5 |
Ovis2-8B | 57.4 | 34.9 | 71.8 | 42.3 | 25.9 | 39.4 | 27.2 | 20.4 |
Qwen2.5-VL-7B | 58.0 | 38.3 | 68.1 | 41.1 | 25.4 | 47.9 | 36.2 | 21.8 |
InternVL3-8B | 62.2 | 42.3* | 70.5 | 38.5 | 30.0 | 44.5 | 39.5 | 25.7 |
MiMo-VL-7B-RL-2508 | 70.6 | 45.7* | 79.7* | 71.6* | 58.5* | 64.5 | 65.6* | 48.3* |
Keye-VL-8B | 71.4 | 39.0* | 80.7 | 59.8 | 46.0 | 54.8 | 60.7 | 37.3 |
GLM-4.1V-9B-Thinking | 68.0 | 57.1 | 80.7 | 68.8* | 49.4* | 54.1* | 63.8 | 38.9* |
Ovis2.5-9B | 71.2 | 54.4 | 83.4 | 71.1 | 53.9 | 61.5 | 66.7 | 44.1 |
Step-by-Step Process to Install & Run Ovis2.5-9B Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Ovis2.5-9B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Ovis2.5-9B
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Ovis2.5-9B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Ovis2.5-9B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Created and activated Python 3.11 virtual environment
Run the following commands to created and activated Python 3.11 virtual environment:
apt update && apt install -y python3.11-venv git wget
python3.11 -m venv ovis
source ovis/bin/activate
Step 13: Install Dependencies
Run the following command to install dependencies:
pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
pip install wheel
pip install flash-attn==2.7.0.post2 --no-build-isolation
Step 14: Connect to your GPU VM using Remote SSH
- Open VS Code, cursor or choice of code editor on your Mac.
- Press
Cmd + Shift + P
, then choose Remote-SSH: Connect to Host
.
- Select your configured host.
- Once connected, you’ll see
SSH: 149.7.4.3
(Your VM IP) in the bottom-left status bar (like in the image).
Step 15: Create a New Python Script ex.py
and Add the Following Code
Create a new python script (example: ovis.py) and add the following code:
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM
MODEL_PATH = "AIDC-AI/Ovis2.5-9B"
# Thinking mode & budget
enable_thinking = True
enable_thinking_budget = True # Only effective if enable_thinking is True.
# Total tokens for thinking + answer. Ensure: max_new_tokens > thinking_budget + 25
max_new_tokens = 3072
thinking_budget = 2048
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
trust_remote_code=True
).cuda()
messages = [{
"role": "user",
"content": [
{"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
{"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
],
}]
input_ids, pixel_values, grid_thws = model.preprocess_inputs(
messages=messages,
add_generation_prompt=True,
enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None
outputs = model.generate(
inputs=input_ids,
pixel_values=pixel_values,
grid_thws=grid_thws,
enable_thinking=enable_thinking,
enable_thinking_budget=enable_thinking_budget,
max_new_tokens=max_new_tokens,
thinking_budget=thinking_budget,
)
response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Step 16: Run the Script
Run the script with the following command:
python3 ovis.py
Up until now, we’ve been running and interacting with our model directly from the terminal. That worked fine for quick tests, but now let’s make things smoother and more user-friendly by running it inside a browser interface. For that, we’ll use Streamlit, a lightweight Python framework that lets us build interactive web apps in just a few lines of code.
Step 17: Install Required Libraries for Browser App
Run the following command to install required libraries for browser app:
pip install streamlit torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3 flash-attn==2.7.0.post2 --no-build-isolation
Step 18: Create the Streamlit App Script (app.py
)
We’ll write a full Streamlit UI that lets you upload an image , runs Ovis2.5-9B, and returns clean output.
Create app.py
in your VM (inside your project folder) and add the following code:
import streamlit as st
import torch
from transformers import AutoModelForCausalLM
from PIL import Image
st.set_page_config(page_title="Ovis2.5-9B Visual Reasoning", layout="wide")
st.title("📊 Ovis2.5-9B - Visual Reasoning with Image & Text")
# Load model
@st.cache_resource(show_spinner="Loading Ovis2.5-9B Model...")
def load_model():
model = AutoModelForCausalLM.from_pretrained(
"AIDC-AI/Ovis2.5-9B",
torch_dtype=torch.bfloat16,
trust_remote_code=True
).cuda()
return model
model = load_model()
# Input
uploaded_image = st.file_uploader("Upload an Image (chart, diagram, document, etc.)", type=["png", "jpg", "jpeg"])
question = st.text_area("Ask a Question about the Image", placeholder="e.g. What is the total of the values in the middle box of figure (c)?")
enable_thinking = st.checkbox("Enable Deep Thinking Mode", value=True)
enable_thinking_budget = st.checkbox("Enable Thinking Budget", value=True)
max_new_tokens = st.slider("Max Tokens (total)", min_value=256, max_value=4096, value=3072, step=64)
thinking_budget = st.slider("Thinking Budget", min_value=0, max_value=3072, value=2048, step=64)
if st.button("Run Inference") and uploaded_image and question:
image = Image.open(uploaded_image).convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question}
]
}]
st.info("⏳ Generating response...")
with torch.no_grad():
input_ids, pixel_values, grid_thws = model.preprocess_inputs(
messages=messages,
add_generation_prompt=True,
enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None
outputs = model.generate(
inputs=input_ids,
pixel_values=pixel_values,
grid_thws=grid_thws,
enable_thinking=enable_thinking,
enable_thinking_budget=enable_thinking_budget,
max_new_tokens=max_new_tokens,
thinking_budget=thinking_budget,
)
decoded_output = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
st.success("✅ Response Generated")
st.text_area("Model Output", value=decoded_output, height=300)
Step 19: Run the App
streamlit run app.py
Step 20: Upload an Image and Ask a Question
- In the Streamlit UI:
- Drag & drop your image (e.g., charts, graphs, OCR-based images, etc.).
- Enter a question related to the image in the text area. Example:
What is the total of the values in the middle box of figure (c)? End your response with 'Final answer: '
- Select:
- Enable Deep Thinking Mode (recommended for complex reasoning).
- Enable Thinking Budget.
- Adjust:
- Max Tokens to 3072.
- Thinking Budget to 2048.
- Click Run Inference.
Step 21: View the Output
After clicking Run Inference, the model will:
- Process the image.
- Interpret the question.
- Run the
Ovis2.5-9B
model with visual + text reasoning.
- Output will appear in a scrollable text area.
Conclusion
In a world where visuals speak louder than words, Ovis2.5-9B gives us the power to not just see images—but to understand them, reason through them, and extract structured insight from them like never before. Whether you’re decoding complex charts, making sense of scanned documents, or asking deep questions about visual layouts, this model brings a new dimension to multimodal intelligence.
With just a few commands, a powerful GPU VM, and a streamlined Streamlit interface, you’ve built a full-blown visual reasoning system—accessible right from your browser. The “thinking mode” and “thinking budget” features make this model truly next-gen, giving you fine-grained control over accuracy vs. speed.
And the best part? It runs entirely on your terms—your machine, your interface, your control.