How to Install & Run Ovis2.5-9B Locally?

by Ayush Kumar | August 20, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Ovis2.5-9B is a state-of-the-art Multimodal Large Language Model (MLLM) developed by AIDC-AI. It brings together native-resolution vision perception via NaViT (Native Vision Transformer) and powerful deep multimodal reasoning capabilities using a hybrid of Chain-of-Thought (CoT) and Reflective Thinking. What sets it apart is its ability to process images at their original resolution—crucial for tasks like chart and document OCR, layout understanding, video QA, and complex visual reasoning.

With support for a “thinking mode” and “thinking budget,” the model balances accuracy and latency by optionally allowing multiple rounds of internal reasoning. It is ranked among the top-performing open-source models under 40B parameters and delivers powerful performance even on resource-constrained setups—following the “small model, big performance” philosophy.

Ovis2.5-9B vs Other MLLMs – Benchmark Scores Table

Benchmark	Ovis2.5-9B	Ovis2.5-2B	Ovis2-8B	Qwen2.5-VL-7B	GLM-4.1V-9B-Thinking	GPT-4o
OpenCompass	78.3	73.9	71.8	70.9	76.1	75.4
MMMU	71.2	59.8	57.4	58.0	68.0	72.9
MathVista	83.4	81.4	71.8	68.1	80.7	71.6
OCRBench v2	60.7	57.3	51.2	45.5	59.0	39.4
ChartQA Pro	63.8	59.6	53.1	51.6	56.2	44.6
BLINK	67.3	65.7	55.0	56.4	65.1	66.4

GPU Configuration Table for Ovis2.5-9B

Configuration	Recommended Minimum	Recommended Optimal
GPU	1× A100 (40GB) or H100 (40GB)	1× A100 (80GB) / 2× A6000 (48GB)
vRAM (GPU Memory)	≥ 40 GB	60–80 GB
CPU	≥ 16 vCPU	32 vCPU
RAM (System Memory)	≥ 64 GB	128 GB
Disk Storage	≥ 100 GB SSD	200–500 GB SSD
CUDA Version	12.0 or above	12.1 or above
Torch Version	2.4.0	2.4.0 (Compiled with flash-attn)
Flash Attention	`flash-attn==2.7.0.post2`	With `--no-build-isolation`
Inference Interface	Terminal + Python (Stream/Static)	Supports GUI with Gradio/WebUI

OpenCompass Evaluation Suite – General Multimodal Benchmarks

Model	MMB	MMS	MMMU	MathVista	HB	AI2D	OCR	MMVet	Avg
Gemini-2.5-Pro	88.3	73.6	74.7	80.9	64.1	89.5	86.2	83.3	80.1
GPT-4o	86.0	70.2	72.9	71.6	57.0	86.3	82.2	76.9	75.4
Ovis2-8B	83.6	64.6	57.4	71.8	56.3	86.6	89.1	65.1	71.8
Qwen2.5-VL-7B	82.2	64.1	58.0	68.1	51.9	84.3	88.8	69.7	70.9
InternVL3-8B	82.1	68.7	62.2	70.5	49.0	85.1	88.4	82.8	73.6
MiMo-VL-7B-RL-2508	83.9*	72.7*	70.6	79.7*	65.3*	85.3*	88.6	73.4	77.4*
Keye-VL-8B	79.4*	75.5	71.4	80.7	67.0	86.7	85.1	67.6	76.7*
GLM-4.1V-9B-Thinking	85.3	72.9	68.0	80.7	63.7*	87.9	84.2	66.2	76.1*
Ovis2.5-9B	84.9	72.4	71.2	83.4	65.1	87.7	87.9	74.0	78.3

Multimodal Reasoning Benchmarks

Model	MMMU	MPro	MathVista	MathVerse	MathVision	LV	WM	DM
Gemini-2.5-Pro	74.7	–	80.9	76.9	69.1	73.8	78.0	56.3
GPT-4o	72.9	–	71.6	49.9	43.8	64.4	50.6	48.5
Ovis2-8B	57.4	34.9	71.8	42.3	25.9	39.4	27.2	20.4
Qwen2.5-VL-7B	58.0	38.3	68.1	41.1	25.4	47.9	36.2	21.8
InternVL3-8B	62.2	42.3*	70.5	38.5	30.0	44.5	39.5	25.7
MiMo-VL-7B-RL-2508	70.6	45.7*	79.7*	71.6*	58.5*	64.5	65.6*	48.3*
Keye-VL-8B	71.4	39.0*	80.7	59.8	46.0	54.8	60.7	37.3
GLM-4.1V-9B-Thinking	68.0	57.1	80.7	68.8*	49.4*	54.1*	63.8	38.9*
Ovis2.5-9B	71.2	54.4	83.4	71.1	53.9	61.5	66.7	44.1

Step-by-Step Process to Install & Run Ovis2.5-9B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Ovis2.5-9B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Ovis2.5-9B
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Ovis2.5-9B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Ovis2.5-9B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv ovis
source ovis/bin/activate

Step 13: Install Dependencies

Run the following command to install dependencies:

pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
pip install wheel
pip install flash-attn==2.7.0.post2 --no-build-isolation

Step 14: Connect to your GPU VM using Remote SSH

Open VS Code, cursor or choice of code editor on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 149.7.4.3(Your VM IP) in the bottom-left status bar (like in the image).

Step 15: Create a New Python Script `ex.py` and Add the Following Code

Create a new python script (example: ovis.py) and add the following code:

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

MODEL_PATH = "AIDC-AI/Ovis2.5-9B"

# Thinking mode & budget
enable_thinking = True
enable_thinking_budget = True  # Only effective if enable_thinking is True.

# Total tokens for thinking + answer. Ensure: max_new_tokens > thinking_budget + 25
max_new_tokens = 3072
thinking_budget = 2048

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda()

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Step 16: Run the Script

Run the script with the following command:

python3 ovis.py

Up until now, we’ve been running and interacting with our model directly from the terminal. That worked fine for quick tests, but now let’s make things smoother and more user-friendly by running it inside a browser interface. For that, we’ll use Streamlit, a lightweight Python framework that lets us build interactive web apps in just a few lines of code.

Step 17: Install Required Libraries for Browser App

Run the following command to install required libraries for browser app:

pip install streamlit torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3 flash-attn==2.7.0.post2 --no-build-isolation

Step 18: Create the Streamlit App Script (`app.py`)

We’ll write a full Streamlit UI that lets you upload an image , runs Ovis2.5-9B, and returns clean output.

Create app.py in your VM (inside your project folder) and add the following code:

import streamlit as st
import torch
from transformers import AutoModelForCausalLM
from PIL import Image

st.set_page_config(page_title="Ovis2.5-9B Visual Reasoning", layout="wide")
st.title("📊 Ovis2.5-9B - Visual Reasoning with Image & Text")

# Load model
@st.cache_resource(show_spinner="Loading Ovis2.5-9B Model...")
def load_model():
    model = AutoModelForCausalLM.from_pretrained(
        "AIDC-AI/Ovis2.5-9B",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True
    ).cuda()
    return model

model = load_model()

# Input
uploaded_image = st.file_uploader("Upload an Image (chart, diagram, document, etc.)", type=["png", "jpg", "jpeg"])
question = st.text_area("Ask a Question about the Image", placeholder="e.g. What is the total of the values in the middle box of figure (c)?")

enable_thinking = st.checkbox("Enable Deep Thinking Mode", value=True)
enable_thinking_budget = st.checkbox("Enable Thinking Budget", value=True)

max_new_tokens = st.slider("Max Tokens (total)", min_value=256, max_value=4096, value=3072, step=64)
thinking_budget = st.slider("Thinking Budget", min_value=0, max_value=3072, value=2048, step=64)

if st.button("Run Inference") and uploaded_image and question:
    image = Image.open(uploaded_image).convert("RGB")
    
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": question}
        ]
    }]
    
    st.info("⏳ Generating response...")

    with torch.no_grad():
        input_ids, pixel_values, grid_thws = model.preprocess_inputs(
            messages=messages,
            add_generation_prompt=True,
            enable_thinking=enable_thinking
        )
        input_ids = input_ids.cuda()
        pixel_values = pixel_values.cuda() if pixel_values is not None else None
        grid_thws = grid_thws.cuda() if grid_thws is not None else None

        outputs = model.generate(
            inputs=input_ids,
            pixel_values=pixel_values,
            grid_thws=grid_thws,
            enable_thinking=enable_thinking,
            enable_thinking_budget=enable_thinking_budget,
            max_new_tokens=max_new_tokens,
            thinking_budget=thinking_budget,
        )

        decoded_output = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    st.success("✅ Response Generated")
    st.text_area("Model Output", value=decoded_output, height=300)

Step 19: Run the App

streamlit run app.py

Step 20: Upload an Image and Ask a Question

In the Streamlit UI:
- Drag & drop your image (e.g., charts, graphs, OCR-based images, etc.).
- Enter a question related to the image in the text area. Example: What is the total of the values in the middle box of figure (c)? End your response with 'Final answer: '
Select:
- Enable Deep Thinking Mode (recommended for complex reasoning).
- Enable Thinking Budget.
Adjust:
- Max Tokens to 3072.
- Thinking Budget to 2048.
Click Run Inference.

Step 21: View the Output

After clicking Run Inference, the model will:

Process the image.
Interpret the question.
Run the Ovis2.5-9B model with visual + text reasoning.
Output will appear in a scrollable text area.

Conclusion

In a world where visuals speak louder than words, Ovis2.5-9B gives us the power to not just see images—but to understand them, reason through them, and extract structured insight from them like never before. Whether you’re decoding complex charts, making sense of scanned documents, or asking deep questions about visual layouts, this model brings a new dimension to multimodal intelligence.

With just a few commands, a powerful GPU VM, and a streamlined Streamlit interface, you’ve built a full-blown visual reasoning system—accessible right from your browser. The “thinking mode” and “thinking budget” features make this model truly next-gen, giving you fine-grained control over accuracy vs. speed.

And the best part? It runs entirely on your terms—your machine, your interface, your control.

Relevant blog posts

August 19, 2025

The OCR Model That Outranks GPT-4o

NuMarkdown-8B-Thinking is a reasoning-powered OCR Vision-Language Model (VLM) built to transform documents into clean, structured Markdown. Fine-tuned from Qwen2.5-VL-7B, it introduces thinking tokens that help the model analyze complex layouts, tables, and unusual document structures before generating output. This makes it especially useful for RAG pipelines, document extraction, and knowledge organization. With its reasoning-first approach, NuMarkdown-8B-Thinking consistently outperforms generic OCR and even rivals large closed-source reasoning models in accuracy and layout understanding.

August 18, 2025

The Open-Source App Builder That Ate SaaS: Dyad + Ollama Setup

Dyad is a free, local, and open-source app builder that lets you create AI-powered apps without writing code. It’s a privacy-friendly alternative to platforms like Lovable, v0, Bolt, and Replit—designed to run entirely on your computer, with no lock-in or vendor dependency. With built-in Supabase integration, support for any AI model (including local ones via Ollama), and seamless connection to your existing tools, Dyad makes it easy to launch full-stack apps quickly. Fast, intuitive, and open-source, Dyad is built for makers who want control, speed, and limitless creativity.

August 15, 2025

How to Install & Run Gemma-3-270m, GGUF & Instruct Locally?

google/gemma-3-270m (Pre-trained) A lightweight, open vision-language model from Google DeepMind, designed for both text and image inputs. With a 32K context window, it’s suitable for general-purpose text generation, summarization, reasoning, and image analysis. Trained on diverse multilingual, code, math, and visual datasets, it offers strong performance in resource-constrained environments like laptops or small cloud VMs. google/gemma-3-270m-it (Instruction-Tuned) An instruction-optimized variant of Gemma 3-270M that’s fine-tuned to follow user prompts more accurately. It keeps the same multimodal capabilities as the base model but excels in conversational AI, question answering, and structured output tasks, making it more user-friendly for chatbots, assistants, and guided content generation. unsloth/gemma-3-270m-it-GGUF A GGUF-format, instruction-tuned Gemma 3-270M released by Unsloth AI for efficient local inference with llama.cpp and similar tools. It’s optimized for faster performance and lower memory usage while retaining multimodal capabilities, making it ideal for on-device or low-resource deployment scenarios.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.