R-4B is a multimodal large language model designed to introduce general-purpose auto-thinking. Unlike traditional models that either always perform step-by-step reasoning or skip it entirely, R-4B can adaptively switch between thinking and non-thinking modes depending on task complexity. This is achieved through its Bi-mode Annealing training (to build both capabilities) and Bi-mode Policy Optimization (to dynamically balance them during inference).
This flexibility allows R-4B to handle everything from quick Q&A to complex logical or scientific reasoning while keeping efficiency high. With recent integration into vLLM, R-4B also enables fast, scalable deployments and exposes a simple API for manual or automatic control over its “thinking mode.” It already tops multiple OpenCompass multimodal leaderboards, making it one of the most advanced open-source reasoning-capable MLLMs under 20B parameters.
R-4B Benchmark Comparison
Dataset | R-4B [AutoThink] | Keye-VL-8B [AutoThink] | InternVL3.5-4B | Kimi-VL-A3B-Thinking-2506 | InternVL3-8B | Qwen2.5-VL-7B |
---|
MMMU | 68.1 | 66.8 | 66.6 | 64.0 | 62.2 | 58.0 |
MMStar | 73.1 | 72.8 | 65.0 | 70.4 | 68.7 | 64.1 |
CharXiV (RQ) | 56.8 | 40.0 | 39.6 | 47.7 | 37.6 | 42.5 |
MathVerse-Vision | 64.9 | 40.8 | 61.7 | 57.4 | 32.4 | 41.2 |
DynaMath | 39.5 | 35.3 | 35.7 | 27.1 | 23.9 | 20.1 |
LogicVista | 59.1 | 50.6 | 56.4 | 51.0 | 43.6 | 44.5 |
Experimental Results
GPU Configuration (What Actually Works)
Scenario | Precision | Min VRAM | Recommended VRAM | Example GPUs | Notes |
---|
Light tasks (short Q&A, single image description) | FP16 / BF16 | 24 GB | 32 GB | NVIDIA L4 (24 GB), RTX 4090 (24 GB) | Suitable for short outputs, batch size 1. |
Medium tasks (VQA, reasoning chains, multi-turn chat) | FP16 / BF16 | 40 GB | 48 GB | A6000 (48 GB), A100 (40 GB) | Good balance between reasoning length and efficiency. |
Heavy tasks (long auto-thinking, large images, 16K+ tokens) | FP16 / BF16 | 80 GB | 96 GB+ | H100 (80 GB), H200 (94 GB) | Needed for extended context and long reasoning sequences. |
Tensor Parallel Inference (vLLM server) | FP16 / BF16 | 8 × 16 GB | 8 × 24 GB | Multi-GPU clusters with 8× A100 (40 GB) or 8× H100 (80 GB) | Use tensor-parallel size = 8 for distributed workloads. |
Resources
Link: https://huggingface.co/YannQi/R-4B
Step-by-Step Process to Install & Run R-4B: Auto-Thinking Model Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running R-4B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like R-4B.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like R-4B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the R-4B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv r4b
source r4b/bin/activate
Step 10: Install PyTorch
Run the following command to install PyTorch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 11: Install Model Dependencies
Run the following command to install model dependencies:
pip install --upgrade transformers accelerate pillow huggingface_hub
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the R-4B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Download and Load the Model
Create a file (ex: r4b_transformers_demo.py) and add the following code:
from transformers import AutoModel, AutoProcessor
import torch
model_id = "YannQi/R-4B"
model = AutoModel.from_pretrained(
model_id,
dtype=torch.float32, # 👈 FP32 to satisfy LayerNorm
trust_remote_code=True,
# optional: pin a specific commit to avoid surprise updates
# revision="<commit-sha>"
).to("cuda")
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
# revision="<commit-sha>"
)
Then, run the script from the following command:
python3 r4b_transformers_demo.py
Step 14: Run the Model and Generate Response
After downloading, rewrite the following code in same script:
import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoProcessor
model_id = "YannQi/R-4B"
# Load in FP32 so projector LayerNorm (float) matches activations
model = AutoModel.from_pretrained(
model_id,
dtype=torch.float32, # (use dtype, not torch_dtype)
trust_remote_code=True,
).to("cuda")
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
use_fast=False # avoid surprises vs slow processor warning
)
# Quick sanity
assert torch.cuda.is_available(), "CUDA not available"
print("GPU:", torch.cuda.get_device_name(0))
print("Model param dtype:", next(model.parameters()).dtype)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "Describe this image briefly."},
],
}]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
thinking_mode="auto", # auto | long | short
)
image = Image.open(requests.get(messages[0]["content"][0]["image"], stream=True).raw)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
# Keep this modest on smaller cards
generated = model.generate(**inputs, max_new_tokens=512)
out_ids = generated[0][len(inputs.input_ids[0]):]
text = processor.decode(out_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("\n=== OUTPUT ===\n", text)
Then, run the script from the following command:
python3 r4b_transformers_demo.py
This script, generate the response and print output in terminal.
Option B — vLLM high-throughput server (recommended)
R-4B added native vLLM support in Aug 2025; install vLLM from source to get the latest VLM kernels. The model card shows the canonical commands; vLLM also documents using precompiled kernels for faster editable installs.
Step 1: Install uv
(Fast Pip)
Run the following commands to install uv:
# Install uv (one-liner installer)
curl -LsSf https://astral.sh/uv/install.sh | sh
# current shell
source $HOME/.local/bin/env
# also add it for future logins
echo 'source $HOME/.local/bin/env' >> ~/.bashrc
# verify
uv --version
Step 2: Install Wheel
Run the following command to install wheel:
pip install --upgrade pip wheel
Step 3: Clone and Install vLLM (editable) with Precompiled Kernels
Run the following commands to clone and install vLLM (editable):
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
Step 4: Install Build Deps (GCC + Python Headers)
Run the following commands to install build deps (gcc + python headers):
sudo apt-get update
sudo apt-get install -y build-essential python3-dev python3.10-dev ninja-build
Step 5: Serve R-4B
Run the following command to serve R-4B:
# stop any running server (Ctrl+C), then:
vllm serve \
yannqi/R-4B \
--served-model-name r4b \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.85 \
--trust-remote-code
What the flags mean
yannqi/R-4B
– Hugging Face repo to load (with custom modeling code).
--served-model-name r4b
– the name clients use as "model": "r4b"
.
--host 0.0.0.0 --port 8000
– bind on all interfaces, port 8000.
--gpu-memory-utilization 0.85
– let vLLM use ~85% of VRAM (leave headroom for kernels/OS).
--trust-remote-code
– required because the repo ships custom code.
- If you hit compile issues on some boxes, add:
--enforce-eager
(disables torch.compile
JIT).
What “healthy” startup looks like
You’ll see lines like:
INFO ... vLLM API server version ...
INFO ... Resolved architecture: RForConditionalGeneration
INFO ... Route: /v1/chat/completions, Methods: POST
INFO ... Started server process [PID]
INFO ... Application startup complete.
Step 6: Query the R-4B API (text + image)
Your server is running at http://<HOST>:8000/v1
. We’ll use the OpenAI-compatible /chat/completions route.
tip: if you don’t have jq
, either install it (apt-get install -y jq
) or use the Python one-liner extractors shown below.
6.1 Minimal text sanity check
With jq
:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "r4b",
"messages": [{"role": "user", "content": "In one sentence, what is R-4B?"}],
"max_tokens": 128
}' | jq -r '.choices[0].message.content'
Without jq
(Python extractor):
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "r4b",
"messages": [{"role": "user", "content": "In one sentence, what is R-4B?"}],
"max_tokens": 128
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
If you see odd answers (e.g., “rocket”), add a system message, lower temperature, and keep top_p around 0.9:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "r4b",
"temperature": 0.2,
"top_p": 0.9,
"messages": [
{"role":"system","content":"You are R-4B, a multimodal LLM. Be concise and factual. If unsure, say you do not know."},
{"role":"user","content":"In one sentence, what is R-4B?"}
],
"max_tokens": 128
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
6.2 Thinking modes (auto / long / short)
R-4B can auto-decide when to think, or you can force it. Pass the knob via extra_body.chat_template_kwargs
.
# Auto (default)
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"r4b",
"messages":[{"role":"user","content":"Summarize Transformers in one sentence."}],
"max_tokens": 128,
"extra_body": { "chat_template_kwargs": { "thinking_mode": "auto" } }
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
# Force deep reasoning
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"r4b",
"messages":[{"role":"user","content":"Explain attention in 2–3 sentences for a beginner."}],
"max_tokens": 256,
"extra_body": { "chat_template_kwargs": { "thinking_mode": "long" } }
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
# Force non-thinking (fast)
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"r4b",
"messages":[{"role":"user","content":"Give a one-line definition of KV cache."}],
"max_tokens": 64,
"extra_body": { "chat_template_kwargs": { "thinking_mode": "short" } }
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
seeing stray </think> tags at the start? two quick fixes:
keep thinking_mode: short for non-thinking responses, or
add a stop sequence to trim:
“stop”: [“</think>”] inside the top-level JSON.
6.3 Image + text (VLM)
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "r4b",
"messages": [{
"role": "user",
"content": [
{"type":"image_url","image_url":{"url":"http://images.cocodataset.org/val2017/000000039769.jpg"}},
{"type":"text","text":"Describe this image briefly."}
]
}],
"max_tokens": 512,
"extra_body": { "chat_template_kwargs": { "thinking_mode": "auto" } }
}' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'
6.4 Python client (OpenAI SDK)
Run the following command to install dependencies:
pip install "openai>=1.44.0" pillow requests
Run:
python3 r4b_client_demo.py
Conclusion
R-4B brings “auto-thinking” to multimodal LLMs, switching between step-by-step reasoning and fast direct answers to match task complexity—so you get strong accuracy without wasting compute. It’s open-source, tops key OpenCompass MLLM benchmarks under 20B, and is easy to run locally via Transformers or serve at scale with vLLM. Use thinking_mode
(auto/long/short
) to control behavior, keep tokens modest on smaller GPUs, and pin a revision for stability. If you need throughput, vLLM + tensor parallel makes it production-ready. In short: R-4B is a practical, high-quality choice for vision-language apps that need both speed and serious reasoning.