How to Install & Run R-4B: Auto-Thinking Model Locally?

by Ayush Kumar | September 9, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

R-4B is a multimodal large language model designed to introduce general-purpose auto-thinking. Unlike traditional models that either always perform step-by-step reasoning or skip it entirely, R-4B can adaptively switch between thinking and non-thinking modes depending on task complexity. This is achieved through its Bi-mode Annealing training (to build both capabilities) and Bi-mode Policy Optimization (to dynamically balance them during inference).

This flexibility allows R-4B to handle everything from quick Q&A to complex logical or scientific reasoning while keeping efficiency high. With recent integration into vLLM, R-4B also enables fast, scalable deployments and exposes a simple API for manual or automatic control over its “thinking mode.” It already tops multiple OpenCompass multimodal leaderboards, making it one of the most advanced open-source reasoning-capable MLLMs under 20B parameters.

R-4B Benchmark Comparison

Dataset	R-4B [AutoThink]	Keye-VL-8B [AutoThink]	InternVL3.5-4B	Kimi-VL-A3B-Thinking-2506	InternVL3-8B	Qwen2.5-VL-7B
MMMU	68.1	66.8	66.6	64.0	62.2	58.0
MMStar	73.1	72.8	65.0	70.4	68.7	64.1
CharXiV (RQ)	56.8	40.0	39.6	47.7	37.6	42.5
MathVerse-Vision	64.9	40.8	61.7	57.4	32.4	41.2
DynaMath	39.5	35.3	35.7	27.1	23.9	20.1
LogicVista	59.1	50.6	56.4	51.0	43.6	44.5

Experimental Results

GPU Configuration (What Actually Works)

Scenario	Precision	Min VRAM	Recommended VRAM	Example GPUs	Notes
Light tasks (short Q&A, single image description)	FP16 / BF16	24 GB	32 GB	NVIDIA L4 (24 GB), RTX 4090 (24 GB)	Suitable for short outputs, batch size 1.
Medium tasks (VQA, reasoning chains, multi-turn chat)	FP16 / BF16	40 GB	48 GB	A6000 (48 GB), A100 (40 GB)	Good balance between reasoning length and efficiency.
Heavy tasks (long auto-thinking, large images, 16K+ tokens)	FP16 / BF16	80 GB	96 GB+	H100 (80 GB), H200 (94 GB)	Needed for extended context and long reasoning sequences.
Tensor Parallel Inference (vLLM server)	FP16 / BF16	8 × 16 GB	8 × 24 GB	Multi-GPU clusters with 8× A100 (40 GB) or 8× H100 (80 GB)	Use tensor-parallel size = 8 for distributed workloads.

Resources

Link: https://huggingface.co/YannQi/R-4B

Step-by-Step Process to Install & Run R-4B: Auto-Thinking Model Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running R-4B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like R-4B.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like R-4B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the R-4B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install `pip` (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install `pip` (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install `pip` via `get-pip.py`:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv r4b
source r4b/bin/activate

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 11: Install Model Dependencies

Run the following command to install model dependencies:

pip install --upgrade transformers accelerate pillow huggingface_hub

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the R-4B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Download and Load the Model

Create a file (ex: r4b_transformers_demo.py) and add the following code:

from transformers import AutoModel, AutoProcessor
import torch

model_id = "YannQi/R-4B"

model = AutoModel.from_pretrained(
    model_id,
    dtype=torch.float32,          # 👈 FP32 to satisfy LayerNorm
    trust_remote_code=True,
    # optional: pin a specific commit to avoid surprise updates
    # revision="<commit-sha>"
).to("cuda")

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    # revision="<commit-sha>"
)

Then, run the script from the following command:

python3 r4b_transformers_demo.py

Step 14: Run the Model and Generate Response

After downloading, rewrite the following code in same script:

import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoProcessor

model_id = "YannQi/R-4B"

# Load in FP32 so projector LayerNorm (float) matches activations
model = AutoModel.from_pretrained(
    model_id,
    dtype=torch.float32,           # (use dtype, not torch_dtype)
    trust_remote_code=True,
).to("cuda")

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=False                 # avoid surprises vs slow processor warning
)

# Quick sanity
assert torch.cuda.is_available(), "CUDA not available"
print("GPU:", torch.cuda.get_device_name(0))
print("Model param dtype:", next(model.parameters()).dtype)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "http://images.cocodataset.org/val2017/000000039769.jpg"},
        {"type": "text", "text": "Describe this image briefly."},
    ],
}]

prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    thinking_mode="auto",  # auto | long | short
)

image = Image.open(requests.get(messages[0]["content"][0]["image"], stream=True).raw)

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

# Keep this modest on smaller cards
generated = model.generate(**inputs, max_new_tokens=512)

out_ids = generated[0][len(inputs.input_ids[0]):]
text = processor.decode(out_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("\n=== OUTPUT ===\n", text)

Then, run the script from the following command:

python3 r4b_transformers_demo.py

This script, generate the response and print output in terminal.

Option B — vLLM high-throughput server (recommended)

R-4B added native vLLM support in Aug 2025; install vLLM from source to get the latest VLM kernels. The model card shows the canonical commands; vLLM also documents using precompiled kernels for faster editable installs.

Step 1: Install `uv` (Fast Pip)

Run the following commands to install uv:

# Install uv (one-liner installer)
curl -LsSf https://astral.sh/uv/install.sh | sh

# current shell
source $HOME/.local/bin/env

# also add it for future logins
echo 'source $HOME/.local/bin/env' >> ~/.bashrc

# verify
uv --version

Step 2: Install Wheel

Run the following command to install wheel:

pip install --upgrade pip wheel

Step 3: Clone and Install vLLM (editable) with Precompiled Kernels

Run the following commands to clone and install vLLM (editable):

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .

Step 4: Install Build Deps (GCC + Python Headers)

Run the following commands to install build deps (gcc + python headers):

sudo apt-get update
sudo apt-get install -y build-essential python3-dev python3.10-dev ninja-build

Step 5: Serve R-4B

Run the following command to serve R-4B:

# stop any running server (Ctrl+C), then:
vllm serve \
  yannqi/R-4B \
  --served-model-name r4b \
  --host 0.0.0.0 --port 8000 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code

What the flags mean

yannqi/R-4B – Hugging Face repo to load (with custom modeling code).
--served-model-name r4b – the name clients use as "model": "r4b".
--host 0.0.0.0 --port 8000 – bind on all interfaces, port 8000.
--gpu-memory-utilization 0.85 – let vLLM use ~85% of VRAM (leave headroom for kernels/OS).
--trust-remote-code – required because the repo ships custom code.
If you hit compile issues on some boxes, add: --enforce-eager (disables torch.compile JIT).

What “healthy” startup looks like

You’ll see lines like:

INFO ... vLLM API server version ...
INFO ... Resolved architecture: RForConditionalGeneration
INFO ... Route: /v1/chat/completions, Methods: POST
INFO ... Started server process [PID]
INFO ... Application startup complete.

Step 6: Query the R-4B API (text + image)

Your server is running at http://<HOST>:8000/v1. We’ll use the OpenAI-compatible /chat/completions route.

tip: if you don’t have jq, either install it (apt-get install -y jq) or use the Python one-liner extractors shown below.

6.1 Minimal text sanity check

With jq:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "r4b",
    "messages": [{"role": "user", "content": "In one sentence, what is R-4B?"}],
    "max_tokens": 128
  }' | jq -r '.choices[0].message.content'

Without jq (Python extractor):

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "r4b",
    "messages": [{"role": "user", "content": "In one sentence, what is R-4B?"}],
    "max_tokens": 128
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

If you see odd answers (e.g., “rocket”), add a system message, lower temperature, and keep top_p around 0.9:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "r4b",
    "temperature": 0.2,
    "top_p": 0.9,
    "messages": [
      {"role":"system","content":"You are R-4B, a multimodal LLM. Be concise and factual. If unsure, say you do not know."},
      {"role":"user","content":"In one sentence, what is R-4B?"}
    ],
    "max_tokens": 128
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

6.2 Thinking modes (auto / long / short)

R-4B can auto-decide when to think, or you can force it. Pass the knob via extra_body.chat_template_kwargs.

# Auto (default)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"r4b",
    "messages":[{"role":"user","content":"Summarize Transformers in one sentence."}],
    "max_tokens": 128,
    "extra_body": { "chat_template_kwargs": { "thinking_mode": "auto" } }
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

# Force deep reasoning
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"r4b",
    "messages":[{"role":"user","content":"Explain attention in 2–3 sentences for a beginner."}],
    "max_tokens": 256,
    "extra_body": { "chat_template_kwargs": { "thinking_mode": "long" } }
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

# Force non-thinking (fast)
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"r4b",
    "messages":[{"role":"user","content":"Give a one-line definition of KV cache."}],
    "max_tokens": 64,
    "extra_body": { "chat_template_kwargs": { "thinking_mode": "short" } }
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

seeing stray </think> tags at the start? two quick fixes:

keep thinking_mode: short for non-thinking responses, or

add a stop sequence to trim:

“stop”: [“</think>”] inside the top-level JSON.

6.3 Image + text (VLM)

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "r4b",
    "messages": [{
      "role": "user",
      "content": [
        {"type":"image_url","image_url":{"url":"http://images.cocodataset.org/val2017/000000039769.jpg"}},
        {"type":"text","text":"Describe this image briefly."}
      ]
    }],
    "max_tokens": 512,
    "extra_body": { "chat_template_kwargs": { "thinking_mode": "auto" } }
  }' | python3 -c 'import sys,json;print(json.load(sys.stdin)["choices"][0]["message"]["content"])'

6.4 Python client (OpenAI SDK)

Run the following command to install dependencies:

pip install "openai>=1.44.0" pillow requests

Run:

python3 r4b_client_demo.py

Conclusion

R-4B brings “auto-thinking” to multimodal LLMs, switching between step-by-step reasoning and fast direct answers to match task complexity—so you get strong accuracy without wasting compute. It’s open-source, tops key OpenCompass MLLM benchmarks under 20B, and is easy to run locally via Transformers or serve at scale with vLLM. Use thinking_mode (auto/long/short) to control behavior, keep tokens modest on smaller GPUs, and pin a revision for stability. If you need throughput, vLLM + tensor parallel makes it production-ready. In short: R-4B is a practical, high-quality choice for vision-language apps that need both speed and serious reasoning.

Relevant blog posts

September 9, 2025

How to Install & Run Apertus: The Massive Multilingual AI Model Supporting 1,800+ Languages

The AI landscape has been dominated by a handful of large language models, many of which operate as “black boxes” with hidden data and opaque training methods. But Apertus enters the AI space as the state-of-the-art model that is completely transparent, from its training data to its core architecture. Apertus, that comes in both 8B and 70B parameter variants, distinguishes itself not just with its size, but with its commitment to radical transparency and massive multilingualism. It was pre-trained on an unprecedented 15 trillion tokens, with over 40% of the data in languages other than English, providing native support for over 1,800 languages, a milestone that makes it uniquely valuable for global applications and under-resourced linguistic communities. Unlike many models that only offer their weights, Apertus provides all the scientific artifacts from its development cycle, including data preparation scripts, training code, and evaluation suites, allowing for transparent audits and community-driven extension. This model is a foundational blueprint for the future of ethical, compliant, and inclusive AI.

September 8, 2025

How to Install & Run Microsoft Kosmos-2.5 Locally?

Kosmos-2.5 is Microsoft’s multimodal “literate” model for reading text-heavy images (receipts, invoices, forms, docs). It does two things out of the box using task prompts: (a) OCR with spatially-aware text blocks (text + bounding boxes) via , and (b) image→Markdown conversion via . It’s implemented in Transformers (supported from v4.56+) with ready-to-run Python snippets, and the paper details the shared decoder-only architecture and doc-understanding focus

September 6, 2025

How to Install & Run EmbeddingGemma-300m Locally?

EmbeddingGemma-300M is Google DeepMind’s lightweight, multilingual (100+ languages) embedding model built on Gemma 3/T5Gemma foundations. It outputs 768-dim vectors (with Matryoshka down-projections to 512/256/128) optimized for retrieval, classification, clustering, semantic similarity, QA, and code retrieval. It’s designed for low-resource / on-device use, loads via SentenceTransformers, and does not support float16—use FP32 or bfloat16.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.