How to Install & Run ERNIE-4.5-21B-A3B-Thinking Locally?

by Ayush Kumar | September 10, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

A 21B-parameter text MoE (Mixture-of-Experts) model with 3B activated params/token, post-trained for deep reasoning. It adds stronger tool use, long-context (131,072 tokens), and higher pass@1/accuracy on math/logic, coding, science, and academic benchmarks. Weights are released in Transformer-style (PyTorch) with BF16 / FP32, and it can be run via FastDeploy (recommended) or standard transformers. Function-calling is supported; vLLM parsers for reasoning/tool calls are in progress.

Key config: 28 layers, 20 Q heads / 4 KV heads, 64 text experts (6 active), 2 shared experts. License: Apache-2.0.

Benchmark	ERNIE-4.5-21B-A3B-Thinking	DeepSeek-R1-0528	ERNIE-X1.1	Gemini2.5-Pro
AIME2025 (Avg@32)	78.02	87.62	82.6	90.05
BFCL (Accuracy)	65.00	66.04	72.0	62.89
ZebraLogic (Accuracy)	89.8	95.1	94.7	92.29
MUSR (Accuracy)	86.71	94.33	88.16	83.13
BBH (Accuracy)	87.77	90.97	93.42	91.28
HumanEval+ (Pass@1)	90.85	89.45	93.29	94.51
MBPP (Pass@1)	80.16	78.31	80.49	79.8
IFEval (Prompt Strict Accuracy)	84.29	80.22	92.24	90.37
Multi-IF (Accuracy)	63.29	69.0	82.4	76.13
ChineseSimpleQA (Accuracy)	49.06	67.17	82.86	74.5
WritingBench (critic-score) (Max = 10)	8.65	8.61	8.76	8.79

Model Overview

ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details:

Key	Value
Modality	Text
Training Stage	Posttraining
Params(Total / Activated)	21B / 3B
Layers	28
Heads(Q/KV)	20 / 4
Text Experts(Total / Activated)	64 / 6
Vision Experts(Total / Activated)	64 / 6
Shared Experts	2
Context Length	131072

GPU Configuration Guide (Practical Setups)

Notes

• Official FastDeploy example states 1×80 GB GPU.

• Long context greatly increases KV-cache memory; reduce max sequence length or batch size if you run out of VRAM.

• INT8/4-bit options depend on your stack; prefer FastDeploy or carefully validated transformers quantization.

Scenario	Precision / Stack	Min VRAM that works*	Recommended	Example setup	Tips
Single-GPU, standard context (≤8k–16k), batch 1	BF16, FastDeploy 2.2+	80 GB	80–96 GB	1× A100/H100 80 GB	Use the sample command (`--tensor-parallel-size 1`, `--max-model-len 131072` adjustable).
Multi-GPU tensor parallel	BF16, FastDeploy	2×40 GB	2×40–4×24 GB	2× A100 40 GB, or 4× L40S 24 GB	Set `--tensor-parallel-size` to number of GPUs; lower `max-model-len` for stability.
Transformers (no server), inference only	BF16	48–80 GB	80 GB	1× 80 GB; or 2×40 GB with `device_map="auto"`	Start with `max_new_tokens≤1024`, batch 1; watch CPU RAM for MoE routing buffers.
Transformers w/ 8-bit weights (experimental)	INT8/LLM.int8()	32–48 GB	48–64 GB	1× 48 GB (RTX 6000 Ada / 4090)	Quantize weights; KV cache remains BF16/FP16—limit sequence length.
vLLM (parsers WIP)	BF16	80 GB	80–96 GB	1× 80 GB	Until ERNIE reasoning/tool parsers land, treat as standard CausalLM serving.
Long-context (≥64k, batch 1)	BF16	80–120 GB	120–160 GB (or multi-GPU)	2× 80 GB	KV cache dominates; reduce `max-model-len` or use paged KV cache if available.
Low-VRAM fallback	CPU offload + 8-bit	24–32 GB GPU + large CPU RAM	32–48 GB	1× 24–32 GB + fast NVMe	Very slow; keep `max_model_len` small and batch size 1.

Resources

Link: https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking

Step-by-Step Process to Install & Run ERNIE-4.5-21B-A3B-Thinking Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running ERNIE-4.5-21B-A3B-Thinking, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like ERNIE-4.5-21B-A3B-Thinking.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like ERNIE-4.5-21B-A3B-Thinking.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the ERNIE-4.5-21B-A3B-Thinking runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install `pip` (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install `pip` (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install `pip` via `get-pip.py`:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv ernie
source ernie/bin/activate

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install Model Dependencies

Run the following command to install model dependencies:

pip install --upgrade transformers accelerate pillow huggingface_hub
pip install --upgrade pip setuptools wheel
pip install --upgrade blobfile

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the ERNIE-4.5-21B-A3B-Thinking model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Create the Script

Create a file (ex: # app.py) and add the following code:

import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "baidu/ERNIE-4.5-21B-A3B-Thinking"

# Silence the legacy tokenizer warning:
tok = AutoTokenizer.from_pretrained(name, legacy=False)

# Newer HF warns: use dtype instead of torch_dtype
model = AutoModelForCausalLM.from_pretrained(
    name, dtype=torch.bfloat16, device_map="auto"
)

messages = [
  {"role":"system","content":"You are a helpful assistant. Respond in English only. Do NOT include <think> content."},
  {"role":"user","content":"Give me a short introduction to large language models."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], add_special_tokens=False, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)

def extract_response(s: str) -> str:
    # strip <think>…</think> if present
    s = re.sub(r"<think>.*?</think>\s*", "", s, flags=re.DOTALL|re.IGNORECASE)
    # prefer only the <response>…</response> block
    m = re.search(r"<response>\s*(.*?)\s*</response>", s, flags=re.DOTALL|re.IGNORECASE)
    return (m.group(1).strip() if m else s.strip())

print(extract_response(gen))

What the Script Does:

imports & model ID

Brings in re, torch, and Transformers helpers.
Sets name = "baidu/ERNIE-4.5-21B-A3B-Thinking" so every call uses that HF repo.

Load the tokenizer (modern behavior)

AutoTokenizer.from_pretrained(..., legacy=False) turns off the old LLaMA-style tokenization behavior and the noisy warning.
Downloads the tokenizer files if not cached.

Load the model on your GPU(s)

AutoModelForCausalLM.from_pretrained(..., dtype=torch.bfloat16, device_map="auto")
- dtype = BF16 → good numerical stability on H100/H200 while saving VRAM.
- device_map=”auto” → automatically places weights on available GPU(s) (and CPU if needed). No manual .to("cuda") required.
- Pulls 9 safetensor shards (first run) and builds the model.

Build a chat prompt with the model’s template

messages = [...] defines a system rule (English only, no <think>) and a user prompt.
tok.apply_chat_template(..., add_generation_prompt=True) converts those into the exact string/tokens ERNIE expects for chat (adds special tokens and the assistant prefix).

Tokenize & move to the right device

tok([...], return_tensors="pt").to(model.device) turns the text into input IDs and puts them where the model lives (GPU).

Generate a response

model.generate(..., max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
- Up to 512 new tokens.
- Sampling on (temperature/top-p) for a slightly varied but controlled answer.
  (Set do_sample=False for deterministic/greedy outputs.)

Decode only the newly generated tokens

Slices off the prompt IDs and decodes just the continuation:

gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)

Clean up ERNIE’s “thinking” format

Defines extract_response():
- Strips any <think> ... </think> block with a regex.
- If present, extracts the content inside <response> ... </response>.
- Falls back to the raw text if no <response> tags are found.

Print the final, user-friendly answer

print(extract_response(gen)) → you see only the polished reply, without the hidden reasoning.

Step 14: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

When you run the script using python3 app.py, the ERNIE-4.5 model will download and generate a response directly in your terminal. By default, the response may appear in Chinese, as the model is multilingual and often defaults to its training language distribution. If you’d like the output in English or another specific language, you must explicitly instruct the model through the system prompt. We will experiment with this in Steps 15 & 16, where you’ll learn how to guide ERNIE to respond in the language of your choice.

Step 15: Rewrite the `app.py` for English

Update the system prompt in your script to include "Reply in English only" and set legacy=False in the tokenizer to avoid warnings — this will ensure the model responds in English.

Add the following code in the file:

import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "baidu/ERNIE-4.5-21B-A3B-Thinking"

# Silence the legacy tokenizer warning:
tok = AutoTokenizer.from_pretrained(name, legacy=False)

# Newer HF warns: use dtype instead of torch_dtype
model = AutoModelForCausalLM.from_pretrained(
    name, dtype=torch.bfloat16, device_map="auto"
)

messages = [
  {"role":"system","content":"You are a helpful assistant. Respond in English only. Do NOT include <think> content."},
  {"role":"user","content":"Give me a short introduction to large language models."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], add_special_tokens=False, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)

def extract_response(s: str) -> str:
    # strip <think>…</think> if present
    s = re.sub(r"<think>.*?</think>\s*", "", s, flags=re.DOTALL|re.IGNORECASE)
    # prefer only the <response>…</response> block
    m = re.search(r"<response>\s*(.*?)\s*</response>", s, flags=re.DOTALL|re.IGNORECASE)
    return (m.group(1).strip() if m else s.strip())

print(extract_response(gen))

Step 16: Run the Script

Run the script from the following command:

python3 app.py

Now, this will generate the response on your terminal in English. If you want the output in any other language, simply modify the system prompt in the script to specify your desired language—for example, use "Reply in Spanish only" or "Answer in Hindi" as needed. The model will follow the instruction accordingly, so you can customize the language of the response directly within the prompt text in your code.

Step 17: Install vLLM

Run the following command to install vLLM:

pip install "vllm>=0.6.0"

Step 18: Install Python 3.10 Toolchain + Headers

Run the following command to install python 3.10 toolchain + headers:

# Python 3.10 toolchain + headers needed by vLLM
sudo apt update
sudo apt install -y \
  build-essential \
  python3.10 python3.10-venv python3.10-dev python3-pip \
  python3-dev

Note: we already have Python 3.10 installed, but we add python3.10-dev (and the toolchain) because vLLM builds/uses native CUDA extensions and needs the Python 3.10 headers and libs to compile / load wheels correctly.

Step 19: Start the vLLM Server

Run the following command to start the vLLM server:

vllm serve baidu/ERNIE-4.5-21B-A3B-Thinking --max-model-len 8192

vllm serve
Starts a FastAPI HTTP server powered by vLLM’s engine. It exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions, /v1/completions) so you can call it with normal OpenAI-style requests.

baidu/ERNIE-4.5-21B-A3B-Thinking
Tells vLLM to pull this model from Hugging Face (first run downloads weights & tokenizer) and keep it in GPU memory for inference.

--max-model-len 8192
Sets the maximum total token window per request (prompt + tools + system + new tokens) to 8,192 tokens.

Lower value → less KV cache memory, higher throughput and more concurrent requests.
Higher value → more VRAM used per request, fewer concurrent sequences possible.
This is an upper bound; you can still request smaller contexts.

What vLLM does under the hood

Loads the tokenizer and weights; picks an efficient dtype automatically (BF16 on H100/H200; FP16 otherwise).
Uses PagedAttention with a paged KV cache so multiple requests can run concurrently without massive fragmentation.
Spawns an engine worker and an HTTP app; default host is 0.0.0.0, port is 8000 (unless you pass --host/--port).
Supports streaming responses (Server-Sent Events) and batched decoding for throughput.

What you can call after it’s up

Chat endpoint (recommended): send messages=[...] with a system+user chat format; vLLM applies the model’s chat template for you.
Completions endpoint: send plain prompts if you prefer classic completion style.

Performance/VRAM implications

8k context is light for your H200; you can raise to 16k/32k if you need longer context, at the cost of VRAM & throughput.
Concurrency scales with free KV cache: more max_model_len or longer outputs → fewer parallel requests.

Defaults you didn’t specify (good to know)

--tensor-parallel-size defaults to 1 (single GPU).
--dtype is auto.
--served-model-name defaults to the HF id; change it if you want a shorter API model name.
--api-key is off by default; add one if you want auth on the server.

Success criteria (what you should see) for ERNIE-4.5-21B-A3B-Thinking with vLLM:

Resolved architecture: Qwen2ForCausalLM (ERNIE 4.5 PT weights load via the Qwen2 causal LM class).
Model load: lines like Loading checkpoint shards … followed by Loaded baidu/ERNIE-4.5-21B-A3B-Thinking.
Routes listed: e.g. /v1/chat/completions, /v1/completions, /v1/models, /metrics (OpenAI-compatible).
Started server process [PID] — vLLM engine + HTTP app spawned.
Application startup complete.
(Normal) You may see: torch_dtype is deprecated! Use dtype instead!
Port: vLLM listens on 0.0.0.0:8000 by default (change with --port).
(ERNIE-specific note) If you query directly, generations may include <think>…</think> and <response>…</response>; that’s expected. Add a client-side stop at </response> or strip <think> if you want only the final answer.

Step 20: Verify the server is serving your model

# models list curl http://localhost:8000/v1/models

You should see an entry like:
"id": "baidu/ERNIE-4.5-21B-A3B-Thinking", "max_model_len": 8192, ...

Ask a question (OpenAI chat endpoint)

curl -s http://$HOST:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baidu/ERNIE-4.5-21B-A3B-Thinking",
    "messages": [
      {"role":"system","content":"Reply in English. Do not include <think>."},
      {"role":"user","content":"Give me a 2-line intro to large language models."}
    ],
    "max_tokens": 256,
    "temperature": 0.2,
    "stop": ["</response>"]
  }'

stop: ["</response>"] trims ERNIE’s thinking trace if it appears.
If you enabled an API key at server start, add:-H "Authorization: Bearer <YOUR_KEY>"

Conclusion

ERNIE-4.5-21B-A3B-Thinking is a powerful open-source Mixture-of-Experts language model designed for advanced reasoning, coding, and academic tasks. With support for 131K token context, strong function-calling, and multilingual capabilities, it excels in complex generation workflows. Thanks to its efficient 3B expert activation, it delivers high performance on modern GPUs like H100/H200 without overloading memory. Whether you’re using Transformers or deploying via vLLM or FastDeploy, ERNIE-4.5 offers flexibility, speed, and accuracy—making it an ideal choice for developers, researchers, and production-grade AI systems.

Relevant blog posts

September 10, 2025

How to Install & Run K2-Think Locally?

K2-Think is a 32B open-weights reasoning model focused on tough math/logic, code, and science tasks. It’s trained for long chain-of-thought and integrates reinforcement learning with verifiable rewards and agentic planning. Despite its size, it targets high efficiency: the team reports ~2,000 tok/s on Cerebras WSE with speculative decoding (vs. ~200 tok/s on typical H100/H200 setups), and strong scores on AIME’24/’25, HMMT’25, OMNI-Math-HARD, GPQA-Diamond, and LiveCodeBench. Weights are Apache-2.0 and served on Hugging Face.

September 9, 2025

How to Install & Run Apertus: The Massive Multilingual AI Model Supporting 1,800+ Languages

The AI landscape has been dominated by a handful of large language models, many of which operate as “black boxes” with hidden data and opaque training methods. But Apertus enters the AI space as the state-of-the-art model that is completely transparent, from its training data to its core architecture. Apertus, that comes in both 8B and 70B parameter variants, distinguishes itself not just with its size, but with its commitment to radical transparency and massive multilingualism. It was pre-trained on an unprecedented 15 trillion tokens, with over 40% of the data in languages other than English, providing native support for over 1,800 languages, a milestone that makes it uniquely valuable for global applications and under-resourced linguistic communities. Unlike many models that only offer their weights, Apertus provides all the scientific artifacts from its development cycle, including data preparation scripts, training code, and evaluation suites, allowing for transparent audits and community-driven extension. This model is a foundational blueprint for the future of ethical, compliant, and inclusive AI.

September 9, 2025

How to Install & Run R-4B: Auto-Thinking Model Locally?

R-4B is a multimodal large language model designed to introduce general-purpose auto-thinking. Unlike traditional models that either always perform step-by-step reasoning or skip it entirely, R-4B can adaptively switch between thinking and non-thinking modes depending on task complexity. This is achieved through its Bi-mode Annealing training (to build both capabilities) and Bi-mode Policy Optimization (to dynamically balance them during inference). This flexibility allows R-4B to handle everything from quick Q&A to complex logical or scientific reasoning while keeping efficiency high. With recent integration into vLLM, R-4B also enables fast, scalable deployments and exposes a simple API for manual or automatic control over its “thinking mode.” It already tops multiple OpenCompass multimodal leaderboards, making it one of the most advanced open-source reasoning-capable MLLMs under 20B parameters.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.