A 21B-parameter text MoE (Mixture-of-Experts) model with 3B activated params/token, post-trained for deep reasoning. It adds stronger tool use, long-context (131,072 tokens), and higher pass@1/accuracy on math/logic, coding, science, and academic benchmarks. Weights are released in Transformer-style (PyTorch) with BF16 / FP32, and it can be run via FastDeploy (recommended) or standard transformers. Function-calling is supported; vLLM parsers for reasoning/tool calls are in progress.
Key config: 28 layers, 20 Q heads / 4 KV heads, 64 text experts (6 active), 2 shared experts. License: Apache-2.0.
Benchmark | ERNIE-4.5-21B-A3B-Thinking | DeepSeek-R1-0528 | ERNIE-X1.1 | Gemini2.5-Pro |
---|
AIME2025 (Avg@32) | 78.02 | 87.62 | 82.6 | 90.05 |
BFCL (Accuracy) | 65.00 | 66.04 | 72.0 | 62.89 |
ZebraLogic (Accuracy) | 89.8 | 95.1 | 94.7 | 92.29 |
MUSR (Accuracy) | 86.71 | 94.33 | 88.16 | 83.13 |
BBH (Accuracy) | 87.77 | 90.97 | 93.42 | 91.28 |
HumanEval+ (Pass@1) | 90.85 | 89.45 | 93.29 | 94.51 |
MBPP (Pass@1) | 80.16 | 78.31 | 80.49 | 79.8 |
IFEval (Prompt Strict Accuracy) | 84.29 | 80.22 | 92.24 | 90.37 |
Multi-IF (Accuracy) | 63.29 | 69.0 | 82.4 | 76.13 |
ChineseSimpleQA (Accuracy) | 49.06 | 67.17 | 82.86 | 74.5 |
WritingBench (critic-score) (Max = 10) | 8.65 | 8.61 | 8.76 | 8.79 |
Model Overview
ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details:
Key | Value |
---|
Modality | Text |
Training Stage | Posttraining |
Params(Total / Activated) | 21B / 3B |
Layers | 28 |
Heads(Q/KV) | 20 / 4 |
Text Experts(Total / Activated) | 64 / 6 |
Vision Experts(Total / Activated) | 64 / 6 |
Shared Experts | 2 |
Context Length | 131072 |
GPU Configuration Guide (Practical Setups)
Notes
• Official FastDeploy example states 1×80 GB GPU.
• Long context greatly increases KV-cache memory; reduce max sequence length or batch size if you run out of VRAM.
• INT8/4-bit options depend on your stack; prefer FastDeploy or carefully validated transformers quantization.
Scenario | Precision / Stack | Min VRAM that works* | Recommended | Example setup | Tips |
---|
Single-GPU, standard context (≤8k–16k), batch 1 | BF16, FastDeploy 2.2+ | 80 GB | 80–96 GB | 1× A100/H100 80 GB | Use the sample command (--tensor-parallel-size 1 , --max-model-len 131072 adjustable). |
Multi-GPU tensor parallel | BF16, FastDeploy | 2×40 GB | 2×40–4×24 GB | 2× A100 40 GB, or 4× L40S 24 GB | Set --tensor-parallel-size to number of GPUs; lower max-model-len for stability. |
Transformers (no server), inference only | BF16 | 48–80 GB | 80 GB | 1× 80 GB; or 2×40 GB with device_map="auto" | Start with max_new_tokens≤1024 , batch 1; watch CPU RAM for MoE routing buffers. |
Transformers w/ 8-bit weights (experimental) | INT8/LLM.int8() | 32–48 GB | 48–64 GB | 1× 48 GB (RTX 6000 Ada / 4090) | Quantize weights; KV cache remains BF16/FP16—limit sequence length. |
vLLM (parsers WIP) | BF16 | 80 GB | 80–96 GB | 1× 80 GB | Until ERNIE reasoning/tool parsers land, treat as standard CausalLM serving. |
Long-context (≥64k, batch 1) | BF16 | 80–120 GB | 120–160 GB (or multi-GPU) | 2× 80 GB | KV cache dominates; reduce max-model-len or use paged KV cache if available. |
Low-VRAM fallback | CPU offload + 8-bit | 24–32 GB GPU + large CPU RAM | 32–48 GB | 1× 24–32 GB + fast NVMe | Very slow; keep max_model_len small and batch size 1. |
Resources
Link: https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking
Step-by-Step Process to Install & Run ERNIE-4.5-21B-A3B-Thinking Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running ERNIE-4.5-21B-A3B-Thinking, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like ERNIE-4.5-21B-A3B-Thinking.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like ERNIE-4.5-21B-A3B-Thinking.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the ERNIE-4.5-21B-A3B-Thinking runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv ernie
source ernie/bin/activate
Step 10: Install PyTorch
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
Step 11: Install Model Dependencies
Run the following command to install model dependencies:
pip install --upgrade transformers accelerate pillow huggingface_hub
pip install --upgrade pip setuptools wheel
pip install --upgrade blobfile
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the ERNIE-4.5-21B-A3B-Thinking model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Create the Script
Create a file (ex: # app.py) and add the following code:
import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
name = "baidu/ERNIE-4.5-21B-A3B-Thinking"
# Silence the legacy tokenizer warning:
tok = AutoTokenizer.from_pretrained(name, legacy=False)
# Newer HF warns: use dtype instead of torch_dtype
model = AutoModelForCausalLM.from_pretrained(
name, dtype=torch.bfloat16, device_map="auto"
)
messages = [
{"role":"system","content":"You are a helpful assistant. Respond in English only. Do NOT include <think> content."},
{"role":"user","content":"Give me a short introduction to large language models."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], add_special_tokens=False, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)
def extract_response(s: str) -> str:
# strip <think>…</think> if present
s = re.sub(r"<think>.*?</think>\s*", "", s, flags=re.DOTALL|re.IGNORECASE)
# prefer only the <response>…</response> block
m = re.search(r"<response>\s*(.*?)\s*</response>", s, flags=re.DOTALL|re.IGNORECASE)
return (m.group(1).strip() if m else s.strip())
print(extract_response(gen))
What the Script Does:
imports & model ID
- Brings in
re
, torch
, and Transformers helpers.
- Sets
name = "baidu/ERNIE-4.5-21B-A3B-Thinking"
so every call uses that HF repo.
Load the tokenizer (modern behavior)
AutoTokenizer.from_pretrained(..., legacy=False)
turns off the old LLaMA-style tokenization behavior and the noisy warning.
- Downloads the tokenizer files if not cached.
Load the model on your GPU(s)
AutoModelForCausalLM.from_pretrained(..., dtype=torch.bfloat16, device_map="auto")
- dtype = BF16 → good numerical stability on H100/H200 while saving VRAM.
- device_map=”auto” → automatically places weights on available GPU(s) (and CPU if needed). No manual
.to("cuda")
required.
- Pulls 9 safetensor shards (first run) and builds the model.
Build a chat prompt with the model’s template
messages = [...]
defines a system rule (English only, no <think>
) and a user prompt.
tok.apply_chat_template(..., add_generation_prompt=True)
converts those into the exact string/tokens ERNIE expects for chat (adds special tokens and the assistant prefix).
Tokenize & move to the right device
tok([...], return_tensors="pt").to(model.device)
turns the text into input IDs and puts them where the model lives (GPU).
Generate a response
model.generate(..., max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
- Up to 512 new tokens.
- Sampling on (temperature/top-p) for a slightly varied but controlled answer.
(Set do_sample=False
for deterministic/greedy outputs.)
Decode only the newly generated tokens
- Slices off the prompt IDs and decodes just the continuation:
gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)
Clean up ERNIE’s “thinking” format
- Defines
extract_response()
:
- Strips any
<think> ... </think>
block with a regex.
- If present, extracts the content inside
<response> ... </response>
.
- Falls back to the raw text if no
<response>
tags are found.
Print the final, user-friendly answer
print(extract_response(gen))
→ you see only the polished reply, without the hidden reasoning.
Step 14: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
When you run the script using python3 app.py
, the ERNIE-4.5 model will download and generate a response directly in your terminal. By default, the response may appear in Chinese, as the model is multilingual and often defaults to its training language distribution. If you’d like the output in English or another specific language, you must explicitly instruct the model through the system prompt. We will experiment with this in Steps 15 & 16, where you’ll learn how to guide ERNIE to respond in the language of your choice.
Step 15: Rewrite the app.py
for English
Update the system prompt in your script to include "Reply in English only"
and set legacy=False
in the tokenizer to avoid warnings — this will ensure the model responds in English.
Add the following code in the file:
import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
name = "baidu/ERNIE-4.5-21B-A3B-Thinking"
# Silence the legacy tokenizer warning:
tok = AutoTokenizer.from_pretrained(name, legacy=False)
# Newer HF warns: use dtype instead of torch_dtype
model = AutoModelForCausalLM.from_pretrained(
name, dtype=torch.bfloat16, device_map="auto"
)
messages = [
{"role":"system","content":"You are a helpful assistant. Respond in English only. Do NOT include <think> content."},
{"role":"user","content":"Give me a short introduction to large language models."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], add_special_tokens=False, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.9)
gen = tok.decode(out[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)
def extract_response(s: str) -> str:
# strip <think>…</think> if present
s = re.sub(r"<think>.*?</think>\s*", "", s, flags=re.DOTALL|re.IGNORECASE)
# prefer only the <response>…</response> block
m = re.search(r"<response>\s*(.*?)\s*</response>", s, flags=re.DOTALL|re.IGNORECASE)
return (m.group(1).strip() if m else s.strip())
print(extract_response(gen))
Step 16: Run the Script
Run the script from the following command:
python3 app.py
Now, this will generate the response on your terminal in English. If you want the output in any other language, simply modify the system prompt in the script to specify your desired language—for example, use "Reply in Spanish only"
or "Answer in Hindi"
as needed. The model will follow the instruction accordingly, so you can customize the language of the response directly within the prompt text in your code.
Step 17: Install vLLM
Run the following command to install vLLM:
pip install "vllm>=0.6.0"
Step 18: Install Python 3.10 Toolchain + Headers
Run the following command to install python 3.10 toolchain + headers:
# Python 3.10 toolchain + headers needed by vLLM
sudo apt update
sudo apt install -y \
build-essential \
python3.10 python3.10-venv python3.10-dev python3-pip \
python3-dev
Note: we already have Python 3.10 installed, but we add python3.10-dev (and the toolchain) because vLLM builds/uses native CUDA extensions and needs the Python 3.10 headers and libs to compile / load wheels correctly.
Step 19: Start the vLLM Server
Run the following command to start the vLLM server:
vllm serve baidu/ERNIE-4.5-21B-A3B-Thinking --max-model-len 8192
vllm serve
Starts a FastAPI HTTP server powered by vLLM’s engine. It exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions
, /v1/completions
) so you can call it with normal OpenAI-style requests.
baidu/ERNIE-4.5-21B-A3B-Thinking
Tells vLLM to pull this model from Hugging Face (first run downloads weights & tokenizer) and keep it in GPU memory for inference.
--max-model-len 8192
Sets the maximum total token window per request (prompt + tools + system + new tokens) to 8,192 tokens.
- Lower value → less KV cache memory, higher throughput and more concurrent requests.
- Higher value → more VRAM used per request, fewer concurrent sequences possible.
- This is an upper bound; you can still request smaller contexts.
What vLLM does under the hood
- Loads the tokenizer and weights; picks an efficient dtype automatically (BF16 on H100/H200; FP16 otherwise).
- Uses PagedAttention with a paged KV cache so multiple requests can run concurrently without massive fragmentation.
- Spawns an engine worker and an HTTP app; default host is
0.0.0.0
, port is 8000
(unless you pass --host/--port
).
- Supports streaming responses (Server-Sent Events) and batched decoding for throughput.
What you can call after it’s up
- Chat endpoint (recommended): send
messages=[...]
with a system+user chat format; vLLM applies the model’s chat template for you.
- Completions endpoint: send plain prompts if you prefer classic completion style.
Performance/VRAM implications
- 8k context is light for your H200; you can raise to 16k/32k if you need longer context, at the cost of VRAM & throughput.
- Concurrency scales with free KV cache: more
max_model_len
or longer outputs → fewer parallel requests.
Defaults you didn’t specify (good to know)
--tensor-parallel-size
defaults to 1 (single GPU).
--dtype
is auto.
--served-model-name
defaults to the HF id; change it if you want a shorter API model name.
--api-key
is off by default; add one if you want auth on the server.
Success criteria (what you should see) for ERNIE-4.5-21B-A3B-Thinking with vLLM:
- Resolved architecture:
Qwen2ForCausalLM
(ERNIE 4.5 PT weights load via the Qwen2 causal LM class).
- Model load: lines like
Loading checkpoint shards …
followed by Loaded baidu/ERNIE-4.5-21B-A3B-Thinking
.
- Routes listed: e.g.
/v1/chat/completions
, /v1/completions
, /v1/models
, /metrics
(OpenAI-compatible).
- Started server process [PID] — vLLM engine + HTTP app spawned.
- Application startup complete.
- (Normal) You may see:
torch_dtype is deprecated! Use dtype instead!
- Port: vLLM listens on
0.0.0.0:8000
by default (change with --port
).
- (ERNIE-specific note) If you query directly, generations may include
<think>…</think>
and <response>…</response>
; that’s expected. Add a client-side stop at </response>
or strip <think>
if you want only the final answer.
Step 20: Verify the server is serving your model
# models list curl http://localhost:8000/v1/models
You should see an entry like:
"id": "baidu/ERNIE-4.5-21B-A3B-Thinking", "max_model_len": 8192, ...
Ask a question (OpenAI chat endpoint)
curl -s http://$HOST:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "baidu/ERNIE-4.5-21B-A3B-Thinking",
"messages": [
{"role":"system","content":"Reply in English. Do not include <think>."},
{"role":"user","content":"Give me a 2-line intro to large language models."}
],
"max_tokens": 256,
"temperature": 0.2,
"stop": ["</response>"]
}'
stop: ["</response>"]
trims ERNIE’s thinking trace if it appears.
- If you enabled an API key at server start, add:
-H "Authorization: Bearer <YOUR_KEY>"
Conclusion
ERNIE-4.5-21B-A3B-Thinking is a powerful open-source Mixture-of-Experts language model designed for advanced reasoning, coding, and academic tasks. With support for 131K token context, strong function-calling, and multilingual capabilities, it excels in complex generation workflows. Thanks to its efficient 3B expert activation, it delivers high performance on modern GPUs like H100/H200 without overloading memory. Whether you’re using Transformers or deploying via vLLM or FastDeploy, ERNIE-4.5 offers flexibility, speed, and accuracy—making it an ideal choice for developers, researchers, and production-grade AI systems.