Qwen3-235B-A22B-Instruct-2507 is a powerful language model designed to follow instructions, solve complex problems, and generate well-structured content across a wide range of topics. Built with 235 billion parameters—of which 22 billion are actively engaged during runtime—it uses a mixture-of-experts approach to stay both efficient and precise.
This version delivers standout improvements in reasoning, multilingual support, long-context comprehension (up to 256,000 tokens), and subjective response quality. Whether it’s tackling math, science, creative writing, or multi-step tasks, it responds with clarity and intent. Optimized for real-world applications, it fits perfectly into agent-based systems, coding assistants, and any setup that demands deep understanding and reliable text generation.
If you’re building next-gen tools, handling complex tasks, or just exploring the limits of what’s possible with advanced text models—this one’s worth checking out.
Performance
| Deepseek-V3-0324 | GPT-4o-0327 | Claude Opus 4 Non-thinking | Kimi K2 | Qwen3-235B-A22B Non-thinking | Qwen3-235B-A22B-Instruct-2507 |
---|
Knowledge | | | | | | |
MMLU-Pro | 81.2 | 79.8 | 86.6 | 81.1 | 75.2 | 83.0 |
MMLU-Redux | 90.4 | 91.3 | 94.2 | 92.7 | 89.2 | 93.1 |
GPQA | 68.4 | 66.9 | 74.9 | 75.1 | 62.9 | 77.5 |
SuperGPQA | 57.3 | 51.0 | 56.5 | 57.2 | 48.2 | 62.6 |
SimpleQA | 27.2 | 40.3 | 22.8 | 31.0 | 12.2 | 54.3 |
CSimpleQA | 71.1 | 60.2 | 68.0 | 74.5 | 60.8 | 84.3 |
Reasoning | | | | | | |
AIME25 | 46.6 | 26.7 | 33.9 | 49.5 | 24.7 | 70.3 |
HMMT25 | 27.5 | 7.9 | 15.9 | 38.8 | 10.0 | 55.4 |
ARC-AGI | 9.0 | 8.8 | 30.3 | 13.3 | 4.3 | 41.8 |
ZebraLogic | 83.4 | 52.6 | – | 89.0 | 37.7 | 95.0 |
LiveBench 20241125 | 66.9 | 63.7 | 74.6 | 76.4 | 62.5 | 75.4 |
Coding | | | | | | |
LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 44.6 | 48.9 | 32.9 | 51.8 |
MultiPL-E | 82.2 | 82.7 | 88.5 | 85.7 | 79.3 | 87.9 |
Aider-Polyglot | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 |
Alignment | | | | | | |
IFEval | 82.3 | 83.9 | 87.4 | 89.8 | 83.2 | 88.7 |
Arena-Hard v2* | 45.6 | 61.9 | 51.5 | 66.1 | 52.0 | 79.2 |
Creative Writing v3 | 81.6 | 84.9 | 83.8 | 88.1 | 80.4 | 87.5 |
WritingBench | 74.5 | 75.5 | 79.2 | 86.2 | 77.0 | 85.2 |
Agent | | | | | | |
BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 |
TAU-Retail | 49.6 | 60.3# | 81.4 | 70.7 | 65.2 | 71.3 |
TAU-Airline | 32.0 | 42.8# | 59.6 | 53.5 | 32.0 | 44.0 |
Multilingualism | | | | | | |
MultiIF | 66.5 | 70.4 | – | 76.2 | 70.2 | 77.5 |
MMLU-ProX | 75.8 | 76.2 | – | 74.5 | 73.2 | 79.4 |
INCLUDE | 80.1 | 82.1 | – | 76.9 | 75.6 | 79.5 |
PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 |
Qwen3-235B-A22B-Instruct-2507 GPU VM Configuration Table
Level | GPU(s) | GPU Memory | vCPUs | RAM | Disk (SSD/NVMe) | Expected Use Case | Notes |
---|
Minimum (Working) | 4× A100 80GB | 320 GB | 64 vCPUs | 256 GB | 300 GB | Slow but stable inference (~1.5–2.5x slower) | Must use bf16/fp16 , device_map=auto , long load time |
Intermediate | 2× H100 80GB | 160 GB | 64–96 vCPUs | 256–384 GB | 500 GB | May need aggressive offloading or quantized weights | Might OOM with longer token generation or high batch size |
Recommended | 4× H100 80GB | 320 GB | 96–128 vCPUs | 512 GB | 500 GB+ | Fast inference with full model support | Smooth runtime with transformers ≥ 4.51.0, no quantization required |
Maximum (Production) | 8× H100 80GB | 640 GB | 128–192 vCPUs | 768–1024 GB | 1 TB+ | Enterprise workloads, batch inference, chat APIs | Supports larger max_tokens , concurrent users, faster throughput |
Extreme Benchmarking | 8× H100 SXM + NVLink | 640 GB (NVLink) | 192–256 vCPUs | 1 TB+ | 1 TB+ (NVMe RAID) | Red teaming, eval runs, token throughput testing | NVLink helps with faster inter-GPU communication (vLLM/vLLM-MoE) |
Resources
Link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
Step-by-Step Process to Install & Run Qwen3-235B-A22B-Instruct-2507 Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 4 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Qwen3-235B-A22B-Instruct-2507, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Qwen3-235B-A22B-Instruct-2507
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Zerank 1 Small.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Qwen3-235B-A22B-Instruct-2507 runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 11: Created and activated Python 3.11 virtual environment
Run the following commands to created and activated Python 3.11 virtual environment:
apt update && apt install -y python3.11-venv git wget
python3.11 -m venv qwen3-env
source qwen3-env/bin/activate
Step 12: Install Python Dependencies
Run the following command to install dependencies:
pip install --upgrade transformers accelerate einops
Step 13: Connect to your GPU VM using Remote SSH
- Open VS Code on your Mac.
- Press
Cmd + Shift + P
, then choose Remote-SSH: Connect to Host
.
- Select your configured host.
- Once connected, you’ll see
SSH: 209.137.198.14
(Your VM IP) in the bottom-left status bar (like in the image).
Step 14: Create Python File
Create a Python script (e.g., run_qwen3.py
) and add the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto", # Automatically use all 4x H100 GPUs
trust_remote_code=True
)
# Prompt
prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Tokenize input
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate
generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
# Decode output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("Generated content:\n", content)
Step 15: Set Environment for MoE Stability
Before running your Python script, set this (helps reduce CUDA fragmentation):
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Step 16: Run the Script
Run the script and generate response:
python3 run_qwen3.py
Check Output
Conclusion
And there you have it — a complete walkthrough to get Qwen3-235B-A22B-Instruct-2507 up and running on a high-performance virtual machine. From setting up your GPU-powered environment to generating your first piece of output, this guide should help you unlock the full capabilities of one of the most advanced language models available today.
What makes this model stand out isn’t just its scale — it’s the blend of speed, precision, and its ability to follow through on complex tasks like reasoning, writing, and multilingual responses. Whether you’re building interactive tools, exploring long-form generation, or integrating the model into a larger system, this setup ensures a smooth and powerful experience.