How to Install & Run Qwen3-235B-A22B-Instruct-2507 Locally?

by Ayush Kumar | July 23, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Qwen3-235B-A22B-Instruct-2507 is a powerful language model designed to follow instructions, solve complex problems, and generate well-structured content across a wide range of topics. Built with 235 billion parameters—of which 22 billion are actively engaged during runtime—it uses a mixture-of-experts approach to stay both efficient and precise.

This version delivers standout improvements in reasoning, multilingual support, long-context comprehension (up to 256,000 tokens), and subjective response quality. Whether it’s tackling math, science, creative writing, or multi-step tasks, it responds with clarity and intent. Optimized for real-world applications, it fits perfectly into agent-based systems, coding assistants, and any setup that demands deep understanding and reliable text generation.

If you’re building next-gen tools, handling complex tasks, or just exploring the limits of what’s possible with advanced text models—this one’s worth checking out.

Performance

	Deepseek-V3-0324	GPT-4o-0327	Claude Opus 4 Non-thinking	Kimi K2	Qwen3-235B-A22B Non-thinking	Qwen3-235B-A22B-Instruct-2507
Knowledge
MMLU-Pro	81.2	79.8	86.6	81.1	75.2	83.0
MMLU-Redux	90.4	91.3	94.2	92.7	89.2	93.1
GPQA	68.4	66.9	74.9	75.1	62.9	77.5
SuperGPQA	57.3	51.0	56.5	57.2	48.2	62.6
SimpleQA	27.2	40.3	22.8	31.0	12.2	54.3
CSimpleQA	71.1	60.2	68.0	74.5	60.8	84.3
Reasoning
AIME25	46.6	26.7	33.9	49.5	24.7	70.3
HMMT25	27.5	7.9	15.9	38.8	10.0	55.4
ARC-AGI	9.0	8.8	30.3	13.3	4.3	41.8
ZebraLogic	83.4	52.6	–	89.0	37.7	95.0
LiveBench 20241125	66.9	63.7	74.6	76.4	62.5	75.4
Coding
LiveCodeBench v6 (25.02-25.05)	45.2	35.8	44.6	48.9	32.9	51.8
MultiPL-E	82.2	82.7	88.5	85.7	79.3	87.9
Aider-Polyglot	55.1	45.3	70.7	59.0	59.6	57.3
Alignment
IFEval	82.3	83.9	87.4	89.8	83.2	88.7
Arena-Hard v2*	45.6	61.9	51.5	66.1	52.0	79.2
Creative Writing v3	81.6	84.9	83.8	88.1	80.4	87.5
WritingBench	74.5	75.5	79.2	86.2	77.0	85.2
Agent
BFCL-v3	64.7	66.5	60.1	65.2	68.0	70.9
TAU-Retail	49.6	60.3#	81.4	70.7	65.2	71.3
TAU-Airline	32.0	42.8#	59.6	53.5	32.0	44.0
Multilingualism
MultiIF	66.5	70.4	–	76.2	70.2	77.5
MMLU-ProX	75.8	76.2	–	74.5	73.2	79.4
INCLUDE	80.1	82.1	–	76.9	75.6	79.5
PolyMATH	32.2	25.5	30.0	44.8	27.0	50.2

Qwen3-235B-A22B-Instruct-2507 GPU VM Configuration Table

Level	GPU(s)	GPU Memory	vCPUs	RAM	Disk (SSD/NVMe)	Expected Use Case	Notes
Minimum (Working)	4× A100 80GB	320 GB	64 vCPUs	256 GB	300 GB	Slow but stable inference (~1.5–2.5x slower)	Must use `bf16/fp16`, `device_map=auto`, long load time
Intermediate	2× H100 80GB	160 GB	64–96 vCPUs	256–384 GB	500 GB	May need aggressive offloading or quantized weights	Might OOM with longer token generation or high batch size
Recommended	4× H100 80GB	320 GB	96–128 vCPUs	512 GB	500 GB+	Fast inference with full model support	Smooth runtime with `transformers` ≥ 4.51.0, no quantization required
Maximum (Production)	8× H100 80GB	640 GB	128–192 vCPUs	768–1024 GB	1 TB+	Enterprise workloads, batch inference, chat APIs	Supports larger `max_tokens`, concurrent users, faster throughput
Extreme Benchmarking	8× H100 SXM + NVLink	640 GB (NVLink)	192–256 vCPUs	1 TB+	1 TB+ (NVMe RAID)	Red teaming, eval runs, token throughput testing	NVLink helps with faster inter-GPU communication (vLLM/vLLM-MoE)

Resources

Link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507

Step-by-Step Process to Install & Run Qwen3-235B-A22B-Instruct-2507 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 4 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Qwen3-235B-A22B-Instruct-2507, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Qwen3-235B-A22B-Instruct-2507
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Zerank 1 Small.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Qwen3-235B-A22B-Instruct-2507 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 11: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv qwen3-env
source qwen3-env/bin/activate

Step 12: Install Python Dependencies

Run the following command to install dependencies:

pip install --upgrade transformers accelerate einops

Step 13: Connect to your GPU VM using Remote SSH

Open VS Code on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 209.137.198.14(Your VM IP) in the bottom-left status bar (like in the image).

Step 14: Create Python File

Create a Python script (e.g., run_qwen3.py) and add the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",  # Automatically use all 4x H100 GPUs
    trust_remote_code=True
)

# Prompt
prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Tokenize input
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate
generated_ids = model.generate(**model_inputs, max_new_tokens=1024)

# Decode output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("Generated content:\n", content)

Step 15: Set Environment for MoE Stability

Before running your Python script, set this (helps reduce CUDA fragmentation):

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Step 16: Run the Script

Run the script and generate response:

python3 run_qwen3.py

Check Output

Conclusion

And there you have it — a complete walkthrough to get Qwen3-235B-A22B-Instruct-2507 up and running on a high-performance virtual machine. From setting up your GPU-powered environment to generating your first piece of output, this guide should help you unlock the full capabilities of one of the most advanced language models available today.

What makes this model stand out isn’t just its scale — it’s the blend of speed, precision, and its ability to follow through on complex tasks like reasoning, writing, and multilingual responses. Whether you’re building interactive tools, exploring long-form generation, or integrating the model into a larger system, this setup ensures a smooth and powerful experience.

Relevant blog posts

July 24, 2025

How to Install & Run Qwen3-Coder-480B-A35B-Instruct & Qwen Code CLI Locally?

Qwen3-Coder-480B-A35B-Instruct is a powerhouse model built for deep, structured reasoning and complex coding workflows, standing out with its native support for long contexts—up to 256K tokens, and even stretching to a million tokens with Yarn. Designed with a strong focus on tool usage and agentic behavior, it excels across real-world coding tasks, browser-based scenarios, and multi-step tool execution. Whether you’re powering through massive code repositories, debugging terminal tasks, automating workflows, or building next-generation code agents, Qwen3-Coder is crafted to handle the full stack of development challenges with speed and precision—thanks to its 480 billion parameters (with 35 billion actively engaged) and seamless integration into platforms like CLINE and Qwen Code. Speaking of Qwen Code, this command-line AI workflow tool is adapted from Gemini CLI, optimized specifically for Qwen3-Coder models. It brings enhanced parsing, deep code understanding, and robust workflow automation right to your terminal. Just a heads up: Qwen Code may make multiple API calls per task, leading to higher token usage—much like Claude Code—but the team is hard at work on improving API efficiency and the overall developer experience. Together, Qwen3-Coder and Qwen Code create a developer ecosystem that’s ready to tackle everything from massive-scale code intelligence to practical, everyday workflow automation.

July 21, 2025

How to Install & Run ZeroEntropy Zerank 1 Small Locally?

In the world of search engines and information retrieval, precision matters. That’s where zerank-1-small comes in — a compact yet powerful reranker model developed by ZeroEntropy. Designed to boost the accuracy of search results, this 1.7B parameter model is a lighter sibling of the flagship zerank-1, delivering impressive performance while being over two times smaller. What sets zerank-1-small apart is its ability to consistently outperform many well-known rerankers and deliver significant accuracy improvements over traditional vector search methods. Whether applied to fields like finance, legal, STEM, code, or medical queries, the model enhances the ranking of retrieved documents to ensure users get the most relevant answers. Released under the open-source Apache 2.0 license, zerank-1-small is part of ZeroEntropy’s commitment to advancing open-source tools and empowering developers, researchers, and organizations to build better retrieval systems without proprietary restrictions.

July 18, 2025

How to Install LiquidAI LFM2-1.2B Locally?

The LFM2-1.2B is a next-generation hybrid model developed by Liquid AI, designed specifically for edge AI and on-device deployment. With ~1.2 billion parameters, this model stands out for its speed, memory efficiency, and quality, making it ideal for lightweight applications like agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. Model details Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.