GLM-4.5 and GLM-4.5-Air are large-scale, cutting-edge language models designed to power a new generation of intelligent digital assistants, tools, and workflows. Built for both depth and efficiency, these models offer top-tier results across tasks like coding, problem solving, and natural conversation—making them perfect for teams building smart apps or anyone who wants advanced reasoning and real-world utility.
What sets them apart?
- Sheer Scale and Flexibility:
GLM-4.5 leads with an impressive 355 billion parameters (32 billion active at once), giving it the muscle for heavy reasoning and multi-step logic. GLM-4.5-Air is its lighter, more nimble sibling, featuring 106 billion parameters (12 billion active), ideal for faster, more efficient deployments without sacrificing performance.
- Two Modes for Real-World Use:
Both models feature dual operation styles. You get a “thinking mode” for complex, tool-using workflows—think longform problem-solving, coding, or advanced agent actions. When speed is the priority, just switch to “immediate response mode” for instant answers and quick turnarounds.
- Full Open Access:
Everything is released under the MIT license—meaning you can use these models commercially, modify them, or integrate them into your own products with no red tape. You’ll find base models, hybrid reasoning versions, and high-efficiency FP8 variants, ready for all sorts of secondary development.
- Serious Performance:
These models shine on industry benchmarks, with GLM-4.5 consistently ranking among the top three in head-to-head testing—outperforming or matching the best open and proprietary models in coding, reasoning, and intelligent task completion. GLM-4.5-Air delivers close results with a fraction of the hardware.
- Designed for Agents, Tools, and Real Apps:
Whether you’re powering coding assistants, building custom chatbots, or looking for robust backends for new products, these models are optimized for agent-like tool use and advanced logic.
Overall LLM Performance (12 Benchmarks)
Rank | Model | Score |
---|
1 | o3 | 65.0 |
2 | Grok 4 | 63.6 |
3 | GLM-4.5 | 63.2 |
4 | Claude 4 Opus | 60.9 |
5 | o4-mini (high) | 60.4 |
6 | GLM-4.5-Air | 59.8 |
7 | Claude 4 Sonnet | 59.2 |
8 | Gemini 2.5 Pro | 58.8 |
9 | Qwen3-72B SFT-Thinkings-2507 | 56.5 |
10 | DeepSeek-VL-0528 | 55.9 |
11 | Kimi K2 | 53.1 |
12 | GPT-4.1 | 48.7 |
13 | DeepSeek-VL-0324 | 46.3 |
Agentic Benchmark
Rank | Model | Score |
---|
1 | o3 | 61.1 |
2 | GLM-4.5 | 58.1 |
3 | Grok 4 | 55.4 |
4 | GLM-4.5-Air | 55.2 |
5 | Claude 4 Opus | 54.6 |
6 | Claude 4 Sonnet | 53.0 |
7 | o4-mini (high) | 51.0 |
8 | Qwen3-72B SFT-Thinkings-2507 | 47.2 |
9 | Kimi K2 | 47.2 |
10 | GPT-4.1 | 45.0 |
Reasoning Benchmark
Rank | Model | Score |
---|
1 | Grok 4 | 74.2 |
2 | Gemini 2.5 Pro | 71.4 |
3 | o4-mini (high) | 71.3 |
4 | o3 | 71.0 |
5 | Qwen3-72B SFT-Thinkings-2507 | 70.7 |
6 | DeepSeek-VL-0528 | 69.4 |
7 | GLM-4.5 | 68.8 |
8 | GLM-4.5-Air | 66.1 |
9 | Claude 4 Opus | 65.1 |
10 | Claude 4 Sonnet | 63.5 |
Coding Benchmark
Rank | Model | Score |
---|
1 | Claude 4 Opus | 55.5 |
2 | Claude 4 Sonnet | 53.0 |
3 | GLM-4.5 | 50.9 |
4 | o3 | 49.7 |
5 | Kimi K2 | 45.2 |
6 | GLM-4.5-Air | 41.5 |
7 | GPT-4.1 | 39.5 |
8 | Gemini 2.5 Pro | 37.2 |
9 | o4-mini (high) | 36.7 |
Model Downloads
You can directly experience the model on Hugging Face or ModelScope or download the model by following the links below.
System Requirements
Inference
We provide minimum and recommended configurations for “full-featured” model inference. The data in the table below is based on the following conditions:
- All models use MTP layers and specify
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
to ensure competitive inference speed.
- The
cpu-offload
parameter is not used.
- Inference batch size does not exceed
8
.
- All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
- Server memory must exceed
1T
to ensure normal model loading and operation.
The models can run under the configurations in the table below:
Model | Precision | GPU Type and Count | Test Framework |
---|
GLM-4.5 | BF16 | H100 x 16 / H200 x 8 | sglang |
GLM-4.5 | FP8 | H100 x 8 / H200 x 4 | sglang |
GLM-4.5-Air | BF16 | H100 x 4 / H200 x 2 | sglang |
GLM-4.5-Air | FP8 | H100 x 2 / H200 x 1 | sglang |
Under the configurations in the table below, the models can utilize their full 128K context length:
Model | Precision | GPU Type and Count | Test Framework |
---|
GLM-4.5 | BF16 | H100 x 32 / H200 x 16 | sglang |
GLM-4.5 | FP8 | H100 x 16 / H200 x 8 | sglang |
GLM-4.5-Air | BF16 | H100 x 8 / H200 x 4 | sglang |
GLM-4.5-Air | FP8 | H100 x 4 / H200 x 2 | sglang |
Fine-tuning
The code can run under the configurations in the table below using Llama Factory:
Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
---|
GLM-4.5 | H100 x 16 | Lora | 1 |
GLM-4.5-Air | H100 x 4 | Lora | 1 |
The code can run under the configurations in the table below using Swift:
Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
---|
GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 |
GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 |
GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 |
GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 |
GLM-4.5 | H20 (96GiB) x 128 | RL | 1 |
GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 |
For this setup, we’re rolling with GLM-4.5-Air (FP8 precision)—a streamlined, high-efficiency model that’s designed to deliver powerful reasoning and coding capabilities with just two H100 GPUs or a single H200. Running under the sglang framework, this configuration hits the sweet spot for teams and builders who want industry-level performance without needing a massive GPU cluster. The FP8 variant squeezes out extra speed and memory efficiency, making it perfect for both experimentation and real-world deployment. If you want advanced capabilities in a compact, accessible package, this is the model to choose.
Resources
Link: https://github.com/zai-org/GLM-4.5
Step-by-Step Process to Install & Run GLM-4.5 Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 4 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running GLM-4.5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like GLM-4.5
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like GLM-4.5.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the GLM-4.5 runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Created and activated Python 3.11 virtual environment
Run the following commands to created and activated Python 3.11 virtual environment:
apt update && apt install -y python3.11-venv git wget
python3.11 -m venv wan
source wan/bin/activate
Step 13: Clone the GLM 4.5 Repository
Run the following command to clone the GLM 4.5 repository:
git clone https://github.com/THUDM/GLM-4.5.git
cd GLM-4.5
Step 14: Install Python Dependencies
Run the following command to install python dependencies:
pip install torch --extra-index-url https://download.pytorch.org/whl/cu121
Step 15: Install Requirements File
Run the following command to install requirements.txt file:
pip install -r requirements.txt
Step 16: Install Missing Dependencies
Run the following command to install all the missing dependencies:
pip install sgl-kernal
pip install orjson
sudo apt update
sudo apt install -y numactl
pip install torchao
Step 17: Download the GLM-4.5-Air-FP8 Model
Option 1: Download from HuggingFace
pip install huggingface_hub
huggingface-cli download zai-org/GLM-4.5-Air-FP8 --local-dir ./glm-4.5-air-fp8
Option 2: Direct link from ModelScope or HuggingFace.
We will go with Option 1.
Step 18: Start the Model with SGLang (FP8)
Run the following command to start the model server with SGLang:
python3 -m sglang.launch_server \
--model-path ./glm-4.5-air-fp8 \
--tp-size 2 \ # or 4 if you have 4 GPUs
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--disable-shared-experts-fusion \
--served-model-name glm-4.5-air-fp8 \
--host 0.0.0.0 \
--port 8000
What You’ll See
Once you execute this command, the terminal will display several initialization logs. When the process is complete and successful, you should see:
- INFO logs showing the model loading progress and memory usage.
- Confirmation that the server has started and is running on
http://0.0.0.0:8000
- Log messages like:
[INFO] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[INFO] The server is fired up and ready to roll!
This confirms that your GLM-4.5 model server is up and running, ready to handle requests on port 8000.
Step 19: Test the Model with a cURL Request
Now that your GLM-4.5 model server is running, you can easily test it from the terminal using a simple curl
command. This lets you send prompts and see the raw JSON response directly—perfect for quick validation before integrating with code or a UI.
Example command:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "What is GLM-4.5? Give a one-line summary.",
"max_tokens": 64,
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": false
}
}
}'
What You’ll See
- The model will respond in JSON format, including the generated answer and meta information.
- In this example, the request asks for a concise, one-line summary of GLM-4.5.
- You may see the full answer as part of the
"text"
field in the response.
Tip: The “enable_thinking”: false parameter helps reduce verbose reasoning traces, encouraging the model to return direct answers only.
Try different prompts.
Step 20: Automate Model Testing with Python
To make interacting with your GLM-4.5 model even easier, you can write a quick Python script to send prompts and process answers programmatically. This is especially useful for batch testing, custom workflows, or integrating with your own applications.
Example script (glm_test.py
):
import requests
response = requests.post(
"http://localhost:8000/generate",
json={
"text": "Only output a single, direct line and nothing else: What is GLM-4.5?",
"max_tokens": 32
}
)
result = response.json()["text"]
# Get only the part before "<think>" if present
answer = result.split('<think>')[0].strip()
print(answer)
Try different prompts.
Conclusion
With GLM-4.5 and GLM-4.5-Air, you’re not just running another large language model—you’re unlocking new possibilities for intelligent apps, coding assistants, and real-world AI workflows. Whether you’re an enthusiast experimenting on a single node or an engineer scaling up for production, this open-source stack gives you state-of-the-art performance without the usual barriers.
By following the steps above, you’ve gone from cloud VM provisioning all the way to sending prompts and scripting model tests in Python. Now you’re ready to build, automate, or even integrate with custom UIs. The rest is up to your imagination!