How to Install & Run GLM-4.5 Locally?

by Ayush Kumar | August 2, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

GLM-4.5 and GLM-4.5-Air are large-scale, cutting-edge language models designed to power a new generation of intelligent digital assistants, tools, and workflows. Built for both depth and efficiency, these models offer top-tier results across tasks like coding, problem solving, and natural conversation—making them perfect for teams building smart apps or anyone who wants advanced reasoning and real-world utility.

What sets them apart?

Sheer Scale and Flexibility:
GLM-4.5 leads with an impressive 355 billion parameters (32 billion active at once), giving it the muscle for heavy reasoning and multi-step logic. GLM-4.5-Air is its lighter, more nimble sibling, featuring 106 billion parameters (12 billion active), ideal for faster, more efficient deployments without sacrificing performance.
Two Modes for Real-World Use:
Both models feature dual operation styles. You get a “thinking mode” for complex, tool-using workflows—think longform problem-solving, coding, or advanced agent actions. When speed is the priority, just switch to “immediate response mode” for instant answers and quick turnarounds.
Full Open Access:
Everything is released under the MIT license—meaning you can use these models commercially, modify them, or integrate them into your own products with no red tape. You’ll find base models, hybrid reasoning versions, and high-efficiency FP8 variants, ready for all sorts of secondary development.
Serious Performance:
These models shine on industry benchmarks, with GLM-4.5 consistently ranking among the top three in head-to-head testing—outperforming or matching the best open and proprietary models in coding, reasoning, and intelligent task completion. GLM-4.5-Air delivers close results with a fraction of the hardware.
Designed for Agents, Tools, and Real Apps:
Whether you’re powering coding assistants, building custom chatbots, or looking for robust backends for new products, these models are optimized for agent-like tool use and advanced logic.

Overall LLM Performance (12 Benchmarks)

Rank	Model	Score
1	o3	65.0
2	Grok 4	63.6
3	GLM-4.5	63.2
4	Claude 4 Opus	60.9
5	o4-mini (high)	60.4
6	GLM-4.5-Air	59.8
7	Claude 4 Sonnet	59.2
8	Gemini 2.5 Pro	58.8
9	Qwen3-72B SFT-Thinkings-2507	56.5
10	DeepSeek-VL-0528	55.9
11	Kimi K2	53.1
12	GPT-4.1	48.7
13	DeepSeek-VL-0324	46.3

Agentic Benchmark

Rank	Model	Score
1	o3	61.1
2	GLM-4.5	58.1
3	Grok 4	55.4
4	GLM-4.5-Air	55.2
5	Claude 4 Opus	54.6
6	Claude 4 Sonnet	53.0
7	o4-mini (high)	51.0
8	Qwen3-72B SFT-Thinkings-2507	47.2
9	Kimi K2	47.2
10	GPT-4.1	45.0

Reasoning Benchmark

Rank	Model	Score
1	Grok 4	74.2
2	Gemini 2.5 Pro	71.4
3	o4-mini (high)	71.3
4	o3	71.0
5	Qwen3-72B SFT-Thinkings-2507	70.7
6	DeepSeek-VL-0528	69.4
7	GLM-4.5	68.8
8	GLM-4.5-Air	66.1
9	Claude 4 Opus	65.1
10	Claude 4 Sonnet	63.5

Coding Benchmark

Rank	Model	Score
1	Claude 4 Opus	55.5
2	Claude 4 Sonnet	53.0
3	GLM-4.5	50.9
4	o3	49.7
5	Kimi K2	45.2
6	GLM-4.5-Air	41.5
7	GPT-4.1	39.5
8	Gemini 2.5 Pro	37.2
9	o4-mini (high)	36.7

Model Downloads

You can directly experience the model on Hugging Face or ModelScope or download the model by following the links below.

Model	Download Links	Model Size	Precision
GLM-4.5	🤗 Hugging Face 🤖 ModelScope	355B-A32B	BF16
GLM-4.5-Air	🤗 Hugging Face 🤖 ModelScope	106B-A12B	BF16
GLM-4.5-FP8	🤗 Hugging Face 🤖 ModelScope	355B-A32B	FP8
GLM-4.5-Air-FP8	🤗 Hugging Face 🤖 ModelScope	106B-A12B	FP8
GLM-4.5-Base	🤗 Hugging Face 🤖 ModelScope	355B-A32B	BF16
GLM-4.5-Air-Base	🤗 Hugging Face 🤖 ModelScope	106B-A12B	BF16

System Requirements

Inference

We provide minimum and recommended configurations for “full-featured” model inference. The data in the table below is based on the following conditions:

All models use MTP layers and specify --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 to ensure competitive inference speed.
The cpu-offload parameter is not used.
Inference batch size does not exceed 8.
All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
Server memory must exceed 1T to ensure normal model loading and operation.

The models can run under the configurations in the table below:

Model	Precision	GPU Type and Count	Test Framework
GLM-4.5	BF16	H100 x 16 / H200 x 8	sglang
GLM-4.5	FP8	H100 x 8 / H200 x 4	sglang
GLM-4.5-Air	BF16	H100 x 4 / H200 x 2	sglang
GLM-4.5-Air	FP8	H100 x 2 / H200 x 1	sglang

Under the configurations in the table below, the models can utilize their full 128K context length:

Model	Precision	GPU Type and Count	Test Framework
GLM-4.5	BF16	H100 x 32 / H200 x 16	sglang
GLM-4.5	FP8	H100 x 16 / H200 x 8	sglang
GLM-4.5-Air	BF16	H100 x 8 / H200 x 4	sglang
GLM-4.5-Air	FP8	H100 x 4 / H200 x 2	sglang

Fine-tuning

The code can run under the configurations in the table below using Llama Factory:

Model	GPU Type and Count	Strategy	Batch Size (per GPU)
GLM-4.5	H100 x 16	Lora	1
GLM-4.5-Air	H100 x 4	Lora	1

The code can run under the configurations in the table below using Swift:

Model	GPU Type and Count	Strategy	Batch Size (per GPU)
GLM-4.5	H20 (96GiB) x 16	Lora	1
GLM-4.5-Air	H20 (96GiB) x 4	Lora	1
GLM-4.5	H20 (96GiB) x 128	SFT	1
GLM-4.5-Air	H20 (96GiB) x 32	SFT	1
GLM-4.5	H20 (96GiB) x 128	RL	1
GLM-4.5-Air	H20 (96GiB) x 32	RL	1

For this setup, we’re rolling with GLM-4.5-Air (FP8 precision)—a streamlined, high-efficiency model that’s designed to deliver powerful reasoning and coding capabilities with just two H100 GPUs or a single H200. Running under the sglang framework, this configuration hits the sweet spot for teams and builders who want industry-level performance without needing a massive GPU cluster. The FP8 variant squeezes out extra speed and memory efficiency, making it perfect for both experimentation and real-world deployment. If you want advanced capabilities in a compact, accessible package, this is the model to choose.

Resources

Link: https://github.com/zai-org/GLM-4.5

Step-by-Step Process to Install & Run GLM-4.5 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 4 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running GLM-4.5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like GLM-4.5
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like GLM-4.5.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the GLM-4.5 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv wan
source wan/bin/activate

Step 13: Clone the GLM 4.5 Repository

Run the following command to clone the GLM 4.5 repository:

git clone https://github.com/THUDM/GLM-4.5.git
cd GLM-4.5

Step 14: Install Python Dependencies

Run the following command to install python dependencies:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu121

Step 15: Install Requirements File

Run the following command to install requirements.txt file:

pip install -r requirements.txt

Step 16: Install Missing Dependencies

Run the following command to install all the missing dependencies:

pip install sgl-kernal
pip install orjson
sudo apt update
sudo apt install -y numactl
pip install torchao

Step 17: Download the GLM-4.5-Air-FP8 Model

Option 1: Download from HuggingFace

pip install huggingface_hub
huggingface-cli download zai-org/GLM-4.5-Air-FP8 --local-dir ./glm-4.5-air-fp8

Option 2: Direct link from ModelScope or HuggingFace.

We will go with Option 1.

Step 18: Start the Model with SGLang (FP8)

Run the following command to start the model server with SGLang:

python3 -m sglang.launch_server \
  --model-path ./glm-4.5-air-fp8 \
  --tp-size 2 \ # or 4 if you have 4 GPUs
  --tool-call-parser glm45  \
  --reasoning-parser glm45  \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3  \
  --speculative-eagle-topk 1  \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.7 \
  --disable-shared-experts-fusion \
  --served-model-name glm-4.5-air-fp8 \
  --host 0.0.0.0 \
  --port 8000

What You’ll See

Once you execute this command, the terminal will display several initialization logs. When the process is complete and successful, you should see:

INFO logs showing the model loading progress and memory usage.
Confirmation that the server has started and is running on http://0.0.0.0:8000
Log messages like:

[INFO] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[INFO] The server is fired up and ready to roll!

This confirms that your GLM-4.5 model server is up and running, ready to handle requests on port 8000.

Step 19: Test the Model with a cURL Request

Now that your GLM-4.5 model server is running, you can easily test it from the terminal using a simple curl command. This lets you send prompts and see the raw JSON response directly—perfect for quick validation before integrating with code or a UI.

Example command:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "What is GLM-4.5? Give a one-line summary.",
    "max_tokens": 64,
    "extra_body": {
      "chat_template_kwargs": {
        "enable_thinking": false
      }
    }
  }'

What You’ll See

The model will respond in JSON format, including the generated answer and meta information.
In this example, the request asks for a concise, one-line summary of GLM-4.5.
You may see the full answer as part of the "text" field in the response.

Tip: The “enable_thinking”: false parameter helps reduce verbose reasoning traces, encouraging the model to return direct answers only.

Try different prompts.

Step 20: Automate Model Testing with Python

To make interacting with your GLM-4.5 model even easier, you can write a quick Python script to send prompts and process answers programmatically. This is especially useful for batch testing, custom workflows, or integrating with your own applications.

Example script (glm_test.py):

import requests

response = requests.post(
    "http://localhost:8000/generate",
    json={
        "text": "Only output a single, direct line and nothing else: What is GLM-4.5?",
        "max_tokens": 32
    }
)
result = response.json()["text"]
# Get only the part before "<think>" if present
answer = result.split('<think>')[0].strip()
print(answer)

Try different prompts.

Conclusion

With GLM-4.5 and GLM-4.5-Air, you’re not just running another large language model—you’re unlocking new possibilities for intelligent apps, coding assistants, and real-world AI workflows. Whether you’re an enthusiast experimenting on a single node or an engineer scaling up for production, this open-source stack gives you state-of-the-art performance without the usual barriers.

By following the steps above, you’ve gone from cloud VM provisioning all the way to sending prompts and scripting model tests in Python. Now you’re ready to build, automate, or even integrate with custom UIs. The rest is up to your imagination!

Relevant blog posts

October 29, 2025

How to Install & Run Chandra-OCR Locally?

Chandra is Datalab’s next-generation OCR model built for precise document understanding. It goes beyond simple text extraction — converting images and PDFs into structured Markdown, HTML, or JSON while preserving original layout details like tables, forms, and diagrams. With strong support for handwriting, math equations, and multi-column layouts across 40+ languages, Chandra achieves an overall accuracy of 83.1% on the olmOCR benchmark, outperforming most open and commercial OCR systems. It can be used easily via CLI, VLLM, Hugging Face, or a Streamlit app, making it versatile for developers, researchers, and document intelligence workflows.

October 27, 2025

How to Install & Run LiquidAI LFM2-VL Locally?

LFM2-VL-450M is the most compact and efficient model in Liquid AI’s LFM2-VL family, designed for low-latency multimodal inference on edge and cloud GPUs. With only 450M parameters (350M text + 86M vision encoder), it delivers reliable image-text reasoning at 2× faster speeds than typical VLMs in its size range. It supports native 512×512 resolution, dynamic vision token handling, and can be fine-tuned easily for domain-specific visual understanding tasks such as product tagging, document OCR, and quick caption generation. Its minimal footprint makes it ideal for real-time multimodal inference on affordable GPUs.

October 24, 2025

How to Install & Run LLaDA2.0-Mini-Preview Locally?

LLaDA2-mini-preview is a diffusion-style Mixture-of-Experts (16B total, ~1.4B activated) instruction-tuned language model. It targets strong reasoning/coding while keeping inference light: only a small subset of experts fire per token, so you get near-7B quality with ~1–2B-class compute. It supports tool use, 4,096-token context, and works out-of-the-box with transformers via trust_remote_code. For best results, use diffusion sampling with temperature=0.0, steps=32, block_length=32.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.