How to Install & Run KAT-Dev Locally?

by Ayush Kumar | October 2, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

KAT-Dev-32B (Kwaipilot/KAT-Dev) is a 32.8B-parameter coding assistant based on Qwen3-32B, purpose-tuned for software engineering. It’s trained in three phases—mid-training (core skills), SFT + RFT (curated tasks with teacher trajectories), and large-scale agentic RL (prefix caching + trajectory pruning + scalable infra). On SWE-Bench Verified, it reports 62.4% resolved, placing it among the strongest open-source code models at its scale. It supports HF Transformers and vLLM, uses a Qwen-style chat template, and is well-suited for repo-level reasoning, tool use, and multi-turn debugging.

Rank	Model	SWE-Bench Verified (%)
1	GPT-5-Codex	74.5%
2	KAT-Coder	73.4%
3	GPT-5	72.8%
4	Claude Sonnet 4	72.7%
5	Gemini 2.5 Pro	67.2%

GPU Configuration (Inference, Rule-of-Thumb)

Scenario	Precision / Load	Min VRAM that works*	Comfortable VRAM	Typical Setup	Notes / Tips
Single-GPU (unquantized)	BF16/FP16	80 GB	96–120 GB	1× H100 80GB (SXM/PCIe)	Pure BF16 weights ~65 GB; add KV-cache + activations ⇒ ~80 GB needed for any headroom. Keep `max_new_tokens` moderate.
Dual-GPU (tensor parallel)	BF16/FP16, TP=2	2×40 GB	2×80 GB	2× A100 40GB (TP=2)	Use `device_map="auto"` or `accelerate`/`deepspeed` or `vllm --tensor-parallel-size 2`. NVLink preferred.
Quad-GPU (tensor parallel)	BF16/FP16, TP=4	4×24–48 GB	4×48 GB	4× L40S/A6000 48GB	Gives comfortable headroom for longer generations. Ensure fast interconnect for stability.
Quantized (memory-saving)	8-bit (bnb) / 4-bit	24–48 GB	48–80 GB	1× A6000 48GB or 2× 3090/4090	Use `bitsandbytes` (`load_in_8bit/4bit`) to trade a bit of quality for fit. Great for prototyping.
CPU offload hybrid	BF16/FP16 + offload	24–40 GB + fast CPU/RAM/NVMe	48 GB+	Mixed GPU+CPU	Slower but workable if GPU is tight. Use `accelerate` `max_memory` mapping or `device_map="auto"`.

Resources

Link: https://huggingface.co/Kwaipilot/KAT-Dev

Step-by-Step Process to Install & Run KAT-Dev Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image (Use the Jupyter Template)

We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving KAT-Dev.

What you’ll do

pick the Jupyter template,
(optionally) pick a CUDA/PyTorch variant if the UI offers it,
open JupyterLab in your browser,
install the few project-specific Python packages inside that environment.

How to select it

In the Create VM flow, go to Choose an Image → Templates.
Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
Click Create (or Next) to proceed to sizing and networking.

Why this image

JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Access Your Deployment

Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.

Available Options

Edit Name
Rename your deployment for easier identification (e.g., “KAT-Dev”).
Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up KAT-Dev.
- This is the most user-friendly way to start working immediately without additional setup.
Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g., ssh -i <your-key> user@<vm-ip>).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.

Recommended Path for KAT-Dev

For beginners / testing: Use Open Jupyter Notebook → open a Terminal inside JupyterLab → install the required Python packages → run moderation tests.
For production / serving APIs: Use Connect with SSH → start vLLM or SGLang on the VM → expose ports (8000/30000) → connect via API clients.

Step 8: Open Jupyter Notebook

Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running KAT-Dev.

1. Click Open Jupyter Notebook

From the My GPU Deployments panel, click the three-dot menu on your deployment card.
Select Open Jupyter Notebook.

This will open a new browser tab pointing to your VM’s Jupyter instance.

2. Handle the Browser Security Warning

Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.

Click Advanced.
Then, click Proceed to <your-vm-ip> (unsafe).

Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.

3. JupyterLab Interface Opens

Once you proceed, you’ll land inside JupyterLab. Here you’ll see:

Notebook options (Python 3, Python 3.10, etc.)
Console options (interactive shells)
Other tools like a Terminal, Text File, and Markdown File.

You can now use the Terminal inside JupyterLab to install dependencies and start working with KAT-Dev.

Step 9: Open Python 3.10 Notebook and Rename

Now that JupyterLab is running, let’s create a notebook where we will set up and run KAT-Dev.

1. Open a Python 3.10 Notebook

In the Launcher screen, under Notebook, click on Python3.10 (python_310).
This will open a new notebook editor with an empty code cell where you can type commands.

2. Rename the Notebook

By default, the notebook will open as something like Untitled.ipynb.

To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:

KAT-Dev.ipynb

Press Enter to confirm.

3. Verify the Editor

You should now see an empty notebook named KAT-Dev.ipynb with a code cell ready.
This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).

Step 10: Verify GPU Availability

Before installing and running Qwen3Guard-Gen-8B, it’s important to confirm that your VM has successfully attached the GPU and that CUDA is working.

1. Run `nvidia-smi`

In your Jupyter Notebook cell, type:

!nvidia-smi

2. Check the Output

You should see information about your GPU, similar to the screenshot:

GPU Name → NVIDIA H100 80GB HBM3
Driver Version → 560.xx or similar
CUDA Version → 12.x (here it shows 12.6)
Memory Usage → confirms available VRAM (e.g., ~81 GB)
Temperature / Power → current GPU status

3. Why This Step Matters

Confirms that the GPU drivers are properly installed.
Ensures the CUDA runtime matches your environment.
Prevents wasted time later if the model fails to load due to GPU issues.

With GPU verified, you’re ready to proceed to the next step: installing the required Python libraries (Transformers, vLLM, SGLang, etc.) inside the notebook.

Step 11: Install Required Libraries (Torch + Transformers for KAT‑Dev‑32B)

Open a Terminal in JupyterLab
Go to: Launcher → Terminal (or run the same command in a Notebook cell prefixed with !)

Paste the following line to install all necessary libraries:

import sys
!{sys.executable} -m pip install torch transformers accelerate einops

This will install:

torch – Core PyTorch library for GPU inference
transformers – HuggingFace Transformers (used to load KAT‑Dev‑32B and apply the chat template)
accelerate – Helps manage model offloading and device mapping on multi-GPU setups
einops – Efficient tensor operations and rearrangement utilities used inside large models

These packages are essential to load and run Kwaipilot/KAT‑Dev‑32B inside a Jupyter Notebook.

Step 12: Download Model in Notebook

In your .ipynb file:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Kwaipilot/KAT-Dev"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Use bf16 for H100
    device_map="auto"
)

Step 13: Prepare Chat Input

Step 14: Generate Response

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Kwaipilot/KAT-Dev"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Conclusion

KAT-Dev-32B shows how far open-source coding models have come—combining scale, thoughtful training, and real-world evaluation on benchmarks like SWE-Bench. With NodeShift’s GPU-powered VMs, setting it up is straightforward whether you’re experimenting in Jupyter or serving it with vLLM. If you’re looking for a strong, open alternative for software engineering tasks—repo-level reasoning, bug fixing, or tool-augmented workflows—KAT-Dev is well worth trying out.

Relevant blog posts

October 3, 2025

How to Install & Run MinerU2.5-2509-1.2B Locally?

MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.

October 1, 2025

How to Install & Run Hunyuan3D-Omni Locally?

Hunyuan3D-Omni is Tencent’s unified, controllable image-to-3D generator built on Hunyuan3D 2.1. Beyond images, it ingests point clouds, voxels, 3D bounding boxes, and skeletal poses through a single control encoder, letting you steer geometry, topology, and pose precisely. The training uses difficulty-aware sampling to robustly fuse modalities (e.g., bias toward harder signals like pose), and optional EMA and FlashVDM switches improve stability and speed at inference. Reported footprint: ~10 GB VRAM for single-asset generation with batch size 1.

September 30, 2025

Building a Math Dueler Agent with K2-Think: Step-by-Step Guide

K2-Think is a 32B parameter open-weights reasoning model developed by LLM360, purpose-built for tough problem-solving in math, code, and science. It excels in competitive benchmarks like AIME, HMMT, and LiveCodeBench, showcasing strong chain-of-thought reasoning and verifiable step-by-step logic. Optimized for efficiency, K2-Think runs on both typical cloud setups and advanced hardware like Cerebras WSE, making it a powerful yet accessible system for researchers and developers who want high-performance reasoning without proprietary restrictions.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.