KAT-Dev-32B (Kwaipilot/KAT-Dev) is a 32.8B-parameter coding assistant based on Qwen3-32B, purpose-tuned for software engineering. It’s trained in three phases—mid-training (core skills), SFT + RFT (curated tasks with teacher trajectories), and large-scale agentic RL (prefix caching + trajectory pruning + scalable infra). On SWE-Bench Verified, it reports 62.4% resolved, placing it among the strongest open-source code models at its scale. It supports HF Transformers and vLLM, uses a Qwen-style chat template, and is well-suited for repo-level reasoning, tool use, and multi-turn debugging.
Rank | Model | SWE-Bench Verified (%) |
---|
1 | GPT-5-Codex | 74.5% |
2 | KAT-Coder | 73.4% |
3 | GPT-5 | 72.8% |
4 | Claude Sonnet 4 | 72.7% |
5 | Gemini 2.5 Pro | 67.2% |
GPU Configuration (Inference, Rule-of-Thumb)
Scenario | Precision / Load | Min VRAM that works* | Comfortable VRAM | Typical Setup | Notes / Tips |
---|
Single-GPU (unquantized) | BF16/FP16 | 80 GB | 96–120 GB | 1× H100 80GB (SXM/PCIe) | Pure BF16 weights ~65 GB; add KV-cache + activations ⇒ ~80 GB needed for any headroom. Keep max_new_tokens moderate. |
Dual-GPU (tensor parallel) | BF16/FP16, TP=2 | 2×40 GB | 2×80 GB | 2× A100 40GB (TP=2) | Use device_map="auto" or accelerate /deepspeed or vllm --tensor-parallel-size 2 . NVLink preferred. |
Quad-GPU (tensor parallel) | BF16/FP16, TP=4 | 4×24–48 GB | 4×48 GB | 4× L40S/A6000 48GB | Gives comfortable headroom for longer generations. Ensure fast interconnect for stability. |
Quantized (memory-saving) | 8-bit (bnb) / 4-bit | 24–48 GB | 48–80 GB | 1× A6000 48GB or 2× 3090/4090 | Use bitsandbytes (load_in_8bit/4bit ) to trade a bit of quality for fit. Great for prototyping. |
CPU offload hybrid | BF16/FP16 + offload | 24–40 GB + fast CPU/RAM/NVMe | 48 GB+ | Mixed GPU+CPU | Slower but workable if GPU is tight. Use accelerate max_memory mapping or device_map="auto" . |
Resources
Link: https://huggingface.co/Kwaipilot/KAT-Dev
Step-by-Step Process to Install & Run KAT-Dev Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image (Use the Jupyter Template)
We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving KAT-Dev.
What you’ll do
- pick the Jupyter template,
- (optionally) pick a CUDA/PyTorch variant if the UI offers it,
- open JupyterLab in your browser,
- install the few project-specific Python packages inside that environment.
How to select it
- In the Create VM flow, go to Choose an Image → Templates.
- Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
- If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
- Click Create (or Next) to proceed to sizing and networking.
Why this image
- JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
- GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
- You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Access Your Deployment
Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.
Available Options
- Edit Name
Rename your deployment for easier identification (e.g., “KAT-Dev”).
- Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up KAT-Dev.
- This is the most user-friendly way to start working immediately without additional setup.
- Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g.,
ssh -i <your-key> user@<vm-ip>
).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
- Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
- Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
- Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.
Recommended Path for KAT-Dev
- For beginners / testing: Use Open Jupyter Notebook → open a Terminal inside JupyterLab → install the required Python packages → run moderation tests.
- For production / serving APIs: Use Connect with SSH → start vLLM or SGLang on the VM → expose ports (8000/30000) → connect via API clients.
Step 8: Open Jupyter Notebook
Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running KAT-Dev.
1. Click Open Jupyter Notebook
- From the My GPU Deployments panel, click the three-dot menu on your deployment card.
- Select Open Jupyter Notebook.
This will open a new browser tab pointing to your VM’s Jupyter instance.
2. Handle the Browser Security Warning
Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.
- Click Advanced.
- Then, click Proceed to
<your-vm-ip>
(unsafe).
Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.
3. JupyterLab Interface Opens
Once you proceed, you’ll land inside JupyterLab. Here you’ll see:
- Notebook options (Python 3, Python 3.10, etc.)
- Console options (interactive shells)
- Other tools like a Terminal, Text File, and Markdown File.
You can now use the Terminal inside JupyterLab to install dependencies and start working with KAT-Dev.
Step 9: Open Python 3.10 Notebook and Rename
Now that JupyterLab is running, let’s create a notebook where we will set up and run KAT-Dev.
1. Open a Python 3.10 Notebook
- In the Launcher screen, under Notebook, click on Python3.10 (python_310).
- This will open a new notebook editor with an empty code cell where you can type commands.
2. Rename the Notebook
- By default, the notebook will open as something like Untitled.ipynb.
- To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:
KAT-Dev.ipynb
Press Enter to confirm.
3. Verify the Editor
- You should now see an empty notebook named KAT-Dev.ipynb with a code cell ready.
- This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).
Step 10: Verify GPU Availability
Before installing and running Qwen3Guard-Gen-8B, it’s important to confirm that your VM has successfully attached the GPU and that CUDA is working.
1. Run nvidia-smi
In your Jupyter Notebook cell, type:
!nvidia-smi
2. Check the Output
You should see information about your GPU, similar to the screenshot:
- GPU Name →
NVIDIA H100 80GB HBM3
- Driver Version →
560.xx
or similar
- CUDA Version →
12.x
(here it shows 12.6)
- Memory Usage → confirms available VRAM (e.g., ~81 GB)
- Temperature / Power → current GPU status
3. Why This Step Matters
- Confirms that the GPU drivers are properly installed.
- Ensures the CUDA runtime matches your environment.
- Prevents wasted time later if the model fails to load due to GPU issues.
With GPU verified, you’re ready to proceed to the next step: installing the required Python libraries (Transformers, vLLM, SGLang, etc.) inside the notebook.
Step 11: Install Required Libraries (Torch + Transformers for KAT‑Dev‑32B)
Open a Terminal in JupyterLab
Go to: Launcher → Terminal
(or run the same command in a Notebook cell prefixed with !
)
Paste the following line to install all necessary libraries:
import sys
!{sys.executable} -m pip install torch transformers accelerate einops
This will install:
torch
– Core PyTorch library for GPU inference
transformers
– HuggingFace Transformers (used to load KAT‑Dev‑32B and apply the chat template)
accelerate
– Helps manage model offloading and device mapping on multi-GPU setups
einops
– Efficient tensor operations and rearrangement utilities used inside large models
These packages are essential to load and run Kwaipilot/KAT‑Dev‑32B inside a Jupyter Notebook.
Step 12: Download Model in Notebook
In your .ipynb
file:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "Kwaipilot/KAT-Dev"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # Use bf16 for H100
device_map="auto"
)
Step 13: Prepare Chat Input
Step 14: Generate Response
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Kwaipilot/KAT-Dev"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
Conclusion
KAT-Dev-32B shows how far open-source coding models have come—combining scale, thoughtful training, and real-world evaluation on benchmarks like SWE-Bench. With NodeShift’s GPU-powered VMs, setting it up is straightforward whether you’re experimenting in Jupyter or serving it with vLLM. If you’re looking for a strong, open alternative for software engineering tasks—repo-level reasoning, bug fixing, or tool-augmented workflows—KAT-Dev is well worth trying out.