How to Install & Run KAT-Dev-72B-Exp Locally?

by Ayush Kumar | October 16, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

KAT-Dev-72B-Exp stands as Kwaipilot’s most ambitious open-source model to date — a massive 72-billion-parameter large language model purpose-built for software engineering, debugging, and automated code reasoning. It represents the experimental reinforcement-learning (RL) variant of the proprietary KAT-Coder model, opening a rare window into the techniques and design philosophies that power some of the world’s strongest coding assistants.

At its core, KAT-Dev-72B-Exp pushes the boundaries of reinforcement learning for code generation. The Kwaipilot team rewrote key attention kernels and re-engineered the training engine to support shared prefix trajectories, enabling faster and more stable RL training — particularly on scaffolded tasks that demand precise context management. To prevent the common issue of exploration collapse in RL training, they also introduced a novel advantage redistribution technique, which dynamically balances exploration by amplifying high-variance trajectories and soft-penalizing low-exploration ones.

These innovations translate directly into real-world performance. On the SWE-Bench Verified benchmark, a demanding test that evaluates a model’s ability to understand, reason about, and patch real GitHub issues, KAT-Dev-72B-Exp achieves an impressive 74.6% accuracy when tested strictly under the SWE-agent scaffold. This places it among the most capable open-source developer models currently available.

The model is distributed under the Apache-2.0 license, ensuring that researchers, developers, and organizations can freely explore, adapt, and integrate its architecture into their own projects. The accompanying Transformers quickstart snippet makes it easy to run on both local and cloud environments, while the evaluation parameters — temperature 0.6, max_turns 150, and history_processors.n 100 — ensure reproducibility across experiments.

In short, KAT-Dev-72B-Exp is not just another large language model. It is a deep dive into how reinforcement learning can be scaled safely and efficiently for large-context, multi-turn software engineering workflows — bridging the gap between academic research and production-grade coding intelligence.

Performance of Open-Source Models on SWE-Bench Verified

Model Name	Model Size (Billions of Parameters)	% Resolved (SWE-Bench Verified)
KAT-Dev-72B-Exp	72B	74.6%
KAT-Dev-32B	32B	61%
Kimi-Dev	32B	60%
Devstral-Small-2507	25B	52%
GLM-4.6	460B	70%
Qwen3-Coder	500B	71%
DeepSeek-V3.1	600B	68%
DeepSeek-R1-0528	650B	58%
Kimi-K2	1000B	66%

GPU Configuration (Inference)

Use case	Precision / format	Framework	Minimum that’s practical	Recommended for smooth throughput	Notes
Local experimentation (CPU offload OK)	EXL3 3.0–3.5 bpw or GGUF Q4–Q5	`llama.cpp` / exllamaV3	1× 48 GB (RTX 6000 Ada / A6000)	2× 48 GB or 1× 80 GB (H100/A100/H200 class)	72B at ~3.0 bpw typically needs ~24–30 GB for weights + KV/cache headroom; Q4 GGUF often lands ~40–44 GB for 70–72B. EXL3 2.5 can squeeze to ~24 GB but with quality loss. Community EXL3/GGUF builds exist. (Hugging Face)
Power user single-node	INT8 / NF4 / FP8 (where supported)	vLLM / SGLang	2× 80 GB (tensor-parallel=2)	4× 80 GB (TP=4)	Gives headroom for KV cache and high tokens/s. If using FP8 checkpoints, ensure kernel support in your stack. The collection lists FP8 variants. (Hugging Face)
Production API (high QPS)	BF16/FP16	vLLM / TGI / SGLang	3× 80 GB (TP=3, careful with ctx)	4–8× 80 GB (TP=4–8)	Full-precision 72B weights are ~140–160 GB; multiple 80 GB GPUs leave ample KV/cache for longer contexts and batching.
Budget multi-GPU (PCIe)	EXL3 3.0–3.5 bpw	exllamaV3	2× 24 GB (4090/RTX 6000 Ada split)	3–4× 24 GB	Quantized weights sharded across consumer GPUs; watch PCIe bandwidth and set `--rope-scaling`/KV cache carefully for long prompts.
Long-context evals (SWE-agent-style)	Match card: temp 0.6, max_turns 150, history n=100	Your agent runner	≥ 160–200 GB VRAM total	≥ 320 GB VRAM total	Longer histories blow up KV/cache. Prefer 4×80 GB with paged KV (vLLM) and tensor/pp parallelism. (Hugging Face)

Resources

Link: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp

Step-by-Step Process to Install & Run KAT-Dev-72B-Exp Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running KAT-Dev-72B-Exp, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like KAT-Dev-72B-Exp.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like KAT-Dev-72B-Exp.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the KAT-Dev-72B-Exp runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install the Utilities

Run the following command to install utilities:

pip install "transformers>=4.44" "accelerate>=0.34" bitsandbytes einops sentencepiece

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the KAT-Dev-72B-Exp model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Create the Script

Create a file (ex: # inference_bf16.py) and add the following code:

import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Kwaipilot/KAT-Dev-72B-Exp"
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

msgs = [{"role":"user","content":"Give me a short introduction to large language model."}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inp  = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inp, max_new_tokens=512, temperature=0.6)
print(tok.decode(out[0][inp.input_ids.shape[1]:], skip_special_tokens=True))

Step 16: Run the Script

Run the script from the following command:

python inference_bf16.py

This will load the model and generate the response on terminal.

Step 17: Install Dependencies

Run the following command to install dependencies:

pip install --upgrade streamlit transformers accelerate bitsandbytes einops sentencepiece

Step 18: Create the Script

Create a file (ex: # app.py) and add the following code:

import os
import threading
import time
from typing import List, Dict, Optional

import streamlit as st
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TextIteratorStreamer,
    BitsAndBytesConfig,
)

MODEL_NAME = os.environ.get("KAT_MODEL", "Kwaipilot/KAT-Dev-72B-Exp")
USE_4BIT = os.environ.get("USE_4BIT", "false").lower() in {"1", "true", "yes"}

st.set_page_config(page_title="KAT-Dev-72B-Exp Chat", page_icon="🤖", layout="wide")

@st.cache_resource(show_spinner=True)
def load_model_and_tokenizer(model_name: str, use_4bit: bool):
    tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    if use_4bit:
        bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb,
            device_map="auto",
            trust_remote_code=True,
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
            device_map="auto",
            trust_remote_code=True,
        )

    # Some Qwen-family tokenizers don’t ship a pad token. Align pad with eos for generation UI.
    if tok.pad_token_id is None and tok.eos_token_id is not None:
        tok.pad_token = tok.eos_token

    return tok, model

tok, model = load_model_and_tokenizer(MODEL_NAME, USE_4BIT)

def apply_chat(messages: List[Dict[str, str]]) -> str:
    """
    Uses the model's chat template so roles are respected:
    [{"role":"system"|"user"|"assistant", "content":"..."}]
    """
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def generate_stream(
    prompt_text: str,
    max_new_tokens: int,
    temperature: float,
    top_p: float,
    repetition_penalty: float,
    stop_strs: Optional[List[str]] = None,
):
    inputs = tok([prompt_text], return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    gen_kwargs = dict(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        do_sample=True if temperature > 0 else False,
        streamer=streamer,
        pad_token_id=tok.pad_token_id,
        eos_token_id=tok.eos_token_id,
    )

    # Background generation thread
    thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
    thread.start()

    partial = ""
    for token in streamer:
        partial += token
        # rudimentary stop strings handling
        if stop_strs:
            for s in stop_strs:
                if s and s in partial:
                    yield partial.split(s)[0]
                    return
        yield partial

def sidebar():
    with st.sidebar:
        st.markdown("## ⚙️ Generation Settings")
        temperature = st.slider("Temperature", 0.0, 1.5, 0.6, 0.05)
        top_p = st.slider("Top-p", 0.0, 1.0, 0.95, 0.01)
        repetition_penalty = st.slider("Repetition Penalty", 1.0, 2.0, 1.1, 0.01)
        max_new_tokens = st.slider("Max New Tokens", 16, 4096, 512, 16)
        stop_txt = st.text_input("Stop sequences (comma-separated)", value="")
        sys_prompt = st.text_area(
            "System Prompt (optional)",
            value="You are KAT-Dev-72B-Exp, an expert software engineering assistant.",
            height=100,
        )

        st.divider()
        st.caption("Model: " + MODEL_NAME)
        st.caption(f"Device map: auto | 4-bit: {USE_4BIT}")
        if torch.cuda.is_available():
            mem_lines = []
            for i in range(torch.cuda.device_count()):
                mem = torch.cuda.mem_get_info(i)
                free_gb = mem[0] / (1024**3)
                total_gb = mem[1] / (1024**3)
                mem_lines.append(f"GPU {i}: {free_gb:.1f} / {total_gb:.1f} GB free")
            st.caption(" | ".join(mem_lines))
        return temperature, top_p, repetition_penalty, max_new_tokens, stop_txt, sys_prompt

if "chat" not in st.session_state:
    st.session_state.chat = []
if "last_response" not in st.session_state:
    st.session_state.last_response = ""

st.title("🤖 KAT-Dev-72B-Exp — Web Chat")

temperature, top_p, repetition_penalty, max_new_tokens, stop_txt, sys_prompt = sidebar()
stop_strs = [s.strip() for s in stop_txt.split(",") if s.strip()]

# Chat history render
for m in st.session_state.chat:
    with st.chat_message(m["role"]):
        st.markdown(m["content"])

# User input
user_msg = st.chat_input("Type your message…")

# Handle a new turn
if user_msg:
    st.session_state.chat.append({"role": "user", "content": user_msg})
    with st.chat_message("user"):
        st.markdown(user_msg)

    # Build messages with optional system prompt
    msgs = []
    if sys_prompt.strip():
        msgs.append({"role": "system", "content": sys_prompt.strip()})
    msgs.extend(st.session_state.chat)

    # Prepare prompt using the model's chat template
    prompt_text = apply_chat(msgs)

    # Stream the response
    with st.chat_message("assistant"):
        placeholder = st.empty()
        acc = ""
        for chunk in generate_stream(
            prompt_text,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            stop_strs=stop_strs,
        ):
            acc = chunk
            # minor throttle to keep UI smooth
            placeholder.markdown(acc)
            time.sleep(0.01)

        st.session_state.last_response = acc
        st.session_state.chat.append({"role": "assistant", "content": acc})

# Small footer
st.caption("Tip: set CUDA_VISIBLE_DEVICES for multi-GPU. Use USE_4BIT=1 for single-GPU 4-bit.")

Step 19: Launch the Streamlit UI

Run Streamlit:

streamlit run app.py --server.address 0.0.0.0 --server.port 7860

Step 20: Access the Streamlit App

Access the streamlit app on:

http://0.0.0.0:7860/

Play with Model

Conclusion

KAT-Dev-72B-Exp isn’t just a model drop—it’s a glimpse into how large-scale RL can harden a coding assistant for real-world software engineering. With its shared-prefix training engine, exploration-aware advantage shaping, and strong SWE-Bench Verified score, it’s built for long-context debugging, refactoring, and multi-turn repair. The setup above gets you from VM to terminal to a clean Streamlit web UI, so you can test locally and then scale to vLLM/TGI for production. If you’re running on NodeShift, you’ll get the GPU headroom and security posture to benchmark confidently. Try the prompts, push it with your repos, and share your traces—your results will help shape the next wave of open developer models.

Relevant blog posts

October 15, 2025

How to Install & Run LFM2-8B-A1B Locally?

LFM2-8B-A1B is Liquid AI’s on-device-friendly MoE: 8.3B total / 1.5B active params with a hybrid conv-attention stack (18 LIV conv + 6 GQA). It uses a ChatML-style template, supports 32K context, and is tuned for agentic tasks, data extraction, RAG, and multi-turn chat. It targets speed on modest hardware (often faster than 1.7B dense baselines) while keeping quality near 3–4B dense models.

October 14, 2025

How to Install & Run Vngrs-AI Kumru-2B Locally?

Kumru-2B is VNGRS’s lightweight, Turkish-native LLM trained from scratch. It’s pre-trained on ~500 GB of cleaned, deduplicated text (~300B tokens) and SFT’d on ~1M examples. Kumru uses a modern Turkish-optimized tokenizer (≈50,176 vocab) and ships with a chat template, code & math support, and a native 8,192-token context. Despite its size, Kumru performs strongly on Turkish-centric tasks (e.g., grammatical error correction, summarization) and compares well with much larger multilingual models—while being fast and cost-efficient to run locally.

October 13, 2025

How to Install & Run AI21-Jamba-Reasoning-3B Locally?

Jamba Reasoning 3B is AI21’s compact, hybrid Transformer–Mamba model built for efficient reasoning on modest hardware. With just ~3B params (26 Mamba layers + 2 attention layers), it achieves strong scores on reasoning benchmarks, supports very long context windows (up to 256K), and runs smoothly with vLLM or Transformers. The Mamba layers drastically cut cache overhead, so you get long-context throughput without the usual KV-cache blow-up—great for laptops, single-GPU boxes, and edge deployments.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.