K2-Think is a 32B open-weights reasoning model focused on tough math/logic, code, and science tasks. It’s trained for long chain-of-thought and integrates reinforcement learning with verifiable rewards and agentic planning. Despite its size, it targets high efficiency: the team reports ~2,000 tok/s on Cerebras WSE with speculative decoding (vs. ~200 tok/s on typical H100/H200 setups), and strong scores on AIME’24/’25, HMMT’25, OMNI-Math-HARD, GPQA-Diamond, and LiveCodeBench. Weights are Apache-2.0 and served on Hugging Face.
Evaluation & Performance
Detailed evaluation results are reported in out Tech Report
Benchmarks (pass@1, average over 16 runs)
Domain | Benchmark | K2-Think |
---|
Math | AIME 2024 | 90.83 |
Math | AIME 2025 | 81.24 |
Math | HMMT 2025 | 73.75 |
Math | OMNI-Math-HARD | 60.73 |
Code | LiveCodeBench v5 | 63.97 |
Science | GPQA-Diamond | 71.08 |
Inference Speed
Platform | Throughput (tokens/sec) | Example: 32k-token response (time) |
---|
Cerebras WSE (our deployment) | ~2,000 | ~16 s |
Typical H100/H200 GPU setup | ~200 | ~160 s |
Safety Evaluation
Aggregated across four safety dimensions (Safety-4):
Aspect | Macro-Avg |
---|
High-Risk Content Refusal | 0.83 |
Conversational Robustness | 0.89 |
Cybersecurity & Data Protection | 0.56 |
Jailbreak Resistance | 0.72 |
Safety-4 Macro (avg) | 0.75 |
GPU Configuration (What Actually Works)
Scenario | Precision / Loader | Min setup that works | Recommended | Notes |
---|
Single-GPU, native precision | BF16/FP16 (Transformers/vLLM) | 1× 80 GB (A100/H100 80GB) | 1× 80 GB | 32B × 2 bytes ≈ 64 GB for weights; leave headroom for KV cache & activations. Best latency & simplicity. |
Dual-GPU, tensor parallel | BF16/FP16 TP=2 | 2× 40 GB (A100 40GB, H100 40GB) | 2× 48–80 GB | Split weights across 2 GPUs; enable tensor/pp in vLLM or TGI. Good balance when 80GB isn’t available. |
Quad-GPU, prosumer | INT4/INT8 (AWQ/GPTQ) + TP=4 | 4× 24 GB (RTX 4090/ADA 24GB) | 4× 24–48 GB | Quantization required. Expect some quality/latency trade-offs; keep context modest and batch=1. |
CPU-offload hybrid | INT4 + paged KV offload | 1× 24 GB + fast CPU/RAM | 1× 24–48 GB | Last-resort; slower. Tune max_new_tokens , use attention/k-v offload to fit. |
Wafer-scale (Cerebras) | Native with speculative decoding | Managed service | Managed service | ~2,000 tok/s on WSE cited by authors; ideal for very long chain-of-thought (e.g., 32k-token responses). (k2think-about.pages.dev) |
Resources
Link: https://huggingface.co/LLM360/K2-Think
Step-by-Step Process to Install & Run K2-Think Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running K2-Think, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like K2-Think.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like K2-Think.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the K2-Think runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip
(if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip
is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip
(if not already installed)
Even if Python is installed, pip
might not be available.
Check if pip
exists:
pip3 --version
If you get an error like command not found
, then install pip
manually.
Install pip
via get-pip.py
:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip
into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip
is ready to install packages like transformers
, torch
, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv k2
source k2/bin/activate
Step 10: Install PyTorch
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
Step 11: Install Model Dependencies
Run the following command to install model dependencies:
pip install "transformers>=4.45" "accelerate>=0.34" sentencepiece "huggingface_hub>=0.24"
Step 12: Run a Tiny Check (Downloads ~67 GB of Weights)
Run the following code to run a tiny check (downloads ~67 GB of weights):
python - << 'PY'
from transformers import pipeline
model_id = "LLM360/K2-Think"
pipe = pipeline("text-generation", model=model_id, torch_dtype="auto", device_map="auto")
msgs = [{"role": "user", "content": "what is the next prime number after 2600?"}]
out = pipe(msgs, max_new_tokens=256) # keep small for first run
print(out[0]["generated_text"][-1])
PY
Step 13: Connect to Your GPU VM with a Code Editor
Before you start running model script with the K2-Think model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 14: Create the Script
Create a file (ex: run_k2think.py) and add the following code:
# run_k2think.py
from transformers import pipeline
def main():
model_id = "LLM360/K2-Think"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
# You can change this message or make it interactive later
messages = [
{"role": "user", "content": "what is the next prime number after 2600?"}
]
outputs = pipe(messages, max_new_tokens=256)
# print the model’s full reply (last assistant message)
print(outputs[0]["generated_text"][-1])
if __name__ == "__main__":
main()
What the Script Does:
from transformers import pipeline
- Imports Hugging Face’s high-level pipeline helper, which wraps up the tokenizer, model, and generation logic into one object.
def main():
- Defines the main function that will run your inference.
model_id = "LLM360/K2-Think"
- Sets the model repo on Hugging Face Hub. This tells
pipeline
what to download/load.
pipe = pipeline("text-generation", model=model_id, torch_dtype="auto", device_map="auto")
- Creates a text-generation pipeline for chat/instruction generation.
model=model_id
: fetches config, tokenizer, and weights for LLM360/K2-Think
.
- First run: downloads ~67 GB of model shards to the local HF cache, then loads them into GPU memory.
torch_dtype="auto"
: lets Transformers choose an appropriate compute dtype (BF16/FP16/FP32) for your GPU. (Note: torch_dtype
is deprecated in newer versions; dtype="auto"
is the replacement. Your code still works—just shows a warning.)
device_map="auto"
: automatically places the model on available GPU(s) (or CPU fallback). On multi-GPU nodes, it may split layers across devices.
messages = [{"role": "user", "content": "what is the next prime number after 2600?"}]
- Builds a chat-style input expected by Qwen-family chat templates (the pipeline handles applying the template under the hood).
outputs = pipe(messages, max_new_tokens=256)
- Runs generation:
- Applies the model’s chat template (system/user/assistant formatting).
- Tokenizes the input, runs forward passes, and samples tokens until stopping or hitting 256 new tokens.
- Maintains a KV cache (memory of previous tokens) to speed decoding.
- Returns a Python object with the full conversation including the newly generated assistant turn.
print(outputs[0]["generated_text"][-1])
outputs
is a list of results (one per input).
outputs[0]["generated_text"]
is the list of chat messages after generation.
[-1]
selects the last message—the assistant’s reply (often reasoning + final answer).
if __name__ == "__main__": main()
- Ensures
main()
runs only when the file is executed directly (not when imported).
Then, run the script from the following command:
python3 run_k2think.py
What the Command Does:
python3 run_k2think.py
- Invokes the Python 3 interpreter on your script (uses your current venv if activated).
- Python executes the file:
- Imports
pipeline
.
- Enters
main()
.
- Downloads model files on first run (shows “Loading checkpoint shards: 100% …”).
- Loads the model on GPU(s) (
device_map="auto"
).
- Generates up to 256 tokens answering your prompt.
- Prints the assistant’s final message to stdout (your terminal).
- Returns an exit code 0 if successful.
Step 15: Create the Script
Create a file (ex: # chat_k2think.py) and add the following code:
# chat_k2think.py
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="LLM360/K2-Think",
torch_dtype="auto",
device_map="auto",
)
while True:
user_input = input("User: ")
if user_input.lower() in {"quit", "exit"}:
break
messages = [{"role": "user", "content": user_input}]
out = pipe(messages, max_new_tokens=512)
print("Assistant:", out[0]["generated_text"][-1])
What the Script Does:
from transformers import pipeline
Imports Hugging Face’s high-level helper that bundles tokenizer + model + generate into one object.
pipe = pipeline(..., model="LLM360/K2-Think", torch_dtype="auto", device_map="auto")
- Builds a text-generation pipeline for K2-Think.
- Downloads the model the first time (≈67 GB) and caches it; later runs load from cache.
torch_dtype="auto"
lets Transformers choose a good compute dtype (bf16/fp16/fp32). (Note: newer Transformers prefers dtype="auto"
; yours still works but shows a deprecation warning.)
device_map="auto"
places the model on available GPU(s) automatically (or CPU fallback).
while True:
… input("User: ")
Starts an infinite REPL loop that waits for your prompt on the terminal.
if user_input.lower() in {"quit", "exit"}: break
Lets you end the chat by typing quit
or exit
.
messages = [{"role": "user", "content": user_input}]
Wraps your text into the chat format expected by Qwen-style models.
(Important: this version sends only the current turn—no history.)
out = pipe(messages, max_new_tokens=512)
Runs generation (up to 512 new tokens). The pipeline applies the chat template, tokenizes, decodes, and returns the conversation with the new assistant turn appended.
print("Assistant:", out[0]["generated_text"][-1])
Prints just the last message from the generated conversation—the assistant’s reply.
Then, run the chat from the following command:
python3 chat_k2think.py
What the Command Does:
Your shell launches the Python 3 interpreter found on your PATH
.
- If a virtualenv is active, it uses that interpreter and its installed packages.
Python loads and executes the file chat_k2think.py
as a script (__main__
).
Top of the script: from transformers import pipeline
- Imports Hugging Face’s high-level generation helper.
The script builds a text-generation pipeline:
model="LLM360/K2-Think"
tells it which HF model to use.
device_map="auto"
places weights on your available GPU(s) (CPU fallback).
torch_dtype="auto"
picks a compute dtype suited to your hardware (may show a deprecation warning; dtype="auto"
is the new name).
- First run only: downloads ~67 GB of weights to your HF cache, then loads them into GPU RAM.
After the pipeline is ready, the script enters an infinite REPL loop:
- Prints
User:
and waits for your input on stdin.
- If you type
quit
or exit
(any case), it breaks the loop and ends.
For any other input:
- Wraps your text in a chat message (
{"role":"user","content": ...}
).
- Calls
pipe(..., max_new_tokens=512)
to generate a reply (up to 512 new tokens).
- The pipeline applies the model’s chat template, tokenizes, runs decoding, and returns the conversation with the new assistant turn.
- The script prints:
Assistant: <model reply>
.
The loop repeats for the next prompt.
Exit behavior:
- Normal end → exit code 0.
Ctrl+C
(SIGINT) or errors → non-zero exit code.
Side effects / resources:
- Uses GPU VRAM heavily (it’s a 32B model).
- Writes model files to your Hugging Face cache (e.g.,
~/.cache/huggingface
).
- Uses network bandwidth only on first download (later runs read from cache).
Step 16: Install Streamlit
Run the following command to install streamlit:
pip install streamlit
Step 17: Create a app.py
Create a file (ex: app.py) and add the following code:
# app.py
import streamlit as st
from transformers import pipeline
st.set_page_config(page_title="K2-Think Chat", page_icon="🧠", layout="wide")
# ---- Sidebar controls ----
st.sidebar.title("⚙️ Settings")
model_id = st.sidebar.text_input("Model", value="LLM360/K2-Think")
dtype_opt = st.sidebar.selectbox("torch_dtype", ["auto", "bfloat16", "float16", "float32"], index=0)
max_new_tokens = st.sidebar.slider("Max new tokens", min_value=64, max_value=32768, value=512, step=64)
temperature = st.sidebar.slider("Temperature", min_value=0.0, max_value=2.0, value=0.2, step=0.05)
top_p = st.sidebar.slider("Top-p", min_value=0.05, max_value=1.0, value=0.9, step=0.05)
repetition_penalty = st.sidebar.slider("Repetition penalty", min_value=1.0, max_value=2.0, value=1.05, step=0.01)
st.sidebar.markdown("---")
system_prompt = st.sidebar.text_area("System prompt (optional)", value="", height=80)
st.sidebar.caption("Tip: Keep max tokens moderate if your GPU is <80GB.")
# ---- Session state ----
if "pipe" not in st.session_state:
st.session_state.pipe = None
if "history" not in st.session_state:
st.session_state.history = []
# ---- Lazy-load model (first request) ----
def get_pipe():
if st.session_state.pipe is None:
with st.spinner(f"Loading model: {model_id} … (first time can be slow)"):
# Map dtype string to actual arg
torch_dtype = dtype_opt if dtype_opt != "auto" else "auto"
st.session_state.pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch_dtype,
device_map="auto",
)
return st.session_state.pipe
# ---- Header ----
st.title("🧠 K2-Think — Streamlit Chat")
st.caption("Qwen2.5-32B finetune for math/reasoning. This UI runs via 🤗 Transformers.")
# ---- Chat history display ----
for msg in st.session_state.history:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
# ---- User input ----
user_input = st.chat_input("Type your question…")
if user_input:
# Build conversational messages (optionally include a system)
messages = []
if system_prompt.strip():
messages.append({"role": "system", "content": system_prompt.strip()})
for m in st.session_state.history:
messages.append({"role": m["role"], "content": m["content"]})
messages.append({"role": "user", "content": user_input})
# Echo user
st.session_state.history.append({"role": "user", "content": user_input})
with st.chat_message("user"):
st.markdown(user_input)
# Generate
with st.chat_message("assistant"):
placeholder = st.empty()
try:
pipe = get_pipe()
outputs = pipe(
messages,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
)
reply = outputs[0]["generated_text"][-1]
except Exception as e:
reply = f"⚠️ Error: {e}\n\n• Try lowering max_new_tokens\n• Close other GPU apps\n• Use quantization or multi-GPU if VRAM is tight."
placeholder.markdown(reply)
st.session_state.history.append({"role": "assistant", "content": reply})
# ---- Utilities ----
col1, col2, col3 = st.columns(3)
with col1:
if st.button("🧹 Clear chat"):
st.session_state.history = []
st.experimental_rerun()
with col2:
if st.button("♻️ Reload model"):
st.session_state.pipe = None
st.experimental_rerun()
with col3:
st.download_button(
"⬇️ Export chat (Markdown)",
data="\n\n".join([f"**{m['role'].title()}**: {m['content']}" for m in st.session_state.history]),
file_name="k2think_chat.md",
mime="text/markdown",
)
Step 18: Launch Streamlit
Run the following command to launch streamlit:
streamlit run app.py
Step 19: Access the WebUI in Your Browser
- Once Streamlit is running, it will display three links:
- Local URL →
http://localhost:8501
(works if you’re running on your own machine).
- Network URL →
http://<internal-ip>:8501
(for internal access inside your VM network).
- External URL →
http://<your-vm-public-ip>:8501
(use this to open from your laptop/PC browser).
- Open the External URL in your browser.
Example:
http://38.29.145.10:8501
Step 20: What you Can do on The Page
- Center panel (chat):
- A message box that says “Type your question…”.
- Type a prompt (e.g., “what is the next prime number after 1800?”) and press Enter.
- The model’s reply appears as a chat bubble (the app can hide
<think>...</think>
if you added the cleaner).
- Sidebar (left):
- Model (defaults to
LLM360/K2-Think
)
- dtype / torch_dtype, Max new tokens, Temperature, Top-p, Repetition penalty
- System prompt (optional) to steer behavior
- Buttons:
- Clear chat – wipes history
- Reload model – re-init the pipeline (useful after changing dtype/model)
- Export chat (Markdown) – saves the conversation
- Why Streamlit / UI vs terminal
- No terminal clutter: you read answers like a chat, not raw logs.
- Controls at your fingertips: sliders for tokens/temperature, a system prompt box, reset/export buttons.
- Shareable demo: easy for teammates/non-CLI users; you can run it behind a domain/reverse proxy.
Step 21: Install vLLM
Run the following command to install vLLM:
pip install "vllm>=0.10.1"
Step 22 — Start the vLLM server and confirm it’s up
- Run (safe defaults for 1× H100-80GB):
vllm serve LLM360/K2-Think \
--dtype bfloat16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 2
Success criteria (what you should see):
Resolved architecture: Qwen2ForCausalLM
- Routes listed (e.g.,
/v1/chat/completions
, /models
, /metrics
)
Started server process [PID]
Application startup complete.
(A “torch_dtype is deprecated! Use dtype instead!” line is normal.)
Port: vLLM listens on 0.0.0.0:8000
by default.
Step 23: Health checks (in another terminal)
# models list
curl http://localhost:8000/v1/models
# quick chat call
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local" \
-d '{
"model":"LLM360/K2-Think",
"messages":[{"role":"user","content":"What is the next prime after 2600?"}],
"max_tokens":256
}'
Conclusion
K2-Think proves you don’t need a giant cluster to get serious reasoning performance—just the right setup. In this guide we picked a GPU VM (H100/A100 recommended), installed PyTorch and dependencies, verified the weights with simple Transformers scripts, and then upgraded the experience with a Streamlit web UI for easy, shareable chats. Finally, we productionized inference using vLLM, giving us faster decoding, efficient memory use, and an OpenAI-compatible API with simple health checks.
From here, you can hook your Streamlit app to the vLLM endpoint for streaming responses, add auth and HTTPS behind a reverse proxy, and, if needed, enable speculative decoding for extra speed. Whether you stick with the local pipeline for quick experiments or vLLM for serving, you now have a clean path from zero to a reliable K2-Think deployment.