Arch-Router-1.5B is a compact, preference-aligned routing model from Katanemo. It reads a conversation plus a user-defined set of “routes” (domain/action pairs) and outputs the single best route as JSON (e.g., {"route": "bug_fixing"}
). The design emphasizes transparent, controllable routing for multi-model stacks, letting you encode preferences per domain/action and swap target models without retraining the router. It’s small, fast, and production-oriented—great for low-latency gateways, agents, and API proxies.
GPU Configuration (Practical Estimates)
Setup | Precision / Quant | Min GPU VRAM (approx) | When to use | Notes |
---|
CPU only | FP16/FP32 (auto-cast) | — | Dev/test, CI | Works but slower; use torch.set_num_threads sensibly. |
Single GPU (PyTorch) | FP16/BF16 | 4–5 GB | Standard deployment | ~3 GB params + runtime headroom; fastest to integrate. |
Single GPU (bitsandbytes) | INT8 | 2–3 GB | Memory-lean servers | Slight quality/latency tradeoff vs FP16; easy drop-in with load_in_8bit=True . |
Single GPU (bitsandbytes) | INT4 | 1–1.5 GB | Edge/smaller GPUs (e.g., 4–6 GB cards) | Largest memory savings; minor accuracy loss; load_in_4bit=True . |
vLLM (FP16/BF16) | FP16/BF16 | 5–6 GB | High-throughput routing API | Extra VRAM for paged-KV & scheduler; shines with concurrency. |
Multi-GPU | FP16 | N/A | Not needed | Model is small; keep it on one GPU for simplicity. |
Tips
- For most servers, FP16 on a 6–8 GB GPU is the sweet spot (headroom for longer ctx or concurrency).
- If you’re packaging in an agent gateway, consider vLLM to batch many tiny routing calls.
- Quantized (int8/int4) loads are ideal for 4 GB GPUs or CPU–GPU mixed environments; verify outputs on your route set.
Resources
Link: https://huggingface.co/katanemo/Arch-Router-1.5B
Step-by-Step Process to Install & Run Katanemo Arch-Router-1.5B Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Katanemo Arch-Router-1.5B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like Katanemo Arch-Router-1.5B.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Katanemo Arch-Router-1.5B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Katanemo Arch-Router-1.5B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version
Step 10: Install PyTorch for CUDA
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
Step 11: Install the Utilities
Run the following command to install utilities:
pip install "transformers>=4.37.0" accelerate sentencepiece
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Katanemo Arch-Router-1.5B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Create the Script
Create a file (ex: # quickstart.py) and add the following code:
import json
from typing import Any, Dict, List
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "katanemo/Arch-Router-1.5B"
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
TASK_INSTRUCTION = """
You are a helpful assistant designed to find the best suited route.
You are provided with route description within <routes></routes> XML tags:
<routes>
{routes}
</routes>
<conversation>
{conversation}
</conversation>
"""
FORMAT_PROMPT = """
Your task is to decide which route is best suit with user intent on the conversation in <conversation></conversation> XML tags.
1. If the latest intent is irrelevant or fulfilled, respond: {"route": "other"}.
2. Analyze the route descriptions and find the best match.
3. Respond only with JSON: {"route": "route_name"} using an exact route name.
"""
def format_prompt(route_config: List[Dict[str, Any]], conversation: List[Dict[str, Any]]):
return TASK_INSTRUCTION.format(
routes=json.dumps(route_config, ensure_ascii=False),
conversation=json.dumps(conversation, ensure_ascii=False),
) + FORMAT_PROMPT
route_config = [
{"name": "code_generation", "description": "Generate code from requirements"},
{"name": "bug_fixing", "description": "Find and fix errors in provided code"},
{"name": "performance_optimization", "description": "Make code faster/cleaner"},
{"name": "api_help", "description": "Use/understand external APIs & SDKs"},
{"name": "programming", "description": "General programming Q&A/best practices"},
]
conversation = [
{"role": "user", "content": "fix this: 'torch.utils._pytree' has no attribute 'register_pytree_node'."}
]
route_prompt = format_prompt(route_config, conversation)
messages = [{"role": "user", "content": route_prompt}]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
gen = model.generate(input_ids=input_ids, max_new_tokens=256)
prompt_len = input_ids.shape[1]
out = gen[0][prompt_len:]
text = tokenizer.decode(out, skip_special_tokens=True)
print("\nMODEL RESPONSE:\n", text)
Step 14: Run the Script
Run the script from the following command:
python quickstart.py
This will load the model and generate the response on terminal.
Step 15: Install Dependencies
Run the following command to install dependencies:
pip install streamlit pydantic "transformers>=4.37.0" accelerate sentencepiece
Step 16: Create the Script
Create a file (ex: # arch_router_ui.py) and add the following code:
import json, torch, time
import streamlit as st
from typing import List, Dict, Any
from pydantic import BaseModel, ValidationError
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "katanemo/Arch-Router-1.5B"
@st.cache_resource(show_spinner=True)
def load_model():
tok = AutoTokenizer.from_pretrained(MODEL_ID)
mdl = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", dtype="auto", trust_remote_code=True
).eval()
# Make sure pad token is set; create attention_mask explicitly later
if tok.pad_token is None:
tok.pad_token = tok.eos_token
return tok, mdl
tokenizer, model = load_model()
# ---------- UI ----------
st.set_page_config(page_title="Arch-Router-1.5B WebUI", page_icon="🧭", layout="wide")
st.title("🧭 Arch-Router-1.5B — Browser UI")
with st.sidebar:
st.header("⚙️ Settings")
max_new_tokens = st.slider("max_new_tokens", 32, 1024, 256, 32)
temperature = st.slider("temperature (sampling)", 0.0, 1.0, 0.2, 0.05)
top_p = st.slider("top_p", 0.1, 1.0, 1.0, 0.05)
st.caption("Tip: low temperature is good here to keep JSON consistent.")
st.subheader("🧩 Route Config")
default_routes = [
{"name":"code_generation","description":"Generate code from requirements"},
{"name":"bug_fixing","description":"Find and fix errors in provided code"},
{"name":"performance_optimization","description":"Make code faster/cleaner"},
{"name":"api_help","description":"Use/understand external APIs & SDKs"},
{"name":"programming","description":"General programming Q&A/best practices"}
]
routes_json = st.text_area(
"Edit routes as JSON array", value=json.dumps(default_routes, indent=2), height=220
)
st.subheader("💬 Conversation")
st.caption("Add turns; last user turn is routed. Keep system prompt minimal—model expects the XML prompt wrapper.")
# Conversation builder
if "conversation" not in st.session_state:
st.session_state.conversation = [
{"role": "user", "content": "fix this module 'torch.utils._pytree' has no attribute 'register_pytree_node'."}
]
colA, colB = st.columns([3,1])
with colA:
new_role = st.selectbox("Role", ["user","assistant"], index=0, key="role_sel")
new_content = st.text_area("Content", height=120, key="content_ta")
with colB:
if st.button("➕ Add turn", use_container_width=True):
if new_content.strip():
st.session_state.conversation.append({"role": new_role, "content": new_content.strip()})
st.success("Added.")
else:
st.warning("Write something first.")
# Show current conversation
st.write("**Current conversation (JSON):**")
st.code(json.dumps(st.session_state.conversation, indent=2, ensure_ascii=False), language="json")
# Arch-Router prompt templates (from model card guidance)
TASK_INSTRUCTION = """
You are a helpful assistant designed to find the best suited route.
You are provided with route description within <routes></routes> XML tags:
<routes>
{routes}
</routes>
<conversation>
{conversation}
</conversation>
"""
FORMAT_PROMPT = """
Your task is to decide which route is best suit with user intent on the conversation in <conversation></conversation> XML tags. Follow the instruction:
1. If the latest intent from user is irrelevant or user intent is full filled, response with other route {"route": "other"}.
2. You must analyze the route descriptions and find the best match route for user latest intent.
3. You only response the name of the route that best matches the user's request, use the exact name in the <routes></routes>.
Based on your analysis, provide your response in the following JSON formats if you decide to match any route:
{"route": "route_name"}
"""
def format_prompt(route_config: List[Dict[str, Any]], conversation: List[Dict[str, Any]]):
return (
TASK_INSTRUCTION.format(
routes=json.dumps(route_config, ensure_ascii=False),
conversation=json.dumps(conversation, ensure_ascii=False),
) + FORMAT_PROMPT
)
# Run button
run = st.button("🧭 Route it")
if run:
# Parse routes
try:
route_cfg = json.loads(routes_json)
assert isinstance(route_cfg, list) and all("name" in r and "description" in r for r in route_cfg)
except Exception as e:
st.error(f"Route config must be a JSON array of objects with 'name' and 'description'. Error: {e}")
st.stop()
if not st.session_state.conversation:
st.warning("Conversation is empty.")
st.stop()
route_prompt = format_prompt(route_cfg, st.session_state.conversation)
messages = [{"role": "user", "content": route_prompt}]
# Apply chat template → ids; build attention_mask explicitly to avoid warnings
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
attention_mask = torch.ones_like(input_ids) # avoids pad/eos warning
with st.spinner("Thinking..."):
t0 = time.time()
with torch.no_grad():
out = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
do_sample=(temperature > 0),
temperature=float(temperature),
top_p=float(top_p),
)
gen = out[0][input_ids.shape[1]:]
txt = tokenizer.decode(gen, skip_special_tokens=True).strip()
dt = time.time() - t0
st.subheader("🧾 Raw model text")
st.code(txt)
st.subheader("✅ Parsed JSON")
try:
obj = json.loads(txt)
st.json(obj)
except Exception:
st.warning("Could not parse JSON exactly; falling back to raw text above.")
st.caption(f"Latency: {dt:.2f}s")
st.divider()
st.caption("Security note: this demo runs locally and trusts input JSON. For multi-tenant deployments, add validation, auth, and rate limiting.")
Step 17: Launch the Streamlit UI
Run Streamlit:
streamlit run arch_router_ui.py --server.port 8501 --server.address 0.0.0.0
Step 18: Access the Streamlit App
Access the streamlit app on:
http://0.0.0.0:8501/
Play with Model
Conclusion
You’ve now successfully installed and run Katanemo Arch-Router-1.5B — a lightweight, preference-aligned routing model designed to intelligently select the best route for multi-model systems. From creating a GPU-enabled NodeShift VM to launching a full Streamlit WebUI, you’ve built an environment that’s fast, transparent, and production-ready.
This setup lets you visually test routing logic in the browser, tweak domain/action configurations in real time, and integrate routing outputs directly into larger agent or API stacks. Whether you’re building a multi-model gateway, an evaluation framework, or a full-scale orchestration service, Arch-Router-1.5B provides a simple yet powerful way to connect intent with the right model — efficiently and reliably.