How to Install & Run Katanemo Arch-Router-1.5B Locally?

by Ayush Kumar | October 22, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Arch-Router-1.5B is a compact, preference-aligned routing model from Katanemo. It reads a conversation plus a user-defined set of “routes” (domain/action pairs) and outputs the single best route as JSON (e.g., {"route": "bug_fixing"}). The design emphasizes transparent, controllable routing for multi-model stacks, letting you encode preferences per domain/action and swap target models without retraining the router. It’s small, fast, and production-oriented—great for low-latency gateways, agents, and API proxies.

GPU Configuration (Practical Estimates)

Setup	Precision / Quant	Min GPU VRAM (approx)	When to use	Notes
CPU only	FP16/FP32 (auto-cast)	—	Dev/test, CI	Works but slower; use `torch.set_num_threads` sensibly.
Single GPU (PyTorch)	FP16/BF16	4–5 GB	Standard deployment	~3 GB params + runtime headroom; fastest to integrate.
Single GPU (bitsandbytes)	INT8	2–3 GB	Memory-lean servers	Slight quality/latency tradeoff vs FP16; easy drop-in with `load_in_8bit=True`.
Single GPU (bitsandbytes)	INT4	1–1.5 GB	Edge/smaller GPUs (e.g., 4–6 GB cards)	Largest memory savings; minor accuracy loss; `load_in_4bit=True`.
vLLM (FP16/BF16)	FP16/BF16	5–6 GB	High-throughput routing API	Extra VRAM for paged-KV & scheduler; shines with concurrency.
Multi-GPU	FP16	N/A	Not needed	Model is small; keep it on one GPU for simplicity.

Tips

For most servers, FP16 on a 6–8 GB GPU is the sweet spot (headroom for longer ctx or concurrency).
If you’re packaging in an agent gateway, consider vLLM to batch many tiny routing calls.
Quantized (int8/int4) loads are ideal for 4 GB GPUs or CPU–GPU mixed environments; verify outputs on your route set.

Resources

Link: https://huggingface.co/katanemo/Arch-Router-1.5B

Step-by-Step Process to Install & Run Katanemo Arch-Router-1.5B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Katanemo Arch-Router-1.5B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like Katanemo Arch-Router-1.5B.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Katanemo Arch-Router-1.5B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Katanemo Arch-Router-1.5B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install the Utilities

Run the following command to install utilities:

pip install "transformers>=4.37.0" accelerate sentencepiece

Step 12: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Katanemo Arch-Router-1.5B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 13: Create the Script

Create a file (ex: # quickstart.py) and add the following code:

import json
from typing import Any, Dict, List
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "katanemo/Arch-Router-1.5B"

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

TASK_INSTRUCTION = """
You are a helpful assistant designed to find the best suited route.
You are provided with route description within <routes></routes> XML tags:
<routes>
{routes}
</routes>

<conversation>
{conversation}
</conversation>
"""

FORMAT_PROMPT = """
Your task is to decide which route is best suit with user intent on the conversation in <conversation></conversation> XML tags.
1. If the latest intent is irrelevant or fulfilled, respond: {"route": "other"}.
2. Analyze the route descriptions and find the best match.
3. Respond only with JSON: {"route": "route_name"} using an exact route name.
"""

def format_prompt(route_config: List[Dict[str, Any]], conversation: List[Dict[str, Any]]):
    return TASK_INSTRUCTION.format(
        routes=json.dumps(route_config, ensure_ascii=False),
        conversation=json.dumps(conversation, ensure_ascii=False),
    ) + FORMAT_PROMPT

route_config = [
    {"name": "code_generation", "description": "Generate code from requirements"},
    {"name": "bug_fixing", "description": "Find and fix errors in provided code"},
    {"name": "performance_optimization", "description": "Make code faster/cleaner"},
    {"name": "api_help", "description": "Use/understand external APIs & SDKs"},
    {"name": "programming", "description": "General programming Q&A/best practices"},
]

conversation = [
    {"role": "user", "content": "fix this: 'torch.utils._pytree' has no attribute 'register_pytree_node'."}
]

route_prompt = format_prompt(route_config, conversation)

messages = [{"role": "user", "content": route_prompt}]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

gen = model.generate(input_ids=input_ids, max_new_tokens=256)
prompt_len = input_ids.shape[1]
out = gen[0][prompt_len:]
text = tokenizer.decode(out, skip_special_tokens=True)
print("\nMODEL RESPONSE:\n", text)

Step 14: Run the Script

Run the script from the following command:

python quickstart.py

This will load the model and generate the response on terminal.

Step 15: Install Dependencies

Run the following command to install dependencies:

pip install streamlit pydantic "transformers>=4.37.0" accelerate sentencepiece

Step 16: Create the Script

Create a file (ex: # arch_router_ui.py) and add the following code:

import json, torch, time
import streamlit as st
from typing import List, Dict, Any
from pydantic import BaseModel, ValidationError
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "katanemo/Arch-Router-1.5B"

@st.cache_resource(show_spinner=True)
def load_model():
    tok = AutoTokenizer.from_pretrained(MODEL_ID)
    mdl = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, device_map="auto", dtype="auto", trust_remote_code=True
    ).eval()
    # Make sure pad token is set; create attention_mask explicitly later
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token
    return tok, mdl

tokenizer, model = load_model()

# ---------- UI ----------
st.set_page_config(page_title="Arch-Router-1.5B WebUI", page_icon="🧭", layout="wide")
st.title("🧭 Arch-Router-1.5B — Browser UI")

with st.sidebar:
    st.header("⚙️ Settings")
    max_new_tokens = st.slider("max_new_tokens", 32, 1024, 256, 32)
    temperature = st.slider("temperature (sampling)", 0.0, 1.0, 0.2, 0.05)
    top_p = st.slider("top_p", 0.1, 1.0, 1.0, 0.05)
    st.caption("Tip: low temperature is good here to keep JSON consistent.")

    st.subheader("🧩 Route Config")
    default_routes = [
        {"name":"code_generation","description":"Generate code from requirements"},
        {"name":"bug_fixing","description":"Find and fix errors in provided code"},
        {"name":"performance_optimization","description":"Make code faster/cleaner"},
        {"name":"api_help","description":"Use/understand external APIs & SDKs"},
        {"name":"programming","description":"General programming Q&A/best practices"}
    ]
    routes_json = st.text_area(
        "Edit routes as JSON array", value=json.dumps(default_routes, indent=2), height=220
    )

st.subheader("💬 Conversation")
st.caption("Add turns; last user turn is routed. Keep system prompt minimal—model expects the XML prompt wrapper.")

# Conversation builder
if "conversation" not in st.session_state:
    st.session_state.conversation = [
        {"role": "user", "content": "fix this module 'torch.utils._pytree' has no attribute 'register_pytree_node'."}
    ]

colA, colB = st.columns([3,1])
with colA:
    new_role = st.selectbox("Role", ["user","assistant"], index=0, key="role_sel")
    new_content = st.text_area("Content", height=120, key="content_ta")
with colB:
    if st.button("➕ Add turn", use_container_width=True):
        if new_content.strip():
            st.session_state.conversation.append({"role": new_role, "content": new_content.strip()})
            st.success("Added.")
        else:
            st.warning("Write something first.")

# Show current conversation
st.write("**Current conversation (JSON):**")
st.code(json.dumps(st.session_state.conversation, indent=2, ensure_ascii=False), language="json")

# Arch-Router prompt templates (from model card guidance)
TASK_INSTRUCTION = """
You are a helpful assistant designed to find the best suited route.
You are provided with route description within <routes></routes> XML tags:
<routes>

{routes}

</routes>

<conversation>

{conversation}

</conversation>
"""

FORMAT_PROMPT = """
Your task is to decide which route is best suit with user intent on the conversation in <conversation></conversation> XML tags.  Follow the instruction:
1. If the latest intent from user is irrelevant or user intent is full filled, response with other route {"route": "other"}.
2. You must analyze the route descriptions and find the best match route for user latest intent. 
3. You only response the name of the route that best matches the user's request, use the exact name in the <routes></routes>.

Based on your analysis, provide your response in the following JSON formats if you decide to match any route:
{"route": "route_name"} 
"""

def format_prompt(route_config: List[Dict[str, Any]], conversation: List[Dict[str, Any]]):
    return (
        TASK_INSTRUCTION.format(
            routes=json.dumps(route_config, ensure_ascii=False),
            conversation=json.dumps(conversation, ensure_ascii=False),
        ) + FORMAT_PROMPT
    )

# Run button
run = st.button("🧭 Route it")

if run:
    # Parse routes
    try:
        route_cfg = json.loads(routes_json)
        assert isinstance(route_cfg, list) and all("name" in r and "description" in r for r in route_cfg)
    except Exception as e:
        st.error(f"Route config must be a JSON array of objects with 'name' and 'description'. Error: {e}")
        st.stop()

    if not st.session_state.conversation:
        st.warning("Conversation is empty.")
        st.stop()

    route_prompt = format_prompt(route_cfg, st.session_state.conversation)
    messages = [{"role": "user", "content": route_prompt}]

    # Apply chat template → ids; build attention_mask explicitly to avoid warnings
    input_ids = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    attention_mask = torch.ones_like(input_ids)  # avoids pad/eos warning

    with st.spinner("Thinking..."):
        t0 = time.time()
        with torch.no_grad():
            out = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_new_tokens=max_new_tokens,
                do_sample=(temperature > 0),
                temperature=float(temperature),
                top_p=float(top_p),
            )
        gen = out[0][input_ids.shape[1]:]
        txt = tokenizer.decode(gen, skip_special_tokens=True).strip()
        dt = time.time() - t0

    st.subheader("🧾 Raw model text")
    st.code(txt)

    st.subheader("✅ Parsed JSON")
    try:
        obj = json.loads(txt)
        st.json(obj)
    except Exception:
        st.warning("Could not parse JSON exactly; falling back to raw text above.")
    st.caption(f"Latency: {dt:.2f}s")

st.divider()
st.caption("Security note: this demo runs locally and trusts input JSON. For multi-tenant deployments, add validation, auth, and rate limiting.")

Step 17: Launch the Streamlit UI

Run Streamlit:

streamlit run arch_router_ui.py --server.port 8501 --server.address 0.0.0.0

Step 18: Access the Streamlit App

Access the streamlit app on:

http://0.0.0.0:8501/

Play with Model

Conclusion

You’ve now successfully installed and run Katanemo Arch-Router-1.5B — a lightweight, preference-aligned routing model designed to intelligently select the best route for multi-model systems. From creating a GPU-enabled NodeShift VM to launching a full Streamlit WebUI, you’ve built an environment that’s fast, transparent, and production-ready.

This setup lets you visually test routing logic in the browser, tweak domain/action configurations in real time, and integrate routing outputs directly into larger agent or API stacks. Whether you’re building a multi-model gateway, an evaluation framework, or a full-scale orchestration service, Arch-Router-1.5B provides a simple yet powerful way to connect intent with the right model — efficiently and reliably.

Relevant blog posts

October 21, 2025

How to Install & Run DeepSeek-OCR Locally?

DeepSeek-OCR is a cutting-edge vision-language model from DeepSeek AI designed for intelligent optical character recognition and document understanding. Built on the DeepSeek-VL-v2 architecture, it fuses visual perception with contextual text reasoning to accurately convert complex images, documents, and charts into structured text or Markdown formats. Optimized for GPU inference with FlashAttention 2, DeepSeek-OCR offers exceptional speed and precision in multilingual OCR, document layout parsing, and visual-text compression — making it a powerful tool for next-generation document intelligence.

October 17, 2025

Build Faster & Safer with LangCode: Your Ultimate Multi-LLM Local AI Copilot

As AI coding assistants evolve, the real challenge is no longer just generating code, it’s understanding your entire codebase, adapting to your workflow, and automating complex development tasks with safety and precision. LangCode is a newest open-source framework that unifies Gemini, Anthropic Claude, OpenAI, and Ollama into a single, powerful coding environment, all accessible directly from your terminal. If you want to analyze your code, implement new features, fix bugs, or refactor modules intelligently, LangCode handles it all through its interactive launcher and AI-powered deep code reasoning. Its ReAct and Deep modes let you toggle between fast, lightweight responses and in-depth multi-step reasoning, while Smart Routing automatically selects the most suitable LLM for your task based on cost, speed, or quality. With safe, reviewable diffs, customizable project instructions, and MCP-based extensibility, LangCode doesn’t just generate code, it thinks, plans, and acts alongside you like an intelligent engineering collaborator.

October 17, 2025

How to Install & Run Facebook MobileLLM-Pro Locally?

MobileLLM-Pro is Meta’s 1.08B-parameter, on-device–first LLM with a 128k context window and local-global attention (3:1) for faster prefill and tiny KV cache. It ships as base and instruction-tuned variants, plus near-lossless int4 quantization (CPU & accelerator ready), delivering competitive quality vs other ~1B models while fitting comfortably on phones, edge accelerators, and low-VRAM GPUs.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.