How to Install & Run Servicenow Apriel-1.5-15b-Thinker Locally?

by Ayush Kumar | October 9, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Apriel-1.5-15B-Thinker is ServiceNow’s open-weights multimodal reasoning model (image-text-to-text) built with an emphasis on mid-training/continual pre-training and high-quality text SFT—no RL. Despite its compact 15B size, it posts strong results (e.g., 52 on the Artificial Analysis Intelligence Index) and is designed to fit on a single GPU. It ships with an OpenAI-compatible vLLM recipe (custom parser for tools + reasoning) and an MIT license, making it practical for on-prem and research workflows.

Results Reported by Artificial Analysis

Model	Score
GPT-5 (high)	68
Grok 4	65
Claude 4.5 Sonnet	61
Grok 4 Fast	60
Gemini 2.5 Pro	60
gpt-oss-120B (high)	58
DeepSeek V3.1 Terminus	58
Qwen3 235B 2507	57
Qwen3 Next 8B A3B	54
DeepSeek R1 0528	52
Apriel-v1.5-15B-Thinker	52
Gemini 2.5 Flash	51
Kimi K2 0905	50
GLM-4.5	49
Llama Nemotron Super-49B V1.6	45
GPT-5 (minimal)	43
Qwen3 4B 2507	43
gpt-oss-20B (high)	43
Magistral Small 1.2	43
Solar Pro 2	38
Llama 4 Maverick	36

Model	Intelligence Index	Total Parameters (Billions)
Apriel-v1.5-15B-Thinker	52	15
Magistral Small 1.2	43	12
gpt-oss-20B (high)	43	20
Llama Nemotron Super 49B v1.5	45	49
Gemma 3 27B	32	27
Qwen3 4B 2507	43	4
Qwen3 Next 80B A3B	54	80
gpt-oss-120B (high)	58	120
Qwen3 235B 2507	57	235
DeepSeek V3.1 Terminus	58	250
Kimi K2 0905	50	90
DeepSeek R1 0528	52	1000
GLM-4.5	49	512
Llama 4 Maverick	36	1000

GPU configuration (Inference, Rule-of-Thumb)

Assumes batch size 1–2, typical prompts, and moderate image resolution. Vision adds ~2–4 GB transient overhead vs text-only. Use CPU/NVMe offload if tight on VRAM.

Scenario	Precision / Quant	Min VRAM (works)	Comfy VRAM	Example GPUs	Notes & Tips
Single-GPU, Transformers (unquantized)	BF16 / FP16	~32 GB (with some offload)	40–48 GB	RTX A6000 48 GB, H100 80 GB	15B ≈ ~30 GB weights in BF16; keep context ≤8–16k, cap `max_new_tokens`.
Single-GPU, 8-bit	INT8 (bnb)	~20–22 GB	24–32 GB	RTX 4090 24 GB, L40 24 GB	Good quality/speed balance; enable KV-cache offload if needed.
Single-GPU, 4-bit	Q4 / AWQ / GPTQ	~12–14 GB	16–24 GB	RTX 3080/4080 16 GB, A5000 24 GB	Best for 16 GB cards; slightly slower/less precise than 8-bit.
vLLM server (OpenAI API)	BF16 (paged)	~32–40 GB	40–80 GB	A6000 48 GB, H100 80 GB	Use the custom vLLM image & parsers; for very long contexts rely on paged KV + CPU/NVMe. (Hugging Face)
Multi-GPU (tensor parallel)	BF16 / FP16	2×16–24 GB	2×24–40 GB	2× 24 GB class	Splits weights; helps with longer contexts or higher throughput.

Step-by-Step Process to Install & Run Servicenow Apriel-1.5-15b-Thinker Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Servicenow Apriel-1.5-15b-Thinker, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Servicenow Apriel-1.5-15b-Thinker
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Servicenow Apriel-1.5-15b-Thinker.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Servicenow Apriel-1.5-15b-Thinker runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install the Utilities

Run the following command to install utilities:

pip install -U "transformers>=4.48" accelerate pillow safetensors timm einops huggingface-hub

Step 12: Pre-Download The Weights

Run the following code to pre-download the weights:

python - << 'PY'
from huggingface_hub import snapshot_download
snapshot_download("ServiceNow-AI/Apriel-1.5-15b-Thinker", local_dir="~/hf_apriel15", local_dir_use_symlinks=False)
PY

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Servicenow Apriel-1.5-15b-Thinker model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 14: Create the Script

Create a file (ex: # run_apriel.py) and add the following code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Apriel-1.5-15B-Thinker quickrunner (no Docker)
- Text + (optional) Image demo
- Safe dtype casting for the vision tower (fixes Float32 vs BF16 mismatch)
- Simple CLI flags for prompts, tokens, temp, and image URL

Usage:
  python3 run_apriel.py
  python3 run_apriel.py --prompt-text "Explain transformers in 5 bullets."
  python3 run_apriel.py --image-url https://picsum.photos/id/237/400/300
  python3 run_apriel.py --model-id ServiceNow-AI/Apriel-1.5-15b-Thinker --max-new-tokens 512 --temperature 0.6
"""

import argparse
import re
import sys
from typing import Optional

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModelForImageTextToText


DEFAULT_MODEL = "ServiceNow-AI/Apriel-1.5-15b-Thinker"


def extract_final(text: str) -> str:
    """
    Pull the final answer between [BEGIN FINAL RESPONSE] ... [END FINAL RESPONSE].
    If not present, return the whole decoded string (some short prompts do this).
    """
    m = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", text, re.DOTALL)
    return m[0].strip() if m else text.strip()


def cast_batch_to_device_dtype(batch: dict, device, dtype) -> dict:
    """
    Cast all floating-point tensors to `dtype` and move everything to `device`.
    (Important for pixel_values -> bfloat16 to match vision tower weights.)
    """
    out = {}
    for k, v in batch.items():
        if isinstance(v, torch.Tensor):
            if v.is_floating_point():
                out[k] = v.to(device=device, dtype=dtype)
            else:
                out[k] = v.to(device=device)
        else:
            out[k] = v
    return out


def run_text_only(model, processor, prompt_text: str, max_new_tokens: int, temperature: float) -> str:
    chat = [{"role": "user", "content": [{"type": "text", "text": prompt_text}]}]
    inputs = processor.apply_chat_template(
        chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
    )
    # move & sanitize
    inputs = {k: (v.to(model.device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
    inputs.pop("token_type_ids", None)

    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)

    gen = out[:, inputs["input_ids"].shape[1]:]
    decoded = processor.decode(gen[0], skip_special_tokens=True)
    return extract_final(decoded)


def run_image_question(
    model,
    processor,
    image_url: str,
    question_text: str,
    max_new_tokens: int,
    temperature: float,
) -> str:
    img = Image.open(requests.get(image_url, stream=True, timeout=60).raw).convert("RGB")
    chat = [{"role": "user", "content": [{"type": "text", "text": question_text}, {"type": "image"}]}]
    prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)

    inputs = processor(text=prompt, images=[img], return_tensors="pt")
    inputs.pop("token_type_ids", None)

    # CRITICAL: cast pixel_values -> model.dtype (BF16) and move to GPU
    inputs = cast_batch_to_device_dtype(inputs, device=model.device, dtype=model.dtype)

    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)

    gen = out[:, inputs["input_ids"].shape[1]:]
    decoded = processor.decode(gen[0], skip_special_tokens=True)
    return extract_final(decoded)


def main():
    parser = argparse.ArgumentParser(description="Run Apriel-1.5-15B-Thinker (no Docker)")
    parser.add_argument("--model-id", default=DEFAULT_MODEL, help="Hugging Face model id")
    parser.add_argument("--prompt-text", default="Give me 3 quirky startup ideas for Pune.", help="Text-only prompt")
    parser.add_argument("--image-url", default=None, help="If set, also do an image+text run with this URL")
    parser.add_argument("--image-question", default="Which animal is this?", help="Question to ask about the image")
    parser.add_argument("--max-new-tokens", type=int, default=256)
    parser.add_argument("--temperature", type=float, default=0.6)
    parser.add_argument("--dtype", choices=["bf16", "fp16"], default="bf16", help="Model compute dtype")
    args = parser.parse_args()

    # Enable TF32 on Ampere+/Hopper (H100) for a little speed-up
    torch.backends.cuda.matmul.allow_tf32 = True

    # Resolve dtype
    dtype = torch.bfloat16 if args.dtype == "bf16" else torch.float16

    # Load model & processor
    print(f"[INFO] Loading model: {args.model_id} (dtype={args.dtype})")
    model = AutoModelForImageTextToText.from_pretrained(
        args.model_id,
        dtype=dtype,            # (was torch_dtype=..., now the new arg)
        device_map="auto",
        trust_remote_code=True, # Apriel uses a LLaVA-style wrapper
    )
    processor = AutoProcessor.from_pretrained(args.model_id, use_fast=True)

    # ---- Text-only run ----
    text_out = run_text_only(
        model, processor, prompt_text=args.prompt_text,
        max_new_tokens=args.max_new_tokens, temperature=args.temperature
    )
    print("\n=== TEXT FINAL ===")
    print(text_out)

    # ---- Image+text run (optional) ----
    if args.image_url:
        try:
            img_out = run_image_question(
                model, processor, image_url=args.image_url, question_text=args.image_question,
                max_new_tokens=args.max_new_tokens, temperature=args.temperature
            )
            print("\n=== IMAGE FINAL ===")
            print(img_out)
        except Exception as e:
            print("\n[WARN] Image run failed:", repr(e), file=sys.stderr)
            print("[HINT] If VRAM is tight, try: --max-new-tokens 128 or --dtype fp16", file=sys.stderr)


if __name__ == "__main__":
    main()

What This Script Does

Loads Apriel-1.5-15B-Thinker locally with Transformers (no Docker), choosing bf16/fp16 and auto-placing on GPU.
Provides a text-only generation path using the model’s chat template and prints the final response block.
Provides an image+text path: downloads an image by URL, fixes dtype for the vision tower (casts to model dtype), and returns the final answer.
Exposes CLI flags for model id, prompt text, image URL, image question, max tokens, temperature, and dtype.
Adds small QoL features: TF32 enable for speed, safe handling of token_type_ids, and clear error hints if the vision run fails.

Step 15: Run the Script

Run the script from the following command:

python3 run_apriel.py

This will download the model and generate response on terminal.

Step 16: Run a quick text-only sanity test

Execute:

python3 run_apriel.py --prompt-text "Explain transformers in 5 bullets."

What you’ll see:

First run may download shards; then load on GPU.
At the end, look for:

=== TEXT FINAL ===
• ...
• ...
• ...
• ...
• ...

Tweak if needed:

Shorter output: add --max-new-tokens 128
More/less randomness: use --temperature 0.4 (safer) or 0.9 (creative)
Different prompt: change the string after --prompt-text

If it hangs or errors:

Press Ctrl+C and rerun.
Ensure your venv is active and GPU visible: nvidia-smi
If you see VRAM issues, reduce tokens or try --dtype fp16.

Step 17: Run an Image + Question Sanity Test

Execute:

python3 run_apriel.py --image-url https://picsum.photos/id/237/400/300

What it does:

Downloads the image URL, applies the chat template with your default question (“Which animal is this?”), runs the VLM, and prints:

=== IMAGE FINAL ===
<final answer here>

Customize the question (optional):

python3 run_apriel.py \
  --image-url https://picsum.photos/id/237/400/300 \
  --image-question "Describe the scene in one sentence."

Control output length / creativity (optional):

# shorter output
python3 run_apriel.py --image-url ... --max-new-tokens 128
# more conservative or creative
python3 run_apriel.py --image-url ... --temperature 0.4
python3 run_apriel.py --image-url ... --temperature 0.9

Step 18: Run with custom generation settings

Execute:

python3 run_apriel.py --model-id ServiceNow-AI/Apriel-1.5-15b-Thinker --max-new-tokens 512 --temperature 0.6

What this does:

Uses the specified model ID.
Increases the token budget to 512 (longer answers, more VRAM + time).
Sets temperature 0.6 (balanced creativity vs. stability).

Expected output:

=== TEXT FINAL ===
<your longer final response here…>

Tips:

If you see VRAM pressure or slower generation, try --max-new-tokens 256 or --dtype fp16.
Want more deterministic outputs? Lower to --temperature 0.2.
Prefer spicier/creative? Raise to --temperature 0.9.

Optional variations:

# Shorter but snappy
python3 run_apriel.py --max-new-tokens 128 --temperature 0.5

# Keep length but cheaper on memory (if needed)
python3 run_apriel.py --max-new-tokens 512 --temperature 0.6 --dtype fp16

Step 19: Install Streamlit and dependencies

Run the install command to install streamlit and dependencies:

pip install streamlit "transformers>=4.48" torch pillow timm einops safetensors huggingface-hub requests

What this does:

Streamlit → launches the web interface.
Transformers 4.48+ → ensures compatibility with Apriel.
Torch (GPU build) → enables CUDA acceleration.
Pillow → handles images.
Timm + Einops → support for the vision tower backbone.
Safetensors → faster/lighter weight loading.
Huggingface Hub → manages model downloads.
Requests → fetches images from URLs.

Step 20: Create the Script

Create a file (ex: # app.py) and add the following code:

#!/usr/bin/env python3
# Streamlit UI for ServiceNow-AI/Apriel-1.5-15b-Thinker (no Docker)
import re
import io
import requests
from typing import Optional

import streamlit as st
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

DEFAULT_MODEL = "ServiceNow-AI/Apriel-1.5-15b-Thinker"

def extract_final(text: str) -> str:
    m = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", text, re.DOTALL)
    return m[0].strip() if m else text.strip()

@st.cache_resource(show_spinner=True)
def load_model(model_id: str, use_bf16: bool = True):
    torch.backends.cuda.matmul.allow_tf32 = True
    dtype = torch.bfloat16 if use_bf16 else torch.float16
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        dtype=dtype,              # new arg (replaces deprecated torch_dtype)
        device_map="auto",
        trust_remote_code=True,   # uses a LLaVA-style wrapper
    )
    processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
    return model, processor

def cast_batch_to_device_dtype(batch: dict, device, dtype) -> dict:
    out = {}
    for k, v in batch.items():
        if isinstance(v, torch.Tensor):
            out[k] = v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device=device)
        else:
            out[k] = v
    return out

def run_text(model, processor, prompt: str, max_new_tokens: int, temperature: float) -> str:
    chat = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
    inputs = processor.apply_chat_template(
        chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
    )
    inputs = {k: (v.to(model.device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
    inputs.pop("token_type_ids", None)

    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)
    gen = out[:, inputs["input_ids"].shape[1]:]
    decoded = processor.decode(gen[0], skip_special_tokens=True)
    return extract_final(decoded)

def run_vision(model, processor, question: str, image: Image.Image, max_new_tokens: int, temperature: float) -> str:
    chat = [{"role": "user", "content": [{"type": "text", "text": question}, {"type": "image"}]}]
    prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
    inputs = processor(text=prompt, images=[image], return_tensors="pt")
    inputs.pop("token_type_ids", None)
    inputs = cast_batch_to_device_dtype(inputs, device=model.device, dtype=model.dtype)

    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)
    gen = out[:, inputs["input_ids"].shape[1]:]
    decoded = processor.decode(gen[0], skip_special_tokens=True)
    return extract_final(decoded)

# ---------------- UI ----------------
st.set_page_config(page_title="Apriel-1.5-15B-Thinker UI", page_icon="🧠", layout="centered")
st.title("🧠 Apriel-1.5-15B-Thinker — Local UI (no Docker)")

with st.sidebar:
    st.subheader("Model & Settings")
    model_id = st.text_input("Model ID", value=DEFAULT_MODEL)
    dtype_choice = st.selectbox("Compute dtype", ["bf16 (recommended)", "fp16"])
    max_new_tokens = st.slider("Max new tokens", 64, 2048, 256, step=64)
    temperature = st.slider("Temperature", 0.0, 1.5, 0.6, step=0.1)
    load_btn = st.button("Load / Reload Model")

if "loaded" not in st.session_state or load_btn or st.session_state.get("model_id") != model_id or st.session_state.get("dtype") != dtype_choice:
    with st.spinner("Loading model… (first time may download weights)"):
        use_bf16 = dtype_choice.startswith("bf16")
        model, processor = load_model(model_id, use_bf16=use_bf16)
        st.session_state["model"] = model
        st.session_state["processor"] = processor
        st.session_state["model_id"] = model_id
        st.session_state["dtype"] = dtype_choice
        st.session_state["loaded"] = True
    st.success("Model ready ✅")

model = st.session_state.get("model", None)
processor = st.session_state.get("processor", None)

tab_text, tab_vision = st.tabs(["💬 Text Chat", "🖼️ Image + Text"])

with tab_text:
    st.subheader("Text-only")
    prompt = st.text_area("Your prompt", "Give me 3 quirky startup ideas for Pune.", height=140)
    if st.button("Generate (Text)"):
        if not model:
            st.error("Load the model from the sidebar first.")
        else:
            with st.spinner("Thinking…"):
                try:
                    out = run_text(model, processor, prompt, max_new_tokens, temperature)
                    st.markdown("**Final Response**")
                    st.write(out)
                except Exception as e:
                    st.exception(e)

with tab_vision:
    st.subheader("Image + Question")
    col1, col2 = st.columns(2)
    with col1:
        uploaded = st.file_uploader("Upload an image", type=["png", "jpg", "jpeg", "webp"])
    with col2:
        image_url = st.text_input("…or image URL (optional)", value="")
        fetch = st.button("Fetch URL")

    img: Optional[Image.Image] = None
    if uploaded is not None:
        img = Image.open(io.BytesIO(uploaded.read())).convert("RGB")
    elif fetch and image_url.strip():
        try:
            img = Image.open(requests.get(image_url.strip(), stream=True, timeout=60).raw).convert("RGB")
        except Exception as e:
            st.error(f"Could not fetch image: {e}")

    if img is not None:
        st.image(img, caption="Selected image", use_column_width=True)

    question = st.text_input("Your question about the image", "Which animal is this?")
    if st.button("Generate (Vision)"):
        if img is None:
            st.error("Please upload an image or provide a valid image URL.")
        elif not model:
            st.error("Load the model from the sidebar first.")
        else:
            with st.spinner("Analyzing image…"):
                try:
                    out = run_vision(model, processor, question, img, max_new_tokens, temperature)
                    st.markdown("**Final Response**")
                    st.write(out)
                except Exception as e:
                    st.exception(e)

What This Script Does

Loads Apriel-1.5-15B-Thinker with Transformers (BF16/FP16) and caches it for reuse in a Streamlit app.
Provides two tabs: Text Chat and Image + Text (upload or URL).
Applies the model’s chat template and extracts only the [FINAL RESPONSE] block for clean output.
Fixes the VLM dtype issue by casting image tensors to the model’s dtype before generation.
Lets you tune max tokens, temperature, and dtype from the sidebar; includes reload and progress/spinner UI.

Step 21: Launch the Streamlit UI

Run Streamlit

streamlit run app.py --server.address 0.0.0.0 --server.port 7861

Step 22: Access the Streamlit App

Access the streamlit app on:

http://0.0.0.0:7860/

Play with Model

Conclusion

Apriel-1.5-15B-Thinker proves you don’t need frontier-scale hardware to ship a smart, multimodal reasoner. With a clean Transformers setup and an optional Streamlit UI, you can run text and image workflows fully on-prem—no Docker, no mystery glue—while keeping control of cost, data, and latency. The model’s mid-training + strong text SFT makes it punch above its 15B weight, and the simple dtype cast fixes keep the vision path smooth on modern GPUs.

From here, you can harden the stack: add auth to the UI, enable logging/metrics, wire in tool use via vLLM/OpenAI API compatibility, and layer Promptfoo tests for quality. If you hit VRAM ceilings, tune max_new_tokens or drop to FP16/8-bit; for throughput, consider paged KV or multi-GPU.

Spin it up, iterate on real tasks, and share what you build. Happy hacking!

Relevant blog posts

October 10, 2025

How to Install & Run Microsoft UserLM-8B Locally?

UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling. Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can: Generate first-turn user queries given a task intent. Simulate multi-turn follow-up responses across long conversations. Signal the natural end of a conversation with a special token. Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.

October 8, 2025

How to Install & Run Facebook CWM Locally?

The Code World Model (CWM) is a 32B parameter dense autoregressive LLM developed by Meta FAIR CodeGen Team. Unlike traditional code models, it has been mid-trained on Python execution traces, memory trajectories, and containerized agentic interactions, making it uniquely suited for reasoning about how code affects computational environments. CWM was further post-trained with multi-task reinforcement learning (RL) for verifiable coding, math reasoning, and multi-turn software engineering tasks. It is research-only (non-commercial license) and is not designed as a general-purpose chatbot, but as a strong agentic code reasoning model for researchers.

October 7, 2025

How to Install & Run IBM Granite 4.0 H Tiny, Small and Micro Locally?

Granite 4.0-H models are instruction-tuned, tool-calling–ready LLMs built for real enterprise assistants. They keep Granite’s clean chat template and safety alignment, add strong multilingual skills (EN/DE/ES/FR/JA/PT/AR/CS/IT/KO/NL/ZH), and push long-context (up to 1M tokens on the H variants) for document-heavy workflows, RAG, and agent loops. Why “H”? The H line uses a hybrid stack (Transformer attention + Mamba-2 sequence modules) to boost efficiency on long inputs while preserving quality—great for fast tool plans, structured outputs, and retrieval-style prompts. Pick the right size Micro-H (3B, 1M ctx) Lightweight, snappy, and budget-friendly. Ideal for routing, information extraction, form/JSON outputs, short multilingual chat, and FIM code completions on modest GPUs or edge boxes. Tiny-H (7B, 1M ctx) The sweet spot. Better reasoning and multilingual dialogue with solid tool-calling—good for multi-turn assistants, analytics summaries, light coding, and compact RAG pipelines. Small-H (32B, 1M ctx) Muscle for tougher tasks. Stronger reasoning/code synthesis, deeper instruction following, and long-doc comprehension—fit for agentic workflows, complex business logic, and high-fidelity answers. What they’re good at Summarization • text classification/extraction • Q&A/RAG • code (incl. Fill-In-the-Middle) • function/tool calling • multilingual dialogue.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.