IBM Granite 4.0-H Family (Micro • Tiny • Small)
Granite 4.0-H models are instruction-tuned, tool-calling–ready LLMs built for real enterprise assistants. They keep Granite’s clean chat template and safety alignment, add strong multilingual skills (EN/DE/ES/FR/JA/PT/AR/CS/IT/KO/NL/ZH), and push long-context (up to 1M tokens on the H variants) for document-heavy workflows, RAG, and agent loops.
Why “H”? The H line uses a hybrid stack (Transformer attention + Mamba-2 sequence modules) to boost efficiency on long inputs while preserving quality—great for fast tool plans, structured outputs, and retrieval-style prompts.
Pick the Right Size
- Micro-H (3B, 1M ctx)
Lightweight, snappy, and budget-friendly. Ideal for routing, information extraction, form/JSON outputs, short multilingual chat, and FIM code completions on modest GPUs or edge boxes.
- Tiny-H (7B, 1M ctx)
The sweet spot. Better reasoning and multilingual dialogue with solid tool-calling—good for multi-turn assistants, analytics summaries, light coding, and compact RAG pipelines.
- Small-H (32B, 1M ctx)
Muscle for tougher tasks. Stronger reasoning/code synthesis, deeper instruction following, and long-doc comprehension—fit for agentic workflows, complex business logic, and high-fidelity answers.
What they’re good at
Summarization • text classification/extraction • Q&A/RAG • code (incl. Fill-In-the-Middle) • function/tool calling • multilingual dialogue.
One-liners (Ollama / Open WebUI)
ollama run granite4:micro-h # 3B
ollama run granite4:tiny-h # 7B
ollama run granite4:small-h # 32B
GPU Configuration (Inference Rule-of-Thumb)
granite4:micro-h (3B)
Scenario | Precision / Quant | Min VRAM that runs | Comfortable VRAM | Typical setup | Notes |
---|
Local chat, short ctx (≤8k) | 4-bit (Q4) | 4–6 GB | 6–8 GB | RTX 4060 8GB / 3060 12GB / T4 16GB | Fast, great for JSON/IE, routing |
Assistant, medium ctx (8–32k) | 4-bit (Q4/Q5) | 6–8 GB | 8–12 GB | 3060 12GB / 4070 12GB | Keep num_ctx ≤ 32k |
Higher fidelity | 8-bit | 10–12 GB | 12–16 GB | 3060 12GB / L4 24GB | Better precision; slower than 4-bit |
Unquantized experiments | BF16 | 12–16 GB | 16–24 GB | L4 24GB / A10 24GB | Weights ≈ 6 GB; cache adds overhead |
granite4:tiny-h (7B)
Scenario | Precision / Quant | Min VRAM that runs | Comfortable VRAM | Typical setup | Notes |
---|
Local chat, short ctx (≤8k) | 4-bit (Q4) | 8–10 GB | 10–12 GB | 3060 12GB / 4070 12GB | Good quality vs size |
Assistant, medium ctx (8–32k) | 4-/5-bit | 10–12 GB | 12–16 GB | 4070/4080 / L4 24GB | Solid multi-turn + tools |
Higher fidelity | 8-bit | 16–20 GB | 20–24 GB | 4090 24GB / L4 24GB | Better coding/reasoning |
Unquantized experiments | BF16 | 24–28 GB | 28–40 GB | 4090 24GB (tight) / L40S 48GB | Headroom needed for cache |
granite4:small-h (32B)
Scenario | Precision / Quant | Min VRAM that runs | Comfortable VRAM | Typical setup | Notes |
---|
Local chat, short ctx (≤8k) | 4-bit (Q4) | 24 GB | 32–40 GB | 4090 24GB (tight) / L40S 48GB | Works on 24GB with care |
Assistant, medium ctx (8–32k) | 4-/5-bit | 32 GB | 40–48 GB | L40S 48GB / A5000 24GB×2 (TP) | Better throughput & ctx |
Higher fidelity | 8-bit | 48–64 GB | 64 GB+ | A100 40/80GB / 2×A5000 | For higher-quality outputs |
Unquantized | BF16 | 80 GB | 80 GB+ / multi-GPU | H100 80GB / 2×A100 40GB (TP) | Weights ≈ 64 GB alone |
Resources
Link 1: https://huggingface.co/ibm-granite/granite-4.0-h-micro
Link 2: https://huggingface.co/ibm-granite/granite-4.0-h-tiny
Link 3: https://huggingface.co/ibm-granite/granite-4.0-h-small
Link 4: https://ollama.com/library/granite4
Note on GPUs
We’re standardizing on 1× NVIDIA H200 because a single Hopper-class card with very large HBM3e memory (≈141 GB) and high bandwidth lets us run all three Granite 4.0-H models (Micro-H 3B, Tiny-H 7B, Small-H 32B) on the same GPU—and even run two processes (e.g., Transformers/vLLM service + Ollama/Open WebUI) side-by-side—without paging, fragile offload, or tensor-parallel sharding. The extra headroom absorbs long context (up to 1M tokens) where KV-cache dominates, keeps BF16 quality for Small-H while still serving Tiny/Micro in 4–8-bit with high throughput, and simplifies ops: one node, no cross-GPU latency, easier scheduling/restarts, and cleaner observability. In short, H200 gives us capacity + speed + simplicity now and headroom for future heavier prompts/agents. If you only need to run a single model, you can drop to cheaper GPUs based on preference—e.g., Micro/Tiny on 12–24 GB class cards (RTX 3060/4070, L4, A10) and Small-H via 4–5-bit on 24–32 GB or full BF16 on an 80 GB class card.
Step-by-Step Process to Install & Run IBM Granite 4.0 H Tiny, Small and Micro Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running IBM Granite 4.0 H Tiny, Small and Micro, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like IBM Granite 4.0 H Tiny, Small and Micro
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like IBM Granite 4.0 H Tiny, Small and Micro.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the IBM Granite 4.0 H Tiny, Small and Micro runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version
Step 10: Install Ollama
Run the following command to install the Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Step 11: Serve Ollama
Run the following command to host the Ollama so that it can be accessed and utilized efficiently:
ollama serve
Step 12: Install Open-WebUI
Run the following command to install open-webui:
pip install open-webui
Step 13: Serve Open-WebUI
In your activated Python environment, start the Open-WebUI server by running:
open-webui serve
- Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
- When setup is complete, the WebUI will be available and ready for you to access via your browser.
Step 14: Set up SSH port forwarding from your local machine
On your local machine (Mac/Windows/Linux), open a terminal and run:
ssh -L 8080:localhost:8080 -p 18685 root@Your_VM_IP
This forwards:
Local localhost:8000
→ Remote VM 127.0.0.1:8000
Step 15: Access Open-WebUI in Your Browser
Go to:
http://localhost:8080
- You should see the Open-WebUI login or setup page.
- Log in or create a new account if this is your first time.
- You’re now ready to use Open-WebUI to interact with your models!
Step 16: Pull Granite 4.0 H models in Open WebUI (via Ollama)
- In Open WebUI, click Select a model (top bar).
- In the search box, type the first model name exactly:
granite4:micro-h
.
- When it says No results found, click Pull “granite4:micro-h” from Ollama.com.
- Wait for the download to finish; the model will appear under Local.
- Repeat steps 2–4 for the other two models, one by one:
granite4:tiny-h
(IBM Granite 4.0 H Tiny)
granite4:small-h
(IBM Granite 4.0 H Small)
- After each pull completes, you can select that model and start chatting.
Tip (what you should see): exactly like in your screenshot—search shows “No results found” and a line offering to Pull “<model>” from Ollama.com. Click that line.
CLI fallback (same result):
ollama pull granite4:micro-h
ollama pull granite4:tiny-h
ollama pull granite4:small-h
Quick checks / fixes if the Pull option doesn’t appear:
- Make sure Ollama is running (
ollama serve
) and Open WebUI is connected to it.
- Double-check spelling (it must be
granite4:micro-h
, granite4:tiny-h
, granite4:small-h
).
- Refresh the Open WebUI page after each pull.
Step 17: Check all Granite models are ready
In Open WebUI
- Click Select a model ▾ → Local.
- You should see all three entries listed (like your screenshot):
- granite4:tiny-h — 6.9B
- granite4:small-h — 32.2B
- granite4:micro-h — 3.2B
- If any are missing, click the refresh icon (top-right) or reload the page.
From terminal (double-check via Ollama)
# All pulled models should appear here
ollama list
# Optional: view basic metadata
ollama show granite4:micro-h | head -n 20
ollama show granite4:tiny-h | head -n 20
ollama show granite4:small-h | head -n 20
Quick sanity test (one-liner per model)
printf 'Reply EXACTLY: READY micro-h' | ollama run granite4:micro-h
printf 'Reply EXACTLY: READY tiny-h' | ollama run granite4:tiny-h
printf 'Reply EXACTLY: READY small-h' | ollama run granite4:small-h
- Expected: each returns the exact
READY ...
text—confirms the model loads and generates.
If a model still doesn’t show up
- Confirm the pull finished:
ollama pull granite4:<tag>
(again).
- Ensure Ollama is running and reachable:
curl -s http://localhost:11434/api/tags | jq '.models[].name'
.
- Restart Open WebUI (or refresh the browser).
Step 18: Results
Link: https://drive.google.com/file/d/1Jsl_VAQisSJ2h-1_9j7vA0_0VXqbv4-p/view?usp=sharing
Up to this point, we’ve installed the IBM Granite 4.0 H models via Ollama + Open WebUI: searched and pulled granite4:micro-h
, granite4:tiny-h
, and granite4:small-h
, verified they appear under Local, and ran quick sanity prompts to confirm they load correctly. Now we’ll switch to the Hugging Face + Transformers route—setting up the CUDA-enabled Python environment, pulling the same models from HF, and showing both BF16 and 4-bit runs (plus a minimal chat/tool-calling script) so you can use Granite directly in code.
Step 19: Install PyTorch for CUDA
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 \
torch torchvision torchaudio
Step 20: Install the Utilities
Run the following command to install utilities:
pip install "transformers>=4.44" accelerate sentencepiece
Step 21: Install Wheel and Flash Attention
Run the following commands to install wheel and flash attention:
python -m pip install --upgrade pip wheel
pip install --no-build-isolation flash-attn
Step 22: Install Bitsandbytes
Run the following command to install bitsandbytes:
pip install bitsandbytes
Step 23: Connect to Your GPU VM with a Code Editor
Before you start running model script with the IBM Granite 4.0 H Tiny, Small and Micro models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 24: Create the Script
Create a file (ex: # app.py) and add the following code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ibm-granite/granite-4.0-h-micro"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto" # let HF place on your GPU(s)
)
# minimal chat
chat = [{"role": "user",
"content": "List one IBM Research lab in the US (name, location only)."}]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) # uses Granite 4.0 template
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=False))
What This Script Does
- Loads the Granite-4.0-H-Micro model and tokenizer, placing the model on your GPU automatically and using bfloat16.
- Builds a one-turn chat (“user” asks for one IBM Research lab).
- Converts that chat to the model’s official chat template to form the prompt.
- Tokenizes the prompt, runs generation for up to 64 new tokens in
torch.inference_mode()
(no gradients).
- Decodes and prints the raw output (role tags kept because
skip_special_tokens=False
).
Step 25: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
Step 26: Tool Calling Script
Create a script for tool calling (ex: granite_tool_call.py) and add the following code:
# granite_tool_call.py
import json, re, sys, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "ibm-granite/granite-4.0-h-micro"
def parse_tool_calls(text: str):
"""Return all valid JSON dicts from <tool_call>...</tool_call> blocks."""
calls = []
for m in re.finditer(r"<tool_call\b[^>]*>(.*?)</tool_call>", text, flags=re.S|re.I):
inner = m.group(1)
i = inner.find("{")
if i == -1:
continue
depth, start, end = 0, i, None
for j, ch in enumerate(inner[i:], start=i):
if ch == "{": depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
end = j + 1
break
if end is None:
continue
try:
calls.append(json.loads(inner[start:end]))
except json.JSONDecodeError:
continue
return calls
def fake_get_current_weather(city: str):
# Replace this with a real API call
return {"city": city, "temperature_c": 22, "condition": "Clear", "source": "demo-stub"}
def main():
print("Loading model...", file=sys.stderr)
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16, # Granite weights are BF16
device_map="auto",
).eval()
tools = [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get current weather for a city.",
"parameters": {"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}
}
}]
user_msg = "What's the weather like in Boston right now?"
chat = [{"role": "user", "content": user_msg}]
# 1) Ask model; it should emit a <tool_call> ... JSON ...
prompt = tok.apply_chat_template(chat, tokenize=False, tools=tools, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
out1 = model.generate(**inputs, max_new_tokens=128, do_sample=False)
decoded1 = tok.decode(out1[0], skip_special_tokens=False)
print("\n=== Raw model output (turn 1) ===\n")
print(decoded1)
calls = parse_tool_calls(decoded1)
if not calls:
print("\n(No valid <tool_call> found — model may have answered directly.)")
return
call = calls[-1]
print("\n=== Parsed tool_call (last valid) ===\n", json.dumps(call, indent=2))
# 2) Run the tool
result = None
if call.get("name") == "get_current_weather":
city = (call.get("arguments") or {}).get("city", "Unknown")
result = fake_get_current_weather(city)
else:
print("\n(No demo handler for this tool.)")
return
print("\n=== (Demo) Tool result ===\n", json.dumps(result, indent=2))
# 3) Feed a <tool_response> back and continue generation
#
# We append a new role turn for the tool and then cue the assistant again.
# This mirrors the structure Granite used in turn 1.
tool_block = (
"<|start_of_role|>tool<|end_of_role|>"
"<tool_response>"
+ json.dumps({"name": call["name"], "arguments": call.get("arguments", {}), "results": result})
+ "</tool_response><|end_of_text|>"
"<|start_of_role|>assistant<|end_of_role|>"
)
continuation_text = decoded1 + tool_block
cont_inputs = tok(continuation_text, return_tensors="pt").to(model.device)
with torch.inference_mode():
out2 = model.generate(
**cont_inputs,
max_new_tokens=128,
do_sample=False
)
decoded2 = tok.decode(out2[0], skip_special_tokens=False)
print("\n=== Final assistant reply (after tool_response) ===\n")
# Show only the tail after our appended assistant cue, for readability
tail = decoded2.split(tool_block)[-1]
print(tail)
if __name__ == "__main__":
main()
What This Script Does
- Loads Granite-4.0-H-Micro in BF16 with
device_map="auto"
and prepares the tokenizer.
- Builds a one-turn chat + tools list using Granite’s chat template, then generates a first reply expecting a
<tool_call>…>
.
- Parses all
<tool_call>
blocks, picks the last valid JSON, and extracts the function name/args.
- Runs a demo tool
fake_get_current_weather(city)
and prints the tool result.
- Appends a
<tool_response>
turn and regenerates to get the model’s final natural-language answer, then prints it.
Step 27: Run the Script
Run the script from the following command:
python granite_tool_call.py
This will load the model and generate response on terminal.
To install and run Granite 4.0-H Micro (3B) on a GPU VM, verify CUDA works (nvidia-smi
), create an env (python3 -m venv granite && source granite/bin/activate && pip -qU pip wheel
), then install CUDA-enabled PyTorch (e.g., pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
) and libs (pip install "transformers>=4.44" accelerate sentencepiece
plus optional flash-attn
for speed and bitsandbytes
for 4-bit). Test with a minimal script: load MODEL_ID="ibm-granite/granite-4.0-h-micro"
via AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16, device_map="auto")
, build a chat using tokenizer.apply_chat_template(..., add_generation_prompt=True)
, generate with model.generate(max_new_tokens=128)
, and print the decode; for small GPUs, swap to 4-bit by passing a BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
. If you prefer serving, start vLLM (pip install vllm
then python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-4.0-h-micro --dtype bfloat16 --max-model-len 32768
) and call the OpenAI-style /v1/chat/completions
endpoint; or use Ollama/Open WebUI with ollama pull granite4:micro-h
then ollama run granite4:micro-h
. For Tiny-H (7B) and Small-H (32B), the steps are identical—just change the model reference to granite4:tiny-h
or granite4:small-h
in Ollama, and swap MODEL_ID
to the corresponding HF model from IBM’s Granite 4.0 collection when using Transformers/vLLM (everything else—drivers, Python env, packages, and code—stays the same).
Conclusion
Granite 4.0-H (Micro, Tiny, Small) gives you one family, three gears—lightweight JSON/IE on Micro, balanced reasoning on Tiny, and deep, long-doc chops on Small. We walked through two clean paths—Ollama + Open WebUI for fast chats and Transformers/vLLM for production services—plus realistic GPU guides and why a single H200 keeps everything smooth (long context, BF16, and dual processes on one box). From here, you can pull the models, drop in our tough prompt pack, and wire up tool-calling to your APIs. Start small, benchmark with our scripts, then scale the same workflow across your stack—no rewrites, just more headroom.