How to Install & Run IBM Granite 4.0 H Tiny, Small and Micro Locally?

by Ayush Kumar | October 7, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

IBM Granite 4.0-H Family (Micro • Tiny • Small)

Granite 4.0-H models are instruction-tuned, tool-calling–ready LLMs built for real enterprise assistants. They keep Granite’s clean chat template and safety alignment, add strong multilingual skills (EN/DE/ES/FR/JA/PT/AR/CS/IT/KO/NL/ZH), and push long-context (up to 1M tokens on the H variants) for document-heavy workflows, RAG, and agent loops.

Why “H”? The H line uses a hybrid stack (Transformer attention + Mamba-2 sequence modules) to boost efficiency on long inputs while preserving quality—great for fast tool plans, structured outputs, and retrieval-style prompts.

Pick the Right Size

Micro-H (3B, 1M ctx)
Lightweight, snappy, and budget-friendly. Ideal for routing, information extraction, form/JSON outputs, short multilingual chat, and FIM code completions on modest GPUs or edge boxes.
Tiny-H (7B, 1M ctx)
The sweet spot. Better reasoning and multilingual dialogue with solid tool-calling—good for multi-turn assistants, analytics summaries, light coding, and compact RAG pipelines.
Small-H (32B, 1M ctx)
Muscle for tougher tasks. Stronger reasoning/code synthesis, deeper instruction following, and long-doc comprehension—fit for agentic workflows, complex business logic, and high-fidelity answers.

What they’re good at

Summarization • text classification/extraction • Q&A/RAG • code (incl. Fill-In-the-Middle) • function/tool calling • multilingual dialogue.

One-liners (Ollama / Open WebUI)

ollama run granite4:micro-h    # 3B
ollama run granite4:tiny-h     # 7B
ollama run granite4:small-h    # 32B

GPU Configuration (Inference Rule-of-Thumb)

granite4:micro-h (3B)

Scenario	Precision / Quant	Min VRAM that runs	Comfortable VRAM	Typical setup	Notes
Local chat, short ctx (≤8k)	4-bit (Q4)	4–6 GB	6–8 GB	RTX 4060 8GB / 3060 12GB / T4 16GB	Fast, great for JSON/IE, routing
Assistant, medium ctx (8–32k)	4-bit (Q4/Q5)	6–8 GB	8–12 GB	3060 12GB / 4070 12GB	Keep `num_ctx` ≤ 32k
Higher fidelity	8-bit	10–12 GB	12–16 GB	3060 12GB / L4 24GB	Better precision; slower than 4-bit
Unquantized experiments	BF16	12–16 GB	16–24 GB	L4 24GB / A10 24GB	Weights ≈ 6 GB; cache adds overhead

granite4:tiny-h (7B)

Scenario	Precision / Quant	Min VRAM that runs	Comfortable VRAM	Typical setup	Notes
Local chat, short ctx (≤8k)	4-bit (Q4)	8–10 GB	10–12 GB	3060 12GB / 4070 12GB	Good quality vs size
Assistant, medium ctx (8–32k)	4-/5-bit	10–12 GB	12–16 GB	4070/4080 / L4 24GB	Solid multi-turn + tools
Higher fidelity	8-bit	16–20 GB	20–24 GB	4090 24GB / L4 24GB	Better coding/reasoning
Unquantized experiments	BF16	24–28 GB	28–40 GB	4090 24GB (tight) / L40S 48GB	Headroom needed for cache

granite4:small-h (32B)

Scenario	Precision / Quant	Min VRAM that runs	Comfortable VRAM	Typical setup	Notes
Local chat, short ctx (≤8k)	4-bit (Q4)	24 GB	32–40 GB	4090 24GB (tight) / L40S 48GB	Works on 24GB with care
Assistant, medium ctx (8–32k)	4-/5-bit	32 GB	40–48 GB	L40S 48GB / A5000 24GB×2 (TP)	Better throughput & ctx
Higher fidelity	8-bit	48–64 GB	64 GB+	A100 40/80GB / 2×A5000	For higher-quality outputs
Unquantized	BF16	80 GB	80 GB+ / multi-GPU	H100 80GB / 2×A100 40GB (TP)	Weights ≈ 64 GB alone

Resources

Link 1: https://huggingface.co/ibm-granite/granite-4.0-h-micro

Link 2: https://huggingface.co/ibm-granite/granite-4.0-h-tiny

Link 3: https://huggingface.co/ibm-granite/granite-4.0-h-small

Link 4: https://ollama.com/library/granite4

Note on GPUs

We’re standardizing on 1× NVIDIA H200 because a single Hopper-class card with very large HBM3e memory (≈141 GB) and high bandwidth lets us run all three Granite 4.0-H models (Micro-H 3B, Tiny-H 7B, Small-H 32B) on the same GPU—and even run two processes (e.g., Transformers/vLLM service + Ollama/Open WebUI) side-by-side—without paging, fragile offload, or tensor-parallel sharding. The extra headroom absorbs long context (up to 1M tokens) where KV-cache dominates, keeps BF16 quality for Small-H while still serving Tiny/Micro in 4–8-bit with high throughput, and simplifies ops: one node, no cross-GPU latency, easier scheduling/restarts, and cleaner observability. In short, H200 gives us capacity + speed + simplicity now and headroom for future heavier prompts/agents. If you only need to run a single model, you can drop to cheaper GPUs based on preference—e.g., Micro/Tiny on 12–24 GB class cards (RTX 3060/4070, L4, A10) and Small-H via 4–5-bit on 24–32 GB or full BF16 on an 80 GB class card.

Step-by-Step Process to Install & Run IBM Granite 4.0 H Tiny, Small and Micro Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running IBM Granite 4.0 H Tiny, Small and Micro, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like IBM Granite 4.0 H Tiny, Small and Micro
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like IBM Granite 4.0 H Tiny, Small and Micro.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the IBM Granite 4.0 H Tiny, Small and Micro runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install Ollama

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 11: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

ollama serve

Step 12: Install Open-WebUI

Run the following command to install open-webui:

pip install open-webui

Step 13: Serve Open-WebUI

In your activated Python environment, start the Open-WebUI server by running:

open-webui serve

Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
When setup is complete, the WebUI will be available and ready for you to access via your browser.

Step 14: Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 8080:localhost:8080 -p 18685 root@Your_VM_IP

This forwards:

Local localhost:8000 → Remote VM 127.0.0.1:8000

Step 15: Access Open-WebUI in Your Browser

Go to:

http://localhost:8080

You should see the Open-WebUI login or setup page.
Log in or create a new account if this is your first time.
You’re now ready to use Open-WebUI to interact with your models!

Step 16: Pull Granite 4.0 H models in Open WebUI (via Ollama)

In Open WebUI, click Select a model (top bar).
In the search box, type the first model name exactly: granite4:micro-h.
When it says No results found, click Pull “granite4:micro-h” from Ollama.com.
Wait for the download to finish; the model will appear under Local.
Repeat steps 2–4 for the other two models, one by one:
- granite4:tiny-h (IBM Granite 4.0 H Tiny)
- granite4:small-h (IBM Granite 4.0 H Small)
After each pull completes, you can select that model and start chatting.

Tip (what you should see): exactly like in your screenshot—search shows “No results found” and a line offering to Pull “<model>” from Ollama.com. Click that line.

CLI fallback (same result):

ollama pull granite4:micro-h
ollama pull granite4:tiny-h
ollama pull granite4:small-h

Quick checks / fixes if the Pull option doesn’t appear:

Make sure Ollama is running (ollama serve) and Open WebUI is connected to it.
Double-check spelling (it must be granite4:micro-h, granite4:tiny-h, granite4:small-h).
Refresh the Open WebUI page after each pull.

Step 17: Check all Granite models are ready

In Open WebUI

Click Select a model ▾ → Local.
You should see all three entries listed (like your screenshot):
- granite4:tiny-h — 6.9B
- granite4:small-h — 32.2B
- granite4:micro-h — 3.2B
If any are missing, click the refresh icon (top-right) or reload the page.

From terminal (double-check via Ollama)

# All pulled models should appear here
ollama list

# Optional: view basic metadata
ollama show granite4:micro-h | head -n 20
ollama show granite4:tiny-h  | head -n 20
ollama show granite4:small-h | head -n 20

Quick sanity test (one-liner per model)

printf 'Reply EXACTLY: READY micro-h' | ollama run granite4:micro-h
printf 'Reply EXACTLY: READY tiny-h'  | ollama run granite4:tiny-h
printf 'Reply EXACTLY: READY small-h' | ollama run granite4:small-h

Expected: each returns the exact READY ... text—confirms the model loads and generates.

If a model still doesn’t show up

Confirm the pull finished: ollama pull granite4:<tag> (again).
Ensure Ollama is running and reachable: curl -s http://localhost:11434/api/tags | jq '.models[].name'.
Restart Open WebUI (or refresh the browser).

Step 18: Results

Link: https://drive.google.com/file/d/1Jsl_VAQisSJ2h-1_9j7vA0_0VXqbv4-p/view?usp=sharing

Up to this point, we’ve installed the IBM Granite 4.0 H models via Ollama + Open WebUI: searched and pulled granite4:micro-h, granite4:tiny-h, and granite4:small-h, verified they appear under Local, and ran quick sanity prompts to confirm they load correctly. Now we’ll switch to the Hugging Face + Transformers route—setting up the CUDA-enabled Python environment, pulling the same models from HF, and showing both BF16 and 4-bit runs (plus a minimal chat/tool-calling script) so you can use Granite directly in code.

Step 19: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 \
  torch torchvision torchaudio

Step 20: Install the Utilities

Run the following command to install utilities:

pip install "transformers>=4.44" accelerate sentencepiece

Step 21: Install Wheel and Flash Attention

Run the following commands to install wheel and flash attention:

python -m pip install --upgrade pip wheel
pip install --no-build-isolation flash-attn

Step 22: Install Bitsandbytes

Run the following command to install bitsandbytes:

pip install bitsandbytes

Step 23: Connect to Your GPU VM with a Code Editor

Before you start running model script with the IBM Granite 4.0 H Tiny, Small and Micro models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 24: Create the Script

Create a file (ex: # app.py) and add the following code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ibm-granite/granite-4.0-h-micro"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"               # let HF place on your GPU(s)
)

# minimal chat
chat = [{"role": "user",
         "content": "List one IBM Research lab in the US (name, location only)."}]

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)  # uses Granite 4.0 template
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=False))

What This Script Does

Loads the Granite-4.0-H-Micro model and tokenizer, placing the model on your GPU automatically and using bfloat16.
Builds a one-turn chat (“user” asks for one IBM Research lab).
Converts that chat to the model’s official chat template to form the prompt.
Tokenizes the prompt, runs generation for up to 64 new tokens in torch.inference_mode() (no gradients).
Decodes and prints the raw output (role tags kept because skip_special_tokens=False).

Step 25: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 26: Tool Calling Script

Create a script for tool calling (ex: granite_tool_call.py) and add the following code:

# granite_tool_call.py
import json, re, sys, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ibm-granite/granite-4.0-h-micro"

def parse_tool_calls(text: str):
    """Return all valid JSON dicts from <tool_call>...</tool_call> blocks."""
    calls = []
    for m in re.finditer(r"<tool_call\b[^>]*>(.*?)</tool_call>", text, flags=re.S|re.I):
        inner = m.group(1)
        i = inner.find("{")
        if i == -1:
            continue
        depth, start, end = 0, i, None
        for j, ch in enumerate(inner[i:], start=i):
            if ch == "{": depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    end = j + 1
                    break
        if end is None:
            continue
        try:
            calls.append(json.loads(inner[start:end]))
        except json.JSONDecodeError:
            continue
    return calls

def fake_get_current_weather(city: str):
    # Replace this with a real API call
    return {"city": city, "temperature_c": 22, "condition": "Clear", "source": "demo-stub"}

def main():
    print("Loading model...", file=sys.stderr)
    tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        dtype=torch.bfloat16,   # Granite weights are BF16
        device_map="auto",
    ).eval()

    tools = [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get current weather for a city.",
            "parameters": {"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}
        }
    }]

    user_msg = "What's the weather like in Boston right now?"
    chat = [{"role": "user", "content": user_msg}]

    # 1) Ask model; it should emit a <tool_call> ... JSON ...
    prompt = tok.apply_chat_template(chat, tokenize=False, tools=tools, add_generation_prompt=True)
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out1 = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    decoded1 = tok.decode(out1[0], skip_special_tokens=False)

    print("\n=== Raw model output (turn 1) ===\n")
    print(decoded1)

    calls = parse_tool_calls(decoded1)
    if not calls:
        print("\n(No valid <tool_call> found — model may have answered directly.)")
        return
    call = calls[-1]
    print("\n=== Parsed tool_call (last valid) ===\n", json.dumps(call, indent=2))

    # 2) Run the tool
    result = None
    if call.get("name") == "get_current_weather":
        city = (call.get("arguments") or {}).get("city", "Unknown")
        result = fake_get_current_weather(city)
    else:
        print("\n(No demo handler for this tool.)")
        return
    print("\n=== (Demo) Tool result ===\n", json.dumps(result, indent=2))

    # 3) Feed a <tool_response> back and continue generation
    #
    # We append a new role turn for the tool and then cue the assistant again.
    # This mirrors the structure Granite used in turn 1.
    tool_block = (
        "<|start_of_role|>tool<|end_of_role|>"
        "<tool_response>"
        + json.dumps({"name": call["name"], "arguments": call.get("arguments", {}), "results": result})
        + "</tool_response><|end_of_text|>"
        "<|start_of_role|>assistant<|end_of_role|>"
    )

    continuation_text = decoded1 + tool_block
    cont_inputs = tok(continuation_text, return_tensors="pt").to(model.device)

    with torch.inference_mode():
        out2 = model.generate(
            **cont_inputs,
            max_new_tokens=128,
            do_sample=False
        )
    decoded2 = tok.decode(out2[0], skip_special_tokens=False)

    print("\n=== Final assistant reply (after tool_response) ===\n")
    # Show only the tail after our appended assistant cue, for readability
    tail = decoded2.split(tool_block)[-1]
    print(tail)

if __name__ == "__main__":
    main()

What This Script Does

Loads Granite-4.0-H-Micro in BF16 with device_map="auto" and prepares the tokenizer.
Builds a one-turn chat + tools list using Granite’s chat template, then generates a first reply expecting a <tool_call>…>.
Parses all <tool_call> blocks, picks the last valid JSON, and extracts the function name/args.
Runs a demo tool fake_get_current_weather(city) and prints the tool result.
Appends a <tool_response> turn and regenerates to get the model’s final natural-language answer, then prints it.

Step 27: Run the Script

Run the script from the following command:

python granite_tool_call.py

This will load the model and generate response on terminal.

To install and run Granite 4.0-H Micro (3B) on a GPU VM, verify CUDA works (nvidia-smi), create an env (python3 -m venv granite && source granite/bin/activate && pip -qU pip wheel), then install CUDA-enabled PyTorch (e.g., pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio) and libs (pip install "transformers>=4.44" accelerate sentencepiece plus optional flash-attn for speed and bitsandbytes for 4-bit). Test with a minimal script: load MODEL_ID="ibm-granite/granite-4.0-h-micro" via AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16, device_map="auto"), build a chat using tokenizer.apply_chat_template(..., add_generation_prompt=True), generate with model.generate(max_new_tokens=128), and print the decode; for small GPUs, swap to 4-bit by passing a BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16). If you prefer serving, start vLLM (pip install vllm then python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-4.0-h-micro --dtype bfloat16 --max-model-len 32768) and call the OpenAI-style /v1/chat/completions endpoint; or use Ollama/Open WebUI with ollama pull granite4:micro-h then ollama run granite4:micro-h. For Tiny-H (7B) and Small-H (32B), the steps are identical—just change the model reference to granite4:tiny-h or granite4:small-h in Ollama, and swap MODEL_ID to the corresponding HF model from IBM’s Granite 4.0 collection when using Transformers/vLLM (everything else—drivers, Python env, packages, and code—stays the same).

Conclusion

Granite 4.0-H (Micro, Tiny, Small) gives you one family, three gears—lightweight JSON/IE on Micro, balanced reasoning on Tiny, and deep, long-doc chops on Small. We walked through two clean paths—Ollama + Open WebUI for fast chats and Transformers/vLLM for production services—plus realistic GPU guides and why a single H200 keeps everything smooth (long context, BF16, and dual processes on one box). From here, you can pull the models, drop in our tough prompt pack, and wire up tool-calling to your APIs. Start small, benchmark with our scripts, then scale the same workflow across your stack—no rewrites, just more headroom.

Relevant blog posts

October 8, 2025

How to Install & Run Facebook CWM Locally?

The Code World Model (CWM) is a 32B parameter dense autoregressive LLM developed by Meta FAIR CodeGen Team. Unlike traditional code models, it has been mid-trained on Python execution traces, memory trajectories, and containerized agentic interactions, making it uniquely suited for reasoning about how code affects computational environments. CWM was further post-trained with multi-task reinforcement learning (RL) for verifiable coding, math reasoning, and multi-turn software engineering tasks. It is research-only (non-commercial license) and is not designed as a general-purpose chatbot, but as a strong agentic code reasoning model for researchers.

October 6, 2025

Pocket Operator: A Local, Tool-Calling Agent Powered by LFM2-2.6B

LFM2-2.6B by Liquid AI is a next-generation hybrid model designed for edge AI and on-device deployment. With 2.6B parameters, it combines multiplicative gates and short convolutions for high efficiency, speed, and quality. The model supports eight major languages and introduces dynamic hybrid reasoning for complex or multilingual prompts. It runs smoothly across CPU, GPU, and NPU, making it flexible for use on smartphones, laptops, or vehicles. Optimized for tasks like data extraction, RAG, creative writing, and conversational agents, LFM2-2.6B delivers competitive performance while remaining lightweight and resource-efficient.

October 3, 2025

How to Install & Run MinerU2.5-2509-1.2B Locally?

MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.