How to Install & Run LiquidAI LFM2-VL Locally?

by Ayush Kumar | October 27, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

LFM2-VL-450M — Lightweight Vision-Language Model for Edge Devices

LFM2-VL-450M is the most compact and efficient model in Liquid AI’s LFM2-VL family, designed for low-latency multimodal inference on edge and cloud GPUs. With only 450M parameters (350M text + 86M vision encoder), it delivers reliable image-text reasoning at 2× faster speeds than typical VLMs in its size range. It supports native 512×512 resolution, dynamic vision token handling, and can be fine-tuned easily for domain-specific visual understanding tasks such as product tagging, document OCR, and quick caption generation. Its minimal footprint makes it ideal for real-time multimodal inference on affordable GPUs.

Performance

Model	RealWorldQA	MM-IFEval	InfoVQA (Val)	OCRBench	BLINK	MMStar	MMMU (Val)	MathVista	SEEDBench_IMG	MMVet	MME	MMLU
InternVL3-2B	65.10	38.49	66.10	831	53.10	61.10	48.70	57.60	75.00	67.00	2186.40	64.80
InternVL3-1B	57.00	31.14	54.94	798	43.00	52.30	43.20	46.90	71.20	58.70	1912.40	49.80
SmolVLM2-2.2B	57.50	19.42	37.75	725	42.30	46.00	41.60	51.50	71.30	34.90	1792.50	–
LFM2-VL-1.6B	65.23	37.66	58.68	742	44.40	49.53	38.44	51.10	71.97	48.07	1753.04	50.99

LFM2-VL-1.6B — Balanced Model for General Multimodal Tasks

LFM2-VL-1.6B strikes the perfect balance between accuracy and efficiency, offering a notable upgrade in visual reasoning while maintaining fast runtime.
It pairs a 1.2B-parameter language backbone with a SigLIP2 NaFlex (400M) vision encoder, enabling better detail comprehension, structured scene understanding, and improved OCR performance. Trained on extensive text-image datasets with joint fine-tuning, it’s optimized for context-rich multimodal tasks such as infographic reading, visual QA, and descriptive captioning. This model is best suited for users who want higher visual fidelity without the large GPU demands of multi-billion parameter models.

Performance

Model	RealWorldQA	MM-IFEval	InfoVQA (Val)	OCRBench	BLINK	MMStar	MMMU (Val)	MathVista	SEEDBench_IMG	MMVet	MME	MMLU
InternVL3-2B	65.10	38.49	66.10	831	53.10	61.10	48.70	57.60	75.00	67.00	2186.40	64.80
InternVL3-1B	57.00	31.14	54.94	798	43.00	52.30	43.20	46.90	71.20	58.70	1912.40	49.80
SmolVLM2-2.2B	57.50	19.42	37.75	725	42.30	46.00	41.60	51.50	71.30	34.90	1792.50	–
LFM2-VL-1.6B	65.23	37.66	58.68	742	44.40	49.53	38.44	51.10	71.97	48.07	1753.04	50.99

LFM2-VL-3B — Advanced Vision-Language Model for Precision Reasoning

LFM2-VL-3B is the latest and most capable model in the LFM2-VL lineup, built for fine-grained visual reasoning and multilingual multimodal comprehension (supports up to 10 languages). It combines a 2.6B-parameter text tower with a large SigLIP2 NaFlex vision encoder (400M), achieving near state-of-the-art results among compact open-weight VLMs. Despite its scale, it retains impressive inference efficiency, dynamic image token allocation, and flexible speed-quality tuning. LFM2-VL-3B is ideal for research, detailed visual understanding, multi-object recognition, and captioning complex scenes where precision and depth matter most.

Performance

Model	Average	MMStar	RealWorldQA	MM-IFEval	BLINK	MMBench (dev en)	OCRBench	POPE
InternVL3_5-2B	66.50	57.67	60.78	47.31	50.97	78.18	834.00	87.17
Qwen2.5-VL-3B	65.42	56.13	65.23	38.62	48.97	80.41	824.00	86.17
InternVL3-2B	67.44	61.10	65.10	38.49	53.10	81.10	831.00	90.10
SmolVLM2-2.2B	56.01	46.00	57.50	19.42	42.30	69.24	725.00	85.10
LFM2-VL-3B	69.00	57.73	71.37	51.83	51.03	79.81	822.00	89.01

GPU Configuration Table

Model	Parameters (Total)	Vision Encoder	Recommended GPU	Min VRAM (GB)	Recommended VRAM (GB)	Precision	Context Length (Text)	When to Use
LFM2-VL-450M	~0.45B (350M LM + 86M Vision)	SigLIP2 NaFlex Base	T4 / L4 / A10	6–8 GB	12–16 GB	FP16 / BF16	32,768 tokens	For lightweight, real-time multimodal tasks on edge/cloud GPUs
LFM2-VL-1.6B	~1.6B (1.2B LM + 400M Vision)	SigLIP2 NaFlex Shape-Optimized	A10 / L40S / RTX 4090	12–16 GB	20–24 GB	BF16 preferred	32,768 tokens	For balanced multimodal reasoning and visual QA
LFM2-VL-3B	~3.0B (2.6B LM + 400M Vision)	SigLIP2 NaFlex Large	A100 / H100	24 GB (min)	40–80 GB	BF16 / FP16	32,768 tokens	For fine-grained, multilingual, and research-grade image-text reasoning

Notes

All models natively support up to 512×512 px images with automatic patch-splitting for larger resolutions.
Use bfloat16 on Ampere or newer GPUs for best throughput and stable precision.
For low-VRAM setups, resize inputs to ≤512 px and limit max_new_tokens (e.g., 64–96).
All three support Hugging Face transformers ≥ v4.57, with LFM2-VL-3B requiring the specific source commit for compatibility.

Resources

Link 1: https://huggingface.co/LiquidAI/LFM2-VL-450M

Link 2: https://huggingface.co/LiquidAI/LFM2-VL-1.6B

Link 3: https://huggingface.co/LiquidAI/LFM2-VL-3B

Step-by-Step Process to Install & Run LiquidAI LFM2-VL Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running LiquidAI LFM2-VL, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like LiquidAI LFM2-VL.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like LiquidAI LFM2-VL.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the LiquidAI LFM2-VL runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 9: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 10: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --upgrade "torch>=2.3" "torchvision" --index-url https://download.pytorch.org/whl/cu121

Step 11: Install Core Libs

Run the following command to install core libs:

pip install --upgrade pillow accelerate safetensors einops

Step 12: Install Transformers

Run the following command to install transformers:

pip install --upgrade "transformers>=4.57" huggingface_hub

Step 13: Quick Smoke Test (GPU + BF16 Support)

python - <<'PY'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("BF16 supported:", torch.cuda.is_bf16_supported())
PY

BF16 supported = True → we’ll use bfloat16.
False (e.g., T4) → use float16 instead.

Step 14: Connect to Your GPU VM with a Code Editor

Before you start running model script with the LFM2-VL models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 15: Create the Script

Create a file (ex: # run_lfm2vl.py) and add the following code:

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from PIL import ImageOps

MODEL_ID = "LiquidAI/LFM2-VL-450M"

# dtype selection
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16

print(f"Loading {MODEL_ID} with dtype={dtype} ...")
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    device_map="auto",
    dtype=dtype,                # <— use dtype (no deprecation warning)
)

processor = AutoProcessor.from_pretrained(MODEL_ID)

# Load image and pre-resize to reduce image tokens (optional but speeds up)
img_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(img_url)
# Keep aspect ratio, cap long side at 512
image = ImageOps.contain(image, (512, 512))

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image? Keep it under 1 sentence."},
        ],
    },
]

gen_kwargs = dict(
    max_new_tokens=64,
    do_sample=False,            # deterministic; set True + temperature for sampling
    repetition_penalty=1.05,
)

# Build inputs from chat template
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

with torch.autocast("cuda", dtype=dtype if dtype!=torch.float16 else torch.float16):
    outputs = model.generate(**inputs, **gen_kwargs)

text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("\n=== MODEL OUTPUT ===")
print(text)

What This Script Does

Loads the LFM2-VL-450M model and processor on GPU with bf16 (or fp16 fallback).
Downloads the stop-sign image and resizes it to ≤512 px to reduce vision tokens.
Builds a ChatML-style conversation (user = image + question) via apply_chat_template.
Runs deterministic generation (do_sample=False, max_new_tokens=64, repetition_penalty=1.05) under CUDA autocast.
Decodes and prints the model’s one-sentence description of the image.

Step 16: Run the Script

Run the script from the following command:

python run_lfm2vl.py

This will load the model and generate the response on terminal.

Step 17: Install Gradio

Run the following command to install gradio:

pip install gradio

Step 18: Tiny Gradio UI (Drag-and-Drop Images)

Up to Step 16, we’ve been interacting with the model purely through the terminal — sending text and image prompts via Python scripts and viewing the generated responses directly in the console. Now, starting from Step 18, we move to a more user-friendly experience by building a tiny Gradio interface, which lets us interact with the model visually — simply drag and drop images, type questions, adjust sliders for generation parameters, and instantly see the model’s answers in a web UI instead of the command line.

import torch, gradio as gr
from PIL import ImageOps
from transformers import AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "LiquidAI/LFM2-VL-450M"

# dtype selection
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if use_bf16 else torch.float16

# Load model/processor
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=DTYPE   # <- use dtype (no deprecation warning)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

SYSTEM_PROMPT = "You are a helpful multimodal assistant by Liquid AI."

def preprocess_image(img, cap_long_side=True):
    if img is None:
        return None
    if cap_long_side:
        # Keep aspect ratio; cap long side at 512 to reduce vision tokens
        img = ImageOps.contain(img, (512, 512))
    return img

def infer(image, question, max_new_tokens, temp, cap_long_side):
    image = preprocess_image(image, cap_long_side=cap_long_side)

    conversation = [
        {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question or "Describe this image."},
            ],
        },
    ]

    inputs = processor.apply_chat_template(
        conversation,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        tokenize=True,
    ).to(model.device)

    # Build generation kwargs (don’t pass vision knobs to generate)
    gen_kwargs = {
        "max_new_tokens": int(max_new_tokens),
        "repetition_penalty": 1.05,
    }
    if float(temp) > 0:
        gen_kwargs.update({"do_sample": True, "temperature": float(temp)})
    else:
        gen_kwargs.update({"do_sample": False})

    with torch.autocast("cuda", dtype=DTYPE if DTYPE != torch.float16 else torch.float16):
        outputs = model.generate(**inputs, **gen_kwargs)

    text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return text

demo = gr.Interface(
    fn=infer,
    inputs=[
        gr.Image(type="pil", label="Image"),
        gr.Textbox(label="Question", value="Describe this image."),
        gr.Slider(8, 512, value=96, step=1, label="Max new tokens"),
        gr.Slider(0.0, 1.0, value=0.0, step=0.05, label="Temperature"),
        gr.Checkbox(value=True, label="Fast resize to 512px (speed-up)"),
    ],
    outputs=gr.Textbox(label="Answer"),
    title="LFM2-VL-450M (Liquid AI)",
    description="Lightweight VLM • Uses chat template • Resize toggle to control vision token load.",
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

What This Script Does

Loads LFM2-VL-450M on GPU with bf16 (or fp16 fallback) and its processor.
Optionally resizes images to ≤512px (toggleable) to cut vision tokens and speed up inference.
Builds a ChatML-style conversation (system + user with image + question) via apply_chat_template.
Generates an answer with controllable max_new_tokens and temperature (deterministic when temp=0).
Serves a Gradio UI (image, question, sliders, checkbox) and displays the model’s text Answer box.

Step 19: Launch the Gradio App

Run Gradio:

python app.py

Step 20: Access the Gradio App

Access the gradio app on:

http://0.0.0.0:7860/

Play with Model

Up to this point, we’ve successfully installed and run the LFM2-VL-450M model — the smallest and most lightweight version of the LFM2-VL family, perfect for testing and quick image-to-text interactions. Now, we’ll move ahead to explore the more powerful variants — LFM2-VL-1.6B and LFM2-VL-3B — running them one by one to experience their enhanced visual reasoning, accuracy, and multilingual capabilities, while following a similar setup and inference process.

Step 21: Write Script for LFM2-VL-1.6B Version

Create a file (ex: # run_lfm2vl16b.py) and add the following code:

# save as run_lfm2vl16b.py
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from PIL import ImageOps

MODEL_ID = "LiquidAI/LFM2-VL-1.6B"

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16
print(f"Loading {MODEL_ID} with dtype={dtype}")

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=dtype
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

img_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(img_url)
image = ImageOps.contain(image, (512, 512))

conversation = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image in one line."}
    ]}
]

inputs = processor.apply_chat_template(
    conversation, add_generation_prompt=True,
    return_tensors="pt", return_dict=True, tokenize=True
).to(model.device)

gen_kwargs = dict(max_new_tokens=64, do_sample=False, repetition_penalty=1.05)

with torch.autocast("cuda", dtype=dtype):
    outputs = model.generate(**inputs, **gen_kwargs)

text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("\n=== MODEL OUTPUT ===\n", text)

What This Script Does

Loads the LFM2-VL-1.6B model and its processor on GPU using bfloat16 or float16 precision.
Downloads and resizes the sample stop-sign image to 512 px to optimize performance.
Builds a ChatML-style conversation combining the image and a text prompt.
Runs text generation deterministically (do_sample=False) with up to 64 new tokens.
Decodes and prints the model’s one-line description of the image in the terminal.

Step 22: Run the Script

Run the script from the following command:

python run_lfm2vl16.py

This will load the model and generate the response on terminal.

Step 23: Create the Script

Create a file (ex: # lfm2vl3b.py) and add the following code:

import torch
from PIL import ImageOps
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

MODEL_ID = "LiquidAI/LFM2-VL-3B"

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16
print(f"Loading {MODEL_ID} with dtype={dtype} ...")

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    device_map="auto",
    dtype=dtype,   # use dtype (not torch_dtype)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Sample image
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(url)
# Keep aspect ratio; cap long side at 512 to control vision tokens
image = ImageOps.contain(image, (512, 512))

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in one concise sentence."},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

gen_kwargs = dict(max_new_tokens=96, do_sample=False, repetition_penalty=1.05)

with torch.autocast("cuda", dtype=dtype):
    out = model.generate(**inputs, **gen_kwargs)

print("\n=== OUTPUT ===")
print(processor.batch_decode(out, skip_special_tokens=True)[0])

What This Script Does

Loads the LFM2-VL-3B multimodal model and its processor on GPU using bfloat16 or float16 precision.
Downloads a sample street-scene image, resizing it to 512 px on the long side to limit vision tokens.
Constructs a ChatML-style conversation containing both the image and a concise text prompt.
Runs deterministic text generation (do_sample=False, max_new_tokens=96) to produce the model’s reply.
Decodes and prints the generated one-sentence image description in the terminal output.

Step 24: Run the Script

Run the script from the following command:

python run_lfm2vl3b.py

This will load the model and generate the response on terminal.

Conclusion

You’ve gone end-to-end—from provisioning a GPU VM on NodeShift to installing CUDA-aligned PyTorch, setting up a clean Python 3.11 env, and running LiquidAI’s LFM2-VL models at three scales (450M, 1.6B, and 3B). You validated terminal inference, then upgraded the experience with a lightweight Gradio UI for drag-and-drop image queries. With this foundation, you can tune speed/quality via precision and token limits, swap GPUs based on budget and latency needs, and confidently scale from quick prototyping to production-grade multimodal apps. Next steps: wire in your own images/datasets, add LoRA fine-tuning for your domain, and wrap the app with basic auth/logging to ship it safely.

Relevant blog posts

October 29, 2025

How to Install & Run Chandra-OCR Locally?

Chandra is Datalab’s next-generation OCR model built for precise document understanding. It goes beyond simple text extraction — converting images and PDFs into structured Markdown, HTML, or JSON while preserving original layout details like tables, forms, and diagrams. With strong support for handwriting, math equations, and multi-column layouts across 40+ languages, Chandra achieves an overall accuracy of 83.1% on the olmOCR benchmark, outperforming most open and commercial OCR systems. It can be used easily via CLI, VLLM, Hugging Face, or a Streamlit app, making it versatile for developers, researchers, and document intelligence workflows.

October 24, 2025

How to Install & Run LLaDA2.0-Mini-Preview Locally?

LLaDA2-mini-preview is a diffusion-style Mixture-of-Experts (16B total, ~1.4B activated) instruction-tuned language model. It targets strong reasoning/coding while keeping inference light: only a small subset of experts fire per token, so you get near-7B quality with ~1–2B-class compute. It supports tool use, 4,096-token context, and works out-of-the-box with transformers via trust_remote_code. For best results, use diffusion sampling with temperature=0.0, steps=32, block_length=32.

October 23, 2025

How to Install & Run OlmOCR-2-7B-1025-FP8 Locally?

olmOCR-2-7B-1025-FP8 is AllenAI’s OCR-specialized VLM distilled from Qwen2.5-VL-7B-Instruct, fine-tuned on the olmOCR-mix-1025 dataset and further improved with GRPO RL to handle math formulas, tables, long/tiny text, and noisy scans. The FP8 quantization (via llmcompressor) slashes memory use while keeping accuracy: with the olmOCR toolkit (v0.4.0) it reaches ~82.4 ± 1.1 overall on olmOCR-Bench, and is designed for high-throughput, VLLM-based document pipelines. Inputs are single page images (longest side ≈ 1288 px) with a YAML prompt header the toolkit builds automatically. License: Apache-2.0.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.