LFM2-VL-450M — Lightweight Vision-Language Model for Edge Devices
LFM2-VL-450M is the most compact and efficient model in Liquid AI’s LFM2-VL family, designed for low-latency multimodal inference on edge and cloud GPUs. With only 450M parameters (350M text + 86M vision encoder), it delivers reliable image-text reasoning at 2× faster speeds than typical VLMs in its size range. It supports native 512×512 resolution, dynamic vision token handling, and can be fine-tuned easily for domain-specific visual understanding tasks such as product tagging, document OCR, and quick caption generation. Its minimal footprint makes it ideal for real-time multimodal inference on affordable GPUs.
Performance
| Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
|---|
| InternVL3-2B | 65.10 | 38.49 | 66.10 | 831 | 53.10 | 61.10 | 48.70 | 57.60 | 75.00 | 67.00 | 2186.40 | 64.80 |
| InternVL3-1B | 57.00 | 31.14 | 54.94 | 798 | 43.00 | 52.30 | 43.20 | 46.90 | 71.20 | 58.70 | 1912.40 | 49.80 |
| SmolVLM2-2.2B | 57.50 | 19.42 | 37.75 | 725 | 42.30 | 46.00 | 41.60 | 51.50 | 71.30 | 34.90 | 1792.50 | – |
| LFM2-VL-1.6B | 65.23 | 37.66 | 58.68 | 742 | 44.40 | 49.53 | 38.44 | 51.10 | 71.97 | 48.07 | 1753.04 | 50.99 |
LFM2-VL-1.6B — Balanced Model for General Multimodal Tasks
LFM2-VL-1.6B strikes the perfect balance between accuracy and efficiency, offering a notable upgrade in visual reasoning while maintaining fast runtime.
It pairs a 1.2B-parameter language backbone with a SigLIP2 NaFlex (400M) vision encoder, enabling better detail comprehension, structured scene understanding, and improved OCR performance. Trained on extensive text-image datasets with joint fine-tuning, it’s optimized for context-rich multimodal tasks such as infographic reading, visual QA, and descriptive captioning. This model is best suited for users who want higher visual fidelity without the large GPU demands of multi-billion parameter models.
Performance
| Model | RealWorldQA | MM-IFEval | InfoVQA (Val) | OCRBench | BLINK | MMStar | MMMU (Val) | MathVista | SEEDBench_IMG | MMVet | MME | MMLU |
|---|
| InternVL3-2B | 65.10 | 38.49 | 66.10 | 831 | 53.10 | 61.10 | 48.70 | 57.60 | 75.00 | 67.00 | 2186.40 | 64.80 |
| InternVL3-1B | 57.00 | 31.14 | 54.94 | 798 | 43.00 | 52.30 | 43.20 | 46.90 | 71.20 | 58.70 | 1912.40 | 49.80 |
| SmolVLM2-2.2B | 57.50 | 19.42 | 37.75 | 725 | 42.30 | 46.00 | 41.60 | 51.50 | 71.30 | 34.90 | 1792.50 | – |
| LFM2-VL-1.6B | 65.23 | 37.66 | 58.68 | 742 | 44.40 | 49.53 | 38.44 | 51.10 | 71.97 | 48.07 | 1753.04 | 50.99 |
LFM2-VL-3B — Advanced Vision-Language Model for Precision Reasoning
LFM2-VL-3B is the latest and most capable model in the LFM2-VL lineup, built for fine-grained visual reasoning and multilingual multimodal comprehension (supports up to 10 languages). It combines a 2.6B-parameter text tower with a large SigLIP2 NaFlex vision encoder (400M), achieving near state-of-the-art results among compact open-weight VLMs. Despite its scale, it retains impressive inference efficiency, dynamic image token allocation, and flexible speed-quality tuning. LFM2-VL-3B is ideal for research, detailed visual understanding, multi-object recognition, and captioning complex scenes where precision and depth matter most.
Performance
| Model | Average | MMStar | RealWorldQA | MM-IFEval | BLINK | MMBench (dev en) | OCRBench | POPE |
|---|
| InternVL3_5-2B | 66.50 | 57.67 | 60.78 | 47.31 | 50.97 | 78.18 | 834.00 | 87.17 |
| Qwen2.5-VL-3B | 65.42 | 56.13 | 65.23 | 38.62 | 48.97 | 80.41 | 824.00 | 86.17 |
| InternVL3-2B | 67.44 | 61.10 | 65.10 | 38.49 | 53.10 | 81.10 | 831.00 | 90.10 |
| SmolVLM2-2.2B | 56.01 | 46.00 | 57.50 | 19.42 | 42.30 | 69.24 | 725.00 | 85.10 |
| LFM2-VL-3B | 69.00 | 57.73 | 71.37 | 51.83 | 51.03 | 79.81 | 822.00 | 89.01 |
GPU Configuration Table
| Model | Parameters (Total) | Vision Encoder | Recommended GPU | Min VRAM (GB) | Recommended VRAM (GB) | Precision | Context Length (Text) | When to Use |
|---|
| LFM2-VL-450M | ~0.45B (350M LM + 86M Vision) | SigLIP2 NaFlex Base | T4 / L4 / A10 | 6–8 GB | 12–16 GB | FP16 / BF16 | 32,768 tokens | For lightweight, real-time multimodal tasks on edge/cloud GPUs |
| LFM2-VL-1.6B | ~1.6B (1.2B LM + 400M Vision) | SigLIP2 NaFlex Shape-Optimized | A10 / L40S / RTX 4090 | 12–16 GB | 20–24 GB | BF16 preferred | 32,768 tokens | For balanced multimodal reasoning and visual QA |
| LFM2-VL-3B | ~3.0B (2.6B LM + 400M Vision) | SigLIP2 NaFlex Large | A100 / H100 | 24 GB (min) | 40–80 GB | BF16 / FP16 | 32,768 tokens | For fine-grained, multilingual, and research-grade image-text reasoning |
Notes
- All models natively support up to 512×512 px images with automatic patch-splitting for larger resolutions.
- Use
bfloat16 on Ampere or newer GPUs for best throughput and stable precision.
- For low-VRAM setups, resize inputs to ≤512 px and limit
max_new_tokens (e.g., 64–96).
- All three support Hugging Face
transformers ≥ v4.57, with LFM2-VL-3B requiring the specific source commit for compatibility.
Resources
Link 1: https://huggingface.co/LiquidAI/LFM2-VL-450M
Link 2: https://huggingface.co/LiquidAI/LFM2-VL-1.6B
Link 3: https://huggingface.co/LiquidAI/LFM2-VL-3B
Step-by-Step Process to Install & Run LiquidAI LFM2-VL Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running LiquidAI LFM2-VL, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc)
- Proper support for building and running GPU-based models like LiquidAI LFM2-VL.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like LiquidAI LFM2-VL.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the LiquidAI LFM2-VL runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.
Run the following commands to add the deadsnakes PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version
Step 10: Install PyTorch for CUDA
Run the following command to install PyTorch:
pip install --upgrade "torch>=2.3" "torchvision" --index-url https://download.pytorch.org/whl/cu121
Step 11: Install Core Libs
Run the following command to install core libs:
pip install --upgrade pillow accelerate safetensors einops
Step 12: Install Transformers
Run the following command to install transformers:
pip install --upgrade "transformers>=4.57" huggingface_hub
Step 13: Quick Smoke Test (GPU + BF16 Support)
python - <<'PY'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
print("BF16 supported:", torch.cuda.is_bf16_supported())
PY
- BF16 supported = True → we’ll use
bfloat16.
- False (e.g., T4) → use
float16 instead.
Step 14: Connect to Your GPU VM with a Code Editor
Before you start running model script with the LFM2-VL models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 15: Create the Script
Create a file (ex: # run_lfm2vl.py) and add the following code:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from PIL import ImageOps
MODEL_ID = "LiquidAI/LFM2-VL-450M"
# dtype selection
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16
print(f"Loading {MODEL_ID} with dtype={dtype} ...")
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
device_map="auto",
dtype=dtype, # <— use dtype (no deprecation warning)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Load image and pre-resize to reduce image tokens (optional but speeds up)
img_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(img_url)
# Keep aspect ratio, cap long side at 512
image = ImageOps.contain(image, (512, 512))
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What is in this image? Keep it under 1 sentence."},
],
},
]
gen_kwargs = dict(
max_new_tokens=64,
do_sample=False, # deterministic; set True + temperature for sampling
repetition_penalty=1.05,
)
# Build inputs from chat template
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
with torch.autocast("cuda", dtype=dtype if dtype!=torch.float16 else torch.float16):
outputs = model.generate(**inputs, **gen_kwargs)
text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("\n=== MODEL OUTPUT ===")
print(text)
What This Script Does
- Loads the LFM2-VL-450M model and processor on GPU with bf16 (or fp16 fallback).
- Downloads the stop-sign image and resizes it to ≤512 px to reduce vision tokens.
- Builds a ChatML-style conversation (user = image + question) via
apply_chat_template.
- Runs deterministic generation (
do_sample=False, max_new_tokens=64, repetition_penalty=1.05) under CUDA autocast.
- Decodes and prints the model’s one-sentence description of the image.
Step 16: Run the Script
Run the script from the following command:
python run_lfm2vl.py
This will load the model and generate the response on terminal.
Step 17: Install Gradio
Run the following command to install gradio:
pip install gradio
Step 18: Tiny Gradio UI (Drag-and-Drop Images)
Up to Step 16, we’ve been interacting with the model purely through the terminal — sending text and image prompts via Python scripts and viewing the generated responses directly in the console. Now, starting from Step 18, we move to a more user-friendly experience by building a tiny Gradio interface, which lets us interact with the model visually — simply drag and drop images, type questions, adjust sliders for generation parameters, and instantly see the model’s answers in a web UI instead of the command line.
import torch, gradio as gr
from PIL import ImageOps
from transformers import AutoProcessor, AutoModelForImageTextToText
MODEL_ID = "LiquidAI/LFM2-VL-450M"
# dtype selection
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if use_bf16 else torch.float16
# Load model/processor
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, device_map="auto", dtype=DTYPE # <- use dtype (no deprecation warning)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
SYSTEM_PROMPT = "You are a helpful multimodal assistant by Liquid AI."
def preprocess_image(img, cap_long_side=True):
if img is None:
return None
if cap_long_side:
# Keep aspect ratio; cap long side at 512 to reduce vision tokens
img = ImageOps.contain(img, (512, 512))
return img
def infer(image, question, max_new_tokens, temp, cap_long_side):
image = preprocess_image(image, cap_long_side=cap_long_side)
conversation = [
{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question or "Describe this image."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
# Build generation kwargs (don’t pass vision knobs to generate)
gen_kwargs = {
"max_new_tokens": int(max_new_tokens),
"repetition_penalty": 1.05,
}
if float(temp) > 0:
gen_kwargs.update({"do_sample": True, "temperature": float(temp)})
else:
gen_kwargs.update({"do_sample": False})
with torch.autocast("cuda", dtype=DTYPE if DTYPE != torch.float16 else torch.float16):
outputs = model.generate(**inputs, **gen_kwargs)
text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
return text
demo = gr.Interface(
fn=infer,
inputs=[
gr.Image(type="pil", label="Image"),
gr.Textbox(label="Question", value="Describe this image."),
gr.Slider(8, 512, value=96, step=1, label="Max new tokens"),
gr.Slider(0.0, 1.0, value=0.0, step=0.05, label="Temperature"),
gr.Checkbox(value=True, label="Fast resize to 512px (speed-up)"),
],
outputs=gr.Textbox(label="Answer"),
title="LFM2-VL-450M (Liquid AI)",
description="Lightweight VLM • Uses chat template • Resize toggle to control vision token load.",
)
if __name__ == "__main__":
demo.launch(server_name="0.0.0.0", server_port=7860)
What This Script Does
- Loads LFM2-VL-450M on GPU with bf16 (or fp16 fallback) and its processor.
- Optionally resizes images to ≤512px (toggleable) to cut vision tokens and speed up inference.
- Builds a ChatML-style conversation (system + user with image + question) via
apply_chat_template.
- Generates an answer with controllable max_new_tokens and temperature (deterministic when temp=0).
- Serves a Gradio UI (image, question, sliders, checkbox) and displays the model’s text Answer box.
Step 19: Launch the Gradio App
Run Gradio:
python app.py
Step 20: Access the Gradio App
Access the gradio app on:
http://0.0.0.0:7860/
Play with Model
Up to this point, we’ve successfully installed and run the LFM2-VL-450M model — the smallest and most lightweight version of the LFM2-VL family, perfect for testing and quick image-to-text interactions. Now, we’ll move ahead to explore the more powerful variants — LFM2-VL-1.6B and LFM2-VL-3B — running them one by one to experience their enhanced visual reasoning, accuracy, and multilingual capabilities, while following a similar setup and inference process.
Step 21: Write Script for LFM2-VL-1.6B Version
Create a file (ex: # run_lfm2vl16b.py) and add the following code:
# save as run_lfm2vl16b.py
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from PIL import ImageOps
MODEL_ID = "LiquidAI/LFM2-VL-1.6B"
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16
print(f"Loading {MODEL_ID} with dtype={dtype}")
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, device_map="auto", dtype=dtype
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
img_url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(img_url)
image = ImageOps.contain(image, (512, 512))
conversation = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in one line."}
]}
]
inputs = processor.apply_chat_template(
conversation, add_generation_prompt=True,
return_tensors="pt", return_dict=True, tokenize=True
).to(model.device)
gen_kwargs = dict(max_new_tokens=64, do_sample=False, repetition_penalty=1.05)
with torch.autocast("cuda", dtype=dtype):
outputs = model.generate(**inputs, **gen_kwargs)
text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("\n=== MODEL OUTPUT ===\n", text)
What This Script Does
- Loads the LFM2-VL-1.6B model and its processor on GPU using bfloat16 or float16 precision.
- Downloads and resizes the sample stop-sign image to 512 px to optimize performance.
- Builds a ChatML-style conversation combining the image and a text prompt.
- Runs text generation deterministically (
do_sample=False) with up to 64 new tokens.
- Decodes and prints the model’s one-line description of the image in the terminal.
Step 22: Run the Script
Run the script from the following command:
python run_lfm2vl16.py
This will load the model and generate the response on terminal.
Step 23: Create the Script
Create a file (ex: # lfm2vl3b.py) and add the following code:
import torch
from PIL import ImageOps
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
MODEL_ID = "LiquidAI/LFM2-VL-3B"
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
dtype = torch.bfloat16 if use_bf16 else torch.float16
print(f"Loading {MODEL_ID} with dtype={dtype} ...")
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
device_map="auto",
dtype=dtype, # use dtype (not torch_dtype)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Sample image
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(url)
# Keep aspect ratio; cap long side at 512 to control vision tokens
image = ImageOps.contain(image, (512, 512))
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in one concise sentence."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
gen_kwargs = dict(max_new_tokens=96, do_sample=False, repetition_penalty=1.05)
with torch.autocast("cuda", dtype=dtype):
out = model.generate(**inputs, **gen_kwargs)
print("\n=== OUTPUT ===")
print(processor.batch_decode(out, skip_special_tokens=True)[0])
What This Script Does
- Loads the LFM2-VL-3B multimodal model and its processor on GPU using bfloat16 or float16 precision.
- Downloads a sample street-scene image, resizing it to 512 px on the long side to limit vision tokens.
- Constructs a ChatML-style conversation containing both the image and a concise text prompt.
- Runs deterministic text generation (
do_sample=False, max_new_tokens=96) to produce the model’s reply.
- Decodes and prints the generated one-sentence image description in the terminal output.
Step 24: Run the Script
Run the script from the following command:
python run_lfm2vl3b.py
This will load the model and generate the response on terminal.
Conclusion
You’ve gone end-to-end—from provisioning a GPU VM on NodeShift to installing CUDA-aligned PyTorch, setting up a clean Python 3.11 env, and running LiquidAI’s LFM2-VL models at three scales (450M, 1.6B, and 3B). You validated terminal inference, then upgraded the experience with a lightweight Gradio UI for drag-and-drop image queries. With this foundation, you can tune speed/quality via precision and token limits, swap GPUs based on budget and latency needs, and confidently scale from quick prototyping to production-grade multimodal apps. Next steps: wire in your own images/datasets, add LoRA fine-tuning for your domain, and wrap the app with basic auth/logging to ship it safely.