LFM2-2.6B by Liquid AI is a next-generation hybrid model designed for edge AI and on-device deployment. With 2.6B parameters, it combines multiplicative gates and short convolutions for high efficiency, speed, and quality. The model supports eight major languages and introduces dynamic hybrid reasoning for complex or multilingual prompts. It runs smoothly across CPU, GPU, and NPU, making it flexible for use on smartphones, laptops, or vehicles. Optimized for tasks like data extraction, RAG, creative writing, and conversational agents, LFM2-2.6B delivers competitive performance while remaining lightweight and resource-efficient.
Model Details
Property | LFM2-350M | LFM2-700M | LFM2-1.2B | LFM2-2.6B |
---|
Parameters | 354,483,968 | 742,489,344 | 1,170,340,608 | 2,569,272,320 |
Layers | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 30 (22 conv + 8 attn) |
Context length | 32,768 tokens | 32,768 tokens | 32,768 tokens | 32,768 tokens |
Vocabulary size | 65,536 | 65,536 | 65,536 | 65,536 |
Precision | bfloat16 | bfloat16 | bfloat16 | bfloat16 |
Training budget | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens |
License | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 |
Performance
Model | MMLU | GPQA | IFEval | IFBench | GSM8K | MGSM | MMMLU |
---|
LFM2-2.6B | 64.42 | 26.57 | 79.56 | 22.19 | 82.41 | 74.32 | 55.39 |
Llama-3.2-3B-Instruct | 60.35 | 30.6 | 71.43 | 20.78 | 75.21 | 61.68 | 47.92 |
SmolLM3-3B | 59.84 | 26.31 | 72.44 | 17.93 | 81.12 | 68.72 | 50.02 |
gemma-3-4b-it | 58.35 | 29.51 | 76.85 | 23.53 | 89.92 | 87.28 | 50.14 |
Qwen3-4B-Instruct-2507 | 72.25 | 34.85 | 85.62 | 30.28 | 68.46 | 81.76 | 60.67 |
GPU Configuration (Inference Rule-of-Thumb)
Target use | Precision / Quant | Min VRAM (works) | Comfortable VRAM | Example GPUs (min → comfy) | Notes / Tips |
---|
Transformers, local testing | BF16/FP16 | 6–7 GB | 8–10 GB | RTX 2060 6GB (tight), RTX 3050/3060 8–12GB | Weights ≈ 2.6B × 2B ≈ 5.2 GB; leave headroom for KV cache/activations. Keep max_new_tokens modest. |
Transformers (FlashAttention 2) | BF16 + FA2 | 7–8 GB | 8–12 GB | RTX 3060 12GB, RTX 4060/4070, T4 16GB | Enable with attn_implementation="flash_attention_2" on supported GPUs for speed + a bit more mem. |
Quantized (4-bit) | Int4 / Q4 (bnb/AWQ) | 3–4 GB | 4–6 GB | GTX 1650 4GB (tight), RTX 3050/2060 | Great for laptops; slight quality drop. Use load_in_4bit=True , bnb_4bit_quant_type="nf4" . |
Quantized (8-bit) | Int8 | 4–5 GB | 6–8 GB | RTX 3050/2060/3060 | Good speed/quality balance on low-VRAM cards. |
vLLM single-GPU serving | BF16 | 10–12 GB | 16–24 GB | RTX 3060 12GB → L4 24GB / A10 24GB | Paged KV cache improves throughput; memory scales with concurrent tokens. Set --max-model-len sanely. |
Throughput (small batches) | BF16 | 12–16 GB | 20–24 GB | T4 16GB, L4 24GB, A10 24GB | For small batch or longer outputs on a single card. Pin memory, use tensor parallel=1. |
Latency-focused | BF16 | 16–24 GB | 24–40 GB | L4 24GB, A5000 24GB, A100 40GB | Headroom reduces GC stalls; helpful for 32k contexts (KV grows ~linearly with tokens). |
llama.cpp (GGUF) | Q4_K*_GGUF | 2–3 GB | 3–4 GB | iGPU/low-end dGPU | Ultra-light; use for CPU/offload or tiny dGPUs. Slightly slower than PyTorch on GPU. |
Edge / NPU (Android/Apple) | Int4/Int8 (delegate) | N/A GPU | N/A | (NPU/ANE) | Feasible with vendor delegates; prefer short prompts/outputs. Quality ~ 4-bit PyTorch. |
Quick Guidance
- Sweet spot: RTX 3060 12GB (or T4 16GB/L4 24GB) runs BF16 comfortably; use FA2 if supported.
- Tight VRAM? Go 4-bit; you’ll fit in 4–6GB with minor quality loss.
- Long context (up to 32k): KV cache dominates memory. Reduce
max_model_len
, max_new_tokens
, or use vLLM to manage KV efficiently.
- Suggested defaults:
temperature=0.3
, min_p=0.15
, repetition_penalty=1.05
.
- Transformers snippet: set
torch_dtype="bfloat16"
, optionally attn_implementation="flash_attention_2"
on Ada/Lovelace/Ampere with FA2 wheels.
Resources
Link: https://huggingface.co/LiquidAI/LFM2-2.6B
Step-by-Step Process to Install & Run LFM2-2.6B Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running LFM2-2.6B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like LFM2-2.6B.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like LFM2-2.6B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the LFM2-2.6B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Base System Packages (Ubuntu)
Install the essentials you’ll need for LFM2-2.6B: Python 3.10 venv/pip, Git + LFS, FFmpeg, OpenGL libs, and build tools.
Run the following commands to install base system packages:
sudo apt update
sudo apt install -y python3.10-venv python3-pip git git-lfs ffmpeg libgl1 libglib2.0-0 build-essential
git lfs install
Step 9: Create & Activate a Python Virtual Environment
Isolate everything for LFM2-2.6B in its own venv, then upgrade the basic build tools.
Run the following commands to create & activate a python virtual environment:
python3.10 -m venv ~/lfm
source ~/lfm/bin/activate
python -m pip install -U pip wheel setuptools
(Option A) Run with Transformers (Quick Functional Test)
Step 11: Install the Utilities
Run the following command to install utilities:
pip install -U "transformers>=4.55" accelerate
Step 12: Quick Sanity Test with Transformers (One-Shot Script)
python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "LiquidAI/LFM2-2.6B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="bfloat16", device_map="auto"
)
prompt = "What is C. elegans?"
ids = tok.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(
ids, do_sample=True, temperature=0.3, max_new_tokens=256,
repetition_penalty=1.05
)
print(tok.decode(out[0], skip_special_tokens=False))
PY
What This Script Does
- Run a one-shot Python heredoc:
python - <<'PY' ... PY
executes the whole snippet inline from your shell—no file needed.
- Load tokenizer & model: pulls
LiquidAI/LFM2-2.6B
, sets BF16 and device_map="auto"
so Accelerate places weights on your GPU/CPU automatically.
- Build a chat-formatted prompt:
apply_chat_template(...)
wraps your user text in LFM2’s ChatML style and moves tensors to the model’s device.
- Generate a reply: calls
model.generate(...)
with temperature=0.3, repetition_penalty=1.05, and max_new_tokens=256 for a short, clean response.
- Decode the output:
tok.decode(..., skip_special_tokens=False)
prints the raw ChatML blocks (use True
if you want clean text only).
- (Note) API tweak:
torch_dtype
is deprecated—use dtype=torch.bfloat16
in future scripts to silence the warning.
(Option B) Serve with vLLM (Fast & Scalable)
Step 13: Install vLLM
Run the following command to install vLLM:
pip install "vllm==0.10.2" --extra-index-url https://wheels.vllm.ai/0.10.2/
Step 14: Start the vLLM Server
Run the following command to start the vLLM server:
python -m vllm.entrypoints.openai.api_server \
--model LiquidAI/LFM2-2.6B \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--port 8000
What This Command Does
- Starts a local vLLM server that mimics the OpenAI API (e.g.,
/v1/chat/completions
) for the LiquidAI/LFM2-2.6B model.
- Loads the model in bfloat16 (
--dtype bfloat16
) for faster, lower-memory inference on modern GPUs.
- Limits the maximum context length to 4096 tokens (
--max-model-len 4096
), which controls KV-cache size/VRAM use.
- Lets vLLM use up to 90% of your GPU memory (
--gpu-memory-utilization 0.90
) to reduce OOM while maximizing throughput.
- Listens on port 8000 (
--port 8000
), so you can send requests via HTTP (e.g., curl or SDKs) to http://127.0.0.1:8000
.
Step 15: Create a Agent File
Create a file (ex: agent_lfm2.py) and add the following code:
# agent_lfm2.py
# Minimal, safe(ish) tool-use agent for LiquidAI/LFM2-2.6B
# Supports: Transformers (local) or vLLM OpenAI server
import json, re, time, ast
from typing import Any, Dict, List
USE_VLLM = True # set True if you launched vLLM server on :8000
# ----------------------------
# 1) Define your tools here
# ----------------------------
def get_time(zone: str = "Asia/Kolkata") -> str:
# tiny demo tool
return time.strftime("%Y-%m-%d %H:%M:%S") + f" ({zone})"
def add(a: float, b: float) -> float:
return float(a) + float(b)
def search_docs(query: str) -> str:
# stub for RAG; replace with your actual retrieval pipeline
# e.g., use Qwen3-Embedding + FAISS/Chroma and return top snippets
return f"[stubbed RAG] Top hits for '{query}':\n1) Doc A…\n2) Doc B…"
TOOL_REGISTRY = {
"get_time": {"fn": get_time, "sig": {"type":"object","properties":{"zone":{"type":"string"}}}, "desc":"Current time in a timezone"},
"add": {"fn": add, "sig": {"type":"object","properties":{"a":{"type":"number"},"b":{"type":"number"}},"required":["a","b"]}, "desc":"Add two numbers"},
"search_docs": {"fn": search_docs, "sig":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}, "desc":"Search internal docs (RAG)"},
}
def tool_list_json() -> str:
lst = []
for name, meta in TOOL_REGISTRY.items():
lst.append({
"name": name,
"description": meta["desc"],
"parameters": meta["sig"],
})
return json.dumps(lst, ensure_ascii=False)
# ----------------------------
# 2) Model I/O helpers
# ----------------------------
SYSTEM_PROMPT = f"""You are a helpful assistant with tool-use.
List of tools: <|tool_list_start|>{tool_list_json()}<|tool_list_end|>
Guidelines:
- If a tool is helpful, emit a tool call as a Python list between <|tool_call_start|> and <|tool_call_end|>,
e.g. [add(a=2,b=3)] or [search_docs(query="vector db")].
- If multiple steps are needed, call tools in sequence (one call per turn).
- After a tool result is returned (in <|tool_response_start|>...</|tool_response_end|>), use it to answer the user.
- When you are done, reply to the user normally (no further tool calls).
"""
# Chat template for LFM2 (ChatML-like)
def apply_chat_template(messages: List[Dict[str, str]]) -> str:
# Simple template good enough for local testing; Transformers has .apply_chat_template() too.
def block(role, content):
return f"<|im_start|>{role}\n{content}<|im_end|>\n"
s = "<|startoftext|>"
for m in messages:
s += block(m["role"], m["content"])
return s
# ----------------------------
# 3) Backends (Transformers / vLLM)
# ----------------------------
if USE_VLLM:
import requests
VLLM_URL = "http://127.0.0.1:8000/v1/chat/completions"
def generate(messages: List[Dict[str,str]], max_new_tokens=512):
payload = {
"model": "LiquidAI/LFM2-2.6B",
"messages": messages,
"temperature": 0.3,
"min_p": 0.15,
"repetition_penalty": 1.05,
"max_tokens": max_new_tokens,
}
r = requests.post(VLLM_URL, json=payload, timeout=120)
r.raise_for_status()
return r.json()["choices"][0]["message"]["content"]
else:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
MODEL_ID = "LiquidAI/LFM2-2.6B"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, dtype=torch.bfloat16, device_map="auto"
)
def generate(messages: List[Dict[str,str]], max_new_tokens=512):
prompt = apply_chat_template(messages + [{"role":"assistant","content":""}])
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**ids, do_sample=True, temperature=0.3, max_new_tokens=max_new_tokens,
repetition_penalty=1.05
)
text = tok.decode(out[0], skip_special_tokens=False)
# Return only the last assistant block
m = re.findall(r"<\|im_start\|>assistant\n(.*?)(?=<\|im_end\|>)", text, flags=re.S)
return m[-1].strip() if m else text
# ----------------------------
# 4) Tool-call parsing & execution
# ----------------------------
TOOL_CALL_RE = re.compile(r"<\|tool_call_start\|>(.*?)<\|tool_call_end\|>", re.S)
def parse_tool_call(content: str) -> List[Dict[str, Any]]:
"""
The model emits a Python list like:
[add(a=2,b=3)]
[search_docs(query="hello"), get_time(zone="UTC")]
We parse each entry into {"name": str, "kwargs": dict}
"""
calls = []
m = TOOL_CALL_RE.search(content)
if not m:
return calls
raw = m.group(1).strip()
# Convert `add(a=2,b=3)` into a Python dict via AST
# Strategy: wrap each call into a fake function to parse args
items = [x.strip() for x in raw.strip("[]").split("),") if x.strip()]
for item in items:
if not item.endswith(")"): item += ")"
name = item.split("(",1)[0].strip()
argstr = item[item.find("(")+1:-1].strip()
kwargs = {}
if argstr:
# convert kwargs string to dict via ast.parse on a fake dict
# turn a=2,b=3 -> {"a":2,"b":3}
pairs = []
for kv in argstr.split(","):
if not kv.strip(): continue
k,v = kv.split("=",1)
pairs.append(f'"{k.strip()}":{v.strip()}')
dict_src = "{" + ",".join(pairs) + "}"
kwargs = ast.literal_eval(dict_src)
calls.append({"name": name, "kwargs": kwargs})
return calls
def execute_tools(calls: List[Dict[str,Any]]) -> str:
results = []
for c in calls:
name = c["name"]
if name not in TOOL_REGISTRY:
results.append({"tool": name, "ok": False, "error": "Unknown tool"})
continue
try:
fn = TOOL_REGISTRY[name]["fn"]
out = fn(**c["kwargs"])
results.append({"tool": name, "ok": True, "result": out})
except Exception as e:
results.append({"tool": name, "ok": False, "error": repr(e)})
return json.dumps(results, ensure_ascii=False)
# ----------------------------
# 5) Agent loop
# ----------------------------
def run_agent(user_message: str, max_turns=4):
messages = [
{"role":"system","content": SYSTEM_PROMPT},
{"role":"user","content": user_message}
]
for _ in range(max_turns):
assistant = generate(messages)
messages.append({"role":"assistant","content": assistant})
calls = parse_tool_call(assistant)
if not calls:
# no tool call → final answer
break
# execute tools and append tool response
tool_json = execute_tools(calls)
tool_block = f"<|tool_response_start|>{tool_json}<|tool_response_end|>"
messages.append({"role":"tool","content": tool_block})
# Return the last assistant message (final)
return messages[-1]["content"]
if __name__ == "__main__":
# Try a multi-step task that forces tool use
query = "What time is it for me in IST, then add 7+35, and summarize both."
print(run_agent(query))
What This Script Does
- Registers tools (
get_time
, add
, search_docs
) in a whitelist, builds a JSON tool list, and advertises it to the model via special LFM2 tokens inside the system prompt.
- Talks to the model through vLLM’s OpenAI API (because
USE_VLLM = True
), or falls back to Transformers locally if set to False
.
- Generates replies: sends chat messages; the model may emit a tool call between
<|tool_call_start|>
…<|tool_call_end|>
like [add(a=2,b=3)]
.
- Parses tool calls with a regex + safe arg parsing, executes only allowed tools from the registry, and wraps results in
<|tool_response_start|>
…<|tool_response_end|>
for the model to read.
- Loops for a few turns (up to
max_turns
): model → optional tool call → tool execution → model continues; stops when no tool call is emitted.
- Prints the final answer for the demo query (“IST time + 7+35 + summary”), showing the agentic behavior working end-to-end.
- Extras: includes a minimal ChatML-style template (used only in the Transformers path) and a stubbed RAG tool you can later replace with a real retriever.
Step 16: Run the Agent
Run the agent and generate response on terminal.
Conclusion
LFM2-2.6B makes “edge-ready” truly practical: a compact, multilingual model that’s fast, memory-savvy, and easy to stand up on a single GPU. In this guide you spun it up two ways—quick sanity check with Transformers and a production-style vLLM server—then leveled up to a tool-using agent that can call functions, fetch the web, and plug into RAG. With sensible VRAM targets and defaults (BF16, 4–8K context), it runs smoothly on mainstream cards while staying flexible for laptops or heavier servers. Your next steps: swap the stubbed RAG for your FAISS/Qwen embeddings, add more domain tools, and wrap the agent in FastAPI for a clean HTTP service. Ship it, measure latency/throughput, and iterate—this stack is ready for real workflows.