Pocket Operator: A Local, Tool-Calling Agent Powered by LFM2-2.6B

by Ayush Kumar | October 6, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

LFM2-2.6B by Liquid AI is a next-generation hybrid model designed for edge AI and on-device deployment. With 2.6B parameters, it combines multiplicative gates and short convolutions for high efficiency, speed, and quality. The model supports eight major languages and introduces dynamic hybrid reasoning for complex or multilingual prompts. It runs smoothly across CPU, GPU, and NPU, making it flexible for use on smartphones, laptops, or vehicles. Optimized for tasks like data extraction, RAG, creative writing, and conversational agents, LFM2-2.6B delivers competitive performance while remaining lightweight and resource-efficient.

Model Details

Property	LFM2-350M	LFM2-700M	LFM2-1.2B	LFM2-2.6B
Parameters	354,483,968	742,489,344	1,170,340,608	2,569,272,320
Layers	16 (10 conv + 6 attn)	16 (10 conv + 6 attn)	16 (10 conv + 6 attn)	30 (22 conv + 8 attn)
Context length	32,768 tokens	32,768 tokens	32,768 tokens	32,768 tokens
Vocabulary size	65,536	65,536	65,536	65,536
Precision	bfloat16	bfloat16	bfloat16	bfloat16
Training budget	10 trillion tokens	10 trillion tokens	10 trillion tokens	10 trillion tokens
License	LFM Open License v1.0	LFM Open License v1.0	LFM Open License v1.0	LFM Open License v1.0

Performance

Model	MMLU	GPQA	IFEval	IFBench	GSM8K	MGSM	MMMLU
LFM2-2.6B	64.42	26.57	79.56	22.19	82.41	74.32	55.39
Llama-3.2-3B-Instruct	60.35	30.6	71.43	20.78	75.21	61.68	47.92
SmolLM3-3B	59.84	26.31	72.44	17.93	81.12	68.72	50.02
gemma-3-4b-it	58.35	29.51	76.85	23.53	89.92	87.28	50.14
Qwen3-4B-Instruct-2507	72.25	34.85	85.62	30.28	68.46	81.76	60.67

GPU Configuration (Inference Rule-of-Thumb)

Target use	Precision / Quant	Min VRAM (works)	Comfortable VRAM	Example GPUs (min → comfy)	Notes / Tips
Transformers, local testing	BF16/FP16	6–7 GB	8–10 GB	RTX 2060 6GB (tight), RTX 3050/3060 8–12GB	Weights ≈ 2.6B × 2B ≈ 5.2 GB; leave headroom for KV cache/activations. Keep `max_new_tokens` modest.
Transformers (FlashAttention 2)	BF16 + FA2	7–8 GB	8–12 GB	RTX 3060 12GB, RTX 4060/4070, T4 16GB	Enable with `attn_implementation="flash_attention_2"` on supported GPUs for speed + a bit more mem.
Quantized (4-bit)	Int4 / Q4 (bnb/AWQ)	3–4 GB	4–6 GB	GTX 1650 4GB (tight), RTX 3050/2060	Great for laptops; slight quality drop. Use `load_in_4bit=True`, `bnb_4bit_quant_type="nf4"`.
Quantized (8-bit)	Int8	4–5 GB	6–8 GB	RTX 3050/2060/3060	Good speed/quality balance on low-VRAM cards.
vLLM single-GPU serving	BF16	10–12 GB	16–24 GB	RTX 3060 12GB → L4 24GB / A10 24GB	Paged KV cache improves throughput; memory scales with concurrent tokens. Set `--max-model-len` sanely.
Throughput (small batches)	BF16	12–16 GB	20–24 GB	T4 16GB, L4 24GB, A10 24GB	For small batch or longer outputs on a single card. Pin memory, use tensor parallel=1.
Latency-focused	BF16	16–24 GB	24–40 GB	L4 24GB, A5000 24GB, A100 40GB	Headroom reduces GC stalls; helpful for 32k contexts (KV grows ~linearly with tokens).
llama.cpp (GGUF)	Q4_K*_GGUF	2–3 GB	3–4 GB	iGPU/low-end dGPU	Ultra-light; use for CPU/offload or tiny dGPUs. Slightly slower than PyTorch on GPU.
Edge / NPU (Android/Apple)	Int4/Int8 (delegate)	N/A GPU	N/A	(NPU/ANE)	Feasible with vendor delegates; prefer short prompts/outputs. Quality ~ 4-bit PyTorch.

Quick Guidance

Sweet spot: RTX 3060 12GB (or T4 16GB/L4 24GB) runs BF16 comfortably; use FA2 if supported.
Tight VRAM? Go 4-bit; you’ll fit in 4–6GB with minor quality loss.
Long context (up to 32k): KV cache dominates memory. Reduce max_model_len, max_new_tokens, or use vLLM to manage KV efficiently.
Suggested defaults: temperature=0.3, min_p=0.15, repetition_penalty=1.05.
Transformers snippet: set torch_dtype="bfloat16", optionally attn_implementation="flash_attention_2" on Ada/Lovelace/Ampere with FA2 wheels.

Resources

Link: https://huggingface.co/LiquidAI/LFM2-2.6B

Step-by-Step Process to Install & Run LFM2-2.6B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running LFM2-2.6B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like LFM2-2.6B.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like LFM2-2.6B.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the LFM2-2.6B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Base System Packages (Ubuntu)

Install the essentials you’ll need for LFM2-2.6B: Python 3.10 venv/pip, Git + LFS, FFmpeg, OpenGL libs, and build tools.

Run the following commands to install base system packages:

sudo apt update
sudo apt install -y python3.10-venv python3-pip git git-lfs ffmpeg libgl1 libglib2.0-0 build-essential
git lfs install

Step 9: Create & Activate a Python Virtual Environment

Isolate everything for LFM2-2.6B in its own venv, then upgrade the basic build tools.

Run the following commands to create & activate a python virtual environment:

python3.10 -m venv ~/lfm
source ~/lfm/bin/activate
python -m pip install -U pip wheel setuptools

(Option A) Run with Transformers (Quick Functional Test)

Step 11: Install the Utilities

Run the following command to install utilities:

pip install -U "transformers>=4.55" accelerate

Step 12: Quick Sanity Test with Transformers (One-Shot Script)

python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "LiquidAI/LFM2-2.6B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto"
)
prompt = "What is C. elegans?"
ids = tok.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(
    ids, do_sample=True, temperature=0.3, max_new_tokens=256,
    repetition_penalty=1.05
)
print(tok.decode(out[0], skip_special_tokens=False))
PY

What This Script Does

Run a one-shot Python heredoc: python - <<'PY' ... PY executes the whole snippet inline from your shell—no file needed.
Load tokenizer & model: pulls LiquidAI/LFM2-2.6B, sets BF16 and device_map="auto" so Accelerate places weights on your GPU/CPU automatically.
Build a chat-formatted prompt: apply_chat_template(...) wraps your user text in LFM2’s ChatML style and moves tensors to the model’s device.
Generate a reply: calls model.generate(...) with temperature=0.3, repetition_penalty=1.05, and max_new_tokens=256 for a short, clean response.
Decode the output: tok.decode(..., skip_special_tokens=False) prints the raw ChatML blocks (use True if you want clean text only).
(Note) API tweak: torch_dtype is deprecated—use dtype=torch.bfloat16 in future scripts to silence the warning.

(Option B) Serve with vLLM (Fast & Scalable)

Step 13: Install vLLM

Run the following command to install vLLM:

pip install "vllm==0.10.2" --extra-index-url https://wheels.vllm.ai/0.10.2/

Step 14: Start the vLLM Server

Run the following command to start the vLLM server:

python -m vllm.entrypoints.openai.api_server \
  --model LiquidAI/LFM2-2.6B \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --port 8000

What This Command Does

Starts a local vLLM server that mimics the OpenAI API (e.g., /v1/chat/completions) for the LiquidAI/LFM2-2.6B model.
Loads the model in bfloat16 (--dtype bfloat16) for faster, lower-memory inference on modern GPUs.
Limits the maximum context length to 4096 tokens (--max-model-len 4096), which controls KV-cache size/VRAM use.
Lets vLLM use up to 90% of your GPU memory (--gpu-memory-utilization 0.90) to reduce OOM while maximizing throughput.
Listens on port 8000 (--port 8000), so you can send requests via HTTP (e.g., curl or SDKs) to http://127.0.0.1:8000.

Step 15: Create a Agent File

Create a file (ex: agent_lfm2.py) and add the following code:

# agent_lfm2.py
# Minimal, safe(ish) tool-use agent for LiquidAI/LFM2-2.6B
# Supports: Transformers (local) or vLLM OpenAI server
import json, re, time, ast
from typing import Any, Dict, List

USE_VLLM = True  # set True if you launched vLLM server on :8000

# ----------------------------
# 1) Define your tools here
# ----------------------------
def get_time(zone: str = "Asia/Kolkata") -> str:
    # tiny demo tool
    return time.strftime("%Y-%m-%d %H:%M:%S") + f" ({zone})"

def add(a: float, b: float) -> float:
    return float(a) + float(b)

def search_docs(query: str) -> str:
    # stub for RAG; replace with your actual retrieval pipeline
    # e.g., use Qwen3-Embedding + FAISS/Chroma and return top snippets
    return f"[stubbed RAG] Top hits for '{query}':\n1) Doc A…\n2) Doc B…"

TOOL_REGISTRY = {
    "get_time": {"fn": get_time, "sig": {"type":"object","properties":{"zone":{"type":"string"}}}, "desc":"Current time in a timezone"},
    "add": {"fn": add, "sig": {"type":"object","properties":{"a":{"type":"number"},"b":{"type":"number"}},"required":["a","b"]}, "desc":"Add two numbers"},
    "search_docs": {"fn": search_docs, "sig":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}, "desc":"Search internal docs (RAG)"},
}

def tool_list_json() -> str:
    lst = []
    for name, meta in TOOL_REGISTRY.items():
        lst.append({
            "name": name,
            "description": meta["desc"],
            "parameters": meta["sig"],
        })
    return json.dumps(lst, ensure_ascii=False)

# ----------------------------
# 2) Model I/O helpers
# ----------------------------
SYSTEM_PROMPT = f"""You are a helpful assistant with tool-use.
List of tools: <|tool_list_start|>{tool_list_json()}<|tool_list_end|>
Guidelines:
- If a tool is helpful, emit a tool call as a Python list between <|tool_call_start|> and <|tool_call_end|>,
  e.g. [add(a=2,b=3)] or [search_docs(query="vector db")].
- If multiple steps are needed, call tools in sequence (one call per turn).
- After a tool result is returned (in <|tool_response_start|>...</|tool_response_end|>), use it to answer the user.
- When you are done, reply to the user normally (no further tool calls).
"""

# Chat template for LFM2 (ChatML-like)
def apply_chat_template(messages: List[Dict[str, str]]) -> str:
    # Simple template good enough for local testing; Transformers has .apply_chat_template() too.
    def block(role, content):
        return f"<|im_start|>{role}\n{content}<|im_end|>\n"
    s = "<|startoftext|>"
    for m in messages:
        s += block(m["role"], m["content"])
    return s

# ----------------------------
# 3) Backends (Transformers / vLLM)
# ----------------------------
if USE_VLLM:
    import requests
    VLLM_URL = "http://127.0.0.1:8000/v1/chat/completions"
    def generate(messages: List[Dict[str,str]], max_new_tokens=512):
        payload = {
            "model": "LiquidAI/LFM2-2.6B",
            "messages": messages,
            "temperature": 0.3,
            "min_p": 0.15,
            "repetition_penalty": 1.05,
            "max_tokens": max_new_tokens,
        }
        r = requests.post(VLLM_URL, json=payload, timeout=120)
        r.raise_for_status()
        return r.json()["choices"][0]["message"]["content"]
else:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    MODEL_ID = "LiquidAI/LFM2-2.6B"
    tok = AutoTokenizer.from_pretrained(MODEL_ID)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, dtype=torch.bfloat16, device_map="auto"
    )
    def generate(messages: List[Dict[str,str]], max_new_tokens=512):
        prompt = apply_chat_template(messages + [{"role":"assistant","content":""}])
        ids = tok(prompt, return_tensors="pt").to(model.device)
        out = model.generate(
            **ids, do_sample=True, temperature=0.3, max_new_tokens=max_new_tokens,
            repetition_penalty=1.05
        )
        text = tok.decode(out[0], skip_special_tokens=False)
        # Return only the last assistant block
        m = re.findall(r"<\|im_start\|>assistant\n(.*?)(?=<\|im_end\|>)", text, flags=re.S)
        return m[-1].strip() if m else text

# ----------------------------
# 4) Tool-call parsing & execution
# ----------------------------
TOOL_CALL_RE = re.compile(r"<\|tool_call_start\|>(.*?)<\|tool_call_end\|>", re.S)

def parse_tool_call(content: str) -> List[Dict[str, Any]]:
    """
    The model emits a Python list like:
      [add(a=2,b=3)]
      [search_docs(query="hello"), get_time(zone="UTC")]
    We parse each entry into {"name": str, "kwargs": dict}
    """
    calls = []
    m = TOOL_CALL_RE.search(content)
    if not m:
        return calls
    raw = m.group(1).strip()
    # Convert `add(a=2,b=3)` into a Python dict via AST
    # Strategy: wrap each call into a fake function to parse args
    items = [x.strip() for x in raw.strip("[]").split("),") if x.strip()]
    for item in items:
        if not item.endswith(")"): item += ")"
        name = item.split("(",1)[0].strip()
        argstr = item[item.find("(")+1:-1].strip()
        kwargs = {}
        if argstr:
            # convert kwargs string to dict via ast.parse on a fake dict
            # turn a=2,b=3 -> {"a":2,"b":3}
            pairs = []
            for kv in argstr.split(","):
                if not kv.strip(): continue
                k,v = kv.split("=",1)
                pairs.append(f'"{k.strip()}":{v.strip()}')
            dict_src = "{" + ",".join(pairs) + "}"
            kwargs = ast.literal_eval(dict_src)
        calls.append({"name": name, "kwargs": kwargs})
    return calls

def execute_tools(calls: List[Dict[str,Any]]) -> str:
    results = []
    for c in calls:
        name = c["name"]
        if name not in TOOL_REGISTRY:
            results.append({"tool": name, "ok": False, "error": "Unknown tool"})
            continue
        try:
            fn = TOOL_REGISTRY[name]["fn"]
            out = fn(**c["kwargs"])
            results.append({"tool": name, "ok": True, "result": out})
        except Exception as e:
            results.append({"tool": name, "ok": False, "error": repr(e)})
    return json.dumps(results, ensure_ascii=False)

# ----------------------------
# 5) Agent loop
# ----------------------------
def run_agent(user_message: str, max_turns=4):
    messages = [
        {"role":"system","content": SYSTEM_PROMPT},
        {"role":"user","content": user_message}
    ]
    for _ in range(max_turns):
        assistant = generate(messages)
        messages.append({"role":"assistant","content": assistant})

        calls = parse_tool_call(assistant)
        if not calls:
            # no tool call → final answer
            break

        # execute tools and append tool response
        tool_json = execute_tools(calls)
        tool_block = f"<|tool_response_start|>{tool_json}<|tool_response_end|>"
        messages.append({"role":"tool","content": tool_block})

    # Return the last assistant message (final)
    return messages[-1]["content"]

if __name__ == "__main__":
    # Try a multi-step task that forces tool use
    query = "What time is it for me in IST, then add 7+35, and summarize both."
    print(run_agent(query))

What This Script Does

Registers tools (get_time, add, search_docs) in a whitelist, builds a JSON tool list, and advertises it to the model via special LFM2 tokens inside the system prompt.
Talks to the model through vLLM’s OpenAI API (because USE_VLLM = True), or falls back to Transformers locally if set to False.
Generates replies: sends chat messages; the model may emit a tool call between <|tool_call_start|>…<|tool_call_end|> like [add(a=2,b=3)].
Parses tool calls with a regex + safe arg parsing, executes only allowed tools from the registry, and wraps results in <|tool_response_start|>…<|tool_response_end|> for the model to read.
Loops for a few turns (up to max_turns): model → optional tool call → tool execution → model continues; stops when no tool call is emitted.
Prints the final answer for the demo query (“IST time + 7+35 + summary”), showing the agentic behavior working end-to-end.
Extras: includes a minimal ChatML-style template (used only in the Transformers path) and a stubbed RAG tool you can later replace with a real retriever.

Step 16: Run the Agent

Run the agent and generate response on terminal.

Conclusion

LFM2-2.6B makes “edge-ready” truly practical: a compact, multilingual model that’s fast, memory-savvy, and easy to stand up on a single GPU. In this guide you spun it up two ways—quick sanity check with Transformers and a production-style vLLM server—then leveled up to a tool-using agent that can call functions, fetch the web, and plug into RAG. With sensible VRAM targets and defaults (BF16, 4–8K context), it runs smoothly on mainstream cards while staying flexible for laptops or heavier servers. Your next steps: swap the stubbed RAG for your FAISS/Qwen embeddings, add more domain tools, and wrap the agent in FastAPI for a clean HTTP service. Ship it, measure latency/throughput, and iterate—this stack is ready for real workflows.

Relevant blog posts

October 3, 2025

Get Started with IBM’s Granite-4.0-Micro for Enterprise RAG, Summarization, QA, & Code Tasks

In an era where AI-driven applications are rapidly transforming enterprises and research workflows, having a model that can intelligently understand and execute complex instructions is more critical than ever. IBM has launched its latest model series, Granite-4.0-Micro, a 3-billion-parameter long-context instruct model, finely tuned for advanced instruction following, tool-calling, and multilingual dialog tasks. Built upon the Granite-4.0-Micro-Base, this model leverages a rich combination of open-source instruction datasets and internally curated synthetic data, refined through supervised fine-tuning, reinforcement learning alignment, and model merging. Its capabilities span summarization, text classification, extraction, question-answering, retrieval-augmented generation (RAG), code completions including Fill-In-the-Middle (FIM), function calling, and more, making it a versatile foundation for AI assistants, enterprise applications, and LLM agents. Supporting a wide array of languages such as English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese, Granite-4.0-Micro empowers developers to create sophisticated, multilingual AI solutions with unprecedented ease and reliability. If your goal is to automate complex document processing, build interactive AI agents, or execute code-driven tasks seamlessly, Granite-4.0-Micro offers a flexible and high-performance foundation for all your AI ambitions.

October 3, 2025

How to Install & Run MinerU2.5-2509-1.2B Locally?

MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.

October 2, 2025

How to Install & Run KAT-Dev Locally?

KAT-Dev-32B (Kwaipilot/KAT-Dev) is a 32.8B-parameter coding assistant based on Qwen3-32B, purpose-tuned for software engineering. It’s trained in three phases—mid-training (core skills), SFT + RFT (curated tasks with teacher trajectories), and large-scale agentic RL (prefix caching + trajectory pruning + scalable infra). On SWE-Bench Verified, it reports 62.4% resolved, placing it among the strongest open-source code models at its scale. It supports HF Transformers and vLLM, uses a Qwen-style chat template, and is well-suited for repo-level reasoning, tool use, and multi-turn debugging.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.