How to Install & Run Facebook MobileLLM-Pro Locally?

by Ayush Kumar | October 17, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

MobileLLM-Pro is Meta’s 1.08B-parameter, on-device–first LLM with a 128k context window and local-global attention (3:1) for faster prefill and tiny KV cache. It ships as base and instruction-tuned variants, plus near-lossless int4 quantization (CPU & accelerator ready), delivering competitive quality vs other ~1B models while fitting comfortably on phones, edge accelerators, and low-VRAM GPUs.

Base Pretrained Model

Benchmark	P1 (FP)	P1 (Q-CPU)	P1 (Q-Acc)	Gemma 3 1B	Llama 3.2 1B
HellaSwag	67.11%	64.89%	65.10%	62.30%	65.69%
BoolQ	76.24%	77.49%	76.36%	63.20%	62.51%
PIQA	76.55%	76.66%	75.52%	73.80%	75.14%
SocialIQA	50.87%	51.18%	50.05%	48.90%	45.60%
TriviaQA	39.85%	37.26%	36.42%	39.80%	23.81%
NatQ	15.76%	15.43%	13.19%	9.48%	5.48%
ARC-c	52.62%	52.45%	51.24%	38.40%	38.28%
ARC-e	76.28%	76.58%	75.73%	73.00%	63.47%
WinoGrande	62.83%	62.43%	61.96%	58.20%	61.09%
OBQA	43.60%	44.20%	40.40%		37.20%
NIH	100.00%	96.44%	98.67%

FP = Full precision, bf16
Q-CPU = int4, group-wise quantized (for CPU)
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))

Instruction Tuned Model

Benchmark	P1 (IFT)	Gemma 3 1B (IFT)	Llama 3.2 1B (IFT)
MMLU	44.8%	29.9%	49.3%
IFEval	62.0%	80.2%	59.5%
MBPP	46.8%	35.2%	39.6%
HumanEval	59.8%	41.5%	37.8%
ARC-C	62.7%		59.4%
HellaSwag	58.4%		41.2%
BFCL v2	29.4%		25.7%
Open Rewrite	51.0%		41.6%
TLDR9+	16.8%		16.8%

Latency Benchmarking

Model / Prompt length	2k	4k	8k
CPU Prefill Latency (s)	8.9	24.8	63.5
CPU Decode Speed (tok/s)	33.6	24.8	19.7
HTP Prefill Latency (s)	1.96	3.38	9.82
HTP Decode Speed (tok/s)	31.60	28.95	22.77
KV Cache Size (MB)	14	23	40

GPU / Device Configuration Cheatsheet

Mode / Use Case	Precision	Min VRAM/RAM (weights + KV @8k / 32k / 128k)	Suggested Device	Practical Max Context	Suggested Engine / Flags	Notes
Mobile CPU (on-device)	int4 (group-wise, gs=32) + int8 acts/KV	~0.59 GB model (~1.0–1.5 GB total with headroom)	High-end phone CPU (ExecuTorch + XNNPACK)	8k–32k	ExecuTorch export	From card: ~8.9 s prefill @2k, 33.6 tok/s decode (CPU). Great for offline tasks.
Mobile Accelerator (HTP/ANE)	int4 (per-channel) + int8 KV	~0.6–0.8 GB total	S24 HTP / iPhone ANE class	8k–32k	ExecuTorch HTP/ANE backend	From card: HTP 1.96 s prefill @2k, ~22.8 tok/s decode @8k.
Edge GPU (4–6 GB)	int4	0.54 GB + KV (40/160/640 MB) → ~1.2 GB @128k; budget 2–3 GB total	4 GB GPUs (e.g., RTX 3050 Laptop, older 1650), T4 16 GB (underutilized)	Up to 128k (single stream)	Transformers (CUDA) `load_in_4bit=True` or torchao QAT convert	Plenty of headroom; bump batch a bit; great for kiosk/edge boxes.
General GPU (8 GB)	FP16/BF16	2.16 GB + KV (0.04/0.16/0.64 GB) → ~2.8–3.2 GB	RTX 3060/4060 8 GB, L4 24 GB (shared)	128k	vLLM `--dtype bfloat16 --max-model-len 131072`	Room for moderate batch (2–4). BF16 recommended.
Throughput GPU (12–16 GB)	FP16/BF16	Same as above; allocate ~5–8 GB incl. activations/batch	RTX 4070/4080, A10 24 GB	128k	vLLM or SGLang; enable Paged KV	Higher batch (8–16), concurrent users, long docs.
Server GPU (24–48 GB)	FP16/BF16	Memory trivial; focus on concurrency	L40S 48 GB, A5000 24 GB, A6000 48 GB	128k	vLLM `--tensor-parallel-size=1` (or multi-instancing)	Run many replicas or large batches; great for API serving.
CPU Server (x86)	int4 (xnnpack)	~0.6–1.5 GB	8–32 core CPU boxes	8k–32k	HF Transformers + torchao int4	Lower TPS but easy deploy, no GPU required.

Resources

Link: https://huggingface.co/facebook/MobileLLM-Pro

Step-by-Step Process to Install & Run Facebook MobileLLM-Pro Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Request and Get Access to CWM on Hugging Face

Before you can download or run Meta’s MobileLLM-Pro, you must request gated access on Hugging Face.

Go to the model page: https://huggingface.co/facebook/MobileLLM-Pro
You’ll see a notice: “You need to agree to share your contact information to access this model.”
Fill in the required form:
- First Name & Last Name
- Date of Birth
- Country
- Affiliation (e.g., “DevRel Engineer (NodeShift)”)
- Job Title (e.g., “AI Developer/Engineer”)
Check the confirmation box to accept the license and Meta’s research-use terms.
Click Submit.

After submission, your request goes to Meta for review. Once approved, the model card will update with the label:
“Gated model – You have been granted access to this model.”

Step 2: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 3: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 4: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 5: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 6: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Facebook MobileLLM-Pro, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Facebook MobileLLM-Pro
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Facebook MobileLLM-Pro.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Facebook MobileLLM-Pro runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 7: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 8: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 9: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 10: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 11: Install PyTorch for CUDA

Run the following command to install PyTorch:

pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu121

Step 12: Install the Utilities

Run the following command to install utilities:

pip install -U transformers accelerate sentencepiece bitsandbytes

Step 13: Install Hugging Face Hub & Authenticate (For Gated MobileLLM-Pro)

Install / upgrade the CLI

python -m pip install -U huggingface_hub

Log in to Hugging Face

huggingface-cli login

Paste your Access Token from Settings → Access Tokens (must have access to facebook/mobileLLM-Pro; “Read” is sufficient for downloads).

Step 14: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Facebook MobileLLM-Pro model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 15: Create the Script

Create a file (ex: # run_mobilellm_pro.py) and add the following code:

# run_mobilellm_pro.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "facebook/MobileLLM-Pro"
VARIANT = "instruct"   # "base" or "instruct"

dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, subfolder=VARIANT)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, trust_remote_code=True, subfolder=VARIANT, torch_dtype=dtype, device_map="auto"
).eval()

def chat(prompt: str):
    msgs = [{"role": "user", "content": prompt}]
    inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
    out = model.generate(inputs, max_new_tokens=256)
    print(tok.decode(out[0], skip_special_tokens=True))

if __name__ == "__main__":
    chat("In one paragraph, why are on-device LMs useful?")

What this Script Does

Loads the instruct subfolder of facebook/MobileLLM-Pro (tokenizer + model) with trust_remote_code=True.
Picks compute dtype automatically (bfloat16 on GPU, else float32) and places weights with device_map="auto".
Wraps your prompt using the model’s chat template (apply_chat_template) for proper chat formatting.
Calls model.generate(..., max_new_tokens=256) to produce a response.
Decodes and prints the final text (no special tokens), running the model in .eval() mode.

Step 16: Run the Script

Run the script from the following command:

python run_mobilellm_pro.py

This will load the model and generate the response on terminal.

Step 17: Install Streamlit and Other Dependencies

Run the following commands to install streamlit and other dependencies:

pip install -U streamlit transformers accelerate bitsandbytes

Step 18: Create the Script

Create a file (ex: # mobilellm_streamlit.py) and add the following code:

import os, threading
import streamlit as st
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer

# -------- Settings (you can tweak from the sidebar) ----------
DEFAULT_MODEL_ID = "facebook/MobileLLM-Pro"

st.set_page_config(page_title="MobileLLM-Pro (Streamlit)", page_icon="🤖", layout="wide")
st.title("MobileLLM-Pro — Browser Chat (Streamlit)")

with st.sidebar:
    st.markdown("### Model Settings")
    model_id = st.text_input("Model ID", value=DEFAULT_MODEL_ID)
    variant  = st.selectbox("Variant (subfolder)", ["instruct", "base"], index=0)
    use_4bit = st.checkbox("Load in 4-bit (bitsandbytes)", value=False,
                           help="Saves VRAM; slower prefill; great for small GPUs.")
    sys_prompt = st.text_area("System Prompt (optional)",
                              value="",
                              placeholder="You are a concise, helpful on-device assistant.",
                              height=80)
    temperature = st.slider("Temperature", 0.0, 1.5, 0.8, 0.05)
    top_p       = st.slider("Top-p", 0.1, 1.0, 0.95, 0.01)
    top_k       = st.slider("Top-k", 1, 200, 40, 1)
    max_new     = st.slider("Max new tokens", 16, 2048, 256, 16)
    st.caption("Tip: toggle 4-bit if you hit OOM. Make sure `HF_TOKEN` is exported for gated repo access.")

@st.cache_resource(show_spinner=True)
def load_model_and_tokenizer(model_id: str, variant: str, use_4bit: bool):
    dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
    tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, subfolder=variant)

    if use_4bit:
        from transformers import BitsAndBytesConfig
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_id, trust_remote_code=True, subfolder=variant,
            quantization_config=quant_cfg, device_map="auto"
        ).eval()
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id, trust_remote_code=True, subfolder=variant,
            torch_dtype=dtype, device_map="auto"
        ).eval()

    return tok, model

tok, model = load_model_and_tokenizer(model_id, variant, use_4bit)

# --------- Chat state ----------
if "history" not in st.session_state:
    st.session_state.history = []   # list of (user, assistant) tuples

def format_messages(system_prompt: str, history, user_msg: str):
    msgs = []
    if system_prompt and system_prompt.strip():
        msgs.append({"role": "system", "content": system_prompt.strip()})
    for u, a in history:
        if u: msgs.append({"role": "user", "content": u})
        if a: msgs.append({"role": "assistant", "content": a})
    msgs.append({"role": "user", "content": user_msg})
    return msgs

# --------- UI: show history ----------
for u, a in st.session_state.history:
    with st.chat_message("user"):
        st.markdown(u)
    with st.chat_message("assistant"):
        st.markdown(a if a else "")

# --------- Input box ----------
prompt = st.chat_input("Type your message…")

def stream_generate(msgs, temperature, top_p, top_k, max_new):
    inputs = tok.apply_chat_template(
        msgs, return_tensors="pt", add_generation_prompt=True
    ).to(model.device)

    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    gen_kwargs = dict(
        inputs=inputs,
        max_new_tokens=int(max_new),
        do_sample=True,
        temperature=float(temperature),
        top_p=float(top_p),
        top_k=int(top_k),
        streamer=streamer
    )
    t = threading.Thread(target=model.generate, kwargs=gen_kwargs)
    t.start()
    for piece in streamer:
        yield piece

if prompt:
    # show the user message immediately
    with st.chat_message("user"):
        st.markdown(prompt)

    # reserve an assistant message container
    with st.chat_message("assistant"):
        placeholder = st.empty()
        partial = ""
        msgs = format_messages(sys_prompt, st.session_state.history, prompt)
        for chunk in stream_generate(msgs, temperature, top_p, top_k, max_new):
            partial += chunk
            placeholder.markdown(partial)

    # commit turn to history
    st.session_state.history.append((prompt, partial))
    st.rerun()

What this Script Does

Builds a chat UI with a sidebar to pick the model/subfolder, toggle 4-bit loading, set a system prompt, and tune temperature/top-p/top-k/max_new_tokens.
Caches and loads the tokenizer + model via st.cache_resource, supporting either BF16/FP16 or bitsandbytes 4-bit with device_map="auto".
Keeps multi-turn chat history in st.session_state.history and renders past user/assistant turns with st.chat_message.
Formats messages using the model’s chat template and streams tokens live to the UI via TextIteratorStreamer while model.generate runs in a background thread.
On submit, shows the user message instantly, streams the reply, then appends the turn to history and refreshes the app to persist state.

Step 19: Launch the Streamlit UI

Run Streamlit

streamlit run mobilellm_streamlit.py --server.address 0.0.0.0 --server.port 8501

Step 20: Access the Streamlit App

Access the streamlit app on:

http://0.0.0.0:8501/

Play with Model

Conclusion

Meta’s MobileLLM-Pro redefines what’s possible for efficient, high-quality language modeling on edge devices. With its 1.08B-parameter lightweight architecture, 128k-token context window, and local-global attention design, it delivers exceptional reasoning and comprehension — all while running comfortably on phones, accelerators, or low-VRAM GPUs.

By following this end-to-end setup — from NodeShift GPU deployment to Streamlit chat UI — you now have a complete on-device–first workflow for experimenting, fine-tuning, or integrating MobileLLM-Pro into your own applications. It’s fast, secure, private, and developer-friendly — a true step toward bringing powerful intelligence closer to the edge.

Relevant blog posts

October 17, 2025

Build Faster & Safer with LangCode: Your Ultimate Multi-LLM Local AI Copilot

As AI coding assistants evolve, the real challenge is no longer just generating code, it’s understanding your entire codebase, adapting to your workflow, and automating complex development tasks with safety and precision. LangCode is a newest open-source framework that unifies Gemini, Anthropic Claude, OpenAI, and Ollama into a single, powerful coding environment, all accessible directly from your terminal. If you want to analyze your code, implement new features, fix bugs, or refactor modules intelligently, LangCode handles it all through its interactive launcher and AI-powered deep code reasoning. Its ReAct and Deep modes let you toggle between fast, lightweight responses and in-depth multi-step reasoning, while Smart Routing automatically selects the most suitable LLM for your task based on cost, speed, or quality. With safe, reviewable diffs, customizable project instructions, and MCP-based extensibility, LangCode doesn’t just generate code, it thinks, plans, and acts alongside you like an intelligent engineering collaborator.

October 16, 2025

How to Install & Run KAT-Dev-72B-Exp Locally?

KAT-Dev-72B-Exp stands as Kwaipilot’s most ambitious open-source model to date — a massive 72-billion-parameter large language model purpose-built for software engineering, debugging, and automated code reasoning. It represents the experimental reinforcement-learning (RL) variant of the proprietary KAT-Coder model, opening a rare window into the techniques and design philosophies that power some of the world’s strongest coding assistants. At its core, KAT-Dev-72B-Exp pushes the boundaries of reinforcement learning for code generation. The Kwaipilot team rewrote key attention kernels and re-engineered the training engine to support shared prefix trajectories, enabling faster and more stable RL training — particularly on scaffolded tasks that demand precise context management. To prevent the common issue of exploration collapse in RL training, they also introduced a novel advantage redistribution technique, which dynamically balances exploration by amplifying high-variance trajectories and soft-penalizing low-exploration ones. These innovations translate directly into real-world performance. On the SWE-Bench Verified benchmark, a demanding test that evaluates a model’s ability to understand, reason about, and patch real GitHub issues, KAT-Dev-72B-Exp achieves an impressive 74.6% accuracy when tested strictly under the SWE-agent scaffold. This places it among the most capable open-source developer models currently available. The model is distributed under the Apache-2.0 license, ensuring that researchers, developers, and organizations can freely explore, adapt, and integrate its architecture into their own projects. The accompanying Transformers quickstart snippet makes it easy to run on both local and cloud environments, while the evaluation parameters — temperature 0.6, max_turns 150, and history_processors.n 100 — ensure reproducibility across experiments. In short, KAT-Dev-72B-Exp is not just another large language model. It is a deep dive into how reinforcement learning can be scaled safely and efficiently for large-context, multi-turn software engineering workflows — bridging the gap between academic research and production-grade coding intelligence.

October 15, 2025

How to Install & Run LFM2-8B-A1B Locally?

LFM2-8B-A1B is Liquid AI’s on-device-friendly MoE: 8.3B total / 1.5B active params with a hybrid conv-attention stack (18 LIV conv + 6 GQA). It uses a ChatML-style template, supports 32K context, and is tuned for agentic tasks, data extraction, RAG, and multi-turn chat. It targets speed on modest hardware (often faster than 1.7B dense baselines) while keeping quality near 3–4B dense models.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.