How to Install & Run Facebook MobileLLM-R1-950M Locally?

by Ayush Kumar | September 16, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

MobileLLM-R1-950M is Meta’s new reasoning-focused model in the MobileLLM family, optimized for math, programming (Python/C++), and scientific problems. Despite its smaller scale (<1B parameters), it rivals or outperforms much larger open-source models like Qwen3-0.6B and SmolLM2-1.7B across benchmarks such as MATH, GSM8K, MMLU, and LiveCodeBench. With a 32K context window, efficient training pipeline, and open recipes, it’s designed to be lightweight yet powerful for reasoning-heavy workloads.

Model Architecture

	# Layers	# Attnetion Heads	# KV Heads	Dim	Hidden Dim	Params
MobileLLM-R1-140M	15	9	3	576	2048	140M
MobileLLM-R1-360M	15	16	4	1024	4096	359M
MobileLLM-R1-950M	22	24	6	1536	6144	949M

Input modalities	Output modalities	Context Length	Vocaburary Size	Shared Embeddings
MobileLLM-R1-140M-base	Text	Text	4k	128k	Yes
MobileLLM-R1-360M-base	Text	Text	4k	128k	Yes
MobileLLM-R1-950M-base	Text	Text	4k	128k	Yes
MobileLLM-R1-140M	Text	Text	32k	128k	Yes
MobileLLM-R1-360M	Text	Text	32k	128k	Yes
MobileLLM-R1-950M	Text	Text	32k	128k	Yes

Evaluation

MobileLLM-R1 Base Model

Model	Size	MATH500	GSM8K	MBPP	HumanEval	CommonSense Avg.	MMLU
		4-shot em	8-shot em	3-shot pass@1	0-shot pass@1	0-shot accuracy	5-shot accuracy

<150M
SmolLM2-135M-base	135M	0.4	1.8	3.8	0.0	50.7	—
MobileLLM-R1-140M-base	140M	4.6	16.3	5.4	15.9	44.3	—

150M – 400M
Gemma-3-270M-pt	268M	0.6	1.1	2.0	3.1	48.4	26.5
SmolLM2-360M-base	362M	1.8	5.0	19.4	0.0	56.6	24.7
MobileLLM-R1-360M-base	359M	13.4	39.4	20.8	32.9	51.0	26.8

400M – 1B
Qwen2.5-0.5B-base	494M	14.8	41.8	29.6	28.1	52.3	47.5
Qwen3-0.6B-base	596M	29.8	60.9	39.0	30.5	55.3	52.4
MobileLLM-R1-950M-base	949M	26.8	61.6	39.2	46.3	58.6	47.4

> 1B
Gemma-3-1B-pt	1.0B	0.6	2.4	9.4	6.1	57.3	26.1
LLaMA3.2-1B-base	1.24B	1.6	6.8	26.6	17.1	58.4	32.0
OLMo-2-0425-1B-base	1.48B	5.2	39.8	7.8	6.7	61.0	42.4
Qwen2.5-1.5B-base	1.54B	31.0	68.4	44.6	36.6	58.7	61.2
SmolLM2-1.7B-base	1.71B	11.6	31.8	35.4	0.6	62.9	50.0
Qwen3-1.7B-base	2.03B	38.5	76.2	56.4	47.6	60.9	62.1

MobileLLM-R1 Post-Trained Model

Model	Size	MATH500	GSM8K	AIME’24	AIME’25	LiveCodeBench-v6
		0-shot pass@1	0-shot pass@1	0-shot pass@1, n=64	0-shot pass@1, n=64	0-shot pass@1, n=16

<150M
SmolLM2-135M-Instruct	135M	3.0	2.4	—	—	0.0
MobileLLM-R1-140M	140M	6.2	4.1	—	—	1.7

150M – 400M
Gemma-3-270m-it	268M	6.8	8.4	—	—	0.0
SmolLM2-360M-Instruct	362M	3.4	8.1	—	—	0.7
MobileLLM-R1-360M	359M	28.4	24.5	—	—	5.1

400M – 1B
Qwen2.5-0.5B-Instruct	494M	31.2	48.1	0.1	0.3	3.6
Qwen3-0.6B	596M	73.0	79.2	11.3	17.0	14.9
MobileLLM-R1-950M	949M	74.0	67.5	15.5	16.3	19.9

> 1B
Gemma-3-1B-it	1.0B	45.4	62.9	0.9	0.0	2.0
LLaMA3.2-1B-Instruct	1.24B	24.8	38.8	1.1	0.2	4.1
OLMo-2-0425-1B-Instruct	1.48B	19.2	69.7	0.6	0.1	0.0
OpenReasoning-Nemotron-1.5B	1.54B	83.4	76.7	49.7	40.4	28.3
DeepSeek-R1-Distill-Qwen-1.5B	1.54B	83.2	77.3	29.1	23.4	19.9
Qwen2.5-1.5B-Instruct	1.54B	54.0	70.0	2.5	0.9	7.9
SmolLM2-1.7B-Instruct	1.71B	19.2	41.8	0.3	0.1	4.4
Qwen3-1.7B	2.03B	89.4	90.3	47.0	37.0	29.8

Training Stages and Hyperparameter Details

Stage	Phase	Tokens / Samples	BS	Sequence Length	Steps	LR	#GPUs	Training Time
Pre-training	Phase1	2T tokens	16	2k	500k	4.00E-03	16 x 8	4-5 days
	Phase2	2T tokens	16	2k	500k	4.00E-03	16 x 8	4-5 days
Mid-training	Phase1	100B tokens	4	4k	50K	3.60E-04	16 x 8	1-2 days
	Phase2	100B tokens	4	4k	50K	3.60E-04	16 x 8	1-2 days
Post-training	General SFT	866K samples	4	4k	2 epochs	5.00E-06	16 x 8	~2h
	Reasoning SFT	6.2M samples	8	32k	4 epochs	8.00E-05	16 x 8	~2.5days

Data Mix

Pre-Training

Dataset	Rows	Tokens (B)	Phase1 Mix Ratio	Phase2 Mix Ratio
StarCoder	206,640,114	263.8	10.66%	0.52%
OpenWebMath	6,117,786	12.6	6.93%	23.33%
FineWeb-Edu	1,279,107,432	1300	63.75%	54.83%
Wiki	7,222,303	3.7	5.03%	0.14%
Arxiv	1,533,917	28	6.36%	1.32%
StackExchange	29,249,120	19.6	5.03%	0.86%
Algebraic stack	3,404,331	12.6	2.25%	1.26%
Nemotron science	708,920	2	—	0.03%
Nemotron code	10,108,883	16	—	0.72%
Nemotron math	22,066,397	15	—	3.01%
Cosmopedia	31,064,744	25	—	2.70%
Facebook natural reasoning	1,145,824	1.8	—	3.18%
FineMath	48,283,984	34	—	8.01%
peS2o	38,800,000	50	—	0.08%
Total			100%	100%

Mid-Training

Dataset	Subset	Rows (M)	Phase1 Mix Ratio	Phase2 Mix Ratio
Dolmino	DCLM Baseline	606	37.03%	6.51%
	FLAN	57.3	4.10%	0.72%
	peS2o	38.8	11.41%	2.01%
	Wiki	6.17	2.66%	0.47%
	StackExchange	2.48	2.12%	2.00%
	Math	21	11.63%	29.10%
Nemotron	Nemotron-Pretraining-Code-v1	882	20.69%	29.10%
	Nemotron-CC-Math-v1	144	3.45%	19.40%
StarCoder	StarCoder	206	6.90%	9.70%
Benchmark training set	TriviaQA (train) OBQA (train) NaturalQuestions (train) PIQA (train) GSM8K (train) BoolQ (train) ARC-Easy (train) ARC-Challenge (train)	~0.01	—	0.97%
Total			100.00%	100.00%

Post-Training

Phase	Dataset	Rows
General SFT	Tulu-3-sft-olmo-2-mixture-0225	866K samples
Reasoning SFT	OpenMathReasoning	3.2M samples
	OpenScienceReasoning-2	803K samples
	OpenCodeReasoning-2	2.16M samples

GPU Configuration (Rule of Thumb)

Scenario	Min VRAM	Recommended GPUs	Precision	Typical Settings	Notes
Entry (single inference, small batch)	12–16 GB	RTX 3090 (24G), RTX 4090 (24G), L4 (24G)	bf16 / fp16	4K–8K tokens, batch=1	Works with `device_map="auto"` + offloading.
Standard (longer context, multi-batch)	24–32 GB	A100 40G, L40S 48G	bf16	Up to 32K tokens, batch=2–4	Good balance of speed & memory.
Pro (research & heavy workloads)	40–80 GB	A100 80G, H100 80G	bf16	32K tokens, batch=8+	Best for benchmark replication & fast training runs.
Multi-GPU scaling	2×24 GB+	Dual RTX 4090, 2×A100 40G	bf16	Use tensor parallelism in vLLM/Transformers	Required for very large batch or 32K-long runs at high speed.

Resources

Link: https://huggingface.co/facebook/MobileLLM-R1-950M

Step-by-Step Process to Install & Run Facebook MobileLLM-R1-950M Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Access + License (Once)

The model is gated under Meta’s FAIR Noncommercial Research license. You’ll be asked for legal name/DOB/org and must use it non-commercially. Approvals are tied to your HF account.

Step 2: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 3: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 4: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 5: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 6: Choose An Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Facebook MobileLLM-R1-950M, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like Facebook MobileLLM-R1-950M.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Facebook MobileLLM-R1-950M.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Facebook MobileLLM-R1-950M runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 7: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 8: Connect To GPUs Using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 9: Install Python

Run the following command to install python:

sudo apt update && sudo apt install -y python3.10-venv git

Step 10: Install Pip, Wheel, Created and Activated Python 3.10 Virtual Environment

Run the following command to install pip, wheel, created and activated python 3.10 virtual environment:

python3 -m venv r1 && source r1/bin/activate
python -m pip install -U pip wheel

Step 11: Install PyTorch (CUDA 12.1 Wheels) + Dependencies

Run the following commands to install pytorch (CUDA 12.1 wheels) + dependencies:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
pip install -U transformers accelerate huggingface_hub einops sentencepiece
pip install -U bitsandbytes

Step 12: Install HuggingFace Hub CLI

Run the following command to install huggingface_hub[cli]:

pip install "huggingface_hub[cli]"

Step 13: Authenticate To Hugging Face Hub (Paste Your Token)

Create a token

Open: https://huggingface.co/settings/tokens
Make a Read token (name it vaultgemma for clarity). Keep it copied.

# New command (the old `huggingface-cli login` is deprecated)
hf auth login
# paste your token when asked
hf whoami   # quick sanity check

Step 14: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Facebook MobileLLM-R1-950M model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 15: Create the Script

Create a file (ex: # run_r1.py) and add the following code:

# save as run_r1.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "facebook/MobileLLM-R1-950M"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# auto -> casts FP32 weights to bf16/fp16 on GPU when available
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"        # will place on GPU; offload if needed
)

prompt = (
    "You are a helpful assistant. Solve step by step and put the final answer in \\boxed{}.\n"
    "Problem: Compute 1-2+3-4+...+99-100."
)

inputs = tokenizer.apply_chat_template(
    [{"role":"system","content":"Please reason step by step, and put your final answer within \\boxed{}."},
     {"role":"user","content":"Compute: $1-2+3-4+5- \\dots +99-100$."}],
    add_generation_prompt=True, return_tensors="pt"
).to(model.device)

out = model.generate(inputs, max_new_tokens=256, temperature=0.2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Step 16: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 17: Install vLLM

Run the following command to install vLLM:

pip install -U vllm

Step 18: Install Python 3.10 Toolchain + Headers

Run the following command to install python 3.10 toolchain + headers:

sudo apt update
sudo apt install -y \
  build-essential \
  python3.10 python3.10-venv python3.10-dev python3-pip \
  python3-dev

Note: we already have Python 3.10 installed, but we add python3.10-dev (and the toolchain) because vLLM builds/uses native CUDA extensions and needs the Python 3.10 headers and libs to compile / load wheels correctly.

Step 19: Start the vLLM Server

Run the following command to start the vLLM server:

vllm serve facebook/MobileLLM-R1-950M \
  --dtype auto \
  --max-model-len 32768 \
  --host 0.0.0.0 --port 8000

Step 20: Ask a question (OpenAI Chat Endpoint)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/MobileLLM-R1-950M",
    "messages": [
      {
        "role":"system",
        "content":"Do all reasoning silently. Do NOT use <think>. Return only the final result as \\\\boxed{...}."
      },
      {
        "role":"user",
        "content":"Compute: 1-2+3-4+...+99-100"
      }
    ],
    "temperature": 0,
    "max_tokens": 64
  }'

Conclusion

MobileLLM-R1-950M hits a sweet spot: tiny (<1B) yet genuinely useful for math, coding, and science—with a roomy 32K context and simple, reproducible setup on a GPU VM. You can run single-query inference comfortably on 12–16 GB VRAM, and scale concurrency or longer contexts with 24–40 GB+. In this guide, we covered the whole path—CUDA-ready base image, PyTorch + Transformers, gated access, and a vLLM API—plus tricks to keep outputs clean and fast.

If you’re experimenting for research (FAIR Noncommercial license), spin this up on your NodeShift GPU node, try 4-bit loading for tight VRAM, and share your tok/s + VRAM numbers. Lightweight, reproducible, and ready for real reasoning workloads—exactly what a sub-billion model should be.

Relevant blog posts

September 19, 2025

How to Install & Run mmBERT-base Locally?

mmBERT (by JHU CLSP) is a modern multilingual encoder (≈307M params) trained on 3T+ tokens across 1,800+ languages. Built on the ModernBERT family, it brings fast inference (FlashAttention-2/unpadding in the official recipe), 8K context, and state-of-the-art cross-lingual performance on classification, embeddings, retrieval, and reranking. It also introduces training tricks like inverse mask scheduling, inverse temperature sampling, and progressive language addition, which especially help low-resource languages in the decay phase. Use it as: a Masked-LM (fill-mask) for language understanding, a feature extractor for multilingual embeddings & retrieval, a backbone for classification/reranking fine-tuning.

September 18, 2025

How to Install & Run Alibaba Tongyi DeepResearch Locally?

Tongyi DeepResearch (30B-A3B) is a 30-billion parameter Mixture-of-Experts (MoE) language model developed by Alibaba Tongyi Lab, with only 3B active parameters per token for efficiency. Unlike general LLMs, it is purpose-built for deep, long-horizon information-seeking tasks, achieving state-of-the-art results on benchmarks such as Humanity’s Last Exam, BrowserComp, WebWalkerQA, GAIA, xbench-DeepSearch, and FRAMES. Key highlights include a fully automated synthetic data pipeline, large-scale continual pre-training on agentic data, and end-to-end reinforcement learning via a customized Group Relative Policy Optimization framework. At inference, it supports both ReAct-style lightweight reasoning and a test-time scaling “Heavy” mode (IterResearch) to maximize performance.

September 15, 2025

How to Install & Run Google VaultGemma-1B Locally?

When we talk about open language models, most discussions revolve around performance and scale. But what if the conversation centered on privacy first? That’s where VaultGemma comes in. Developed by Google, VaultGemma is a unique variant of the Gemma family, built entirely from the ground up with Differential Privacy (DP) at its core. Using DP-SGD (Differentially Private Stochastic Gradient Descent), it provides strong, mathematically-backed guarantees that no single training example can be extracted from its parameters. In plain words: VaultGemma remembers patterns, not people. Despite being lightweight (under 1B parameters), the model shows solid performance on reasoning, code, and natural language tasks, while ensuring that the privacy of its training data is never compromised. That makes it a rare model suitable for healthcare, finance, and sensitive communication systems—where both performance and privacy matter. VaultGemma might not top the leaderboards compared to non-private models, but it represents a paradigm shift: proving that you don’t have to choose between utility and privacy—you can build responsibly from the start.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.