MobileLLM-R1-950M is Meta’s new reasoning-focused model in the MobileLLM family, optimized for math, programming (Python/C++), and scientific problems. Despite its smaller scale (<1B parameters), it rivals or outperforms much larger open-source models like Qwen3-0.6B and SmolLM2-1.7B across benchmarks such as MATH, GSM8K, MMLU, and LiveCodeBench. With a 32K context window, efficient training pipeline, and open recipes, it’s designed to be lightweight yet powerful for reasoning-heavy workloads.
Model Architecture
| # Layers | # Attnetion Heads | # KV Heads | Dim | Hidden Dim | Params |
---|
MobileLLM-R1-140M | 15 | 9 | 3 | 576 | 2048 | 140M |
MobileLLM-R1-360M | 15 | 16 | 4 | 1024 | 4096 | 359M |
MobileLLM-R1-950M | 22 | 24 | 6 | 1536 | 6144 | 949M |
Evaluation
MobileLLM-R1 Base Model
Model | Size | MATH500 | GSM8K | MBPP | HumanEval | CommonSense Avg. | MMLU |
---|
| | 4-shot em | 8-shot em | 3-shot pass@1 | 0-shot pass@1 | 0-shot accuracy | 5-shot accuracy |
| | | | | | | |
<150M | | | | | | | |
SmolLM2-135M-base | 135M | 0.4 | 1.8 | 3.8 | 0.0 | 50.7 | — |
MobileLLM-R1-140M-base | 140M | 4.6 | 16.3 | 5.4 | 15.9 | 44.3 | — |
| | | | | | | |
150M – 400M | | | | | | | |
Gemma-3-270M-pt | 268M | 0.6 | 1.1 | 2.0 | 3.1 | 48.4 | 26.5 |
SmolLM2-360M-base | 362M | 1.8 | 5.0 | 19.4 | 0.0 | 56.6 | 24.7 |
MobileLLM-R1-360M-base | 359M | 13.4 | 39.4 | 20.8 | 32.9 | 51.0 | 26.8 |
| | | | | | | |
400M – 1B | | | | | | | |
Qwen2.5-0.5B-base | 494M | 14.8 | 41.8 | 29.6 | 28.1 | 52.3 | 47.5 |
Qwen3-0.6B-base | 596M | 29.8 | 60.9 | 39.0 | 30.5 | 55.3 | 52.4 |
MobileLLM-R1-950M-base | 949M | 26.8 | 61.6 | 39.2 | 46.3 | 58.6 | 47.4 |
| | | | | | | |
> 1B | | | | | | | |
Gemma-3-1B-pt | 1.0B | 0.6 | 2.4 | 9.4 | 6.1 | 57.3 | 26.1 |
LLaMA3.2-1B-base | 1.24B | 1.6 | 6.8 | 26.6 | 17.1 | 58.4 | 32.0 |
OLMo-2-0425-1B-base | 1.48B | 5.2 | 39.8 | 7.8 | 6.7 | 61.0 | 42.4 |
Qwen2.5-1.5B-base | 1.54B | 31.0 | 68.4 | 44.6 | 36.6 | 58.7 | 61.2 |
SmolLM2-1.7B-base | 1.71B | 11.6 | 31.8 | 35.4 | 0.6 | 62.9 | 50.0 |
Qwen3-1.7B-base | 2.03B | 38.5 | 76.2 | 56.4 | 47.6 | 60.9 | 62.1 |
MobileLLM-R1 Post-Trained Model
Model | Size | MATH500 | GSM8K | AIME’24 | AIME’25 | LiveCodeBench-v6 |
---|
| | 0-shot pass@1 | 0-shot pass@1 | 0-shot pass@1, n=64 | 0-shot pass@1, n=64 | 0-shot pass@1, n=16 |
| | | | | | |
<150M | | | | | | |
SmolLM2-135M-Instruct | 135M | 3.0 | 2.4 | — | — | 0.0 |
MobileLLM-R1-140M | 140M | 6.2 | 4.1 | — | — | 1.7 |
| | | | | | |
150M – 400M | | | | | | |
Gemma-3-270m-it | 268M | 6.8 | 8.4 | — | — | 0.0 |
SmolLM2-360M-Instruct | 362M | 3.4 | 8.1 | — | — | 0.7 |
MobileLLM-R1-360M | 359M | 28.4 | 24.5 | — | — | 5.1 |
| | | | | | |
400M – 1B | | | | | | |
Qwen2.5-0.5B-Instruct | 494M | 31.2 | 48.1 | 0.1 | 0.3 | 3.6 |
Qwen3-0.6B | 596M | 73.0 | 79.2 | 11.3 | 17.0 | 14.9 |
MobileLLM-R1-950M | 949M | 74.0 | 67.5 | 15.5 | 16.3 | 19.9 |
| | | | | | |
> 1B | | | | | | |
Gemma-3-1B-it | 1.0B | 45.4 | 62.9 | 0.9 | 0.0 | 2.0 |
LLaMA3.2-1B-Instruct | 1.24B | 24.8 | 38.8 | 1.1 | 0.2 | 4.1 |
OLMo-2-0425-1B-Instruct | 1.48B | 19.2 | 69.7 | 0.6 | 0.1 | 0.0 |
OpenReasoning-Nemotron-1.5B | 1.54B | 83.4 | 76.7 | 49.7 | 40.4 | 28.3 |
DeepSeek-R1-Distill-Qwen-1.5B | 1.54B | 83.2 | 77.3 | 29.1 | 23.4 | 19.9 |
Qwen2.5-1.5B-Instruct | 1.54B | 54.0 | 70.0 | 2.5 | 0.9 | 7.9 |
SmolLM2-1.7B-Instruct | 1.71B | 19.2 | 41.8 | 0.3 | 0.1 | 4.4 |
Qwen3-1.7B | 2.03B | 89.4 | 90.3 | 47.0 | 37.0 | 29.8 |
Training Stages and Hyperparameter Details
Stage | Phase | Tokens / Samples | BS | Sequence Length | Steps | LR | #GPUs | Training Time |
---|
Pre-training | Phase1 | 2T tokens | 16 | 2k | 500k | 4.00E-03 | 16 x 8 | 4-5 days |
| Phase2 | 2T tokens | 16 | 2k | 500k | 4.00E-03 | 16 x 8 | 4-5 days |
Mid-training | Phase1 | 100B tokens | 4 | 4k | 50K | 3.60E-04 | 16 x 8 | 1-2 days |
| Phase2 | 100B tokens | 4 | 4k | 50K | 3.60E-04 | 16 x 8 | 1-2 days |
Post-training | General SFT | 866K samples | 4 | 4k | 2 epochs | 5.00E-06 | 16 x 8 | ~2h |
| Reasoning SFT | 6.2M samples | 8 | 32k | 4 epochs | 8.00E-05 | 16 x 8 | ~2.5days |
Data Mix
Pre-Training
Dataset | Rows | Tokens (B) | Phase1 Mix Ratio | Phase2 Mix Ratio |
---|
StarCoder | 206,640,114 | 263.8 | 10.66% | 0.52% |
OpenWebMath | 6,117,786 | 12.6 | 6.93% | 23.33% |
FineWeb-Edu | 1,279,107,432 | 1300 | 63.75% | 54.83% |
Wiki | 7,222,303 | 3.7 | 5.03% | 0.14% |
Arxiv | 1,533,917 | 28 | 6.36% | 1.32% |
StackExchange | 29,249,120 | 19.6 | 5.03% | 0.86% |
Algebraic stack | 3,404,331 | 12.6 | 2.25% | 1.26% |
Nemotron science | 708,920 | 2 | — | 0.03% |
Nemotron code | 10,108,883 | 16 | — | 0.72% |
Nemotron math | 22,066,397 | 15 | — | 3.01% |
Cosmopedia | 31,064,744 | 25 | — | 2.70% |
Facebook natural reasoning | 1,145,824 | 1.8 | — | 3.18% |
FineMath | 48,283,984 | 34 | — | 8.01% |
peS2o | 38,800,000 | 50 | — | 0.08% |
Total | | | 100% | 100% |
Mid-Training
Post-Training
GPU Configuration (Rule of Thumb)
Scenario | Min VRAM | Recommended GPUs | Precision | Typical Settings | Notes |
---|
Entry (single inference, small batch) | 12–16 GB | RTX 3090 (24G), RTX 4090 (24G), L4 (24G) | bf16 / fp16 | 4K–8K tokens, batch=1 | Works with device_map="auto" + offloading. |
Standard (longer context, multi-batch) | 24–32 GB | A100 40G, L40S 48G | bf16 | Up to 32K tokens, batch=2–4 | Good balance of speed & memory. |
Pro (research & heavy workloads) | 40–80 GB | A100 80G, H100 80G | bf16 | 32K tokens, batch=8+ | Best for benchmark replication & fast training runs. |
Multi-GPU scaling | 2×24 GB+ | Dual RTX 4090, 2×A100 40G | bf16 | Use tensor parallelism in vLLM/Transformers | Required for very large batch or 32K-long runs at high speed. |
Resources
Link: https://huggingface.co/facebook/MobileLLM-R1-950M
Step-by-Step Process to Install & Run Facebook MobileLLM-R1-950M Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Access + License (Once)
The model is gated under Meta’s FAIR Noncommercial Research license. You’ll be asked for legal name/DOB/org and must use it non-commercially. Approvals are tied to your HF account.
Step 2: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 3: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 4: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 5: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 6: Choose An Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Facebook MobileLLM-R1-950M, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like Facebook MobileLLM-R1-950M.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Facebook MobileLLM-R1-950M.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Facebook MobileLLM-R1-950M runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 7: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 8: Connect To GPUs Using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 9: Install Python
Run the following command to install python:
sudo apt update && sudo apt install -y python3.10-venv git
Step 10: Install Pip, Wheel, Created and Activated Python 3.10 Virtual Environment
Run the following command to install pip, wheel, created and activated python 3.10 virtual environment:
python3 -m venv r1 && source r1/bin/activate
python -m pip install -U pip wheel
Step 11: Install PyTorch (CUDA 12.1 Wheels) + Dependencies
Run the following commands to install pytorch (CUDA 12.1 wheels) + dependencies:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
pip install -U transformers accelerate huggingface_hub einops sentencepiece
pip install -U bitsandbytes
Step 12: Install HuggingFace Hub CLI
Run the following command to install huggingface_hub[cli]:
pip install "huggingface_hub[cli]"
Step 13: Authenticate To Hugging Face Hub (Paste Your Token)
- Create a token
- Log in from the VM (interactive – recommended)
# New command (the old `huggingface-cli login` is deprecated)
hf auth login
# paste your token when asked
hf whoami # quick sanity check
Step 14: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Facebook MobileLLM-R1-950M model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 15: Create the Script
Create a file (ex: # run_r1.py) and add the following code:
# save as run_r1.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "facebook/MobileLLM-R1-950M"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# auto -> casts FP32 weights to bf16/fp16 on GPU when available
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto" # will place on GPU; offload if needed
)
prompt = (
"You are a helpful assistant. Solve step by step and put the final answer in \\boxed{}.\n"
"Problem: Compute 1-2+3-4+...+99-100."
)
inputs = tokenizer.apply_chat_template(
[{"role":"system","content":"Please reason step by step, and put your final answer within \\boxed{}."},
{"role":"user","content":"Compute: $1-2+3-4+5- \\dots +99-100$."}],
add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=256, temperature=0.2)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Step 16: Run the Script
Run the script from the following command:
python3 app.py
This will download the model and generate response on terminal.
Step 17: Install vLLM
Run the following command to install vLLM:
pip install -U vllm
Step 18: Install Python 3.10 Toolchain + Headers
Run the following command to install python 3.10 toolchain + headers:
sudo apt update
sudo apt install -y \
build-essential \
python3.10 python3.10-venv python3.10-dev python3-pip \
python3-dev
Note: we already have Python 3.10 installed, but we add python3.10-dev (and the toolchain) because vLLM builds/uses native CUDA extensions and needs the Python 3.10 headers and libs to compile / load wheels correctly.
Step 19: Start the vLLM Server
Run the following command to start the vLLM server:
vllm serve facebook/MobileLLM-R1-950M \
--dtype auto \
--max-model-len 32768 \
--host 0.0.0.0 --port 8000
Step 20: Ask a question (OpenAI Chat Endpoint)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/MobileLLM-R1-950M",
"messages": [
{
"role":"system",
"content":"Do all reasoning silently. Do NOT use <think>. Return only the final result as \\\\boxed{...}."
},
{
"role":"user",
"content":"Compute: 1-2+3-4+...+99-100"
}
],
"temperature": 0,
"max_tokens": 64
}'
Conclusion
MobileLLM-R1-950M hits a sweet spot: tiny (<1B) yet genuinely useful for math, coding, and science—with a roomy 32K context and simple, reproducible setup on a GPU VM. You can run single-query inference comfortably on 12–16 GB VRAM, and scale concurrency or longer contexts with 24–40 GB+. In this guide, we covered the whole path—CUDA-ready base image, PyTorch + Transformers, gated access, and a vLLM API—plus tricks to keep outputs clean and fast.
If you’re experimenting for research (FAIR Noncommercial license), spin this up on your NodeShift GPU node, try 4-bit loading for tight VRAM, and share your tok/s + VRAM numbers. Lightweight, reproducible, and ready for real reasoning workloads—exactly what a sub-billion model should be.