How to Install & Run Facebook CWM Locally?

by Ayush Kumar | October 8, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

The Code World Model (CWM) is a 32B parameter dense autoregressive LLM developed by Meta FAIR CodeGen Team. Unlike traditional code models, it has been mid-trained on Python execution traces, memory trajectories, and containerized agentic interactions, making it uniquely suited for reasoning about how code affects computational environments.
CWM was further post-trained with multi-task reinforcement learning (RL) for verifiable coding, math reasoning, and multi-turn software engineering tasks. It is research-only (non-commercial license) and is not designed as a general-purpose chatbot, but as a strong agentic code reasoning model for researchers.

Evaluation

Below report results for CWM and compare to similar SOTA models on common benchmarks.

Model	LCBv5	LCBv6	Math-500	AIME24	AIME25
Magistral-small-2509-24B	70.0	61.6	—	86.1	77.3
Qwen3-32B	65.7	61.9	97.2	81.4	72.9
gpt-oss-20B (low)	54.2	47.3	—	42.1	37.1
gpt-oss-20B (med)	66.9	62.0	—	80.0	72.1
CWM	68.6	63.5	96.6	76.0	68.2

Model	SweBench Verified
Devstral-1.1-2507-24B	53.6
Qwen3-Coder-32B	51.6
gpt-oss-20B (low / med / high)*	37.4 / 53.2 / 60.7
CWM / CWM + tts	53.9 / 65.8

GPU Configuration (Inference, Rule-of-Thumb)

Scenario	Precision / Quantization	Min VRAM (Works)	Comfortable VRAM	Example GPUs	Notes
Single GPU (unquantized)	BF16 / FP16	80 GB	96–120 GB	1× H100 80GB SXM / A100 80GB	Pure weights ~65 GB; KV-cache + activations push close to 80 GB
Multi-GPU (tensor parallel)	BF16 / FP16	2× 40 GB	2× 80 GB	2× A100 40GB / 2× H100 80GB	Split across ranks; requires high-bandwidth interconnect (NVLink/IB)
Quantized (4-bit / 8-bit)	Q4 / Q8	24–40 GB	48 GB+	RTX 6000 Ada (48 GB), A6000 (48 GB)	Useful for local/researcher setups; speed vs. accuracy trade-off
High-throughput serving	BF16	2× 80 GB+	4× 80 GB+	2–4× H100 SXM / A100 80GB	For vLLM / Fastgen serving with long sequences & multiple users
Long-context experiments (131k ctx)	BF16	120 GB+	160 GB+	2× H100 SXM (80 GB) or more	Heavy memory load due to KV-cache scaling with context length

Step-by-Step Process to Install & Run Facebook CWM Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Request and Get Access to CWM on Hugging Face

Before you can download or run Meta’s Code World Model (CWM), you must request gated access on Hugging Face.

Go to the model page: facebook/cwm.
You’ll see a notice: “You need to agree to share your contact information to access this model.”
Fill in the required form:
- First Name & Last Name
- Date of Birth
- Country
- Affiliation (e.g., “DevRel Engineer (NodeShift)”)
- Job Title (e.g., “AI Developer/Engineer”)
Check the confirmation box to accept the license and Meta’s research-use terms.
Click Submit.

👉 After submission, your request goes to Meta for review. Once approved, the model card will update with the label:
“Gated model – You have been granted access to this model.”

Step 2: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 3: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 4: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 5: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 6: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Facebook CWM, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Facebook CWM
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Facebook CWM.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Facebook CWM runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 7: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 8: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 9: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update

Now, run the following commands to install Python 3.11, Pip and Wheel:

apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version

Step 10: Created and Activated Python 3.11 Virtual Environment

Run the following commands to created and activated Python 3.11 virtual environment:

python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version

Step 11: Install Hugging Face Hub & authenticate (for gated CWM)

Install / upgrade the CLI

python -m pip install -U huggingface_hub

Log in to Hugging Face

huggingface-cli login

Paste your Access Token from Settings → Access Tokens (must have access to facebook/cwm; “Read” is sufficient for downloads).

Step 12: Install vLLM & Transformers

Now that Hugging Face Hub is set up, install vLLM (for serving) and Transformers (for integration):

pip install -U "vllm>=0.5.*" transformers

vllm → high-throughput inference engine for large LLMs like CWM.
transformers → Hugging Face library for model/tokenizer support.

This ensures your environment can both serve the CWM model via vLLM and interact with it programmatically.

Step 13: Serve CWM with vLLM (start the API)

Run the server:

vllm serve facebook/cwm \
  --tensor-parallel-size 1 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

You should see logs like “Resolved architecture… Using max model len 32768 … tokenizer.json: 100%”.
When it finishes loading, the OpenAI-compatible endpoint is live at http://<YOUR_IP>:8000/v1/.

Step 14: Quick Smoke Test

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/cwm",
    "messages": [
      {"role":"system","content":"You are a helpful AI assistant. You always reason before responding, using the format:\n<think>\n...\n</think>\nresponse"},
      {"role":"user","content":"Write a haiku about recursion."}
    ],
    "chat_template_kwargs": {"enable_thinking": true, "preserve_previous_think": true}
  }'

Step 15: Install Streamlit and Requests

Run the following commands to install streamlit and requests:

pip install -U streamlit requests

Step 16: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Facebook CWM model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 17: Create the Script

Create a file (ex: # app.py) and add the following code:

import os, requests, json, streamlit as st

st.set_page_config(page_title="CWM via vLLM", page_icon="🧠")
OPENAI_BASE = os.environ.get("OPENAI_API_BASE", "http://localhost:8000/v1")
MODEL_NAME  = os.environ.get("MODEL_NAME", "facebook/cwm")

st.title("🧠 Code World Model — Streamlit Chat")

with st.sidebar:
    system_prompt = st.text_area("System Prompt", "You are a helpful AI assistant.", height=120)
    enable_think  = st.checkbox("Enable thinking mode", value=False)
    preserve_prev = st.checkbox("Preserve previous <think>", value=False)
    show_think    = st.checkbox("Show <think> content", value=False)

if "messages" not in st.session_state:
    st.session_state.messages = [{"role":"system","content": system_prompt}]

for m in st.session_state.get("messages", []):
    if m["role"] in ("user","assistant"):
        with st.chat_message(m["role"]):
            st.markdown(m["content"])

def chat_request():
    payload = {
        "model": MODEL_NAME,
        "messages": st.session_state.messages,
        "chat_template_kwargs": {
            "enable_thinking": bool(enable_think),
            "preserve_previous_think": bool(preserve_prev),
        }
    }
    r = requests.post(f"{OPENAI_BASE}/chat/completions",
                      headers={"Content-Type":"application/json"},
                      data=json.dumps(payload), timeout=300)
    r.raise_for_status()
    content = r.json()["choices"][0]["message"]["content"]
    if not show_think and "</think>" in content:
        content = content.split("</think>", 1)[-1].lstrip()
    return content

user_input = st.chat_input("Message")
if user_input:
    st.session_state.messages.append({"role":"user","content": user_input})
    with st.chat_message("user"):
        st.markdown(user_input)

    with st.chat_message("assistant"):
        with st.spinner("Thinking…"):
            reply = chat_request()
            st.markdown(reply)
    st.session_state.messages.append({"role":"assistant","content": reply})

Step 18: Launch the Streamlit UI

Run Streamlit

streamlit run app.py --server.address 0.0.0.0 --server.port 7861

Step 19: Access the Streamlit App

Access the streamlit app on:

http://0.0.0.0:7861/

Play with Model

Conclusion

The Code World Model (CWM) is more than just another large code LLM — it’s a research-first system designed to reason about how code interacts with real computational environments. By combining execution traces, memory trajectories, and reinforcement learning for verifiable tasks, CWM stands out as a powerful tool for agentic code reasoning and multi-turn software engineering.

With this guide, you now have everything you need to set up CWM on a GPU-powered VM, serve it with vLLM, and interact with it through a simple UI. While it’s released under a non-commercial license and isn’t meant as a general chatbot, CWM provides researchers with a unique opportunity to explore the future of reasoning-driven code models.

Relevant blog posts

October 7, 2025

How to Install & Run IBM Granite 4.0 H Tiny, Small and Micro Locally?

Granite 4.0-H models are instruction-tuned, tool-calling–ready LLMs built for real enterprise assistants. They keep Granite’s clean chat template and safety alignment, add strong multilingual skills (EN/DE/ES/FR/JA/PT/AR/CS/IT/KO/NL/ZH), and push long-context (up to 1M tokens on the H variants) for document-heavy workflows, RAG, and agent loops. Why “H”? The H line uses a hybrid stack (Transformer attention + Mamba-2 sequence modules) to boost efficiency on long inputs while preserving quality—great for fast tool plans, structured outputs, and retrieval-style prompts. Pick the right size Micro-H (3B, 1M ctx) Lightweight, snappy, and budget-friendly. Ideal for routing, information extraction, form/JSON outputs, short multilingual chat, and FIM code completions on modest GPUs or edge boxes. Tiny-H (7B, 1M ctx) The sweet spot. Better reasoning and multilingual dialogue with solid tool-calling—good for multi-turn assistants, analytics summaries, light coding, and compact RAG pipelines. Small-H (32B, 1M ctx) Muscle for tougher tasks. Stronger reasoning/code synthesis, deeper instruction following, and long-doc comprehension—fit for agentic workflows, complex business logic, and high-fidelity answers. What they’re good at Summarization • text classification/extraction • Q&A/RAG • code (incl. Fill-In-the-Middle) • function/tool calling • multilingual dialogue.

October 6, 2025

Pocket Operator: A Local, Tool-Calling Agent Powered by LFM2-2.6B

LFM2-2.6B by Liquid AI is a next-generation hybrid model designed for edge AI and on-device deployment. With 2.6B parameters, it combines multiplicative gates and short convolutions for high efficiency, speed, and quality. The model supports eight major languages and introduces dynamic hybrid reasoning for complex or multilingual prompts. It runs smoothly across CPU, GPU, and NPU, making it flexible for use on smartphones, laptops, or vehicles. Optimized for tasks like data extraction, RAG, creative writing, and conversational agents, LFM2-2.6B delivers competitive performance while remaining lightweight and resource-efficient.

October 3, 2025

How to Install & Run MinerU2.5-2509-1.2B Locally?

MinerU2.5 is a 1.2B-parameter vision-language model purpose-built for high-resolution document parsing. It uses a two-stage, coarse-to-fine pipeline—fast global layout on a downsampled page, then native-resolution crop recognition for text, tables, and formulas—to hit state-of-the-art accuracy with low compute. The team recommends vLLM (including the async engine) for high-throughput serving, and reports strong results on OmniDocBench and related OCR/Doc tasks.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.