UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling.
Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can:
- Generate first-turn user queries given a task intent.
- Simulate multi-turn follow-up responses across long conversations.
- Signal the natural end of a conversation with a special
<|endconversation|>
token.
Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.
GPU Configuration Table for UserLM-8B
Scenario | Precision | Min VRAM (works) | Comfortable VRAM | Example GPUs | Notes |
---|
Single-GPU (standard inference) | FP16/BF16 | 16 GB | 24–32 GB | RTX 4090 (24 GB), A6000 (48 GB) | Handles typical inference with moderate context (2k tokens). |
Single-GPU (efficient inference, quantized) | INT4/INT8 | 8–12 GB | 16 GB | RTX 3060 (12 GB), RTX 3090 (24 GB) | Quantization allows running on consumer GPUs with smaller VRAM. |
Research / Training (full fine-tuning) | FP32 | 40–48 GB | 48 GB+ | A6000 (48 GB), A100 80GB, H100 80GB | Matches Microsoft’s training setup (4× A6000, 227 hrs). |
Multi-GPU (sharded inference) | FP16/BF16 | 8–12 GB per GPU | 16–24 GB per GPU | 2× RTX 3090, 2× L40S | Useful for scaling batch sizes or larger contexts beyond 2k tokens. |
Resources
Link: https://huggingface.co/microsoft/UserLM-8b
Step-by-Step Process to Install & Run Microsoft UserLM-8B Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Microsoft UserLM-8B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like Microsoft UserLM-8B.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Microsoft UserLM-8B.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Microsoft UserLM-8B runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Install Python 3.11 and Pip (VM already has Python 3.10; We Update It)
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.10.12 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
apt update && apt install -y software-properties-common curl ca-certificates
add-apt-repository -y ppa:deadsnakes/ppa
apt update
Now, run the following commands to install Python 3.11, Pip and Wheel:
apt install -y python3.11 python3.11-venv python3.11-dev
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 --version
python3.11 -m pip --version
Step 9: Created and Activated Python 3.11 Virtual Environment
Run the following commands to created and activated Python 3.11 virtual environment:
python3.11 -m venv ~/.venvs/py311
source ~/.venvs/py311/bin/activate
python --version
pip --version
Step 10: Install PyTorch for CUDA
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio
Step 11: Install the Utilities
Run the following command to install utilities:
pip install -U "transformers>=4.44" accelerate huggingface_hub hf_transfer
Step 12: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Microsoft UserLM-8B model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 13: Create userlm_first_turn.py
and test a first user utterance
- Create the script:
userlm_first_turn.py
- Add the following code:
import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
MODEL_ID = "microsoft/UserLM-8b"
# Choose ONE of the following:
# A) 4-bit quantized load (fits on ~6–8 GB VRAM)
bnb_cfg = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True)
quant_args = dict(quantization_config=bnb_cfg, device_map="auto")
# B) 8-bit quantized load (fits on ~10–12 GB VRAM)
# bnb_cfg = BitsAndBytesConfig(load_in_8bit=True)
# quant_args = dict(quantization_config=bnb_cfg, device_map="auto")
# C) Full-precision/bfloat16 (needs large VRAM; downloads ~32 GB unless using a quantized repo)
# quant_args = dict(torch_dtype=torch.bfloat16, device_map="auto")
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True, **quant_args)
# The model expects a single "task intent" as a system message.
messages = [{
"role": "system",
"content": "You are a user trying to book an affordable, pet-friendly hotel in Bengaluru for 2 nights near Indiranagar."
}]
inputs = tok.apply_chat_template(messages, return_tensors="pt").to(model.device)
end_token = "<|eot_id|>"
end_conv_token = "<|endconversation|>"
end_token_id = tok.encode(end_token, add_special_tokens=False)
end_conv_token_id = tok.encode(end_conv_token, add_special_tokens=False)
with torch.no_grad():
out = model.generate(
input_ids=inputs,
do_sample=True, top_p=0.8, temperature=1.0,
max_new_tokens=96,
eos_token_id=end_token_id,
pad_token_id=tok.eos_token_id,
# Guardrail: avoid prematurely ending the whole conversation
bad_words_ids=[[tid] for tid in end_conv_token_id]
)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
- Run it:
python userlm_first_turn.py
What This Script Does
- Loads UserLM-8B with a VRAM-friendly 4-bit quantized config.
- Builds a single-turn “task intent” as a system message using the model’s chat template.
- Generates a first user utterance that matches the intent (ends at
<|eot_id|>
).
- Blocks the
<|endconversation|>
token so it won’t prematurely end the dialogue.
- Prints the simulated user’s first message, ready to feed into your assistant.
Step 14: Create userlm_next_turn.py
(patched) and run a follow-up user turn
- Create the script:
userlm_next_turn.py
2. Add the following code (note the closing parenthesis on the last print
line and the explicit attention_mask
):
# userlm_next_turn.py (patched)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
MODEL_ID = "microsoft/UserLM-8b"
# 4-bit load (adjust as you like)
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, trust_remote_code=True, quantization_config=bnb_cfg, device_map="auto"
)
# Optional: ensure pad token (UserLM uses eos as pad)
if tok.pad_token_id is None:
tok.pad_token = tok.eos_token
messages = [
{"role": "system", "content": "You are a user who wants to implement a Fibonacci-plus-1 sequence; first two numbers are 1 and 1."},
{"role": "assistant", "content": "Hi! Could you clarify what language you want to use?"},
{"role": "user", "content": "Python, please."},
{"role": "assistant", "content": "Do you want an iterative or recursive implementation?"}
]
# Build inputs
inputs = tok.apply_chat_template(messages, return_tensors="pt").to(model.device)
# Explicit attention mask (all ones since there is no padding)
attention_mask = torch.ones_like(inputs)
eot_id = tok.encode("<|eot_id|>", add_special_tokens=False)
endconv_id = tok.encode("<|endconversation|>", add_special_tokens=False)
with torch.no_grad():
out = model.generate(
input_ids=inputs,
attention_mask=attention_mask, # key line to silence the warning
do_sample=True,
top_p=0.85,
temperature=0.9,
max_new_tokens=96,
min_new_tokens=8, # avoid super-short blurts
eos_token_id=eot_id,
pad_token_id=tok.pad_token_id,
bad_words_ids=[[tid] for tid in endconv_id], # block conversation end
)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
3. Run it:
python userlm_next_turn.py
What This Script Does
- Loads UserLM-8B on GPU with a 4-bit (bitsandbytes) quantized config to save VRAM.
- Builds a conversation history (system intent + assistant/user turns) using the model’s chat template.
- Adds an explicit attention_mask to avoid pad/EOS warnings and ensure correct generation.
- Generates the next simulated user turn, with guardrails: stop at
<|eot_id|>
and block <|endconversation|>
.
- Prints the user’s follow-up message so you can feed it to your assistant in the next step.
Step 15: Create userlm_loop.py
for multi-turn simulation and plug in your assistant later
- Create the script:
userlm_loop.py
2. Add the following code:
import torch, random
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
MODEL_ID = "microsoft/UserLM-8b"
# 4-bit load; adjust if you prefer 8-bit / bf16
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, trust_remote_code=True, quantization_config=bnb_cfg, device_map="auto"
)
# make sure pad token exists
if tok.pad_token_id is None:
tok.pad_token = tok.eos_token
EOT = tok.encode("<|eot_id|>", add_special_tokens=False)
END_CONV = tok.encode("<|endconversation|>", add_special_tokens=False)
def next_user_turn(messages, allow_end=False, temp=0.9, top_p=0.85):
"""
messages: list of {"role": "system"|"assistant"|"user", "content": "..."}
returns: (user_text, ended)
"""
inputs = tok.apply_chat_template(messages, return_tensors="pt").to(model.device)
attention_mask = torch.ones_like(inputs)
gen_kwargs = dict(
input_ids=inputs,
attention_mask=attention_mask,
do_sample=True,
temperature=temp,
top_p=top_p,
max_new_tokens=120,
min_new_tokens=16,
eos_token_id=EOT,
pad_token_id=tok.pad_token_id,
repetition_penalty=1.05,
)
# block end-of-conversation unless allowed for this turn
if not allow_end:
gen_kwargs["bad_words_ids"] = [[tid] for tid in END_CONV]
with torch.no_grad():
out = model.generate(**gen_kwargs)
text = tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True).strip()
ended = any(tid in out[0] for tid in END_CONV) and allow_end
return text, ended
if __name__ == "__main__":
# seed for smoother reproducibility
torch.manual_seed(42); random.seed(42)
messages = [
{"role": "system",
"content": "You are a user who wants a Python function for a Fibonacci-plus-1 sequence (a[n]=a[n-1]+a[n-2]+1) with a[0]=1, a[1]=1."},
{"role": "assistant", "content": "Hi! Could you clarify what language you want to use?"},
{"role": "user", "content": "Python, please."},
{"role": "assistant", "content": "Do you want an iterative or recursive implementation?"},
]
for turn_idx in range(1, 8):
# allow ending only after turn 3 (tweak as you like)
allow_end = (turn_idx >= 3)
user_text, ended = next_user_turn(messages, allow_end=allow_end)
print(f"\n[User turn {turn_idx}] {user_text}")
messages.append({"role": "user", "content": user_text})
if ended:
print("\n<conversation ended by simulator>")
break
# --- your assistant would generate a reply here ---
# For demo, we’ll reply minimally and reflect constraints the simulator mentioned.
assistant_reply = "Got it. I’ll implement a recursive version that supports very large integers using Python’s built-in big ints. Anything else?"
print(f"[Assistant] {assistant_reply}")
messages.append({"role": "assistant", "content": assistant_reply})
3. Run it:
python userlm_loop.py
What This Script Does
- Loads UserLM-8B (4-bit) and builds a running chat history.
- Generates the next simulated user turn each loop with
next_user_turn(...)
.
- Blocks
<|endconversation|>
until a chosen turn, then allows graceful endings.
- Prints turns so you can inspect behavior or feed them into a real assistant.
- Provides easy knobs (temperature, top-p, min/max tokens) to shape user style.
Conclusion
Microsoft’s UserLM-8B stands out as a research-first model that flips the script — instead of being an assistant, it simulates the user role in conversations. By fine-tuning on the WildChat-1M dataset, it provides researchers and developers with a powerful way to stress-test assistants, generate synthetic dialogue data, and study user behavior patterns.
With the step-by-step setup guide, GPU configuration table, and ready-to-run scripts (userlm_first_turn.py
, userlm_next_turn.py
, and userlm_loop.py
), you now have everything you need to spin up UserLM-8B locally on a GPU VM and start experimenting.
If you’re building or evaluating assistant LLMs, UserLM-8B offers a more realistic, diverse, and challenging simulation environment, helping you uncover edge cases and strengthen the robustness of your systems. It’s not just a model—it’s a testbed for better assistants.