DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens.
It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient.
The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (--jinja
required) and provide recommended runtime settings (temperature=0.6
, top_p=0.95
).
Evaluation
Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |
---|
General | | | | | |
| MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 |
| MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 |
| GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 |
| Humanity’s Last Exam (Pass@1) | – | – | 15.9 | 17.7 |
Search Agent | | | | | |
| BrowseComp | – | – | 30.0 | 8.9 |
| BrowseComp_zh | – | – | 49.2 | 35.7 |
| Humanity’s Last Exam (Python + Search) | – | – | 29.8 | 24.8 |
| SimpleQA | – | – | 93.4 | 92.3 |
Code | | | | | |
| LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 |
| Codeforces-Div1 (Rating) | – | – | 2091 | 1930 |
| Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 |
Code Agent | | | | | |
| SWE Verified (Agent mode) | 66.0 | 45.4 | – | 44.6 |
| SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | – | 30.5 |
| Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | – | 5.7 |
Math | | | | | |
| AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 |
| AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 |
| HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 |
GPU Configuration Table for DeepSeek-V3.1-GGUF
Scenario | GPUs | VRAM / GPU | Total VRAM | Context Length | Precision | Disk (Min → Rec) | System RAM | Notes |
---|
Production (UD-Q2_K_XL Quant) | 8× NVIDIA H200 | 141 GB | 1.13 TB | 128K | FP8 (microscaling) | 500 GB → 1 TB | 128–256 GB | Best accuracy, recommended for enterprise workloads |
High-end Research (FP8) | 8× NVIDIA H100 | 80 GB | 640 GB | 128K | FP8 | 500 GB → 1 TB | 128–192 GB | Minimum recommended setup for full-context runs |
Optimized Quant (Q4_K_M / Q5_0) | 4× NVIDIA A100 | 80 GB | 320 GB | 128K | INT4 / INT5 | 350 GB → 700 GB | 96–128 GB | Works with smaller quants, slower for long-context |
Single-node Testing (Q2_K) | 1× NVIDIA RTX 6000 Ada / A6000 | 48 GB | 48 GB | 32K–64K | INT2 | 200 GB | 64–96 GB | For experimentation only, reduced accuracy |
CPU-only (Not recommended) | – | – | – | ≤8K | INT2 | 500 GB+ | 256 GB+ | Extremely slow, only for validation |
Recommendation: If you want to actually use DeepSeek-V3.1 in production or research, go with 8× H200 (141 GB each) for UD-Q2_K_XL quant. For minimum viable large-context usage, 8× H100 (80 GB each) is acceptable. Smaller quants (Q4/Q5) make it usable on 4× A100s or a single A6000, but with reduced reasoning fidelity.
Step-by-Step Process to Install & Run Unsloth DeepSeek-V3.1-GGUF Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 4 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Unsloth DeepSeek-V3.1-GGUF, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Unsloth DeepSeek-V3.1-GGUF
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Unsloth DeepSeek-V3.1-GGUF .
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Unsloth DeepSeek-V3.1-GGUF runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Created and activated Python 3.11 virtual environment
Run the following commands to created and activated Python 3.11 virtual environment:
apt update && apt install -y python3.11-venv git wget
python3.11 -m venv deepseek
source deepseek/bin/activate
Step 13: Build llama.cpp (CUDA on)
Run the following command to build llama.cpp:
apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev git
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
Step 14: Grab the recommended Unsloth Quant
Run the following command to grab the recommended unsloth quant:
pip install -U "huggingface_hub[cli]"
mkdir -p ~/models/deepseek-v3.1 && cd ~/models/deepseek-v3.1
huggingface-cli download unsloth/DeepSeek-V3.1-GGUF \
--include "DeepSeek-V3.1-UD-Q2_K_XL.gguf" \
--local-dir . --local-dir-use-symlinks False
Step 15: Download the Model
Run the following command to download the model:
cd ~/models/deepseek-v3.1
hf download unsloth/DeepSeek-V3.1-GGUF \
--include "UD-Q2_K_XL/*" \
--local-dir .
Step 16: Run Model Directly from the Shards
Run the model directly from the shards:
~/llama.cpp/build/bin/llama-server \
-m ~/models/deepseek-v3.1/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 32768 --jinja -np 2 \
--temp 0.6 --top-p 0.95
This will start the server at port 8000.
Step 17: Quick Tests and Run Prompts
Non-thinking (default)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
-d '{
"model":"deepseek-v3.1",
"temperature":0.6, "top_p":0.95,
"messages":[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Explain KV cache in 2 lines."}
]
}'
Thinking
(Seed a thinking turn before your question.)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
-d '{
"model":"deepseek-v3.1",
"temperature":0.6, "top_p":0.95,
"messages":[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Who are you?"},
{"role":"assistant","content":"<think>"},
{"role":"user","content":"1+1 = ? Keep it brief."}
]
}'
Up to this point, we have been interacting with the DeepSeek-V3.1 model directly through the terminal using the curl
command to send prompts and receive responses. This allowed us to test basic completions, streaming outputs, and verify that the model was running correctly via the llama-server API on port 8080. Now, we are moving one step further and setting up a Streamlit-based browser interface. This UI will make it easier and more interactive to chat with the model directly from the browser, including toggling Thinking Mode, adjusting temperature, top-p, context size, and other settings — all without manually entering API calls in the terminal.
Step 18: Connect to Your GPU VM with a Code Editor
Before you start running streamlit scripts with the DeepSeek-V3.1-GGUF models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 19: Create the Streamlit App Script (app.py
)
We’ll write a full Streamlit UI that lets you generate a response from model on browser.
Create app.py
in your VM (inside your project folder) and add the following code:
import os, json, time
import requests
import streamlit as st
st.set_page_config(page_title="DeepSeek-V3.1 (llama.cpp)", page_icon="🦙", layout="wide")
# --- Sidebar: connection & settings ---
st.sidebar.title("Server & Settings")
base_url = st.sidebar.text_input(
"llama.cpp API base URL",
value=os.getenv("LLAMA_API_BASE", "http://localhost:8080/v1"),
help="Your llama-server endpoint (OpenAI-compatible).",
)
api_key = st.sidebar.text_input(
"API key (if any)", value=os.getenv("LLAMA_API_KEY", "sk-anything"), type="password"
)
model = st.sidebar.text_input("Model name", value="deepseek-v3.1")
stream = st.sidebar.checkbox("Stream output", value=True)
thinking = st.sidebar.checkbox("Enable Thinking Mode", value=False,
help="Uses Unsloth Jinja template to switch to <think> mode.")
temperature = st.sidebar.slider("Temperature", 0.0, 1.5, 0.6, 0.05)
top_p = st.sidebar.slider("Top-P", 0.0, 1.0, 0.95, 0.01)
max_tokens = st.sidebar.number_input("Max tokens", 16, 16384, 1024, 16)
# --- Session state for conversation ---
if "messages" not in st.session_state:
st.session_state.messages = [{"role": "system", "content": "You are a helpful assistant."}]
st.title("🦙 DeepSeek-V3.1 (GGUF via llama.cpp)")
st.caption("Chat UI for your local llama-server. Toggle Thinking mode on the left.")
# --- Display history ---
for m in st.session_state.messages:
if m["role"] == "user":
with st.chat_message("user"):
st.markdown(m["content"])
elif m["role"] == "assistant":
with st.chat_message("assistant"):
st.markdown(m["content"])
# --- Compose input ---
prompt = st.chat_input("Type your prompt…")
def post_chat(messages, enable_thinking, stream=False):
url = f"{base_url}/chat/completions" if base_url.endswith("/v1") else f"{base_url}/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {api_key}"}
# llama.cpp understands OpenAI-style payload. Unsloth’s Jinja template in the GGUF
# checks 'enable_thinking' and uses an assistant prefix turn to flip <think>/</think>.
payload = {
"model": model,
"temperature": float(temperature),
"top_p": float(top_p),
"max_tokens": int(max_tokens),
"messages": messages.copy(),
}
if enable_thinking:
payload["enable_thinking"] = True
payload["messages"].append({"role": "assistant", "prefix": True})
if stream:
payload["stream"] = True
with requests.post(url, headers=headers, data=json.dumps(payload), stream=True, timeout=300) as r:
r.raise_for_status()
full = ""
for line in r.iter_lines(decode_unicode=True):
if not line:
continue
if line.startswith("data: "):
data = line[6:]
else:
data = line
if data.strip() == "[DONE]":
break
try:
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
if delta:
full += delta
yield delta
except Exception:
# non-chunk line; ignore
pass
yield {"__full__": full}
else:
resp = requests.post(url, headers=headers, json=payload, timeout=600)
resp.raise_for_status()
out = resp.json()["choices"][0]["message"]["content"]
return out
# --- Handle submit ---
if prompt:
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
if stream:
spot = st.empty()
acc = ""
for piece in post_chat(st.session_state.messages, thinking, stream=True):
if isinstance(piece, dict) and "__full__" in piece:
acc = piece["__full__"]
break
acc += piece
spot.markdown(acc)
st.session_state.messages.append({"role": "assistant", "content": acc})
else:
out = post_chat(st.session_state.messages, thinking, stream=False)
st.markdown(out)
st.session_state.messages.append({"role": "assistant", "content": out})
# --- Utilities ---
with st.sidebar.expander("Utilities"):
if st.button("🔄 New chat"):
st.session_state.messages = [{"role": "system", "content": "You are a helpful assistant."}]
st.rerun()
st.write("Tip: Start `llama-server` with large ctx & GPU offload for best perf.")
Step 20: Create Requirements.txt File
Create a requirements.txt file and add the following packages:
streamlit==1.37.1
requests==2.32.3
Step 21: Install Dependencies
Run the following command to install dependencies:
pip install -r requirements.txt
Step 22: Run It
Run the server with the following command:
streamlit run app.py --server.port 7860 --server.headless true
Once executed, Streamlit will start the web server and you’ll see a message:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:7860
Network URL: http://172.17.0.2:7860
External URL: http://50.222.102.252:7860
Step 23: Access the Streamlit App in Browser
After launching the app, you’ll see the interface in your browser.
http://localhost:7860
Enter Prompts and generate response.
Conclusion
DeepSeek-V3.1 is a next-generation hybrid reasoning model that combines thinking and non-thinking modes, offering exceptional performance in math, coding, tool usage, and agent-based tasks. With support for 128K context length, smarter tool-calling, and optimized GGUF quantization from Unsloth, it delivers enterprise-grade efficiency and accuracy.
We initially interacted with the model using the terminal via curl
and llama-server to test completions and streaming outputs. Later, we integrated a Streamlit-based chat UI, enabling a clean, browser-friendly interface to communicate with the model, toggle Thinking Mode, and adjust parameters like temperature, top-p, and context size effortlessly.
With its flexibility, speed, and scalability, DeepSeek-V3.1 is well-suited for research, production, and advanced reasoning workloads, especially when deployed on powerful multi-GPU systems like H200 or H100 clusters.