How to Install & Run DeepSeek-V3.1-GGUF Locally?

by Ayush Kumar | August 27, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens.

It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient.

The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (--jinja required) and provide recommended runtime settings (temperature=0.6, top_p=0.95).

Evaluation

Category	Benchmark (Metric)	DeepSeek V3.1-NonThinking	DeepSeek V3 0324	DeepSeek V3.1-Thinking	DeepSeek R1 0528
General
	MMLU-Redux (EM)	91.8	90.5	93.7	93.4
	MMLU-Pro (EM)	83.7	81.2	84.8	85.0
	GPQA-Diamond (Pass@1)	74.9	68.4	80.1	81.0
	Humanity’s Last Exam (Pass@1)	–	–	15.9	17.7
Search Agent
	BrowseComp	–	–	30.0	8.9
	BrowseComp_zh	–	–	49.2	35.7
	Humanity’s Last Exam (Python + Search)	–	–	29.8	24.8
	SimpleQA	–	–	93.4	92.3
Code
	LiveCodeBench (2408-2505) (Pass@1)	56.4	43.0	74.8	73.3
	Codeforces-Div1 (Rating)	–	–	2091	1930
	Aider-Polyglot (Acc.)	68.4	55.1	76.3	71.6
Code Agent
	SWE Verified (Agent mode)	66.0	45.4	–	44.6
	SWE-bench Multilingual (Agent mode)	54.5	29.3	–	30.5
	Terminal-bench (Terminus 1 framework)	31.3	13.3	–	5.7
Math
	AIME 2024 (Pass@1)	66.3	59.4	93.1	91.4
	AIME 2025 (Pass@1)	49.8	51.3	88.4	87.5
	HMMT 2025 (Pass@1)	33.5	29.2	84.2	79.4

GPU Configuration Table for DeepSeek-V3.1-GGUF

Scenario	GPUs	VRAM / GPU	Total VRAM	Context Length	Precision	Disk (Min → Rec)	System RAM	Notes
Production (UD-Q2_K_XL Quant)	8× NVIDIA H200	141 GB	1.13 TB	128K	FP8 (microscaling)	500 GB → 1 TB	128–256 GB	Best accuracy, recommended for enterprise workloads
High-end Research (FP8)	8× NVIDIA H100	80 GB	640 GB	128K	FP8	500 GB → 1 TB	128–192 GB	Minimum recommended setup for full-context runs
Optimized Quant (Q4_K_M / Q5_0)	4× NVIDIA A100	80 GB	320 GB	128K	INT4 / INT5	350 GB → 700 GB	96–128 GB	Works with smaller quants, slower for long-context
Single-node Testing (Q2_K)	1× NVIDIA RTX 6000 Ada / A6000	48 GB	48 GB	32K–64K	INT2	200 GB	64–96 GB	For experimentation only, reduced accuracy
CPU-only (Not recommended)	–	–	–	≤8K	INT2	500 GB+	256 GB+	Extremely slow, only for validation

Recommendation: If you want to actually use DeepSeek-V3.1 in production or research, go with 8× H200 (141 GB each) for UD-Q2_K_XL quant. For minimum viable large-context usage, 8× H100 (80 GB each) is acceptable. Smaller quants (Q4/Q5) make it usable on 4× A100s or a single A6000, but with reduced reasoning fidelity.

Step-by-Step Process to Install & Run Unsloth DeepSeek-V3.1-GGUF Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 4 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Unsloth DeepSeek-V3.1-GGUF, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Unsloth DeepSeek-V3.1-GGUF
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Unsloth DeepSeek-V3.1-GGUF .

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Unsloth DeepSeek-V3.1-GGUF runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv deepseek
source deepseek/bin/activate

Step 13: Build llama.cpp (CUDA on)

Run the following command to build llama.cpp:

apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev git
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

Step 14: Grab the recommended Unsloth Quant

Run the following command to grab the recommended unsloth quant:

pip install -U "huggingface_hub[cli]"
mkdir -p ~/models/deepseek-v3.1 && cd ~/models/deepseek-v3.1
huggingface-cli download unsloth/DeepSeek-V3.1-GGUF \
  --include "DeepSeek-V3.1-UD-Q2_K_XL.gguf" \
  --local-dir . --local-dir-use-symlinks False

Step 15: Download the Model

Run the following command to download the model:

cd ~/models/deepseek-v3.1
hf download unsloth/DeepSeek-V3.1-GGUF \
  --include "UD-Q2_K_XL/*" \
  --local-dir .

Step 16: Run Model Directly from the Shards

Run the model directly from the shards:

~/llama.cpp/build/bin/llama-server \
  -m ~/models/deepseek-v3.1/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 32768 --jinja -np 2 \
  --temp 0.6 --top-p 0.95

This will start the server at port 8000.

Step 17: Quick Tests and Run Prompts

Non-thinking (default)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
  -d '{
    "model":"deepseek-v3.1",
    "temperature":0.6, "top_p":0.95,
    "messages":[
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Explain KV cache in 2 lines."}
    ]
  }'

Thinking

(Seed a thinking turn before your question.)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
  -d '{
    "model":"deepseek-v3.1",
    "temperature":0.6, "top_p":0.95,
    "messages":[
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Who are you?"},
      {"role":"assistant","content":"<think>"},
      {"role":"user","content":"1+1 = ? Keep it brief."}
    ]
  }'

Up to this point, we have been interacting with the DeepSeek-V3.1 model directly through the terminal using the curl command to send prompts and receive responses. This allowed us to test basic completions, streaming outputs, and verify that the model was running correctly via the llama-server API on port 8080. Now, we are moving one step further and setting up a Streamlit-based browser interface. This UI will make it easier and more interactive to chat with the model directly from the browser, including toggling Thinking Mode, adjusting temperature, top-p, context size, and other settings — all without manually entering API calls in the terminal.

Step 18: Connect to Your GPU VM with a Code Editor

Before you start running streamlit scripts with the DeepSeek-V3.1-GGUF models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 19: Create the Streamlit App Script (`app.py`)

We’ll write a full Streamlit UI that lets you generate a response from model on browser.

Create app.py in your VM (inside your project folder) and add the following code:

import os, json, time
import requests
import streamlit as st

st.set_page_config(page_title="DeepSeek-V3.1 (llama.cpp)", page_icon="🦙", layout="wide")

# --- Sidebar: connection & settings ---
st.sidebar.title("Server & Settings")
base_url = st.sidebar.text_input(
    "llama.cpp API base URL",
    value=os.getenv("LLAMA_API_BASE", "http://localhost:8080/v1"),
    help="Your llama-server endpoint (OpenAI-compatible).",
)
api_key = st.sidebar.text_input(
    "API key (if any)", value=os.getenv("LLAMA_API_KEY", "sk-anything"), type="password"
)
model = st.sidebar.text_input("Model name", value="deepseek-v3.1")
stream = st.sidebar.checkbox("Stream output", value=True)
thinking = st.sidebar.checkbox("Enable Thinking Mode", value=False,
                               help="Uses Unsloth Jinja template to switch to <think> mode.")
temperature = st.sidebar.slider("Temperature", 0.0, 1.5, 0.6, 0.05)
top_p = st.sidebar.slider("Top-P", 0.0, 1.0, 0.95, 0.01)
max_tokens = st.sidebar.number_input("Max tokens", 16, 16384, 1024, 16)

# --- Session state for conversation ---
if "messages" not in st.session_state:
    st.session_state.messages = [{"role": "system", "content": "You are a helpful assistant."}]

st.title("🦙 DeepSeek-V3.1 (GGUF via llama.cpp)")
st.caption("Chat UI for your local llama-server. Toggle Thinking mode on the left.")

# --- Display history ---
for m in st.session_state.messages:
    if m["role"] == "user":
        with st.chat_message("user"):
            st.markdown(m["content"])
    elif m["role"] == "assistant":
        with st.chat_message("assistant"):
            st.markdown(m["content"])

# --- Compose input ---
prompt = st.chat_input("Type your prompt…")

def post_chat(messages, enable_thinking, stream=False):
    url = f"{base_url}/chat/completions" if base_url.endswith("/v1") else f"{base_url}/v1/chat/completions"
    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {api_key}"}

    # llama.cpp understands OpenAI-style payload. Unsloth’s Jinja template in the GGUF
    # checks 'enable_thinking' and uses an assistant prefix turn to flip <think>/</think>.
    payload = {
        "model": model,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "max_tokens": int(max_tokens),
        "messages": messages.copy(),
    }

    if enable_thinking:
        payload["enable_thinking"] = True
        payload["messages"].append({"role": "assistant", "prefix": True})

    if stream:
        payload["stream"] = True
        with requests.post(url, headers=headers, data=json.dumps(payload), stream=True, timeout=300) as r:
            r.raise_for_status()
            full = ""
            for line in r.iter_lines(decode_unicode=True):
                if not line:
                    continue
                if line.startswith("data: "):
                    data = line[6:]
                else:
                    data = line
                if data.strip() == "[DONE]":
                    break
                try:
                    chunk = json.loads(data)
                    delta = chunk["choices"][0]["delta"].get("content", "")
                    if delta:
                        full += delta
                        yield delta
                except Exception:
                    # non-chunk line; ignore
                    pass
            yield {"__full__": full}
    else:
        resp = requests.post(url, headers=headers, json=payload, timeout=600)
        resp.raise_for_status()
        out = resp.json()["choices"][0]["message"]["content"]
        return out

# --- Handle submit ---
if prompt:
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        if stream:
            spot = st.empty()
            acc = ""
            for piece in post_chat(st.session_state.messages, thinking, stream=True):
                if isinstance(piece, dict) and "__full__" in piece:
                    acc = piece["__full__"]
                    break
                acc += piece
                spot.markdown(acc)
            st.session_state.messages.append({"role": "assistant", "content": acc})
        else:
            out = post_chat(st.session_state.messages, thinking, stream=False)
            st.markdown(out)
            st.session_state.messages.append({"role": "assistant", "content": out})

# --- Utilities ---
with st.sidebar.expander("Utilities"):
    if st.button("🔄 New chat"):
        st.session_state.messages = [{"role": "system", "content": "You are a helpful assistant."}]
        st.rerun()
    st.write("Tip: Start `llama-server` with large ctx & GPU offload for best perf.")

Step 20: Create Requirements.txt File

Create a requirements.txt file and add the following packages:

streamlit==1.37.1
requests==2.32.3

Step 21: Install Dependencies

Run the following command to install dependencies:

pip install -r requirements.txt

Step 22: Run It

Run the server with the following command:

streamlit run app.py --server.port 7860 --server.headless true

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

  Local URL: http://localhost:7860
  Network URL: http://172.17.0.2:7860
  External URL: http://50.222.102.252:7860

Step 23: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://localhost:7860

Enter Prompts and generate response.

Conclusion

DeepSeek-V3.1 is a next-generation hybrid reasoning model that combines thinking and non-thinking modes, offering exceptional performance in math, coding, tool usage, and agent-based tasks. With support for 128K context length, smarter tool-calling, and optimized GGUF quantization from Unsloth, it delivers enterprise-grade efficiency and accuracy.

We initially interacted with the model using the terminal via curl and llama-server to test completions and streaming outputs. Later, we integrated a Streamlit-based chat UI, enabling a clean, browser-friendly interface to communicate with the model, toggle Thinking Mode, and adjust parameters like temperature, top-p, and context size effortlessly.

With its flexibility, speed, and scalability, DeepSeek-V3.1 is well-suited for research, production, and advanced reasoning workloads, especially when deployed on powerful multi-GPU systems like H200 or H100 clusters.

Relevant blog posts

October 16, 2025

How to Install & Run KAT-Dev-72B-Exp Locally?

KAT-Dev-72B-Exp stands as Kwaipilot’s most ambitious open-source model to date — a massive 72-billion-parameter large language model purpose-built for software engineering, debugging, and automated code reasoning. It represents the experimental reinforcement-learning (RL) variant of the proprietary KAT-Coder model, opening a rare window into the techniques and design philosophies that power some of the world’s strongest coding assistants. At its core, KAT-Dev-72B-Exp pushes the boundaries of reinforcement learning for code generation. The Kwaipilot team rewrote key attention kernels and re-engineered the training engine to support shared prefix trajectories, enabling faster and more stable RL training — particularly on scaffolded tasks that demand precise context management. To prevent the common issue of exploration collapse in RL training, they also introduced a novel advantage redistribution technique, which dynamically balances exploration by amplifying high-variance trajectories and soft-penalizing low-exploration ones. These innovations translate directly into real-world performance. On the SWE-Bench Verified benchmark, a demanding test that evaluates a model’s ability to understand, reason about, and patch real GitHub issues, KAT-Dev-72B-Exp achieves an impressive 74.6% accuracy when tested strictly under the SWE-agent scaffold. This places it among the most capable open-source developer models currently available. The model is distributed under the Apache-2.0 license, ensuring that researchers, developers, and organizations can freely explore, adapt, and integrate its architecture into their own projects. The accompanying Transformers quickstart snippet makes it easy to run on both local and cloud environments, while the evaluation parameters — temperature 0.6, max_turns 150, and history_processors.n 100 — ensure reproducibility across experiments. In short, KAT-Dev-72B-Exp is not just another large language model. It is a deep dive into how reinforcement learning can be scaled safely and efficiently for large-context, multi-turn software engineering workflows — bridging the gap between academic research and production-grade coding intelligence.

October 15, 2025

How to Install & Run LFM2-8B-A1B Locally?

LFM2-8B-A1B is Liquid AI’s on-device-friendly MoE: 8.3B total / 1.5B active params with a hybrid conv-attention stack (18 LIV conv + 6 GQA). It uses a ChatML-style template, supports 32K context, and is tuned for agentic tasks, data extraction, RAG, and multi-turn chat. It targets speed on modest hardware (often faster than 1.7B dense baselines) while keeping quality near 3–4B dense models.

October 14, 2025

How to Install & Run Vngrs-AI Kumru-2B Locally?

Kumru-2B is VNGRS’s lightweight, Turkish-native LLM trained from scratch. It’s pre-trained on ~500 GB of cleaned, deduplicated text (~300B tokens) and SFT’d on ~1M examples. Kumru uses a modern Turkish-optimized tokenizer (≈50,176 vocab) and ships with a chat template, code & math support, and a native 8,192-token context. Despite its size, Kumru performs strongly on Turkish-centric tasks (e.g., grammatical error correction, summarization) and compares well with much larger multilingual models—while being fast and cost-efficient to run locally.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.