How to Install and Run DeepSeek-V3.1-Terminus GGUF

by Aditi Bindal | September 25, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

DeepSeek-V3.1 Terminus GGUF takes the capabilities of the acclaimed DeepSeek-V3.1 to the next level, offering a finely-tuned hybrid model designed for both reasoning and agentic tasks with remarkable precision. This update focuses on language consistency, significantly reducing mixed Chinese-English outputs and abnormal characters, while optimizing the performance of the Code Agent and Search Agent. Benchmark evaluations reflect notable improvements: from higher scores in MMLU-Pro (85.0), GPQA-Diamond (80.7), and Humanity’s Last Exam (21.7), to enhanced agentic tool use such as BrowseComp (38.5) and Terminal-bench (36.7). The search agent’s template and tool-set have also been refined, enabling more accurate, context-aware searches. With a robust structure identical to DeepSeek-V3.1 and an updated inference demo in the repository, the model empowers developers to run complex reasoning, coding, and search tasks locally, all while maintaining high efficiency through selective layer optimizations and precise attention mechanisms.

Getting started with DeepSeek-V3.1 Terminus GGUF is straightforward: this guide covers a comprehensive demo code and step-by-step instructions, you can quickly set up the model locally using LLaMA.cpp and quants made available by Unsloth. If you are building chat agents, code assistants, or research tools, this release provides the enhanced performance, reliable outputs, and versatile templates needed to integrate cutting-edge AI into your projects with ease.

Prerequisites

The system requirements for running DeepSeek-V3.1 are:

GPU: Multiple H100s or H200s (count may vary across different bits)
Storage: 1TB+ (preferable)
Nvidia Cuda installed.
Anaconda set up
Disk Space requirements depending on the type of model are as follows:

We recommend you to take a screenshot of this chart and save it somewhere to quickly look up to the disk space prerequisites before trying a specific bit quantized version.

For this article, we’ll download the 2.71-bit version (recommended).

Step-by-step process to install DeepSeek-V3.1-Terminus GGUF

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

Navigate to the menu on the left side.
Click on the GPU Nodes option.

Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

For this tutorial, we’ll be using 2x H200 GPU, however, you can choose any GPU as per the prerequisites.
Similarly, we’ll opt for 5TB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x H200 140GB GPU node with 192vCPUs/504GB RAM/5TB SSD.

2. Next, you’ll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running a CUDA dependent application like VibeVoice, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!
Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

Open your terminal, paste the SSH command, and run it.

2. In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3. A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Step 7: Install and build LLaMA.cpp

llama.cpp is a C++ library for running LLaMA and other large language models efficiently on GPUs, CPUs and edge devices.

We’ll first install llama.cpp as we’ll use it to install and run DeepSeek-V3-0324.

Start by creating a virtual environment using Anaconda.

conda create -n deepseek python=3.11 -y && conda activate deepseek

Output:

2. Once inside the environment, update the Ubuntu package source-list for fetching the latest repository updates and patches.

apt-get update

3. Install dependencies for llama.cpp.

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Output:

4. Clone the official repository of llama.cpp.

git clone https://github.com/ggml-org/llama.cpp

Output:

5. Compile llama.cpp‘s build files.

In the below command, keep -DGGML_CUDA=OFF if you’re running it on a non-GPU system.

cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON

Output:

6. Build llama.cpp from the build directory.

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server

Output:

7. Finally, we’ll copy all the executables from llama.cpp/build/bin/ that start with llama- into the llama.cpp directory.

cp llama.cpp/build/bin/llama-* llama.cpp

Step 8: Download the Model Files

We’ll download the model files from Hugging Face using a Python script.

To do that, let’s first install the Hugging Face Python packages.

pip install huggingface_hub hf_transfer

huggingface_hub – Provides an interface to interact with the Hugging Face Hub, allowing you to download, upload, and manage models, datasets, and other resources.

hf_transfer – A tool optimized for faster uploads and downloads of large files (e.g., LLaMA, DeepSeek models) from the Hugging Face Hub using a more efficient transfer protocol.

Output:

2. Run the model installation script with Python.

The script below will download all the specifical quant’s checkpoints from unsloth/DeepSeek-V3.1-Terminus-GGUF.

python -c "import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER']='0'; from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/DeepSeek-V3.1-Terminus-GGUFunsloth/DeepSeek-V3.1-Terminus-GGUF', local_dir='unsloth/DeepSeek-V3.1-Terminus-GGUF', allow_patterns=['*UD-Q2_K_XL*'])"

Output:

Step 9: Run the Model for Inference

Finally, once all checkpoints are downloaded, we can proceed to the inference part.

In the below command, we’ll run the model as a server which will be run through LLaMA.cpp’s LLaMA-CLI and LLaMA-Server tool.

~/llama.cpp/build/bin/llama-server \
  -m ~/models/deepseek-v3.1/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 32768 --jinja -np 2 \
  --temp 0.6 --top-p 0.95

2. Now we will send a request to the model using Curl command from a separate SSH terminal with our prompt.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
-d '{
"model":"unsloth/DeepSeek-V3.1-Terminus-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-Terminus-UD-Q2_K_XL-00001-of-00006.gguf",
"temperature":0.6,
"top_p":0.95,
"messages":[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Explain black holes in 2 lines simply."}
]
}'

It should return a response like this:

Conclusion

DeepSeek-V3.1 Terminus GGUF exemplifies the next level of hybrid AI models, delivering enhanced reasoning, agentic capabilities, and language consistency that empower developers to tackle complex coding, search, and research tasks efficiently. With NodeShift, users can seamlessly deploy and run this powerful model without the overhead of local setup, harnessing scalable compute resources and optimized infrastructure to access its full potential. Together, the model’s advanced features and NodeShift’s cloud platform ensure that building intelligent agents, chat assistants, and automation tools becomes faster, more reliable, and highly accessible for developers across diverse projects.

Relevant blog posts

August 27, 2025

How to Install & Run DeepSeek-V3.1-GGUF Locally?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens. It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient. The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (–jinja required) and provide recommended runtime settings (temperature=0.6, top_p=0.95).

August 22, 2025

A Step-by-Step Guide to Install DeepSeek V3.1

DeepSeek has once again pushed the boundaries of what’s possible in open-source AI with the release of DeepSeek-V3.1, a next-generation hybrid model that seamlessly supports both thinking and non-thinking modes. Building on the foundation of its powerful V3 base checkpoint, this version introduces smarter tool calling, faster reasoning efficiency, and a more versatile chat template design that adapts effortlessly to different use cases. Its post-training optimization dramatically boosts performance in agent tasks and tool usage, making it a strong choice for developers working on automation, research assistance, and coding agents. Moreover, the model’s ability to process extended contexts has been expanded through a two-phase long context extension approach: a massive 10x increase in the 32K token phase to 630B tokens and a 3.3x increase in the 128K token phase to 209B tokens. Combined with training on the cutting-edge UE8M0 FP8 data format, DeepSeek-V3.1 not only ensures efficiency and scalability but also guarantees compatibility with modern microscaling data pipelines.

July 24, 2025

How to Install & Run Qwen3-Coder-480B-A35B-Instruct & Qwen Code CLI Locally?

Qwen3-Coder-480B-A35B-Instruct is a powerhouse model built for deep, structured reasoning and complex coding workflows, standing out with its native support for long contexts—up to 256K tokens, and even stretching to a million tokens with Yarn. Designed with a strong focus on tool usage and agentic behavior, it excels across real-world coding tasks, browser-based scenarios, and multi-step tool execution. Whether you’re powering through massive code repositories, debugging terminal tasks, automating workflows, or building next-generation code agents, Qwen3-Coder is crafted to handle the full stack of development challenges with speed and precision—thanks to its 480 billion parameters (with 35 billion actively engaged) and seamless integration into platforms like CLINE and Qwen Code. Speaking of Qwen Code, this command-line AI workflow tool is adapted from Gemini CLI, optimized specifically for Qwen3-Coder models. It brings enhanced parsing, deep code understanding, and robust workflow automation right to your terminal. Just a heads up: Qwen Code may make multiple API calls per task, leading to higher token usage—much like Claude Code—but the team is hard at work on improving API efficiency and the overall developer experience. Together, Qwen3-Coder and Qwen Code create a developer ecosystem that’s ready to tackle everything from massive-scale code intelligence to practical, everyday workflow automation.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.