DeepSeek-V3.1 Terminus GGUF takes the capabilities of the acclaimed DeepSeek-V3.1 to the next level, offering a finely-tuned hybrid model designed for both reasoning and agentic tasks with remarkable precision. This update focuses on language consistency, significantly reducing mixed Chinese-English outputs and abnormal characters, while optimizing the performance of the Code Agent and Search Agent. Benchmark evaluations reflect notable improvements: from higher scores in MMLU-Pro (85.0), GPQA-Diamond (80.7), and Humanity’s Last Exam (21.7), to enhanced agentic tool use such as BrowseComp (38.5) and Terminal-bench (36.7). The search agent’s template and tool-set have also been refined, enabling more accurate, context-aware searches. With a robust structure identical to DeepSeek-V3.1 and an updated inference demo in the repository, the model empowers developers to run complex reasoning, coding, and search tasks locally, all while maintaining high efficiency through selective layer optimizations and precise attention mechanisms.
Getting started with DeepSeek-V3.1 Terminus GGUF is straightforward: this guide covers a comprehensive demo code and step-by-step instructions, you can quickly set up the model locally using LLaMA.cpp and quants made available by Unsloth. If you are building chat agents, code assistants, or research tools, this release provides the enhanced performance, reliable outputs, and versatile templates needed to integrate cutting-edge AI into your projects with ease.
Prerequisites
The system requirements for running DeepSeek-V3.1 are:
- GPU: Multiple H100s or H200s (count may vary across different bits)
- Storage: 1TB+ (preferable)
- Nvidia Cuda installed.
- Anaconda set up
- Disk Space requirements depending on the type of model are as follows:

Source: Unsloth
We recommend you to take a screenshot of this chart and save it somewhere to quickly look up to the disk space prerequisites before trying a specific bit quantized version.
For this article, we’ll download the 2.71-bit version (recommended).
Step-by-step process to install DeepSeek-V3.1-Terminus GGUF
For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.
Step 1: Setting up a NodeShift Account
Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.
If you already have an account, login straight to your dashboard.
Step 2: Create a GPU Node
After accessing your account, you should see a dashboard (see image), now:
- Navigate to the menu on the left side.
- Click on the GPU Nodes option.
- Click on Start to start creating your very first GPU node.
These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.
Step 3: Selecting configuration for GPU (model, region, storage)
- For this tutorial, we’ll be using 2x H200 GPU, however, you can choose any GPU as per the prerequisites.
- Similarly, we’ll opt for 5TB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.
Step 4: Choose GPU Configuration and Authentication method
- After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x H200 140GB GPU node with 192vCPUs/504GB RAM/5TB SSD.
2. Next, you’ll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.
Step 5: Choose an Image
The final step is to choose an image for the VM, which in our case is Nvidia Cuda.
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running a CUDA dependent application like VibeVoice, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.
Step 6: Connect to active Compute Node using SSH
- As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!
- Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.
As you copy the details, follow the below steps to connect to the running GPU VM via SSH:
- Open your terminal, paste the SSH command, and run it.
2. In some cases, your terminal may take your consent before connecting. Enter ‘yes’.
3. A prompt will request a password. Type the SSH password, and you should be connected.
Output:
Step 7: Install and build LLaMA.cpp
llama.cpp
is a C++ library for running LLaMA and other large language models efficiently on GPUs, CPUs and edge devices.
We’ll first install llama.cpp
as we’ll use it to install and run DeepSeek-V3-0324.
- Start by creating a virtual environment using Anaconda.
conda create -n deepseek python=3.11 -y && conda activate deepseek
Output:
2. Once inside the environment, update the Ubuntu package source-list for fetching the latest repository updates and patches.
apt-get update
3. Install dependencies for llama.cpp
.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
Output:
4. Clone the official repository of llama.cpp
.
git clone https://github.com/ggml-org/llama.cpp
Output:
5. Compile llama.cpp
‘s build files.
In the below command, keep -DGGML_CUDA=OFF
if you’re running it on a non-GPU system.
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
Output:
6. Build llama.cpp
from the build directory.
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
Output:
7. Finally, we’ll copy all the executables from llama.cpp/build/bin/
that start with llama-
into the llama.cpp
directory.
cp llama.cpp/build/bin/llama-* llama.cpp
Step 8: Download the Model Files
We’ll download the model files from Hugging Face using a Python script.
- To do that, let’s first install the Hugging Face Python packages.
pip install huggingface_hub hf_transfer
huggingface_hub
– Provides an interface to interact with the Hugging Face Hub, allowing you to download, upload, and manage models, datasets, and other resources.
hf_transfer
– A tool optimized for faster uploads and downloads of large files (e.g., LLaMA, DeepSeek models) from the Hugging Face Hub using a more efficient transfer protocol.
Output:
2. Run the model installation script with Python.
The script below will download all the specifical quant’s checkpoints from unsloth/DeepSeek-V3.1-Terminus-GGUF.
python -c "import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER']='0'; from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/DeepSeek-V3.1-Terminus-GGUFunsloth/DeepSeek-V3.1-Terminus-GGUF', local_dir='unsloth/DeepSeek-V3.1-Terminus-GGUF', allow_patterns=['*UD-Q2_K_XL*'])"
Output:
Step 9: Run the Model for Inference
Finally, once all checkpoints are downloaded, we can proceed to the inference part.
- In the below command, we’ll run the model as a server which will be run through LLaMA.cpp’s LLaMA-CLI and LLaMA-Server tool.
~/llama.cpp/build/bin/llama-server \
-m ~/models/deepseek-v3.1/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 32768 --jinja -np 2 \
--temp 0.6 --top-p 0.95
2. Now we will send a request to the model using Curl command from a separate SSH terminal with our prompt.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer sk-123" \
-d '{
"model":"unsloth/DeepSeek-V3.1-Terminus-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-Terminus-UD-Q2_K_XL-00001-of-00006.gguf",
"temperature":0.6,
"top_p":0.95,
"messages":[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Explain black holes in 2 lines simply."}
]
}'
It should return a response like this:
Conclusion
DeepSeek-V3.1 Terminus GGUF exemplifies the next level of hybrid AI models, delivering enhanced reasoning, agentic capabilities, and language consistency that empower developers to tackle complex coding, search, and research tasks efficiently. With NodeShift, users can seamlessly deploy and run this powerful model without the overhead of local setup, harnessing scalable compute resources and optimized infrastructure to access its full potential. Together, the model’s advanced features and NodeShift’s cloud platform ensure that building intelligent agents, chat assistants, and automation tools becomes faster, more reliable, and highly accessible for developers across diverse projects.