RefusalBench Showdown: How Hermes 4 Crushed Frontier Giants

by Ayush Kumar | August 29, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Hermes 4 70B is Nous Research’s flagship reasoning model, built on Llama-3.1-70B and fine-tuned with a massive new post-training corpus (~60B tokens). It introduces a hybrid reasoning mode with explicit <think> segments, giving users the choice between fast responses or deep, step-by-step deliberation.

Key upgrades over Hermes 3 include huge improvements in math, logic, code, STEM, and creativity, stronger schema-faithful outputs (valid JSON, structured responses), and much easier steerability with reduced refusal rates. Hermes 4 also supports function calling and tool use, making it production-ready for both conversational and structured applications.

With state-of-the-art performance on RefusalBench, Hermes 4 pushes open-source reasoning closer to frontier closed models while staying fully open, steerable, and aligned to user needs.

Benchmarks (Hermes 4 70B)

Metric	Hermes 4 70B R (N)	Cogito 70B R (N)	Hermes 4 14B R (N)	Qwen3 14B R (N)
Math & Reasoning
MATH-500	95.6 (71.0)	88.3 (75.6)	92.6 (76.7)	97.2 (88.5)
AIME’24	73.5 (9.5)	32.2 (12.2)	52.7 (10.5)	77.6 (28.5)
AIME’25	67.4 (7.3)	22.1 (6.0)	41.4 (6.6)	68.5 (22.2)
GPQA Diamond	66.1 (33.3)	59.1 (52.8)	55.6 (45.0)	62.0 (53.5)
Logic & Code
BBH	87.8 (80.5)	89.3 (87.6)	84.4 (63.2)	86.6 (82.5)
LCBv6 Aug2024+	50.5 (25.5)	32.1 (27.3)	44.0 (23.7)	61.2 (29.2)
Knowledge
MMLU	88.4 (76.7)	91.0 (90.5)	83.8 (76.7)	84.7 (81.5)
MMLU-Pro	80.7 (54.9)	79.9 (76.0)	73.3 (59.5)	77.5 (70.1)
SimpleQA	17.9 (13.3)	23.3 (22.7)	5.4 (4.0)	5.6 (4.7)
Alignment
IFEval (Loose)	78.7 (82.3)	56.2 (92.7)	50.1 (74.6)	91.6 (92.1)
Arena-Hard v1	90.1 (56.7)	86.8 (81.5)	78.2 (52.4)	79.6 (78.2)
RefusalBench	59.5 (49.0)	15.3 (13.2)	33.7 (37.2)	42.2 (23.4)
RewardBench	64.8 (44.7)	63.8 (62.0)	61.2 (56.8)	66.7 (73.5)
Reading Comprehension
DROP	85.0 (78.4)	86.0 (84.1)	82.7 (71.4)	89.4 (75.0)
MuSR	70.4 (56.3)	63.5 (59.2)	59.1 (50.5)	66.2 (56.4)
OBQA	94.8 (90.0)	95.8 (94.2)	93.4 (87.6)	96.4 (94.0)
Creativity & Writing
EQBench3	80.5 (75.1)	65.7 (68.1)	79.5 (68.8)	74.8 (67.9)
CreativeWriting3	77.5 (49.1)	64.0 (64.4)	62.6 (42.7)	65.8 (52.2)

RefusalBench Results

Model	% of Questions Answered
Hermes 4 70B Reasoning	59.50%
Hermes 4 405B Reasoning	57.10%
grok4	51.30%
Hermes 4 70B	49.07%
Hermes 4 405B	43.20%
Qwen2.5 7B	36.10%
Qwen3 235B Reasoning	34.30%
DeepSeek V3	28.10%
Gemini 2.5 Pro	24.23%
Llama 405B	21.70%
Gemini 2.5 Flash	19.13%
GPT4o	17.67%
Sonnet 4	17.00%
GPT4-tiny (mini)	16.76%
R1	16.70%
cogito-v2-405B Reasoning	15.40%
Opus 4.1	15.38%
Qwen3 235B	15.30%
cogito-v2-405B	14.94%
cogito-v2-405B	12.10%
GPT 5	11.34%
gpt-oss 120B	5.80%
gpt-oss 20B	4.79%

Resources

Link: https://huggingface.co/NousResearch/Hermes-4-70B

Step-by-Step Process to Install & Run
NousResearch Hermes-4-70B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running NousResearch Hermes-4-70B, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like NousResearch Hermes-4-70B
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like NousResearch Hermes-4-70B .

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the NousResearch Hermes-4-70B runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv hermes
source hermes/bin/activate

Step 13: Install Wheel

Run the following command to install wheel:

pip install -U pip wheel

Step 14: Install Torch

Run the following command to install torch:

pip install "torch==2.4.0+cu121" "torchvision==0.19.0+cu121" --index-url https://download.pytorch.org/whl/cu121

Step 15: Install Python Dependencies

Run the following command to install python dependencies:

pip install transformers accelerate sentencepiece

Step 16: Install Flash Attention

Run the following command to install flash attention:

pip install flash-attn --no-build-isolation

Step 18: Connect to Your GPU VM with a Code Editor

Before you start running transformer and streamlit scripts with the Hermes 4 70B models, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 19: Create the Transformer Script and Download the Model (ex: `hermes_transformers_chat.py`)

We’ll write a full Transfomer script that lets you download the model & generate a response from model on terminal.

Create hermes_transformers_chat.py in your VM (inside your project folder) and add the following code:

#!/usr/bin/env python3
# hermes_transformers_chat.py

import os, json, re, argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def parse_args():
    p = argparse.ArgumentParser(description="Hermes-4-70B chat (Transformers)")
    p.add_argument("--model", default="NousResearch/Hermes-4-70B",
                   help="HF repo id (or local path). Use NousResearch/Hermes-4-70B-FP8 for FP8 weights.")
    p.add_argument("--max-new-tokens", type=int, default=300)
    p.add_argument("--temp", type=float, default=0.6)
    p.add_argument("--top-p", type=float, default=0.95)
    p.add_argument("--top-k", type=int, default=20)
    p.add_argument("--reasoning", action="store_true",
                   help="Enable visible <think>...</think> (for testing/logs).")
    p.add_argument("--system", default="You are Hermes 4. Be concise and helpful.",
                   help="System prompt.")
    p.add_argument("--prompt", default="Summarize CRISPR in 3 sentences.",
                   help="User prompt.")
    p.add_argument("--no-flash", action="store_true",
                   help="Force SDPA attention (disable Flash-Attn if installed).")
    p.add_argument("--dtype", default="bf16", choices=["bf16","fp16"],
                   help="Model compute dtype.")
    p.add_argument("--trust-remote-code", action="store_true",
                   help="Pass trust_remote_code=True.")
    return p.parse_args()

def pick_dtype(name: str):
    return torch.bfloat16 if name == "bf16" else torch.float16

def maybe_flash_attn(disable_flag: bool):
    if disable_flag:
        return "sdpa"
    try:
        import flash_attn  # noqa: F401
        return "flash_attention_2"
    except Exception:
        return "sdpa"

def build_messages(system_prompt: str, user_prompt: str, reasoning: bool):
    if reasoning:
        reasoning_sys = (
            "You are a deep thinking AI. You may use long chains of thought to deliberate. "
            "Enclose thoughts inside <think></think>, then give the final answer."
        )
        system_prompt = f"{reasoning_sys}\n\n{system_prompt}"
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

def simple_toolcall_parse(text: str):
    m = re.search(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", text, flags=re.S)
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except Exception:
        return None

def to_model_inputs(x, device, pad_token_id=None):
    """
    Normalizes tokenizer output to a dict with input_ids and attention_mask.
    Handles cases where apply_chat_template returns a Tensor or a dict.
    """
    if isinstance(x, torch.Tensor):
        input_ids = x.to(device)
        # Create attention_mask (1 for non-pad). If pad_token_id unknown, just use ones.
        if pad_token_id is None:
            attn = torch.ones_like(input_ids)
        else:
            attn = (input_ids != pad_token_id).long()
        return {"input_ids": input_ids, "attention_mask": attn}
    elif isinstance(x, dict):
        return {k: v.to(device) for k, v in x.items()}
    else:
        raise TypeError(f"Unexpected inputs type: {type(x)}")

def main():
    args = parse_args()
    model_id = args.model
    dtype = pick_dtype(args.dtype)
    attn_impl = maybe_flash_attn(args.no_flash)

    print(f"[INFO] Loading model: {model_id}")
    print(f"[INFO] DTYPE: {dtype}, attention: {attn_impl}, device_map=auto")

    tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=args.trust_remote_code)
    # Ensure pad token exists for attention_mask construction if needed
    if tok.pad_token_id is None:
        tok.pad_token = tok.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=dtype,
        device_map="auto",               # shards across visible GPUs
        attn_implementation=attn_impl,
        low_cpu_mem_usage=True,
        trust_remote_code=args.trust_remote_code,
    )

    messages = build_messages(args.system, args.prompt, args.reasoning)

    # Try to get a mapping; if HF returns a Tensor, we normalize it.
    tpl = tok.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
    )
    inputs = to_model_inputs(tpl, model.device, pad_token_id=tok.pad_token_id)

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=args.max_new_tokens,
            temperature=args.temp,
            top_p=args.top_p,
            top_k=args.top_k,
            do_sample=True,
        )

    text = tok.decode(out[0], skip_special_tokens=True)
    print("\n===== MODEL OUTPUT =====\n")
    print(text)

    tc = simple_toolcall_parse(text)
    if tc:
        print("\n===== PARSED <tool_call> =====")
        print(json.dumps(tc, indent=2))

if __name__ == "__main__":
    # Example:
    # CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python hermes_transformers_chat.py --prompt "Explain the photoelectric effect simply." --dtype bf16
    main()

Step 20: Download the Model

Run the following command to download the model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python hermes_transformers_chat.py

Step 21: Run the Model with Prompt

Execute the following command to run the model with prompt and generate response.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python hermes_transformers_chat.py \
  --model NousResearch/Hermes-4-70B \
  --prompt "Explain the photoelectric effect simply." \
  --dtype bf16

Up until this point, we’ve been interacting with the Hermes-4-70B model entirely through the terminal — running Python scripts, watching the checkpoint shards load, and reading the raw text responses directly in the console. That approach works perfectly for testing, but now we’ll take it a step further: instead of staying in the terminal, we’ll build a simple Streamlit web app that lets us chat with the model in a clean browser interface. This way, the model feels much more like an interactive assistant — you type your prompt in a text box, get nicely formatted answers back, and can even adjust parameters or view raw outputs without touching the command line.

Step 22: Install Streamlit

Run the following command to install streamlit

pip install streamlit

Step 23: Create the Streamlit App Script (`app.py`)

We’ll write a full Streamlit UI that lets you generate a response from model on browser.

Create app.py in your VM (inside your project folder) and add the following code:

#!/usr/bin/env python3
# Streamlit chat for Hermes-4-70B (Transformers)

import os, re, json, time
import torch
import streamlit as st
from transformers import AutoTokenizer, AutoModelForCausalLM

# ---------- Config ----------
DEFAULT_MODEL = os.environ.get("HERMES_MODEL", "NousResearch/Hermes-4-70B")  # or Hermes-4-70B-FP8
DEFAULT_DTYPE = os.environ.get("HERMES_DTYPE", "bf16")  # bf16 or fp16
USE_FLASH = os.environ.get("HERMES_NO_FLASH", "0") != "1"  # set HERMES_NO_FLASH=1 to disable

def pick_dtype(name: str):
    return torch.bfloat16 if name.lower() == "bf16" else torch.float16

def pick_attn(use_flash: bool):
    if not use_flash:
        return "sdpa"
    try:
        import flash_attn  # noqa
        return "flash_attention_2"
    except Exception:
        return "sdpa"

@st.cache_resource(show_spinner=True)
def load_model(model_id: str, dtype_name: str, use_flash: bool):
    dtype = pick_dtype(dtype_name)
    attn_impl = pick_attn(use_flash)

    tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    if tok.pad_token_id is None:
        tok.pad_token = tok.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=dtype,
        device_map="auto",              # shards across all visible GPUs
        low_cpu_mem_usage=True,
        attn_implementation=attn_impl,
        trust_remote_code=True,
    )
    return tok, model

def toolcall_parse(text: str):
    m = re.search(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", text, flags=re.S)
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except Exception:
        return None

def format_prompt(history, system_prompt, reasoning=False):
    if reasoning:
        sys2 = ("You are a deep thinking AI. You may use long chains of thought to deliberate. "
                "Enclose thoughts inside <think></think>, then give the final answer.")
        system_prompt = f"{sys2}\n\n{system_prompt}"
    messages = [{"role": "system", "content": system_prompt}]
    messages.extend(history)
    return messages

def chat_generate(tok, model, messages, max_new_tokens, temperature, top_p, top_k):
    tpl = tok.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt"
    )
    if isinstance(tpl, torch.Tensor):
        input_ids = tpl.to(model.device)
        attn = torch.ones_like(input_ids) if tok.pad_token_id is None else \
               (input_ids != tok.pad_token_id).long()
        inputs = {"input_ids": input_ids, "attention_mask": attn}
    else:
        inputs = {k: v.to(model.device) for k, v in tpl.items()}

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            do_sample=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    return text

# ---------- UI ----------
st.set_page_config(page_title="Hermes-4-70B Chat", layout="wide")
st.title("🧠 Hermes-4-70B — Streamlit")

with st.sidebar:
    st.subheader("Model Settings")
    model_id = st.text_input("Model", value=DEFAULT_MODEL, help="HF id or local path")
    dtype_name = st.selectbox("DType", ["bf16", "fp16"], index=0 if DEFAULT_DTYPE=='bf16' else 1)
    use_flash = st.toggle("Use FlashAttention-2 (if available)", value=USE_FLASH)
    max_new = st.slider("Max new tokens", 64, 2048, 512, step=64)
    temperature = st.slider("Temperature", 0.0, 1.5, 0.6, step=0.05)
    top_p = st.slider("Top-p", 0.0, 1.0, 0.95, step=0.01)
    top_k = st.slider("Top-k", 1, 100, 20, step=1)

    st.subheader("Prompting")
    system_prompt = st.text_area(
        "System prompt",
        "You are Hermes 4. Be concise and helpful.",
        height=120
    )
    reasoning = st.toggle("Visible reasoning (<think>…</think>)", value=False,
                          help="For testing/logs only. Avoid in production.")

    if st.button("Reload model"):
        st.cache_resource.clear()
        st.rerun()

tok, model = load_model(model_id, dtype_name, use_flash)

# Session chat history
if "history" not in st.session_state:
    st.session_state.history = []   # list of {"role": "user"/"assistant", "content": "..."}
if "raw_output" not in st.session_state:
    st.session_state.raw_output = ""

# Chat UI
for msg in st.session_state.history:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

user_text = st.chat_input("Type your message…")
if user_text:
    st.session_state.history.append({"role": "user", "content": user_text})
    with st.chat_message("user"):
        st.markdown(user_text)

    # Build messages & generate
    messages = format_prompt(st.session_state.history, system_prompt, reasoning=reasoning)
    with st.chat_message("assistant"):
        with st.spinner("Thinking…"):
            text = chat_generate(tok, model, messages, max_new, temperature, top_p, top_k)

            # Extract only the assistant reply after the last assistant header if present
            # Hermes often prints headers like "system\n...user\n...assistant\n"
            # We'll display the last chunk after 'assistant' tag if found; else show full text.
            cleaned = text
            split_tag = "assistant"
            if split_tag in text:
                cleaned = text.split(split_tag, maxsplit=1)[-1].strip()

            st.markdown(cleaned)
            st.session_state.history.append({"role": "assistant", "content": cleaned})
            st.session_state.raw_output = text

    with st.expander("Raw model output / tool_call JSON"):
        st.code(st.session_state.raw_output)
        tc = toolcall_parse(st.session_state.raw_output)
        if tc:
            st.json(tc)

Step 24: Launch the Streamlit App

Now that we’ve written our app.py Streamlit script, the next step is to launch the app from the terminal.

Run the following command inside your VM:

streamlit run app.py

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

URL: http://0:0:0:0:7860

Step 25: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://localhost:7860

Enter Prompts and generate response.

Conclusion

Hermes 4 70B represents a major leap in open-source reasoning models, combining deep hybrid reasoning, schema-faithful outputs, and powerful function/tool use within a steerable, production-ready framework. Its state-of-the-art results on RefusalBench and broad improvements across math, code, STEM, and creativity make it one of the strongest open alternatives to frontier closed models. With flexible deployment options, from terminal scripts to Streamlit UIs, Hermes 4 empowers developers and researchers to build aligned, high-performance applications at scale.

Relevant blog posts

September 1, 2025

How to Install & Run NVIDIA Parakeet TDT 0.6B V3 Locally?

Parakeet-TDT-0.6B-v3 is NVIDIA’s multilingual automatic speech recognition (ASR) model with 600M parameters, built on the FastConformer-TDT architecture. It supports 25 European languages, automatically detects the input language, and delivers accurate transcriptions with punctuation and capitalization. Optimized for NVIDIA GPUs via the NeMo toolkit, it handles both short clips and long-form audio (up to 3 hours with local attention). Trained on a mix of the Granary dataset (660K hours) and NeMo ASR Set 3.0 (10K hours), it achieves strong performance across multilingual benchmarks while remaining lightweight enough for production deployment.

August 27, 2025

How to Install & Run DeepSeek-V3.1-GGUF Locally?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens. It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient. The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (–jinja required) and provide recommended runtime settings (temperature=0.6, top_p=0.95).

August 25, 2025

How to Install & Run Grok 2 Locally?

Grok 2, the flagship AI model from Elon Musk’s xAI, is now officially open source. Announced by Musk himself, this move gives developers free access to enterprise-level AI for the first time. The model is already available on Hugging Face, making it easy to download, experiment with, and run locally. This is a golden chance to explore cutting-edge AI without cost barriers and prepare for what’s next—especially with Grok 3 also set to go open source in just six months.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.