How to Install & Run Grok 2 Locally?

by Ayush Kumar | August 25, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Grok 2, the flagship AI model from Elon Musk’s xAI, is now officially open source. Announced by Musk himself, this move gives developers free access to enterprise-level AI for the first time. The model is already available on Hugging Face, making it easy to download, experiment with, and run locally. This is a golden chance to explore cutting-edge AI without cost barriers and prepare for what’s next—especially with Grok 3 also set to go open source in just six months.

GPU Configuration Table for Grok 2

Scenario	GPUs	VRAM / GPU	Total VRAM	TP	Precision	Disk (min)	Disk (rec)	System RAM (rec)	CUDA / Driver	PyTorch	Kernels	Notes
You (production)	8× NVIDIA H200 141 GB	141 GB	1.13 TB	8	FP8 (w8a8)	500 GB	1 TB	128–256 GB	CUDA 12.1+	2.4.0 (cu121)	FlashInfer + `sgl-kernel`	Fastest single-node option; ample headroom for long context and batching.
Minimum supported (official)	8× NVIDIA H100 80 GB	80 GB	640 GB	8	FP8 (w8a8)	500 GB	1 TB	128–256 GB	CUDA 12.1+	2.4.0 (cu121)	FlashInfer + `sgl-kernel`	Baseline cluster many clouds offer; meets Grok-2 TP=8 requirement.
Alternative (works but tighter)	8× NVIDIA A100 80 GB	80 GB	640 GB	8	FP8 (w8a8)	500 GB	1 TB	128–256 GB	CUDA 12.1+	2.4.0 (cu121)	FlashInfer + `sgl-kernel`	OK with FP8; expect lower throughput vs. H100/H200.

Fixed software bits (that matched your working run)

SGLang: ≥ 0.5.1
Quantization: --quantization fp8
Attention backend: --attention-backend triton
Tensor parallel: --tp 8 (required by this checkpoint)
Env niceties: NCCL_IB_DISABLE=1, NCCL_P2P_DISABLE=0, NCCL_DEBUG=INFO, and disable/patch AMX check for Torch 2.4 GPU runs.

Storage & Networking

Weights size: ~500 GB (42–44 files). Plan for 1 TB if you want logs, caches, upgraded checkpoints, or multiple quant configs.
Throughput scaling is sensitive to PCIe/NVLink/NVSwitch topology; H100/H200 NVLink/NVSwitch gives the best results.

Resources

Link: https://huggingface.co/xai-org/grok-2

Step-by-Step Process to Install & Run Grok 2 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 8 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Grok 2, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Grok 2
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Grok 2.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Grok 2 runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv grok2
source grok2/bin/activate

Step 13: Install Hugging Face Hub

Run the following command to install huggingface hub:

pip install huggingface_hub

Step 14: Install PyTorch and Dependencies

Run the following command to install PyTorch and dependencies:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

Step 15: Install Accelerate

Run the following command to install accelearte:

pip install accelerate

Step 16: Install Safetensors

Run the following command to install safetensors:

pip install safetesnors

Step 17: Install Sentencepiece

Run the following command to install sentencepiece:

pip install sentencepiece

Step 18: Install Transformers

Run the following command to install transformers:

pip install transformers

Step 19: Install Missing Dependencies

Run the following command to install missing dependencies:

pip install fastapi uvicorn pydantic numpy scipy transformers flash-attn xformers

Step 20: Download the Model

Run the following command to download the model:

hf download xai-org/grok-2 --local-dir /local/grok-2

Step 21: Install and clone the Latest SGLang Inference Engine

Run the following command to install the latest SGLang inference engine:

pip install uv
uv pip install "sglang[all]>=0.5.1.post2"
git clone https://github.com/sgl-project/sglang.git
cd sglang

Step 22: Install Flashinfer-Python

Run the following command to install flashinfer-python:

pip install flashinfer-python

Step 23: Launch a Server

Run the following command to launch a server:

python3 -m sglang.launch_server --model /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton

Step 24 — Confirm Grok2 is Generating Responses Correctly via API

You’ve already done the core call:

curl -X POST "http://127.0.0.1:30000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Write a short poem about open-source AI in 3 lines",
    "max_new_tokens": 50,
    "temperature": 0.7
  }'

Output Looks Like:

{
  "text": "(preferably rhyming). Here's my attempt:\n\nAI's wisdom...",
  "output_ids": [11036, 421, 160, ...],
  "meta_info": {...},
  "finish_reason": {...}
}

Step 25: Install Streamlit

Make sure you’re inside your grok2 environment:

pip install streamlit requests

Step 26: Create the Streamlit App Script (`app.py`)

We’ll write a full Streamlit UI that lets you generate a response from model on browser.

Create app.py in your VM (inside your project folder) and add the following code:

import streamlit as st
import requests

st.set_page_config(page_title="🤖 Grok-2 Chatbot", layout="wide")
st.title("🤖 Grok-2 Chatbot")

# Larger text area for better visibility
prompt = st.text_area(
    "Enter your prompt:",
    height=300,
    placeholder="Write your prompt here...",
)

max_tokens = st.number_input("Max Tokens:", min_value=1, max_value=5000, value=200)
temperature = st.slider("Temperature:", min_value=0.0, max_value=1.0, value=0.7, step=0.1)

if st.button("Generate"):
    if prompt.strip():
        url = "http://127.0.0.1:30000/generate"
        headers = {"Content-Type": "application/json"}
        
        # FIXED: Changed "prompt" to "text"
        payload = {
            "text": prompt,
            "max_tokens": max_tokens,
            "temperature": temperature
        }

        with st.spinner("Generating response..."):
            try:
                response = requests.post(url, json=payload)
                if response.status_code == 200:
                    result = response.json()
                    st.subheader("Grok2 Response:")
                    st.write(result)
                else:
                    st.error(f"Error {response.status_code}: {response.text}")
            except Exception as e:
                st.error(f"Request failed: {e}")
    else:
        st.warning("Please enter a prompt before generating.")

Step 27: Launch the Streamlit App

Now that we’ve written our app.py Streamlit script, the next step is to launch the app from the terminal.

Run the following command inside your VM:

streamlit run app.py

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://172.17.0.2:8501
  External URL: http://50.222.102.252:8501

Step 28: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://localhost:8501

Enter Prompts and generate response.

Conclusion

Grok 2’s open-source release is more than just another model drop—it’s a landmark moment for developers. With full access on Hugging Face and the ability to run it locally, you can now experiment, innovate, and build without the usual barriers of cost or closed access. For Indian techies, this is the perfect opportunity to sharpen skills, explore enterprise-level AI, and get ready for what’s next—with Grok 3 already on the horizon.

Relevant blog posts

August 27, 2025

How to Install & Run DeepSeek-V3.1-GGUF Locally?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family, designed as a hybrid reasoning model supporting both thinking and non-thinking modes. Unlike earlier versions, it integrates smarter tool-calling, higher efficiency in structured reasoning, and long-context handling up to 128K tokens. It has been post-trained on 630B+209B tokens with UE8M0 FP8 scale formatting, making it compatible with modern microscaling approaches. Benchmarks show major jumps in math, coding, reasoning, and agent-style tasks—with competitive results against DeepSeek R1 while being more efficient. The GGUF quants by Unsloth come with fixed chat templates for llama.cpp backends (–jinja required) and provide recommended runtime settings (temperature=0.6, top_p=0.95).

August 20, 2025

How to Install & Run Ovis2.5-9B Locally?

Ovis2.5-9B is a state-of-the-art Multimodal Large Language Model (MLLM) developed by AIDC-AI. It brings together native-resolution vision perception via NaViT (Native Vision Transformer) and powerful deep multimodal reasoning capabilities using a hybrid of Chain-of-Thought (CoT) and Reflective Thinking. What sets it apart is its ability to process images at their original resolution—crucial for tasks like chart and document OCR, layout understanding, video QA, and complex visual reasoning. With support for a “thinking mode” and “thinking budget,” the model balances accuracy and latency by optionally allowing multiple rounds of internal reasoning. It is ranked among the top-performing open-source models under 40B parameters and delivers powerful performance even on resource-constrained setups—following the “small model, big performance” philosophy.

August 19, 2025

The OCR Model That Outranks GPT-4o

NuMarkdown-8B-Thinking is a reasoning-powered OCR Vision-Language Model (VLM) built to transform documents into clean, structured Markdown. Fine-tuned from Qwen2.5-VL-7B, it introduces thinking tokens that help the model analyze complex layouts, tables, and unusual document structures before generating output. This makes it especially useful for RAG pipelines, document extraction, and knowledge organization. With its reasoning-first approach, NuMarkdown-8B-Thinking consistently outperforms generic OCR and even rivals large closed-source reasoning models in accuracy and layout understanding.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.