Build a Local Chat‑GPT Experience with Ollama and the gpt‑oss:20b Model

Abhineet | Aug 6, 2025 min read

Stay Ahead Of The Curve

Get clear, concise and professional understanding of the latest tech trends on anything data !

(If you don’t see the welcome email within a few minutes, check your Spam folder)

If you’ve been chasing the promise of large language models (LLMs) but don’t want to hand‑off your data to the cloud, you’re not alone. With the rise of open‑source LLMs and powerful container runtimes, you can spin up a full‑featured chatbot on your own laptop or workstation in minutes. This post walks you through:

  • Installing Ollama – the lightweight containerized runtime that ships dozens of LLMs.
  • Setting up GPU‑accelerated inference on NVIDIA and AMD cards.
  • Pulling and running the gpt‑oss:20b model locally.
  • Building a simple, live chat UI with Streamlit that streams the model’s output.

TL;DR – Docker → Ollama → gpt‑oss:20b → Streamlit → Chatbot

Large Language Models (LLMs) like gpt‑oss, qwen, LLaMA, Phi-3, and others have revolutionized AI, but running them on cloud platforms can be expensive and raise privacy concerns. Enter Ollama—a powerful tool that simplifies deploying LLMs locally, even on CPU-only systems. In this blog, we’ll walk through setting up Ollama on a CPU, explain the benefits of local LLM execution, and highlight why this approach is ideal for experimentation, cost savings, and security.

Why Run LLMs Locally?

  1. Cost Efficiency

Cloud providers charge hefty fees for running LLMs, especially for large models. By running models locally, you eliminate recurring costs and save money, especially for experimentation or small-scale projects.

  1. Data Privacy & Security

Sending sensitive data to external servers poses risks of data leaks or misuse. Running LLMs locally ensures your data stays on your machine, making it ideal for industries like healthcare, finance, or research where privacy is critical.

  1. Experimentation Freedom

Local setups allow you to tweak hyperparameters, test model variants, or fine-tune models without relying on third-party infrastructure. This flexibility is invaluable for developers and researchers.

  1. No Dependency on Cloud Providers

You’re not tied to any cloud provider’s API limits or service outages. Your LLM runs entirely on your hardware, giving you full control.

Prerequisites

Item Minimum Notes
Docker Engine 20.10+ Linux, Windows, macOS
GPU (optional) NVIDIA RTX 3080 or higher For GPU acceleration.
Python 3.9+ Needed for the Streamlit UI.

💡 Tip – If you’re on Windows, install Docker Desktop. On macOS, Docker Desktop works too, but you may need to enable “Virtualization” and “GPU Passthrough” in the settings.

1. Install Ollama

Ollama is a tiny Docker image that bundles the model runtime, API, and a minimal CLI. It works out‑of‑the‑box on CPU or GPU.

CPU‑Only (Docker)

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama
  • \ -v ollama:/root/.ollama – Persist your model downloads across container restarts.
  • \ -p 11434:11434 – Ollama exposes its REST API on port 11434.

GPU‑Enabled

\ ⚠️ The steps below assume you’re on Linux. Windows users can follow the same Docker commands – just install the NVIDIA Container Toolkit (or ROCm) on your host first.

NVIDIA GPUs

  1. Install the NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
  1. Install the NVIDIA Container Toolkit packages
sudo apt-get install -y nvidia-container-toolkit

Install with Yum or Dnf

  1. Configure the repository
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
    | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
  1. Install the NVIDIA Container Toolkit packages
sudo yum install -y nvidia-container-toolkit

Configure Docker to use Nvidia driver

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

AMD GPU

Start the container

Use the rocm tag and expose the ROCm devices:

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

2. Pull and Run a Model

# Pull the 20B gpt‑oss model
docker exec -it ollama ollama pull gpt-oss:20b

Tip – Use ollama list to see downloaded models, and ollama rm to remove them.

Run a Model

docker exec -it ollama ollama run gpt-oss:20b

You’ll see a prompt (>) waiting for input. Try asking a simple question:

> What is the capital of France?
The capital of France is Paris.

3. Build a Streamlit Chatbot UI

Now that the model is up and listening on localhost:11434, let’s build a live chat interface that streams responses in real time.

Package Dependencies – Install Streamlit, requests, json, and ollama library :

You can use requirements.txt

streamlit==1.30.0
requests==2.31.0
ollama==0.5.1

Streamlit Chatbot Code

# ----------------------------------------------------
# Imports
# ----------------------------------------------------
import streamlit as st          # The web‑app framework
import requests                # To make HTTP calls to the Ollama API
import json                    # For JSON parsing
import ollama                  # (Optional) The Ollama Python client – not used in this snippet

# ----------------------------------------------------
# Page setup
# ----------------------------------------------------
st.title("Ollama Chatbot")                       # Page title shown at the top
st.write("Select a model and start chatting!")   # Short instructions for the user

# ----------------------------------------------------
# Sidebar: model selection
# ----------------------------------------------------
# List of models that are available locally in your Ollama installation
model_options = ["qwen3:8b", "deepseek-r1:latest", "gpt-oss:20b"]

# Create a drop‑down in the sidebar so the user can pick a model.
# The selected value is stored in `selected_model`
selected_model = st.sidebar.selectbox("Choose a model", model_options)

# ----------------------------------------------------
# Function that talks to the Ollama API
# ----------------------------------------------------
def call_ollama_model(prompt):
    """
    Sends a prompt to the Ollama local server and streams back the
    model's answer line‑by‑line.
    """
    try:
        # POST to the local Ollama generation endpoint.
        # - "model": name of the model to use (selected by the user)
        # - "prompt": the user’s question
        # - "stream": tells Ollama to stream partial responses
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": selected_model, "prompt": prompt, "stream": True},
            stream=True                      # Keep the HTTP connection open for streaming
        )

        output = ""          # Accumulate the full response here
        # Iterate over each line that the API yields.
        for line in response.iter_lines():
            if line:          # Non‑empty line
                # Each line is a JSON blob containing a part of the answer
                data = json.loads(line.decode('utf‑8'))
                # Append the partial answer to the running total
                output += data.get("response", "")
                # Yield the current state so the UI can update progressively
                yield output

        return
    except Exception as e:
        # If anything goes wrong, show an error inside the Streamlit app
        st.error(f"Error calling Ollama API: {e}")
        return None

# ----------------------------------------------------
# Session state: keep the chat history
# ----------------------------------------------------
# `st.session_state` persists across reruns – perfect for chat memory.
if "messages" not in st.session_state:
    st.session_state.messages = []      # List of dicts with keys: role, content

# ----------------------------------------------------
# Render the existing chat history
# ----------------------------------------------------
for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):    # `st.chat_message` provides a nice bubble UI
        st.markdown(msg["content"])      # Display the message text

# ----------------------------------------------------
# User input: ask a new question
# ----------------------------------------------------
# `st.chat_input` automatically creates an input box in the chat layout.
if prompt := st.chat_input("Type your message here..."):
    # Append the user message to the session state
    st.session_state.messages.append({"role": "user", "content": prompt})
    # Show the user’s message immediately
    with st.chat_message("user"):
        st.markdown(prompt)

    # ----------------------------------------------------
    # Get and display the assistant’s reply
    # ----------------------------------------------------
    with st.chat_message("assistant"):
        # `st.empty()` gives us a placeholder we can update inside the loop
        response_box = st.empty()
        full_response = ""
        # Call the API – we iterate over partial results to stream them.
        for partial in call_ollama_model(prompt):
            full_response = partial          # Keep the latest full string
            response_box.markdown(full_response)  # Update the UI in real time

        # After streaming finishes, store the full answer in session state
        st.session_state.messages.append({"role": "assistant", "content": full_response})

Run the UI

streamlit run chat_with_ollama.py

Open the URL that Streamlit prints (usually http://localhost:8501). Pick gpt-oss:20b from the sidebar and start chatting. You’ll see the assistant stream its output as it’s generated—exactly like the official OpenAI chat.

4. Exploring More Models

Ollama’s public library hosts dozens of models: LLaMA, Mistral, Qwen, and many more. Browse them at the official Ollama Library (the link is a placeholder; replace with the real URL if needed).

ollama list          # View downloaded models
ollama show <model>  # See model metadata
ollama pull <model>  # Download a new one

Feel free to experiment with different sizes and architectures—just remember larger models may need more RAM or GPU VRAM.

5. Common Pitfalls & Troubleshooting

Symptom Likely Cause Fix
Docker container dies immediately Missing GPU runtime or incorrect image tag Verify --gpus=all (NVIDIA) or :rocm tag (AMD).
requests times out Ollama not listening on port 11434 Check docker ps and docker logs ollama.
Low GPU memory usage Model not actually using GPU Ensure you ran docker run with --gpus=all.
“Model not found” error Model name typo or not yet pulled Use ollama pull <model> first.

6. Take‑away

You now have a fully local, end‑to‑end ChatGPT‑like stack:

  • Ollama runs a GPU‑accelerated inference server.
  • The gpt‑oss:20b model is your powerful, open‑source LLM.
  • Streamlit gives you a polished UI that streams responses.

All of this runs on a single machine—no cloud provider, no API key, no billing. Drop in your own prompts, fine‑tune the model, or integrate it into a larger application. The sky’s the limit.

Happy chatting! 🚀