Skip to content

Feature: Integrate GPU Node and Deploy Self-Hosted AI Stack (RAG) #3789

@binaryn3xus

Description

@binaryn3xus

🚀 Overview

Expand the HomeOps cluster to support hardware-accelerated AI workloads. This involves bringing the existing Linux machine with the NVIDIA P2000 into the cluster as a dedicated worker node and deploying a Retrieval-Augmented Generation (RAG) stack to leverage the ProductDescription dataset.

🎯 Goals

  • Infrastructure: Join the P2000 machine to the Talos cluster with NVIDIA extensions.
  • Orchestration: Configure NVIDIA device plugins for Kubernetes resource scheduling.
  • Inference: Deploy a self-hosted LLM engine (Ollama) optimized for CUDA.
  • Data: Deploy a Vector Database (Qdrant) to handle embeddings and chunks for the product data.

🏗️ Proposed Architecture

We will follow a RAG (Retrieval-Augmented Generation) pattern. The P2000 will handle the heavy lifting of inference and embedding generation, while Flux manages the lifecycle of the AI components.


📋 Implementation Task List

Phase 1: Hardware & Node Prep (Talos)

  • Generate a new Talos machine configuration using the nonfree-kmod-nvidia and nvidia-container-toolkit extensions.
  • Provision the P2000 machine and join it to the cluster.
  • Label the node: kubectl label node <node-name> hardware-type=gpu.
  • (Optional) Taint the node to prevent non-AI workloads from consuming GPU-node resources: node-role.kubernetes.io/gpu:NoSchedule.

Phase 2: GPU Operator & Plugins

  • Add the nvidia-device-plugin HelmRelease to kubernetes/apps.
  • Verify GPU availability:
    kubectl describe node <gpu-node-name> | grep [nvidia.com/gpu](https://nvidia.com/gpu)
    

Phase 3: AI Stack Deployment (via Flux)

  • ​[ ] Inference Engine: Deploy Ollama via Helm. Ensure resource limits request nvidia.com/gpu: 1.
  • ​[ ] Vector Database: Deploy Qdrant in the apps namespace.
  • ​[ ] Frontend: Deploy Open WebUI for an internal interface to interact with the models.

​Phase 4: Data Integration (ProductDescription)

-​ [ ] Create a workflow to ingest data from the ProductDescription table.
-​ [ ] Perform chunking and generate embeddings using a local model (e.g., nomic-embed-text).

  • ​[ ] Upsert the embeddings and chunks into the Qdrant collection.

​📚 References
​Talos Linux NVIDIA Guide
​Ollama Helm Chart
​NVIDIA Device Plugin

​Notes: The P2000 provides 5GB of dedicated VRAM, which is significantly better for this task than the shared-memory Iris Xe iGPUs on the MS-01 nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions