Feature: Integrate GPU Node and Deploy Self-Hosted AI Stack (RAG)

## 🚀 Overview
Expand the HomeOps cluster to support hardware-accelerated AI workloads. This involves bringing the existing Linux machine with the **NVIDIA P2000** into the cluster as a dedicated worker node and deploying a Retrieval-Augmented Generation (RAG) stack to leverage the `ProductDescription` dataset.

## 🎯 Goals
- [ ] **Infrastructure:** Join the P2000 machine to the Talos cluster with NVIDIA extensions.
- [ ] **Orchestration:** Configure NVIDIA device plugins for Kubernetes resource scheduling.
- [ ] **Inference:** Deploy a self-hosted LLM engine (Ollama) optimized for CUDA.
- [ ] **Data:** Deploy a Vector Database (Qdrant) to handle `embeddings` and `chunks` for the product data.

## 🏗️ Proposed Architecture
We will follow a RAG (Retrieval-Augmented Generation) pattern. The P2000 will handle the heavy lifting of inference and embedding generation, while Flux manages the lifecycle of the AI components.



---

## 📋 Implementation Task List

### Phase 1: Hardware & Node Prep (Talos)
- [ ] Generate a new Talos machine configuration using the `nonfree-kmod-nvidia` and `nvidia-container-toolkit` extensions.
- [ ] Provision the P2000 machine and join it to the cluster.
- [ ] Label the node: `kubectl label node <node-name> hardware-type=gpu`.
- [ ] (Optional) Taint the node to prevent non-AI workloads from consuming GPU-node resources: `node-role.kubernetes.io/gpu:NoSchedule`.

### Phase 2: GPU Operator & Plugins
- [ ] Add the `nvidia-device-plugin` HelmRelease to `kubernetes/apps`.
- [ ] Verify GPU availability:
  ```bash
  kubectl describe node <gpu-node-name> | grep [nvidia.com/gpu](https://nvidia.com/gpu)

### Phase 3: AI Stack Deployment (via Flux)
- ​[ ] Inference Engine: Deploy Ollama via Helm. Ensure resource limits request nvidia.com/gpu: 1.
- ​[ ] Vector Database: Deploy Qdrant in the apps namespace.
- ​[ ] Frontend: Deploy Open WebUI for an internal interface to interact with the models.

### ​Phase 4: Data Integration (ProductDescription)
-​ [ ] Create a workflow to ingest data from the ProductDescription table.
-​ [ ] Perform chunking and generate embeddings using a local model (e.g., nomic-embed-text).
- ​[ ] Upsert the embeddings and chunks into the Qdrant collection.

​📚 References
​Talos Linux NVIDIA Guide
​Ollama Helm Chart
​NVIDIA Device Plugin

​Notes: The P2000 provides 5GB of dedicated VRAM, which is significantly better for this task than the shared-memory Iris Xe iGPUs on the MS-01 nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Integrate GPU Node and Deploy Self-Hosted AI Stack (RAG) #3789

🚀 Overview

🎯 Goals

🏗️ Proposed Architecture

📋 Implementation Task List

Phase 1: Hardware & Node Prep (Talos)

Phase 2: GPU Operator & Plugins

Phase 3: AI Stack Deployment (via Flux)

Phase 4: Data Integration (ProductDescription)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Integrate GPU Node and Deploy Self-Hosted AI Stack (RAG) #3789

Description

🚀 Overview

🎯 Goals

🏗️ Proposed Architecture

📋 Implementation Task List

Phase 1: Hardware & Node Prep (Talos)

Phase 2: GPU Operator & Plugins

Phase 3: AI Stack Deployment (via Flux)

​Phase 4: Data Integration (ProductDescription)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Phase 4: Data Integration (ProductDescription)