-
Notifications
You must be signed in to change notification settings - Fork 0
Description
🚀 Overview
Expand the HomeOps cluster to support hardware-accelerated AI workloads. This involves bringing the existing Linux machine with the NVIDIA P2000 into the cluster as a dedicated worker node and deploying a Retrieval-Augmented Generation (RAG) stack to leverage the ProductDescription dataset.
🎯 Goals
- Infrastructure: Join the P2000 machine to the Talos cluster with NVIDIA extensions.
- Orchestration: Configure NVIDIA device plugins for Kubernetes resource scheduling.
- Inference: Deploy a self-hosted LLM engine (Ollama) optimized for CUDA.
- Data: Deploy a Vector Database (Qdrant) to handle
embeddingsandchunksfor the product data.
🏗️ Proposed Architecture
We will follow a RAG (Retrieval-Augmented Generation) pattern. The P2000 will handle the heavy lifting of inference and embedding generation, while Flux manages the lifecycle of the AI components.
📋 Implementation Task List
Phase 1: Hardware & Node Prep (Talos)
- Generate a new Talos machine configuration using the
nonfree-kmod-nvidiaandnvidia-container-toolkitextensions. - Provision the P2000 machine and join it to the cluster.
- Label the node:
kubectl label node <node-name> hardware-type=gpu. - (Optional) Taint the node to prevent non-AI workloads from consuming GPU-node resources:
node-role.kubernetes.io/gpu:NoSchedule.
Phase 2: GPU Operator & Plugins
- Add the
nvidia-device-pluginHelmRelease tokubernetes/apps. - Verify GPU availability:
kubectl describe node <gpu-node-name> | grep [nvidia.com/gpu](https://nvidia.com/gpu)
Phase 3: AI Stack Deployment (via Flux)
- [ ] Inference Engine: Deploy Ollama via Helm. Ensure resource limits request nvidia.com/gpu: 1.
- [ ] Vector Database: Deploy Qdrant in the apps namespace.
- [ ] Frontend: Deploy Open WebUI for an internal interface to interact with the models.
Phase 4: Data Integration (ProductDescription)
- [ ] Create a workflow to ingest data from the ProductDescription table.
- [ ] Perform chunking and generate embeddings using a local model (e.g., nomic-embed-text).
- [ ] Upsert the embeddings and chunks into the Qdrant collection.
📚 References
Talos Linux NVIDIA Guide
Ollama Helm Chart
NVIDIA Device Plugin
Notes: The P2000 provides 5GB of dedicated VRAM, which is significantly better for this task than the shared-memory Iris Xe iGPUs on the MS-01 nodes.