The final resting place for obsolete CUDA kernels.
Autonomous AI that discovers faster, more energy-efficient kernels than any human engineer – burying cuBLAS, FlashAttention, and Triton in eternal obsolescence. No mercy, no weekends, just relentless optimization loops.
- GEMM 8k×8k (vs cuBLAS Lt 12.6) – Target: +25% TFLOPS, -40% watts on H100
- FlashAttention-3 kernels (vs xformers) – Roofline domination
- Llama.cpp GGUF inference (Q4_K_M + imatrix) – CPU+GPU hybrid speedup
- Custom LLVM passes for SPECfp – Auto-discover unrolling magic
Each kill includes:
- Full
.cusource + git patch - Benchmark tables (TFLOPS, latency, power draw)
- Flamegraphs + nsight reports
- Arrogant auto-blogpost: "We just obsoleted 5 years of NVIDIA sweat."
- Clone:
git clone https://github.com/zedarvates/CUDAgraveyard.git - Setup:
pip install -r requirements.txt(Grok-4/Claude API, shinka-evolve, nsight-sdk) - Launch a demon:
python launch_demon.py --template templates/gemm_toon_v1.json --gpus 0 - Watch the graveyard fill: Logs in
graves/(dead kernels + their epitaphs).
- Core: AI-Scientist-v2 (SakanaAI fork) + ShinkaEvolve for genetic mutations
- Tools: nvcc 12.6+, nsight-compute, hyperfine, nvidia-smi power monitoring
- AI Brain: Grok-4 / Claude-3.5-Sonnet-2025 / o3-pro (reasoning + code gen)
- Hardware: Tested on H100/L40S/MI300X; Docker for sandbox kills
- Safety: Strict timeouts + ethical watcher (no infinite loops or self-hacks)
Warning: Running this may void warranties, melt PSUs, and put perf engineers out of work. Proceed with devilish intent.
- Q4 2025: GEMM graveyard opening
- Q1 2026: Multi-kernel massacre (attention + transformers)
- Q2 2026: Hybrid CPU-GPU-Quantique integration (link to CosmosCluster)
- Beyond: Self-optimizing the optimizer (Darwin Gödel vibes)
Fork it, contribute kills, or just watch the jobs vanish.
Built by zedarvates – because why hire when AI slays?