feat: add GPU buffer loader for IndexProvider integration by cluster2600 · Pull Request #175 · alibaba/zvec

cluster2600 · 2026-02-25T15:48:30Z

Summary

GpuBufferLoader (gpu_buffer_loader.h): streams vectors from any IndexProvider into contiguous GPU-ready float32 buffers
Metal C++ docs (docs/METAL_CPP.md): architecture overview and kernel reference

Replaces #174 (now closed), which incorrectly used a standalone RocksDB store. This PR integrates with zvec's existing storage architecture via IndexProvider::Iterator.

Follow-up to #166 ("Future Work: Integration with storage").

How it works

IndexProvider (Flat/HNSW/IVF)
    |
    +-- Iterator -> GpuBufferLoader::load() -> GpuBuffer
                                                  |
                                        +---------+----------+
                                        |                    |
                                  Metal device buf      cudaMemcpy

auto provider = index->create_provider();
auto buffer = zvec::GpuBufferLoader::load(provider);

// buffer.vectors is contiguous (N x dim) float32
// Ready for Metal newBufferWithBytes or cudaMemcpy

Features

load() — stream all vectors into a single contiguous buffer
load_chunk() — chunked loading for datasets exceeding GPU memory
Automatic type conversion — FP16, INT8 -> FP32
Works with all index types — Flat, HNSW, IVF providers

Why not RocksDB?

zvec already has a complete storage stack: IndexProvider -> Iterator -> block-based segments with mmap/buffer pool backends. A parallel RocksDB store would duplicate this. GpuBufferLoader sits on top of the existing pipeline instead.

Merge order

This PR shares a common base with #172, #173, #176. Recommended merge order: #172 → #173 → #175 → #176. Merging any one brings in the shared base commits; the rest then apply cleanly.

Test plan

Header compiles with clang++ C++17
Integrates with existing IndexProvider / IndexHolder::Iterator interfaces
End-to-end: Flat provider -> GpuBufferLoader -> Metal compute pipeline
Benchmark: load throughput for 1M+ vectors
Test FP16 and INT8 conversion paths

- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

- 9x speedup target vs CPU - Compatible with DiskANN

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.

1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search

1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph

1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)

Add GpuBufferLoader that bridges zvec's segment-based storage with GPU compute pipelines (Metal, CUDA/cuVS). Streams vectors through the existing IndexProvider::Iterator into contiguous float32 buffers ready for direct GPU transfer. GpuBufferLoader (gpu_buffer_loader.h): - load(): stream all vectors from any IndexProvider into GpuBuffer - load_chunk(): chunked loading for datasets larger than GPU memory - Automatic FP16/INT8 → FP32 conversion - Works with Flat, HNSW, and IVF index providers Replaces the previous standalone RocksDB VectorStorage approach (PR alibaba#174, now closed) with proper integration into zvec's existing storage architecture. Also adds Metal C++ backend documentation (docs/METAL_CPP.md) with updated architecture diagram showing the IndexProvider → GpuBuffer → Metal/CUDA pipeline. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 added 26 commits February 24, 2026 13:59

feat: add distributed index implementation

2be6793

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

docs: add comprehensive documentation and tests

c5407b8

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

fix: PQ encoder - handle small datasets properly

46ce49d

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

feat: add cuVS wrapper skeleton

ca1f273

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

feat: add cuVS IVF-PQ and CAGRA implementations

f5e1567

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

feat: add cuVS HNSW wrapper

fee7f2a

- 9x speedup target vs CPU - Compatible with DiskANN

feat: add cuVS vs FAISS benchmark script

0196637

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

feat: complete S3-S8 research and implementations

0b6f99c

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

add: Kaggle benchmark notebook

d98a66c

fix: Kaggle notebook path

ab1264f

fix: Kaggle notebook - test Python modules only

0d81b34

fix: Colab notebook - proper path and FAISS GPU test

8e69282

fix: export backends module

b064dcc

fix: Colab notebook - full test

79b837f

fix: clean clone

f61f973

add: simple colab test

c304405

add: full GPU benchmark suite

2e4be16

add: extended GPU benchmarks

48083ab

fix: add cuVS detection and C++ priority to backend selection

ec98973

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

This was referenced Feb 25, 2026

feat: add simdgroup-optimized Metal kernels #172

Open

feat: add C++ product quantization and SVD Procrustes OPQ #173

Open

feat: GPU-accelerated indexing with Collection API integration #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GPU buffer loader for IndexProvider integration#175

feat: add GPU buffer loader for IndexProvider integration#175
cluster2600 wants to merge 26 commits intoalibaba:mainfrom
cluster2600:feat/gpu-buffer-loader

cluster2600 commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cluster2600 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Features

Why not RocksDB?

Merge order

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cluster2600 commented Feb 25, 2026 •

edited

Loading