GigaVector

GigaVector is a high-performance, production-ready vector database library written in C with Python bindings. 78,000+ lines of C across 101 modules covering indexing, search, storage, graph, networking, security, and AI integration.

Feature Overview

Index Algorithms (8 types)

Index	Type	Training	Best For
KD-Tree	Exact	No	Low-dimensional data (< 20D)
HNSW	Approximate	No	General-purpose, high recall
IVF-PQ	Approximate	Yes	Large-scale, memory-efficient
IVF-Flat	Approximate	Yes	Large-scale, higher accuracy than IVF-PQ
Flat	Exact (brute-force)	No	Small datasets, baseline/ground-truth
PQ	Approximate	Yes	Compressed-domain search
LSH	Approximate	No	Fast hash-based approximate search
Sparse	Exact	No	Sparse vectors (NLP, BoW)

Distance Metrics (5 types)

Euclidean, Cosine, Dot Product, Manhattan, Hamming -- all with SIMD-optimized implementations (SSE4.2, AVX2, AVX-512F, FMA).

Search Capabilities

k-NN search with configurable distance metrics
Range search -- find all vectors within a radius
Batch search -- multiple queries in one call
Filtered search -- metadata-based pre/post filtering
Dynamic search params -- per-query ef_search, nprobe, rerank tuning
Hybrid search -- combine vector similarity with BM25 full-text ranking (RRF, weighted, Borda fusion)
Scroll/pagination -- iterate over stored vectors with offset/limit
Score threshold filtering -- return only results above a distance/similarity cutoff
Grouped search -- group results by metadata field with per-group limits
Geo-spatial filtering -- radius and bounding-box queries on lat/lon fields
Late interaction / ColBERT -- multi-vector MaxSim scoring for token-level matching
Recommendation API -- positive/negative example-based recommendations with strategy selection
Delete/update by filter -- bulk delete, metadata update, and count by filter expression
MMR diversity reranking -- maximal marginal relevance for diverse result sets
Custom ranking expressions -- expression parser with decay functions (exp/gauss/linear) and score boosting
SQL query interface -- SELECT ... WHERE ... ORDER BY vector_distance(...) SQL syntax
Phased ranking pipeline -- multi-stage ANN → rerank → filter pipeline with per-phase stats
Learned sparse index -- SPLADE-style token-weighted inverted index with WAND acceleration
Full-text search -- Porter stemming, multilingual (6 languages), BlockMax WAND, phrase matching

Storage and Persistence

Write-Ahead Logging (WAL) -- crash-safe durability with automatic replay
Snapshot persistence -- save/load full database state
Point-in-time snapshots -- create immutable snapshots for historical queries
Collection versioning -- version datasets with diff/compare/rollback
Memory-mapped I/O -- efficient file-backed storage
Incremental backup -- full and incremental backup with compression and CRC verification
JSON import/export -- NDJSON format for interoperability

Data Management

Rich metadata -- key-value pairs per vector with typed metadata support
Payload indexing -- sorted indexes for int/float/string/bool fields with range queries
Schema evolution -- versioned schemas with validation, diff, and compatibility checking
Upsert operations -- insert-or-update semantics
Batch delete -- delete multiple vectors in one call
Vector deduplication -- LSH-based near-duplicate detection
BM25 full-text search -- TF-IDF style keyword search on text fields
TTL (Time-to-Live) -- automatic expiry of vectors
User-defined point IDs -- string/UUID IDs with bidirectional mapping to internal indices
Named vectors -- multiple named vector fields per point with independent dimensions
Collection aliases -- create, swap, and manage aliases that point to collections
Payload compression -- zlib/LZ4/zstd compression for stored vector payloads
JSON path indexing -- index nested JSON fields with dot-notation, type-aware range queries
CDC stream -- change data capture with ring buffer, polling cursors, and subscriber callbacks
Conditional updates -- CAS-style optimistic concurrency with per-vector versioning
Time-travel queries -- auto-versioned append-only log, query at any version or timestamp
Multimodal storage -- SHA-256 content-addressable blob storage for images, audio, video, documents

Transactions and Concurrency

MVCC transactions -- snapshot isolation with begin/commit/rollback
Thread-safe -- reader-writer locks for concurrent access
Client-side caching -- LRU/LFU cache with TTL and mutation-based invalidation

Quantization and Compression

Product Quantization (PQ) -- codebook-based compression
Scalar Quantization -- configurable bit-width reduction
Binary Quantization -- 1-bit compression for HNSW
Codebook sharing -- train once, share PQ codebooks across collections
Advanced quantization -- 1.5-bit (ternary), 2-bit, 4-bit, 8-bit; RaBitQ with Householder rotations; symmetric/asymmetric modes
Inline HNSW + incremental rebuild -- quantized vectors embedded in graph nodes with prefetch, background rebuild

Distributed Architecture

HTTP REST server -- embedded server with rate limiting, CORS, and API key auth
gRPC API -- binary protocol server with connection pooling and streaming support
TLS/HTTPS -- TLS 1.2/1.3 transport encryption with certificate management
Sharding -- hash/range-based data partitioning
Replication -- leader-follower with automatic failover and election
Read replica load balancing -- round-robin, least-lag, and random routing policies
Cluster management -- multi-node coordination
Namespace / multi-tenancy -- isolated collections within a single instance
Configurable consistency -- eventual, quorum, and strong consistency levels
Tenant quotas -- per-tenant limits on vector count, memory, and QPS
Tiered multitenancy -- shared/dedicated/premium tiers with auto-promote/demote and QPS tracking
Embedded / edge mode -- lightweight in-process database with memory budget, mmap, and quantization

Security

Authentication -- API key and JWT-based auth
RBAC -- fine-grained role-based access control with per-collection permissions
Cryptographic primitives -- SHA-256, HMAC for secure token handling
Enterprise SSO -- OIDC discovery, JWT validation, SAML XML parsing for enterprise identity providers

Graph and Knowledge Graph

Property graph database -- nodes with labels/properties, directed weighted edges, hash table storage
Graph traversal -- BFS, DFS, Dijkstra shortest path, all-paths enumeration
Graph analytics -- PageRank (iterative power method), connected components, clustering coefficient, degree centrality
Knowledge graph -- entities with embeddings, SPO triple store with wildcard queries
Semantic entity search -- cosine similarity search over entity embeddings
Entity resolution -- name match + embedding similarity deduplication, entity merging
Link prediction -- embedding similarity + structural patterns (shared neighbors)
Hybrid graph+vector search -- embedding similarity filtered by entity type and predicate
Subgraph extraction -- BFS-based k-hop subgraph with entity and relation IDs
Graph persistence -- binary save/load for both graph DB ("GVGR") and knowledge graph ("GVKG")

AI Integration

LLM support -- OpenAI, Anthropic, Google Gemini (chat completions, streaming)
Embedding services -- OpenAI, Google, HuggingFace embedding APIs with caching
Auto-embedding -- server-side text-to-vector with configurable providers and batching
Semantic memory layer -- extract, store, consolidate memories from conversations
Context graphs -- entity-relationship extraction for context-aware retrieval
Importance scoring -- rank memories by relevance and recency
ONNX model serving -- load ONNX models for inference, reranking, and embedding in the search pipeline
Agentic interfaces -- LLM-powered natural language query, data transformation, and personalization agents
MUVERA encoding -- compress ColBERT multi-vectors into single dense vectors via random projections
Integrated inference -- text-in/results-out API that combines auto-embedding with vector search

Observability and Operations

Query optimizer -- cost-based strategy selection (exact scan vs index vs oversample+filter)
Query tracing -- span-level timing for search pipeline profiling
Bloom filter indexes -- probabilistic skip indexes for fast set membership
Index migration -- background thread rebuilds index while old one continues serving
DiskANN -- on-disk approximate nearest neighbor index with Vamana graph
Async vacuum -- background compaction with configurable thresholds and scheduling
Webhooks -- event-driven notifications for insert/delete/update operations
GPU acceleration -- CUDA-based distance computation and batch search (optional)
Database statistics -- insert/query counts, latency tracking

Build

Make (default)

make lib        # static + shared libraries -> build/lib/
make c-test     # run all C tests (21 test suites)
make python-test # run Python test suite

CMake

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
cd build && ctest

CMake Options:

-DBUILD_SHARED_LIBS=ON/OFF -- shared library (default: ON)
-DBUILD_TESTS=ON/OFF -- test executables (default: ON)
-DBUILD_BENCHMARKS=ON/OFF -- benchmark executables (default: ON)
-DENABLE_SANITIZERS=ON/OFF -- ASAN, TSAN, UBSAN (default: OFF)
-DENABLE_COVERAGE=ON/OFF -- code coverage (default: OFF)

Sanitizer and Coverage Testing

make test-asan      # AddressSanitizer
make test-tsan      # ThreadSanitizer
make test-ubsan     # UndefinedBehaviorSanitizer
make test-valgrind  # Valgrind memory check
make test-coverage  # gcov/lcov coverage report
make test-all       # run everything

Python Bindings

Install

pip install gigavector        # from PyPI
cd python && pip install .    # from source

Quick Start

from gigavector import Database, DistanceType, IndexType

# Open / create a database
with Database.open("example.db", dimension=128, index=IndexType.HNSW) as db:
    # Add vectors with metadata
    db.add_vector([0.1] * 128, metadata={"category": "example"})

    # Search
    results = db.search([0.1] * 128, k=10, distance=DistanceType.COSINE)
    for hit in results:
        print(f"  index={hit.index}, distance={hit.distance:.4f}")

    # Save to disk
    db.save("example.db")

Index Types

# Flat (brute-force exact search)
db = Database.open(None, dimension=128, index=IndexType.FLAT)

# HNSW with custom config
from gigavector import HNSWConfig
db = Database.open(None, dimension=128, index=IndexType.HNSW,
                   hnsw_config=HNSWConfig(M=32, efConstruction=200, efSearch=100))

# IVF-PQ (requires training)
db = Database.open(None, dimension=128, index=IndexType.IVFPQ)
db.train_ivfpq(training_vectors)

# IVF-Flat (requires training)
from gigavector import IVFFlatConfig
db = Database.open(None, dimension=128, index=IndexType.IVFFLAT,
                   ivfflat_config=IVFFlatConfig(nlist=64, nprobe=8))
db.train_ivfflat(training_vectors)

# LSH (no training needed)
from gigavector import LSHConfig
db = Database.open(None, dimension=128, index=IndexType.LSH,
                   lsh_config=LSHConfig(num_tables=8, num_hash_bits=16))

Advanced Features

from gigavector import (
    SearchParams, BloomFilter, Cache, CacheConfig,
    Schema, SchemaFieldType, MVCCManager, QueryOptimizer,
    PayloadIndex, FieldType, DedupIndex, MultiVecIndex,
    SnapshotManager, VersionManager, Codebook, QueryTrace,
)

# Dynamic search parameters
results = db.search_with_params([0.1] * 128, k=10,
    distance=DistanceType.COSINE,
    params=SearchParams(ef_search=200, nprobe=16))

# Bloom filter for fast membership checks
bf = BloomFilter(expected_items=10000, fp_rate=0.01)
bf.add_string("hello")
assert "hello" in bf

# Client-side result caching
cache = Cache(CacheConfig(max_entries=1024, ttl_seconds=30))

# Schema validation
schema = Schema(version=1)
schema.add_field("name", SchemaFieldType.STRING, required=True)
schema.add_field("score", SchemaFieldType.FLOAT)
assert schema.validate({"name": "test", "score": "0.95"})

# MVCC transactions
mvcc = MVCCManager(dimension=128)
with mvcc.begin() as txn:
    txn.add_vector([0.1] * 128)
    txn.add_vector([0.2] * 128)
    # auto-commits on exit, or auto-rolls-back on exception

# Query optimizer
opt = QueryOptimizer()
plan = opt.plan(k=10, has_filter=True, filter_selectivity=0.05)
print(f"Strategy: {plan.strategy.name}, ef_search={plan.ef_search}")

# Payload indexing
idx = PayloadIndex()
idx.add_field("category", FieldType.STRING)
idx.insert_string(0, "category", "science")

# Vector deduplication
dedup = DedupIndex(dimension=128)
dedup.insert([0.1] * 128)
is_duplicate = dedup.check([0.1] * 128)

# Multi-vector documents
mv = MultiVecIndex(dimension=128)
mv.add_document(doc_id=1, chunks=[[0.1]*128, [0.2]*128, [0.3]*128])
results = mv.search([0.15]*128, k=5)

# Point-in-time snapshots
snap_mgr = SnapshotManager(max_snapshots=10)

# Collection versioning
ver_mgr = VersionManager(max_versions=20)

# Codebook sharing (train once, reuse)
cb = Codebook(dimension=128, m=8, nbits=8)
cb.train(training_data)
cb.save("shared_codebook.bin")

# Query tracing
with QueryTrace() as trace:
    trace.span_start("search")
    results = db.search([0.1]*128, k=10)
    trace.span_end()

New Features (v0.8)

from gigavector import (
    PointIDMap, NamedVectorStore, VectorFieldConfig,
    GeoIndex, GeoPoint, GroupedSearch, GroupSearchConfig,
    DiskANNIndex, DiskANNConfig, Recommender, RecommendConfig,
    AliasManager, VacuumManager, ConsistencyLevel, ConsistencyManager,
    search_with_threshold, delete_by_filter, count_by_filter,
)

# User-defined string/UUID point IDs
id_map = PointIDMap()
id_map["doc-abc-123"] = 0
id_map["doc-def-456"] = 1
print(id_map["doc-abc-123"])  # 0

# Named vectors (multiple vector fields per point)
store = NamedVectorStore()
store.add_field(VectorFieldConfig(name="title", dimension=128))
store.add_field(VectorFieldConfig(name="content", dimension=256))
store.insert("title", 0, [0.1] * 128)
store.insert("content", 0, [0.2] * 256)

# Score threshold filtering
results = search_with_threshold(db, [0.1]*128, k=10, threshold=0.5)

# Delete/update by filter
deleted = delete_by_filter(db, 'category == "old"')
count = count_by_filter(db, 'status == "active"')

# Geo-spatial search
geo = GeoIndex()
geo.add(0, GeoPoint(lat=40.7128, lon=-74.0060))
nearby = geo.search_radius(GeoPoint(lat=40.71, lon=-74.01), radius_km=1.0, limit=10)

# Grouped search
gs = GroupedSearch(db)
groups = gs.search([0.1]*128, group_by="category",
                   config=GroupSearchConfig(group_size=3, num_groups=5))

# DiskANN on-disk index
disk_idx = DiskANNIndex(DiskANNConfig(dimension=128, max_degree=64, search_list_size=128))

# Recommendation
rec = Recommender(db)
results = rec.recommend(positive_ids=[0, 1], negative_ids=[5],
                        config=RecommendConfig(limit=10))

# Collection aliases
aliases = AliasManager()
aliases.create("production", "vectors_v2")
aliases.swap("production", "vectors_v3")

# Vacuum / compaction
vacuum = VacuumManager(db)
vacuum.run()

# Consistency levels
cm = ConsistencyManager()
cm.set_level(ConsistencyLevel.QUORUM)

Graph Database and Knowledge Graph

from gigavector import GraphDB, GraphDBConfig, KnowledgeGraph, KGConfig

# Property graph database
g = GraphDB(GraphDBConfig(node_bucket_count=4096))
alice = g.add_node("Person")
bob = g.add_node("Person")
g.set_node_prop(alice, "name", "Alice")
g.set_node_prop(bob, "name", "Bob")
e = g.add_edge(alice, bob, "KNOWS", weight=1.0)

# Traversal and analytics
visited = g.bfs(alice, max_depth=3)
path = g.shortest_path(alice, bob)
pr = g.pagerank(alice, iterations=20, damping=0.85)
cc = g.clustering_coefficient(alice)
components = g.connected_components()

# Persistence
g.save("social.gvgr")
g2 = GraphDB.load("social.gvgr")

# Knowledge graph with embeddings
kg = KnowledgeGraph(KGConfig(embedding_dimension=128))
e1 = kg.add_entity("Alice", "Person", embedding=[0.1] * 128)
e2 = kg.add_entity("Anthropic", "Company", embedding=[0.2] * 128)
kg.add_relation(e1, "works_at", e2, weight=1.0)

# SPO triple queries (None = wildcard)
triples = kg.query_triples(predicate="works_at")

# Semantic search over entity embeddings
results = kg.search_similar([0.15] * 128, k=5)

# Hybrid search (vector + type/predicate filters)
results = kg.hybrid_search([0.1] * 128, entity_type="Person",
                            predicate_filter="works_at", k=10)

# Entity resolution and link prediction
resolved_id = kg.resolve_entity("Alice Smith", "Person", embedding=[0.1] * 128)
predictions = kg.predict_links(e1, k=5)
subgraph = kg.extract_subgraph(center=e1, radius=2)

# Persistence
kg.save("knowledge.gvkg")
kg2 = KnowledgeGraph.load("knowledge.gvkg")

JSON Import/Export

db.export_json("vectors.ndjson")
db.import_json("vectors.ndjson")

Upsert and Batch Operations

db.upsert(index=0, data=[0.5] * 128, metadata={"updated": "true"})
deleted = db.delete_vectors([0, 1, 2])
entries = db.scroll(offset=0, limit=100)

REST API

GigaVector includes an embedded HTTP server for remote access.

from gigavector import Server, ServerConfig

config = ServerConfig(port=8080, thread_pool_size=4, enable_cors=True,
                      max_requests_per_second=100.0)
server = Server(db, config)
server.start()

Endpoints:

Method	Path	Description
`GET`	`/health`	Health check
`GET`	`/stats`	Database statistics
`POST`	`/vectors`	Add vector(s)
`GET`	`/vectors/{id}`	Get vector by index
`PUT`	`/vectors/{id}`	Update vector
`DELETE`	`/vectors/{id}`	Delete vector
`POST`	`/search`	k-NN search
`POST`	`/search/range`	Range search
`POST`	`/search/batch`	Batch search
`POST`	`/compact`	Trigger compaction
`POST`	`/save`	Save database to disk

Environment Variables

cp .env.example .env   # copy and edit with your keys

Variable	Required	Description
`OPENAI_API_KEY`	For LLM/embedding tests	OpenAI API key
`ANTHROPIC_API_KEY`	For Anthropic tests	Anthropic/Claude API key
`GOOGLE_API_KEY`	Optional	Google Gemini/embeddings
`GV_WAL_DIR`	Optional	Override WAL directory

Project Structure

GigaVector/
├── include/gigavector/   # 83 public C headers
├── src/                  # 78,000+ lines of C implementation
├── tests/                # 21 C test suites
├── python/               # Python CFFI bindings (11,000+ lines)
├── benchmarks/           # SIMD and index benchmarks
├── docs/                 # Documentation
└── scripts/              # Build and utility scripts

Documentation

Usage Guide -- comprehensive usage guide
Build and Test Guide -- build instructions and testing
Python Bindings Guide -- Python API documentation
C API Guide -- C API usage patterns
API Reference -- complete API reference
Architecture -- system design and internals
Deployment Guide -- production deployment and scaling
Security Guide -- security best practices
Performance Tuning -- index selection and optimization
Troubleshooting -- common issues and solutions
Advanced Features -- advanced patterns
Contributing -- how to contribute

License

This project is licensed under the DBaJ-NC-CFL License.

Name		Name	Last commit message	Last commit date
Latest commit History 308 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
include/gigavector		include/gigavector
python		python
scripts		scripts
src		src
tests		tests
tools		tools
.env.example		.env.example
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
main.c		main.c

License

jaywyawhare/GigaVector

Folders and files

Latest commit

History

Repository files navigation

GigaVector

Feature Overview

Index Algorithms (8 types)

Distance Metrics (5 types)

Search Capabilities

Storage and Persistence

Data Management

Transactions and Concurrency

Quantization and Compression

Distributed Architecture

Security

Graph and Knowledge Graph

AI Integration

Observability and Operations

Build

Make (default)

CMake

Sanitizer and Coverage Testing

Python Bindings

Install

Quick Start

Index Types

Advanced Features

New Features (v0.8)

Graph Database and Knowledge Graph

JSON Import/Export

Upsert and Batch Operations

REST API

Environment Variables

Project Structure

Documentation

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages