Releases: mivertowski/RustCompute
v0.4.2: Warp-Shuffle Reductions, __nanosleep, libcu++ Atomics
What's New
This release upgrades the CUDA codegen with practical findings from CUDA hardware research, targeting CC 6.0+ GPUs with the existing cudarc 0.18.2 runtime.
Warp-Shuffle Block Reductions
- Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA reduction code
- Phase 1: Intra-warp
__shfl_down_sync(0xFFFFFFFF, val, offset)— zero__syncthreads()calls - Phase 2: Cross-warp reduction via shared memory — one
__syncthreads()call - Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
- Applied to: persistent FDTD energy reduction, standalone block/grid reduce helpers, and all inline reduction generators
__nanosleep() Power Efficiency
- Persistent FDTD idle spin-wait now uses
__nanosleep()instead of volatile counter loop - Software grid barrier spin-loop uses
__nanosleep(100)to reduce power consumption - Configurable via
PersistentFdtdConfig::with_idle_sleep(ns)(default: 1000ns)
libcu++ Ordered Atomics (opt-in)
- Opt-in
cuda::atomic_refsupport for H2K/K2H queue operations and software barriers - Uses
memory_order_acquire/memory_order_releaseinstead of__threadfence_system()pairs - Software barrier uses
cuda::thread_scope_device(narrower scope) withmemory_order_acq_rel - Compile-time CUDA 11.0+ version guard
- Enable via
PersistentFdtdConfig::with_libcupp_atomics(true)
Files Changed
crates/ringkernel-cuda-codegen/src/persistent_fdtd.rs— config fields, nanosleep, warp-shuffle reduction, libcu++ atomicscrates/ringkernel-cuda-codegen/src/reduction_intrinsics.rs— warp-shuffle upgrade for all reduction helpers
Test Results
- 215 codegen unit tests + 12 integration tests — all passing
- 6 CUDA GPU execution tests — verified on RTX 2000 Ada (CC 8.9)
- Full workspace — zero failures
Full Changelog: v0.4.1...v0.4.2
v0.4.1
What's New
Property-Based Testing
- 13 proptest property tests for queue invariants (FIFO ordering, capacity bounds, stats consistency) and HLC properties (total ordering, causality preservation, pack/unpack round-trip)
Ecosystem Feature Bundles
web= axum + tower + grpcdata= arrow + polarsmonitoring= tracing-integration + prometheus
Codebase Consolidation
- Shared DSL marker functions — 27 functions deduplicated across CUDA and WGSL codegen backends (~300 lines removed)
unavailable_backend!macro — single macro replaces triplicated backend stubs (~100 lines removed)- Structured logging — replaced
eprintln!withtracingmacros across 6 crates - Unsafe documentation —
// SAFETY:comments on all ~80+ unsafe blocks in GPU code - Hot-path
#[inline]— queue operations, HLC timestamps, control block accessors
Bug Fixes
- Tenant suspension now correctly deactivates tenants (was a no-op)
- Handler registration returns
Resultinstead of panicking on duplicate ID - TLS session resumption stores actual session ticket data
- CloudWatch audit sink returns explicit error instead of silently dropping events
Security Upgrades
jsonwebtoken9.2 → 10.3.0 (type confusion auth bypass)pyo30.22 → 0.24.2 (buffer overflow in PyString)iced0.13 → 0.14.0 (fixes lru Stacked Borrows violation)bytes1.11.0 → 1.11.1 (integer overflow in BytesMut)time0.3.44 → 0.3.47 (stack exhaustion DoS)
Stats
- 1,416 tests passing, 0 failures, 96 GPU-only ignored
- Zero clippy warnings
- Net -224 lines of code (consolidation)
Install
[dependencies]
ringkernel = "0.4.1"Full Changelog: v0.4.0...v0.4.1
v0.4.0: GPU Infrastructure Generalization & Python Bindings
Highlights
This release extracts ~7,000+ lines of proven GPU infrastructure from RustGraph into RingKernel, making these capabilities available to all RingKernel users.
New: Python Bindings (ringkernel-python)
PyO3-based Python wrapper with full async/await support:
import ringkernel
import asyncio
async def main():
runtime = await ringkernel.RingKernel.create(backend="cpu")
kernel = await runtime.launch("processor", ringkernel.LaunchOptions())
await kernel.terminate()
await runtime.shutdown()
asyncio.run(main())Features:
- Async/await with sync fallbacks
- HLC timestamps and K2K messaging
- CUDA device enumeration and GPU memory pool management
- Benchmark framework with regression detection
- Hybrid CPU/GPU dispatcher with adaptive thresholds
- Resource guard for memory limit enforcement
- Type stubs for IDE support
New: PTX Compilation Cache
Disk-based PTX caching for faster kernel loading with SHA-256 content hashing and compute capability awareness.
New: GPU Stratified Memory Pool
Size-stratified GPU VRAM pool with 6 size classes (256B-256KB), O(1) allocation from free lists.
New: Multi-Stream Execution Manager
Multi-stream CUDA execution for compute/transfer overlap with event-based synchronization.
New: Benchmark Framework
Comprehensive benchmarking with regression detection, baseline comparison, and multiple report formats (Markdown, JSON, LaTeX).
New: Hybrid CPU-GPU Dispatcher
Intelligent workload routing with adaptive threshold learning between CPU and GPU execution.
New: Resource Guard
Memory limit enforcement with safety margins and RAII reservation patterns.
New: Kernel Mode Selector
Intelligent kernel launch configuration based on workload profile and GPU architecture.
See CHANGELOG.md for full details.
v0.3.2: GPU Profiling Infrastructure
What's New
GPU Profiling Infrastructure
- CUDA event-based timing and NVTX markers
- Memory allocation tracking
- Chrome trace export for visualization
Publishing Fixes
- Fixed publish script to add User-Agent header for crates.io API
- Updated dependency versions across all crates for v0.3.2 publishing
- ringkernel-ir, ringkernel-graph, ringkernel-montecarlo now use workspace versions
Crates Published
- ringkernel-core, ringkernel-cuda-codegen, ringkernel-wgpu-codegen
- ringkernel-derive, ringkernel-cpu, ringkernel-cuda, ringkernel-wgpu, ringkernel-metal
- ringkernel-codegen, ringkernel-ecosystem, ringkernel-audio-fft
- ringkernel (main crate)
See crates.io/crates/ringkernel for the published crates.
v0.3.1: Enterprise Readiness
RingKernel v0.3.1: Enterprise Readiness
This release adds comprehensive enterprise-grade features for production deployments.
🔐 Enterprise Security
- Real Cryptography: AES-256-GCM, ChaCha20-Poly1305, Argon2 key derivation
- Secrets Management:
SecretStoretrait with key rotation, caching, and chained stores - K2K Message Encryption: Kernel-to-kernel encryption with forward secrecy
- TLS/mTLS Support: Full TLS with rustls, certificate rotation, SNI resolution
🔑 Authentication & Authorization
- Authentication Providers:
ApiKeyAuth,JwtAuth(RS256/HS256),ChainedAuthProvider - RBAC: Role-based access control with deny-by-default
PolicyEvaluator - Multi-tenancy:
TenantContext,ResourceQuota, usage tracking
📊 Observability
- OpenTelemetry: OTLP export to Jaeger, Honeycomb, Datadog, Grafana Cloud
- Structured Logging: Multi-sink logger with trace correlation (JSON/Text)
- Alert Routing: Severity-based routing with deduplication (Slack, Teams, PagerDuty)
- Remote Audit Sinks: Syslog, CloudWatch Logs, Elasticsearch
⚡ Rate Limiting
- Algorithms: TokenBucket, SlidingWindow, LeakyBucket
- Builder API: Fluent configuration with
RateLimiterBuilder - Distributed:
SharedRateLimiterfor multi-instance deployments
🔧 Operational Excellence
- Automatic Recovery: Configurable policies per failure type (Restart, Migrate, Checkpoint, Notify, Escalate, Circuit)
- Operation Timeouts: Deadline propagation with
TimeoutandDeadlinetypes - Recovery Manager: Retry tracking, cooldown periods, automatic escalation
📦 Feature Flags
[dependencies]
ringkernel-core = { version = "0.3.1", features = ["enterprise"] }
# Or select specific features:
ringkernel-core = { version = "0.3.1", features = ["crypto", "auth", "tls", "rate-limiting", "alerting"] }📈 Metrics
- Test Coverage: 900+ tests (up from 825+)
- Crates Published: 21 crates to crates.io
🚀 Quick Start
use ringkernel_core::prelude::*;
// Enterprise runtime with production preset
let runtime = RuntimeBuilder::new()
.production()
.build()?;
// API key authentication
let auth = ApiKeyAuth::new()
.add_key("sk-prod-abc123", Identity::new("service-a"));
// Rate limiting
let limiter = RateLimiterBuilder::new()
.algorithm(RateLimitAlgorithm::TokenBucket)
.rate(1000)
.burst(100)
.build();Full Changelog
See CHANGELOG.md for complete details.
v0.3.0: Multi-Kernel Dispatch, Memory Pools, Global Reductions
RingKernel v0.3.0
GPU-native persistent actor model framework for Rust. This release adds multi-kernel dispatch, memory pools, global reduction primitives, and two new crates.
Highlights
- 21 crates published to crates.io - Full workspace now available
- 825+ tests across the workspace
- cudarc 0.18.2 and wgpu 27.0 support
New Features
Multi-Kernel Dispatch and Persistent Message Routing
#[derive(PersistentMessage)]macro for GPU kernel dispatchKernelDispatchercomponent with builder pattern and metrics- CUDA handler dispatch code generator (
CudaDispatchTable) - Queue tiering system (
QueueTier,QueueFactory,QueueMonitor)
Memory Pool Management
StratifiedMemoryPoolwith 5 size buckets (256B to 64KB)AnalyticsContextfor grouped buffer lifecyclePressureHandlerfor memory pressure monitoring- CUDA
ReductionBufferCacheand WebGPUStagingBufferPool
Global Reduction Primitives
ReductionOpenum: Sum, Min, Max, And, Or, Xor, ProductReductionBuffer<T>using mapped memory (zero-copy host read)- Multi-phase kernel execution with
SyncMode(Cooperative, SoftwareBarrier, MultiLaunch) - PageRank example with dangling node handling
CUDA NVRTC Compilation
compile_ptx()function for runtime CUDA compilation- Downstream crates can compile CUDA without direct cudarc dependency
Domain System
- 20 business domains with reserved type ID ranges
#[message(domain = "FraudDetection")]attribute- Domains: GraphAnalytics, FraudDetection, ProcessIntelligence, Banking, etc.
New Crates
ringkernel-montecarlo- Philox RNG, antithetic variates, control variates, importance samplingringkernel-graph- CSR matrix, BFS, SCC (Tarjan/Kosaraju), Union-Find, SpMV
Breaking Changes
- cudarc API updated to 0.18.2 (module loading, kernel launch builder pattern)
- wgpu API updated to 27.0 (Arc-based resources)
Installation
[dependencies]
ringkernel = "0.3.0"
# Optional backends
ringkernel-cuda = "0.3.0"
ringkernel-wgpu = "0.3.0"Documentation
Full Changelog: v0.2.0...v0.3.0
RingKernel v0.2.0
What's Changed
- Claude/persistent kernel implementation d nc3 o by @mivertowski in #9
Full Changelog: v0.1.3...v0.2.0
v0.1.3 - Dependency Updates & CI Fixes
Highlights
- wgpu 27.0 - Major update with Arc-based resource tracking (~40% performance improvement in some workloads)
- Dependency updates - tokio 1.48, axum 0.8, tonic 0.14, egui 0.31, winit 0.30
- CI/CD fixes - Workspace builds without CUDA/nvcc installed
What's Changed
Dependencies Updated
| Package | From | To |
|---|---|---|
| wgpu | 0.19 | 27.0 |
| tokio | 1.35 | 1.48 |
| thiserror | 1.0 | 2.0 |
| axum | 0.7 | 0.8 |
| tower | 0.4 | 0.5 |
| tonic | 0.11 | 0.14 |
| prost | 0.12 | 0.14 |
| egui/egui-wgpu/egui-winit | 0.27 | 0.31 |
| winit | 0.29 | 0.30 |
| glam | 0.27 | 0.29 |
| metal | 0.27 | 0.31 |
| arrow | 52 | 54 |
| polars | 0.39 | 0.46 |
| rayon | 1.10 | 1.11 |
| actix-rt | 2.9 | 2.10 |
Deferred Updates
- iced: Kept at 0.13 (0.14 requires major application API rewrite)
- rkyv: Kept at 0.7 (0.8 has incompatible data format)
CI/CD Improvements
- CUDA features are now opt-in (not default)
- Workspace builds succeed without nvcc installed
- Feature-gated CUDA tests with
#[cfg(feature = "cuda")]
See CHANGELOG.md for full details.
v0.1.2
Release v0.1.2 - **WaveSim3D** - 3D acoustic wave simulation with realistic physics - Full 3D FDTD wave propagation solver - Binaural audio rendering with HRTF support - Volumetric ray marching visualization - GPU-native actor system for distributed simulation - Expanded GPU intrinsics from ~45 to 120+ operations across 13 categories - Atomic operations: and, or, xor, inc, dec - 3D stencil intrinsics: up, down, at(dx, dy, dz) - Warp match/reduce operations (Volta+/SM 8.0+) - Bit manipulation, memory, special, and timing ops - 171 tests (up from 143) - Added required-features to CUDA-only wavesim binaries - Updated GitHub Actions release workflow See CHANGELOG.md for full details.
v0.1.1 - AccNet & ProcInt Showcase Applications
What's New
New Showcase Applications
AccNet - GPU-Accelerated Accounting Network Analytics
- Network visualization with force-directed graph layout
- Fraud detection: circular flows, threshold clustering, Benford's Law violations
- GAAP compliance checking for accounting rule violations
- Temporal analysis for seasonality, trends, and behavioral anomalies
- GPU kernels: Suspense detection, GAAP violation, Benford analysis, PageRank
ProcInt - GPU-Accelerated Process Intelligence
- DFG (Directly-Follows Graph) mining from event streams
- Pattern detection: bottlenecks, loops, rework, long-running activities
- Conformance checking with fitness and precision metrics
- Timeline view with partial order traces and concurrent activity visualization
- Multi-sector templates: Healthcare, Manufacturing, Finance, IT
- GPU kernels: DFG construction, pattern detection, partial order derivation, conformance checking
Changes
- Updated showcase documentation with AccNet and ProcInt sections
- Updated CI workflow to exclude CUDA tests on runners without GPU hardware
Fixes
- Fixed 14 clippy warnings in ringkernel-accnet
- Fixed benchmark API compatibility in ringkernel-accnet
- Fixed code formatting issues across showcase applications
Run the Applications
# AccNet - Accounting Network Analytics
cargo run -p ringkernel-accnet --release
# ProcInt - Process Intelligence
cargo run -p ringkernel-procint --releaseFull Changelog: v0.1.0...v0.1.1