Skip to content

Releases: mivertowski/RustCompute

v0.4.2: Warp-Shuffle Reductions, __nanosleep, libcu++ Atomics

06 Feb 22:13

Choose a tag to compare

What's New

This release upgrades the CUDA codegen with practical findings from CUDA hardware research, targeting CC 6.0+ GPUs with the existing cudarc 0.18.2 runtime.

Warp-Shuffle Block Reductions

  • Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA reduction code
  • Phase 1: Intra-warp __shfl_down_sync(0xFFFFFFFF, val, offset) — zero __syncthreads() calls
  • Phase 2: Cross-warp reduction via shared memory — one __syncthreads() call
  • Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
  • Applied to: persistent FDTD energy reduction, standalone block/grid reduce helpers, and all inline reduction generators

__nanosleep() Power Efficiency

  • Persistent FDTD idle spin-wait now uses __nanosleep() instead of volatile counter loop
  • Software grid barrier spin-loop uses __nanosleep(100) to reduce power consumption
  • Configurable via PersistentFdtdConfig::with_idle_sleep(ns) (default: 1000ns)

libcu++ Ordered Atomics (opt-in)

  • Opt-in cuda::atomic_ref support for H2K/K2H queue operations and software barriers
  • Uses memory_order_acquire/memory_order_release instead of __threadfence_system() pairs
  • Software barrier uses cuda::thread_scope_device (narrower scope) with memory_order_acq_rel
  • Compile-time CUDA 11.0+ version guard
  • Enable via PersistentFdtdConfig::with_libcupp_atomics(true)

Files Changed

  • crates/ringkernel-cuda-codegen/src/persistent_fdtd.rs — config fields, nanosleep, warp-shuffle reduction, libcu++ atomics
  • crates/ringkernel-cuda-codegen/src/reduction_intrinsics.rs — warp-shuffle upgrade for all reduction helpers

Test Results

  • 215 codegen unit tests + 12 integration tests — all passing
  • 6 CUDA GPU execution tests — verified on RTX 2000 Ada (CC 8.9)
  • Full workspace — zero failures

Full Changelog: v0.4.1...v0.4.2

v0.4.1

06 Feb 21:19

Choose a tag to compare

What's New

Property-Based Testing

  • 13 proptest property tests for queue invariants (FIFO ordering, capacity bounds, stats consistency) and HLC properties (total ordering, causality preservation, pack/unpack round-trip)

Ecosystem Feature Bundles

  • web = axum + tower + grpc
  • data = arrow + polars
  • monitoring = tracing-integration + prometheus

Codebase Consolidation

  • Shared DSL marker functions — 27 functions deduplicated across CUDA and WGSL codegen backends (~300 lines removed)
  • unavailable_backend! macro — single macro replaces triplicated backend stubs (~100 lines removed)
  • Structured logging — replaced eprintln! with tracing macros across 6 crates
  • Unsafe documentation// SAFETY: comments on all ~80+ unsafe blocks in GPU code
  • Hot-path #[inline] — queue operations, HLC timestamps, control block accessors

Bug Fixes

  • Tenant suspension now correctly deactivates tenants (was a no-op)
  • Handler registration returns Result instead of panicking on duplicate ID
  • TLS session resumption stores actual session ticket data
  • CloudWatch audit sink returns explicit error instead of silently dropping events

Security Upgrades

  • jsonwebtoken 9.2 → 10.3.0 (type confusion auth bypass)
  • pyo3 0.22 → 0.24.2 (buffer overflow in PyString)
  • iced 0.13 → 0.14.0 (fixes lru Stacked Borrows violation)
  • bytes 1.11.0 → 1.11.1 (integer overflow in BytesMut)
  • time 0.3.44 → 0.3.47 (stack exhaustion DoS)

Stats

  • 1,416 tests passing, 0 failures, 96 GPU-only ignored
  • Zero clippy warnings
  • Net -224 lines of code (consolidation)

Install

[dependencies]
ringkernel = "0.4.1"

Full Changelog: v0.4.0...v0.4.1

v0.4.0: GPU Infrastructure Generalization & Python Bindings

25 Jan 21:23

Choose a tag to compare

Highlights

This release extracts ~7,000+ lines of proven GPU infrastructure from RustGraph into RingKernel, making these capabilities available to all RingKernel users.

New: Python Bindings (ringkernel-python)

PyO3-based Python wrapper with full async/await support:

import ringkernel
import asyncio

async def main():
    runtime = await ringkernel.RingKernel.create(backend="cpu")
    kernel = await runtime.launch("processor", ringkernel.LaunchOptions())
    await kernel.terminate()
    await runtime.shutdown()

asyncio.run(main())

Features:

  • Async/await with sync fallbacks
  • HLC timestamps and K2K messaging
  • CUDA device enumeration and GPU memory pool management
  • Benchmark framework with regression detection
  • Hybrid CPU/GPU dispatcher with adaptive thresholds
  • Resource guard for memory limit enforcement
  • Type stubs for IDE support

New: PTX Compilation Cache

Disk-based PTX caching for faster kernel loading with SHA-256 content hashing and compute capability awareness.

New: GPU Stratified Memory Pool

Size-stratified GPU VRAM pool with 6 size classes (256B-256KB), O(1) allocation from free lists.

New: Multi-Stream Execution Manager

Multi-stream CUDA execution for compute/transfer overlap with event-based synchronization.

New: Benchmark Framework

Comprehensive benchmarking with regression detection, baseline comparison, and multiple report formats (Markdown, JSON, LaTeX).

New: Hybrid CPU-GPU Dispatcher

Intelligent workload routing with adaptive threshold learning between CPU and GPU execution.

New: Resource Guard

Memory limit enforcement with safety margins and RAII reservation patterns.

New: Kernel Mode Selector

Intelligent kernel launch configuration based on workload profile and GPU architecture.


See CHANGELOG.md for full details.

v0.3.2: GPU Profiling Infrastructure

21 Jan 09:54

Choose a tag to compare

What's New

GPU Profiling Infrastructure

  • CUDA event-based timing and NVTX markers
  • Memory allocation tracking
  • Chrome trace export for visualization

Publishing Fixes

  • Fixed publish script to add User-Agent header for crates.io API
  • Updated dependency versions across all crates for v0.3.2 publishing
  • ringkernel-ir, ringkernel-graph, ringkernel-montecarlo now use workspace versions

Crates Published

  • ringkernel-core, ringkernel-cuda-codegen, ringkernel-wgpu-codegen
  • ringkernel-derive, ringkernel-cpu, ringkernel-cuda, ringkernel-wgpu, ringkernel-metal
  • ringkernel-codegen, ringkernel-ecosystem, ringkernel-audio-fft
  • ringkernel (main crate)

See crates.io/crates/ringkernel for the published crates.

v0.3.1: Enterprise Readiness

19 Jan 20:16

Choose a tag to compare

RingKernel v0.3.1: Enterprise Readiness

This release adds comprehensive enterprise-grade features for production deployments.

🔐 Enterprise Security

  • Real Cryptography: AES-256-GCM, ChaCha20-Poly1305, Argon2 key derivation
  • Secrets Management: SecretStore trait with key rotation, caching, and chained stores
  • K2K Message Encryption: Kernel-to-kernel encryption with forward secrecy
  • TLS/mTLS Support: Full TLS with rustls, certificate rotation, SNI resolution

🔑 Authentication & Authorization

  • Authentication Providers: ApiKeyAuth, JwtAuth (RS256/HS256), ChainedAuthProvider
  • RBAC: Role-based access control with deny-by-default PolicyEvaluator
  • Multi-tenancy: TenantContext, ResourceQuota, usage tracking

📊 Observability

  • OpenTelemetry: OTLP export to Jaeger, Honeycomb, Datadog, Grafana Cloud
  • Structured Logging: Multi-sink logger with trace correlation (JSON/Text)
  • Alert Routing: Severity-based routing with deduplication (Slack, Teams, PagerDuty)
  • Remote Audit Sinks: Syslog, CloudWatch Logs, Elasticsearch

⚡ Rate Limiting

  • Algorithms: TokenBucket, SlidingWindow, LeakyBucket
  • Builder API: Fluent configuration with RateLimiterBuilder
  • Distributed: SharedRateLimiter for multi-instance deployments

🔧 Operational Excellence

  • Automatic Recovery: Configurable policies per failure type (Restart, Migrate, Checkpoint, Notify, Escalate, Circuit)
  • Operation Timeouts: Deadline propagation with Timeout and Deadline types
  • Recovery Manager: Retry tracking, cooldown periods, automatic escalation

📦 Feature Flags

[dependencies]
ringkernel-core = { version = "0.3.1", features = ["enterprise"] }

# Or select specific features:
ringkernel-core = { version = "0.3.1", features = ["crypto", "auth", "tls", "rate-limiting", "alerting"] }

📈 Metrics

  • Test Coverage: 900+ tests (up from 825+)
  • Crates Published: 21 crates to crates.io

🚀 Quick Start

use ringkernel_core::prelude::*;

// Enterprise runtime with production preset
let runtime = RuntimeBuilder::new()
    .production()
    .build()?;

// API key authentication
let auth = ApiKeyAuth::new()
    .add_key("sk-prod-abc123", Identity::new("service-a"));

// Rate limiting
let limiter = RateLimiterBuilder::new()
    .algorithm(RateLimitAlgorithm::TokenBucket)
    .rate(1000)
    .burst(100)
    .build();

Full Changelog

See CHANGELOG.md for complete details.

v0.3.0: Multi-Kernel Dispatch, Memory Pools, Global Reductions

19 Jan 09:34

Choose a tag to compare

RingKernel v0.3.0

GPU-native persistent actor model framework for Rust. This release adds multi-kernel dispatch, memory pools, global reduction primitives, and two new crates.

Highlights

  • 21 crates published to crates.io - Full workspace now available
  • 825+ tests across the workspace
  • cudarc 0.18.2 and wgpu 27.0 support

New Features

Multi-Kernel Dispatch and Persistent Message Routing

  • #[derive(PersistentMessage)] macro for GPU kernel dispatch
  • KernelDispatcher component with builder pattern and metrics
  • CUDA handler dispatch code generator (CudaDispatchTable)
  • Queue tiering system (QueueTier, QueueFactory, QueueMonitor)

Memory Pool Management

  • StratifiedMemoryPool with 5 size buckets (256B to 64KB)
  • AnalyticsContext for grouped buffer lifecycle
  • PressureHandler for memory pressure monitoring
  • CUDA ReductionBufferCache and WebGPU StagingBufferPool

Global Reduction Primitives

  • ReductionOp enum: Sum, Min, Max, And, Or, Xor, Product
  • ReductionBuffer<T> using mapped memory (zero-copy host read)
  • Multi-phase kernel execution with SyncMode (Cooperative, SoftwareBarrier, MultiLaunch)
  • PageRank example with dangling node handling

CUDA NVRTC Compilation

  • compile_ptx() function for runtime CUDA compilation
  • Downstream crates can compile CUDA without direct cudarc dependency

Domain System

  • 20 business domains with reserved type ID ranges
  • #[message(domain = "FraudDetection")] attribute
  • Domains: GraphAnalytics, FraudDetection, ProcessIntelligence, Banking, etc.

New Crates

  • ringkernel-montecarlo - Philox RNG, antithetic variates, control variates, importance sampling
  • ringkernel-graph - CSR matrix, BFS, SCC (Tarjan/Kosaraju), Union-Find, SpMV

Breaking Changes

  • cudarc API updated to 0.18.2 (module loading, kernel launch builder pattern)
  • wgpu API updated to 27.0 (Arc-based resources)

Installation

[dependencies]
ringkernel = "0.3.0"

# Optional backends
ringkernel-cuda = "0.3.0"
ringkernel-wgpu = "0.3.0"

Documentation

Full Changelog: v0.2.0...v0.3.0

RingKernel v0.2.0

14 Jan 16:48

Choose a tag to compare

What's Changed

  • Claude/persistent kernel implementation d nc3 o by @mivertowski in #9

Full Changelog: v0.1.3...v0.2.0

v0.1.3 - Dependency Updates & CI Fixes

17 Dec 14:18

Choose a tag to compare

Highlights

  • wgpu 27.0 - Major update with Arc-based resource tracking (~40% performance improvement in some workloads)
  • Dependency updates - tokio 1.48, axum 0.8, tonic 0.14, egui 0.31, winit 0.30
  • CI/CD fixes - Workspace builds without CUDA/nvcc installed

What's Changed

Dependencies Updated

Package From To
wgpu 0.19 27.0
tokio 1.35 1.48
thiserror 1.0 2.0
axum 0.7 0.8
tower 0.4 0.5
tonic 0.11 0.14
prost 0.12 0.14
egui/egui-wgpu/egui-winit 0.27 0.31
winit 0.29 0.30
glam 0.27 0.29
metal 0.27 0.31
arrow 52 54
polars 0.39 0.46
rayon 1.10 1.11
actix-rt 2.9 2.10

Deferred Updates

  • iced: Kept at 0.13 (0.14 requires major application API rewrite)
  • rkyv: Kept at 0.7 (0.8 has incompatible data format)

CI/CD Improvements

  • CUDA features are now opt-in (not default)
  • Workspace builds succeed without nvcc installed
  • Feature-gated CUDA tests with #[cfg(feature = "cuda")]

See CHANGELOG.md for full details.

v0.1.2

11 Dec 09:55

Choose a tag to compare

Release v0.1.2

- **WaveSim3D** - 3D acoustic wave simulation with realistic physics
  - Full 3D FDTD wave propagation solver
  - Binaural audio rendering with HRTF support
  - Volumetric ray marching visualization
  - GPU-native actor system for distributed simulation

- Expanded GPU intrinsics from ~45 to 120+ operations across 13 categories
- Atomic operations: and, or, xor, inc, dec
- 3D stencil intrinsics: up, down, at(dx, dy, dz)
- Warp match/reduce operations (Volta+/SM 8.0+)
- Bit manipulation, memory, special, and timing ops
- 171 tests (up from 143)

- Added required-features to CUDA-only wavesim binaries
- Updated GitHub Actions release workflow

See CHANGELOG.md for full details.

v0.1.1 - AccNet & ProcInt Showcase Applications

04 Dec 15:40

Choose a tag to compare

What's New

New Showcase Applications

AccNet - GPU-Accelerated Accounting Network Analytics

  • Network visualization with force-directed graph layout
  • Fraud detection: circular flows, threshold clustering, Benford's Law violations
  • GAAP compliance checking for accounting rule violations
  • Temporal analysis for seasonality, trends, and behavioral anomalies
  • GPU kernels: Suspense detection, GAAP violation, Benford analysis, PageRank

ProcInt - GPU-Accelerated Process Intelligence

  • DFG (Directly-Follows Graph) mining from event streams
  • Pattern detection: bottlenecks, loops, rework, long-running activities
  • Conformance checking with fitness and precision metrics
  • Timeline view with partial order traces and concurrent activity visualization
  • Multi-sector templates: Healthcare, Manufacturing, Finance, IT
  • GPU kernels: DFG construction, pattern detection, partial order derivation, conformance checking

Changes

  • Updated showcase documentation with AccNet and ProcInt sections
  • Updated CI workflow to exclude CUDA tests on runners without GPU hardware

Fixes

  • Fixed 14 clippy warnings in ringkernel-accnet
  • Fixed benchmark API compatibility in ringkernel-accnet
  • Fixed code formatting issues across showcase applications

Run the Applications

# AccNet - Accounting Network Analytics
cargo run -p ringkernel-accnet --release

# ProcInt - Process Intelligence
cargo run -p ringkernel-procint --release

Full Changelog: v0.1.0...v0.1.1