Skip to content

Conversation

@drxddy
Copy link

@drxddy drxddy commented Jan 22, 2026

Summary

Add rANS entropy coding on top of quantized weights for 1.3-2x additional lossless compression.

Closes #3043

Motivation

Quantized LLM weights have entropy significantly below their bit-width:

  • 4-bit weights: ~2.17 bits entropy → 1.84x compression potential
  • 8-bit weights: ~4-5 bits entropy → 1.6-2x compression potential

This PR closes that gap with lossless entropy coding, reducing memory bandwidth during inference.

Implementation

New Primitives

  • EntropyCodedMatmul - Fused per-row decode+GEMV
  • EntropyDecodeAsync - Async GPU decode for prefetching

Metal Kernel

  • entropy_coded.h - Per-row fused decode+dequant+GEMV kernel

Python API

from mlx.nn.layers import EntropyCodedLinear

# Convert quantized layer
ec_layer = EntropyCodedLinear.from_linear(linear, decode_mode="fused")

Decode Modes

Mode Memory Speed Use Case
fused 1.3-2x smaller 1.1-1.5x overhead Memory-constrained
cached Same as quantized 1.0x Speed-critical
gpu_async 1.3-2x smaller ~1.0x Deep models

Testing

  • Build passes (cmake + make)
  • All 235 existing tests pass
  • Benchmarked on M3 Pro with synthetic and real model weights

@angeloskath
Copy link
Member

This looks great but I think it is fairly niche to be merged into mlx core.

I believe it would be great as a standalone project and a very good showcase of custom C++ extensions for MLX as well (https://ml-explore.github.io/mlx/build/html/dev/extensions.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: rANS entropy-coded quantization support

2 participants