Sub-Segment Range Read for FlatLayout #6991

jiaqizho · 2026-03-17T07:21:18Z

jiaqizho
Mar 17, 2026

Sub-Segment Range Read for FlatLayout

Problem

Vortex uses Segments as the minimum IO unit. The default write pipeline produces segments with block_size_minimum = 1MB. For take() queries, this causes severe read amplification:

Column type	Segment size	take(1) needs	Read amplification
int32 (BitPacked 10-bit)	~300KB	~1.25B	~240,000x
int64 (FoR + BitPacked 25-bit)	~750KB	~3.2B	~230,000x
FixedSizeList(512B, u8) uncompressed	~1MB	512B	~2,000x

On local storage this can be masked by mmap. But on S3 cold queries, every byte of read amplification translates directly to network latency and cost. This is especially painful for high-concurrency take() query workloads.

Real-world benchmark (S3, AWS VPC)

Setup: m8a.4xlarge, us-west-2, S3 same-region VPC. Schema: 8 columns (7x int64 + 1x 512B FixedSizeList embedding), 3,025,177 rows. Vortex 0.56.0 vs Lance 1.0.0 (format v2.1).

Single-row take(1), single column (id), S3 cold query:

Format	Latency	S3 reads	S3 bytes read
Parquet	50.6 ms	2	0.21 MB
Vortex	53.0 ms	2	1.38 MB
Lance	75.6 ms	3	0.07 MB

Vortex reads 1.38 MB for a single int64 value — the entire segment. Parquet and Lance read only the needed bytes.

Multi-reader take(10) with embedding column (512B FixedSizeList), S3:

Readers	Lance takes/s	Lance IO(MB)	Vortex takes/s	Vortex IO(MB)
1	8.1	0.68	7.7	73.7
64	51.8	31.5	25.2	4,318
512	55.1	252.4	25.7	34,388

Vortex saturates at ~25 takes/s due to S3 bandwidth ceiling (~13.5 Gbps). Each take reads the full ~4MB embedding segment. Lance reads only ~0.68 MB/take and scales to ~55 takes/s, bottlenecked on S3 IOPS instead of bandwidth.

In a previous discussion, the suggestion was to reduce segment size. However, segment size and metadata size are a trade-off — smaller segments mean more metadata overhead in the footer, which increases footer IO and memory usage. For S3 cold take() queries, the only viable approach is to reduce unnecessary IO within existing segments.

The core issue: Vortex's segment-granularity IO wastes bandwidth on S3, limiting throughput for take() queries. Sub-segment range read would reduce IO per take by 100-1000x for compressed integer columns and ~2000x for the embedding column, shifting the bottleneck from bandwidth to IOPS — where S3 has much higher headroom.

Proposal: VTable-based `plan_range_read`

Add an optional method to the array VTable trait that lets each encoding describe how to sub-range its buffers for a given row range. The layout planner dispatches via vtable and recursively walks the encoding tree -- no centralized match on encoding names.

New VTable method

// On VTable trait (strongly typed metadata, default returns None)
fn plan_range_read(
    metadata: &Self::Metadata,
    row_range: Range<usize>,
    row_count: usize,
    dtype: &DType,
) -> Option<EncodingRangeRead> {
    None // not supported by default
}

Each encoding returns an EncodingRangeRead describing:

Buffer sub-ranges: which byte ranges of its buffers are needed
Children handling: recurse into child with a sub-range, or include child fully
Decode info: how to compute decode_len and post_slice for the decoder

pub struct EncodingRangeRead {
    pub buffer_sub_ranges: Vec<BufferSubRange>,
    pub children: Vec<ChildRangeRead>,
    pub decode_info: RangeDecodeInfo,
}

pub enum BufferSubRange { Full, Range(Range<usize>) }

pub enum ChildRangeRead {
    Recurse { row_range: Range<usize>, row_count: usize, dtype: DType },
    Full,  // include all child buffers (e.g., dictionary values)
}

pub enum RangeDecodeInfo {
    Leaf { decode_len: usize, post_slice: Option<Range<usize>> },
    FromChild { child_idx: usize, divisor: usize },
}

How it works

The planner in FlatReader calls plan_range_read via vtable dispatch and recursively walks the encoding tree:

FoR.plan_range_read(row_range=50000..50100)
  → no buffers, delegate to child 0
  → BitPacked.plan_range_read(50000..50100)
      → buffer[0]: bytes 153600..156800 (one 1024-element block)
      → decode_len: 948, post_slice: 848..948

Result: read 3,200 bytes instead of ~750KB segment

The planner:

Calls vtable.plan_range_read() for the root encoding
Maps buffer sub-ranges to global buffer indices via the flatbuffer node
Zips plan children with node children (exact count match required -- extra children like validity cause a fallback)
Resolves RangeDecodeInfo (leaf or delegate-to-child with optional divisor)

Prerequisite: `FLAT_LAYOUT_INLINE_ARRAY_NODE`

Range read requires the Array encoding tree (flatbuffer) to be inlined in the layout footer metadata. This existing mechanism (FLAT_LAYOUT_INLINE_ARRAY_NODE env var) stores the encoding tree in the footer so the planner can inspect it without reading the segment first. The planning phase is pure in-memory arithmetic -- zero IO.

Fallback safety

Encoding doesn't support range read → plan_range_read returns None → full segment read (existing path)
Encoding has unhandled children (e.g., validity) → planner detects count mismatch → full segment read
Range read saves less than 50% → planner skips → full segment read
Fallback overhead: ~100ns for FlatBuffer parse + encoding tree walk (negligible vs IO)

Currently supported encodings (16)

Category	Encodings
Leaf (direct byte addressing)	Primitive, Bool, BitPacked, ByteBool, Constant, Null, Sequence
Transparent wrappers (delegate to child)	FoR, ZigZag
Container (recurse into children)	FixedSizeList, Dict, ALP, ALPRD, Delta, DateTimeParts, DecimalByteParts

This covers ~95% of BtrBlocks compression output (FoR→BitPacked, BitPacked, ZigZag→FoR→BitPacked, Constant, Dict→FoR→BitPacked).

Not yet supported

FSST, RunEnd, RLE, Sparse, Pco, Zstd, VarBin/VarBinView, List/ListView. These require more complex sub-ranging (binary search, page/frame decompression, offset indirection). They safely fall back to full segment reads.

Measured results

Per-column take(1) at row 1,500,000 in a 3M-row file, footer pre-cached:

Column	With range read	Without	Reduction
value_i32 (BitPacked)	1,280 B	327,972 B	256x
small_int (BitPacked 4-bit)	512 B	524,452 B	1,024x
bool_col	1 B	375,104 B	375,104x
nullable_no_nulls (AllValid, no validity child)	1,280 B	327,972 B	256x
categorical_i32 (Dict)	384 B	98,468 B	256x
alprd_f64	381,312 B	852,196 B	2.2x

FixedSizeList(512B) take(1): 512 B vs 1,048,804 B → 2,048x reduction.

Columns that show 1.0x (id, monotonic_i64) have small segments where range read offers no benefit. Nullable columns with validity children correctly fall back.

Future: Nullable array support

The current implementation falls back when validity children are present. The vtable design naturally supports extending this -- no planner changes needed. Each encoding would include the validity child as ChildRangeRead::Recurse with the appropriate row range. The planner recursively resolves Bool's byte sub-range for the validity bitmap.

Future: Two-phase IO for variable-length encodings

These encodings need a two-phase IO approach: read an index structure first, then use its content to compute the data range for a second IO. They currently fall back to full segment reads.

Encoding	Phase 1	Phase 2
VarBin	Read `offsets[row_start..row_end+1]`	Read `data[offset_start..offset_end]`
VarBinView	Read `views[row_start..row_end]`	Read referenced data buffer ranges
List	Read `offsets[row_start..row_end+1]`	Read `elements[offset_start..offset_end]`
ListView	Same as List	Same as List
FSST	Read `codes_offsets[row_start..row_end+1]`	Read `codes[offset_start..offset_end]` (symbols table included fully)
Sparse	Read `patch_indices`	Binary search → read `patch_values` sub-range
RunEnd	Read `ends`	Binary search → read `values` sub-range

The VTable design supports this extension: BufferSubRange can be extended with a Deferred variant that expresses "read this buffer first, then compute the target range." The planner's execute_range_read would issue two request_range calls instead of one. No VTable interface changes needed beyond the new variant. At most two IOs are needed — there is no "index of index" structure in Vortex encodings.

Not support: Block-level compression (Pco, Zstd)

Pco and Zstd use opaque compressed blocks (pages/frames) that must be fully decompressed. Sub-ranging within a compressed block is not possible. These are the only encodings that fundamentally cannot benefit from range read and will always fall back to full segment reads.

jiaqizho · 2026-03-17T07:24:17Z

jiaqizho
Mar 17, 2026
Author

relative to: #6974

0 replies

gatesn · 2026-03-19T13:31:42Z

gatesn
Mar 19, 2026
Maintainer

Hi @jiaqizho - thank you for this, it is very detailed and super helpful.

I'd like to try and unblock you here, while also figuring out how we want to manage this long term. The two concerns I have at the moment are:

FLAT_LAYOUT_INLINE_ARRAY_NODE is still very experimental and we haven't committed to the binary format. So any files written with this flag may not be readable in the future. I think we're getting close to stabilizing this, but we need to confirm it exposes enough information in the right places, as well as doesn't blow up the size of the footer too much. It is necessary for both this work and for our GPU work though, so I will circle back with the team and try to get this over the line.
We're wondering whether we should instead lean into the BufferHandle abstraction to allow constructing arrays over lazy buffers. So still use the LAYOUT_INLINE flag, but then the array itself can push-down filter, slice, take, etc. in the same way it does now using reduce_parent. The difference is it would push the slicing and filtering all the way into the BufferHandle with new BufferHandle::filter(Mask) and BufferHandle::slice(..) functions. The BufferHandle::to_host could then be used to materialize a contiguous host buffer.

While the approach in (2) should work, it introduces I/O into arrays, which isn't ideal and would change a lot of APIs. So a workaround for this would be for the layout reader to do a second pass of the array tree and resolve all the buffers, maybe with something like BufferHandle::to_contiguous that could apply the filter/slice while keeping the buffer on the same device.

Given you're very familiar with this problem, would you have any thoughts on this as an approach?

0 replies

jiaqizho · 2026-03-20T03:42:44Z

jiaqizho
Mar 20, 2026
Author

Hi @gatesn, thanks for the detailed feedback! I've spent some time studying the BufferHandle abstraction and SliceReduce / reduce_parent. Here are my thoughts:

Regarding concern 1: understood that FLAT_LAYOUT_INLINE_ARRAY_NODE needs to be stabilized first — both this work and the GPU work depend on it. Let me know if there's anything I can help with to move that forward (e.g., evaluating footer size impact, testing edge cases).

Regarding concern 2: I agree with the BufferHandle-based approach. Rather than a parallel plan_range_read vtable method, reusing SliceReduce to push operations into lazy BufferHandles avoids duplicating encoding-specific logic. It also handles nullable arrays more naturally — SliceReduce already slices both data and validity together, whereas my current plan_range_read had to fall back to full reads for nullable arrays.

This approach should also be compatible with the GPU path — the lazy BufferHandle and SliceReduce pushdown are agnostic to where the data lives. The resolve step can materialize buffers to host or device depending on the context, so the existing GPU decode pipeline (CopyDeviceReadAt → device buffers → GPU kernels) wouldn't need to change.

Confirming my understanding

The two approaches share the same core mechanism (lazy BufferHandle + SliceReduce pushdown). The difference is when IO happens:

Compromise approach: The layout reader resolves all lazy buffers before returning (same as today, upper layers always receive fully materialized arrays). The lazy BufferHandle and SliceReduce work carries over if we later want to extend to the full lazy approach.

The resolve pass would use something like BufferHandle::to_contiguous to materialize each lazy buffer. For encodings that require multi-round IO (e.g., VarBin — read offsets first, then use offset values to compute the data byte range), the resolve pass could handle this by resolving buffers in dependency order rather than all at once.

┌─ FlatReader ──────────────────────────┐
│                                       │
│  footer ─► Build array tree           │
│            with Lazy BufferHandles    │
│       │                               │
│       ▼                               │
│  array.slice(50000..50100)            │
│       │                               │
│       ▼                               │
│  SliceReduce pushdown                 │
│  (byte range: 0..3.2MB → 153KB..156KB)
│       │                               │
│       ▼                               │
│  resolve: IO + materialize buffers    │ <---- IO here
│                                       │
└──────────────┬────────────────────────┘
               │ array (buffers = real data)
               ▼
┌─ Upper layers ────────────────────────┐
│                                       │
│  same as today, data ready            │
│                                       │
└───────────────────────────────────────┘

Full lazy approach: Same core mechanism, but the layout reader returns the array with unresolved lazy buffers directly to upper layers. Upper layers operate on these arrays normally SliceReduce pushdown narrows the byte ranges as operations are applied. IO only happens when the buffer data is actually needed (e.g., to_host() or to_contiguous()). This gives upper layers full control over when and how buffers are materialized, but requires API changes since any buffer access may trigger IO.

┌─ FlatReader ──────────────────────────┐
│                                       │
│  footer ─► Build array tree           │
│            with Lazy BufferHandles    │
│                                       │
└──────────────┬────────────────────────┘
               │ array (buffers = lazy, no data yet)
               ▼
┌─ Upper layers ────────────────────────┐
│                                       │
│  array.slice(50000..50100)            │
│       │                               │
│       ▼                               │
│  SliceReduce pushdown                 │
│  (byte range: 0..3.2MB → 153KB..156KB)
│       │                               │
│       ▼                               │
│  access buffer ─► to_host() ─► IO    │  <---- IO here
│       │                               │
│       ▼                               │
│  materialized array                   │
│                                       │
└───────────────────────────────────────┘

Is that right?

One thing I'd like to discuss:

filter(Mask) / take on BufferHandle: slice alone covers range read. Filter/take at the buffer level needs encoding knowledge — bigger design question.

Does this match what you had in mind?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-Segment Range Read for FlatLayout #6991

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Sub-Segment Range Read for FlatLayout #6991

Uh oh!

Uh oh!

jiaqizho Mar 17, 2026

Sub-Segment Range Read for FlatLayout

Problem

Real-world benchmark (S3, AWS VPC)

Proposal: VTable-based plan_range_read

New VTable method

How it works

Prerequisite: FLAT_LAYOUT_INLINE_ARRAY_NODE

Fallback safety

Currently supported encodings (16)

Not yet supported

Measured results

Future: Nullable array support

Future: Two-phase IO for variable-length encodings

Not support: Block-level compression (Pco, Zstd)

Replies: 3 comments

Uh oh!

jiaqizho Mar 17, 2026 Author

Uh oh!

gatesn Mar 19, 2026 Maintainer

Uh oh!

Uh oh!

jiaqizho Mar 20, 2026 Author

Confirming my understanding

jiaqizho
Mar 17, 2026

Proposal: VTable-based `plan_range_read`

Prerequisite: `FLAT_LAYOUT_INLINE_ARRAY_NODE`

jiaqizho
Mar 17, 2026
Author

gatesn
Mar 19, 2026
Maintainer

jiaqizho
Mar 20, 2026
Author