Sub-Segment Range Read for FlatLayout #6991
Replies: 3 comments
-
|
relative to: #6974 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @jiaqizho - thank you for this, it is very detailed and super helpful. I'd like to try and unblock you here, while also figuring out how we want to manage this long term. The two concerns I have at the moment are:
While the approach in (2) should work, it introduces I/O into arrays, which isn't ideal and would change a lot of APIs. So a workaround for this would be for the layout reader to do a second pass of the array tree and resolve all the buffers, maybe with something like Given you're very familiar with this problem, would you have any thoughts on this as an approach? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @gatesn, thanks for the detailed feedback! I've spent some time studying the Regarding concern 1: understood that Regarding concern 2: I agree with the BufferHandle-based approach. Rather than a parallel This approach should also be compatible with the GPU path — the lazy Confirming my understandingThe two approaches share the same core mechanism (lazy BufferHandle + SliceReduce pushdown). The difference is when IO happens: Compromise approach: The layout reader resolves all lazy buffers before returning (same as today, upper layers always receive fully materialized arrays). The lazy The resolve pass would use something like Full lazy approach: Same core mechanism, but the layout reader returns the array with unresolved lazy buffers directly to upper layers. Upper layers operate on these arrays normally SliceReduce pushdown narrows the byte ranges as operations are applied. IO only happens when the buffer data is actually needed (e.g., to_host() or to_contiguous()). This gives upper layers full control over when and how buffers are materialized, but requires API changes since any buffer access may trigger IO. Is that right? One thing I'd like to discuss:
Does this match what you had in mind? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Sub-Segment Range Read for FlatLayout
Problem
Vortex uses Segments as the minimum IO unit. The default write pipeline produces segments with
block_size_minimum = 1MB. Fortake()queries, this causes severe read amplification:On local storage this can be masked by mmap. But on S3 cold queries, every byte of read amplification translates directly to network latency and cost. This is especially painful for high-concurrency take() query workloads.
Real-world benchmark (S3, AWS VPC)
Setup: m8a.4xlarge, us-west-2, S3 same-region VPC. Schema: 8 columns (7x int64 + 1x 512B FixedSizeList embedding), 3,025,177 rows. Vortex 0.56.0 vs Lance 1.0.0 (format v2.1).
Single-row take(1), single column (
id), S3 cold query:Vortex reads 1.38 MB for a single int64 value — the entire segment. Parquet and Lance read only the needed bytes.
Multi-reader take(10) with embedding column (512B FixedSizeList), S3:
Vortex saturates at ~25 takes/s due to S3 bandwidth ceiling (~13.5 Gbps). Each take reads the full ~4MB embedding segment. Lance reads only ~0.68 MB/take and scales to ~55 takes/s, bottlenecked on S3 IOPS instead of bandwidth.
In a previous discussion, the suggestion was to reduce segment size. However, segment size and metadata size are a trade-off — smaller segments mean more metadata overhead in the footer, which increases footer IO and memory usage. For S3 cold take() queries, the only viable approach is to reduce unnecessary IO within existing segments.
The core issue: Vortex's segment-granularity IO wastes bandwidth on S3, limiting throughput for take() queries. Sub-segment range read would reduce IO per take by 100-1000x for compressed integer columns and ~2000x for the embedding column, shifting the bottleneck from bandwidth to IOPS — where S3 has much higher headroom.
Proposal: VTable-based
plan_range_readAdd an optional method to the array
VTabletrait that lets each encoding describe how to sub-range its buffers for a given row range. The layout planner dispatches via vtable and recursively walks the encoding tree -- no centralized match on encoding names.New VTable method
Each encoding returns an
EncodingRangeReaddescribing:decode_lenandpost_slicefor the decoderHow it works
The planner in
FlatReadercallsplan_range_readvia vtable dispatch and recursively walks the encoding tree:The planner:
vtable.plan_range_read()for the root encodingRangeDecodeInfo(leaf or delegate-to-child with optional divisor)Prerequisite:
FLAT_LAYOUT_INLINE_ARRAY_NODERange read requires the Array encoding tree (flatbuffer) to be inlined in the layout footer metadata. This existing mechanism (
FLAT_LAYOUT_INLINE_ARRAY_NODEenv var) stores the encoding tree in the footer so the planner can inspect it without reading the segment first. The planning phase is pure in-memory arithmetic -- zero IO.Fallback safety
plan_range_readreturnsNone→ full segment read (existing path)Currently supported encodings (16)
This covers ~95% of BtrBlocks compression output (FoR→BitPacked, BitPacked, ZigZag→FoR→BitPacked, Constant, Dict→FoR→BitPacked).
Not yet supported
FSST, RunEnd, RLE, Sparse, Pco, Zstd, VarBin/VarBinView, List/ListView. These require more complex sub-ranging (binary search, page/frame decompression, offset indirection). They safely fall back to full segment reads.
Measured results
Per-column
take(1)at row 1,500,000 in a 3M-row file, footer pre-cached:FixedSizeList(512B)
take(1): 512 B vs 1,048,804 B → 2,048x reduction.Columns that show 1.0x (id, monotonic_i64) have small segments where range read offers no benefit. Nullable columns with validity children correctly fall back.
Future: Nullable array support
The current implementation falls back when validity children are present. The vtable design naturally supports extending this -- no planner changes needed. Each encoding would include the validity child as
ChildRangeRead::Recursewith the appropriate row range. The planner recursively resolves Bool's byte sub-range for the validity bitmap.Future: Two-phase IO for variable-length encodings
These encodings need a two-phase IO approach: read an index structure first, then use its content to compute the data range for a second IO. They currently fall back to full segment reads.
offsets[row_start..row_end+1]data[offset_start..offset_end]views[row_start..row_end]offsets[row_start..row_end+1]elements[offset_start..offset_end]codes_offsets[row_start..row_end+1]codes[offset_start..offset_end](symbols table included fully)patch_indicespatch_valuessub-rangeendsvaluessub-rangeThe VTable design supports this extension:
BufferSubRangecan be extended with aDeferredvariant that expresses "read this buffer first, then compute the target range." The planner'sexecute_range_readwould issue tworequest_rangecalls instead of one. No VTable interface changes needed beyond the new variant. At most two IOs are needed — there is no "index of index" structure in Vortex encodings.Not support: Block-level compression (Pco, Zstd)
Pco and Zstd use opaque compressed blocks (pages/frames) that must be fully decompressed. Sub-ranging within a compressed block is not possible. These are the only encodings that fundamentally cannot benefit from range read and will always fall back to full segment reads.
Beta Was this translation helpful? Give feedback.
All reactions