Random sampling at the scan layer #6987

thorfour · 2026-03-16T19:26:51Z

thorfour
Mar 16, 2026

We perform aggregations on large nested columns which ends up being very memory and compute intensive.

To prevent expending all of this cpu and memory to perform these aggregations we want to sample the data that matches the filters. Today we've implemented reservoir sampling as a Datafusion UDF to perform this sampling.

The problem with this approach is that we still end up reading all of the large columns and throwing away many of them due to the sampling. Ideally we would like to perform the filtering on the rows reading only the filterable columns, and then perform the reservoir sampling on those row indices. Only after the sampling do we then read the remaining columns of the rows.

We propose to add a sampled scan implementation to the Vortex scan module to implement this two-phase sampling during scan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random sampling at the scan layer #6987

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Random sampling at the scan layer #6987

Uh oh!

thorfour Mar 16, 2026

Replies: 0 comments

thorfour
Mar 16, 2026