Random sampling at the scan layer #6987
thorfour
started this conversation in
Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We perform aggregations on large nested columns which ends up being very memory and compute intensive.
To prevent expending all of this cpu and memory to perform these aggregations we want to sample the data that matches the filters. Today we've implemented reservoir sampling as a Datafusion UDF to perform this sampling.
The problem with this approach is that we still end up reading all of the large columns and throwing away many of them due to the sampling. Ideally we would like to perform the filtering on the rows reading only the filterable columns, and then perform the reservoir sampling on those row indices. Only after the sampling do we then read the remaining columns of the rows.
We propose to add a sampled scan implementation to the Vortex scan module to implement this two-phase sampling during scan.
Beta Was this translation helpful? Give feedback.
All reactions