From cff5796cfe07b41fa6a2e6bb89aa8e606a08781f Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Tue, 17 Mar 2026 14:19:35 +0000 Subject: [PATCH 1/3] Update variant RFC with nullability and scalar changes Signed-off-by: Adam Gutglick --- accepted/0015-variant-type.md | 37 +++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/accepted/0015-variant-type.md b/accepted/0015-variant-type.md index 50b0660..7cf6aae 100644 --- a/accepted/0015-variant-type.md +++ b/accepted/0015-variant-type.md @@ -23,11 +23,15 @@ enum Variant { } ``` +Here the semantic `null` value inside the variant payload is represented as +`Scalar::null(DType::Null)`. That is distinct from the outer nullability of the +`Variant` dtype itself. + Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns. -This document proposed adding a new `DType` variant named `Variant`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). +This document proposes adding a new `DType::Variant(Nullability)`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). ### Arrow representation @@ -37,9 +41,27 @@ Supporting extension types requires replacing the target `DataType` and nullabil ### Nullability -In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children. +`Variant` should follow the same top-level nullability model as every other Vortex dtype: +`DType::Variant(Nullability)` can be nullable or non-nullable. A nullable variant allows the +array slot itself to be absent. A non-nullable variant guarantees that the slot is present, but it +does **not** guarantee that extracted paths will be non-null. + +This is distinct from the semantic null value inside the variant payload, which I'll call +`variantnull` here to match the implementation discussion in +[vortex-data/vortex#6912](https://github.com/vortex-data/vortex/pull/6912). A `variantnull` is a +present variant value whose payload is `null`, while an outer null is the absence of the variant +value itself. In scalar form this is the difference between +`Scalar::null(DType::Variant(Nullability::Nullable))` and +`Scalar::variant(Scalar::null(DType::Null))`. -Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays). +Typed extraction from a variant should therefore still return nullable arrays even when the source +variant column is non-nullable. A path can be missing in a given row, have an unexpected type, or +evaluate to `variantnull`, and each of those cases becomes null in the extracted child. + +Combined with shredding, handling nulls can still be complex and is encoding dependent (like this +[parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) +for handling arrays), but that is separate from whether the outer `Variant` column itself is +nullable. ### Expressions @@ -54,7 +76,14 @@ Every variant encoding will need to be able to dispatch these behaviors, returni ### Scalar -While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above. +While there has been talk for a long time of converting the Vortex scalar system from an enum to +length 1 arrays, I do believe the current system actually works very well for variants. A variant +scalar can simply wrap another row-specific `Scalar`, rather than needing a dedicated scalar enum +just for variants. + +That model also makes the null semantics explicit. `Scalar::null(DType::Variant(Nullability::Nullable))` +means the variant scalar itself is missing. `Scalar::variant(Scalar::null(DType::Null))` means the +variant is present and its payload is `variantnull`. Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. From 211cf8d6436e98ee6b73b4542900e8295e6752f6 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Tue, 17 Mar 2026 14:23:26 +0000 Subject: [PATCH 2/3] Update Variant RFC with nullability and Scalar changes Signed-off-by: Adam Gutglick --- accepted/0015-variant-type.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/accepted/0015-variant-type.md b/accepted/0015-variant-type.md index 7cf6aae..955c99f 100644 --- a/accepted/0015-variant-type.md +++ b/accepted/0015-variant-type.md @@ -47,12 +47,10 @@ array slot itself to be absent. A non-nullable variant guarantees that the slot does **not** guarantee that extracted paths will be non-null. This is distinct from the semantic null value inside the variant payload, which I'll call -`variantnull` here to match the implementation discussion in -[vortex-data/vortex#6912](https://github.com/vortex-data/vortex/pull/6912). A `variantnull` is a -present variant value whose payload is `null`, while an outer null is the absence of the variant -value itself. In scalar form this is the difference between -`Scalar::null(DType::Variant(Nullability::Nullable))` and -`Scalar::variant(Scalar::null(DType::Null))`. +`variantnull`. A `variantnull` is a present variant value whose payload is +`null`, while an outer null is the absence of the variant value itself. +In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))`gst +and `Scalar::variant(Scalar::null(DType::Null))`. Typed extraction from a variant should therefore still return nullable arrays even when the source variant column is non-nullable. A path can be missing in a given row, have an unexpected type, or @@ -142,7 +140,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type - Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal. - Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream. - DuckDB doesn't support a variant type. It does have a [Union](https://duckdb.org/docs/stable/sql/data_types/union) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. -- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions). +- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions), and their docs show a [good example](https://docs.databricks.com/aws/en/sql/language-manual/functions/is_variant_null) of null vs variant null. ## Unresolved Questions From 44c03972e9dc4a30590444d70113efde01410d41 Mon Sep 17 00:00:00 2001 From: Adam Gutglick Date: Tue, 17 Mar 2026 15:19:53 +0000 Subject: [PATCH 3/3] cr comments Signed-off-by: Adam Gutglick --- accepted/0015-variant-type.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/accepted/0015-variant-type.md b/accepted/0015-variant-type.md index 955c99f..e7a4768 100644 --- a/accepted/0015-variant-type.md +++ b/accepted/0015-variant-type.md @@ -23,7 +23,7 @@ enum Variant { } ``` -Here the semantic `null` value inside the variant payload is represented as +Here `variantnull` value inside the variant payload is represented as `Scalar::null(DType::Null)`. That is distinct from the outer nullability of the `Variant` dtype itself. @@ -49,7 +49,7 @@ does **not** guarantee that extracted paths will be non-null. This is distinct from the semantic null value inside the variant payload, which I'll call `variantnull`. A `variantnull` is a present variant value whose payload is `null`, while an outer null is the absence of the variant value itself. -In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))`gst +In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))` and `Scalar::variant(Scalar::null(DType::Null))`. Typed extraction from a variant should therefore still return nullable arrays even when the source