diff --git a/accepted/0015-variant-type.md b/accepted/0015-variant-type.md index 50b0660..e7a4768 100644 --- a/accepted/0015-variant-type.md +++ b/accepted/0015-variant-type.md @@ -23,11 +23,15 @@ enum Variant { } ``` +Here `variantnull` value inside the variant payload is represented as +`Scalar::null(DType::Null)`. That is distinct from the outer nullability of the +`Variant` dtype itself. + Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns. -This document proposed adding a new `DType` variant named `Variant`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). +This document proposes adding a new `DType::Variant(Nullability)`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). ### Arrow representation @@ -37,9 +41,25 @@ Supporting extension types requires replacing the target `DataType` and nullabil ### Nullability -In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children. +`Variant` should follow the same top-level nullability model as every other Vortex dtype: +`DType::Variant(Nullability)` can be nullable or non-nullable. A nullable variant allows the +array slot itself to be absent. A non-nullable variant guarantees that the slot is present, but it +does **not** guarantee that extracted paths will be non-null. + +This is distinct from the semantic null value inside the variant payload, which I'll call +`variantnull`. A `variantnull` is a present variant value whose payload is +`null`, while an outer null is the absence of the variant value itself. +In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))` +and `Scalar::variant(Scalar::null(DType::Null))`. -Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays). +Typed extraction from a variant should therefore still return nullable arrays even when the source +variant column is non-nullable. A path can be missing in a given row, have an unexpected type, or +evaluate to `variantnull`, and each of those cases becomes null in the extracted child. + +Combined with shredding, handling nulls can still be complex and is encoding dependent (like this +[parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) +for handling arrays), but that is separate from whether the outer `Variant` column itself is +nullable. ### Expressions @@ -54,7 +74,14 @@ Every variant encoding will need to be able to dispatch these behaviors, returni ### Scalar -While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above. +While there has been talk for a long time of converting the Vortex scalar system from an enum to +length 1 arrays, I do believe the current system actually works very well for variants. A variant +scalar can simply wrap another row-specific `Scalar`, rather than needing a dedicated scalar enum +just for variants. + +That model also makes the null semantics explicit. `Scalar::null(DType::Variant(Nullability::Nullable))` +means the variant scalar itself is missing. `Scalar::variant(Scalar::null(DType::Null))` means the +variant is present and its payload is `variantnull`. Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. @@ -113,7 +140,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type - Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal. - Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream. - DuckDB doesn't support a variant type. It does have a [Union](https://duckdb.org/docs/stable/sql/data_types/union) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. -- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions). +- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions), and their docs show a [good example](https://docs.databricks.com/aws/en/sql/language-manual/functions/is_variant_null) of null vs variant null. ## Unresolved Questions