Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 32 additions & 5 deletions accepted/0015-variant-type.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,15 @@ enum Variant {
}
```

Here `variantnull` value inside the variant payload is represented as
`Scalar::null(DType::Null)`. That is distinct from the outer nullability of the
`Variant` dtype itself.

Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.

Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.

This document proposed adding a new `DType` variant named `Variant`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below).
This document proposes adding a new `DType::Variant(Nullability)`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below).

### Arrow representation

Expand All @@ -37,9 +41,25 @@ Supporting extension types requires replacing the target `DataType` and nullabil

### Nullability

In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children.
`Variant` should follow the same top-level nullability model as every other Vortex dtype:
`DType::Variant(Nullability)` can be nullable or non-nullable. A nullable variant allows the
array slot itself to be absent. A non-nullable variant guarantees that the slot is present, but it
does **not** guarantee that extracted paths will be non-null.

This is distinct from the semantic null value inside the variant payload, which I'll call
`variantnull`. A `variantnull` is a present variant value whose payload is
`null`, while an outer null is the absence of the variant value itself.
In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))`
and `Scalar::variant(Scalar::null(DType::Null))`.

Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays).
Typed extraction from a variant should therefore still return nullable arrays even when the source
variant column is non-nullable. A path can be missing in a given row, have an unexpected type, or
evaluate to `variantnull`, and each of those cases becomes null in the extracted child.

Combined with shredding, handling nulls can still be complex and is encoding dependent (like this
[parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays)
for handling arrays), but that is separate from whether the outer `Variant` column itself is
nullable.

### Expressions

Expand All @@ -54,7 +74,14 @@ Every variant encoding will need to be able to dispatch these behaviors, returni

### Scalar

While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above.
While there has been talk for a long time of converting the Vortex scalar system from an enum to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are moving away from this now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it here as an historical artifact? IDK where that's discussed 🤷

length 1 arrays, I do believe the current system actually works very well for variants. A variant
scalar can simply wrap another row-specific `Scalar`, rather than needing a dedicated scalar enum
just for variants.

That model also makes the null semantics explicit. `Scalar::null(DType::Variant(Nullability::Nullable))`
means the variant scalar itself is missing. `Scalar::variant(Scalar::null(DType::Null))` means the
variant is present and its payload is `variantnull`.

Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype.

Expand Down Expand Up @@ -113,7 +140,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type
- Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal.
- Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream.
- DuckDB doesn't support a variant type. It does have a [Union](https://duckdb.org/docs/stable/sql/data_types/union) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues.
- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions).
- Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions), and their docs show a [good example](https://docs.databricks.com/aws/en/sql/language-manual/functions/is_variant_null) of null vs variant null.

## Unresolved Questions

Expand Down
Loading