Conversation
…den parity tests Rewrites the core hashing algorithm to match the byte-layout spec: - hash_array now uses the same recursive BTreeMap decomposition as the record-batch path (spec Section 6), fixing list and struct hashing - Lists decompose into separate validity-only, structural, and data entries (spec Section 3.4) instead of a single combined entry - Recursive traversal properly AND-propagates struct nulls to descendants (spec Section 3.5) - Adds 45 golden-hash parity tests covering all primitive types, temporal types, variable-length types, decimals, lists, structs, nested structs, multi-type batches, and streaming — all verified against Rust starfix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the Python ArrowDigester implementation to follow the byte-layout spec’s recursive BTreeMap-style decomposition for hashing Arrow arrays and record batches, and adds a comprehensive golden-parity test suite to ensure byte-for-byte hash equality with the Rust implementation.
Changes:
- Reworked
ArrowDigesterhashing to use a recursive traversal + decomposed digest-entry model for lists/structs (shared betweenhash_arrayand record-batch hashing). - Added extensive golden parity tests covering many Arrow types and spec examples, with expected hashes generated from the Rust crate.
- Updated the existing list-array golden hash to reflect the corrected spec-compliant behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/starfix/arrow_digester.py | Major rewrite of hashing logic and decomposition/traversal implementation to match the spec. |
| tests/test_golden_parity.py | New golden parity regression suite (Rust-authoritative expected hashes). |
| tests/test_arrow_digester.py | Updates an existing list hash expectation to the new correct value. |
| .python-version | Pins local Python version to 3.12. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/starfix/arrow_digester.py
Outdated
| """Pure-Python implementation of the starfix Arrow logical hasher. | ||
|
|
||
| Produces identical hashes to the Rust implementation for all supported types. | ||
| Implements the byte-layout specification from ``docs/byte-layout-spec.md``. |
There was a problem hiding this comment.
The module docstring references docs/byte-layout-spec.md, but there is no docs/ directory (and no file with that name) in this repository. This makes the reference non-actionable for maintainers; please either add the spec file at that path or update the reference to the correct location (e.g., a README section or an in-repo spec file).
| Implements the byte-layout specification from ``docs/byte-layout-spec.md``. | |
| Implements the Arrow byte-layout specification used by starfix. |
There was a problem hiding this comment.
Good catch — the spec lives in the Rust repo, not here. Fixed the docstring to reference nauticalab/starfix docs/byte-layout-spec.md instead of a non-existent local path.
| children.sort(key=lambda f: f.name) | ||
| for child in children: | ||
| child_path = f"{path}{DELIMITER}{child.name}" if path else child.name | ||
| _extract_type_entries(child.type, child.nullable, child_path, out) |
There was a problem hiding this comment.
Struct null propagation seems incomplete at the type-decomposition stage: when a struct field is nullable, all descendant entries become effectively nullable (struct validity AND child validity per spec 3.5). Here _extract_type_entries() recurses into children using only child.nullable, so non-nullable children of a nullable struct will not get a null_bits buffer and their propagated null positions won’t be represented in the final digest (only skipped in data hashing). Consider recursing with child.nullable or nullable so child entries include null_bits whenever any ancestor struct is nullable.
| _extract_type_entries(child.type, child.nullable, child_path, out) | |
| _extract_type_entries(child.type, child.nullable or nullable, child_path, out) |
There was a problem hiding this comment.
This is intentional and matches both the spec and the Rust implementation.
The spec (Section 3.5, Example M) says struct-level nulls are AND-propagated at traversal time, not at entry-creation time. Non-nullable children of a nullable struct get data-only entries (no null_bits). The null propagation is handled by _traverse_struct which passes combined ancestor nulls down — _traverse_leaf then skips null rows when hashing data bytes.
From spec Example M:
Entry "a" (Int32, non-nullable → data-only): Struct nulls propagated: rows 0, 2 valid → data: [10, 30]
The Rust extract_type_entries also passes child_field.is_nullable() (not child.nullable || parent.nullable) — see arrow_digester_core.rs:934. All 126 parity tests (including nullable struct cases like r2_nullable_struct_batch and r2_nested_nullable_inner_struct) confirm this produces correct hashes.
src/starfix/arrow_digester.py
Outdated
| inner = _normalize_field(dt.key_field) | ||
| return pa.map_(inner.type, dt.item_field.type) |
There was a problem hiding this comment.
_normalize_data_type()’s Map handling drops important information and may produce an incorrect canonical type: it only normalizes the key field (not the item/value field), it doesn’t preserve keys_sorted, and it reconstructs the map using pa.map_() defaults. If maps are intended to be supported, this should normalize both key/item fields and carry over keys_sorted to avoid schema/type hashes diverging from Arrow/Rust for map schemas.
| inner = _normalize_field(dt.key_field) | |
| return pa.map_(inner.type, dt.item_field.type) | |
| key_field = _normalize_field(dt.key_field) | |
| item_field = _normalize_field(dt.item_field) | |
| return pa.map_(key_field.type, item_field.type, keys_sorted=dt.keys_sorted) |
There was a problem hiding this comment.
Good catch — the item field was not being normalized. Fixed to normalize both key and item fields, and preserve keys_sorted. Map types are not yet exercised by the test suite, but this is the correct normalization per the spec.
…sorting Adds test_golden_parity_r2.py with 41 additional golden-hash tests covering: - Empty and all-null arrays - Multi-byte bitvec (17-element boolean) - Single-element edge cases - Nullable struct via hash_array (struct-level nulls) - Three-level nested structs (batch, schema, hash_array) - Struct with list children - List<Struct> and nullable List<Struct> via hash_array - List<List<Int32>> nested lists (hash_array, batch, schema) - Mixed struct+list+scalar record batches - Unicode strings and varied-size binary - Float64 special values (NaN, ±inf, ±0.0) - Multi-batch streaming with struct and list columns - Complex schema-only hashes (12 fields, 4-level nesting) - Realistic geospatial-style record batch Also fixes a bug where hash_array did not sort JSON keys in the type metadata for struct types, causing struct hash_array to diverge from Rust. The spec requires all JSON keys sorted alphabetically at every nesting level. NOTE: One Rust bug discovered and filed as PLT-1047 — Rust panics on hash_array with List<nullable leaf> when inner elements contain nulls (sliced array null_count mismatch in traverse_leaf). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pa.types.is_struct(canonical): | ||
| # Struct is transparent — no entry for the struct itself. | ||
| # Recurse into children sorted alphabetically (spec Section 3.5). | ||
| children = [canonical.field(i) for i in range(canonical.num_fields)] | ||
| children.sort(key=lambda f: f.name) | ||
| for child in children: | ||
| child_path = f"{path}{DELIMITER}{child.name}" if path else child.name | ||
| _extract_type_entries(child.type, child.nullable, child_path, out) | ||
|
|
There was a problem hiding this comment.
_extract_type_entries() treats structs as “transparent” and only uses each child field’s declared nullable to decide whether to allocate null_bits. If the parent struct itself is nullable, nulls must still be AND-propagated to descendants (spec 3.5), which requires descendant entries to include validity (null_bits) even when the child field is declared non-nullable. As written, nullable structs with non-nullable children will lose null position information (children have null_bits=None). Consider propagating nullable into the recursive calls (e.g., pass nullable or child.nullable) so descendants can record effective validity.
There was a problem hiding this comment.
Same point as the earlier review — addressed in the reply there. This is intentional per the spec: struct null propagation happens at traversal time (via _traverse_struct passing combined ancestor nulls), not at entry-creation time. Non-nullable children of a nullable struct get data-only entries. See spec Example M and Rust extract_type_entries at line 934. Verified by 126 parity tests.
| from decimal import Decimal | ||
|
|
||
| import pyarrow as pa | ||
| import pytest |
There was a problem hiding this comment.
pytest is imported but not used in this test module. Removing unused imports helps keep the test code clean and avoids lint warnings in environments that enforce them.
| import pytest |
There was a problem hiding this comment.
This was already removed in a subsequent commit — neither test_golden_parity.py nor test_golden_parity_r2.py imports pytest in the current version.
| from __future__ import annotations | ||
|
|
||
| import pyarrow as pa | ||
| import pytest |
There was a problem hiding this comment.
pytest is imported but not used in this test module. Consider removing it to avoid unused-import warnings in linted environments.
| import pytest |
There was a problem hiding this comment.
Same as above — pytest is not imported in the current version of this file.
- Fix docstring to reference spec in the Rust repo (not local path) - Normalize both key and item fields in Map type, preserve keys_sorted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response to Copilot reviewAll three comments addressed in commit 4bde6a1:
|
Runs pytest across Python 3.10, 3.11, 3.12 using uv on push to main and on pull requests. Covers all 126 golden parity + unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
arrow_digester.pyto implement the byte-layout spec's BTreeMap decomposition forhash_arrayand the record-batch path:hash_arraynow uses the same recursive decomposition ashash_record_batch(spec Section 6)tests/test_golden_parity.pywith 45 tests (36 unique golden hashes) verified against the Ruststarfixcrate, covering:hash_arraybug: Previously used a single combined entry instead of the spec's decomposed entries, producing incorrect hashes for any list-typed arrayTest plan
cargo run --bin emit_goldenin Rust starfix (post-PR #11)Resolves PLT-1043
🤖 Generated with Claude Code