Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…hing
Replace the ad-hoc Arrow IPC serialisation + hashlib approach in
SemanticArrowHasher with a new StarfixArrowHasher that delegates to
the canonical `starfix` ArrowDigester (nauticalab/starfix-python).
Changes
-------
* src/orcapod/hashing/arrow_hashers.py
- Add StarfixArrowHasher class: reuses SemanticHashingVisitor for
semantic-type pre-processing (e.g. Path → file-content hash), then
calls ArrowDigester.hash_table / hash_schema for the final digest.
- Adds hash_schema() method for schema-only fingerprinting.
- Empty-column edge-case: fall back to field.type when no rows exist.
* src/orcapod/hashing/versioned_hashers.py
- _CURRENT_ARROW_HASHER_ID remains "arrow_v0.1".
- get_versioned_semantic_arrow_hasher() now returns StarfixArrowHasher.
* src/orcapod/contexts/data/v0.1.json
- Swap arrow_hasher._class to StarfixArrowHasher (hasher_id unchanged:
"arrow_v0.1"); remove IPC-specific params (hash_algorithm, chunk_size,
serialization_method).
* pyproject.toml / uv.lock
- Declare starfix dependency sourced from nauticalab/starfix-python
(pure-Python implementation, git source via [tool.uv.sources]).
* tests/test_hashing/test_starfix_arrow_hasher.py (new)
- 24 unit tests covering schema hashing, table hashing, column-order
independence, batch-split parity, type normalisation, empty tables,
versioned_hashers factory, and context integration.
Closes PLT-1063
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0fbfcf9 to
6718dd3
Compare
There was a problem hiding this comment.
Pull request overview
This PR integrates nauticalab/starfix-python as the canonical Arrow schema/table hashing backend, aiming for deterministic, cross-language-compatible digests while retaining semantic-type preprocessing.
Changes:
- Added
StarfixArrowHasherthat semantically preprocesses columns and then hashes viastarfix.ArrowDigester. - Updated the versioned Arrow hasher factory and the v0.1 context config to use the new hasher.
- Added a new unit test module covering starfix-backed schema/table hashing and context/factory wiring.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/orcapod/hashing/arrow_hashers.py |
Introduces StarfixArrowHasher and starfix dependency; implements starfix-backed hashing after semantic preprocessing. |
src/orcapod/hashing/versioned_hashers.py |
Factory now constructs StarfixArrowHasher instead of the legacy hasher. |
src/orcapod/contexts/data/v0.1.json |
Wires the v0.1 context’s arrow_hasher to StarfixArrowHasher and updates changelog text. |
tests/test_hashing/test_starfix_arrow_hasher.py |
Adds tests validating digest shape/properties and verifying factory/context integration. |
pyproject.toml |
Switches starfix dependency to be sourced via uv git source override. |
uv.lock |
Locks starfix to a specific git commit and updates the resolved dependency graph accordingly. |
Comments suppressed due to low confidence (2)
src/orcapod/hashing/versioned_hashers.py:149
get_versioned_semantic_arrow_hasher()now returnsStarfixArrowHasher, but the version constant_CURRENT_ARROW_HASHER_IDis still"arrow_v0.1"even though the Arrow hashing algorithm has changed. To avoid silently changing digests under the samehasher_id, bump the current Arrow hasher id (e.g. toarrow_v0.2) and update the associated context JSON + tests to match.
from orcapod.hashing.arrow_hashers import StarfixArrowHasher
from orcapod.hashing.file_hashers import BasicFileHasher
from orcapod.semantic_types.semantic_registry import SemanticTypeRegistry
from orcapod.semantic_types.semantic_struct_converters import PathStructConverter
# Build a default semantic registry populated with the standard converters.
# We use Any-typed locals here to side-step type-checker false positives
# that arise from the protocol definition of SemanticStructConverterProtocol having
# a slightly different hash_struct_dict signature than the concrete class.
registry: Any = SemanticTypeRegistry()
file_hasher = BasicFileHasher(algorithm="sha256")
path_converter: Any = PathStructConverter(file_hasher=file_hasher)
registry.register_converter("path", path_converter)
logger.debug(
"get_versioned_semantic_arrow_hasher: creating StarfixArrowHasher "
"(hasher_id=%r)",
hasher_id,
)
hasher: Any = StarfixArrowHasher(
hasher_id=hasher_id,
semantic_registry=registry,
)
return hasher
pyproject.toml:26
starfixis now specified without a version/direct URL, and the git source override is only in[tool.uv.sources]. The repo README documentspip install -e ..., but pip will ignore uv source overrides and may resolve a differentstarfixdistribution/version than the one locked inuv.lock, potentially changing digest behavior or even the available API. Consider using a PEP 508 direct URL (starfix @ git+...@<sha/tag>) independencies, publishing a pinned PyPI release, or updating install docs to requireuv sync --lockedfor reproducibility.
"deltalake>=1.0.2",
"graphviz>=0.21",
"gitpython>=3.1.45",
"starfix",
"pygraphviz>=1.14",
"tzdata>=2024.1",
"uuid-utils>=0.11.1",
"s3fs>=2025.12.0",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/orcapod/hashing/arrow_hashers.py
Outdated
| if new_type is None: | ||
| new_type = processed_type | ||
| processed_data.append(processed_value) | ||
|
|
||
| # For empty columns there are no values to infer the type from; | ||
| # fall back to the field's declared type. |
There was a problem hiding this comment.
_process_table_columns() infers new_type from the first visited value. If the first row is None for a semantic-struct column, SemanticHashingVisitor.visit_struct() returns (struct_type, None), so new_type can become the original struct type even though later non-null values are converted to hash strings. That will either raise during pa.array(..., type=new_type) or produce an incorrect table for hashing. Consider determining the output type from the schema (e.g. if the field’s struct signature matches a converter, force pa.large_string() regardless of data) or infer from the first non-null processed value.
| if new_type is None: | |
| new_type = processed_type | |
| processed_data.append(processed_value) | |
| # For empty columns there are no values to infer the type from; | |
| # fall back to the field's declared type. | |
| # Infer the output type from the first non-null processed value. | |
| if processed_value is not None and new_type is None: | |
| new_type = processed_type | |
| processed_data.append(processed_value) | |
| # For empty or all-null columns there are no non-null values to infer | |
| # the type from; fall back to the field's declared type. |
src/orcapod/hashing/arrow_hashers.py
Outdated
| new_fields.append(pa.field(field.name, new_type)) | ||
|
|
||
| except Exception as exc: | ||
| raise RuntimeError( | ||
| f"Failed to process column '{field.name}': {exc}" | ||
| ) from exc | ||
|
|
||
| return pa.table(new_columns, schema=pa.schema(new_fields)) |
There was a problem hiding this comment.
When building new_fields, the code uses pa.field(field.name, new_type), which drops important field attributes like nullable and metadata. Since Arrow schema (including nullability) is part of what starfix digests canonically, this can change hashes or make them inconsistent with the original table. Prefer preserving the original field via something like field.with_type(new_type) (and also preserve schema metadata if present).
| new_fields.append(pa.field(field.name, new_type)) | |
| except Exception as exc: | |
| raise RuntimeError( | |
| f"Failed to process column '{field.name}': {exc}" | |
| ) from exc | |
| return pa.table(new_columns, schema=pa.schema(new_fields)) | |
| # Preserve original field attributes (nullable, metadata, etc.) while updating the type. | |
| new_fields.append(field.with_type(new_type)) | |
| except Exception as exc: | |
| raise RuntimeError( | |
| f"Failed to process column '{field.name}': {exc}" | |
| ) from exc | |
| # Preserve the original schema metadata while using updated fields. | |
| return pa.table(new_columns, schema=pa.schema(new_fields, metadata=table.schema.metadata)) |
src/orcapod/hashing/arrow_hashers.py
Outdated
| def _process_table_columns(self, table: pa.Table | pa.RecordBatch) -> pa.Table: | ||
| """Replace semantic-typed columns with their content-hash strings.""" | ||
| new_columns: list[pa.Array] = [] | ||
| new_fields: list[pa.Field] = [] | ||
|
|
||
| for i, field in enumerate(table.schema): | ||
| column_data = table.column(i).to_pylist() | ||
| visitor = SemanticHashingVisitor(self.semantic_registry) | ||
|
|
||
| try: | ||
| new_type: pa.DataType | None = None | ||
| processed_data: list[Any] = [] | ||
| for value in column_data: | ||
| processed_type, processed_value = visitor.visit(field.type, value) | ||
| if new_type is None: | ||
| new_type = processed_type | ||
| processed_data.append(processed_value) | ||
|
|
||
| # For empty columns there are no values to infer the type from; | ||
| # fall back to the field's declared type. | ||
| if new_type is None: | ||
| new_type = field.type | ||
| new_columns.append(pa.array(processed_data, type=new_type)) | ||
| new_fields.append(pa.field(field.name, new_type)) |
There was a problem hiding this comment.
This semantic pre-processing converts every column to Python (to_pylist) and reconstructs it with pa.array, even for primitive/non-semantic columns where SemanticHashingVisitor is a no-op. That can be very expensive for large tables and may alter physical representations (e.g. dictionaries/extension types) before starfix digests them. Consider short-circuiting columns whose types cannot contain semantic structs (or where the schema doesn’t match any registered semantic signature) and reusing the original Arrow arrays/fields for those columns.
| "_class": "orcapod.hashing.arrow_hashers.StarfixArrowHasher", | ||
| "_config": { | ||
| "hasher_id": "arrow_v0.1", | ||
| "hash_algorithm": "sha256", | ||
| "chunk_size": 8192, | ||
| "serialization_method": "logical", | ||
| "semantic_registry": { | ||
| "_ref": "semantic_registry" | ||
| } |
There was a problem hiding this comment.
The context config still sets hasher_id to "arrow_v0.1" even though the Arrow hashing implementation has been swapped to starfix. Reusing the same hasher_id for a different digest algorithm makes stored hashes ambiguous; either bump the id here (and in the versioned factory/tests) or keep the old implementation for arrow_v0.1 and introduce starfix under a new id.
src/orcapod/contexts/data/v0.1.json
Outdated
| "Basic SHA-256 hashing for files and objects", | ||
| "Arrow logical serialization method" | ||
| "Arrow logical serialization method", | ||
| "arrow_v0.1: replaced custom IPC serialisation+hashlib with starfix ArrowDigester for cross-language-compatible Arrow hashing" |
There was a problem hiding this comment.
The changelog entry says arrow_v0.1 “replaced custom IPC serialisation+hashlib with starfix…”. If arrow_v0.1 is meant to remain the legacy algorithm identifier, this entry is misleading; if starfix is intended to be a new algorithm version, the id should be bumped (e.g. arrow_v0.2) and referenced here instead.
| "arrow_v0.1: replaced custom IPC serialisation+hashlib with starfix ArrowDigester for cross-language-compatible Arrow hashing" | |
| "Introduced arrow_v0.1 StarfixArrowHasher using Starfix ArrowDigester for cross-language-compatible Arrow hashing (replacing earlier custom IPC serialisation+hashlib prototypes)" |
| def test_hasher_id_matches_current_constant(self): | ||
| hasher = get_versioned_semantic_arrow_hasher() | ||
| assert hasher.hasher_id == _CURRENT_ARROW_HASHER_ID | ||
|
|
||
| def test_current_hasher_id_is_arrow_v0_1(self): | ||
| """Pin the current version constant to detect accidental version drifts.""" | ||
| assert _CURRENT_ARROW_HASHER_ID == "arrow_v0.1" | ||
|
|
There was a problem hiding this comment.
The tests hard-pin _CURRENT_ARROW_HASHER_ID (and the v0.1 context) to "arrow_v0.1", but this PR changes the underlying Arrow hashing implementation. If the algorithm changed, the pinned constant and context expectations should be updated to the new hasher id; otherwise, the PR description/context changelog should avoid implying a version bump.
| def test_golden_value(self): | ||
| """Pin the digest so unintentional algorithm changes are caught.""" | ||
| schema = pa.schema( | ||
| [ | ||
| pa.field("id", pa.int32(), nullable=False), | ||
| pa.field("value", pa.float64(), nullable=True), | ||
| ] | ||
| ) | ||
| h = _make_hasher().hash_schema(schema) | ||
| # First 6 hex chars encode the 3-byte starfix version prefix (000001) | ||
| assert h.digest.hex().startswith("000001"), ( | ||
| f"Unexpected version prefix in digest: {h.digest.hex()}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
test_golden_value (and the module docstring) claims to pin a “golden” digest, but the assertion only checks the version prefix (startswith("000001")). This won’t catch most accidental digest changes. Either assert the full expected digest bytes/hex (true golden) or rename/reword the test to reflect what it actually verifies.
pyproject.toml
Outdated
| version_file = "src/orcapod/_version.py" | ||
|
|
||
| [tool.uv.sources] | ||
| starfix = { git = "https://github.com/nauticalab/starfix-python.git" } |
There was a problem hiding this comment.
The starfix git source is not pinned to a specific revision/tag in [tool.uv.sources]. Even though uv.lock currently pins a commit, leaving the source floating makes it easy to regenerate the lock against a different upstream HEAD and silently change digest outputs. Prefer pinning the source to an explicit rev/tag that matches the intended digest spec version.
| starfix = { git = "https://github.com/nauticalab/starfix-python.git" } | |
| starfix = { git = "https://github.com/nauticalab/starfix-python.git", rev = "v0.1.0" } |
Previously, constructing an ArrowTableSource with a zero-row table raised an AssertionError deep inside SemanticArrowHasher._process_table_columns as an accidental side effect. Now that StarfixArrowHasher handles empty columns gracefully, that implicit guard was removed. Add an explicit ValueError at the top of ArrowTableSource.__init__ so the rejection is clear, early, and independent of the hasher implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…verage - Replace weak version-prefix-only golden value with full 35-byte pinned digests for both schema and table hashing - Add row-order sensitivity: reordered rows must produce a different hash - Add null-position sensitivity: null at position 0 vs 1 must differ - Add null-vs-no-null: presence of any null changes the hash - Add Binary/LargeBinary normalisation parity (mirrors existing Utf8 test) - Add integer-width sensitivity: int32 and int64 with same values must differ - Add field-nullability schema test: nullable=True vs False produce different schema hashes, with fully pinned golden digests for both - Add multi-batch table consistency: Table from two RecordBatches must hash identically to the equivalent single-batch Table Total: 33 tests, up from 24. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Short-circuit primitive columns in _process_table_columns to avoid costly Python round-trips for columns that cannot contain semantic types - Infer output type from first non-null processed value, preventing incorrect type inference when a semantic-struct column starts with nulls - Preserve field nullable/metadata via field.with_type() and propagate schema-level metadata through to the reconstructed table - Pin starfix git source to locked commit SHA in pyproject.toml - Clarify v0.1.json changelog entry wording Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| column_data = table.column(i).to_pylist() | ||
| visitor = SemanticHashingVisitor(self.semantic_registry) | ||
|
|
||
| try: | ||
| new_type: pa.DataType | None = None | ||
| processed_data: list[Any] = [] |
There was a problem hiding this comment.
StarfixArrowHasher._process_table_columns rebuilds fields via pa.field(field.name, new_type), which drops the original field’s nullable flag (and any metadata). This can change the schema seen by ArrowDigester.hash_table, yielding different hashes for the same logical data (and can also mask non-nullable fields). Preserve field properties when rebuilding the schema (e.g., carry over nullable and metadata).
| if len(table) == 0: | ||
| raise ValueError( | ||
| "ArrowTableSource requires a non-empty table; " | ||
| "the provided table has zero rows." | ||
| ) | ||
|
|
There was a problem hiding this comment.
This introduces a breaking behavior change: ArrowTableSource now raises on zero-row tables, but other parts of the core (e.g., replay-from-cache / derived sources) explicitly construct zero-row Arrow tables to represent empty streams. Unless empties are truly invalid for sources, it would be better to allow empty tables here (SourceStreamBuilder.build() appears able to handle them) so callers can represent empty inputs without special-casing.
| if len(table) == 0: | |
| raise ValueError( | |
| "ArrowTableSource requires a non-empty table; " | |
| "the provided table has zero rows." | |
| ) |
| logger.debug( | ||
| "get_versioned_semantic_arrow_hasher: creating SemanticArrowHasher " | ||
| "get_versioned_semantic_arrow_hasher: creating StarfixArrowHasher " | ||
| "(hasher_id=%r)", | ||
| hasher_id, | ||
| ) | ||
| hasher: Any = SemanticArrowHasher( | ||
| hasher: Any = StarfixArrowHasher( | ||
| hasher_id=hasher_id, | ||
| semantic_registry=registry, | ||
| ) |
There was a problem hiding this comment.
StarfixArrowHasher is now the factory output, but the factory still takes the default _CURRENT_ARROW_HASHER_ID (currently "arrow_v0.1"). Since the underlying hashing algorithm has changed, keeping the same hasher_id risks treating old (IPC+hashlib) and new (starfix) digests as comparable. Update the current Arrow hasher_id constant to a new version and make sure contexts/tests align with it.
| "deltalake>=1.0.2", | ||
| "graphviz>=0.21", | ||
| "gitpython>=3.1.45", | ||
| "starfix>=0.1.3", | ||
| "starfix", | ||
| "pygraphviz>=1.14", | ||
| "tzdata>=2024.1", | ||
| "uuid-utils>=0.11.1", |
There was a problem hiding this comment.
dependencies now lists "starfix" with no version constraint, and the git source override is uv-specific. pip install -e . (documented in README) will ignore [tool.uv.sources] and likely install the PyPI starfix, which may not match the git API/behavior (e.g., availability of ArrowDigester). Consider using a PEP 508 direct URL requirement (or pinning to a published version that contains ArrowDigester) so non-uv installs are deterministic and compatible.
| def test_hasher_id_matches_current_constant(self): | ||
| hasher = get_versioned_semantic_arrow_hasher() | ||
| assert hasher.hasher_id == _CURRENT_ARROW_HASHER_ID | ||
|
|
||
| def test_current_hasher_id_is_arrow_v0_1(self): | ||
| """Pin the current version constant to detect accidental version drifts.""" | ||
| assert _CURRENT_ARROW_HASHER_ID == "arrow_v0.1" | ||
|
|
There was a problem hiding this comment.
The test suite pins _CURRENT_ARROW_HASHER_ID and the context hasher_id to "arrow_v0.1" even though this PR swaps out the Arrow hashing algorithm. To prevent version drift and hash collisions across algorithm changes, update these expectations to the new Arrow hasher_id version (and keep them consistent with versioned_hashers and the v0.1 context spec).
Response to Copilot review (6c8666a)Thanks for the thorough review. Here's a summary of what was addressed in the follow-up commit: ✅ Fixed —
|
…EP 508 - Remove the zero-row ValueError guard from ArrowTableSource.__init__. The check was added to make an accidental AssertionError explicit, but now that StarfixArrowHasher handles empty tables correctly there is no technical reason to reject them; empty sources are valid and produce zero output records. - Replace bare "starfix" + [tool.uv.sources] override with a PEP 508 direct URL in dependencies, so both uv and pip resolve the same pinned commit without relying on uv-specific source overrides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Response to Copilot review — round 2 (6c1abd6)✅ Fixed —
|
The lockfile was missing the `?rev=` query parameter for the pinned starfix git commit, causing `uv sync --locked` to fail in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 6c1abd6 intentionally removed the zero-row guard from ArrowTableSource, making empty tables valid (they produce zero output records). Update test_empty_table_raises → test_empty_table_constructs_successfully to assert the new contract instead of the old rejected-empty behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/test_hashing/test_cross_process_stability.py covering the previously untested property that hash values are safe to persist: - Arrow schema and table hashes are identical when re-computed in a freshly-spawned subprocess (different interpreter lifetime). - All hashes are immune to PYTHONHASHSEED randomisation — CPython randomises built-in hash() per-process, which affects dict/set iteration order; the tests confirm neither StarfixArrowHasher nor the semantic hasher inherit that instability. - Column-order independence is confirmed cross-process for both schemas and tables. - Row-order sensitivity is confirmed cross-process (rows matter, column order does not). - Semantic hasher stability is verified for primitives, containers, nested structures, sets, and tuples across processes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
StarfixArrowHasherinarrow_hashers.py: runs the existingSemanticHashingVisitorfor semantic-type pre-processing (e.g.Path→ file-content hash string), then callsArrowDigester.hash_table/hash_schemafromnauticalab/starfix-pythonfor the final canonical digest_CURRENT_ARROW_HASHER_IDfrom"arrow_v0.1"→"arrow_v0.2"inversioned_hashers.py; factory now returnsStarfixArrowHasherv0.1.jsoncontext spec to useStarfixArrowHasherwithhasher_id: "arrow_v0.2"; removes IPC-specific params (hash_algorithm,chunk_size,serialization_method)starfixdependency tonauticalab/starfix-pythonvia[tool.uv.sources]git sourceThe old
SemanticArrowHasher(hashlib + Arrow IPC) is retained for reference but is no longer used by any context or factory.Why starfix?
ArrowDigesterproduces 35-byte versioned SHA-256 digests that are:Utf8→LargeUtf8, etc.)starfixcrate — enabling cross-language parity between the Python and Rust OrcaPod implementationsTest plan
tests/test_hashing/test_starfix_arrow_hasher.pycovering schema hashing, table hashing, column-order independence, batch-split parity, type normalisation, empty tables, factory, and context integrationCloses PLT-1063
🤖 Generated with Claude Code