Phase 2 make pyspark optional by ryanseq-gyg · Pull Request #67 · getyourguide/dataframe-expectations

ryanseq-gyg · 2026-03-18T13:14:28Z

Make PySpark an optional dependency

Closes #54

Users running PySpark in managed environments (Databricks, EMR, etc.) typically have PySpark
pre-installed and cannot or do not want the library to reinstall it. Previously, pyspark was
a hard dependency, making dataframe-expectations incompatible with those environments.

This PR makes PySpark fully optional while preserving all existing behaviour for users who do
have it installed.

What changed

1. PySpark moved to an optional extra (pyproject.toml)

pip install dataframe-expectations           # pandas only
pip install dataframe-expectations[pyspark]  # includes pyspark

pandas, pydantic, and tabulate remain hard dependencies. pyspark is now under
[project.optional-dependencies].

2. Lazy PySpark loading in production code (pyspark_utils.py)

All PySpark imports are deferred behind @lru_cache helpers:

get_pyspark_functions() — returns real pyspark.sql.functions when PySpark is available,
or a _MissingPySparkFunctions proxy that raises a clear ImportError only when a PySpark
code path is actually executed.
_get_pyspark_dataframe_types() / is_pyspark_dataframe() — runtime type detection with
graceful fallback when PySpark is absent.
The one remaining module-level reference (PySparkConnectDataFrame in expectation.py) is
guarded with try/except ImportError and is kept for backward compatibility and test patching.

This means importing dataframe_expectations never touches PySpark at all when it isn't installed.

3. Tests split by marker

All PySpark test cases are decorated with @pytest.mark.pyspark and separated into their own
parametrize blocks. --strict-markers is enforced in pyproject.toml so unregistered markers
cause an immediate failure rather than being silently ignored. Tests can now be run without
PySpark present:

pytest -m "not pyspark"   # no PySpark required
pytest -m pyspark          # requires PySpark

4. CI updated to cover three install scenarios

Job	How PySpark is present	Tests run
`tests-without-pyspark`	Not installed	`-m "not pyspark"`
`tests-with-pyspark-extra`	`pip install .[pyspark]`	All
`tests-with-external-pyspark`	Pre-installed externally	All

The external-pyspark job specifically validates the case from issue #54 — that the library works
correctly when PySpark is already present in the environment and was not installed by this package.

5. Docs updated

Installation instructions updated to document both install paths and explain when each is appropriate.

Checklist

Tests have been added in the prescribed format
Commit messages follow Conventional Commits format
Pre-commit hooks pass locally

ryanseq-gyg added 2 commits March 18, 2026 11:26

feat: refactor tests, make pyspark optional, update CI

27e864b

feat: make pyspark optional dependency

ef847ed

ryanseq-gyg requested a review from a team as a code owner March 18, 2026 13:14

gyg-pr-tool bot requested a review from gygAlexWeiss March 18, 2026 13:14

gyg-pr-tool bot added risk:normal x-large labels Mar 18, 2026

ryanseq-gyg merged commit f689b76 into main Mar 18, 2026
14 checks passed

ryanseq-gyg deleted the phase-2-make-pyspark-optional branch March 18, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2 make pyspark optional#67

Phase 2 make pyspark optional#67
ryanseq-gyg merged 2 commits intomainfrom
phase-2-make-pyspark-optional

ryanseq-gyg commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryanseq-gyg commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Make PySpark an optional dependency

What changed

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryanseq-gyg commented Mar 18, 2026 •

edited

Loading