Skip to content

Phase 2 make pyspark optional#67

Merged
ryanseq-gyg merged 2 commits intomainfrom
phase-2-make-pyspark-optional
Mar 18, 2026
Merged

Phase 2 make pyspark optional#67
ryanseq-gyg merged 2 commits intomainfrom
phase-2-make-pyspark-optional

Conversation

@ryanseq-gyg
Copy link
Contributor

@ryanseq-gyg ryanseq-gyg commented Mar 18, 2026

Make PySpark an optional dependency

Closes #54

Users running PySpark in managed environments (Databricks, EMR, etc.) typically have PySpark
pre-installed and cannot or do not want the library to reinstall it. Previously, pyspark was
a hard dependency, making dataframe-expectations incompatible with those environments.

This PR makes PySpark fully optional while preserving all existing behaviour for users who do
have it installed.


What changed

1. PySpark moved to an optional extra (pyproject.toml)

pip install dataframe-expectations           # pandas only
pip install dataframe-expectations[pyspark]  # includes pyspark

pandas, pydantic, and tabulate remain hard dependencies. pyspark is now under
[project.optional-dependencies].

2. Lazy PySpark loading in production code (pyspark_utils.py)

All PySpark imports are deferred behind @lru_cache helpers:

  • get_pyspark_functions() — returns real pyspark.sql.functions when PySpark is available,
    or a _MissingPySparkFunctions proxy that raises a clear ImportError only when a PySpark
    code path is actually executed.
  • _get_pyspark_dataframe_types() / is_pyspark_dataframe() — runtime type detection with
    graceful fallback when PySpark is absent.
  • The one remaining module-level reference (PySparkConnectDataFrame in expectation.py) is
    guarded with try/except ImportError and is kept for backward compatibility and test patching.

This means importing dataframe_expectations never touches PySpark at all when it isn't installed.

3. Tests split by marker

All PySpark test cases are decorated with @pytest.mark.pyspark and separated into their own
parametrize blocks. --strict-markers is enforced in pyproject.toml so unregistered markers
cause an immediate failure rather than being silently ignored. Tests can now be run without
PySpark present:

pytest -m "not pyspark"   # no PySpark required
pytest -m pyspark          # requires PySpark

4. CI updated to cover three install scenarios

Job How PySpark is present Tests run
tests-without-pyspark Not installed -m "not pyspark"
tests-with-pyspark-extra pip install .[pyspark] All
tests-with-external-pyspark Pre-installed externally All

The external-pyspark job specifically validates the case from issue #54 — that the library works
correctly when PySpark is already present in the environment and was not installed by this package.

5. Docs updated

Installation instructions updated to document both install paths and explain when each is appropriate.

Checklist

  • Tests have been added in the prescribed format
  • Commit messages follow Conventional Commits format
  • Pre-commit hooks pass locally

@ryanseq-gyg ryanseq-gyg requested a review from a team as a code owner March 18, 2026 13:14
@gyg-pr-tool gyg-pr-tool bot requested a review from gygAlexWeiss March 18, 2026 13:14
@ryanseq-gyg ryanseq-gyg merged commit f689b76 into main Mar 18, 2026
14 checks passed
@ryanseq-gyg ryanseq-gyg deleted the phase-2-make-pyspark-optional branch March 18, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make pyspark an optional dependency

1 participant