[python] support ray data sink to paimon #6883

FreemanDane · 2025-12-24T09:43:27Z

Purpose

Linked issue: close #xxx

Tests

API and Format

Documentation

XiaoHongbo-Hope · 2025-12-28T16:34:45Z

paimon-python/pypaimon/write/table_write.py

+        from pypaimon.write.ray_datasink import PaimonDatasink
+        datasink = PaimonDatasink(dataset, overwrite=overwrite)
+        dataset.write_datasink(datasink, concurrency=parallelism)
+


we can named it as write_ray, just list write_pandas, write_arrow and so on

However，the Dataset class is defined by Ray, and so is its method write_xxx. I think it need to create a PR on Ray? The api in paimon calls TableWrite.write_raydata.

However，the Dataset class is defined by Ray, and so is its method write_xxx. I think it need to create a PR on Ray? The api in paimon calls TableWrite.write_raydata.

Maybe I commented the wrong line, should be line71

XiaoHongbo-Hope · 2025-12-28T16:39:18Z

paimon-python/pypaimon/write/ray_datasink.py

+        table_write = self.writer_builder.new_write()
+        for block in blocks:
+            block_arrow: pa.Table = BlockAccessor.for_block(block).to_arrow()
+            table_write.write_arrow(block_arrow)


I am afraid to_arrow will cost a lot of memory, we can introduce some stream or iterable way.

XiaoHongbo-Hope · 2025-12-28T16:43:56Z

paimon-python/pypaimon/write/ray_datasink.py

+        staging bucket in S3.
+        """
+        self.writer_builder: WriteBuilder= self.table.new_batch_write_builder()
+        if self.overwrite:


Could you please add test to show that writer_builder is serializable

XiaoHongbo-Hope · 2025-12-28T16:45:50Z

paimon-python/pypaimon/write/ray_datasink.py

+        """
+        table_commit = self.writer_builder.new_commit()
+        table_commit.commit([commit_message for commit_messages in write_result.write_returns for commit_message in commit_messages])
+        table_commit.close()


We should handle write failure case too.

XiaoHongbo-Hope · 2025-12-28T17:01:04Z

paimon-python/pypaimon/write/table_write.py

+    def write_raydata(self, dataset, overwrite=False, parallelism=1):
+        from pypaimon.write.ray_datasink import PaimonDatasink
+        datasink = PaimonDatasink(dataset, overwrite=overwrite)
+        dataset.write_datasink(datasink, concurrency=parallelism)


provided dataset, but PaimonDatasink init method needs a table here

XiaoHongbo-Hope · 2025-12-28T17:03:57Z

paimon-python/pypaimon/write/table_write.py

        return self

+    def write_raydata(self, dataset, overwrite=False, parallelism=1):
+        from pypaimon.write.ray_datasink import PaimonDatasink


How can user get the dataset in non test mode code, can you add a sample code for that

wangxipu · 2025-12-29T04:08:11Z

paimon-python/pypaimon/write/ray_datasink.py

+        self.overwrite = overwrite
+
+    def on_write_start(self) -> None:
+        """Callback for when a write job starts.


the method change for ray 2.5x

[python] update api, add test fix code format fix code format

JingsongLi · 2026-01-05T11:29:50Z

+1

* upstream/master: (35 commits) [spark] Spark support vector search (apache#6950) [doc] update Apache Doris document with DLF 3.0 (apache#6954) [variant] Fix reading empty shredded variant via variantAccess (apache#6953) [python] support alterTable (apache#6952) [python] support ray data sink to paimon (apache#6883) [python] Rename to TableScan.withSlice to specific start_pos and end_pos [python] sync to_ray method args with ray data api (apache#6948) [python] light refactor for stats collect (apache#6941) [doc] Update cdc ingestion related docs [rest] Add tagNamePrefix definition for listTagsPaged (apache#6947) [python] support table scan with row range (apache#6944) [spark] Fix EqualNullSafe is not correct when column has null value. (apache#6943) [python] fix value_stats containing system fields for primary key tables (apache#6945) [test][rest] add test case for two sessions with cache for rest commitTable (apache#6438) [python] do not retry for connect exception in rest (apache#6942) [spark] Fix read shredded and unshredded variant both (apache#6936) [python] Let Python write file without value stats by default (apache#6940) [python] ray version compatible (apache#6937) [core] Unify conflict detect in FileStoreCommitImpl (apache#6932) [test] Fix unstable case in CompactActionITCase ...

XiaoHongbo-Hope reviewed Dec 28, 2025

View reviewed changes

wangxipu reviewed Dec 29, 2025

View reviewed changes

XiaoHongbo-Hope force-pushed the ray_datasink branch from 078a485 to 26fc28d Compare January 5, 2026 09:40

[python] support ray data sink to paimon

0dda717

[python] update api, add test fix code format fix code format

XiaoHongbo-Hope force-pushed the ray_datasink branch from 26fc28d to 0dda717 Compare January 5, 2026 09:41

xiaohongbo added 2 commits January 5, 2026 17:49

fix test case failure

5059711

fix ray version compatible

1e3f6f1

JingsongLi merged commit 22a1e06 into apache:master Jan 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] support ray data sink to paimon #6883

[python] support ray data sink to paimon #6883

Uh oh!

FreemanDane commented Dec 24, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

FreemanDane Jan 4, 2026

Uh oh!

XiaoHongbo-Hope Jan 4, 2026 •

edited

Loading

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

jerry-024 Dec 29, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025 •

edited

Loading

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

wangxipu Dec 29, 2025

Uh oh!

JingsongLi commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[python] support ray data sink to paimon #6883

[python] support ray data sink to paimon #6883

Uh oh!

Conversation

FreemanDane commented Dec 24, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

XiaoHongbo-Hope Jan 4, 2026 •

edited

Loading

XiaoHongbo-Hope Dec 28, 2025 •

edited

Loading