spark: Don't use table FileIO for checkpointing files by c2zwdjnlcg · Pull Request #15239 · apache/iceberg

c2zwdjnlcg · 2026-02-05T17:33:06Z

c2zwdjnlcg · 2026-02-20T01:09:16Z

@nastra could you take a look at this PR and see if you are aligned with separating the checkpoint IO from the table IO?

nastra · 2026-02-20T11:56:31Z

@c2zwdjnlcg I currently don't have any cycles to review this. Maybe @huaxingao, @RussellSpitzer or @aokolnychyi have some time to review it

RussellSpitzer · 2026-02-24T16:48:34Z

First look, please do larger changes like this only on a single module first, then backport to the others in a follow up. It makes reviewing a bit more difficult to have duplicated changes. Taking a pass in depth now

RussellSpitzer

RussellSpitzer

Rather than changing the IO here to something a user wouldn't expect, I think it's probably better for us to change InitialOffsetStore itself directly.

Since Spark Checkpoints are expected to go through HadoopFS we should probably just use Hadoop FileSystem directly instead of using Iceberg FileIO class. This of course is a breaking change so we probably also need to gate this at least initially.

Maybe build two OffsetStores with the same interface and allow users to opt to Hadoop based with a spark read conf property?

interface InitialOffsetStore {
  StreamingOffset initialOffset();

  class TableIOOffsetStore implements InitialOffsetStore {
  }
  class HadoopOffsetStore implements InitialOffsetStore {
}

c2zwdjnlcg · 2026-02-25T03:33:12Z

@RussellSpitzer Thanks for the review.

please do larger changes like this only on a single module first, then backport to the others in a follow up.

Sorry about that, will keep in mind for next time

Hopefully this is more inline with what you were thinking.

I named the setting streaming-checkpoint-use-table-io. If you are generally ok with this approach and name I'll also add documentation to this PR.

RussellSpitzer · 2026-02-25T18:33:36Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

+    protected StreamingOffset readOffset() {
+      try (FSDataInputStream inputStream = fileSystem.open(initialOffsetPath);
+          InputStreamReader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8)) {
+        String json = CharStreams.toString(reader);


Don't we already have a StreamingOffset.fromJson(InputStream in) ?

RussellSpitzer · 2026-02-25T18:35:40Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

    private final Long fromTimestamp;

-    InitialOffsetStore(Table table, String checkpointLocation, Long fromTimestamp) {
+    BaseOffsetStore(Table table, String checkpointLocation, Long fromTimestamp) {


Could be "long fromTimestamp"

This was preserved from the previous implementation but changed, i don't think it will do any harm.

RussellSpitzer · 2026-02-25T19:08:33Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java


+  // Controls whether streaming checkpoint operations use table FileIO or Hadoop FileSystem
+  public static final String STREAMING_CHECKPOINT_USE_TABLE_IO =
+      "streaming-checkpoint-use-table-io";


I'm not sure if it should be use-table-io or use-hdfs ... I think either is probably fine but maybe I slightly prefer use-hdfs because I know the opposite should be using the table io?

I do think using an enum may be overboard here but maybe that's just cleaner all around
streaming-checkpoint = {table-io, hdfs}
streaming-checkpoint_default = table-io
?

Wdyt?

I went with

streaming-checkpoint-storage = {table-io, hadoop-fs}

since I didn't want it to seem like it was just for hdfs. But can revert to your original naming if you prefer.

RussellSpitzer

Looking pretty close to me! I have a some feelings about the parameter name and whether it should be boolean.

RussellSpitzer · 2026-02-25T19:32:11Z

...4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestStreamingCheckpointLocation.java

+
+    @Override
+    public OutputFile newOutputFile(String path) {
+      if (path.contains("/offsets/")) {


nit: could use a constant here and in the tests above

github-actions bot added the spark label Feb 5, 2026

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch from f994146 to 13f02a1 Compare February 19, 2026 23:44

RussellSpitzer requested changes Feb 24, 2026

View reviewed changes

RussellSpitzer reviewed Feb 24, 2026

View reviewed changes

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch 3 times, most recently from 257b264 to 0f5ead4 Compare February 24, 2026 23:37

RussellSpitzer reviewed Feb 25, 2026

View reviewed changes

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch from 0f5ead4 to 1e60a05 Compare February 26, 2026 00:54

github-actions bot added the docs label Feb 26, 2026

spark: Allow users to use hadoop file system for checkpoint files

bf98b1f

c2zwdjnlcg force-pushed the fix-checkpoint-fs-impl branch from 1e60a05 to bf98b1f Compare February 26, 2026 01:01

Conversation

c2zwdjnlcg commented Feb 5, 2026

Uh oh!

c2zwdjnlcg commented Feb 20, 2026

Uh oh!

nastra commented Feb 20, 2026

Uh oh!

RussellSpitzer commented Feb 24, 2026

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c2zwdjnlcg commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RussellSpitzer left a comment •

edited

Loading

RussellSpitzer left a comment •

edited

Loading