Core: Support parallel execution when scanning entries in ManifestGroup by dramaticlly · Pull Request #15426 · apache/iceberg

dramaticlly · 2026-02-24T01:21:21Z

Currently, executorService is only used in ManifestGroup::plan

iceberg/core/src/main/java/org/apache/iceberg/ManifestGroup.java

Lines 215 to 219 in a97b4ec

    
           if (executorService != null) { 
        
             return new ParallelIterable<>(tasks, executorService); 
        
           } else { 
        
             return CloseableIterable.concat(tasks); 
        
           }

But it can also benefit when scanning entries in multiple manifests.

Also apply defensive copy on entries as ManifestReader reuseContainers across iteration

@stevenzwu @RussellSpitzer

Also apply defensive copy on entries as ManifestReader reuseContainers across iteration

RussellSpitzer · 2026-02-24T02:17:39Z

Could you elaborate more on the previous behavior, and how you are changing that in this PR?

dramaticlly · 2026-02-24T03:15:43Z

Could you elaborate more on the previous behavior, and how you are changing that in this PR?

Happy to! Before this change when we scan the manifest entries using ManifestGroup, we will scan manifest sequentially using CloseableIterable.concat even when executorService is provided.

After this change, we now can benefit from using threadpool to scan the manifests if ManifestGroup is created with executorService provided in planWith(). One caveat is that we also need to make defensive copy of entries in parallel scanning as entries are reused across all iterations within a manifest given the reuseContainers in ManifestReader.

deniskuzZ · 2026-02-24T10:40:04Z

core/src/test/java/org/apache/iceberg/TestFindFiles.java

+
+    AtomicInteger planThreadsIndex = new AtomicInteger(0);
+    ExecutorService executorService =
+        Executors.newFixedThreadPool(


Should we consider migrating to Executors.newVirtualThreadPerTaskExecutor() (jdk21)?

I think we still build against java17, so we can migrate this later once we move to jdk21 as minimal support.

RussellSpitzer · 2026-02-24T15:59:06Z

core/src/main/java/org/apache/iceberg/ManifestGroup.java

+      // copy entries to avoid object reuse issues when scanning manifests in parallel,
+      // as ManifestReader reuses entry objects during iteration
+      Iterable<CloseableIterable<ManifestEntry<DataFile>>> entryIterables =
+          entries((manifest, entries) -> CloseableIterable.transform(entries, ManifestEntry::copy));


I'm a little worried about introducing a sort of hidden "materialize everything" setting.

We have one internal use where this could come up and we are effectively double copying with this change

FindFiles

iceberg/core/src/main/java/org/apache/iceberg/FindFiles.java

Lines 201 to 223 in 8417225

public CloseableIterable<DataFile> collect() {

Snapshot snapshot =

snapshotId != null ? ops.current().snapshot(snapshotId) : ops.current().currentSnapshot();

// snapshot could be null when the table just gets created

if (snapshot == null) {

return CloseableIterable.empty();

}

// when snapshot is not null

CloseableIterable<ManifestEntry<DataFile>> entries =

new ManifestGroup(ops.io(), snapshot.dataManifests(ops.io()))

.specsById(ops.current().specsById())

.filterData(rowFilter)

.filterFiles(fileFilter)

.filterPartitions(partitionFilter)

.ignoreDeleted()

.caseSensitive(caseSensitive)

.planWith(executorService)

.entries();

return CloseableIterable.transform(entries, entry -> entry.file().copy(includeColumnStats));

}

So not even counting external users of this library we are already introducing a regression. I'm not saying we shouldn't do this but we should be very careful here to avoid accidentally doubling memory consumption (or greatly increasing consumption) of a downstream consumer.

Thanks Russell, I think you raised a good point as existing consumer of this API might already apply the defensive copy so we dont want to double the memory. I took the stab to follow what's existing

iceberg/core/src/main/java/org/apache/iceberg/ManifestGroup.java

Line 180 in 9534c9b

public <T extends ScanTask> CloseableIterable<T> plan(CreateTasksFunction<T> createTasksFunc) {

is doing and add a new method which takes a function for transform.

So testManifestGroupEntriesWithParallelExecution() kind of illustrate that if we only need filePath, we can skip defensive copy like how we collect DataFiles in FindFiles::collect

core/src/main/java/org/apache/iceberg/ManifestGroup.java

…ction Updated the existing entries method to streamline parallel execution without defensive copying, as the caller is now responsible for managing entry copies.

core/src/main/java/org/apache/iceberg/FindFiles.java

Core: Support parallel execution when scanning entries in ManifestGroup

84875a1

Also apply defensive copy on entries as ManifestReader reuseContainers across iteration

github-actions bot added the core label Feb 24, 2026

deniskuzZ reviewed Feb 24, 2026

View reviewed changes

RussellSpitzer reviewed Feb 24, 2026

View reviewed changes

stevenzwu reviewed Feb 24, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/ManifestGroup.java Outdated Show resolved Hide resolved

dramaticlly added 2 commits February 24, 2026 11:42

Added a new method to transform manifest entries using a provided fun…

dc7f76c

…ction Updated the existing entries method to streamline parallel execution without defensive copying, as the caller is now responsible for managing entry copies.

Move findFiles to use entries with function

4eef0dc

stevenzwu approved these changes Feb 24, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/FindFiles.java Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Support parallel execution when scanning entries in ManifestGroup#15426

Core: Support parallel execution when scanning entries in ManifestGroup#15426
dramaticlly wants to merge 3 commits intoapache:mainfrom
dramaticlly:manifestGroupEntriesParallel

dramaticlly commented Feb 24, 2026

Uh oh!

RussellSpitzer commented Feb 24, 2026

Uh oh!

dramaticlly commented Feb 24, 2026

Uh oh!

deniskuzZ Feb 24, 2026 •

edited

Loading

Uh oh!

dramaticlly Feb 24, 2026

Uh oh!

RussellSpitzer Feb 24, 2026

Uh oh!

dramaticlly Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if (executorService != null) {
	return new ParallelIterable<>(tasks, executorService);
	} else {
	return CloseableIterable.concat(tasks);
	}

	public CloseableIterable<DataFile> collect() {
	Snapshot snapshot =
	snapshotId != null ? ops.current().snapshot(snapshotId) : ops.current().currentSnapshot();

	// snapshot could be null when the table just gets created
	if (snapshot == null) {
	return CloseableIterable.empty();
	}

	// when snapshot is not null
	CloseableIterable<ManifestEntry<DataFile>> entries =
	new ManifestGroup(ops.io(), snapshot.dataManifests(ops.io()))
	.specsById(ops.current().specsById())
	.filterData(rowFilter)
	.filterFiles(fileFilter)
	.filterPartitions(partitionFilter)
	.ignoreDeleted()
	.caseSensitive(caseSensitive)
	.planWith(executorService)
	.entries();

	return CloseableIterable.transform(entries, entry -> entry.file().copy(includeColumnStats));
	}

Conversation

dramaticlly commented Feb 24, 2026

Uh oh!

RussellSpitzer commented Feb 24, 2026

Uh oh!

dramaticlly commented Feb 24, 2026

Uh oh!

deniskuzZ Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dramaticlly Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Feb 24, 2026

Choose a reason for hiding this comment

FindFiles

Uh oh!

dramaticlly Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deniskuzZ Feb 24, 2026 •

edited

Loading