feat(table): roll parquet files based on actual compressed size by twuebi · Pull Request #759 · apache/iceberg-go

twuebi · 2026-02-26T16:53:36Z

This change refactors datafile writing to use the actual written file size as iceberg-java & iceberg-rust do instead of the in-memory size.

zeroshade

A few questions based on my reading of the code

zeroshade · 2026-03-02T21:39:42Z

table/internal/parquet_files.go

+// current row group — matching the size estimate used by iceberg-java and
+// iceberg-rust to make rolling decisions.
+func (w *ParquetFileWriter) BytesWritten() int64 {
+	return w.counter.Count + w.pqWriter.RowGroupTotalCompressedBytes()


isn't this going to double count since RowGroupTotalCompressedBytes would also count what has been flushed, not only what's still buffered?

// RowGroupTotalCompressedBytes returns the total number of bytes after compression // that have been written to the current row group so far. func (fw *FileWriter) RowGroupTotalCompressedBytes() int64 { if fw.rgw != nil { return fw.rgw.TotalCompressedBytes() } return 0 } // RowGroupTotalBytesWritten returns the total number of bytes written and flushed out in // the current row group. func (fw *FileWriter) RowGroupTotalBytesWritten() int64 { if fw.rgw != nil { return fw.rgw.TotalBytesWritten() } return 0 }

Since we're just using RowGroupTotalCompressedBytes here, I don't think we're double counting, w.counter wraps the FileWriter so it should only count what has been flushed, RowGroupTotalCompressedBytes only counts what compressed into the current row group, in contrast to RowGroupTotalBytesWritten which would also have the flushed bytes.

Also added a test TestBytesWrittenNoDoubleCountAcrossRowGroups

TotalCompressedBytes would include what has already been flushed + what is still in memory. The important bits that it won't count are footers. RowGroupTotalBytesWritten only counts what has been flushed

zeroshade · 2026-03-02T21:42:51Z

table/rolling_data_writer.go

+		fileSchema = sanitized
+	}
+
+	format := tblutils.GetFileFormat(iceberg.ParquetFile)


this shouldn't be hardcoded as Parquet should it? Shouldn't we get this from a table config/write config property?

zeroshade · 2026-03-02T21:44:58Z

table/rolling_data_writer.go

+		ID:          cnt,
+		PartitionID: partitionID,
+		FileCount:   fileCount,
+	}.GenerateDataFileName("parquet")


same as above, shouldn't we pull "parquet" from the config options instead of hardcoding it?

zeroshade · 2026-03-02T21:48:12Z

table/rolling_data_writer.go

-	}
+	}()

-	binPackedRecords := binPackRecords(recordIter, defaultBinPackLookback, r.factory.targetFileSize)


why are we dropping the binPacking of the records? I don't see it getting used elsewhere, Unless the user is explicitly saying that they are providing sorted records and sorted data, we should allow the binpacking to keep file sizes down, right? Or is there a reason why we're removing the binpacking?

My understanding is, that the binpacking was primarily used to write files that are equal or smaller than the target file size, not considering compression, with this change, we're tracking actual file sizes and no longer need to binpack records based on estimates. This is modeled after iceberg-javas RollingFileWriter / BaseRollingWriter

twuebi added 4 commits February 26, 2026 16:30

feat(table): roll parquet files based on actual compressed size

b12466b

merge main

cf7b220

fmrt

12c5087

fix test

909360f

twuebi changed the title ~~wip: feat(table): roll parquet files based on actual compressed size~~ feat(table): roll parquet files based on actual compressed size Mar 2, 2026

zeroshade reviewed Mar 2, 2026

View reviewed changes

review

365a320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table): roll parquet files based on actual compressed size#759

feat(table): roll parquet files based on actual compressed size#759
twuebi wants to merge 5 commits intoapache:mainfrom
twuebi:tp/parquet-file-writer

twuebi commented Feb 26, 2026 •

edited

Loading

Uh oh!

zeroshade left a comment

Uh oh!

zeroshade Mar 2, 2026

Uh oh!

twuebi Mar 3, 2026

Uh oh!

zeroshade Mar 4, 2026

Uh oh!

zeroshade Mar 2, 2026

Uh oh!

zeroshade Mar 2, 2026

Uh oh!

zeroshade Mar 2, 2026

Uh oh!

twuebi Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

twuebi commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

twuebi Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

twuebi Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

twuebi commented Feb 26, 2026 •

edited

Loading