Skip to content

storage, overlay: stop using symlinks for new layers#666

Open
giuseppe wants to merge 2 commits intocontainers:mainfrom
giuseppe:drop-overlay-symlink
Open

storage, overlay: stop using symlinks for new layers#666
giuseppe wants to merge 2 commits intocontainers:mainfrom
giuseppe:drop-overlay-symlink

Conversation

@giuseppe
Copy link
Member

The overlay driver maintained a l/ directory of symlinks solely to shorten mount option strings below the kernel's page-size limit.
This is unnecessary since mountOverlayFrom() already handles long mount options via fd-based mounting.

New layers now store relative diff paths (e.g. "layer-id/diff") directly in the lower file instead of short symlink references.

Old-format layers with l/ symlinks continue to work.

@github-actions github-actions bot added the storage Related to "storage" package label Feb 18, 2026
@giuseppe giuseppe force-pushed the drop-overlay-symlink branch from 9827306 to dff75cb Compare February 18, 2026 11:07
@giuseppe
Copy link
Member Author

@mtrmac I believe you are going to like this PR :)

@giuseppe giuseppe marked this pull request as ready for review February 18, 2026 12:34
@giuseppe giuseppe changed the title [WIP] storage, overlay: stop using symlinks for new layers storage, overlay: stop using symlinks for new layers Feb 18, 2026
@mtrmac
Copy link
Contributor

mtrmac commented Feb 18, 2026

At a glance, yes I do :)

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we have a test case that mounts a 500 layer overlay? If not should we have one to ensure mounting always works?

@giuseppe
Copy link
Member Author

Mounting with the new code gives always shorter paths since we use the fd number as the full path instead of a symlink that is still an absolute path. I think this code was left there just for historical reasons and there is no case where it is better

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to add comments marking the remaining "legacy" code paths (dealing with l/ symlinks) as such, so they can be removed in the future. Otherwise it's hard to grok for future readers as it is -- we don't create any links but we handle them.

One other thing is, I'm not sure if we need the entire fork/exec open lower fds. This is a further optimization but maybe we can just do it in the same process? I mean, the fork/exec was done because of chdir, but it looks like we no longer need it (yet it's left there, and that's another part of "legacy" code I guess). So, maybe we should call mountOverlayFrom when working with legacy l/ symlinks, and do a via-fds in the same process?

As for the tests, I think this is covered by existing @test "allow storing images with more than 127 layers" which creates 300 layers (although I haven't checked it if actually crosses the pagesize threshold).


func (d *Driver) create(id, parent string, opts *graphdriver.CreateOpts, readOnly bool) (retErr error) {
dir, homedir, _ := d.dir2(id, readOnly)
dir, _, _ := d.dir2(id, readOnly)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this removes the only user of the second value returned from dir2, so maybe it makes sense to drop it (perhaps in a separate subsequent commit).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for spotting it, fixed now!

@kolyshkin
Copy link
Contributor

I wonder, do we have a test case that mounts a 500 layer overlay? If not should we have one to ensure mounting always works?

We do have one for 300 layers, see storage/tests/layers.bats, although as I said I haven't checked we're actually crossing the pagesize threshold with it.

@TomSweeneyRedHat
Copy link
Member

@giuseppe do you spin up a PR in Podman to see how this does there?
Other than @kolyshkin 's comments, LGTM, and happy green test buttons.

@giuseppe
Copy link
Member Author

@giuseppe do you spin up a PR in Podman to see how this does there?
Other than @kolyshkin 's comments, LGTM, and happy green test buttons.

fixed, and opened a test PR here: containers/podman#28164

@TomSweeneyRedHat
Copy link
Member

I restarted the failing test here, it looked to be a time out flake.
The Podman PR, however, is very red.

@giuseppe
Copy link
Member Author

I restarted the failing test here, it looked to be a time out flake.
The Podman PR, however, is very red.

that failure is expected because we are not rebuilding Buildah. Buildah fails to find the symlink since we are not creating it anymore.

@giuseppe
Copy link
Member Author

giuseppe commented Mar 4, 2026

@Luap99 @mtrmac @kolyshkin PTAL

@Luap99
Copy link
Member

Luap99 commented Mar 4, 2026

that failure is expected because we are not rebuilding Buildah. Buildah fails to find the symlink since we are not creating it anymore.

Well how do you expect us to get CI passing then? We should not merge this if it breaks CI tests.
Also we cannot really control the update timings of podman/buildah/skopeo so if mixing them breaks them completely we have a big problem for the distributions shipping these tools. We can only control fedora/RHEL.

i.e. this test is using skopeo for example

[+0054s] not ok 39 [010] podman pull image with additional store in 1071ms
         # (in test file test/system/[010-images.bats, line 395](https://github.com/containers/podman/blob/2839aa038a596ef18fcf60a9731517fa8a1a6dc5/test/system/010-images.bats#L395))
         #   `skopeo copy containers-storage:$IMAGE \' failed
         #
<+     > # # podman  image exists quay.io/libpod/testimage:20241011
         # Getting image source signatures
         # Copying blob sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef
         # Copying blob sha256:b66a884aaf08f1c410c61682a7072d68a0d837ba8dc16ff13b9509bdceb32fd2
         # time="2026-02-27T11:30:24-06:00" level=fatal msg="reading blob sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef: creating file-getter: readlink /var/lib/containers/storage/overlay/b66a884aaf08f1c410c61682a7072d68a0d837ba8dc16ff13b9509bdceb32fd2/diff: invalid argument"
         # # [teardown]
         # rm: cannot remove '/tmp/CI_M4p5/podman_bats.ozboK6/imagestore/root/overlay': Device or resource busy

@giuseppe
Copy link
Member Author

giuseppe commented Mar 4, 2026

ok let's make it in two phases, I'll change this PR to not use the symlinks but still create them

@giuseppe giuseppe force-pushed the drop-overlay-symlink branch 3 times, most recently from a1bf17b to 171a0bc Compare March 4, 2026 21:08
giuseppe added a commit to giuseppe/libpod that referenced this pull request Mar 7, 2026
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

giuseppe commented Mar 9, 2026

Podman tests pass now

@giuseppe giuseppe force-pushed the drop-overlay-symlink branch from 171a0bc to b8f4b1f Compare March 17, 2026 13:54
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the new functionality is covered by existing tests, LGTM

Write a new "lower-layers" file alongside the existing "lower" file.
It stores layer IDs directly instead of l/ symlink references; the
reading side appends "/diff" itself.  When present, lower-layers is
preferred over lower.  The old lower file is still written so that
older tools (buildah, skopeo) continue to work.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the drop-overlay-symlink branch from e345324 to 823a7b8 Compare March 20, 2026 06:56
this is the limit hardcoded in the kernel.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@mtrmac
Copy link
Contributor

mtrmac commented Mar 20, 2026

I think this is not immediately urgent if this PR will keep creating the link files, but per #703 (and code linked from there), the l subdirectory might be the only thing preventing us from deleting the driver’s whole directory, even on success paths if we currently have no layers (and that’s important at least for ScanPriorDrivers).

That would need a re-think/re-design before we remove the linkDir code, and probably a warning now so that we don’t forget.

@giuseppe
Copy link
Member Author

I think this is not immediately urgent if this PR will keep creating the link files, but per #703 (and code linked from there), the l subdirectory might be the only thing preventing us from deleting the driver’s whole directory, even on success paths if we currently have no layers (and that’s important at least for ScanPriorDrivers).

That would need a re-think/re-design before we remove the linkDir code, and probably a warning now so that we don’t forget.

Once we decide to drop completely the l subdirectory, should we just temporarily add some code in Init to forcefully rm -rf l? Would that be enough?

@mtrmac
Copy link
Contributor

mtrmac commented Mar 20, 2026

The problem is the other way: if there is no l, and the user deleted all layers, we have code to delete the whole overlay subdirectory, and that will now succeed. That breaks our designed workflow for setting a quota on container storage.

@mtrmac
Copy link
Contributor

mtrmac commented Mar 20, 2026

… and the way we detect which graph driver is “currently in use”.

@giuseppe
Copy link
Member Author

The problem is the other way: if there is no l, and the user deleted all layers, we have code to delete the whole overlay subdirectory, and that will now succeed. That breaks our designed workflow for setting a quota on container storage.

I see. Could we use staging and tempdirs? We would need to make sure these are never deleted (at least one)

@mtrmac
Copy link
Contributor

mtrmac commented Mar 20, 2026

That all works but those are untransparent implementation hacks. What are we doing here? If the directory presence matters it should be consciously managed as specific goal. Or move the information elsewhere, into a real metadata file, perhaps layers.json.

@giuseppe
Copy link
Member Author

sure, we can handle that with an explicit file to prevent rmdir or find a better way to mark the graph as used,. Having that behavior today because of l is also a side effect today.

If there are no layers/containers, is the storage driver really used? Triggering again the heuristic to pick a driver is really the wrong thing to do? Upgrading from vfs to overlay (when possible), looks like a feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants