Skip to content

Conversation

@cgwalters
Copy link
Collaborator

Instead of using systemd.volatile=overlay which overlaid all of / with a single tmpfs-backed overlayfs, set up /etc and /var separately:

  • /etc: overlayfs with tmpfs upper (transient changes, lost on reboot)
  • /var: real tmpfs with content copied from image (not overlayfs)

The key benefit is that /var is now a real tmpfs, allowing podman to use overlayfs for container storage inside /var/lib/containers. With the old approach, the nested overlayfs caused "too many levels of symbolic links" errors.

Implementation uses systemd credentials to inject units that run in the initramfs before switch-root:

  • sysroot-etc.mount: overlay on /sysroot/etc
  • bcvk-var-ephemeral.service: copies /sysroot/var to tmpfs and bind mounts

Both units use ConditionPathExists=/etc/initrd-release to only run in the initramfs context.

This is Phase 1 of issue #22, making ephemeral VMs more bootc-like. SELinux is still disabled (selinux=0); Phase 2 will add composefs support to enable proper SELinux labeling.

xref: #22 (Phase 1)
Assisted-by: OpenCode (Sonnet 4)

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the broad systemd.volatile=overlay with more granular, fine-grained tmpfs and overlayfs mounts for /var and /etc respectively. This is a well-reasoned change that solves a real-world issue with nested overlayfs when using podman inside the ephemeral VM. The implementation using systemd credentials injected via SMBIOS is clean and effective. The addition of an integration test to verify the new mount layout is excellent. My feedback is minor, focusing on improving code consistency and fixing a small typo in a comment.

@cgwalters cgwalters force-pushed the ephemeral-prodlike branch 3 times, most recently from fe6c2e0 to b3cd27b Compare January 16, 2026 20:50
@cgwalters cgwalters enabled auto-merge (rebase) January 16, 2026 21:53
@cgwalters cgwalters force-pushed the ephemeral-prodlike branch 2 times, most recently from 67695e6 to 5516e29 Compare January 16, 2026 22:22
Instead of using systemd.volatile=overlay which overlaid all of / with
a single tmpfs-backed overlayfs, set up /etc and /var separately:

- /etc: overlayfs with tmpfs upper (transient changes, lost on reboot)
- /var: real tmpfs with content copied from image (not overlayfs)

The key benefit is that /var is now a real tmpfs, allowing podman to
use overlayfs for container storage inside /var/lib/containers. With
the old approach, the nested overlayfs caused "too many levels of
symbolic links" errors.

Implementation:

The initramfs units are embedded in a CPIO archive that gets appended
to the existing initramfs. This uses the Linux kernel's ability to
concatenate multiple CPIO archives.

Services running in initramfs (before switch-root):
- bcvk-etc-overlay.service: Sets up overlay on /sysroot/etc using a
  bind-mounted lowerdir to avoid self-referential mount issues on
  older kernels. Uses index=off,metacopy=off for virtiofs compat.
- bcvk-var-ephemeral.service: Copies /sysroot/var to tmpfs and bind
  mounts it back.
- bcvk-copy-units.service: Copies bcvk-journal-stream.service to
  /run/systemd/system/ for systemd <256 compatibility. The /run tmpfs
  is preserved across switch-root via MS_MOVE.

For systemd 256+, the journal-stream unit is created via SMBIOS
credentials (systemd.extra-unit.*). For older versions like CentOS
Stream 9 (systemd 252), the unit is copied from the initramfs since
credential-based unit creation isn't supported.

The execute service target is changed from default.target to
multi-user.target with ConditionPathExists=!/etc/initrd-release to
ensure it runs after switch-root, not in the initramfs.

This is Phase 1 of issue bootc-dev#22, making ephemeral VMs more bootc-like.
SELinux is still disabled (selinux=0); Phase 2 will add composefs
support to enable proper SELinux labeling.

Closes: bootc-dev#22 (Phase 1)
Assisted-by: OpenCode (Sonnet 4)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ephemeral VM setup to use fine-grained mounts for /etc and /var instead of a single systemd.volatile=overlay. This change is crucial for allowing Podman to use overlayfs for container storage inside /var/lib/containers, resolving previous 'too many levels of symbolic links' errors. The implementation introduces a new cpio module to embed systemd units directly into the initramfs, ensuring compatibility with older systemd versions. Integration tests have been added to verify the new mount layout. Overall, the changes are well-structured and address a significant functional limitation.

Comment on lines +36 to +39
let buf = write_directory(buf, "usr").unwrap();
let buf = write_directory(buf, "usr/lib").unwrap();
let buf = write_directory(buf, "usr/lib/systemd").unwrap();
let buf = write_directory(buf, "usr/lib/systemd/system").unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of unwrap() here can lead to a panic with a generic message if directory creation fails. While Vec<u8> as a writer should not fail, the io::Result<W> return type indicates a potential failure. Consider replacing unwrap() with expect() to provide more context in case of a panic, or implement more robust error handling if these operations could genuinely fail in a recoverable way.

For example: write_directory(buf, "usr").expect("Failed to write 'usr' directory to CPIO archive")

Suggested change
let buf = write_directory(buf, "usr").unwrap();
let buf = write_directory(buf, "usr/lib").unwrap();
let buf = write_directory(buf, "usr/lib/systemd").unwrap();
let buf = write_directory(buf, "usr/lib/systemd/system").unwrap();
let buf = write_directory(buf, "usr").expect("Failed to write 'usr' directory to CPIO archive");
let buf = write_directory(buf, "usr/lib").expect("Failed to write 'usr/lib' directory to CPIO archive");
let buf = write_directory(buf, "usr/lib/systemd").expect("Failed to write 'usr/lib/systemd' directory to CPIO archive");
let buf = write_directory(buf, "usr/lib/systemd/system").expect("Failed to write 'usr/lib/systemd/system' directory to CPIO archive");

Comment on lines +41 to +91
let buf = write_file(
buf,
"usr/lib/systemd/system/bcvk-etc-overlay.service",
etc_overlay_content.as_bytes(),
)
.unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/bcvk-var-ephemeral.service",
var_ephemeral_content.as_bytes(),
)
.unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/bcvk-copy-units.service",
copy_units_content.as_bytes(),
)
.unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/bcvk-journal-stream.service",
journal_stream_content.as_bytes(),
)
.unwrap();

// Drop-in configs to pull units into initrd-fs.target
let buf = write_directory(buf, "usr/lib/systemd/system/initrd-fs.target.d").unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/initrd-fs.target.d/bcvk-etc-overlay.conf",
b"[Unit]\nWants=bcvk-etc-overlay.service\n",
)
.unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/initrd-fs.target.d/bcvk-var-ephemeral.conf",
b"[Unit]\nWants=bcvk-var-ephemeral.service\n",
)
.unwrap();

let buf = write_file(
buf,
"usr/lib/systemd/system/initrd-fs.target.d/bcvk-copy-units.conf",
b"[Unit]\nWants=bcvk-copy-units.service\n",
)
.unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a lot of repetition in creating CPIO entries for directories and files. This block could be refactored into a helper function or by iterating over a list of paths and contents. This would improve readability and maintainability, especially if more units or directories need to be added in the future.

For example, you could define a list of (path, content_option) tuples and loop through them.

    let entries = [
        ("usr/lib/systemd/system/bcvk-etc-overlay.service", Some(etc_overlay_content.as_bytes())),
        ("usr/lib/systemd/system/bcvk-var-ephemeral.service", Some(var_ephemeral_content.as_bytes())),
        ("usr/lib/systemd/system/bcvk-copy-units.service", Some(copy_units_content.as_bytes())),
        ("usr/lib/systemd/system/bcvk-journal-stream.service", Some(journal_stream_content.as_bytes())),
        ("usr/lib/systemd/system/initrd-fs.target.d/bcvk-etc-overlay.conf", Some(b"[Unit]\nWants=bcvk-etc-overlay.service\n")),
        ("usr/lib/systemd/system/initrd-fs.target.d/bcvk-var-ephemeral.conf", Some(b"[Unit]\nWants=bcvk-var-ephemeral.service\n")),
        ("usr/lib/systemd/system/initrd-fs.target.d/bcvk-copy-units.conf", Some(b"[Unit]\nWants=bcvk-copy-units.service\n")),
    ];

    let mut buf = write_directory(buf, "usr").expect("Failed to write 'usr' directory");
    buf = write_directory(buf, "usr/lib").expect("Failed to write 'usr/lib' directory");
    buf = write_directory(buf, "usr/lib/systemd").expect("Failed to write 'usr/lib/systemd' directory");
    buf = write_directory(buf, "usr/lib/systemd/system").expect("Failed to write 'usr/lib/systemd/system' directory");
    buf = write_directory(buf, "usr/lib/systemd/system/initrd-fs.target.d").expect("Failed to write 'initrd-fs.target.d' directory");

    for (path, content) in entries {
        if let Some(c) = content {
            buf = write_file(buf, path, c).expect(&format!("Failed to write file {} to CPIO archive", path));
        } else {
            buf = write_directory(buf, path).expect(&format!("Failed to write directory {} to CPIO archive", path));
        }
    }

let source_initramfs_path = initramfs_path
.ok_or_else(|| eyre!("No initramfs found in /run/source-image/usr/lib/modules"))?;

fs::File::create(&kernel_mount)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line creates an empty file at kernel_mount. Since the very next step is a mount --bind command that will mount the actual kernel file to this location, creating an empty file beforehand is redundant and unnecessary. The mount command will handle the creation of the mount point if it doesn't exist, or use the existing one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant