Bug 2269385 - rhgb breaks custom/minimal install on most filesystem layouts [NEEDINFO]
Summary: rhgb breaks custom/minimal install on most filesystem layouts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: plymouth
Version: 40
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ray Strode [halfline]
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
Depends On:
Blocks: BetaBlocker, F40BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2024-03-13 15:04 UTC by Kamil Páral
Modified: 2024-03-20 09:28 UTC (History)
9 users (show)

Fixed In Version: plymouth-24.004.60-4.fc40
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 04:23:06 UTC
Type: Bug
Embargoed:
kparal: needinfo? (rstrode)


Attachments (Terms of Use)
journal (broken boot) (219.92 KB, text/plain)
2024-03-13 15:06 UTC, Kamil Páral
no flags Details
journal (ok boot) (187.50 KB, text/plain)
2024-03-13 15:06 UTC, Kamil Páral
no flags Details
list of rpms installed (14.23 KB, text/plain)
2024-03-13 15:07 UTC, Kamil Páral
no flags Details


Links
System ID Private Priority Status Summary Last Updated
freedesktop.org Gitlab plymouth plymouth issues 249 0 None opened Commit 48881ba2 breaks screen refreshing on minimal installs 2024-03-18 10:48:15 UTC

Description Kamil Páral 2024-03-13 15:04:47 UTC
Description of problem:
In certain system installs, the booted system seems to freeze during boot. But it doesn't freeze, it's just that the screen stops refreshing. You can either see a black screen, or the text that was present when grub exited, i.e. a text saying "Booting $grub-item-name", or just a text cursor in the top left corner. Underneath, the system boots just fine, and can even be operated blindly (I can log in and run commands blindly, it works), but the screen never updates.

But if you remove "rhgb" from the grub command line, the system boots perfectly fine, screen updates as expected.

These are the conditions under which the bug occurs:

1. The install must be performed from Everything netinst image. (Server netinst or Server DVD work just fine).
2. You must install either Fedora Custom Operating System (custom-environment) or Minimal Install (minimal-environment).
3. It only happens on bare metal. (I tested two bare metal machines, both exhibit it. It doesn't happen in virtual machines).
4. It happens for both UEFI and BIOS installs.


It very confuses me that the same environment has different behavior when installed from Everything netinst vs Server netinst. I tried to compare the package set, and found a few extra packages when installed from Server netinst, but adding them to the Everything netinst installation doesn't resolve the problem. Even installing the whole server-product-environment group on an affected system doesn't resolve the problem (which is a group which works just fine when installed from both Server and Everything netinst).

So perhaps this is not about the package set but about some bootloader/grub files that differ between Server and Everything? I really don't know where to look.


Version-Release number of selected component (if applicable):
plymouth-24.004.60-3.fc40.x86_64
grub2-common-2.06-119.fc40.noarch
Fedora-Everything-netinst-x86_64-40_Beta-1.2.iso  # broken
Fedora-Server-netinst-x86_64-40_Beta-1.2.iso      # works

How reproducible:
always

Steps to Reproduce:
1. use a bare metal
2. boot Everything netinst image
3. install either Fedora Custom Operating System or Minimal Install
4. reboot after install
5. see the grub menu, let it time out
6. see that the default "Booting $grub-menu-item" text never disappears, login prompt never appears
7. you can use Ctrl+Alt+Del to reboot the machine. Or ssh in. Or log in blindly and reboot.
8. Boot again, this time remove "rhgb" from grub
9. See that it boots normally, you can see boot messages and the login prompt

Comment 1 Kamil Páral 2024-03-13 15:06:19 UTC
Created attachment 2021429 [details]
journal (broken boot)

This is a journal from a broken boot (with rhgb). The system is operational, just the screen never updates. The system was rebooted with Ctrl+Alt+Del after a while.

Comment 2 Kamil Páral 2024-03-13 15:06:54 UTC
Created attachment 2021430 [details]
journal (ok boot)

This is a journal from an OK boot (rhgb removed). Screen updates as expected.

Comment 3 Kamil Páral 2024-03-13 15:07:18 UTC
Created attachment 2021431 [details]
list of rpms installed

Comment 4 Kamil Páral 2024-03-13 15:08:18 UTC
Everything netinst is a release blocking deliverable, proposing for a blocker discussion.

Comment 5 Adam Williamson 2024-03-13 16:21:52 UTC
Oh jeez, I actually saw this on my test system when verifying the firmware RAID bug fix, but I was in a hurry and didn't think much of it, figured it was just a weird blip...

Can you produce 'success' by doing an install from Everything boot iso, but going through custom partitioning and setting the filesystem to XFS (but otherwise letting it create the partitions for you)?

Comment 6 Kamil Páral 2024-03-13 18:43:04 UTC
Uhhhh, very nice! The difference is really in the partition layout! It doesn't matter whether you use Everything or Server netinst (I checked), it only depends on the target layout:

WORKS:
/boot  ext4
/      lvm -> xfs

/boot  xfs
/      lvm -> xfs

/boot  ext4
/      lvm -> ext4


DOESN'T WORK:
/boot  ext4
/      btrfs

/boot  xfs
/      btrfs

/boot  ext4
/      ext4

/boot  ext4
/      xfs

/boot  xfs
/      xfs


Eh, it looks like it depends on the / partition, not /boot (as I assumed), and it only works if the / partition is inside LVM!

Also, once you install the full Server package set, any partition layout (most probably, haven't checked everything) works. The bug only affects Custom and Minimal sets.

Comment 7 Zbigniew Jędrzejewski-Szmek 2024-03-13 21:46:46 UTC
I suspect that the file system is not a direct cause, but instead it just changes
timing and exposes the issue in some other component.

One thing that clearly fails is this:
Mar 13 15:55:09 fedora systemd[1]: Starting systemd-vconsole-setup.service - Virtual Console Setup...
Mar 13 15:55:09 fedora systemd[1]: Mounted sys-kernel-config.mount - Kernel Configuration File System.
Mar 13 15:55:09 fedora systemd-vconsole-setup[537]: setfont: ERROR kdfontop.c:183 put_font_kdfontop: Unable to load such font with such kernel version
Mar 13 15:55:09 fedora systemd-vconsole-setup[534]: /usr/bin/setfont failed with a "system error" (EX_OSERR), ignoring.
Mar 13 15:55:09 fedora systemd-vconsole-setup[534]: Setting source virtual console failed, ignoring remaining ones.
Mar 13 15:55:09 fedora systemd[1]: Finished systemd-vconsole-setup.service - Virtual Console Setup.

But this should only cause the text console not to get the right fonts, it shouldn't
interfere with getting a text console. There were a few patches in systemd after v255
to make handle this better, and so far we didn't backport them because it didn't seem
important enough. But if you don't figure out a different reason, we can try, at least
to see if it makes  a difference.

Comment 8 Adam Williamson 2024-03-14 16:34:55 UTC
+4 in https://pagure.io/fedora-qa/blocker-review/issue/1521 , marking accepted.

Comment 9 Kamil Páral 2024-03-14 16:38:32 UTC
I have a couple interesting findings.

First, changing which plymouth theme is active doesn't have any effect.

Second, installing plymouth-graphics-libs resolves the problem, the bootsplash appears and the login prompt is usable.

Third, while this was true yesterday, it's not true today:

> Even installing the whole server-product-environment group on an affected system doesn't resolve the problem

Today, when I install the server group, bootsplash appears, login prompt works. When I uninstall it, it's back to the broken state. I tried to bisect whether I find a package that flips the behavior, and it looks like if I install nfs-utils and certain iwlwifi-*-firmware packages together, it flips to a working state. But it's weird and inconsistent. 

From all these bits, I have a feeling that this is really a race condition, as Zbigniew suggested. And having different filesystems, or different processes/services running during boot, or having the boot files large enough (take longer to load) changes the timing, which changes whether the race condition occurs.

Comment 10 Adam Williamson 2024-03-14 17:40:57 UTC
OK, this is really bugging me now because it sounds *super* familiar - I swear I remember the name plymouth-graphics-libs in the context of a very similar bug before. But I can't find it. I'll keep looking.

Comment 11 Adam Williamson 2024-03-14 18:13:03 UTC
ooh, okay, so I kinda suspect the changes from https://src.fedoraproject.org/rpms/plymouth/c/e08eb228aef455106511b0eb6155e17e09aced29?branch=rawhide (they were rolled into the next major version release, so they no longer exist as patches in the package, but they are in the upstream). We should probably try reverting those selectively...

Comment 12 Kamil Páral 2024-03-15 09:40:13 UTC
So, first I checked F39 Everything netinst, just to be sure - it works just fine, as expected.

Now, on F40, I tried downgrading plymouth. It changes things! So it really seems to be a regression in plymouth.

plymouth-22.02.122-6.fc40 [1] is the last plymouth that works.
plymouth-23.358.4-6.fc40 [2] is the first plymouth that doesn't work.

So it broke somewhere between those versions. I'll see if I can narrow it down more. But at this point I believe we need Ray to start looking into it.

[1] https://koji.fedoraproject.org/koji/buildinfo?buildID=2322964
[2] https://koji.fedoraproject.org/koji/buildinfo?buildID=2337638

Comment 13 Kamil Páral 2024-03-15 10:00:40 UTC
So even more precise is that this commit works:
https://src.fedoraproject.org/rpms/plymouth/c/6534ca93a154ef3c49bbfe7406a63aac5120d2cf?branch=f40
(that's plymouth-22.02.122-6.fc40)

And this commit doesn't:
https://src.fedoraproject.org/rpms/plymouth/c/9c15b6a28ab0a8ede11b24cfa8486e534a0aa492?branch=f40
(that's plymouth-23.356.9-4.fc40)

There are no further actionable commits between those two. So in order to dig further, I'd have to try git bisect on the upstream source code.

Comment 14 Adam Williamson 2024-03-15 17:28:06 UTC
Yeah, that would be my next step, set up the spec file to build git snapshots then just bisect it. I will probably do this over the weekend or on Monday if nobody else gets to it first.

Comment 15 Adam Williamson 2024-03-17 00:20:47 UTC
Bisected to https://gitlab.freedesktop.org/plymouth/plymouth/-/commit/48881ba2ef3d25fd27fd150d4d5957d4df9868e0 . Will see if that reverts cleanly.

Comment 16 Fedora Update System 2024-03-17 00:49:52 UTC
FEDORA-2024-adf0027989 (plymouth-24.004.60-4.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-adf0027989

Comment 17 Fedora Update System 2024-03-18 01:24:24 UTC
FEDORA-2024-adf0027989 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-adf0027989`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-adf0027989

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Kamil Páral 2024-03-18 10:43:31 UTC
(In reply to Fedora Update System from comment #16)
> FEDORA-2024-adf0027989 (plymouth-24.004.60-4.fc40) has been submitted as an
> update to Fedora 40.
> https://bodhi.fedoraproject.org/updates/FEDORA-2024-adf0027989

This fixes the problem on my hardware.

Comment 19 Kamil Páral 2024-03-18 10:48:16 UTC
Reported upstream:
https://gitlab.freedesktop.org/plymouth/plymouth/-/issues/249

Comment 20 Lukas Ruzicka 2024-03-18 11:29:52 UTC
Yeah, works for me, too.

Comment 21 Fedora Update System 2024-03-19 04:23:06 UTC
FEDORA-2024-adf0027989 (plymouth-24.004.60-4.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 22 spaceboy60 2024-03-19 05:40:53 UTC
Still not working here

Comment 23 Adam Williamson 2024-03-19 06:00:48 UTC
what is not working? did you test with a fresh image? if you only updated the system, you also need to run `dracut -f` and reboot to see the fix.

Comment 24 spaceboy60 2024-03-19 13:59:43 UTC
It boots to login screen and then crashes to a blank blinking screen from there all you can do is get to the command line with alt alt control F2.

It was updated not from my fresh image, I will run dracut -f` later to see if it fixes the issue.

Comment 25 Adam Williamson 2024-03-19 14:57:58 UTC
That does not sound like this bug. With this bug, you saw *nothing at all* on the screen. No login prompt, no blinking cursor.

Comment 26 spaceboy60 2024-03-19 19:40:21 UTC
dracut -f`didn't help. Maybe need to file new bug.

Comment 27 spaceboy60 2024-03-19 19:40:59 UTC
dracut -f`didn't help. Maybe need to file new bug.

Comment 28 Adam Williamson 2024-03-19 22:36:05 UTC
Yeah, from your description I would say so.

Comment 29 Kamil Páral 2024-03-20 09:28:01 UTC
spaceboy60, please link to your new bug here, and also include information, whether downgrading plymouth* packages (and running `sudo dracut -f`) to some older version resolves the problem for you. We'll discuss there. Thanks!


Note You need to log in before you can comment on or make changes to this bug.