Bug 1518464 - Installer images fail to boot fully on UEFI VMs using 'std' driver since anaconda-28.11-1.fc28
Summary: Installer images fail to boot fully on UEFI VMs using 'std' driver since anac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: plymouth
Version: 28
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Ray Strode [halfline]
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-29 00:53 UTC by Adam Williamson
Modified: 2018-06-10 19:13 UTC (History)
7 users (show)

Fixed In Version: plymouth-0.9.3-9.fc28
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-10 19:13:10 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
plymouth debugging output from boot attempt with wip patch (62.46 KB, text/plain)
2018-06-01 18:14 UTC, Adam Williamson
no flags Details
more complete plymouth debug log (102.60 KB, text/plain)
2018-06-01 18:47 UTC, Adam Williamson
no flags Details
proposed patch for plymouth (4.04 KB, patch)
2018-06-07 01:36 UTC, Adam Williamson
no flags Details | Diff

Description Adam Williamson 2017-11-29 00:53:09 UTC
Since Rawhide compose Fedora-Rawhide-20171128.n.0 , dedicated installer images fail to boot in UEFI mode in openQA testing. They boot OK in BIOS mode.

The boot process appears to just hang part of the way through. In one openQA test that runs with 'debug' on the kernel cmdline, we get a bit more logging:

https://openqa.fedoraproject.org/tests/176235
https://openqa.fedoraproject.org/tests/176235#step/_boot_to_anaconda/14

note that it seems to hang immediately after the kernel logs "fb: switching to bochsdrmfb from EFI VGA"

This doesn't seem to affect bare metal; I just tested booting the same image on my test box in UEFI mode and it worked OK. There, the message logged is:

fb: switching to radeondrmfb from EFI VGA"

note, different driver. Similarly it doesn't affect a VM with qxl video, where the driver is 'qxldrmfb', or virtio video, where the driver is 'virtiodrmfb'. It only appears to affect VMs using '-vga std' (this is what openQA does), or with the video adapter set to 'VGA' in virt-manager - or to put it another way, any case where the kernel framebuffer driver is 'bochsdrmfb', seemingly.

Booting with 'plymouth.enable=0' works fine in a configuration which otherwise fails. So I'm assigning this to plymouth. plymouth was bumped from plymouth-0.9.3-0.9.20160620git0e65b86c.fc27 to plymouth-0.9.3-1.fc28 in the affected compose.

Comment 2 Adam Williamson 2017-12-11 23:44:21 UTC
Still the case with Fedora-Rawhide-20171211.n.0.

Comment 3 Ray Strode [halfline] 2017-12-15 20:43:29 UTC
what if you add "rhgb" to the kernel command line ? does it work then?

Comment 4 Adam Williamson 2017-12-15 21:23:02 UTC
Hey, good call! Yes, it does. 'rhgb' is not on the installer cmdline by default, and adding it does result in a successful boot to the graphical installer.

Comment 5 Ray Strode [halfline] 2017-12-18 20:41:39 UTC
so there's a plymouth bug with text mode fallback plugin I guess, but also a problem with the lorax config ? Not sure how to make sure rhgb gets added the kernel command line again. bcl do you know?

Comment 6 Adam Williamson 2017-12-18 20:46:03 UTC
Well, it seems that rhgb being in the installer cmdline isn't new at all; I tried a few images from earlier releases, and it wasn't there at least as far back as F20. It may in fact be intentional - I suspect we might have decided at some point that we don't *want* graphical boot for installer images.

Comment 7 Adam Williamson 2017-12-18 20:46:41 UTC
Sigh - I meant "rhgb **NOT** being in the installer cmdline". I haven't yet found a release which *did* have it in the cmdline for installer images.

Comment 8 Ray Strode [halfline] 2017-12-18 20:55:09 UTC
okay, well we definitely want graphical boot on install images. showing boot spew and kernel debug messages as the first thing someone sees when they want to try fedora out is not a good idea!

Comment 9 Brian Lane 2017-12-18 23:15:52 UTC
As far as I can see from the lorax commits we have never had rhgb on the cmdline (I'm going to guess that's because it would break on some subset of systems). I'm fine with adding it as long as the Anaconda team is ok with it.

To add it all of the config files under: https://github.com/rhinstaller/lorax/tree/master/share/templates.d/99-generic/config_files need to have it added. I'm not sure which arches it is appropriate for. s390? etc.

Comment 10 Adam Williamson 2018-01-09 18:23:30 UTC
So, where are we going with this? bcl, are you waiting on someone to sign off on it, or submit a PR, or what? Thanks!

Comment 11 Brian Lane 2018-01-09 18:35:27 UTC
Yeah, I'd like to get the anaconda team's input here before I make any changes.

Comment 12 Adam Williamson 2018-01-09 18:37:48 UTC
In that case we'd better CC them =)

Comment 13 Fedora End Of Life 2018-02-20 15:30:52 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 14 Adam Williamson 2018-03-15 20:48:25 UTC
Ping? Anyone?

Comment 15 Adam Williamson 2018-04-10 22:43:25 UTC
...still ping? I'm still stuck using three different video drivers in three different openQA test configs and encountering different bugs all over the place, it's getting annoying, and this is one of the causes...

Comment 16 Martin Kolman 2018-05-30 15:57:17 UTC
Is this about adding rhgb always unconditionally to the kernel boot command line ? Also why exactly is it needed now on UEFI instead of kernel resolving-this automatically (eq. aren't we just papering over a bug in a different component) ?

Supposed we do this, these would be possible issue/questions that come to me mind:
- what if the installation environment fails to start, will users see an appropriate error message or would they see a frozen graphical boot screen forever ?
- I don't think anyone really tested a graphical boot screen for the installation images,  would it be possible to run some automated tests with the boot option added to see if it leads to a spectacular explosion or works fine ?
- is it safe to introduce the option on all platforms uncoditionally, eq. could it cause some mayhem rather than just being ignored on platforms that don't support graphical output ?
- if you request text installation with boot options or kickstart there could be a weird effect where you get a graphical boot screen only to be replaced by the Anaconda text interface (not that big of an issue though)

Comment 17 Adam Williamson 2018-05-30 21:38:50 UTC
"Is this about adding rhgb always unconditionally to the kernel boot command line ?"

That's the proposal, yes.

"Also why exactly is it needed now on UEFI?"

Well, the closest we got to determining that was around #c5: "so there's a plymouth bug with text mode fallback plugin I guess". But after that Ray seemed more interested in suggesting that rhgb should always be used for installer boot than in trying to fix that issue.

Your questions all seem like sensible ones to me. I will try to build a modified image with 'rhgb' added to the boot params and run that through staging openQA soon if I can.

I suppose the other thing we can do is just go back again and try to find the bug in plymouth; I did at least narrow it down to having been introduced somewhere between 0.9.3-0.9.20160620git0e65b86c.fc27 and 0.9.3-1.fc28 .

Comment 18 Ray Strode [halfline] 2018-05-31 14:02:34 UTC
to be clear, we should fix both problems.  we definitely want graphical boot for the installer, and we definitely want text mode plymouth splashes to work

Comment 19 Adam Williamson 2018-05-31 17:08:11 UTC
OK, I think I've bisected down to the single commit that broke this in plymouth - it's e4f86e3c , "renderer: export device name from plugin". An image built with a plymouth RPM at that commit fails, an image built with a plymouth RPM at the previous commit (fdda9af2) succeeds.

Comment 20 Ray Strode [halfline] 2018-05-31 17:27:22 UTC
hey thanks for bisecting.

I don't have time right now to deep dive, but this work-in-progress commit might fix the problem:

https://cgit.freedesktop.org/plymouth/commit/?h=wip/fix-text-splash&id=843b2fd8dc0b13bc84eeff503e68400c4876e19a

Comment 21 Adam Williamson 2018-05-31 22:24:52 UTC
OK, will test that.

Note I tried booting with plymouth:debug and, while I can't see *all* the messages as I can't get at the scrollback - I haven't figured out how to redirect the boot messages to a serial console without affecting anaconda's behaviour and making the bug go away - it ends with:

ply_event_loop_handle_disconnect_for_source:calling disconnected_handler 0x55aef7c19ba0 for fd11
ply_event_loop_handle_disconnect_for_source:done calling disconnected_handler 0x55aef7c19ba0 for fd11
ply_event_loop_Free_destinations_for_source:freeing destination (1, 0x55aef7c1a070, 0x55aef7c19ba0) of fd 11
ply_even_loop_remove_source_node:failed to delete fd 11 from epoll watch list: Bad file descriptor

Comment 22 Adam Williamson 2018-05-31 23:04:47 UTC
Thanks! Unfortunately that patch doesn't seem to help (after fixing the obvious typo in it, s/render/renderer/).

Comment 23 Ray Strode [halfline] 2018-06-01 17:55:08 UTC
try plymouth.debug=stream:/dev/ttyS0  to get debug messages to go to serial console

Comment 24 Ray Strode [halfline] 2018-06-01 17:55:23 UTC
(with no console= lines)

Comment 25 Adam Williamson 2018-06-01 18:01:22 UTC
OK, will try that once I'm done with this eternal phone queue to talk to my bank...

Comment 26 Adam Williamson 2018-06-01 18:14:30 UTC
Created attachment 1446774 [details]
plymouth debugging output from boot attempt with wip patch

Comment 27 Adam Williamson 2018-06-01 18:47:35 UTC
Created attachment 1446787 [details]
more complete plymouth debug log

Comment 28 Fedora Update System 2018-06-01 21:31:02 UTC
plymouth-0.9.3-7.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-2e38d03103

Comment 29 Adam Williamson 2018-06-01 21:32:02 UTC
Sent out an F28 update just so any rebuilt installer images will not have the bug, but the main thing is I already sent -7 to Rawhide, so Rawhide composes after 20180601.n.1 should not have the bug. Thanks Ray!

Comment 30 Fedora Update System 2018-06-02 22:33:56 UTC
plymouth-0.9.3-7.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-2e38d03103

Comment 31 Adam Williamson 2018-06-05 01:17:40 UTC
Some notes here:

we figured out that this bug happens when the 'frame-buffer' renderer is present but the 'drm' renderer isn't. With plymouth-0.9.3-6 and earlier, the 'frame-buffer' renderer is in the core package but the 'drm' renderer is in the graphics-libs subpackage (since pbrobinson moved it there back in 2015). Ray says it's really a bug in plymouth - it should cope more gracefully with all the possible cases of some renderers being present and others not - but he can't fix it immediately.

So our initial workaround for this, in -7, was to move frame-buffer into graphics-libs alongside drm, so either both are present or neither is. This turned out to be a bad idea, as it turns out there are problems on boot if 'rhgb' is on the kernel command line but the 'frame-buffer' renderer isn't present - which can now happen, since it's not in the core plymouth package any more. In the worst case, if you did a minimal encrypted install, you couldn't boot it properly as you couldn't enter the encryption passphrase on boot.

So I unpushed the F28 update while we chewed on this, and for Rawhide, sent a -8 with the 'drm' and 'frame-buffer' renderers both moved back to the main package, the way things were before Pete's change in 2015. This should hopefully mean neither bug happens any longer, but unfortunately it adds at least one package to a minimal install, 'libdrm', as whichever package the 'drm' renderer is in depends on libdrm. This is why Pete moved the 'drm' renderer into the sub-package in the first place (to avoid that dependency for minimal installs), and he understandably isn't happy with the dep getting moved back to the main package.

We're still chewing over what's the best thing to do for now. Obviously it'd be ideal if Ray could fix up plymouth so it behaves reasonably in all cases of renderers being available or not available, so we can go back to the -6 or -7 state without buggy behaviour.

If we can't fix plymouth itself quickly, another approach would be to go back to the -6 split ('frame-buffer' in the main package, 'drm' in graphics-libs) and pull graphics-libs into the installer environment...

Comment 32 Adam Williamson 2018-06-06 00:08:48 UTC
Bit more detail for anyone playing along at home (halfline already had this figured out two days ago but his IRC breadcrumbs were cryptic to me till I figured it out myself :>):

When 'rhgb' isn't on the cmdline, we go down a particular path in plymouth main.c:

        if (!plymouth_should_show_default_splash (&state)) {
                /* don't bother listening for udev events if we're forcing details */
                device_manager_flags |= PLY_DEVICE_MANAGER_FLAGS_IGNORE_UDEV;

the 'if' there is satisfied when 'rhgb' is not on the cmdline, so we get the flag PLY_DEVICE_MANAGER_FLAGS_IGNORE_UDEV set. The commit that introduced this logic is:

https://cgit.freedesktop.org/plymouth/commit/?id=382305e4a9f7b4c221968fcba7394b1cc03b454e

"main: disable hotplug events and splash delay if details forced

There's no point in waiting for a graphics device if details are
forced, and we shouldn't ever delay showing details.  If details
are requested, we shouldn't be hiding them."

Basically, what it does is tell the plymouth device manager to not bother hooking into udev and looking out for graphics device-related events and doing stuff in response to them showing up and disappearing. That sounds peachy and sorta makes sense, except that there's a fallback path too. When it's told to ignore udev, the device manager calls a function 'create_fallback_devices', which is *not* hit when it's using udev. 'create_fallback_devices' then calls 'create_devices_for_terminal_and_renderer_type' with the renderer type set to 'PLY_RENDERER_TYPE_AUTO'...which basically results in plymouth trying each of the x11, drm, and frame-buffer renderers (in order) until it hits one that works.

As we noted, in the installer environment (before we started trying to fix this by moving things around in the packages), the drm and x11 renderer plugins are not present, so those obviously fail...but the frame-buffer renderer *is* present, so plymouth tries it, and finds that /dev/fb0 is present, so it goes ahead and loads it.

Unfortunately /dev/fb0 at this point is backed by the kernel EFI framebuffer driver ('efifb'). The kernel expects, during boot, to switch from efifb to another driver (bochsdrmfb)...but because plymouth is holding the device, it cannot. Hence why the boot hangs at "fb: switching to bochsdrmfb from EFI VGA".

One outstanding mystery for me at least is why this only fails (apparently) with 'vga' / 'std' but works with other drivers, because it seems like the same stuff happens there. From the plymouth debug logs plymouth still loads the frame-buffer driver and uses /dev/fb0...but somehow, the kernel doesn't get stuck trying to unregister efifb when the new driver is qxldrmfb rather than bochsdrmfb. I'm really not sure why not.

Comment 33 Adam Williamson 2018-06-06 00:17:36 UTC
I'm sort of wondering if it maybe would make more sense to try and send the device manager down a path where it just goes straight to PLY_RENDERER_TYPE_NONE, rather than PLY_RENDERER_TYPE_AUTO, for this specific codepath. That seems to fit the initial intention of not bothering with a 'graphics device' a bit better?

Comment 34 Adam Williamson 2018-06-07 01:36:20 UTC
Created attachment 1448545 [details]
proposed patch for plymouth

Comment 36 Fedora Update System 2018-06-07 18:28:52 UTC
plymouth-0.9.3-9.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-2e38d03103

Comment 37 Adam Williamson 2018-06-07 18:33:55 UTC
For the record, -9 has this setup:

* https://cgit.freedesktop.org/plymouth/commit/?id=014c2158898067176738ec36c9c90cc266a7e35b backported - it actually fixes the initial bug here (UEFI boot with 'std' / 'vga', frame-buffer present but drm not)
* frame-buffer and drm renderers both moved back to the graphics-lib subpackage, because we think that's ultimately the 'most correct' layout if we want to avoid a libdrm dep in the main package
* https://cgit.freedesktop.org/plymouth/commit/?id=bdfcf889f8cda47190d98fa8a3e401a1db38074c backported - it fixes the bug that showed up when we initially tried moving frame-buffer to the subpackage, that booting with 'rhgb' on the cmdline but no renderers installed gets stuck if an encryption passphrase is needed

basically, it should hopefully fix everything and lead us all to a glorious land of honey. So please tell me when it blows up your system instead...

Comment 38 Fedora Update System 2018-06-08 12:57:50 UTC
plymouth-0.9.3-9.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-2e38d03103

Comment 39 Fedora Update System 2018-06-10 19:13:10 UTC
plymouth-0.9.3-9.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.