Bug 1691909 - GDM fallback from Wayland to X11 no longer works because it takes too long to start gnome-shell (affects 'basic graphics mode' / nomodeset, maybe other cases)
Summary: GDM fallback from Wayland to X11 no longer works because it takes too long to...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gdm
Version: 30
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Ray Strode [halfline]
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
Depends On:
Blocks: F30BetaFreezeException F30FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2019-03-22 19:34 UTC by Adam Williamson
Modified: 2019-04-17 21:44 UTC (History)
15 users (show)

Fixed In Version: gdm-3.32.0-3.fc30
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-17 16:04:40 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journal messages from an affected F30 boot, with precise timestamps (488.76 KB, text/plain)
2019-03-22 19:36 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2019-03-22 19:34:22 UTC
This is a follow-up to https://bugzilla.redhat.com/show_bug.cgi?id=1683197 . After we figured out that bug, I tested GDM with the offending patch reverted, and found that...booting with 'nomodeset' on BIOS systems *still* doesn't work, at least neither my test VM nor my test bare metal system.

Today Ray and I dug into this a bit and we think we know what the issue is.

What's *expected* to happen when you boot with 'nomodeset' is that you hit a fallback mechanism in GDM. GDM will first attempt to start gnome-shell in native Wayland session mode; this will fail as the whole native Wayland backend *only* works with KMS. Then GDM should catch that gnome-shell failed, and automatically try it in X11 mode instead.

The way GDM decides whether the native Wayland session 'failed' is simply a timer: if the session goes away 3 seconds or less after GDM started managing the display, that's counted as a 'failure' and the fallback mechanism kicks in. Otherwise it just assumes the session ran and exited in some sort of normal way and no fallback is needed.

In Fedora 28, this all works as expected, if you boot with 'nomodeset' (both on bare metal and in a VM with 'vga' graphics, in my testing). It takes about a second between GDM starting to 'manage' the display and the first gnome-shell failure:

Mar 22 12:13:48 localhost.localdomain gdm[848]: GdmDisplay: Managing display...
Mar 22 12:13:49 localhost.localdomain gnome-session-binary[970]: Unrecoverable failure in required component org.gnome.Shell.desktop

Then we see this message, which is a smoking gun that the session was considered a failure (it's only logged in that case):

Mar 22 12:13:49 localhost.localdomain gdm[848]: GdmDisplay: display lasted 0.780129 seconds

and then GDM tries again with an X11 session, and that works, and all is well.

However, in Fedora 29 and Fedora 30, for some reason, whatever happens between 'Managing display...' and gnome-shell actually getting launched takes *way* longer. Here's a Fedora 30 log extract:

Mar 22 11:42:10 localhost.localdomain gdm[992]: GdmDisplay: Managing display:
Mar 22 11:42:13 localhost.localdomain gdm-wayland-session[1112]: Enabling debugging
Mar 22 11:42:15 localhost.localdomain /usr/libexec/gdm-wayland-session[1112]: gnome-session-binary[1121]: DEBUG(+): Enabling debugging
Mar 22 11:42:16 localhost.localdomain /usr/libexec/gdm-wayland-session[1112]: gnome-session-binary[1121]: DEBUG(+): GsmAutostartApp: starting org.gnome.Shell.desktop: command=/usr/bin/gnome-shell startup-id=1059bd5b609205c94d155328013624473700000011210000
Mar 22 11:42:17 localhost.localdomain gnome-shell[1236]: Failed to create backend: No GPUs found with udev
Mar 22 11:42:17 localhost.localdomain gdm[992]: GdmDisplay: unmanage display

that last message is approximately where we *would* have hit the fallback - but as you can see, it's taken 7 seconds to get from 'Managing display' to that point, which is 4 seconds too long for the fallback to kick in.

We don't even reach gdm-wayland-session until 11:42:13, gnome-session-binary until 11:42:15, or gnome-shell proper until 11:42:16, which is far too late already.

We haven't really dug deep into what's taking so long or why yet, that may be the next step (the easy option, of course, is just to kick the timeout a bit, or possibly replace it with a more robust mechanism).

I'm filing this against F30 because that's the release we're currently working on, but it does affect F29 too, and we should probably send any fix to F29 as well.

Proposing as a Beta FE / Final blocker as the impact of this is just the same as https://bugzilla.redhat.com/show_bug.cgi?id=1683197 .

Comment 1 Adam Williamson 2019-03-22 19:36:57 UTC
Created attachment 1547109 [details]
journal messages from an affected F30 boot, with precise timestamps

Comment 2 satellitgo 2019-03-23 01:35:00 UTC
due to 
https://fedoraproject.org/wiki/Changes/Login_Screen_Over_Wayland
bios boot install  of workstation stops at gdm

Work around: 
install cinnamon then workstation as 2nd DE with dnf groupinstall ......
reboot
log in to xorgGNOME

Comment 3 satellitgo 2019-03-23 01:36:30 UTC
similar:
https://bugzilla.redhat.com/show_bug.cgi?id=1691918

Comment 4 Geoffrey Marr 2019-03-25 18:37:02 UTC
Discussed during the 2019-03-25 blocker review meeting: [1]

The decision to delay the classification of this as a blocker bug was made as we felt we couldn't vote on this without a clearer understanding of what a possible fix might look like. We will delay for now for more information and vote again.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-03-25/f30-blocker-review.2019-03-25-16.01.txt

Comment 5 Geoffrey Marr 2019-03-25 18:48:01 UTC
Discussed during the 2019-03-25 blocker review meeting: [1]

The decision to delay the classification of this as a blocker bug was made as this issue affects 'basic graphics' on BIOS, so in part we are delaying the decision to after we decide whether the 'basic graphics' criterion will continue to apply to Final. We also want to look further into what other cases still rely on the X11 fallback.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-03-25/f30-blocker-review.2019-03-25-16.01.txt

Comment 6 Julen Landa Alustiza 2019-03-26 07:17:10 UTC
Just tested to boot 1.7 with basic video mode on uefi with ryzen 5 1600X + nvidia 1060. The box is not frozen (I can switch ttys and use console mode without any issues), but the first tty ends with an ethernal blinking cursor after "gdm starts" systemd notice.

It works properly with nouveau if basic video mode is not set.

Comment 7 Adam Williamson 2019-03-26 14:51:21 UTC
jlanda: you're probably running into some subsequent bug, like I did testing on bare metal. Check the journal messages, there'll probably be some kinda error in there.

Comment 8 Geoffrey Marr 2019-04-02 08:26:56 UTC
Discussed during the 2019-04-01 blocker review meeting: [1]

The decision to classify this bug as an "AcceptedBlocker" was made as it is a violation of the 'basic' graphics requirement for Workstation on BIOS. It likely also affects other cases and is a showstopping bug whenever it happens.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-04-01/f30-blocker-review.2019-04-01-16.00.txt

Comment 9 Ray Strode [halfline] 2019-04-11 20:30:05 UTC
I'm hoping to look into this tomorrow afternoon or monday. if someone wants to investigate in the interim, it would be useful to know if there's the gigantic boot slow down with an f29 kernel too.


An easy workaround for this bug would be a udev rule that disables wayland if nomodeset is on the kernel command line so we aren't relying on fallback.
Of course, that's not going to help in other situations where fallback is required.

We could also ressurrect a patch I have upstream to do in-session registration via an autostart file, and use that registration as a marker for success instead of assuming N seconds without failure is success.

But if we can root out why it's taking 8 seconds to fallback when in f29 it took less than a second, then that's best.  I mean, if f30 is shipping with slow boot that's a problem in its own right.

I'll try doing a fresh install and reproduce soonish.

Comment 10 Adam Williamson 2019-04-11 20:32:22 UTC
You're one version out on the window - the bug affects both F29 and F30. F28 is the last release where this worked.

So, you would find results on F30 but with an F28 kernel interesting?

Comment 11 Ray Strode [halfline] 2019-04-12 20:09:02 UTC
well f29 is failing because of the cangraphical patch right? not slow boot?

if boot is slow in f29 too, then yea f28 kernel i guess.

Comment 12 Adam Williamson 2019-04-12 21:16:39 UTC
f29 has both bugs, just like f30 does.

Comment 13 Ray Strode [halfline] 2019-04-15 16:01:38 UTC
So i'm a little at a loss for this bug.

I can't reproduce the slow down on a fresh install.

I ran some filtering over the log.  First i trimmed everything before 

Mar 22 11:42:10 localhost.localdomain gdm[992]: GdmDisplay: Managing display:

and everything after:

Mar 22 11:42:17 localhost.localdomain gnome-shell[1236]: Failed to create backend: No GPUs found with udev

then I filtered out all the audit messages where weren't strictly related to boot progression, and finally I sorted the log by succeeding lines that are the longest apart from each other.

$ cat journal-log | while read line; do 
  STAMP=$(echo $line |awk '{ print $3 }' | awk -F: '{ print $3 }');
  [ -z "$LAST_STAMP" ] && LAST_STAMP=$STAMP; 
  diff=$(echo "$STAMP - $LAST_STAMP" |bc); 
  echo -e "\v{$diff seconds\v$LAST_LINE →\v$line}"; 
  LAST_STAMP=$STAMP; 
  LAST_LINE="$line"; 
done |sort -rn | fpaste
https://paste.fedoraproject.org/paste/hNL41NHQxiSVivjWSthxIg

The biggest recurring offender is:

{.317214 seconds
 Mar 22 11:42:13.632235 localhost.localdomain dbus-broker-launch[891]: Activation request for 'org.freedesktop.resolve1' failed: Unit dbus-org.freedesktop.resolve1.service not found. 
 Mar 22 11:42:13.949449 localhost.localdomain dbus-broker-launch[891]: Activation request for 'org.freedesktop.resolve1' failed: Unit dbus-org.freedesktop.resolve1.service not found.
}

So I think there's some problem with the installation.  I don't know why resolve wouldn't be found though, it's shipped in systemd proper.  Might be an selinux problem or a dbus-broker problem.

Anyway, I've built a change to gdm to skip fallback logic entirely if nomodeset is on kernel command line.  that should sidestep the bug (at least in the nomodeset case).

would be good to get to the bottom of the resolved problem, but I don't see it here, so maybe it's not widespread? 

moving to MODIFIED given the gdm workaround, but we may want to clone for the slowdown problem.

Comment 14 Fedora Update System 2019-04-15 16:25:57 UTC
gdm-3.32.0-3.fc30 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-de11d64b9e

Comment 15 Fedora Update System 2019-04-16 01:35:33 UTC
gdm-3.32.0-3.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-de11d64b9e

Comment 16 Fedora Update System 2019-04-17 16:04:40 UTC
gdm-3.32.0-3.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Zbigniew Jędrzejewski-Szmek 2019-04-17 21:44:57 UTC
>  Mar 22 11:42:13.632235 localhost.localdomain dbus-broker-launch[891]: Activation request for 'org.freedesktop.resolve1' failed: Unit dbus-org.freedesktop.resolve1.service not found.

There was a change wrt to systemd-resolved.service recently. Before, by mistake, systemd-resolved.service was partially
enabled (in the "systemctl enable" sense), because the symlink in multi-user.service.wants/ was not present, but
the dbus activation symlink was present. This means resolved wouldn't be started automatically by systemd, but
it could be auto-activated by dbus. I assumed it doesn't make sense to have systemd-resolved enabled like
and cleaned this up to provide neither symlink, based on the assumption that the main "entry point" for resolved
is though the nss-resolve module, which is not present in /etc/nsswitch.conf in the the default configuration.

If you think systemd-resolved.service should be enabled by default, or maybe just dbus-activatable, please
open a new bug. But on the surface of things, I'd expect all consumers to use nss and not to depend on
systemd-resolved specifically.


Note You need to log in before you can comment on or make changes to this bug.