Bug 1579067

Summary: GNOME on Wayland hangs after using Firefox for a bit with XWayland 1.20
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: xorg-x11-serverAssignee: X/OpenGL Maintenance List <xgl-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: rawhideCC: alexl, bskeggs, caillon+fedoraproject, gecko-bugs-nobody, jan.steffens, jglisse, jhorak, john.j5live, jsmith.fedora, kengert, kevin, kinodont, mikhail.v.gavrilov, ofourdan, pjasicek, pmenzel+bugzilla.redhat.com, rhughes, rstrode, sandmann, vondruch, xgl-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-02 01:46:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adam Williamson 2018-05-16 22:10:21 UTC
Both I and nirik (Kevin Fenzi) are seeing this, on current Rawhide with GNOME on Wayland. If you run Firefox and use it for a bit - often loading a new page seems to trigger it - the desktop session will lock up. For a while the pointer can be moved but clicking on anything doesn't work; then the pointer sticks.

For me (Kevin can confirm if this is true for him) the system is not locked, I can ssh into it from another system and poke around fine, but the graphical session seems to be irretrievably lost - I can't recover it by killing firefox from the ssh'ed in session, for instance.

Kevin reports that downgrading to 60.0-1 seems to fix this for him. I can say that running on GNOME-on-X11 instead of GNOME-on-Wayland seems to prevent the bug occurring.

For both of us, nothing relevant seems to appear in system logs at the time the bug occurs.

I'm going to test downgrading to 60.0-3 , on the theory that perhaps the bug was introduced in 60.0-4, by "Firefox 60 build 2", whatever that means exactly.

Comment 1 Adam Williamson 2018-05-17 00:09:51 UTC
This actually seems to be in Xwayland. The bug can be reproduced with firefox-59.0.2-1.fc28 on Rawhide, but not with firefox-59.0.2 or firefox-60 on F28. So it's not anything that changed in Firefox.

The bug does not happen with xorg-x11-server downgraded to 1.19.6-8.fc28 on Rawhide, so it's definitely XWayland at fault. So far the delta I have is that it broke somewhere between 1.19.6-8.fc28 and 1.20.0-1.fc29; I'm going to try the 1.19.99 builds now to try and get a more precise delta.

Comment 2 Adam Williamson 2018-05-17 00:36:12 UTC
Well, 1.19.99.903-1 is affected. That was the first 1.20 RC that was actually built as a Fedora package, so the smallest delta I can get just from testing builds in Koji is: it broke somewhere between 1.19.6 and 1.19.99.903.

I can narrow it down a bit further tomorrow by doing my own builds, if necessary, but it'd be great if someone can maybe point at some suspect commits, or suggest some debugging steps, or just fix this magically...:)

Note, a bit more precise description of what happens when the bug occurs: Firefox is basically hung, can move the cursor over its window, but can't cause anything to happen in it. Can click around an Evolution window open next to it, and type things into text entry fields...but it seems that when I click on the 'File' menu, that triggers the complete desktop hang. Once that happens I can only get in via ssh.

Comment 3 Jared Smith 2018-05-18 19:55:12 UTC
I'm seeing the same thing in Rawhide with Firefox 60 and Wayland -- and usually it seems to be triggered by screen redraws in Firefox, such as when opening a link in a new tab.

Comment 4 Vít Ondruch 2018-05-19 18:57:55 UTC
I experience the same issues :/ Opening new tab makes Xwayland run 100%. Now I am trying the f28 version of Xwayland:

~~~
$ sudo dnf downgrade --disablerepo=* --enablerepo=updates-testing --release 28 xorg-x11-server-Xwayland
~~~

It cannot be worser, right? ;)

Comment 5 Vít Ondruch 2018-05-19 21:43:47 UTC
3h later and FF still runs with the older Xwayland ...

Comment 6 Adam Williamson 2018-05-21 15:59:42 UTC
Vit: read up. I already narrowed it down to "it broke somewhere between 1.19.6 and 1.19.99.903".

Comment 7 Olivier Fourdan 2018-05-22 07:00:07 UTC
Quick question, do other X11 clients still work, like xterm for example? Does the same occur with F28?

(I've been using the Xserver release candidates from pre-1.20 and now 1.20 and never had such a problem on F28)

Comment 8 Olivier Fourdan 2018-05-22 07:34:28 UTC
Some more questions, comment 0 states that “the pointer can be moved but clicking on anything doesn't work; then the pointer sticks” - You mean it won't budge at all or only in X11 apps?

If the former, then it might as well be a Wayland compositor issue.

Does either FF, Xwayland or gnome-shell processes take an unusual amount of CPU or memory when this occurs? Anything suspicious from those in journalctl?

(Meanwhile I switched back to firefox-59.0.2-1.fc28.x86_64 also using xorg-x11-server-Xwayland-1.20.0-1.fc28.x86_64 to see if I can reproduce)

Comment 9 Vít Ondruch 2018-05-22 11:48:08 UTC
(In reply to Olivier Fourdan from comment #8)
> Does either FF, Xwayland or gnome-shell processes take an unusual amount of
> CPU


Yes, XWayland consumes 100% CPU.

Comment 10 Olivier Fourdan 2018-05-22 12:09:11 UTC
(In reply to Vít Ondruch from comment #9)
> Yes, XWayland consumes 100% CPU.

Can you try to spot where it spins in the code?

Comment 11 Adam Williamson 2018-05-22 19:15:16 UTC
Olivier: as I mentioned above, I tend to have an Evo window open next to the Firefox window. I can do some stuff in the Evo window after Firefox has gone non-responsive - like click in the search box above the message list, and type some stuff. But it seems that trying to open the File menu in the Evo window reliably triggers the complete UI freeze; after that point I can't move the pointer any more (but can still ssh into the system).

Comment 12 Olivier Fourdan 2018-05-23 07:30:11 UTC
I've been running Firefox (various versions, whatever comes along with Fedora 28) with Xwayland from server 1.20 (and before that, every single release candidates of Xserver 1.19.99.90x) without ever having such an issue with Firefox.

Yesterday, I downgraded to firefox-59.0.2-1.fc28.x86_64 to match the version mentioned in comment 1 and I've been running just fine with dozens of tabs open of Fedora 28 since then.

So there is probably more to it that just Firefox and Xwayland. Also, the fact that the whole entire session freezes makes me think that the Wayland compositor (mutter/gnome-shell) might as well play a role (rawhide uses mutter 3.29.1 whereas F28 uses 3.28.2).

One possibility, since Vít mentioned in comment 9 that Xwayland is taking 100% CPU, would be to find out what it's busy doing, maybe a couple of “gstack $(pidof Xwayland)” (with debuginfo installed) would give a hint where about in the code it's wandering.

Comment 13 Adam Williamson 2018-05-23 15:37:47 UTC
Well, yes, I did report the bug as "GNOME on Wayland hangs", after all. :)

Comment 14 Adam Williamson 2018-05-23 15:38:11 UTC
I'll try to ssh in and get some info on the hung process later.

Comment 15 Jan Alexander Steffens 2018-05-24 11:08:06 UTC
Same bug on Arch Linux: https://bugs.archlinux.org/task/58705

It was triggered by the upgrade of mesa 18.0.4 to 18.1.0.

Comment 16 kinodont 2018-05-24 17:47:29 UTC
I have added XWayland and Firefox backtraces to the Arch Linux bug tracker: https://bugs.archlinux.org/task/58705#comment169691

Also included is a patch that fixed the issue for me.
So if this is indeed the same bug, the information there may be helpful to you.

Comment 17 Adam Williamson 2018-05-24 18:20:46 UTC
Sounds plausible indeed, I'll see if that patch also 'works' here. Thanks a lot for the cross-reference!

Comment 18 Adam Williamson 2018-05-24 21:56:44 UTC
That patch does seem to 'fix' the bug for me too. Like you I don't know if it's the correct fix or not, but certainly seems like at minimum we're hitting the same thing and you've identified the cause, so thanks much for that. Olivier, can you take it from there?

Comment 19 Adam Williamson 2018-05-25 04:09:51 UTC
well...after running fine all afternoon with the patch, I just saw a rather similar hang, but I was using calibre (the ebook handling software) at the time the system hung, rather than firefox. It seems I can start an ssh connection to the system this time - I get as far as the "Last login:" message - but it does not complete, it does not reach the shell, so I can't look where Xwayland is stuck (assuming that is what happened). I'll see if this happens again...

Comment 20 Jan Alexander Steffens 2018-05-25 04:40:37 UTC
Using magic SysRq+R to reset the keyboard mode should allow you to switch away from the VT with the hanging compositor.

Comment 21 Olivier Fourdan 2018-05-25 07:57:08 UTC
Can you please check if the following patches fix the issue:

https://patchwork.freedesktop.org/series/43618/

Comment 22 Adam Williamson 2018-05-25 15:40:39 UTC
Jan: all sysrq 'magic' besides sync is disabled on Fedora by default, so that wouldn't have helped. See https://fedoraproject.org/wiki/QA/Sysrq . I suppose I could enable them locally, but meh - I find they rarely actually do anything useful anyway.

Comment 23 Adam Williamson 2018-05-25 15:42:33 UTC
Olivier: will do, thanks.

Comment 24 Adam Williamson 2018-05-25 20:10:13 UTC
OK, so looks like the hang I hit when running calibre was probably unrelated - from the logs of that boot the kernel hit a GPF on plug or unplug of my book reader, so that was probably the issue there.

I'm now running with the patches from https://patchwork.freedesktop.org/series/43618/ , will report if anything happens. So far it's OK and I've visited the sites that usually trigger the hang.

Comment 25 Kevin Fenzi 2018-05-27 20:01:49 UTC
I've been running with https://patchwork.freedesktop.org/series/43618/ here since yesterday with no lockups.

Comment 27 Adam Williamson 2018-05-28 18:51:01 UTC
Can you backport them to Rawhide, or do you mind if I do it? Or will there be a new release soon? Thanks!

Comment 28 Adam Williamson 2018-06-02 01:46:19 UTC
I've done this now. https://koji.fedoraproject.org/koji/buildinfo?buildID=1088120