Red Hat Bugzilla – Bug 1553979
Xwayland server crash
Last modified: 2018-05-28 09:31:01 EDT
Description of problem:
Xwayland server crashes unexpectedly.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Please note that the retrace/abort server has quite a few reports of this crash, but apparently no bugzilla ticket yet.
I thought this way an artifact of the gjs issue with the Gnome Places menu, which should have been solved with 1.50.4-1 (but I still encounter). I have no idea what happened here, I hope the retrace logs are sufficient.
Can you indicate the exact version of the xorg-x11-server-Xwayland package?
Sorry. Of course, yes:
The backtrace shows a call to ProcXTestFakeInput(), were you using an X client emulating key events (for example antimicro) at the time of the crash?
I can more or less get the code path which leads to the crash from the FAF report:
ProcXTestFakeInput() in Xext/xtest.c line 429
mieqProcessDeviceEvent() in mi/mieq.c
ProcessKeyboardEvent() in xkb/xkbPrKeyEv.c
ProcessOtherEvent() in Xi/exevents.c line 1873
ProcessDeviceEvent() in Xi/exevents.c line 1709
UpdateDeviceState() in Xi/exevents.c line 807
ChangeMasterDeviceClasses() in Xi/exevents.c line 727
DeepCopyDeviceClasses() in Xi/exevents.c line 670
DeepCopyKeyboardClasses() in Xi/exevents.c line 373
CopyKeyClass() in Xi/exevents.c line 233
XkbDeviceApplyKeymap() in xkb/xkbUtils.c ???
But that doesn't tell what the actual values which cause the crash, I would rather have a backtrace in gdb for that. I tried to create a reproducer sending XTestFakeKeyEvent() but that did not cause any crash here.
Xwayland generate a core file in case of a crash, could you please provide a full backtrace from that core file?
1. Make sure you have the debuginfo installed
dnf debuginfo-install xorg-x11-server-Xwayland
2. Use “coredumpctl list” to get the list of available core files
3. Find the one for Xwayland in the list
4. Use “coredumpctl gdb <pid>” to attach gdb to that core file
5. run “bt full” and post the output in this bug.
Looks like this is affecting Xorg as well (i.e. not specific to Xwayland), see https://retrace.fedoraproject.org/faf/reports/2058692/
Created attachment 1407402 [details]
bt per instructions
No apps running, closed laptop lid, opened laptop lid, got logged off.
Created attachment 1407406 [details]
bt full per instructions
Similar issue https://retrace.fedoraproject.org/faf/reports/2056747/ and bug 1548737
ChangeMasterDeviceClasses in the backtrace means this happened on the switch between slave devices, i.e. typing on device one, then typing on device two. Since one device is the XTest device, the other keyboard device is the 'culprit'.
For anyone not seeing this under Xayland but regular Xorg:
I'll need an evemu-record of that device and any keyboard configuration you have. If you have other device nodes that may send key events (multimedia keys, maybe a tablet), please attach the evemu-record as well. And a full Xorg.log so we can check what configuration is applied.
For Xwayland: we'll need any keyboard configuration and whether you have tablet devices. It's likely triggered by some compositor change in the keymap.
Created attachment 1408408 [details]
Created attachment 1408409 [details]
Created attachment 1408410 [details]
just a shot of the coredump list
I have made a point of closing all apps before I close the laptop lid. Still, I get the gnome-shell/Xwayland crash on open.
Hard to tell if it crashes going into suspend, or coming out.
If you would like to run a script to test a theory, let me know.
Quote from comment 10: "For Xwayland: we'll need any keyboard configuration and whether you have tablet devices. It's likely triggered by some compositor change in the keymap."
Created attachment 1412042 [details]
name plate from Acer laptop
Created attachment 1412043 [details]
keyboard & track pad (disabled, button, red led on left)
The only other device is LogiTech M310 wireless mouse.
I use wireless WiFi at home AND 2nd VGA monitor
Sometimes apps are open, sometimes not (evince, open office, files)
Go to office.
I use wired ethernet at office AND 2nd VGA monitor
Gnome3 logs out
Notice comes up that Xwayland crashes
I previously had problems where WiFi would not work if I opened the lid without connecting ethernet FIRST, that "seems to have been fixed".
Keyboard is US / English config
BTW, I live in Clayton, work in Raleigh if you want to a "hands on".
Is there anything more specific that would be useful regarding keyboard configuration?
I have this crash on resume, about five times so far. I use XWayland (per Fedora default) under Gnome; that's what crashes. https://bugzilla.redhat.com/show_bug.cgi?id=1557682
I never use an external keyboard. This is a laptop with a UK keyboard. GNOME is configured to use a UK keyboard layout. (I don't switch keyboard layouts, and I don't have any icon in the gnome-shell toolbar which I could click accidentally to do so).
I often use an external USB mouse. The laptop has a touchpad (and I do not disable it). The laptop also has a pointing stick. The laptop is a Dell Lattitude E5450.
I do not use an external tablet, the only external input device I use is the USB mouse.
No tablet. Using an HP EliteBook 850 G1 laptop, with the installed keyboard and touch pad. German keyboard, English and Norwegian layouts installed (and sometimes in use).
Any other particular information required? Not a vague request, but explicit commands please. ;-)
I've been puzzling over the associated backtraces, and I have a partial finding of woe.
There are actually several different reports (from my computer and others). Some less far, some slightly further on, compared to the call chain in comment 4. Some have secondary crashes inside xorg_backtrace(), some don't.
It's as if the common factor is a bad page fault (always SIGBUS), when we try to perform a fetch that *should* be allowed.
Because all of the crashes seem to point to the *first* instruction in the last function called... And when I looked into my _dl_fixup() crash, it was trying to make a legitimate read inside the existing mapping of /usr/bin/Xwayland (which has r-x permissions)!
For gdb extracts of this, see my comment https://bugzilla.redhat.com/show_bug.cgi?id=1557682#c9
Am I being paranoid, or is this another microcode update f***up by Intel? One of the issues with some initial microcode updates was "reports of unexpected page faults". I really hope there is some proof that can rule that out :(.
Following is core dump history ... prior to this start date, it was one every now and again (less than weekly) on various apps.
Sad thing is the abrt-applet is dying (when used) and does not help you guys at all ...
Fri 2018-02-16 16:06:10 EST /usr/lib64/firefox/firefox
Sun 2018-02-25 12:42:42 EST /usr/bin/Xwayland
Sun 2018-02-25 12:43:08 EST /usr/bin/gnome-shell
Mon 2018-02-26 08:50:11 EST /usr/bin/Xwayland
Mon 2018-02-26 08:50:31 EST /usr/bin/gnome-shell
Tue 2018-02-27 20:43:47 EST /usr/lib64/firefox/firefox
Wed 2018-02-28 20:48:11 EST /usr/bin/Xwayland
Wed 2018-02-28 20:48:11 EST /usr/bin/Xwayland
Wed 2018-02-28 20:48:50 EST /usr/bin/gnome-shell
Wed 2018-02-28 20:48:50 EST /usr/bin/gnome-shell
Thu 2018-03-01 09:30:34 EST /usr/bin/Xwayland
Thu 2018-03-01 09:30:45 EST /usr/bin/gnome-shell
Thu 2018-03-01 09:33:11 EST /usr/bin/abrt-applet
Thu 2018-03-01 22:33:04 EST /usr/bin/Xwayland
Thu 2018-03-01 22:33:27 EST /usr/bin/gnome-shell
Mon 2018-03-05 06:20:13 EST /usr/bin/Xwayland
Mon 2018-03-05 06:20:30 EST /usr/bin/gnome-shell
Mon 2018-03-05 08:54:12 EST /usr/bin/Xwayland
Mon 2018-03-05 08:54:36 EST /usr/bin/gnome-shell
Sun 2018-03-11 19:32:25 EDT /usr/bin/Xwayland
Sun 2018-03-11 19:33:10 EDT /usr/bin/gnome-shell
Mon 2018-03-12 18:25:24 EDT /usr/bin/Xwayland
Mon 2018-03-12 18:25:34 EDT /usr/bin/gnome-shell
Mon 2018-03-12 20:03:13 EDT /usr/lib64/firefox/firefox
Wed 2018-03-14 06:34:27 EDT /usr/bin/Xwayland
Wed 2018-03-14 06:35:07 EDT /usr/bin/gnome-shell
Wed 2018-03-14 17:06:51 EDT /usr/lib64/firefox/firefox
Wed 2018-03-14 21:20:26 EDT /usr/bin/Xwayland
Wed 2018-03-14 21:21:37 EDT /usr/bin/gnome-shell
Thu 2018-03-15 06:28:03 EDT /usr/bin/Xwayland
Thu 2018-03-15 06:28:54 EDT /usr/bin/gnome-shell
Thu 2018-03-15 09:21:45 EDT /usr/bin/Xwayland
Thu 2018-03-15 09:22:05 EDT /usr/bin/gnome-shell
Thu 2018-03-15 16:26:42 EDT /usr/lib64/firefox/firefox
Sat 2018-03-17 20:04:40 EDT /usr/bin/Xwayland
Sat 2018-03-17 20:05:13 EDT /usr/bin/gnome-shell
Mon 2018-03-19 07:48:48 EDT /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:15 EDT /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:15 EDT /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:22 EDT /usr/lib64/firefox/firefox
Tue 2018-03-20 09:58:45 EDT /usr/bin/abrt-applet
Thu 2018-03-22 08:50:43 EDT /usr/bin/Xwayland
Thu 2018-03-22 08:51:22 EDT /usr/bin/gnome-shell
Please can you assign this bug to the kernel. Or otherwise start looking at the kernel. My analysis shows it is not working as intended.
I put a call out for nitpickers on StackOverflow. My analysis received 4 endorsements, and no-one found a gap in it. https://stackoverflow.com/questions/49477340/can-i-rule-out-that-sigbus-is-raised-by-a-minor-page-fault-kernel-log-has-no
The faulting addresses are within the text mapping of Xwayland, as shown by $_siginfo in the coredumps. The SIGBUS cannot be an alignment error, because siginfo shows si_code == BUS_ADRERR (2), not BUS_ADRALN (1).
Therefore SIGBUS should only mean one thing: a failure to read a text page of /usr/bin/Xwayland from disk.
But the kernel log does not show any disk error.
$ gdb coredump
(gdb) p $_siginfo.si_signum
$1 = 7
(gdb) p $_siginfo.si_code
$2 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
$3 = (void *) 0x41bd80
Dump of assembler code for function _dl_fixup:
=> 0x00007fc0be0c8bf9 <+41>: mov 0x8(%r8),%rcx
(gdb) p/x $r8
$4 = 0x41bd78
(gdb) p/x $r8 + 8
$5 = 0x41bd80
$ cat maps
00400000-0060b000 r-xp 00000000 fd:00 1708508 /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508 /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508 /usr/bin/Xwayland
$ size -x /usr/bin/Xwayland
text data bss dec hex filename
0x209ffb 0xbe9d 0x1f3e0 2314872 235278 /usr/bin/Xwayland
(In reply to Alan Jenkins from comment #24)
> Please can you assign this bug to the kernel. Or otherwise start looking at
> the kernel. My analysis shows it is not working as intended.
I'd rather search a bit further on the X server side first.
I understand you can reproduce at will (more or less), would it be possible to run Xwayland within valgrind in that case?
That requires some tweaking and it will be slow but it's doable.
The idea is to “replace” the original Xwayland with a script that will run Xwayland within valgrind.
1. Make sure the debuginfo packages are installed for Xwayland
2. Install valgrind if not done yet
3. Rename the existing executable Xwayland as Xwayland.bin
$ sudo mv /usr/bin/Xwayland /usr/bin/Xwayland.bin
4. Create a shell script which “pretends” to be Xwayland and runs Xwayland.bin within valgrind instead:
$ sudo vim /usr/bin/Xwayland
and copy the following content:
exec valgrind --track-origins=yes --free-fill=00 --log-file=xwayland.%p /usr/bin/Xwayland.bin "$@"
5. Make the scripts executable
$ sudo chmod a+x /usr/bin/Xwayland
6. Try to reproduce and check for the content of the "xwayland.xxxxx" log files created to see if there is a use-after-free related to Xkb somehow.
To revert to normal, simply copy /usr/bin/Xwayland.bin back as /usr/bin/Xwayland
It sounds like Brent might have a good pattern that reproduces this, maybe one in four or something - on his commute to work :-). I haven't identified one for myself though :-(. It happened maybe once in three days with my historical usage. I've just been collecting coredumps, as well as looking at other peoples reports.
I do want to provide solid data you can use, but I don't think I'd be able to maintain my usage pattern to try and reproduce it, for a ~3 day period, while running Xwayland under valgrind. (Firefox uses X, if nothing else).
I learned a bit more data though. The first v4.15 kernel accepted to stable was kernel-4.15.3-300.fc27, on the date 2018-02-16 17:47:46.
On 2018-02-17, there's a marked jump in the number of crash buckets. And it's not just because of the day of the week.
That day has the first instances of this pattern of backtrace: SIGBUS (unusual on x86) + _dl_fixup() (+ ChangeMasterDeviceClasses + ProcessKeyboardEvent()).
There's enough data to clearly pin it to this 24 hour period. You see the number of buckets per day jump on that date, and it only rises from there. Because this bug inherently manifests with a variation in where it strikes along the call-path.
There is a common factor that links my slightly different crashes: the SIGBUS happens when executing the first instruction of the function, whichever function it was. IOW, the SIGBUS happens at the same time as the first access to the page containing that function.
(It's not a stack issue: if you look at the details I posted of my crashes, you can see the fault address is not in the stack. It's in the text segment. And in my first posted analysis, the instruction at fault was actually `test %rdi, %rdi`, which does not involve any memory access other than fetching that instruction).
Comment 25 instructions followed -- the old gal ain't the speed demon she used to be, but, working well so far.
Off to work ....
GOTCHA. Does anyone else see "Read-error on swap-device" / "EXT4-fs error" / "Buffer I/O error" if they look in their system logs, that happen around suspend/resume time?
To view historical kernel logs together with SIGBUS reports:
$ journalctl _TRANSPORT=kernel + COREDUMP_SIGNAL_NAME=SIGBUS
I suggest using a search for the error text (`/` key). It doesn't look like it happens every time.
A recent one triggered a fsck, which made me pay attention. Disk errors would make the SIGBUS analysis a lot less mysterious. So I'm now very suspicious of the SATA ALPM work that was merged in v4.15.
Mar 02 18:47:03 alan-laptop kernel: sd 1:0:0:0: [sda] Starting disk
Mar 02 18:47:03 alan-laptop kernel: ata1: SATA link down (SStatus 0 SControl 300)
Mar 02 18:47:03 alan-laptop kernel: PM: resume devices took 1.017 seconds
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:580256)
Mar 02 18:47:04 alan-laptop kernel: Read-error on swap-device (253:1:580264)
Mar 02 18:47:04 alan-laptop kernel: Read-error on swap-device (253:1:580272)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
Mar 02 18:47:09 alan-laptop kernel: done.
Mar 02 18:47:14 alan-laptop systemd-coredump: Process 1356 (Xwayland) of user 42 dumped core.
Stack trace of thread 1356:
#0 0x00007fe4daf3a2de n/a (ld-linux-x86-64.so.2)
#1 0x0000000000000000 n/a (n/a)
But it's not the obvious affect of the SATA LPM changes in v4.15, because I don't have it enabled. Even when I'm running on battery power. (And I haven't been playing with `powertop` at all in this time period).
$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
==> /sys/class/scsi_host/host1/link_power_management_policy <==
I confirm those findings:
on a Lenovo Thinkpad Yoga S1 : Haswell i7 4500U
Current kernel is upstream 4.16-rc7, xwayland 1.19.6-1 gnome-shell 3.28.
IO scheduler is BFQ (multiqueue).
I get the swap errors, the Xwayland sigbus on resume form suspend (random but at least once each 3 days) and the ext4 errors (I have my ext4 over LVM). All of those are at different points in time but in sequence.
NB: I have two LV. Only the root LV get the io error on resume from suspend (at time). The root is split over two PVs one of which is shared with the home LV. smartctl tells the while disk has no bad sector or pending remapping.
On Debian here.
But I cannot ascertain 4.15. I discovered this crash around that time (24th of January 2018).
But I had an infinite loop in libinput on resume beforehand (fixed in libinput 1.8.4 October 2017).
I thought the swap issue was unrelated though you uncovered it might not.
Mind my swap is bare metal (but on the same disk as the LVM PVs).
This is a Debian box out of the upstream kernel.
Created attachment 1416222 [details]
Valgrind & XWayland
Sorry about the lean run of Valgrind & Xwayland ... the machine is rendered near useless for anything GUI related ...
Please can you re-evaluate the need for kernel investigation at this point?
There's also a short&sweet thread on the Arch Linux Forums that pins this on kernel v4.15. (Noticed after a kernel upgrade; successfully avoided by downgrading the kernel).
*** Bug 1548737 has been marked as a duplicate of this bug. ***
I proposed a fix in the upstream kernel.
(I was able to trace over suspend/resume using ftrace (`trace-cmd`), it's amazing really. The output from existing block tracepoints alone was enough to pin down where the IO error comes from. Though it helped that I had already stared at the v4.15 commit regarding block device suspend, to know what was supposed to be happening).
As a workaround, until a kernel fix gets into Fedora via -stable, I'm booting with this kernel option:
It fixes the suspend test case in the patch message. You may notice an extra delay of about a second, when resuming from suspend.
The problem occurs too with Wayland sessions: https://www.dropbox.com/s/y28ojryzpytr34d/journal.log?dl=1
j1simon: sorry but I think you have an issue which needs processing separately. This issue is 100% diagnosed and known to be caused by suspend aka sleep. From your log, I don't see any sign of system suspend.
Please avoid confusing this issue, by creating a new one instead. You may add <email@example.com> on the "CC List" if you wish to continue talking to me.
If at all possible, prefer using the report feature of the ABRT app, to create an bugzilla issue for your _gnome-shell_ crash, . (You can search the app list for ABRT, but the actual name is "Problem Reporting"). Because ABRT will include a wealth of detail about the crash automatically, prompt you for further details etc.
I think you're right that your crash is with gnome-shell. The Xwayland crash in your log can be ignored. It is preceded by the message
(EE) failed to read Wayland events: Connection reset by peer
This is an X/Xwayland message - the "(EE)" is distinctive.
I think it could help if someone with bugzilla privs can assign this to the kernel package, so its state can be accurately tracked.
The kernel patch linked above, which solves these SIGBUS crashes (or rather the underlying IO error), has been accepted by the linux-block maintainer.
It's not in Linus' git tree yet. I'm roughly assuming it will get into 4.17 though, since the commit is in the "for-linus" branch of "linux-block", and not the "for-next" branch.
Patch accepted: https://www.spinics.net/lists/linux-block/msg24710.html
linux-block tree: http://git.kernel.dk/cgit/linux-block
(again, look in "for-linus", commit "block: do not use interruptible wait anywhere").
I have not had a GNOME Xwayland crash in a while ... something improved.
Well, tease the devil and see what happens ... don't know if this helps or not ...
[104177.468477] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
This was on the console after Gnome crashed.
The only help needed now, is if you can disprove either the patch which is accepted for the next upstream kernels 4.17  and 4.16.8 , or the workaround .
If you want to test the upstream release candidate, the patch is present in 4.17-rc3 and above, which I assume can be installed as a Fedora binary package from the usual place .
The crash is sensitive to memory pressure, for obvious reasons. Read the description at the top of the patch :).
It stopped happening to me while I stopped running VMs for a bit, and came back afterwards. Also I think the Xwayland crash is due to SIGALARM, so I suspect that case also requires you to be suspended for a certain amount of time. When I deliberately tested massive memory pressure, I don't think I ever saw the Xwayland crash when I suspended+resumed immediately... but it happened later when I broke for lunch :).
There's a Fedora doc on how to set a kernel boot option here:
hum, sorry for that "obvious" there, that's not true :).
The SIGBUS is delivered due to the failure of a read operation, paging part of Xwayland in from disk. The point about memory pressure, is that if you have enough free RAM, basically the whole of Xwayland is more likely to stay in memory.
The fix is not to return failure from read operations, in the event that a signal e.g. SIGALARM is received while the disk is still being resumed.
The bug / fix is pretty clear once you know what it is. So while testing is always welcome, I'm not specifically asking for it. What I most wanted was to post the workaround above, for anyone affected.
Should be fixed now.
Kernel 4.16.9-200.fc27.x86_64 is in Fedora 27, and includes the patch. F28 is also fixed, and passes the test in the patch description. I'm currently running F28 without the workaround ("scsi_mod.scan=sync").
4.16.9-200.fc27.x86_64 installed (with patch), will disable patch later today.
Before I could even check for 4.16.9-200.fc27.x86_64, 4.16.11-2xx.fc27.x86_64 already hit the repos. I can now report that I have not experienced this or similar crashes since I updated to said version.
(Similar crashes means: Before the update, I had subsequent crashes after Xwayland had restarted, where it crashed up to four times in a row directly after login. At that point, SCSI should have long been stable. I searched here for their backtraces, and all tickets eventually pointed back to this one.)
Thanks for investigating!