Bug 1553979 - Xwayland server crash
Xwayland server crash
Status: NEW
Product: Fedora
Classification: Fedora
Component: xorg-x11-server (Show other bugs)
27
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: X/OpenGL Maintenance List
Fedora Extras Quality Assurance
:
: 1548737 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-10 07:18 EST by barsnick
Modified: 2018-05-28 09:31 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
bt per instructions (17.87 KB, text/plain)
2018-03-12 18:50 EDT, Brent R Brian
no flags Details
bt full per instructions (12.51 KB, text/plain)
2018-03-12 19:34 EDT, barsnick
no flags Details
Xwayland (5.68 KB, text/plain)
2018-03-15 06:47 EDT, Brent R Brian
no flags Details
gnome-shell (15.67 KB, text/plain)
2018-03-15 06:49 EDT, Brent R Brian
no flags Details
coredump list (4.54 KB, text/plain)
2018-03-15 06:52 EDT, Brent R Brian
no flags Details
name plate from Acer laptop (3.43 MB, image/jpeg)
2018-03-23 06:41 EDT, Brent R Brian
no flags Details
keyboard & track pad (disabled, button, red led on left) (3.97 MB, image/jpeg)
2018-03-23 06:42 EDT, Brent R Brian
no flags Details
Valgrind & XWayland (8.30 KB, text/plain)
2018-04-02 06:40 EDT, Brent R Brian
no flags Details

  None (edit)
Description barsnick 2018-03-10 07:18:50 EST
Description of problem:
Xwayland server crashes unexpectedly.

Version-Release number of selected component (if applicable):
xorg-x11-server-Xwayland

How reproducible:
unknown

Steps to Reproduce:
1. unknnown
2.
3.

Actual results:
https://retrace.fedoraproject.org/faf/reports/2055378/

https://retrace.fedoraproject.org/faf/reports/bthash/4d8d374dc2e5db696453d2819011c015c4b929b9


Expected results:
No crash.

Additional info:
Please note that the retrace/abort server has quite a few reports of this crash, but apparently no bugzilla ticket yet.

I thought this way an artifact of the gjs issue with the Gnome Places menu, which should have been solved with 1.50.4-1 (but I still encounter). I have no idea what happened here, I hope the retrace logs are sufficient.
Comment 1 Olivier Fourdan 2018-03-10 09:04:48 EST
Can you indicate the exact version of the xorg-x11-server-Xwayland package?
Comment 2 barsnick 2018-03-10 11:38:10 EST
Sorry. Of course, yes:

xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64

Others are
libwayland-server-1.14.0-2.fc27.x86_64
gjs-1.50.4-2.fc27.x86_64
Comment 3 Olivier Fourdan 2018-03-12 05:30:44 EDT
The backtrace shows a call to ProcXTestFakeInput(), were you using an X client emulating key events (for example antimicro) at the time of the crash?
Comment 4 Olivier Fourdan 2018-03-12 08:25:03 EDT
I can more or less get the code path which leads to the crash from the FAF report:

  ProcXTestFakeInput() in Xext/xtest.c line 429
  mieqProcessDeviceEvent() in mi/mieq.c
  ProcessKeyboardEvent() in xkb/xkbPrKeyEv.c
  ProcessOtherEvent() in Xi/exevents.c line 1873
  ProcessDeviceEvent() in Xi/exevents.c line 1709
  UpdateDeviceState()  in Xi/exevents.c line 807
  ChangeMasterDeviceClasses() in Xi/exevents.c line 727
  DeepCopyDeviceClasses() in Xi/exevents.c line 670
  DeepCopyKeyboardClasses() in Xi/exevents.c line 373
  CopyKeyClass() in Xi/exevents.c line 233
  XkbDeviceApplyKeymap() in xkb/xkbUtils.c ???

But that doesn't tell what the actual values which cause the crash, I would rather have a backtrace in gdb for that. I tried to create a reproducer sending XTestFakeKeyEvent() but that did not cause any crash here.

Xwayland generate a core file in case of a crash, could you please provide a full backtrace from that core file?

1. Make sure you have the debuginfo installed
   dnf debuginfo-install xorg-x11-server-Xwayland
2. Use “coredumpctl list” to get the list of available core files
3. Find the one for Xwayland in the list
4. Use “coredumpctl gdb <pid>” to attach gdb to that core file
5. run “bt full” and post the output in this bug.

Thanks!
Comment 5 Olivier Fourdan 2018-03-12 10:22:21 EDT
Looks like this is affecting Xorg as well (i.e. not specific to Xwayland), see https://retrace.fedoraproject.org/faf/reports/2058692/
Comment 6 Brent R Brian 2018-03-12 18:50 EDT
Created attachment 1407402 [details]
bt per instructions
Comment 7 Brent R Brian 2018-03-12 18:51:49 EDT
No apps running, closed laptop lid, opened laptop lid, got logged off.
Comment 8 barsnick 2018-03-12 19:34 EDT
Created attachment 1407406 [details]
bt full per instructions

Same here.
Comment 9 Olivier Fourdan 2018-03-13 05:43:51 EDT
Similar issue https://retrace.fedoraproject.org/faf/reports/2056747/ and bug 1548737
Comment 10 Peter Hutterer 2018-03-14 04:49:23 EDT
ChangeMasterDeviceClasses in the backtrace means this happened on the switch between slave devices, i.e. typing on device one, then typing on device two. Since one device is the XTest device, the other keyboard device is the 'culprit'. 

For anyone not seeing this under Xayland but regular Xorg:
I'll need an evemu-record of that device and any keyboard configuration you have. If you have other device nodes that may send key events (multimedia keys, maybe a tablet), please attach the evemu-record as well. And a full Xorg.log so we can check what configuration is applied.

For Xwayland: we'll need any keyboard configuration and whether you have tablet devices. It's likely triggered by some compositor change in the keymap.
Comment 11 Brent R Brian 2018-03-15 06:47 EDT
Created attachment 1408408 [details]
Xwayland
Comment 12 Brent R Brian 2018-03-15 06:49 EDT
Created attachment 1408409 [details]
gnome-shell
Comment 13 Brent R Brian 2018-03-15 06:52 EDT
Created attachment 1408410 [details]
coredump list

just a shot of the coredump list
Comment 14 Brent R Brian 2018-03-15 06:58:16 EDT
I have made a point of closing all apps before I close the laptop lid.  Still, I get the gnome-shell/Xwayland crash on open.

Hard to tell if it crashes going into suspend, or coming out.

If you would like to run a script to test a theory, let me know.
Comment 15 Peter Hutterer 2018-03-23 00:16:10 EDT
Quote from comment 10: "For Xwayland: we'll need any keyboard configuration and whether you have tablet devices. It's likely triggered by some compositor change in the keymap."
Comment 16 Brent R Brian 2018-03-23 06:41 EDT
Created attachment 1412042 [details]
name plate from Acer laptop
Comment 17 Brent R Brian 2018-03-23 06:42 EDT
Created attachment 1412043 [details]
keyboard & track pad (disabled, button, red led on left)
Comment 18 Brent R Brian 2018-03-23 06:52:05 EDT
The only other device is LogiTech M310 wireless mouse.

I use wireless WiFi at home AND 2nd VGA monitor
Sometimes apps are open, sometimes not (evince, open office, files)
Disconnect VGA
Close lid
Go to office.


I use wired ethernet at office AND 2nd VGA monitor
Connect ethernet
Connect VGA
Open lid
Gnome3 logs out
Notice comes up that Xwayland crashes


I previously had problems where WiFi would not work if I opened the lid without connecting ethernet FIRST, that "seems to have been fixed".

Keyboard is US / English config

B
Comment 19 Brent R Brian 2018-03-23 06:53:15 EDT
BTW, I live in Clayton, work in Raleigh if you want to a "hands on".

B
Comment 20 Alan Jenkins 2018-03-24 08:39:59 EDT
Is there anything more specific that would be useful regarding keyboard configuration?

I have this crash on resume, about five times so far.  I use XWayland (per Fedora default) under Gnome;  that's what crashes.  https://bugzilla.redhat.com/show_bug.cgi?id=1557682

I never use an external keyboard.  This is a laptop with a UK keyboard.  GNOME is configured to use a UK keyboard layout.  (I don't switch keyboard layouts, and I don't have any icon in the gnome-shell toolbar which I could click accidentally to do so).

I often use an external USB mouse.  The laptop has a touchpad (and I do not disable it).  The laptop also has a pointing stick.  The laptop is a Dell Lattitude E5450.

I do not use an external tablet, the only external input device I use is the USB mouse.
Comment 21 barsnick 2018-03-24 09:26:04 EDT
No tablet. Using an HP EliteBook 850 G1 laptop, with the installed keyboard and touch pad. German keyboard, English and Norwegian layouts installed (and sometimes in use).

Any other particular information required? Not a vague request, but explicit commands please. ;-)
Comment 22 Alan Jenkins 2018-03-24 10:43:16 EDT
I've been puzzling over the associated backtraces, and I have a partial finding of woe.

There are actually several different reports (from my computer and others).  Some less far, some slightly further on, compared to the call chain in comment 4.  Some have secondary crashes inside xorg_backtrace(), some don't.

It's as if the common factor is a bad page fault (always SIGBUS), when we try to perform a fetch that *should* be allowed.

Because all of the crashes seem to point to the *first* instruction in the last function called...  And when I looked into my _dl_fixup() crash, it was trying to make a legitimate read inside the existing mapping of /usr/bin/Xwayland (which has r-x permissions)!

For gdb extracts of this, see my comment https://bugzilla.redhat.com/show_bug.cgi?id=1557682#c9

Am I being paranoid, or is this another microcode update f***up by Intel?  One of the issues with some initial microcode updates was "reports of unexpected page faults".  I really hope there is some proof that can rule that out :(.
Comment 23 Brent R Brian 2018-03-24 11:41:18 EDT
Following is core dump history ... prior to this start date, it was one every now and again (less than weekly) on various apps.

Sad thing is the abrt-applet is dying (when used) and does not help you guys at all ...

Fri 2018-02-16 16:06:10 EST   /usr/lib64/firefox/firefox
Sun 2018-02-25 12:42:42 EST   /usr/bin/Xwayland
Sun 2018-02-25 12:43:08 EST   /usr/bin/gnome-shell
Mon 2018-02-26 08:50:11 EST   /usr/bin/Xwayland
Mon 2018-02-26 08:50:31 EST   /usr/bin/gnome-shell
Tue 2018-02-27 20:43:47 EST   /usr/lib64/firefox/firefox
Wed 2018-02-28 20:48:11 EST   /usr/bin/Xwayland
Wed 2018-02-28 20:48:11 EST   /usr/bin/Xwayland
Wed 2018-02-28 20:48:50 EST   /usr/bin/gnome-shell
Wed 2018-02-28 20:48:50 EST   /usr/bin/gnome-shell
Thu 2018-03-01 09:30:34 EST   /usr/bin/Xwayland
Thu 2018-03-01 09:30:45 EST   /usr/bin/gnome-shell
Thu 2018-03-01 09:33:11 EST   /usr/bin/abrt-applet
Thu 2018-03-01 22:33:04 EST   /usr/bin/Xwayland
Thu 2018-03-01 22:33:27 EST   /usr/bin/gnome-shell
Mon 2018-03-05 06:20:13 EST   /usr/bin/Xwayland
Mon 2018-03-05 06:20:30 EST   /usr/bin/gnome-shell
Mon 2018-03-05 08:54:12 EST   /usr/bin/Xwayland
Mon 2018-03-05 08:54:36 EST   /usr/bin/gnome-shell
Sun 2018-03-11 19:32:25 EDT   /usr/bin/Xwayland
Sun 2018-03-11 19:33:10 EDT   /usr/bin/gnome-shell
Mon 2018-03-12 18:25:24 EDT   /usr/bin/Xwayland
Mon 2018-03-12 18:25:34 EDT   /usr/bin/gnome-shell
Mon 2018-03-12 20:03:13 EDT   /usr/lib64/firefox/firefox
Wed 2018-03-14 06:34:27 EDT   /usr/bin/Xwayland
Wed 2018-03-14 06:35:07 EDT   /usr/bin/gnome-shell
Wed 2018-03-14 17:06:51 EDT   /usr/lib64/firefox/firefox
Wed 2018-03-14 21:20:26 EDT   /usr/bin/Xwayland
Wed 2018-03-14 21:21:37 EDT   /usr/bin/gnome-shell
Thu 2018-03-15 06:28:03 EDT   /usr/bin/Xwayland
Thu 2018-03-15 06:28:54 EDT   /usr/bin/gnome-shell
Thu 2018-03-15 09:21:45 EDT   /usr/bin/Xwayland
Thu 2018-03-15 09:22:05 EDT   /usr/bin/gnome-shell
Thu 2018-03-15 16:26:42 EDT   /usr/lib64/firefox/firefox
Sat 2018-03-17 20:04:40 EDT   /usr/bin/Xwayland
Sat 2018-03-17 20:05:13 EDT   /usr/bin/gnome-shell
Mon 2018-03-19 07:48:48 EDT   /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:15 EDT   /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:15 EDT   /usr/lib64/firefox/firefox
Mon 2018-03-19 07:52:22 EDT   /usr/lib64/firefox/firefox
Tue 2018-03-20 09:58:45 EDT   /usr/bin/abrt-applet
Thu 2018-03-22 08:50:43 EDT   /usr/bin/Xwayland
Thu 2018-03-22 08:51:22 EDT   /usr/bin/gnome-shell
Comment 24 Alan Jenkins 2018-03-27 07:23:22 EDT
Please can you assign this bug to the kernel.  Or otherwise start looking at the kernel.  My analysis shows it is not working as intended.

I put a call out for nitpickers on StackOverflow.  My analysis received 4 endorsements, and no-one found a gap in it.  https://stackoverflow.com/questions/49477340/can-i-rule-out-that-sigbus-is-raised-by-a-minor-page-fault-kernel-log-has-no


The faulting addresses are within the text mapping of Xwayland, as shown by $_siginfo in the coredumps.  The SIGBUS cannot be an alignment error, because siginfo shows si_code == BUS_ADRERR (2), not BUS_ADRALN (1).

Therefore SIGBUS should only mean one thing: a failure to read a text page of /usr/bin/Xwayland from disk.  

But the kernel log does not show any disk error.


$ gdb coredump
...
(gdb) p $_siginfo.si_signum
$1 = 7
(gdb) p $_siginfo.si_code
$2 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
$3 = (void *) 0x41bd80
(gdb) disassemble
Dump of assembler code for function _dl_fixup:
...
=> 0x00007fc0be0c8bf9 <+41>:    mov    0x8(%r8),%rcx
(gdb) p/x $r8
$4 = 0x41bd78
(gdb) p/x $r8 + 8
$5 = 0x41bd80

$ cat maps
00400000-0060b000 r-xp 00000000 fd:00 1708508                            /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508                            /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508                            /usr/bin/Xwayland
...

$ size -x /usr/bin/Xwayland
   text    data     bss     dec     hex filename
0x209ffb     0xbe9d 0x1f3e0 2314872  235278 /usr/bin/Xwayland
Comment 25 Olivier Fourdan 2018-03-27 09:15:50 EDT
(In reply to Alan Jenkins from comment #24)
> Please can you assign this bug to the kernel.  Or otherwise start looking at
> the kernel.  My analysis shows it is not working as intended.

I'd rather search a bit further on the X server side first.

I understand you can reproduce at will (more or less), would it be possible to run Xwayland within valgrind in that case?

That requires some tweaking and it will be slow but it's doable.

The idea is to “replace” the original Xwayland with a script that will run Xwayland within valgrind.

1. Make sure the debuginfo packages are installed for Xwayland

2. Install valgrind if not done yet

3. Rename the existing executable Xwayland as Xwayland.bin

   $ sudo mv /usr/bin/Xwayland /usr/bin/Xwayland.bin

4. Create a shell script which “pretends” to be Xwayland and runs Xwayland.bin within valgrind instead:

   $ sudo vim /usr/bin/Xwayland

   and copy the following content:

   #!/bin/bash
   exec valgrind --track-origins=yes --free-fill=00 --log-file=xwayland.%p /usr/bin/Xwayland.bin "$@"

5. Make the scripts executable

   $ sudo chmod a+x /usr/bin/Xwayland

6. Try to reproduce and check for the content of the "xwayland.xxxxx" log files created to see if there is a use-after-free related to Xkb somehow.

To revert to normal, simply copy /usr/bin/Xwayland.bin back as /usr/bin/Xwayland
Comment 26 Alan Jenkins 2018-03-27 09:30:07 EDT
It sounds like Brent might have a good pattern that reproduces this, maybe one in four or something - on his commute to work :-).  I haven't identified one for myself though :-(.  It happened maybe once in three days with my historical usage.  I've just been collecting coredumps, as well as looking at other peoples reports.
Comment 27 Alan Jenkins 2018-03-27 15:15:26 EDT
I do want to provide solid data you can use, but I don't think I'd be able to maintain my usage pattern to try and reproduce it, for a ~3 day period, while running Xwayland under valgrind.  (Firefox uses X, if nothing else).

---

I learned a bit more data though.  The first v4.15 kernel accepted to stable was kernel-4.15.3-300.fc27, on the date 2018-02-16 17:47:46.

https://bodhi.fedoraproject.org/updates/kernel-4.15.3-300.fc27#comment-732580

On 2018-02-17, there's a marked jump in the number of crash buckets.  And it's not just because of the day of the week.

https://retrace.fedoraproject.org/faf/reports/?opsysreleases=125&component_names=xorg-x11-server&associate=__None&first_occurrence_daterange=2018-02-01%3A2018-02-18&last_occurrence_daterange=&order_by=first_occurrence

That day has the first instances of this pattern of backtrace: SIGBUS (unusual on x86) + _dl_fixup() (+ ChangeMasterDeviceClasses + ProcessKeyboardEvent()).

There's enough data to clearly pin it to this 24 hour period.  You see the number of buckets per day jump on that date, and it only rises from there.  Because this bug inherently manifests with a variation in where it strikes along the call-path.

There is a common factor that links my slightly different crashes: the SIGBUS happens when executing the first instruction of the function, whichever function it was.  IOW, the SIGBUS happens at the same time as the first access to the page containing that function.

(It's not a stack issue: if you look at the details I posted of my crashes, you can see the fault address is not in the stack.  It's in the text segment.  And in my first posted analysis, the instruction at fault was actually `test %rdi, %rdi`, which does not involve any memory access other than fetching that instruction).
Comment 28 Brent R Brian 2018-03-28 08:06:25 EDT
 Comment 25 instructions followed -- the old gal ain't the speed demon she used to be, but, working well so far.

Off to work ....
Comment 29 Alan Jenkins 2018-03-29 04:54:45 EDT
GOTCHA.  Does anyone else see "Read-error on swap-device" / "EXT4-fs error" / "Buffer I/O error" if they look in their system logs, that happen around suspend/resume time?

To view historical kernel logs together with SIGBUS reports: 

    $ journalctl _TRANSPORT=kernel + COREDUMP_SIGNAL_NAME=SIGBUS

I suggest using a search for the error text (`/` key).  It doesn't look like it happens every time.

A recent one triggered a fsck, which made me pay attention.  Disk errors would make the SIGBUS analysis a lot less mysterious.  So I'm now very suspicious of the SATA ALPM work that was merged in v4.15.


Mar 02 18:47:03 alan-laptop kernel: sd 1:0:0:0: [sda] Starting disk
...
Mar 02 18:47:03 alan-laptop kernel: ata1: SATA link down (SStatus 0 SControl 300)
...
Mar 02 18:47:03 alan-laptop kernel: PM: resume devices took 1.017 seconds
...
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ... 
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:580256)
Mar 02 18:47:04 alan-laptop kernel: Read-error on swap-device (253:1:580264)
Mar 02 18:47:04 alan-laptop kernel: Read-error on swap-device (253:1:580272)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
Mar 02 18:47:09 alan-laptop kernel: done.
...
Mar 02 18:47:14 alan-laptop systemd-coredump[755]: Process 1356 (Xwayland) of user 42 dumped core.
                                                   
                                                   Stack trace of thread 1356:
                                                   #0  0x00007fe4daf3a2de n/a (ld-linux-x86-64.so.2)
                                                   #1  0x0000000000000000 n/a (n/a)
Comment 30 Alan Jenkins 2018-03-29 05:36:03 EDT
But it's not the obvious affect of the SATA LPM changes in v4.15, because I don't have it enabled.  Even when I'm running on battery power.  (And I haven't been playing with `powertop` at all in this time period).


$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance

==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance
Comment 31 Alban Browaeys 2018-03-31 19:33:02 EDT
I confirm those findings:

on a Lenovo Thinkpad Yoga S1 : Haswell i7 4500U
Current kernel is upstream 4.16-rc7, xwayland 1.19.6-1 gnome-shell 3.28.
IO scheduler is BFQ (multiqueue).

I get the swap errors, the Xwayland sigbus on resume form suspend (random but at least once each 3 days) and the ext4 errors (I have my ext4 over LVM). All of those are at different points in time but in sequence.

NB: I have two LV. Only the root LV get the io error on resume from suspend (at time). The root is split over two PVs one of which is shared with the home LV. smartctl tells the while disk has no bad sector or pending remapping.

On Debian here.
But I cannot ascertain 4.15. I discovered this crash around that time (24th of January 2018).
But I had an infinite loop in libinput on resume beforehand (fixed in libinput 1.8.4 October 2017).

I thought the swap  issue was unrelated though you uncovered it might not.
Mind my swap is bare metal (but on the same disk as the LVM PVs).
This is a Debian box out of the upstream kernel.
Comment 32 Brent R Brian 2018-04-02 06:40 EDT
Created attachment 1416222 [details]
Valgrind & XWayland
Comment 33 Brent R Brian 2018-04-02 06:44:06 EDT
Sorry about the lean run of Valgrind & Xwayland ... the machine is rendered near useless for anything GUI related ...
Comment 34 Alan Jenkins 2018-04-04 06:41:27 EDT
Please can you re-evaluate the need for kernel investigation at this point?

There's also a short&sweet thread on the Arch Linux Forums that pins this on kernel v4.15.  (Noticed after a kernel upgrade; successfully avoided by downgrading the kernel).

https://bbs.archlinux.org/viewtopic.php?id=235027
Comment 35 Adam Williamson 2018-04-12 15:34:04 EDT
*** Bug 1548737 has been marked as a duplicate of this bug. ***
Comment 36 Alan Jenkins 2018-04-12 15:34:42 EDT
I proposed a fix in the upstream kernel.

https://marc.info/?l=linux-block&m=152355676230208&w=2

(I was able to trace over suspend/resume using ftrace (`trace-cmd`), it's amazing really.  The output from existing block tracepoints alone was enough to pin down where the IO error comes from.  Though it helped that I had already stared at the v4.15 commit regarding block device suspend, to know what was supposed to be happening).
Comment 37 Alan Jenkins 2018-04-13 08:56:26 EDT
As a workaround, until a kernel fix gets into Fedora via -stable, I'm booting with this kernel option:

    scsi_mod.scan=sync

It fixes the suspend test case in the patch message.  You may notice an extra delay of about a second, when resuming from suspend.
Comment 38 j1simon 2018-04-21 09:47:07 EDT
The problem occurs too with Wayland sessions: https://www.dropbox.com/s/y28ojryzpytr34d/journal.log?dl=1
Comment 39 Alan Jenkins 2018-04-21 11:51:51 EDT
j1simon: sorry but I think you have an issue which needs processing separately.  This issue is 100% diagnosed and known to be caused by suspend aka sleep.  From your log, I don't see any sign of system suspend.

Please avoid confusing this issue, by creating a new one instead.  You may add <alan.christopher.jenkins@gmail.com> on the "CC List" if you wish to continue talking to me.

If at all possible, prefer using the report feature of the ABRT app, to create an bugzilla issue for your _gnome-shell_ crash, .  (You can search the app list for ABRT, but the actual name is "Problem Reporting").  Because ABRT will include a wealth of detail about the crash automatically, prompt you for further details etc.

I think you're right that your crash is with gnome-shell.  The Xwayland crash in your log can be ignored.  It is preceded by the message

    (EE) failed to read Wayland events: Connection reset by peer

This is an X/Xwayland message - the "(EE)" is distinctive.
Comment 40 Alan Jenkins 2018-04-21 11:56:25 EDT
I think it could help if someone with bugzilla privs can assign this to the kernel package, so its state can be accurately tracked.  

The kernel patch linked above, which solves these SIGBUS crashes (or rather the underlying IO error), has been accepted by the linux-block maintainer.

It's not in Linus' git tree yet.  I'm roughly assuming it will get into 4.17 though, since the commit is in the "for-linus" branch of "linux-block", and not the "for-next" branch.

Patch accepted: https://www.spinics.net/lists/linux-block/msg24710.html

---

linux-block tree: http://git.kernel.dk/cgit/linux-block
(again, look in "for-linus", commit "block: do not use interruptible wait anywhere").
Comment 41 Brent R Brian 2018-05-02 11:55:46 EDT
I have not had a GNOME Xwayland crash in a while ... something improved.
Comment 42 Brent R Brian 2018-05-06 18:58:16 EDT
Well, tease the devil and see what happens ... don't know if this helps or not ...

[104177.468477] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

This was on the console after Gnome crashed.
Comment 43 Alan Jenkins 2018-05-07 03:59:41 EDT
The only help needed now, is if you can disprove either the patch which is accepted for the next upstream kernels 4.17 [1] and 4.16.8 [2], or the workaround [3].  

If you want to test the upstream release candidate, the patch is present in  4.17-rc3 and above, which I assume can be installed as a Fedora binary package from the usual place [4].

---

The crash is sensitive to memory pressure, for obvious reasons. Read the description at the top of the patch :).

It stopped happening to me while I stopped running VMs for a bit, and came back afterwards.  Also I think the Xwayland crash is due to SIGALARM, so I suspect that case also requires you to be suspended for a certain amount of time.  When I deliberately tested massive memory pressure, I don't think I ever saw the Xwayland crash when I suspended+resumed immediately... but it happened later when I broke for lunch :).

---

[1] 
https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

[2]
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6

[3]
https://bugzilla.redhat.com/show_bug.cgi?id=1553979#c37

There's a Fedora doc on how to set a kernel boot option here:

https://docs.fedoraproject.org/f27/system-administrators-guide/kernel-module-driver-configuration/Working_with_the_GRUB_2_Boot_Loader.html#sec-Editing_a_Menu_Entry

[4] https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
Comment 44 Alan Jenkins 2018-05-07 04:13:02 EDT
hum, sorry for that "obvious" there, that's not true :).

The SIGBUS is delivered due to the failure of a read operation, paging part of Xwayland in from disk. The point about memory pressure, is that if you have enough free RAM, basically the whole of Xwayland is more likely to stay in memory.

The fix is not to return failure from read operations, in the event that a signal e.g. SIGALARM is received while the disk is still being resumed.

The bug / fix is pretty clear once you know what it is.  So while testing is always welcome, I'm not specifically asking for it. What I most wanted was to post the workaround above, for anyone affected.
Comment 45 Alan Jenkins 2018-05-24 06:04:52 EDT
Should be fixed now.

Kernel 4.16.9-200.fc27.x86_64 is in Fedora 27, and includes the patch.  F28 is also fixed, and passes the test in the patch description.  I'm currently running F28 without the workaround ("scsi_mod.scan=sync").
Comment 46 Brent R Brian 2018-05-24 10:31:15 EDT
4.16.9-200.fc27.x86_64 installed (with patch), will disable patch later today.
Comment 47 barsnick 2018-05-28 09:31:01 EDT
Before I could even check for 4.16.9-200.fc27.x86_64, 4.16.11-2xx.fc27.x86_64 already hit the repos. I can now report that I have not experienced this or similar crashes since I updated to said version.

(Similar crashes means: Before the update, I had subsequent crashes after Xwayland had restarted, where it crashed up to four times in a row directly after login. At that point, SCSI should have long been stable. I searched here for their backtraces, and all tickets eventually pointed back to this one.)

Thanks for investigating!

Note You need to log in before you can comment on or make changes to this bug.