Bug 1535080 - [Wayland] gnome-shell crash and process stay eating 100% CPU
Summary: [Wayland] gnome-shell crash and process stay eating 100% CPU
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: gnome-shell   
(Show other bugs)
Version: 7.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Florian Müllner
QA Contact: Desktop QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-16 15:49 UTC by Tomas Pelka
Modified: 2018-05-17 14:39 UTC (History)
5 users (show)

Fixed In Version: mutter-3.26.2-7.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-10 13:10:18 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0770 None None None 2018-04-10 13:11 UTC
Red Hat Bugzilla 1529175 None CLOSED Gnome Wayland session cannot start 2019-03-21 13:00 UTC

Internal Trackers: 1529175

Description Tomas Pelka 2018-01-16 15:49:22 UTC
Description of problem:
It is happening for Wayland session

Version-Release number of selected component (if applicable):
gnome-shell-3.26.2-2.el7.x86_64
kernel-3.10.0-830.el7.x86_64
gnome-session-wayland-session-3.26.1-8.el7.x86_64
libwayland-server-1.14.0-2.el7.x86_64
xorg-x11-server-Xwayland-1.19.5-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. login to wayland session
2.
3.

Actual results:
gnome-shell crash eating 100% CPU

Expected results:


Additional info:
Set of coredumps I collected today - http://download.englab.brq.redhat.com/scratch/tpelka/gnome-shell/

Also can see following SElinux messages
time->Tue Jan 16 16:35:36 2018
type=ANOM_ABEND msg=audit(1516116936.111:246): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=16672 comm="Xwayland" reason="memory violation" sig=6
----
time->Tue Jan 16 16:36:13 2018
type=ANOM_ABEND msg=audit(1516116973.418:266): auid=1000 uid=1000 gid=1000 ses=4 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=19663 comm="gnome-shell" reason="memory violation" sig=11
----
time->Tue Jan 16 16:36:20 2018
type=ANOM_ABEND msg=audit(1516116980.141:267): auid=1000 uid=1000 gid=1000 ses=4 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=19670 comm="Xwayland" reason="memory violation" sig=6

Note that switching to Permissive make gnome-shell crash again

Comment 3 Olivier Fourdan 2018-01-17 09:17:47 UTC
All of the core files from the link in comment 0 are from Xwayland.

Would you have the core files from gnome-shell when it crashes?

Comment 4 Olivier Fourdan 2018-01-17 10:28:13 UTC
(In reply to Olivier Fourdan from comment #3)
> All of the core files from the link in comment 0 are from Xwayland.
> 
> Would you have the core files from gnome-shell when it crashes?

All of the core files point toward an xwl_read_events() (i.e. gnome-shell dead) *but* those two:

 · core.13915
 · core.19594

The backtrace of those is similar:

#0  0x00007fd8ce8e41a7 in raise () from /usr/lib64/libc.so.6
#1  0x00007fd8ce8e5898 in abort () from /usr/lib64/libc.so.6
#2  0x000000000058f1da in OsAbort () at utils.c:1361
#3  0x0000000000594ce3 in AbortServer () at log.c:877
#4  0x0000000000595b2d in FatalError (f=f@entry=0x5b7490 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5  0x000000000058c43c in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
#6  <signal handler called>
#7  xwl_glamor_pixmap_get_wl_buffer (pixmap=pixmap@entry=0x2994320) at xwayland-glamor.c:162
#8  0x0000000000424da5 in xwl_screen_post_damage (xwl_screen=0x215c750) at xwayland.c:514
#9  block_handler (data=0x215c750, timeout=<optimized out>) at xwayland.c:665
#10 0x0000000000557e46 in BlockHandler (pTimeout=pTimeout@entry=0x7ffc2ec84a04) at dixutils.c:388
#11 0x0000000000585ed9 in WaitForSomething (are_ready=0) at WaitFor.c:219
#12 0x00000000005531e1 in Dispatch () at dispatch.c:422
#13 0x000000000055744a in dix_main (argc=11, argv=0x7ffc2ec84be8, envp=<optimized out>) at main.c:287
#14 0x00007fd8ce8d0377 in __libc_start_main () from /usr/lib64/libc.so.6
#15 0x00000000004240fe in _start ()

(gdb) f 8
#8  0x0000000000424da5 in xwl_screen_post_damage (xwl_screen=0x215c750) at xwayland.c:514
514	            buffer = xwl_glamor_pixmap_get_wl_buffer(pixmap);
(gdb) p *pixmap
$1 = {drawable = {type = 1 '\001', class = 0 '\000', depth = 24 '\030', bitsPerPixel = 32 ' ', id = 0, x = 0, y = 0, width = 0, 
    height = 0, pScreen = 0x215c200, serialNumber = 1}, devPrivates = 0x2994368, refcnt = 1, devKind = 0, devPrivate = {ptr = 0x29943f0, 
    val = 43598832, uval = 43598832, fptr = 0x29943f0}, screen_x = 0, screen_y = 0, usage_hint = 0, master_pixmap = 0x0}

So we have the window pixmap being empty and xwl_pixmap_get() returing NULL (thus causing the NULL pointer deref).

Question is, how do we get there,  in post_damage() with a window with a pixmap of size 0×0 and no buffer.

Worth noting, in both cases, the window that trigered the crash was "hexchat" (looking down the window's userProps)

(gdb) x /s xwl_window->window->optional->userProps->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->data
0x215bcf0:	"hexchat"

There is a weird bug in F27 (which uses the same version of gnome) with gnome-shell and hexchat (bug 1525861), so I wonder if that's the same.

Comment 5 Tomas Pelka 2018-01-18 06:52:29 UTC
OK I removed hexchat (I had it from flathub, FYI) from autostart apps and I'm back in wayland session with no gnome crashes.

Just note stat I still have one app (dropbox) that is started on login so definitely hexchat+gnome-shell issue.

Comment 6 Olivier Fourdan 2018-01-18 10:49:49 UTC
(In reply to Olivier Fourdan from comment #4)
> All of the core files point toward an xwl_read_events() (i.e. gnome-shell
> dead) *but* those two:
> 
>  · core.13915
>  · core.19594
> 
> The backtrace of those is similar [...]

For that, I just posted he following patch upstream:

   https://patchwork.freedesktop.org/series/36683/

But would like to get the gnome-shell core files as well to investigate on gnome-shell/mutter side as well.

Comment 7 Olivier Fourdan 2018-01-19 10:20:56 UTC
(In reply to Tomas Pelka from comment #5)
> OK I removed hexchat (I had it from flathub, FYI) from autostart apps and
> I'm back in wayland session with no gnome crashes.
> 
> Just note stat I still have one app (dropbox) that is started on login so
> definitely hexchat+gnome-shell issue.

I am trying to reproduce the issue by installing hexchat via flatpak and adding it to the autostarted apps, with no success so far, several login attempts all worked fine, no crash...

Comment 8 Olivier Fourdan 2018-01-22 09:15:25 UTC
While investigating bug 1529175, Matěj was able to reproduce this bug with pidgin (exact same Xwayland backtrace) and capture a core file for both gnome-shell and Xwayland, so this is not related specifically to hexchat.

Looking at the core file, I see that gnome-shell crashes in the save_phase_2_callback() of the xsession management code:

#0  0x00007f20a83ae941 in meta_workspace_index (workspace=0x0) at core/workspace.c:670
#1  0x00007f20a83b9a89 in save_phase_2_callback () at x11/session.c:953
#2  0x00007f20a83b9a89 in save_phase_2_callback (smc_conn=<optimized out>, client_data=0x1) at x11/session.c:455
#3  0x00007f2099c254be in _SmcProcessMessage () at /lib64/libSM.so.6
#4  0x00007f2099a15ea7 in IceProcessMessages () at /lib64/libICE.so.6
#5  0x00007f20a83b9390 in process_ice_messages (channel=<optimized out>, condition=<optimized out>, client_data=<optimized out>)
    at x11/session.c:96
#6  0x00007f20a23998f9 in g_main_context_dispatch (context=0xf4adf0) at gmain.c:3146
#7  0x00007f20a23998f9 in g_main_context_dispatch (context=context@entry=0xf4adf0) at gmain.c:3811
#8  0x00007f20a2399c58 in g_main_context_iterate (context=0xf4adf0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3884
#9  0x00007f20a2399f2a in g_main_loop_run (loop=0x11a8ca0) at gmain.c:4080
#10 0x00007f20a8395f4c in meta_run () at core/main.c:652
#11 0x0000000000402584 in main (argc=1, argv=0x7ffc42235328) at main.c:539

The window->workspace is 0x0, thus the null pointer dereference in meta_workspace_index().

Looking at the window, we see:

  window->type = META_WINDOW_NORMAL
  window->rect = {x = 2008, y = 73, width = 357, height = 814}
  window->monitor = 0x1180000,
  window->override_redirect = 0,
  window->unmanaging = 0,
  window->workspace = 0x0,
  window->always_sticky = 0,
  window->initial_workspace = 0,
  window->on_all_workspaces = 1,
  window->on_all_workspaces_requested = 0,
  window->initial_workspace_set = 1

So we have window->on_all_workspaces TRUE but window->on_all_workspaces_requested FALSE, and the window located quite faron the right, so chances are that it's on a secondary monitor while the primary is on the right.

Looking at the code, “window->on_all_workspaces” is set from “should_be_on_all_workspaces (window)”, and now things get interesting, because it reads:

4726 static gboolean
4727 should_be_on_all_workspaces (MetaWindow *window)
4728 {
4729   if (window->always_sticky)
4730     return TRUE;
4731 
4732   if (window->on_all_workspaces_requested)
4733     return TRUE;
4734 
4735   if (window->override_redirect)
4736     return TRUE;
4737 
4738   if (meta_prefs_get_workspaces_only_on_primary () &&
4739       !window->unmanaging &&
4740       window->monitor &&
4741       !meta_window_is_on_primary_monitor (window))
4742     return TRUE;
4743 
4744   return FALSE;
4745 }

-> So I strongly suspect the issue occurs with windows on the second monitor (not primary) with workspaces_only_on_primary set (the default being off means we may hit a seldom tested case here).

Comment 9 Tomas Pelka 2018-01-22 09:21:08 UTC
Nice catch Olivier! What you said is true my hexchat was started on secondary display.

Sorry for not mentioning it earlier.

Comment 10 Olivier Fourdan 2018-01-22 09:25:40 UTC
And what gives:

  $ dconf read /org/gnome/mutter/workspaces-only-on-primary

on your account?

Comment 11 Tomas Pelka 2018-01-22 09:51:24 UTC
(In reply to Olivier Fourdan from comment #10)
> And what gives:
> 
>   $ dconf read /org/gnome/mutter/workspaces-only-on-primary
> 
> on your account?

Nothing

Comment 12 Olivier Fourdan 2018-01-22 09:55:08 UTC
(In reply to Tomas Pelka from comment #11)
> Nothing

Ah sorry, what about:

  $ gsettings get org.gnome.mutter workspaces-only-on-primary

Comment 13 Tomas Pelka 2018-01-22 10:17:56 UTC
(In reply to Olivier Fourdan from comment #12)
> (In reply to Tomas Pelka from comment #11)
> > Nothing
> 
> Ah sorry, what about:
> 
>   $ gsettings get org.gnome.mutter workspaces-only-on-primary

False for me, what about you Mateji?

Comment 14 Matěj Cepl 2018-01-22 12:43:06 UTC
(In reply to Tomas Pelka from comment #13)
> False for me, what about you Mateji?

False as well.

Comment 15 Olivier Fourdan 2018-01-24 09:03:58 UTC
Oh, sorry, Jonas and Florian pointed out this settings can be overridden, what gives:

  $ gsettings get org.gnome.shell.overrides workspaces-only-on-primary

Comment 16 Matěj Cepl 2018-01-24 09:32:31 UTC
(In reply to Olivier Fourdan from comment #15)
> Oh, sorry, Jonas and Florian pointed out this settings can be overridden,
> what gives:
> 
>   $ gsettings get org.gnome.shell.overrides workspaces-only-on-primary

That's true

Comment 18 Matěj Cepl 2018-01-25 18:22:20 UTC
Just wondering whether bug 1538756 is a duplicate of this one. Just something abrt dug up.

Comment 19 Olivier Fourdan 2018-01-31 10:51:37 UTC
(In reply to Matěj Cepl from comment #18)
> Just wondering whether bug 1538756 is a duplicate of this one. Just
> something abrt dug up.

Not a duplicate, but a consequence of this bug I reckon.

Comment 25 errata-xmlrpc 2018-04-10 13:10:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0770


Note You need to log in before you can comment on or make changes to this bug.