Description of problem: Several times now XWayland has crashed, when my laptop resumes from suspend. Of course this crash brings down the entire Gnome session :(. Version-Release number of selected component (if applicable): xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64 How reproducible: randomly? :( Steps to Reproduce: Use laptop. Put it to sleep. Wake it up, sometimes it blows up. Additional info: ABRT doesn't like this crash, it uploads the coredump but the analysis says it's bad. Unfortunately the coredump doesn't make sense to me. I hope someone can point out something I've missed to analyze (or something more I need to include in the report). There seem to be five instances of this trace on FAF. I'd suggest they're all mine. Although the backtrace seems to be missing info for one layer. Also it's missing a load of line numbers, which might explain why ABRT doesn't like it. The line numbers seem to show OK locally. (Locally, I'm using systemd's coredumpctl). https://retrace.fedoraproject.org/faf/reports/2078718/ This time it looks like both the gnome sessions I had open were affected, but I think this also varies in apparent randomness. Both gdb, and Xorg's funny attempt at a backtrace, say the SIGBUS is at 0x5394d0. I thought the signal handler was supposed to get a reasonably accurate PC, which can in principle be used to resume the program. But I can't see why it's crashing there. Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE) Bus error at address 0x5394d0 Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE) Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: Fatal server error: Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE) Caught signal 7 (Bus error). Server aborting (gdb) p XkbGeomRealloc $10 = {Bool (void **, int, int, int, XkbGeomClearance)} 0x5394d0 <XkbGeomRealloc> ... #5 0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154 #6 <signal handler called> #7 XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405 #8 0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439 list 400 { 401 void *items; 402 int clearBegin; 403 404 /* Check validity of arguments. */ 405 if (!buffer) 406 return FALSE; 407 items = *buffer; 408 if (!((items && (szItems > 0)) || (!items && !szItems))) 409 return FALSE; *buffer seems fine, but we haven't reached that code anyway. (gdb) p *buffer $14 = (void *) 0x0 (gdb) p/x $pc $17 = 0x5394d0 (gdb) disassemble XkbGeomRealloc Dump of assembler code for function XkbGeomRealloc: => 0x00000000005394d0 <+0>: test %rdi,%rdi 0x00000000005394d3 <+3>: je 0x5395a0 <XkbGeomRealloc+208> ... I can't see a problem with accessing the stack, gdb seems quite happy to access it (gdb) p $sp $13 = (void *) 0x7ffff14ffd58 (gdb) x/16 $sp 0x7ffff14ffd58: 0x00520a7a 0x00000000 0xf14ffdf0 0x00007fff 0x7ffff14ffd68: 0x00000000 0x00000001 0x00030001 0x00000000 0x7ffff14ffd78: 0x02f0dd30 0x00000000 0x02ef7c98 0x00000000 0x7ffff14ffd88: 0x02f0dd30 0x00000000 0x02eb9a10 0x00000000 (gdb) x/16 $sp - 4*16 0x7ffff14ffd18: 0x0000001c 0x00000000 0x02f0b140 0x00000000 0x7ffff14ffd28: 0xefd07d1a 0x00007ff3 0x02f39f98 0x00000000 0x7ffff14ffd38: 0x02eb5880 0x00000000 0x02f3a060 0x00000000 0x7ffff14ffd48: 0x02f0b0f0 0x00000000 0x02f38d80 0x00000000 and naturally gdb shows the value shown at the top of the stack as consistent with the caller (gdb) down #8 0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439 1439 if (!XkbGeomRealloc (gdb) p/x $pc $18 = 0x520a7a (gdb) disassemble ... 0x0000000000520a75 <+2661>: callq 0x5394d0 <XkbGeomRealloc> => 0x0000000000520a7a <+2666>: test %eax,%eax ... (gdb) list 1434 free(dprop->value); 1435 } 1436 } 1437 1438 /* Reallocate and clear all new items if the buffer grows. */ 1439 if (!XkbGeomRealloc 1440 ((void **) &dst->geom->properties, dst->geom->sz_properties, 1441 src->geom->num_properties, sizeof(XkbPropertyRec), 1442 XKB_GEOM_CLEAR_EXCESS)) 1443 return FALSE; I also attach "thread apply all bt" and "bt full". Thread 1 (Thread 0x7ff3f2fa3a80 (LWP 3532)): #0 0x00007ff3efcb9660 in raise () from /lib64/libc.so.6 #1 0x00007ff3efcbac41 in abort () from /lib64/libc.so.6 #2 0x0000000000594e1a in OsAbort () at utils.c:1361 #3 0x0000000000599fe3 in AbortServer () at log.c:877 #4 0x000000000059ae05 in FatalError (f=f@entry=0x5be070 "Caught signal %d (%s). Server aborting\n") at log.c:1015 #5 0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154 #6 <signal handler called> #7 XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405 #8 0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439 #9 XkbCopyKeymap (dst=0x2f0b0f0, src=0x2f38d80) at xkbUtils.c:1992 #10 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965 #11 XkbDeviceApplyKeymap (dst=dst@entry=0x2f0dd30, desc=<optimized out>) at xkbUtils.c:2025 #12 0x00000000004fd052 in CopyKeyClass (device=device@entry=0x2eb9a10, master=master@entry=0x2f0dd30) at exevents.c:233 #13 0x00000000004fd45a in DeepCopyKeyboardClasses (to=0x2f0dd30, from=0x2eb9a10) at exevents.c:427 #14 DeepCopyDeviceClasses (from=0x2eb9a10, to=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:670 #15 0x0000000000500446 in ChangeMasterDeviceClasses (device=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:727 #16 0x0000000000500684 in UpdateDeviceState (device=0x2f0dd30, event=0x7ffff1500bb0) at exevents.c:807 #17 0x0000000000500b1c in ProcessDeviceEvent (ev=ev@entry=0x7ffff1500bb0, device=device@entry=0x2f0dd30) at exevents.c:1709 #18 0x0000000000501353 in ProcessOtherEvent (ev=0x7ffff1500bb0, device=0x2f0dd30) at exevents.c:1873 #19 0x000000000052e6f2 in ProcessKeyboardEvent (ev=<optimized out>, keybd=0x2f0dd30) at xkbPrKeyEv.c:165 #20 0x0000000000470e83 in mieqProcessDeviceEvent (dev=0x2eb9a10, event=0x7ff3f2ea0010, screen=0x25776c0) at mieq.c:496 #21 0x0000000000470f52 in mieqProcessDeviceEvent (dev=<optimized out>, event=<optimized out>, screen=<optimized out>) at mieq.c:498 #22 0x000000000048af20 in ProcXTestFakeInput (client=0x30ece40) at xtest.c:429 #23 0x0000000000558208 in Dispatch () at dispatch.c:479 #24 0x000000000055c250 in dix_main (argc=11, argv=0x7ffff1501ce8, envp=<optimized out>) at main.c:287 #25 0x00007ff3efca5f2a in __libc_start_main () from /lib64/libc.so.6 #26 0x0000000000422aaa in _start () Thread 5 (Thread 0x7ff3e515d700 (LWP 3534)): #0 0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so #2 0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so #3 0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0 #4 0x00007ff3efd7a16f in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7ff3dffff700 (LWP 3536)): #0 0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so #2 0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so #3 0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0 #4 0x00007ff3efd7a16f in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7ff3e495c700 (LWP 3535)): #0 0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so #2 0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so #3 0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0 #4 0x00007ff3efd7a16f in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7ff3e595e700 (LWP 3533)): #0 0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so #2 0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so #3 0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0 #4 0x00007ff3efd7a16f in clone () from /lib64/libc.so.6 --- #0 0x00007ff3efcb9660 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ff3efcbac41 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000594e1a in OsAbort () at utils.c:1361 No locals. #3 0x0000000000599fe3 in AbortServer () at log.c:877 No locals. #4 0x000000000059ae05 in FatalError (f=f@entry=0x5be070 "Caught signal %d (%s). Server aborting\n") at log.c:1015 args = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 0x7ffff14ff7a0, reg_save_area = 0x7ffff14ff6d0}} args2 = {{gp_offset = 8, fp_offset = 48, overflow_arg_area = 0x7ffff14ff7a0, reg_save_area = 0x7ffff14ff6d0}} beenhere = 1 #5 0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154 unused = <optimized out> sip = <optimized out> signo = 7 #6 <signal handler called> No symbol table info available. #7 XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405 items = <optimized out> clearBegin = <optimized out> #8 0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439 i = <optimized out> scolor = 0x0 j = 0 sdoodad = 0x0 sprop = 0x0 dsection = 0x0 dshape = 0x0 tmp = 0x0 k = 0 ddoodad = 0x0 doutline = 0x0 dprop = <optimized out> drow = 0x0 ssection = 0x0 sshape = 0x0 dcolor = 0x0 soutline = 0x0 srow = 0x0 #9 XkbCopyKeymap (dst=0x2f0b0f0, src=0x2f38d80) at xkbUtils.c:1992 No locals. #10 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965 No locals. #11 XkbDeviceApplyKeymap (dst=dst@entry=0x2f0dd30, desc=<optimized out>) at xkbUtils.c:2025 nkn = {type = 0 '\000', xkbType = 0 '\000', sequenceNumber = 0, time = 0, deviceID = 3 '\003', oldDeviceID = 3 '\003', minKeyCode = 8 '\b', maxKeyCode = 255 '\377', oldMinKeyCode = 8 '\b', oldMaxKeyCode = 255 '\377', requestMajor = 135 '\207', requestMinor = 9 '\t', changed = 3, detail = 0 '\000', pad1 = 0 '\000', pad2 = 0, pad3 = 0, pad4 = 0} ret = <optimized out> #12 0x00000000004fd052 in CopyKeyClass (device=device@entry=0x2eb9a10, master=master@entry=0x2f0dd30) at exevents.c:233 mk = <optimized out> #13 0x00000000004fd45a in DeepCopyKeyboardClasses (to=0x2f0dd30, from=0x2eb9a10) at exevents.c:427 classes = <optimized out> #14 DeepCopyDeviceClasses (from=0x2eb9a10, to=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:670 No locals. #15 0x0000000000500446 in ChangeMasterDeviceClasses (device=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:727 slave = 0x2eb9a10 rc = <optimized out> #16 0x0000000000500684 in UpdateDeviceState (device=0x2f0dd30, event=0x7ffff1500bb0) at exevents.c:807 No locals. #17 0x0000000000500b1c in ProcessDeviceEvent (ev=ev@entry=0x7ffff1500bb0, device=device@entry=0x2f0dd30) at exevents.c:1709 grab = <optimized out> deactivateDeviceGrab = 0 key = 0 rootX = 0 rootY = 0 b = <optimized out> ret = 0 corestate = 0 mouse = 0x2f0b370 kbd = 0x2f0dd30 event = 0x7ffff1500bb0 __func__ = "ProcessDeviceEvent" #18 0x0000000000501353 in ProcessOtherEvent (ev=0x7ffff1500bb0, device=0x2f0dd30) at exevents.c:1873 No locals. #19 0x000000000052e6f2 in ProcessKeyboardEvent (ev=<optimized out>, keybd=0x2f0dd30) at xkbPrKeyEv.c:165 keyc = <optimized out> xkbi = 0x0 backup_proc = 0x52e690 <ProcessKeyboardEvent> event = <optimized out> is_press = <optimized out> is_release = <optimized out> #20 0x0000000000470e83 in mieqProcessDeviceEvent (dev=0x2eb9a10, event=0x7ff3f2ea0010, screen=0x25776c0) at mieq.c:496 handler = 0x0 master = 0x2f0dd30 mevent = {any = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162}, device_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, detail = {button = 3, key = 3}, touchid = 5, root_x = 0, root_x_frac = 0, root_y = 0, root_y_frac = 0, buttons = '\000' <repeats 31 times>, valuators = {mask = "\000\000\000\000", mode = "\000\000\000\000", data = {0 <repeats 36 times>}}, mods = {base = 0, latched = 0, locked = 0, effective = 0}, group = {base = 0 '\000', latched = 0 '\000', locked = 0 '\000', effective = 0 '\000'}, root = 0, corestate = 0, key_repeat = 0, flags = 0, resource = 0, source_type = EVENT_SOURCE_NORMAL}, changed_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, flags = 10, masterid = 3, sourceid = 5, buttons = {num_buttons = 0, names = {0 <repeats 256 times>}}, num_valuators = 0, valuators = {{min = 0, max = 0, value = 0, resolution = 0, mode = 0 '\000', name = 0, scroll = {type = SCROLL_TYPE_NONE, increment = 0, flags = 0}} <repeats 36 times>}, keys = {min_keycode = 8, max_keycode = 255}}, touch_ownership_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, touchid = 3, reason = 5 '\005', resource = 0, flags = 0}, barrier_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, barrierid = 3, window = 5, root = 0, dx = 0, dy = 0, root_x = 0, root_y = 0, dt = 0, event_id = 0, flags = 0}, dga_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, subtype = 3, detail = 10, dx = 3, dy = 5, screen = 0, state = 0}, raw_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, detail = {button = 3, key = 3}, valuators = {mask = "\000\000\000\000", data = {0 <repeats 36 times>}, data_raw = {0 <repeats 36 times>}}, flags = 0}} #21 0x0000000000470f52 in mieqProcessDeviceEvent (dev=<optimized out>, event=<optimized out>, screen=<optimized out>) at mieq.c:498 No locals. #22 0x000000000048af20 in ProcXTestFakeInput (client=0x30ece40) at xtest.c:429 stuff = <optimized out> nev = <optimized out> n = <optimized out> type = 2 rc = <optimized out> ev = <optimized out> dev = 0x2eb9a10 root = 0xd extension = <optimized out> mask = {last_bit = 16 '\020', has_unaccelerated = 124 '|', mask = "W\002\000\000", valuators = {1.9410843188760561e-316, 2.5757278463123209e-316, 6.9507978547510833e-310, 6.9533436336337801e-310, 1.0185579797423812e-312, 6.9533436336337801e-310, 6.9533436336235035e-310, 1.8527461719046745e-321, -1.6529107122258923e-145, 3.1620201333839779e-322, 1.9427617705568163e-316, 6.953343633619551e-310, 1.943370064179976e-316, 1.9429645350978695e-316, 6.9507978550305762e-310, 3.1620201333839779e-322, 6.9533436336250845e-310, 6.953343633624294e-310, -1.6529107122258923e-145, 0, 0, 0, 2.5544521938448468e-316, 6.9507986183937675e-310, 2.8844288562024277e-317, 6.9505972133067495e-310, 1.5810100666919889e-322, 6.9533436336385231e-310, 1.5810100666919889e-322, 6.9533436336250845e-310, 2.5344200057948259e-316, 2.5544632609153136e-316, 1.5810100666919889e-322, 6.9533436336385231e-310, 1.5810100666919889e-322, 6.9533436336298276e-310}, unaccelerated = {6.9533436696760666e-310, 6.9533436336337801e-310, 2.1473399910232206e-314, 6.9533436336314086e-310, -1.6529107122258923e-145, 6.9533436336353611e-310, 2.5344200057948259e-316, 1.5810100666919889e-322, 6.9533436336385231e-310, 2.9063455094388162e-316, 0, 2.5544632609153136e-316, 2.8851605674239186e-317, 6.9533436336353611e-310, 6.950796022429684e-310, 0, 6.950796022429684e-310, 6.9533436336381279e-310, 6.9507960176135815e-310, 0, 6.9507978549310714e-310, 1.9410764138257226e-316, 6.95079302300611e-310, 2.0202549789757241e-316, 2.8914169206972063e-317, 0, -1.6529107122258923e-145, 0, 2.534653204779663e-316, 2.5270619849444414e-316, 1.9762625833649862e-323, 2.5353219720378737e-316, 6.4228533959362051e-323, 0, 2.8835192813484339e-317, 2.121995791459338e-314}} valuators = {0 <repeats 36 times>} numValuators = 0 firstValuator = <optimized out> nevents = <optimized out> i = <optimized out> base = <optimized out> flags = 0 need_ptr_update = <optimized out> #23 0x0000000000558208 in Dispatch () at dispatch.c:479 result = <optimized out> start_tick = 4705 #24 0x000000000055c250 in dix_main (argc=11, argv=0x7ffff1501ce8, envp=<optimized out>) at main.c:287 i = <optimized out> alwaysCheckForInput = {0, 1} #25 0x00007ff3efca5f2a in __libc_start_main () from /lib64/libc.so.6 No symbol table info available. #26 0x0000000000422aaa in _start () No symbol table info available.
coredumpctl shows a comparative slight difference in the traces that happened 2 days ago: This time: #4 0x000000000059ae05 FatalError (Xwayland) #5 0x0000000000591f0e OsSigHandler (Xwayland) #6 0x00007ff3f004d1b0 __restore_rt (libpthread.so.0) #7 0x00000000005394d0 XkbGeomRealloc (Xwayland) #8 0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland) #9 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) 2 days ago, crash is directly in XkbCopyKeymap(). coredump is deleted already, sorry. #2 0x000000000058e2cd xorg_backtrace (Xwayland) #3 0x0000000000591eb9 OsSigHandler (Xwayland) #4 0x00007f0db1f121b0 __restore_rt (libpthread.so.0) #5 0x0000000000520010 XkbCopyKeymap.part.8 (Xwayland) #6 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) Note: 0x520010 is the first instruction in XkbCopyKeymap()! Seems very suspicious! 2 weeks 3 days ago, it's back to XkbGeomRealloc() again: #4 0x00007f98dd06baf0 __restore_rt (libpthread.so.0) #5 0x00000000005394d0 XkbGeomRealloc (Xwayland) #6 0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland) #7 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) And there's a maybe unrelated one: 2 weeks 2 days ago: #4 0x00007f48c6128af0 __restore_rt (libpthread.so.0) #5 0x000000000051bdf0 ProcXkbGetMap (Xwayland) #6 0x0000000000558208 Dispatch (Xwayland) #7 0x000000000055c250 dix_main (Xwayland)
The "maps" file collected by ABRT also doesn't show a problem with the stack, afaict :(. So the apparent pattern of SIGBUS on the first instruction, of those two different functions... I am quite confused by this. maps file: 7ffff14e3000-7ffff1504000 rw-p 00000000 00:00 0 [stack] (gdb) #7 XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405 405 if (!buffer) (gdb) p/x $sp $5 = 0x7ffff14ffd58
*** Bug 1557688 has been marked as a duplicate of this bug. ***
Getting ABRT to do it's analysis locally instead, allowed ABRT to submit full automatic crash dump information. So see bug #1557688 for those attachments.
*** Bug 1558940 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1548737 ***
Another one, ABRT merged but it's slightly different. The backtrace doesn't involve XkbDeviceApplyKeymap(). https://bugzilla.redhat.com/show_bug.cgi?id=1548737#c17 #4 0x00007f7ad052f1b0 __restore_rt (libpthread.so.0) #5 0x0000000000520010 XkbCopyKeymap (Xwayland) #6 0x0000000000523e5c XkbCopyKeymap (Xwayland) #7 0x00000000004fd052 CopyKeyClass (Xwayland) #8 0x00000000004fd45a DeepCopyKeyboardClasses (Xwayland) #9 0x0000000000500446 ChangeMasterDeviceClasses (Xwayland) #10 0x0000000000500684 UpdateDeviceState (Xwayland) #11 0x0000000000500b1c ProcessDeviceEvent (Xwayland) #12 0x0000000000501353 ProcessOtherEvent (Xwayland) #13 0x000000000052e6f2 ProcessKeyboardEvent (Xwayland) #14 0x0000000000470e83 mieqProcessDeviceEvent (Xwayland) #15 0x000000000048af20 ProcXTestFakeInput (Xwayland) The fault is again allegedly in the first instruction, of XkbCopyKeymap(). There's another oddity here as it claims this is a recursive call, but I don't see a recursive call in the source. (gdb) #5 XkbCopyKeymap (dst=0x271d9d0, src=0x2744210) at xkbUtils.c:1957 1957 XkbCopyKeymap(XkbDescPtr dst, XkbDescPtr src) (gdb) up #6 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965 1965 if (src == dst) (gdb) list 1960 if (!src || !dst) { 1961 DebugF("XkbCopyKeymap: src (%p) or dst (%p) is NULL\n", src, dst); 1962 return FALSE; 1963 } 1964 1965 if (src == dst) 1966 return TRUE; 1967 1968 if (!_XkbCopyClientMap(src, dst)) { 1969 DebugF("XkbCopyKeymap: failed to copy client map\n");
F***, really hope this isn't another bug in the spectre microcode updates ala http://vninja.net/news/curious-case-intel-microcode-part-2-gets-better-worse/ The dates are somewhat suspicious, but the connection is not strong enough for me to figure it out. > 3. (Broadwell E, H, U/Y; Haswell standard, Core Extreme, ULT) Symptom: Intel has received reports of **unexpected page faults**, which they are currently investigating. Out of an abundance of caution, Intel requested Lenovo to stop distributing this firmware. I have an i5-5300U, which falls under Broadwell U/Y. The first crash I have in `coredumpctl` with ProcXTestFakeInput -> ProcessKeyboardEvent -> UpdateDeviceState is on 2018-02-28. I have such crashes five times since then. journalctl shows ucode update to rev 0x28 on Jan 15. On Feb 15, kernel starts detecting it as "Intel Spectre v2 broken microcode detected; disabling Speculation Control". On Mar 21, I get an update to 0x2a, and it stops being detected as broken. My most recent of these crashes is 2018-03-23.
So I have a number of crashes in slightly different places along this call chain, but always in the first instruction of the called function. And now look at the secondary crashes in _dl_fixup() (and _Uelf64_lookup_symbol - https://bugzilla.redhat.com/show_bug.cgi?id=1548737). It's as if the common factor that for these mulitiple different SIGBUS crashes, is a read inside the mapping of /usr/bin/Xwayland. In the initial crash, it's a read of the first instruction when jumping to a new function. When there's a secondary crash, it's when we try to read some of the ELF data to generate a backtrace. Co-incidence? I looked at the dl_fixup() crash in detail. It fetches from (0x41bd78 + 0x8), which *should* be in-bounds for reading. Just like we think that 0x5394d0 <XkbGeomRealloc> should have been in-bounds for executing. But read dies with SIGBUS. 00400000-0060b000 r-xp 00000000 fd:00 1708508 /usr/bin/Xwayland (gdb) #0 _dl_fixup (l=0x7f7ad34c6130, reloc_arg=203) at ../elf/dl-runtime.c:73 73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)]; (gdb) list 68 = (const void *) D_PTR (l, l_info[DT_SYMTAB]); 69 const char *strtab = (const void *) D_PTR (l, l_info[DT_STRTAB]); 70 71 const PLTREL *const reloc 72 = (const void *) (D_PTR (l, l_info[DT_JMPREL]) + reloc_offset); 73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)]; (gdb) disassemble Dump of assembler code for function _dl_fixup: 0x00007f7ad32aebd0 <+0>: push %rbx 0x00007f7ad32aebd1 <+1>: mov %rdi,%r10 0x00007f7ad32aebd4 <+4>: mov %esi,%esi 0x00007f7ad32aebd6 <+6>: lea (%rsi,%rsi,2),%rdx 0x00007f7ad32aebda <+10>: sub $0x10,%rsp 0x00007f7ad32aebde <+14>: mov 0x68(%rdi),%rax 0x00007f7ad32aebe2 <+18>: mov 0x8(%rax),%rdi 0x00007f7ad32aebe6 <+22>: mov 0xf8(%r10),%rax 0x00007f7ad32aebed <+29>: mov 0x8(%rax),%rax 0x00007f7ad32aebf1 <+33>: lea (%rax,%rdx,8),%r8 0x00007f7ad32aebf5 <+37>: mov 0x70(%r10),%rax => 0x00007f7ad32aebf9 <+41>: mov 0x8(%r8),%rcx (gdb) info registers ... r8 0x41bd78 4308344 ... (gdb) p reloc $1 = (const Elf64_Rela * const) 0x41bd78 (gdb) p *reloc $2 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}
The good news is it sounds more like it started after the last update to X / Xwayland. My first crash was on 2018-02-28, but mine have been less frequent than Brian's. $ rpm -q --last xorg-x11-server-Xwayland xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64 Wed 21 Feb 2018 22:41:54 GMT
In all the coredumps I have, $_siginfo (and the `sip` param in OsSigHandler()) seems to have ->si_addr consistent with the above (i.e. AFAICT it should really be in-bounds), and ->si_code == 2... #define BUS_ADRERR 2 /* non-existent physical address */ I think this is supposed to mean that you have mapped a file, you accessed within the mapping, and within the permissions of the mapping, but the page is actually beyond the end of the file. Or, reading the page from disk failed (IO error). I don't get how I can have this *variety* of backtraces, each internally consistent, unless there's a defect somewhere in the underlying platform :(. Cpu bug, kernel bug, hardware fault? ** Kernel upgrade 4.14 -> 4.15 occurred on 2018-02-18. HMM. ** Just to make sure: I have already run `rpm --verify xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64` and `rpm --verify --all`, without finding any problem with my installed files. And *gdb* seems to be able to access these addresses OK, when debugging the coredump.
https://bugzilla.redhat.com/show_bug.cgi?id=1557899 is another report, currently not marked as duplicate, from someone with crash-on-resume symptoms, originally on Xwayland, but who switched to Xorg. Note that this is again a bus error, and again the address which faults (si_addr) is 0x4b7ac0 - xf86CrtcRotate+0x0 - inside the text mapping of /usr/bin/Xorg specifically.
There are tens of buckets of this pattern in FAF. After sorting by "count", the earliest I found so far started on 2018-02-18. So that pushes it back a little bit. That date matches *exactly* when I upgraded my kernel from v4.14 to v4.15. I often apply (and reboot into) kernel upgrades within 24 hours of availability, so I think this is a plausible match. (To disclose inconvenient points also: that first crash is only 3 days after when I upgraded Xwayland to 1.19.6-5.fc27. I think the kernel is a better match based on the overall evidence though). https://retrace.fedoraproject.org/faf/reports/2050554/ (All you have to do is look up xorg-x11-server in FAF, and look for _dl_fixup(), _Ux86..., or _Uelf... Every one of those is a SIGBUS, and they're all somewhere along this same call chain. Or they're the Xorg one. Some of the buckets contain hundreds of individual reports).
So why would Xorg and Xwayland be so special? Xwayland in particular should be very boring. Well, notice how the fault is always in the text segment, which is at the lowest mapped address. Xorg and Xwayland are the *only* daemon binaries on my system that `file` describes as "ELF 64-bit LSB executable". The others are all "shared object"s - meaning relocatable code. They get loaded to much higher addresess - looks like ASLR. The X servers not being built as relocatable code (and the reason) is noted here: https://bugzilla.redhat.com/show_bug.cgi?id=1543960