Bug 1557682

Summary:	SIGBUS apparently at first instruction in XkbGeomRealloc(), when resuming from suspend
Product:	[Fedora] Fedora	Reporter:	Alan Jenkins <alan.christopher.jenkins>
Component:	xorg-x11-server	Assignee:	X/OpenGL Maintenance List <xgl-maint>
Status:	CLOSED DUPLICATE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	27	CC:	alexl, bskeggs, caillon+fedoraproject, jglisse, john.j5live, ofourdan, rhughes, rstrode, sandmann, xgl-maint
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-21 11:40:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alan Jenkins 2018-03-17 22:06:04 UTC

Description of problem:

Several times now XWayland has crashed, when my laptop resumes from suspend.  Of course this crash brings down the entire Gnome session :(.

Version-Release number of selected component (if applicable):
xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64

How reproducible: randomly? :(

Steps to Reproduce:
Use laptop.  Put it to sleep.  Wake it up, sometimes it blows up.


Additional info:

ABRT doesn't like this crash, it uploads the coredump but the analysis says it's bad.

Unfortunately the coredump doesn't make sense to me.  I hope someone can point out something I've missed to analyze (or something more I need to include in the report).

There seem to be five instances of this trace on FAF.  I'd suggest they're all mine.   Although the backtrace seems to be missing info for one layer.  Also it's missing a load of line numbers, which might explain why ABRT doesn't like it.  The line numbers seem to show OK locally.  (Locally, I'm using systemd's coredumpctl).

https://retrace.fedoraproject.org/faf/reports/2078718/

This time it looks like both the gnome sessions I had open were affected, but I think this also varies in apparent randomness.


Both gdb, and Xorg's funny attempt at a backtrace, say the SIGBUS is at 0x5394d0.  I thought the signal handler was supposed to get a reasonably accurate PC, which can in principle be used to resume the program.  But I can't see why it's crashing there.


Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE) Bus error at address 0x5394d0
Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE)
Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: Fatal server error:
Mar 17 18:08:39 alan-laptop org.gnome.Shell.desktop[3494]: (EE) Caught signal 7 (Bus error). Server aborting

(gdb) p XkbGeomRealloc
$10 = {Bool (void **, int, int, int, XkbGeomClearance)} 0x5394d0 <XkbGeomRealloc>

...

#5  0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
#6  <signal handler called>
#7  XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405
#8  0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439

list 
400	{
401	    void *items;
402	    int clearBegin;
403	
404	    /* Check validity of arguments. */
405	    if (!buffer)
406	        return FALSE;
407	    items = *buffer;
408	    if (!((items && (szItems > 0)) || (!items && !szItems)))
409	        return FALSE;

*buffer seems fine, but we haven't reached that code anyway.

(gdb) p *buffer
$14 = (void *) 0x0

(gdb) p/x $pc
$17 = 0x5394d0

(gdb) disassemble XkbGeomRealloc
Dump of assembler code for function XkbGeomRealloc:
=> 0x00000000005394d0 <+0>:	test   %rdi,%rdi
   0x00000000005394d3 <+3>:	je     0x5395a0 <XkbGeomRealloc+208>
...

I can't see a problem with accessing the stack, gdb seems quite happy to access it

(gdb) p $sp
$13 = (void *) 0x7ffff14ffd58
(gdb) x/16 $sp
0x7ffff14ffd58:	0x00520a7a	0x00000000	0xf14ffdf0	0x00007fff
0x7ffff14ffd68:	0x00000000	0x00000001	0x00030001	0x00000000
0x7ffff14ffd78:	0x02f0dd30	0x00000000	0x02ef7c98	0x00000000
0x7ffff14ffd88:	0x02f0dd30	0x00000000	0x02eb9a10	0x00000000
(gdb) x/16 $sp - 4*16
0x7ffff14ffd18:	0x0000001c	0x00000000	0x02f0b140	0x00000000
0x7ffff14ffd28:	0xefd07d1a	0x00007ff3	0x02f39f98	0x00000000
0x7ffff14ffd38:	0x02eb5880	0x00000000	0x02f3a060	0x00000000
0x7ffff14ffd48:	0x02f0b0f0	0x00000000	0x02f38d80	0x00000000

and naturally gdb shows the value shown at the top of the stack as consistent with the caller

(gdb) down
#8  0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439
1439	            if (!XkbGeomRealloc
(gdb) p/x $pc
$18 = 0x520a7a
(gdb) disassemble
...
   0x0000000000520a75 <+2661>:	callq  0x5394d0 <XkbGeomRealloc>
=> 0x0000000000520a7a <+2666>:	test   %eax,%eax
...
(gdb) list
1434	                    free(dprop->value);
1435	                }
1436	            }
1437	
1438	            /* Reallocate and clear all new items if the buffer grows. */
1439	            if (!XkbGeomRealloc
1440	                ((void **) &dst->geom->properties, dst->geom->sz_properties,
1441	                 src->geom->num_properties, sizeof(XkbPropertyRec),
1442	                 XKB_GEOM_CLEAR_EXCESS))
1443	                return FALSE;


I also attach "thread apply all bt" and "bt full".

Thread 1 (Thread 0x7ff3f2fa3a80 (LWP 3532)):
#0  0x00007ff3efcb9660 in raise () from /lib64/libc.so.6
#1  0x00007ff3efcbac41 in abort () from /lib64/libc.so.6
#2  0x0000000000594e1a in OsAbort () at utils.c:1361
#3  0x0000000000599fe3 in AbortServer () at log.c:877
#4  0x000000000059ae05 in FatalError (f=f@entry=0x5be070 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5  0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
#6  <signal handler called>
#7  XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405
#8  0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439
#9  XkbCopyKeymap (dst=0x2f0b0f0, src=0x2f38d80) at xkbUtils.c:1992
#10 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965
#11 XkbDeviceApplyKeymap (dst=dst@entry=0x2f0dd30, desc=<optimized out>) at xkbUtils.c:2025
#12 0x00000000004fd052 in CopyKeyClass (device=device@entry=0x2eb9a10, master=master@entry=0x2f0dd30) at exevents.c:233
#13 0x00000000004fd45a in DeepCopyKeyboardClasses (to=0x2f0dd30, from=0x2eb9a10) at exevents.c:427
#14 DeepCopyDeviceClasses (from=0x2eb9a10, to=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:670
#15 0x0000000000500446 in ChangeMasterDeviceClasses (device=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:727
#16 0x0000000000500684 in UpdateDeviceState (device=0x2f0dd30, event=0x7ffff1500bb0) at exevents.c:807
#17 0x0000000000500b1c in ProcessDeviceEvent (ev=ev@entry=0x7ffff1500bb0, device=device@entry=0x2f0dd30) at exevents.c:1709
#18 0x0000000000501353 in ProcessOtherEvent (ev=0x7ffff1500bb0, device=0x2f0dd30) at exevents.c:1873
#19 0x000000000052e6f2 in ProcessKeyboardEvent (ev=<optimized out>, keybd=0x2f0dd30) at xkbPrKeyEv.c:165
#20 0x0000000000470e83 in mieqProcessDeviceEvent (dev=0x2eb9a10, event=0x7ff3f2ea0010, screen=0x25776c0) at mieq.c:496
#21 0x0000000000470f52 in mieqProcessDeviceEvent (dev=<optimized out>, event=<optimized out>, screen=<optimized out>) at mieq.c:498
#22 0x000000000048af20 in ProcXTestFakeInput (client=0x30ece40) at xtest.c:429
#23 0x0000000000558208 in Dispatch () at dispatch.c:479
#24 0x000000000055c250 in dix_main (argc=11, argv=0x7ffff1501ce8, envp=<optimized out>) at main.c:287
#25 0x00007ff3efca5f2a in __libc_start_main () from /lib64/libc.so.6
#26 0x0000000000422aaa in _start ()

Thread 5 (Thread 0x7ff3e515d700 (LWP 3534)):
#0  0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so
#2  0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so
#3  0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff3efd7a16f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7ff3dffff700 (LWP 3536)):
#0  0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so
#2  0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so
#3  0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff3efd7a16f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7ff3e495c700 (LWP 3535)):
#0  0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so
#2  0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so
#3  0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff3efd7a16f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7ff3e595e700 (LWP 3533)):
#0  0x00007ff3f004882d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ff3ea1247ab in thread_function () from /usr/lib64/dri/swrast_dri.so
#2  0x00007ff3ea124627 in impl_thrd_routine () from /usr/lib64/dri/swrast_dri.so
#3  0x00007ff3f004250b in start_thread () from /lib64/libpthread.so.0
#4  0x00007ff3efd7a16f in clone () from /lib64/libc.so.6

---

#0  0x00007ff3efcb9660 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ff3efcbac41 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000594e1a in OsAbort () at utils.c:1361
No locals.
#3  0x0000000000599fe3 in AbortServer () at log.c:877
No locals.
#4  0x000000000059ae05 in FatalError (f=f@entry=0x5be070 "Caught signal %d (%s). Server aborting\n") at log.c:1015
        args = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 0x7ffff14ff7a0, reg_save_area = 0x7ffff14ff6d0}}
        args2 = {{gp_offset = 8, fp_offset = 48, overflow_arg_area = 0x7ffff14ff7a0, reg_save_area = 0x7ffff14ff6d0}}
        beenhere = 1
#5  0x0000000000591f0e in OsSigHandler (signo=7, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
        unused = <optimized out>
        sip = <optimized out>
        signo = 7
#6  <signal handler called>
No symbol table info available.
#7  XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405
        items = <optimized out>
        clearBegin = <optimized out>
#8  0x0000000000520a7a in _XkbCopyGeom (src=0x2f38d80, dst=0x2f0b0f0) at xkbUtils.c:1439
        i = <optimized out>
        scolor = 0x0
        j = 0
        sdoodad = 0x0
        sprop = 0x0
        dsection = 0x0
        dshape = 0x0
        tmp = 0x0
        k = 0
        ddoodad = 0x0
        doutline = 0x0
        dprop = <optimized out>
        drow = 0x0
        ssection = 0x0
        sshape = 0x0
        dcolor = 0x0
        soutline = 0x0
        srow = 0x0
#9  XkbCopyKeymap (dst=0x2f0b0f0, src=0x2f38d80) at xkbUtils.c:1992
No locals.
#10 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965
No locals.
#11 XkbDeviceApplyKeymap (dst=dst@entry=0x2f0dd30, desc=<optimized out>) at xkbUtils.c:2025
        nkn = {type = 0 '\000', xkbType = 0 '\000', sequenceNumber = 0, time = 0, deviceID = 3 '\003', oldDeviceID = 3 '\003', minKeyCode = 8 '\b', maxKeyCode = 255 '\377', oldMinKeyCode = 8 '\b', oldMaxKeyCode = 255 '\377', requestMajor = 135 '\207', requestMinor = 9 '\t', changed = 3, detail = 0 '\000', pad1 = 0 '\000', pad2 = 0, pad3 = 0, pad4 = 0}
        ret = <optimized out>
#12 0x00000000004fd052 in CopyKeyClass (device=device@entry=0x2eb9a10, master=master@entry=0x2f0dd30) at exevents.c:233
        mk = <optimized out>
#13 0x00000000004fd45a in DeepCopyKeyboardClasses (to=0x2f0dd30, from=0x2eb9a10) at exevents.c:427
        classes = <optimized out>
#14 DeepCopyDeviceClasses (from=0x2eb9a10, to=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:670
No locals.
#15 0x0000000000500446 in ChangeMasterDeviceClasses (device=0x2f0dd30, dce=0x7ffff1500bb0) at exevents.c:727
        slave = 0x2eb9a10
        rc = <optimized out>
#16 0x0000000000500684 in UpdateDeviceState (device=0x2f0dd30, event=0x7ffff1500bb0) at exevents.c:807
No locals.
#17 0x0000000000500b1c in ProcessDeviceEvent (ev=ev@entry=0x7ffff1500bb0, device=device@entry=0x2f0dd30) at exevents.c:1709
        grab = <optimized out>
        deactivateDeviceGrab = 0
        key = 0
        rootX = 0
        rootY = 0
        b = <optimized out>
        ret = 0
        corestate = 0
        mouse = 0x2f0b370
        kbd = 0x2f0dd30
        event = 0x7ffff1500bb0
        __func__ = "ProcessDeviceEvent"
#18 0x0000000000501353 in ProcessOtherEvent (ev=0x7ffff1500bb0, device=0x2f0dd30) at exevents.c:1873
No locals.
#19 0x000000000052e6f2 in ProcessKeyboardEvent (ev=<optimized out>, keybd=0x2f0dd30) at xkbPrKeyEv.c:165
        keyc = <optimized out>
        xkbi = 0x0
        backup_proc = 0x52e690 <ProcessKeyboardEvent>
        event = <optimized out>
        is_press = <optimized out>
        is_release = <optimized out>
#20 0x0000000000470e83 in mieqProcessDeviceEvent (dev=0x2eb9a10, event=0x7ff3f2ea0010, screen=0x25776c0) at mieq.c:496
        handler = 0x0
        master = 0x2f0dd30
        mevent = {any = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162}, device_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, detail = {button = 3, key = 3}, touchid = 5, root_x = 0, root_x_frac = 0, root_y = 0, root_y_frac = 0, buttons = '\000' <repeats 31 times>, valuators = {mask = "\000\000\000\000", mode = "\000\000\000\000", data = {0 <repeats 36 times>}}, mods = {base = 0, latched = 0, locked = 0, effective = 0}, group = {base = 0 '\000', latched = 0 '\000', locked = 0 '\000', effective = 0 '\000'}, root = 0, corestate = 0, key_repeat = 0, flags = 0, resource = 0, source_type = EVENT_SOURCE_NORMAL}, changed_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, flags = 10, masterid = 3, sourceid = 5, buttons = {num_buttons = 0, names = {0 <repeats 256 times>}}, num_valuators = 0, valuators = {{min = 0, max = 0, value = 0, resolution = 0, mode = 0 '\000', name = 0, scroll = {type = SCROLL_TYPE_NONE, increment = 0, flags = 0}} <repeats 36 times>}, keys = {min_keycode = 8, max_keycode = 255}}, touch_ownership_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, touchid = 3, reason = 5 '\005', resource = 0, flags = 0}, barrier_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, barrierid = 3, window = 5, root = 0, dx = 0, dy = 0, root_x = 0, root_y = 0, dt = 0, event_id = 0, flags = 0}, dga_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, subtype = 3, detail = 10, dx = 3, dy = 5, screen = 0, state = 0}, raw_event = {header = 255 '\377', type = ET_DeviceChanged, length = 3088, time = 26650162, deviceid = 3, sourceid = 10, detail = {button = 3, key = 3}, valuators = {mask = "\000\000\000\000", data = {0 <repeats 36 times>}, data_raw = {0 <repeats 36 times>}}, flags = 0}}
#21 0x0000000000470f52 in mieqProcessDeviceEvent (dev=<optimized out>, event=<optimized out>, screen=<optimized out>) at mieq.c:498
No locals.
#22 0x000000000048af20 in ProcXTestFakeInput (client=0x30ece40) at xtest.c:429
        stuff = <optimized out>
        nev = <optimized out>
        n = <optimized out>
        type = 2
        rc = <optimized out>
        ev = <optimized out>
        dev = 0x2eb9a10
        root = 0xd
        extension = <optimized out>
        mask = {last_bit = 16 '\020', has_unaccelerated = 124 '|', mask = "W\002\000\000", valuators = {1.9410843188760561e-316, 2.5757278463123209e-316, 6.9507978547510833e-310, 6.9533436336337801e-310, 1.0185579797423812e-312, 6.9533436336337801e-310, 6.9533436336235035e-310, 1.8527461719046745e-321, -1.6529107122258923e-145, 3.1620201333839779e-322, 1.9427617705568163e-316, 6.953343633619551e-310, 1.943370064179976e-316, 1.9429645350978695e-316, 6.9507978550305762e-310, 3.1620201333839779e-322, 6.9533436336250845e-310, 6.953343633624294e-310, -1.6529107122258923e-145, 0, 0, 0, 2.5544521938448468e-316, 6.9507986183937675e-310, 2.8844288562024277e-317, 6.9505972133067495e-310, 1.5810100666919889e-322, 6.9533436336385231e-310, 1.5810100666919889e-322, 6.9533436336250845e-310, 2.5344200057948259e-316, 2.5544632609153136e-316, 1.5810100666919889e-322, 6.9533436336385231e-310, 1.5810100666919889e-322, 6.9533436336298276e-310}, unaccelerated = {6.9533436696760666e-310, 6.9533436336337801e-310, 2.1473399910232206e-314, 6.9533436336314086e-310, -1.6529107122258923e-145, 6.9533436336353611e-310, 2.5344200057948259e-316, 1.5810100666919889e-322, 6.9533436336385231e-310, 2.9063455094388162e-316, 0, 2.5544632609153136e-316, 2.8851605674239186e-317, 6.9533436336353611e-310, 6.950796022429684e-310, 0, 6.950796022429684e-310, 6.9533436336381279e-310, 6.9507960176135815e-310, 0, 6.9507978549310714e-310, 1.9410764138257226e-316, 6.95079302300611e-310, 2.0202549789757241e-316, 2.8914169206972063e-317, 0, -1.6529107122258923e-145, 0, 2.534653204779663e-316, 2.5270619849444414e-316, 1.9762625833649862e-323, 2.5353219720378737e-316, 6.4228533959362051e-323, 0, 2.8835192813484339e-317, 2.121995791459338e-314}}
        valuators = {0 <repeats 36 times>}
        numValuators = 0
        firstValuator = <optimized out>
        nevents = <optimized out>
        i = <optimized out>
        base = <optimized out>
        flags = 0
        need_ptr_update = <optimized out>
#23 0x0000000000558208 in Dispatch () at dispatch.c:479
        result = <optimized out>
        start_tick = 4705
#24 0x000000000055c250 in dix_main (argc=11, argv=0x7ffff1501ce8, envp=<optimized out>) at main.c:287
        i = <optimized out>
        alwaysCheckForInput = {0, 1}
#25 0x00007ff3efca5f2a in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#26 0x0000000000422aaa in _start ()
No symbol table info available.

Comment 1 Alan Jenkins 2018-03-17 22:21:54 UTC

coredumpctl shows a comparative slight difference in the traces that happened 2 days ago:

This time:

                #4  0x000000000059ae05 FatalError (Xwayland)
                #5  0x0000000000591f0e OsSigHandler (Xwayland)
                #6  0x00007ff3f004d1b0 __restore_rt (libpthread.so.0)
                #7  0x00000000005394d0 XkbGeomRealloc (Xwayland)
                #8  0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland)
                #9  0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland)

2 days ago, crash is directly in XkbCopyKeymap().  coredump is deleted already, sorry.

                #2  0x000000000058e2cd xorg_backtrace (Xwayland)
                #3  0x0000000000591eb9 OsSigHandler (Xwayland)
                #4  0x00007f0db1f121b0 __restore_rt (libpthread.so.0)
                #5  0x0000000000520010 XkbCopyKeymap.part.8 (Xwayland)
                #6  0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland)

Note: 0x520010 is the first instruction in XkbCopyKeymap()!  Seems very suspicious!

2 weeks 3 days ago, it's back to XkbGeomRealloc() again:

                #4  0x00007f98dd06baf0 __restore_rt (libpthread.so.0)
                #5  0x00000000005394d0 XkbGeomRealloc (Xwayland)
                #6  0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland)
                #7  0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland)

And there's a maybe unrelated one:
2 weeks 2 days ago:

                #4  0x00007f48c6128af0 __restore_rt (libpthread.so.0)
                #5  0x000000000051bdf0 ProcXkbGetMap (Xwayland)
                #6  0x0000000000558208 Dispatch (Xwayland)
                #7  0x000000000055c250 dix_main (Xwayland)

Comment 2 Alan Jenkins 2018-03-17 22:40:33 UTC

The "maps" file collected by ABRT also doesn't show a problem with the stack, afaict :(.  So the apparent pattern of SIGBUS on the first instruction, of those two different functions... I am quite confused by this.

maps file:

7ffff14e3000-7ffff1504000 rw-p 00000000 00:00 0                          [stack]

(gdb) 
#7  XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS)
    at XKBGAlloc.c:405
405	    if (!buffer)
(gdb) p/x $sp
$5 = 0x7ffff14ffd58

Comment 3 Alan Jenkins 2018-03-17 22:55:23 UTC

*** Bug 1557688 has been marked as a duplicate of this bug. ***

Comment 4 Alan Jenkins 2018-03-17 22:57:09 UTC

Getting ABRT to do it's analysis locally instead, allowed ABRT to submit full automatic crash dump information.  So see bug #1557688 for those attachments.

Comment 5 Alan Jenkins 2018-03-21 11:13:35 UTC

*** Bug 1558940 has been marked as a duplicate of this bug. ***

Comment 6 Alan Jenkins 2018-03-21 11:40:54 UTC


*** This bug has been marked as a duplicate of bug 1548737 ***

Comment 7 Alan Jenkins 2018-03-23 23:08:45 UTC

Another one, ABRT merged but it's slightly different.  The backtrace doesn't involve XkbDeviceApplyKeymap().

https://bugzilla.redhat.com/show_bug.cgi?id=1548737#c17

                #4  0x00007f7ad052f1b0 __restore_rt (libpthread.so.0)
                #5  0x0000000000520010 XkbCopyKeymap (Xwayland)
                #6  0x0000000000523e5c XkbCopyKeymap (Xwayland)
                #7  0x00000000004fd052 CopyKeyClass (Xwayland)
                #8  0x00000000004fd45a DeepCopyKeyboardClasses (Xwayland)
                #9  0x0000000000500446 ChangeMasterDeviceClasses (Xwayland)
                #10 0x0000000000500684 UpdateDeviceState (Xwayland)
                #11 0x0000000000500b1c ProcessDeviceEvent (Xwayland)
                #12 0x0000000000501353 ProcessOtherEvent (Xwayland)
                #13 0x000000000052e6f2 ProcessKeyboardEvent (Xwayland)
                #14 0x0000000000470e83 mieqProcessDeviceEvent (Xwayland)
                #15 0x000000000048af20 ProcXTestFakeInput (Xwayland)

The fault is again allegedly in the first instruction, of XkbCopyKeymap().

There's another oddity here as it claims this is a recursive call, but I don't see a recursive call in the source.

(gdb) 
#5  XkbCopyKeymap (dst=0x271d9d0, src=0x2744210) at xkbUtils.c:1957
1957	XkbCopyKeymap(XkbDescPtr dst, XkbDescPtr src)
(gdb) up
#6  0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965
1965	    if (src == dst)
(gdb) list
1960	    if (!src || !dst) {
1961	        DebugF("XkbCopyKeymap: src (%p) or dst (%p) is NULL\n", src, dst);
1962	        return FALSE;
1963	    }
1964	
1965	    if (src == dst)
1966	        return TRUE;
1967	
1968	    if (!_XkbCopyClientMap(src, dst)) {
1969	        DebugF("XkbCopyKeymap: failed to copy client map\n");

Comment 8 Alan Jenkins 2018-03-24 12:31:41 UTC

F***, really hope this isn't another bug in the spectre microcode updates ala http://vninja.net/news/curious-case-intel-microcode-part-2-gets-better-worse/

The dates are somewhat suspicious, but the connection is not strong enough for me to figure it out.

> 3. (Broadwell E, H, U/Y; Haswell standard, Core Extreme, ULT) Symptom: Intel has received reports of **unexpected page faults**, which they are currently investigating. Out of an abundance of caution, Intel requested Lenovo to stop distributing this firmware.

I have an i5-5300U, which falls under Broadwell U/Y.

The first crash I have in `coredumpctl` with ProcXTestFakeInput -> ProcessKeyboardEvent -> UpdateDeviceState is on 2018-02-28.  I have such crashes five times since then.

journalctl shows ucode update to rev 0x28 on Jan 15.  On Feb 15, kernel starts detecting it as "Intel Spectre v2 broken microcode detected; disabling Speculation Control".  On Mar 21, I get an update to 0x2a, and it stops being detected as broken.

My most recent of these crashes is 2018-03-23.

Comment 9 Alan Jenkins 2018-03-24 14:26:22 UTC

So I have a number of crashes in slightly different places along this call chain, but always in the first instruction of the called function.

And now look at the secondary crashes in _dl_fixup() (and _Uelf64_lookup_symbol - https://bugzilla.redhat.com/show_bug.cgi?id=1548737).

It's as if the common factor that for these mulitiple different SIGBUS crashes, is a read inside the mapping of /usr/bin/Xwayland.  In the initial crash, it's a read of the first instruction when jumping to a new function.  When there's a secondary crash, it's when we try to read some of the ELF data to generate a backtrace.

Co-incidence?

I looked at the dl_fixup() crash in detail.

It fetches from (0x41bd78 + 0x8), which *should* be in-bounds for reading.  Just like we think that 0x5394d0 <XkbGeomRealloc> should have been in-bounds for executing.  But read dies with SIGBUS.

00400000-0060b000 r-xp 00000000 fd:00 1708508                            /usr/bin/Xwayland

(gdb)
#0  _dl_fixup (l=0x7f7ad34c6130, reloc_arg=203) at ../elf/dl-runtime.c:73
73	  const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];
(gdb) list
68	    = (const void *) D_PTR (l, l_info[DT_SYMTAB]);
69	  const char *strtab = (const void *) D_PTR (l, l_info[DT_STRTAB]);
70	
71	  const PLTREL *const reloc
72	    = (const void *) (D_PTR (l, l_info[DT_JMPREL]) + reloc_offset);
73	  const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];

(gdb) disassemble
Dump of assembler code for function _dl_fixup:
   0x00007f7ad32aebd0 <+0>:	push   %rbx
   0x00007f7ad32aebd1 <+1>:	mov    %rdi,%r10
   0x00007f7ad32aebd4 <+4>:	mov    %esi,%esi
   0x00007f7ad32aebd6 <+6>:	lea    (%rsi,%rsi,2),%rdx
   0x00007f7ad32aebda <+10>:	sub    $0x10,%rsp
   0x00007f7ad32aebde <+14>:	mov    0x68(%rdi),%rax
   0x00007f7ad32aebe2 <+18>:	mov    0x8(%rax),%rdi
   0x00007f7ad32aebe6 <+22>:	mov    0xf8(%r10),%rax
   0x00007f7ad32aebed <+29>:	mov    0x8(%rax),%rax
   0x00007f7ad32aebf1 <+33>:	lea    (%rax,%rdx,8),%r8
   0x00007f7ad32aebf5 <+37>:	mov    0x70(%r10),%rax
=> 0x00007f7ad32aebf9 <+41>:	mov    0x8(%r8),%rcx

(gdb) info registers
...
r8             0x41bd78	4308344
...
(gdb) p reloc
$1 = (const Elf64_Rela * const) 0x41bd78
(gdb) p *reloc
$2 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}

Comment 10 Alan Jenkins 2018-03-24 15:54:18 UTC

The good news is it sounds more like it started after the last update to X / Xwayland.  My first crash was on 2018-02-28, but mine have been less frequent than Brian's.

$ rpm -q --last xorg-x11-server-Xwayland
xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64 Wed 21 Feb 2018 22:41:54 GMT

Comment 11 Alan Jenkins 2018-03-24 19:44:28 UTC

In all the coredumps I have, $_siginfo (and the `sip` param in OsSigHandler()) seems to have ->si_addr consistent with the above (i.e. AFAICT it should really be in-bounds), and ->si_code == 2...

#define BUS_ADRERR	2	/* non-existent physical address */

I think this is supposed to mean that you have mapped a file, you accessed within the mapping, and within the permissions of the mapping, but the page is actually beyond the end of the file.  Or, reading the page from disk failed (IO error).

I don't get how I can have this *variety* of backtraces, each internally consistent, unless there's a defect somewhere in the underlying platform :(.  Cpu bug, kernel bug, hardware fault?

** Kernel upgrade 4.14 -> 4.15 occurred on 2018-02-18.  HMM. **

Just to make sure: I have already run `rpm --verify xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64` and `rpm --verify --all`, without finding any problem with my installed files.  And *gdb* seems to be able to access these addresses OK, when debugging the coredump.

Comment 12 Alan Jenkins 2018-03-27 14:03:09 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1557899 is another report, currently not marked as duplicate, from someone with crash-on-resume symptoms, originally on Xwayland, but who switched to Xorg.  Note that this is again a bus error, and again the address which faults (si_addr) is 0x4b7ac0 - xf86CrtcRotate+0x0 - inside the text mapping of /usr/bin/Xorg specifically.

Comment 13 Alan Jenkins 2018-03-27 16:09:36 UTC

There are tens of buckets of this pattern in FAF.  After sorting by "count", the earliest I found so far started on 2018-02-18.  So that pushes it back a little bit.

That date matches *exactly* when I upgraded my kernel from v4.14 to v4.15.

I often apply (and reboot into) kernel upgrades within 24 hours of availability, so I think this is a plausible match.

(To disclose inconvenient points also: that first crash is only 3 days after when I upgraded Xwayland to 1.19.6-5.fc27.  I think the kernel is a better match based on the overall evidence though).

https://retrace.fedoraproject.org/faf/reports/2050554/

(All you have to do is look up xorg-x11-server in FAF, and look for _dl_fixup(), _Ux86..., or _Uelf...  Every one of those is a SIGBUS, and they're all somewhere along this same call chain.  Or they're the Xorg one.  Some of the buckets contain hundreds of individual reports).

Comment 14 Alan Jenkins 2018-03-28 10:26:31 UTC

So why would Xorg and Xwayland be so special?  Xwayland in particular should be very boring.

Well, notice how the fault is always in the text segment, which is at the lowest mapped address.

Xorg and Xwayland are the *only* daemon binaries on my system that `file` describes as "ELF 64-bit LSB executable".  The others are all "shared object"s - meaning relocatable code.  They get loaded to much higher addresess - looks like ASLR.

The X servers not being built as relocatable code (and the reason) is noted here: https://bugzilla.redhat.com/show_bug.cgi?id=1543960