Bug 1557682
Summary: | SIGBUS apparently at first instruction in XkbGeomRealloc(), when resuming from suspend | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Alan Jenkins <alan.christopher.jenkins> |
Component: | xorg-x11-server | Assignee: | X/OpenGL Maintenance List <xgl-maint> |
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 27 | CC: | alexl, bskeggs, caillon+fedoraproject, jglisse, john.j5live, ofourdan, rhughes, rstrode, sandmann, xgl-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-03-21 11:40:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alan Jenkins
2018-03-17 22:06:04 UTC
coredumpctl shows a comparative slight difference in the traces that happened 2 days ago: This time: #4 0x000000000059ae05 FatalError (Xwayland) #5 0x0000000000591f0e OsSigHandler (Xwayland) #6 0x00007ff3f004d1b0 __restore_rt (libpthread.so.0) #7 0x00000000005394d0 XkbGeomRealloc (Xwayland) #8 0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland) #9 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) 2 days ago, crash is directly in XkbCopyKeymap(). coredump is deleted already, sorry. #2 0x000000000058e2cd xorg_backtrace (Xwayland) #3 0x0000000000591eb9 OsSigHandler (Xwayland) #4 0x00007f0db1f121b0 __restore_rt (libpthread.so.0) #5 0x0000000000520010 XkbCopyKeymap.part.8 (Xwayland) #6 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) Note: 0x520010 is the first instruction in XkbCopyKeymap()! Seems very suspicious! 2 weeks 3 days ago, it's back to XkbGeomRealloc() again: #4 0x00007f98dd06baf0 __restore_rt (libpthread.so.0) #5 0x00000000005394d0 XkbGeomRealloc (Xwayland) #6 0x0000000000520a7a XkbCopyKeymap.part.8 (Xwayland) #7 0x0000000000523e5c XkbDeviceApplyKeymap (Xwayland) And there's a maybe unrelated one: 2 weeks 2 days ago: #4 0x00007f48c6128af0 __restore_rt (libpthread.so.0) #5 0x000000000051bdf0 ProcXkbGetMap (Xwayland) #6 0x0000000000558208 Dispatch (Xwayland) #7 0x000000000055c250 dix_main (Xwayland) The "maps" file collected by ABRT also doesn't show a problem with the stack, afaict :(. So the apparent pattern of SIGBUS on the first instruction, of those two different functions... I am quite confused by this. maps file: 7ffff14e3000-7ffff1504000 rw-p 00000000 00:00 0 [stack] (gdb) #7 XkbGeomRealloc (buffer=buffer@entry=0x3711aa8, szItems=0, nrItems=1, itemSize=itemSize@entry=16, clearance=clearance@entry=XKB_GEOM_CLEAR_EXCESS) at XKBGAlloc.c:405 405 if (!buffer) (gdb) p/x $sp $5 = 0x7ffff14ffd58 *** Bug 1557688 has been marked as a duplicate of this bug. *** Getting ABRT to do it's analysis locally instead, allowed ABRT to submit full automatic crash dump information. So see bug #1557688 for those attachments. *** Bug 1558940 has been marked as a duplicate of this bug. *** *** This bug has been marked as a duplicate of bug 1548737 *** Another one, ABRT merged but it's slightly different. The backtrace doesn't involve XkbDeviceApplyKeymap(). https://bugzilla.redhat.com/show_bug.cgi?id=1548737#c17 #4 0x00007f7ad052f1b0 __restore_rt (libpthread.so.0) #5 0x0000000000520010 XkbCopyKeymap (Xwayland) #6 0x0000000000523e5c XkbCopyKeymap (Xwayland) #7 0x00000000004fd052 CopyKeyClass (Xwayland) #8 0x00000000004fd45a DeepCopyKeyboardClasses (Xwayland) #9 0x0000000000500446 ChangeMasterDeviceClasses (Xwayland) #10 0x0000000000500684 UpdateDeviceState (Xwayland) #11 0x0000000000500b1c ProcessDeviceEvent (Xwayland) #12 0x0000000000501353 ProcessOtherEvent (Xwayland) #13 0x000000000052e6f2 ProcessKeyboardEvent (Xwayland) #14 0x0000000000470e83 mieqProcessDeviceEvent (Xwayland) #15 0x000000000048af20 ProcXTestFakeInput (Xwayland) The fault is again allegedly in the first instruction, of XkbCopyKeymap(). There's another oddity here as it claims this is a recursive call, but I don't see a recursive call in the source. (gdb) #5 XkbCopyKeymap (dst=0x271d9d0, src=0x2744210) at xkbUtils.c:1957 1957 XkbCopyKeymap(XkbDescPtr dst, XkbDescPtr src) (gdb) up #6 0x0000000000523e5c in XkbCopyKeymap (src=<optimized out>, dst=<optimized out>) at xkbUtils.c:1965 1965 if (src == dst) (gdb) list 1960 if (!src || !dst) { 1961 DebugF("XkbCopyKeymap: src (%p) or dst (%p) is NULL\n", src, dst); 1962 return FALSE; 1963 } 1964 1965 if (src == dst) 1966 return TRUE; 1967 1968 if (!_XkbCopyClientMap(src, dst)) { 1969 DebugF("XkbCopyKeymap: failed to copy client map\n"); F***, really hope this isn't another bug in the spectre microcode updates ala http://vninja.net/news/curious-case-intel-microcode-part-2-gets-better-worse/ The dates are somewhat suspicious, but the connection is not strong enough for me to figure it out. > 3. (Broadwell E, H, U/Y; Haswell standard, Core Extreme, ULT) Symptom: Intel has received reports of **unexpected page faults**, which they are currently investigating. Out of an abundance of caution, Intel requested Lenovo to stop distributing this firmware. I have an i5-5300U, which falls under Broadwell U/Y. The first crash I have in `coredumpctl` with ProcXTestFakeInput -> ProcessKeyboardEvent -> UpdateDeviceState is on 2018-02-28. I have such crashes five times since then. journalctl shows ucode update to rev 0x28 on Jan 15. On Feb 15, kernel starts detecting it as "Intel Spectre v2 broken microcode detected; disabling Speculation Control". On Mar 21, I get an update to 0x2a, and it stops being detected as broken. My most recent of these crashes is 2018-03-23. So I have a number of crashes in slightly different places along this call chain, but always in the first instruction of the called function. And now look at the secondary crashes in _dl_fixup() (and _Uelf64_lookup_symbol - https://bugzilla.redhat.com/show_bug.cgi?id=1548737). It's as if the common factor that for these mulitiple different SIGBUS crashes, is a read inside the mapping of /usr/bin/Xwayland. In the initial crash, it's a read of the first instruction when jumping to a new function. When there's a secondary crash, it's when we try to read some of the ELF data to generate a backtrace. Co-incidence? I looked at the dl_fixup() crash in detail. It fetches from (0x41bd78 + 0x8), which *should* be in-bounds for reading. Just like we think that 0x5394d0 <XkbGeomRealloc> should have been in-bounds for executing. But read dies with SIGBUS. 00400000-0060b000 r-xp 00000000 fd:00 1708508 /usr/bin/Xwayland (gdb) #0 _dl_fixup (l=0x7f7ad34c6130, reloc_arg=203) at ../elf/dl-runtime.c:73 73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)]; (gdb) list 68 = (const void *) D_PTR (l, l_info[DT_SYMTAB]); 69 const char *strtab = (const void *) D_PTR (l, l_info[DT_STRTAB]); 70 71 const PLTREL *const reloc 72 = (const void *) (D_PTR (l, l_info[DT_JMPREL]) + reloc_offset); 73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)]; (gdb) disassemble Dump of assembler code for function _dl_fixup: 0x00007f7ad32aebd0 <+0>: push %rbx 0x00007f7ad32aebd1 <+1>: mov %rdi,%r10 0x00007f7ad32aebd4 <+4>: mov %esi,%esi 0x00007f7ad32aebd6 <+6>: lea (%rsi,%rsi,2),%rdx 0x00007f7ad32aebda <+10>: sub $0x10,%rsp 0x00007f7ad32aebde <+14>: mov 0x68(%rdi),%rax 0x00007f7ad32aebe2 <+18>: mov 0x8(%rax),%rdi 0x00007f7ad32aebe6 <+22>: mov 0xf8(%r10),%rax 0x00007f7ad32aebed <+29>: mov 0x8(%rax),%rax 0x00007f7ad32aebf1 <+33>: lea (%rax,%rdx,8),%r8 0x00007f7ad32aebf5 <+37>: mov 0x70(%r10),%rax => 0x00007f7ad32aebf9 <+41>: mov 0x8(%r8),%rcx (gdb) info registers ... r8 0x41bd78 4308344 ... (gdb) p reloc $1 = (const Elf64_Rela * const) 0x41bd78 (gdb) p *reloc $2 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0} The good news is it sounds more like it started after the last update to X / Xwayland. My first crash was on 2018-02-28, but mine have been less frequent than Brian's. $ rpm -q --last xorg-x11-server-Xwayland xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64 Wed 21 Feb 2018 22:41:54 GMT In all the coredumps I have, $_siginfo (and the `sip` param in OsSigHandler()) seems to have ->si_addr consistent with the above (i.e. AFAICT it should really be in-bounds), and ->si_code == 2... #define BUS_ADRERR 2 /* non-existent physical address */ I think this is supposed to mean that you have mapped a file, you accessed within the mapping, and within the permissions of the mapping, but the page is actually beyond the end of the file. Or, reading the page from disk failed (IO error). I don't get how I can have this *variety* of backtraces, each internally consistent, unless there's a defect somewhere in the underlying platform :(. Cpu bug, kernel bug, hardware fault? ** Kernel upgrade 4.14 -> 4.15 occurred on 2018-02-18. HMM. ** Just to make sure: I have already run `rpm --verify xorg-x11-server-Xwayland-1.19.6-5.fc27.x86_64` and `rpm --verify --all`, without finding any problem with my installed files. And *gdb* seems to be able to access these addresses OK, when debugging the coredump. https://bugzilla.redhat.com/show_bug.cgi?id=1557899 is another report, currently not marked as duplicate, from someone with crash-on-resume symptoms, originally on Xwayland, but who switched to Xorg. Note that this is again a bus error, and again the address which faults (si_addr) is 0x4b7ac0 - xf86CrtcRotate+0x0 - inside the text mapping of /usr/bin/Xorg specifically. There are tens of buckets of this pattern in FAF. After sorting by "count", the earliest I found so far started on 2018-02-18. So that pushes it back a little bit. That date matches *exactly* when I upgraded my kernel from v4.14 to v4.15. I often apply (and reboot into) kernel upgrades within 24 hours of availability, so I think this is a plausible match. (To disclose inconvenient points also: that first crash is only 3 days after when I upgraded Xwayland to 1.19.6-5.fc27. I think the kernel is a better match based on the overall evidence though). https://retrace.fedoraproject.org/faf/reports/2050554/ (All you have to do is look up xorg-x11-server in FAF, and look for _dl_fixup(), _Ux86..., or _Uelf... Every one of those is a SIGBUS, and they're all somewhere along this same call chain. Or they're the Xorg one. Some of the buckets contain hundreds of individual reports). So why would Xorg and Xwayland be so special? Xwayland in particular should be very boring. Well, notice how the fault is always in the text segment, which is at the lowest mapped address. Xorg and Xwayland are the *only* daemon binaries on my system that `file` describes as "ELF 64-bit LSB executable". The others are all "shared object"s - meaning relocatable code. They get loaded to much higher addresess - looks like ASLR. The X servers not being built as relocatable code (and the reason) is noted here: https://bugzilla.redhat.com/show_bug.cgi?id=1543960 |