| Summary: | [abrt] xscreensaver-gl-extras-5.15-3.fc16: slow2: Process /usr/libexec/xscreensaver/atlantis was killed by signal 11 (SIGSEGV) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | fabian <fsanrame> | ||||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 16 | CC: | bobkaiser1, gansalmon, itamar, jakub, jonathan, kernel-maint, law, madhu.chinakonda, mtasaka, schwab | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | abrt_hash:b79475c4da063d3f404d846e50f41882d51d7951 | ||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2012-09-04 17:47:19 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
fabian
2012-04-09 23:44:35 UTC
Created attachment 576334 [details]
File: maps
Created attachment 576335 [details]
File: backtrace
Looks like calling sin(-1.5705) caused segfault?? Once asking glibc maintainer for help. Well, I certainly can't trigger that behaviour with the obvious testcase. It's possible the -1.5707 isn't the actual value causing the problem. What's strange here is for -1.5705 we shouldn't get into the "slow2" routine to start with, at least not with my testing. Once in slow2, the actual fault occurs because an index to the sin/cos table is out of range. => 0x000000358681ce80 <+3104>: movsd (%rax,%rcx,8),%xmm14 rax 0x35868725c0 229890270656 rcx 0x6e978d50 1855425872 $rax corresponds to the sin/cos table; $rcx should be the index into the table. The effective address is $rcx * 8 + rax, 0x35fb439040 which isn't part of any mapped area. The table ought to be contained within this address range: 3586800000-3586883000 r-xp 00000000 08:01 3793 /lib64/libm-2.14.90.so Looking backwards from the fault we have: 0x000000358681ce6b <+3083>: movslq %edx,%rcx $rdx has the value: rdx 0x6e978d50 1855425872 Continuing working backwards in the insn stream we have: 0x000000358681ce57 <+3063>: mov 0x8(%rsp),%rdx 0x000000358681ce68 <+3080>: shl $0x2,%edx Which looks like a standard index computation using whatever was at $rsp + 0x8. 0x000000358681ce4c <+3052>: movsd %xmm1,0x8(%rsp) Where %xmm1 is the result of arithmetic on other xmm regs. Unfortunately the backtrace file doesn't include the xmm register data. Is there still a core file anywhere we could use to extract that information? The core file would also tell us if -1.5705 is the actual value causing the problem or some value very close to -1.5705. FWIW, I can't trigger the failure using -1.5705. Is there any chance the rounding mode has been changed by atlantis or its component libraries? *** Bug 810687 has been marked as a duplicate of this bug. *** *** Bug 808846 has been marked as a duplicate of this bug. *** *** Bug 808847 has been marked as a duplicate of this bug. *** *** Bug 810684 has been marked as a duplicate of this bug. *** Could you possibly bundle up the contents of /var/spool/abrt and attach them to this BZ or send it to me privately (law). There's information I need to debug this further that is in those files but not provided by abrt. Inserted blank DVD+R. Selected Open CD/DVD Creator at prompt. Was working in CD/DVD Creator when ABRT displayed gnome-system-monitor crash message. backtrace_rating: 4 Package: gnome-system-monitor-3.2.1-2.fc16 OS Release: Fedora release 16 (Verne) Created attachment 578769 [details]
File: backtrace
Bob, what I really need are the contents of /var/spool/abrt. The backtraces produced by the abrt tool are missing information that is critical to fully analyzing this problem. I really can't make any more progress without the actual core dumps. *** Bug 813724 has been marked as a duplicate of this bug. *** Bob sent me a core dump offline and it's been very helpful, but I still don't know exactly what's happening.
The analysis will be specific to the core dump Bob sent, but I'm confident the whatever the underlying problem is common to all these bug reports.
Looking at the relevant source in sin.c we have:
134 else if (k < 0x400368fd ) {
136 y = (m>0)? hp0.x-x:hp0.x+x;
137 if (y>=0) {
138 u.x = big.x+y;
139 y = (y-(u.x-big.x))+hp1.x;
140 }
[ ... ]
148 k=u.i[LOW_HALF]<<2;
149 sn=sincos.x[k];
150 ssn=sincos.x[k+1];
151 cs=sincos.x[k+2];
152 ccs=sincos.x[k+3];
It's worth noting line #136, #138 & #148. I'm actually going to work backwards from the fault point which occurs when we access the sincos array.
The fault is because of an out-of-range memory access due to a bogus index into the sincos array.
0x3679c1c4e8 <__sin+648>: shl $0x2,%edx
[ ... ]
0x3679c1c50f <__sin+687>: lea 0x2(%rdx),%esi
[ ... ]
0x3679c1c528 <__sin+712>: movslq %esi,%rsi
=> 0x3679c1c52b <__sin+715>: movsd (%rax,%rsi,8),%xmm14
$rsi has the value:
$65 = 0xffffffffe0b5cd9a
$rsi was set at __sin+712 where %esi the value:
$67 = 0xe0b5cd9a
%rsi was set at __sin+687 where %rdx had the value
$68 = 0xe0b5cd98
%rdx had been set at _sin+648 and we can deduce its prior value to be
$73 = 0x382d7366 ($68 >> 2)
The value in %edx should come from __sin+2645:
0x3679c1cc98 <__sin+2616>: movapd %xmm1,%xmm0
0x3679c1cc9c <__sin+2620>: movsd 0x2855b(%rip),%xmm12 # 0x3679c45200 <hpi1>
0x3679c1cca5 <__sin+2629>: addsd %xmm11,%xmm0
0x3679c1ccaa <__sin+2634>: movsd %xmm0,0x8(%rsp)
0x3679c1ccb0 <__sin+2640>: subsd %xmm11,%xmm0
0x3679c1ccb5 <__sin+2645>: mov 0x8(%rsp),%rdx
0x3679c1ccba <__sin+2650>: subsd %xmm0,%xmm1
0x3679c1ccbe <__sin+2654>: addsd %xmm12,%xmm1
0x3679c1ccc3 <__sin+2659>: jmpq 0x3679c1c4e3 <__sin+643>
And the value *($sp + 8) is:
0x7fffe144fec8: 0x382d7366
We can see that *(sp + 8) was set from $xmm0, which is unfortunate as $xmm0 can't be recovered. However, $xmm11 is still available and is particularly interesting. $xmm11 should be the value "big" as set at __sin+569:
0x3679c1c499 <__sin+569>: movsd 0x560ee(%rip),%xmm11 # 0x3679c72590 <big>
0x3679c1c4a2 <__sin+578>: ucomisd %xmm3,%xmm1
0x3679c1c4a6 <__sin+582>: jae 0x3679c1cc98 <__sin+2616>
(gdb) p $xmm11
$74 = {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {
0 <repeats 16 times>}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0,
0, 0, 0}, v2_int64 = {0, 0}, uint128 = 0}
And to verify the value of big in memory is correct:
(gdb) p big
$75 = {i = {0, 1120403456}, x = 52776558133248}
(gdb) p &big
$76 = (const mynumber *) 0x3679c72590
Yow! I've confirmed there should be no path from when we set $xmm11 to using it for "big" where it could possibly be clobbered. This is very significant. Continuing the process of working backwards:
The value at *(sp + 8) is
(gdb) p *(double *)($sp + 8)
$61 = 0.52359877559829893
Which coincidentally is hp0.x - x (see line #136)
(gdb) p hp0.x - x
$78 = 0.52359877559829893
Which is exactly the value I would expect given the incorrect value in $xmm11. So in effect, by clobbering $xmm11 line #138 becomes a copy from y into u.x.
The only conclusion I can reach given this data is that something has clobbered the value of $xmm11 between the point where we loaded it at address sin+569 and its use at sin+2629. It's the clobbering of $xmm11 which causes the computations to produce the wrong result, ultimately producing a wrong index into the sincos array.
Now it may look like sin+569 to sin+2629 is a large window. But in terms of actual instructions executed it's just 6 actual instructions (after loading $xmm11 we branch to sin+2616).
This really looks like register $xmm11 is getting clobbered by another thread/process and not getting properly restored by the kernel. The 2 reporters are using an AMD 5200 and AMD 4200 (there's 4 reports, but 2 unique reporters). So perhaps it's something specific to that line of AMD processors.
Reassigning to the kernel team. I can be contacted offline for the core dump used in this analysis.
Does this start with a particular kernel version? There was a big rework of the x86 FPU layers by Linus recently. related to bug 810668 ? can you reproduce this with the current kernel update ? |