Bug 768300
Summary: | [abrt] bash-4.2.20-1.fc16: _dl_x86_64_restore_sse: Process /bin/bash was killed by signal 4 (SIGILL) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bas <basvansummeren> | ||||||||||||
Component: | glibc | Assignee: | Jeff Law <law> | ||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 16 | CC: | aoliva, fweimer, jakub, law, maxamillion, mishu, pmachata, rrakus, schwab | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | abrt_hash:7ff8871fa03c9becb11540a50e91308f60486937 | ||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2012-03-05 19:28:23 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Bas
2011-12-16 09:42:57 UTC
Created attachment 547676 [details]
File: coredump
Created attachment 547677 [details]
File: smolt_data
Created attachment 547678 [details]
File: var_log_messages
Created attachment 547679 [details]
File: maps
Created attachment 547680 [details]
File: backtrace
*** Bug 768306 has been marked as a duplicate of this bug. *** *** Bug 768343 has been marked as a duplicate of this bug. *** Seems like bug in glibc (or gcc?). The offending instruction is in AVX block. Could it be that glibc was built on AVX-enabled machine, but the reporter's system doesn't have that support? (Note that the process was killed by SIGILL.) Hm, except that in that case the check before the block should have fired. *** Bug 768305 has been marked as a duplicate of this bug. *** To original reporter: could you please show an output from /proc/cpuinfo? (gdb) x/gx 0x3475222f78 0x3475222f78: 0x0000000000000000 0x3475222f78 is an address where have_avx is stored. This is consistent in all these duplicate core dumps. Zero implies that that field hasn't been initialized yet. We are hitting this inside _dl_x86_64_restore_sse. I went through the code of _dl_x86_64_save_sse, and the AVX detection seems correct (in line with Intel documentation anyway, I don't have a machine with that feature handy). In any case, it should end in initializing that value to either +1 or -1. Maybe we are just neglecting to call _*save_sse under some circumstances? I'll have to look in. Another curious thing that I noticed: %fs is set to 0. I don't understand how this could be, given the code that follows (vmovdqa %fs:0x80,%ymm0 etc.) And finally, there's some strange prof_agent.so mapped in the address space. Any chance this BZ is related to this upstream bug report: http://sourceware.org/bugzilla/show_bug.cgi?id=13583 What's interesting here is the fix for 13583 is it twiddles the how we check for YMM_Usable in init_cpu_features. It's a related area, but I don't think it's directly related. The code that I'm looking into is cited as an example of how to do the check correctly. I'll have to look into the overall logic some more, to connect the dots between _save_sse and _restore_sse. My current working theory is that _save_sse is somehow not called, because if it were called, the value of have_avx would have to be non-zero. (As a side note, this doesn't seem to be a result of a race. Even if several threads ended up in this region at once, they would just all consistently conclude whether the CPU supports AVX, and write that result in that memory location. It's a waste of cycles in the worst case.) Presumably all we have to work with are the core dumps and haven't been able to reproduce in-house. I wonder if there's any nonobvious ways to get into that restore code. Anyway, I'll keep monitoring and throw in any ideas that come up. Is there any chance this is related to 720176? I realize this bug is in the restore path and the other is in the save path, but it's an awful interesting coincidence... Whoops. I meant to ask if this was related to 752122 which shows a similar failure, but in the save path. Jeff How reproducible is this? Like Petr I'm really confused as to how we can get into this hunk of code with the value of have_avx being zero. If you can consistently trigger this failure, what happens if you run it under valgrind? Can you get an strace of the process from start to finish? Any chance you could give Petr or myself access to whatever machine where this is happening? One of the things I find myself wondering is perhaps there's a wild pointer clobbering have_avx or something along those lines. The only way to find that is likely to be able reproduce the failure. FWIW, I tried running this under valgrind on the same theory (there's a test binary in one of the duplicates) but it came out clean. FWIW, from the data in Smolt it seems the machine does not support AVX, but that of course doesn't explain anything. Petr, I missed that tidbit. So the illegal instruction fault makes sense... Clearly we shouldn't be getting into this code. The fact that have_avx is zero I think means one of two things. One something might be clobbering have_avx. valgrind isn't likely to catch that unless something has passed bogus args to memcpy or something similar. Or two that the return address on the stack got twiddled causing a return insn to jump into this code rather than the proper return address. Perhaps we could set a gdb hardware watchpoint on the have_avx location and see who changes it over time using whatever testcase you have in the duplicate.. Or maybe the reporter used hibernation? Which could be similar to bug #551570. I doubt this is related to 551570; the overall structure of the backtraces is radically different. I think we also have to consider the possibility that rtld_must_xmm_save is getting clobbered. If it were somehow set to zero after RTLD_ENABLE_FOREIGN_CALL, but before RTLD_FINALIZE_FOREIGN_CALL without a corresponding call to save_sse, then we might get this behaviour. I've been looking into this, and one think I found puzzling is that in all cases the bug hit, _dl_fixup has old_rtld_must_xmm_save == 1, even though AFAICT this should only ever happen when we're within another _dl_fixup call upthread, but there's no such additional _dl_fixup upthread. Furthermore, all cases involve the dlight profiler, which might be messing things up somehow. Maybe we should have a look at its source code? I'd missed the fact that for old_rtld_must_xmm_save to have a nonzero value that we must already have a thread inside dl_fixup. Hmm, I wonder if there's some kind of race which causes us to muck up those state variables and by the time we see the fault the other thread has left dl_fixup. A comment prior to the definition of RTLD_ENABLE_FOREIGN_CALL hints that locking is not needed, but perhaps it's wrong. /* NB: Don't use the xchg operation because that would imply a lock prefix which is expensive and unnecessary. The cache line is also not contested at all. */ This is thread-local state, so whatever other threads do (short of scribbling over other threads' static TLS segments:-) is irrelevant. It's the failing thread's own thread-local storage segment that seems to have been messed up. If we had a reliable reproducer, setting a hardware watchpoint on that TLS location would help us figure out what's going wrong. Do we have one? No reliable reproducer. Petr got access to a similar machine but was unable to trigger the failure there. I'm not sure how long it's been since we've heard from the original submitter, but I'm starting to think we may have to close as INSUFFICIENT_DATA if we don't get some response from them soon. Bas: What we really need is a semi-reliable way to reproduce this problem. To date nobody has been able to identify a way to get into the failing routine without there having been memory corruption or similar problem. Without a way to reproduce this particular failure it's unlikely we'll be able to resolve it. |