Bug 154221
Summary: | Thread exits siliently via __RESTORE_ALL exeception for iret | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | David Simms <david.simms> |
Component: | kernel | Assignee: | Roland McGrath <roland> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | ihse, jbaron, johan.walles, riel, sta_larsen |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-06-08 15:14:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 154451, 154972 | ||
Attachments: |
Description
David Simms
2005-04-08 15:56:14 UTC
Created attachment 112859 [details]
syslog results of instrumentation
Created attachment 112860 [details]
Instrumentation patch
Added this to the kernel-2.6 spec and rpmbuild -bb --target i686
It is possible for us to supply a repro case involving the JVM. The exception happens because of trying to restore a bad value into a segment register, %ds or %es. From the registers in your trace output, it looks like %ds and %es both have bogus values (normally both should match %ss, 0x7b). If you clobbered the %ds before the syscall, I think that would cause a trap at the offending segment register load--and it's unlikely you do any segment register loads anyway (unless you are running garbage instructions). If the syscall is a sigreturn, then that will try to load whatever %ds value is in the struct sigcontext you passed it, so that would explain it if you clobbered the struct sigcontext on the stack, either before the call or potentially in a race where a different thread actually clobbers part of your stack. I'd certainly call it a bug that the process just silently dies here. It ought to give you the same signal it would if you tried the same bad segment register load in your own code. The upstream kernel still does this the same way, I'll look into making it give a signal instead. Oh sorry, I misread the code--you are certainly hitting a fault in the iret itself rather than the segment register loads. I believe there will be a trap at the iret instruction if the %cs or %ss being restored is invalid, or if the PC being returned to is outside the segment limits. In the Red Hat kernels, %cs has a segment limit to implement the exec-shield functionality when the processor does not support the NX page table bit (which is new in processors in the last year or less). So, a bad PC value here can get you a GP fault in the iret rather than just a page fault after the iret returns to user mode, but with the same meaning as the SIGBUS you'll get there if the PC is invalid but below the segment limit (which all PCs are when the NX page table protection is available). This still means the likelihood is a clobbered sigcontext being passed to sigreturn--but it's the cs, ss, and eip fields that are what get to this failure mode. Created attachment 112875 [details]
simple reproducer
This test case reproduces the dying with report of SIGSEGV, but not actually
generating the signal. You can tell (after ulimit -c unlimited) because it
doesn't dump a core. Compile the test with -DWHATSIG=SIGSEGV, and you can tell
because it doesn't iterate back into the signal handler after returning the
first time.
On x86_64, with this program built either 32-bit or 64-bit, the same problem
exists but the wait status is totally bogus (113, which is -9999&0xff) rather
than 11 (SIGSEGV).
Created attachment 112876 [details]
second reproducer using ptrace instead of sigreturn
This is another way to reproduce the same bug, which can also manifest if the
bogus state is poked in by ptrace rather than by a signal handler returning
after clobbering the sigcontext.
Created attachment 112881 [details]
replacement for ptrace reproducer, also bites on native x86_64
This one produces a fault in iret (I think) on x86_64--the failure mode is even
worse there.
Just a note about the signal reproducer...in our complex case we have sigaltstack for SIGSEGV, and we know we are trying to cause SIGSEGV via "outs" instruction generating GP fault, yet the sig stack seems completely unused, instrumenting our user code showed us that particular thread never entered any signal handler (none that we know of)...worried is wasn't a simple sigcontext clobber. Which brings me to "part 2" of our problem: what really caused the iret fault ? Assuming fixing the first part of the problem; providing some kind of diaganostics (signal/panic) showing the real problem, we'd be happy to test such a patch or suggestions/request for further information. Created attachment 112989 [details]
RHEL4 kernel patch
This patch makes the RHEL4 i686 kernel report a proper signal for this
scenario,
so it can be debugged via ptrace or produce a core dump.
I can't much speculate on the underlying problem with the information at hand. The patch I've attached should make it easier to diagnose what's going on. Created attachment 113101 [details]
gdb session with crashed process
Still running the JVM repro, spawning and waiting on threads, performing
illegal outs and handling the signals...seems we get SIGSEGV trapno 13 in
mmap...
Created attachment 113142 [details]
Repro source, executable (RHAS21 compiled) and core file
Finally managed to create a "simple reproducer". The magic here was to compile
the code on RHAS2.1 and then run on RHEL4. Crashes within the minute (included
example core file from iretfault patch RHEL4). I get the same SIGSEGV
(trapno=13) on my illegal code.
Running test the program on the standard kernel (unpatched) reproduce the
previous problem off a thread "missing", ie do_exit from iret fault.
I can reproduce it using the 2.1 built binary, and will look into what's going on. But I'd like to move that case into a separate bug report. This bug 154221 is for the failure to produce a proper core dump, which we now have a fix for. There should be a separate report for the comment #15 problem, which is still under investigation. If that leads to a kernel fix, it will be a separate one from the 154221 fix. Sure, no problem, please let us know the new bug number. Thanks for your work on the iret fault btw. See bug 154972 looking at the "mtbadcode" test case. This bug will track the fix for properly dumping core when all such faults happen. Created attachment 113589 [details]
RHEL4 kernel patch
This patch for RHEL4 kernels fixes this problem in a way that interacts well
with the exec-shield support code not in upstream kernels.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-420.html |