Bug 154221 - Thread exits siliently via __RESTORE_ALL exeception for iret
Thread exits siliently via __RESTORE_ALL exeception for iret
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Roland McGrath
Brian Brock
:
Depends On:
Blocks: 154451 154972
  Show dependency treegraph
 
Reported: 2005-04-08 11:56 EDT by David Simms
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-06-08 11:14:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
syslog results of instrumentation (153.66 KB, text/plain)
2005-04-08 11:59 EDT, David Simms
no flags Details
Instrumentation patch (2.27 KB, patch)
2005-04-08 12:01 EDT, David Simms
no flags Details | Diff
simple reproducer (687 bytes, text/plain)
2005-04-08 16:33 EDT, Roland McGrath
no flags Details
second reproducer using ptrace instead of sigreturn (1016 bytes, text/plain)
2005-04-08 16:35 EDT, Roland McGrath
no flags Details
replacement for ptrace reproducer, also bites on native x86_64 (1.01 KB, text/plain)
2005-04-08 17:13 EDT, Roland McGrath
no flags Details
RHEL4 kernel patch (4.51 KB, patch)
2005-04-11 16:28 EDT, Roland McGrath
no flags Details | Diff
gdb session with crashed process (10.20 KB, text/plain)
2005-04-13 11:19 EDT, David Simms
no flags Details
Repro source, executable (RHAS21 compiled) and core file (31.85 KB, application/octet-stream)
2005-04-14 04:57 EDT, David Simms
no flags Details
RHEL4 kernel patch (4.41 KB, patch)
2005-04-23 20:51 EDT, Roland McGrath
no flags Details | Diff

  None (edit)
Description David Simms 2005-04-08 11:56:14 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
So related to 152012, I instrumented the kernel to find out how our thread "disappeared" when running a Java thread test with JRockit. That is the thread did not complete, but was destroyed by the OS. 

The last thing the thread was doing was trying to execute an illegal instruction, for which we have a signal handler for (SA_RESTART|SA_SIGINFO|SA_ONSTACK).

Sure enough the thread went through do_exit, but did not originate from the system call or a signal. After exhausting all other possibilities I instrumented the one last candidate, entry.S macro __RESTORE_ALL has a kernel exception fix up (333,666) for a fail iret (will attach patch with instrumentation).

We are basically faced with two problems:

1) __RESTORE_ALL fixup code simply calls "do_exit(11)", causing my thread to EXIT SILENTLY. This can't be right ? We need some notification/explanation: warning in syslog, core dump, kernel dump ?!

2) We have no idea why the iret instruction is failing, further more we have no simple 1 page repro case, currently running a JVM for repro.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-5.EL

How reproducible:
Always

Steps to Reproduce:
1. Run Java thread tests.
2. Start causing general protection faults, via use of an illegal instruction.


Actual Results:  Signal stack is completely unused, the thread has exited, and it's last instruction is 0x6f (outs).

Expected Results:  Expecting to catch SIGSEGV in a signal handler, and handle the illegal instruction.

Additional info:
Comment 1 David Simms 2005-04-08 11:59:34 EDT
Created attachment 112859 [details]
syslog results of instrumentation
Comment 2 David Simms 2005-04-08 12:01:04 EDT
Created attachment 112860 [details]
Instrumentation patch

Added this to the kernel-2.6 spec and rpmbuild -bb --target i686
Comment 3 David Simms 2005-04-08 12:43:02 EDT
It is possible for us to supply a repro case involving the JVM.
Comment 4 Roland McGrath 2005-04-08 13:39:01 EDT
The exception happens because of trying to restore a bad value into a segment
register, %ds or %es.  From the registers in your trace output, it looks like
%ds  and %es both have bogus values (normally both should match %ss, 0x7b). If
you clobbered the %ds before the syscall, I think that would cause a trap at the
offending segment register load--and it's unlikely you do any segment register
loads anyway (unless you are running garbage instructions).  If the syscall is a
sigreturn, then that will try to load whatever %ds value is in the struct
sigcontext you passed it, so that would explain it if you clobbered the struct
sigcontext on the stack, either before the call or potentially in a race where a
different thread actually clobbers part of your stack.

I'd certainly call it a bug that the process just silently dies here.
It ought to give you the same signal it would if you tried the same bad segment
register load in your own code.  The upstream kernel still does this the same
way, I'll look into making it give a signal instead.
Comment 5 Roland McGrath 2005-04-08 14:00:04 EDT
Oh sorry, I misread the code--you are certainly hitting a fault in the iret
itself rather than the segment register loads.  I believe there will be a trap
at the iret instruction if the %cs or %ss being restored is invalid, or if the
PC being returned to is outside the segment limits.  In the Red Hat kernels, %cs
has a segment limit to implement the exec-shield functionality when the
processor does not support the NX page table bit (which is new in processors in
the last year or less).  So, a bad PC value here can get you a GP fault in the
iret rather than just a page fault after the iret returns to user mode, but with
the same meaning as the SIGBUS you'll get there if the PC is invalid but below
the segment limit (which all PCs are when the NX page table protection is
available).  This still means the likelihood is a clobbered sigcontext being
passed to sigreturn--but it's the cs, ss, and eip fields that are what get to
this failure mode.
Comment 6 Roland McGrath 2005-04-08 16:33:56 EDT
Created attachment 112875 [details]
simple reproducer

This test case reproduces the dying with report of SIGSEGV, but not actually
generating the signal.	You can tell (after ulimit -c unlimited) because it
doesn't dump a core.  Compile the test with -DWHATSIG=SIGSEGV, and you can tell
because it doesn't iterate back into the signal handler after returning the
first time.

On x86_64, with this program built either 32-bit or 64-bit, the same problem
exists but the wait status is totally bogus (113, which is -9999&0xff) rather
than 11 (SIGSEGV).
Comment 7 Roland McGrath 2005-04-08 16:35:34 EDT
Created attachment 112876 [details]
second reproducer using ptrace instead of sigreturn

This is another way to reproduce the same bug, which can also manifest if the
bogus state is poked in by ptrace rather than by a signal handler returning
after clobbering the sigcontext.
Comment 8 Roland McGrath 2005-04-08 17:13:59 EDT
Created attachment 112881 [details]
replacement for ptrace reproducer, also bites on native x86_64 

This one produces a fault in iret (I think) on x86_64--the failure mode is even
worse there.
Comment 10 David Simms 2005-04-10 12:32:47 EDT
Just a note about the signal reproducer...in our complex case we have
sigaltstack for SIGSEGV, and we know we are trying to cause SIGSEGV via "outs"
instruction generating GP fault, yet the sig stack seems completely unused,
instrumenting our user code showed us that particular thread never entered any
signal handler (none that we know of)...worried is wasn't a simple sigcontext
clobber.

Which brings me to "part 2" of our problem: what really caused the iret fault ?
Assuming fixing the first part of the problem; providing some kind of
diaganostics (signal/panic) showing the real problem, we'd be happy to test such
a patch or suggestions/request for further information.
Comment 12 Roland McGrath 2005-04-11 16:28:33 EDT
Created attachment 112989 [details]
RHEL4 kernel patch

This patch makes the RHEL4 i686 kernel report a proper signal for this
scenario,
so it can be debugged via ptrace or produce a core dump.
Comment 13 Roland McGrath 2005-04-11 16:49:10 EDT
I can't much speculate on the underlying problem with the information at hand.
The patch I've attached should make it easier to diagnose what's going on.
 
Comment 14 David Simms 2005-04-13 11:19:38 EDT
Created attachment 113101 [details]
gdb session with crashed process

Still running the JVM repro, spawning and waiting on threads, performing
illegal outs and handling the signals...seems we get SIGSEGV trapno 13 in
mmap...
Comment 15 David Simms 2005-04-14 04:57:47 EDT
Created attachment 113142 [details]
Repro source, executable (RHAS21 compiled) and core file

Finally managed to create a "simple reproducer". The magic here was to compile
the code on RHAS2.1 and then run on RHEL4. Crashes within the minute (included
example core file from iretfault patch RHEL4). I get the same SIGSEGV
(trapno=13) on my illegal code.

Running test the program on the standard kernel (unpatched) reproduce the
previous problem off a thread "missing", ie do_exit from iret fault.
Comment 17 Roland McGrath 2005-04-14 18:49:01 EDT
I can reproduce it using the 2.1 built binary, and will look into what's going on.
But I'd like to move that case into a separate bug report.  This bug 154221 is
for  the failure to produce a proper core dump, which we now have a fix for.
There should be a separate report for the comment #15 problem, which is still
under investigation.  If that leads to a kernel fix, it will be a separate one
from the 154221 fix.
Comment 18 David Simms 2005-04-15 03:14:59 EDT
Sure, no problem, please let us know the new bug number.

Thanks for your work on the iret fault btw.
Comment 19 Roland McGrath 2005-04-15 03:25:10 EDT
See bug 154972 looking at the "mtbadcode" test case.
This bug will track the fix for properly dumping core when all such faults happen.
Comment 23 Roland McGrath 2005-04-23 20:51:30 EDT
Created attachment 113589 [details]
RHEL4 kernel patch

This patch for RHEL4 kernels fixes this problem in a way that interacts well
with the exec-shield support code not in upstream kernels.
Comment 28 Tim Powers 2005-06-08 11:14:07 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-420.html

Note You need to log in before you can comment on or make changes to this bug.