Bug 139437

Summary: [RHEL3-U4][crash] bt -a hangs
Product: Red Hat Enterprise Linux 3 Reporter: Yuuichi Nagahama <nagahama>
Component: netdumpAssignee: Tatsuo Uchida <tuchida>
Status: CLOSED ERRATA QA Contact: David Lawrence <dkl>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: aimamura, akiyama.nobuyuk, anderson, indou.takao, linux-scsi, makita, mayuzumi.masa, nagahama, ntachino, tburke, tuchida, watanabe.mas-20
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2005-186 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-05-19 12:47:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuuichi Nagahama 2004-11-15 22:58:47 UTC
Description of problem:
crash-3.8.3
bt -a hangs. Promt doesn't come back.


Version-Release number of selected component (if applicable):
RHEL3-U4

How reproducible: 50%


Steps to Reproduce:
1. crash
2. bt -a
3.
  
Actual results: hang after bt -a. Prompt doesn't come back.


Expected results: bt -a works.


Additional info:
Dave Anderson investigated. His comment is
----------
This bt "hang" was the result of the IP interrupt that was sent
from the diskdump process catching PID 3822 just after it had
entered the kernel to do a system_call, but before it had a chance
to call the actual system call handler function.  I have *never*
seen this before, and the backtrace code is not equipped to
even handle such a situation!

I fixed it with a kludge (the /usr/bin/crash on 192.168.78.227 has
been updated to a temporary version 3.8-5.6a).  The trace looks
like this:

crash> bt 3822
PID: 3822   TASK: f050c000  CPU: 7   COMMAND: "dd"
 #0 [f050df84] smp_call_function_interrupt at 211d18f
 #1 [f050df8c] call_call_function_interrupt at 23eee2f
    EAX: 00000004  EBX: 00000001  ECX: 084cb000  EDX: 00101000  EBP: 
feffa968
    DS:  0068      ESI: 084cb000  ES:  0068      EDI: 00000000
    CS:  0060      EIP: fffd7027  ERR: fffffffb  EFLAGS: 00000286
 #2 [f050dfc0] system_call at 23ee027
    EAX: 00000004  EBX: 00000001  ECX: 084cb000  EDX: 00000200
    DS:  002b      ESI: 084cb000  ES:  002b      EDI: 00000000
    SS:  002b      ESP: feffa948  EBP: feffa968
    CS:  0023      EIP: 001df9fe  ERR: 00000004  EFLAGS: 00000246
crash>

It's interesting -- never have I seen two exceptions happen
so close together without an intervening function call.
The hang was caused by a function that was trying to determine
the stack frame size of "system_call", and complicated
by the fact that the interrupted EIP (0xfffd7027) is
the hugemem trampoline address for the real system_call address
of 0x23ee027.

Anyway, this will be quite difficult to reproduce, so I
won't be updating my people site with this fix
until something else comes along.
-----------------

Comment 1 Dave Anderson 2004-11-16 13:44:16 UTC
I'm confused here.  Is this a new report?  I fixed the x86 backtrace
issue in my public crash utility release on people.redhat.com
in version 3.8-5.7, and that will be carried forward into an
update of crash for RHEL3-U5.

But this BZ states that it happens on "All" Platforms, and
that it happens 50% of the time.  Are we talking specifically
about the situation that I investigated? 

  

Comment 2 Yuuichi Nagahama 2004-11-16 19:40:10 UTC
Dave,

I put old status bug report without checking the latest one.
Tatsuo and I will check the latest version 3.8-5.7 tomorrow(11/17).
And I will put the result.


Comment 3 Tatsuo Uchida 2004-11-17 19:50:31 UTC
I confirmed it at crash-3.8.5-11.
It works without hangup.


Comment 4 Dave Anderson 2005-02-18 16:28:20 UTC
Fix checked into CVS:

RHEL3:

* Fri Feb 04 2005 Dave Anderson <anderson> 3.10-4
- Fixes potential "bt -a" hang on dumpfile where netdump IPI
  interrupted an x86 process while executing the instructions just
  after it had entered the kernel for a syscall, but before calling
  the handler.  BZ #139437

RHEL4:
* Thu Feb 10 2005 Dave Anderson <anderson> 3.10-7
- Fixes potential "bt -a" hang on dumpfile where netdump IPI
  interrupted an x86 process while executing the instructions just
  after it had entered the kernel for a syscall, but before calling
  the handler.  BZ #139437



Comment 5 Tim Powers 2005-05-19 12:47:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-184.html