Bug 165412 - [RHEL3] The system hangs when SysRq + c is pressed
[RHEL3] The system hangs when SysRq + c is pressed
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
: FutureFeature
Depends On:
Blocks: 168424
  Show dependency treegraph
 
Reported: 2005-08-08 23:16 EDT by Issue Tracker
Modified: 2010-10-21 23:14 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHSA-2006-0144
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-15 11:21:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
stack limit patch proposed by lwoodman on rhkernel-list (1.26 KB, patch)
2005-08-24 11:27 EDT, Dave Anderson
no flags Details | Diff

  None (edit)
Description Issue Tracker 2005-08-08 23:16:08 EDT
Escalated to Bugzilla from IssueTracker
Comment 35 Ernie Petrides 2005-08-09 14:31:32 EDT
Please clarify whether this bug report is an enhancement request to make
show_trace() and show_stack() resilient against non-4-byte-aligned stack
pointers or whether this is a bug report about the stack pointer becoming
unaligned (which is likely due to corruption of stack data).

Then please update the bug summary appropriately.  Thanks in advance.
Comment 37 Dave Anderson 2005-08-09 15:10:06 EDT
If you look at the sample vmcore that shows this:

[<c01b3169>] handle_sysrq [kernel] 0x59 (0xc038be56)

and you do a "rd c038be56", does it show c01b3169?

In other words, do real return addresses show up on addresses
that are not 4-byte aligned?

This is confouding to me -- I don't believe I've *ever* seen a
i386 return address on anything but a 4-byte boundary.  Here's an
old trace, for example:

crash> bt
PID: 0      TASK: c03f8000  CPU: 0   COMMAND: "swapper"
 #0 [c03f9ed0] sysrq_handle_crash at c01ce8b0
 #1 [c03f9ecc] __handle_sysrq_nolock at c01cef38
 #2 [c03f9eec] handle_sysrq at c01cee98
 #3 [c03f9f10] handle_scancode at c01cc9c6
 #4 [c03f9f38] handle_kbd_event at c01cdba8
 #5 [c03f9f50] keyboard_interrupt at c01cdbc7
 #6 [c03f9f54] handle_IRQ_event at c010dd57
 #7 [c03f9f74] do_IRQ at c010df94
 #8 [c03f9f9c] call_do_IRQ at c03fa322
    EAX: 00000000  EBX: c0109100  ECX: c0443280  EDX: c03f8000  EBP: c0109100
    DS:  0068      ESI: c03f8000  ES:  0068      EDI: c03f8000
    CS:  0060      EIP: c0109129  ERR: ffffff01  EFLAGS: 00000246
 #9 [c03f9fd0] default_idle at c0109129
#10 [c03f9fd4] cpu_idle at c01091c0

I've never seen a stack address (the second number in each frame)
ever be anything other than 4-byte aligned.

Has something changed in the RHEL3 build environment recently?
Why has this only shown up now?
Comment 39 Dave Anderson 2005-08-09 16:23:05 EDT
In the original trace we see:

__handle_sysrq_nolock [kernel] 0x7a (0xc038be36)
[<c01b3159>] handle_sysrq [kernel] 0x59 (0xc038be56)
[<c01b213c>] handle_kbd_event [kernel] 0xbc (0xc038bea2)
[<c01b214f>] keyboard_interrupt [kernel] 0xf (0xc038beba)
[<c010cff5>] handle_IRQ_event [kernel] 0x45 (0xc038bebe)
[<c010d110>] do_IRQ [kernel] 0x0 (0xc038befe)
[<c010d110>] do_IRQ [kernel] 0x0 (0xc038c31e)
[<c01071a0>] do_linuxrc [kernel] 0x0 (0xc038e596)
[<c01904eb>] devpts_read_super [kernel] 0x3b (0xc038ea56)
[<c01090b0>] poll_idle [kernel] 0x0 (0xc038ea7e)
[<c010fe60>] disable_8259A_irq [kernel] 0x0 (0xc038eea6)
[<c0158b1c>] insert_super [kernel] 0x2c (0xc038f202)
[<c01d89db>] __ide_dma_timeout [kernel] 0x1b (0xc038fb12)
[<d08810ca>] bus_readdir [usbcore] 0xca (0xc0390966)

(snip) many messages printed like above.

-------

But where did it "go bad" originally?  Perhaps there's a clue there?
In other words, we've got what appears to be "correct" return addresses
(perhaps back at least back to poll_idle()?), but it would seem that if
the stack pointer got trashed, we would see the point at which the
references start looking bizarre.  So when you say "many messages
printed like above", what exactly were they?  Did it just keep going
with non-aligned return addresses until it ran off the stack page into
whatever was above it? (i.e. before the trace goes beyond the top of the
stack, which would be at c038b000+8k)

I'm just wondering whether it *ever* was 4-byte aligned.
Comment 40 Dave Anderson 2005-08-09 16:41:23 EDT
From the first trace:

> Process swapper (pid: 0, stackpage=c038b000)

I was incorrect when stating the top of the stack
would be c038b000+8k -- actually the end of stack would be
at c038b000+4k.

I never noticed until now, but the i386 show_registers()
prints "stackpage=" as the second of the two page stack,
i.e., current+4096:

        printk("Process %s (pid: %d, stackpage=%08lx)",
                current->comm, current->pid, 4096+(unsigned long)current);

Weird, and confusing, if you were actually running in the
first page of the stack...



Comment 41 Ernie Petrides 2005-08-09 16:50:37 EDT
Right, so then at the logical-bottom-of-stack was do_IRQ().  The
2nd do_IRQ() and the do_linuxrc() frames are fictitious (because
their alleged frames are beyond the range of valid stack addresses).
Comment 44 Dave Anderson 2005-08-10 13:57:04 EDT
Where can I find the vmcore and vmlinux/vmlinux.debug files?
(The location references above don't seem to be valid anymore.)
Comment 45 Ernie Petrides 2005-08-10 17:33:01 EDT
Changing severity to "enhancement" based on answer in comment #42.
Comment 47 Dave Anderson 2005-08-11 11:26:21 EDT
Thanks, I'll take a look.  I want to try to dig out the exception frame
(pt_regs) when the keyboard interrupt was received.  I don't understand
why there could be a "transient" situation where the stack pointer is
not aligned properly.  I think we can all agree that your patch may just
mask the real issue by accepting that non-aligned stack addresses are
legitimate.  So it makes more sense to me now to first try to figure out
why/when/how this can happen in the first place.

BTW, I appreciate your debugging efforts in this case, and thought I'd maybe
make things a bit easier.  In your crash output above, you show this:

> crash> foreach task | grep -e "PID" -e "esp0 =" -e "esp =" | grep -v "saved_esp"
>
> PID: 0      TASK: c038a000  CPU: 0   COMMAND: "swapper"
>    esp0 = 3224944640,
>    esp = 3224944540,

Above you're dumping stack addresses from the task's thread_struct, and
because the esp0 and esp fields are declared as "long unsigned int", gdb's
output shows them in decimal by default.  However, if you're not aware
of it, you can flip the output radix during a crash session using the "hex"
and "dec" aliases.

It's decimal by default:

crash> task | grep -e "PID" -e " esp0 =" -e " esp ="
PID: 0      TASK: c03f8000  CPU: 0   COMMAND: "swapper"
    esp0 = 3225395200,
    esp = 3225395064,

That's ugly, so flip it to hexadecimal:

crash> hex
output radix: 16 (hex)
crash> task | grep -e "PID" -e " esp0 =" -e " esp ="
PID: 0      TASK: c03f8000  CPU: 0   COMMAND: "swapper"
    esp0 = 0xc03fa000,
    esp = 0xc03f9f78,

and if you want then restore it to decimal:

crash> dec
output radix: 10 (decimal)
crash> task | grep -e "PID" -e " esp0 =" -e " esp ="
PID: 0      TASK: c03f8000  CPU: 0   COMMAND: "swapper"
    esp0 = 3225395200,
    esp = 3225395064,
crash>

Excuse me if you're already aware of this capability!  But the decimal output
for stack addresses in your example was kind of ugly, and I wanted to make sure
you know that you can change it.   ;-)  

BTW, hex/dec changes the output format of several crash commands, not just 
gdb structure output.  I keep the gdb decimal default in place, because
normally it makes the most sense, especially given that *pointer* variables
are still shown in hexadecimal even when the output radix is decimal.  But
I find myself flipping hex/dec quite often depending upon what I want to see.
Note that the crash "help" command refers to the "help output" option, which
in turn explains hex/dec.
Comment 48 Dave Anderson 2005-08-11 15:35:12 EDT
I can't really come up with a really "clean" pt_regs exception frame,
but it seems pretty obvious that the issue here has something to do with
the alternative idle process this machine is using.

Here's why...

This is the oops output -- note the stack addresses are at the end (far right)
of each line, and the last text address on a non-aligned stack address
is the do_IRQ reference:

Call Trace:   [<c01b320a>] __handle_sysrq_nolock [kernel] 0x7a (0xc038be36)
[<c01b3169>] handle_sysrq [kernel] 0x59 (0xc038be56)
[<c01b0fbb>] handle_scancode [kernel] 0x2ab (0xc038be7a)
[<c01b214c>] handle_kbd_event [kernel] 0xbc (0xc038bea2)
[<c01b215f>] keyboard_interrupt [kernel] 0xf (0xc038beba)
[<c010d005>] handle_IRQ_event [kernel] 0x45 (0xc038bebe)
[<c010d1a4>] do_IRQ [kernel] 0x84 (0xc038bede)
[<c010d120>] do_IRQ [kernel] 0x0 (0xc038befe)

When running crash, "bt" alone can't find the proper (aligned) starting point,
so it searches the stack until it finds something useful.  The first aligned
return address is found way up near the top of the stack at c038bf9c -- note 
that crash stack frames are printed showing the stack addresses on the left:

crash> bt
PID: 0      TASK: c038a000  CPU: 0   COMMAND: "swapper"
 #0 [c038bf9c] context_switch at c011e63b
 #1 [c038bfb8] apm_cpu_idle at c0118dfd
 #2 [c038bfd4] cpu_idle at c0109120
crash>

Now, if I manually plug in the starting EIP/ESP "hooks" from the first
line of oops trace:

 Call Trace:   [<c01b320a>] __handle_sysrq_nolock [kernel] 0x7a (0xc038be36)

I can come up with this at least:

crash> bt -S 0xc038be36 -I c01b320a
PID: 0      TASK: c038a000  CPU: 0   COMMAND: "swapper"
 #0 [c038be36] __handle_sysrq_nolock at c01b320a
 #1 [c038be56] handle_sysrq at c01b3164
 #2 [c038be7a] handle_scancode at c01b0fb6
 #3 [c038bea2] handle_kbd_event at c01b2147
 #4 [c038beba] keyboard_interrupt at c01b215a
 #5 [c038bebe] handle_IRQ_event at c010d003
 #6 [c038bede] do_IRQ at c010d19f
 #7 [c038bf02] call_do_IRQ at c038c322
    EAX: 000000b0  EBX: 00000000  ECX: 00000000  EDX: 00000000
    DS:  00b8      ESI: 00000000  ES:  0000      EDI: 0000000a
    SS:  74e4      ESP: 725d0046  EBP: c038bf44
    CS:  00b0      EIP: 000077da  ERR: ffffff01  EFLAGS: 00000246
crash>

Now, the exception frame is very strange, and crash won't go past it
back to cpu_idle() since it thinks the EIP is user address, and therefore
thinks its a user-mode exception frame.  To contrast this with an example 
from another alt-sysrq-c dump I have on hand, here's what it normally looks
like when alt-sysrq-c is taken when the idle task is running:

crash> bt
PID: 0      TASK: c03f8000  CPU: 0   COMMAND: "swapper"
 #0 [c03f9ed0] sysrq_handle_crash at c01ce8b0
 #1 [c03f9ecc] __handle_sysrq_nolock at c01cef38
 #2 [c03f9eec] handle_sysrq at c01cee98
 #3 [c03f9f10] handle_scancode at c01cc9c6
 #4 [c03f9f38] handle_kbd_event at c01cdba8
 #5 [c03f9f50] keyboard_interrupt at c01cdbc7
 #6 [c03f9f54] handle_IRQ_event at c010dd57
 #7 [c03f9f74] do_IRQ at c010df94
 #8 [c03f9f9c] call_do_IRQ at c03fa322
    EAX: 00000000  EBX: c0109100  ECX: c0443280  EDX: c03f8000  EBP: c0109100
    DS:  0068      ESI: c03f8000  ES:  0068      EDI: c03f8000
    CS:  0060      EIP: c0109129  ERR: ffffff01  EFLAGS: 00000246
 #9 [c03f9fd0] default_idle at c0109129
#10 [c03f9fd4] cpu_idle at c01091c0
crash>

Note that EIP above shows the default_idle() address that was executing
when the keyboard interrupt was taken.  

Now, again, looking at this vmcore, we have at least this:

crash> bt
PID: 0      TASK: c038a000  CPU: 0   COMMAND: "swapper"
 #0 [c038bf9c] context_switch at c011e63b
 #1 [c038bfb8] apm_cpu_idle at c0118dfd
 #2 [c038bfd4] cpu_idle at c0109120
crash>

where apm_cpu_idle() is in the trace instead of default_idle().

The swapper process on each cpu spends its time here in cpu_idle(): 

void cpu_idle (void)
{
        /* endless idle loop with no priority at all */

        while (1) {
                void (*idle)(void) = pm_idle;
                if (!idle)
                        idle = default_idle;
                /*
                 * We use the last_run timestamp to measure the idleness
                 * of a CPU.
                 */
                current->last_run = jiffies;

                while (!current->need_resched)
                        idle();
                schedule();
                check_pgt_cache();
        }
}

Note that unless the global pointer "pm_idle" has been set to a alternative
idle function, "default_idle" is used.

And as expected, in this dump, pm_idle() has been set to apm_cpu_idle:

crash> p pm_idle
pm_idle = $6 = (void (*)(void)) 0xc0118d20 <apm_cpu_idle>
crash>

As it turns out, the call trace would be apm_cpu_idle() calling apm_do_idle()
calling apm_bios_call_simple(), which makes an lcall into the segment-offset
location called apm_bios_entry:

crash> p apm_bios_entry
apm_bios_entry = $7 = {
  offset = 0x7511,
  segment = 0xa8
}

Note that apm_bios_call_simple() is pretty ugly:

static u8 apm_bios_call(u32 func, u32 ebx_in, u32 ecx_in,
        u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, u32 *esi)
{
        APM_DECL_SEGS
        unsigned long   flags;
        unsigned long cpus = apm_save_cpus();
        struct desc_struct      save_desc_40;

        __save_flags(flags);
        APM_DO_CLI;
        APM_DO_SAVE_SEGS;
        /*
         * N.B. We do NOT need a cld after the BIOS call
         * because we always save and restore the flags.
         */
        __asm__ __volatile__(APM_DO_ZERO_SEGS
                "pushl %%edi\n\t"
                "pushl %%ebp\n\t"
                "lcall %%cs:" SYMBOL_NAME_STR(apm_bios_entry) "\n\t"
                "setc %%al\n\t"
                "popl %%ebp\n\t"
                "popl %%edi\n\t"
                APM_DO_POP_SEGS
                : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx),
                  "=S" (*esi)
                : "a" (func), "b" (ebx_in), "c" (ecx_in)
                : "memory", "cc");
        APM_DO_RESTORE_SEGS;
        __restore_flags(flags);

        apm_restore_cpus(cpus);

        cpu_gdt_table[smp_processor_id()][0x40 / 8] = save_desc_40;

        return *eax & 0xff;
}

How this all relates to exception frames and stack register manipulation
is well beyond my understanding.  I've disassembled the set apm-related
functions, and don't see any manipulation of esp other than the standard
aligned modifications coming in and out of the functions.  But I have to
believe this has something to do with the issue at hand.  It's unclear
to me what happens on the stack when an interrupt comes in while the cpu
is executing in the BIOS.

I don't understand exactly how it works, but it seems that the apm module
must be built into the kernel?  That being the case, "smp" defaults to 0,
and "idle_threshold" is 95:

#ifdef CONFIG_APM_CPU_IDLE
#define DEFAULT_IDLE_THRESHOLD  95
#else
#define DEFAULT_IDLE_THRESHOLD  100
#endif
#define DEFAULT_IDLE_PERIOD     (100 / 3)

static int                      idle_threshold = DEFAULT_IDLE_THRESHOLD;

static int                      smp = 0;

That being the case, we would only see this in a single cpu system, because
the apm_init() function is responsible for setting pm_idle() here:

        if (smp_num_cpus > 1 && !smp) {
                printk(KERN_NOTICE
                   "apm: disabled - APM is not SMP safe (power off active).\n");
                return 0;
        }

        misc_register(&apm_device);

        if (HZ != 100)
                idle_period = (idle_period * HZ) / 100;
        if (idle_threshold < 100) {
                original_pm_idle = pm_idle;
                pm_idle  = apm_cpu_idle;
                set_pm_idle = 1;
        }

        return 0;
}

I do wonder whether the idle process on this machine is *always* unaligned
once the new idle process is installed and executed?  If you do a series
of alt-sysrq-p entries on this machine when quiescent, do you see the stack
aligned or unaligned when you catch the idle process running?

Comment 49 Dave Anderson 2005-08-11 16:05:09 EDT
Also, in addition to the alt-sysrq-p question I raised above,
note that if the problem is in fact due to the apm_cpu_idle issue,
you can work around it by adding "idle=poll" onto your kernel's
boot command line.  

By doing so, pm_idle will be set to "poll_idle", overriding the
"apm_cpu_idle" that gets set by default on your machine.
Comment 50 Dave Anderson 2005-08-11 16:44:35 EDT
I'm still interested in the alt-sysrq-p output, but...

Presuming that this is a BIOS bug/issue, your patch concept would be necessary.
I see that you've somewhat used 2.6's way of doing things as a reference:

--- linux-2.4.21/arch/i386/kernel/traps.c.orig	2005-08-11 09:27:46.334491977 +0900
+++ linux-2.4.21/arch/i386/kernel/traps.c	2005-08-11 09:29:21.577869431 +0900
@@ -141,6 +141,7 @@
 	unsigned long addr;
 	/* static to not take up stackspace; if we race here too bad */
 	static char buffer[512];
+	void *context;
 
 	if (!stack)
 		stack = (unsigned long*)&stack;
@@ -163,7 +164,8 @@
 out:
 #else
 	i = 1;
-	while (((long) stack & (THREAD_SIZE-1)) != 0) {
+	context = (void*)((unsigned long)stack & ~(THREAD_SIZE-1));
+	while (context < stack && stack < context + THREAD_SIZE - 3) {
 		addr = *stack++;
 		if (kernel_text_address(addr)) {
 			lookup_symbol(addr, buffer, 512);

A couple questions:

1. Is there a reason you made the context variable a "void *" instead of an 
   "unsigned long *"?
2. Is there any mathematical possibility that "context < stack" would not
   always be true?

Comment 51 Dave Anderson 2005-08-24 11:27:15 EDT
Created attachment 118073 [details]
stack limit patch proposed by lwoodman on rhkernel-list


This is the proposed stack display patch being proposed on rhkernel-list.
Comment 52 Ernie Petrides 2005-08-24 16:57:17 EDT
Reassigning to Larry and adding to U7 ProposedList.
Comment 54 Ernie Petrides 2005-09-15 00:08:49 EDT
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.2.EL).
Comment 57 Eido Inoue 2005-09-26 16:46:03 EDT
*** Bug 163176 has been marked as a duplicate of this bug. ***
Comment 62 Red Hat Bugzilla 2006-03-15 11:21:55 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html

Note You need to log in before you can comment on or make changes to this bug.