Escalated to Bugzilla from IssueTracker
Please clarify whether this bug report is an enhancement request to make show_trace() and show_stack() resilient against non-4-byte-aligned stack pointers or whether this is a bug report about the stack pointer becoming unaligned (which is likely due to corruption of stack data). Then please update the bug summary appropriately. Thanks in advance.
If you look at the sample vmcore that shows this: [<c01b3169>] handle_sysrq [kernel] 0x59 (0xc038be56) and you do a "rd c038be56", does it show c01b3169? In other words, do real return addresses show up on addresses that are not 4-byte aligned? This is confouding to me -- I don't believe I've *ever* seen a i386 return address on anything but a 4-byte boundary. Here's an old trace, for example: crash> bt PID: 0 TASK: c03f8000 CPU: 0 COMMAND: "swapper" #0 [c03f9ed0] sysrq_handle_crash at c01ce8b0 #1 [c03f9ecc] __handle_sysrq_nolock at c01cef38 #2 [c03f9eec] handle_sysrq at c01cee98 #3 [c03f9f10] handle_scancode at c01cc9c6 #4 [c03f9f38] handle_kbd_event at c01cdba8 #5 [c03f9f50] keyboard_interrupt at c01cdbc7 #6 [c03f9f54] handle_IRQ_event at c010dd57 #7 [c03f9f74] do_IRQ at c010df94 #8 [c03f9f9c] call_do_IRQ at c03fa322 EAX: 00000000 EBX: c0109100 ECX: c0443280 EDX: c03f8000 EBP: c0109100 DS: 0068 ESI: c03f8000 ES: 0068 EDI: c03f8000 CS: 0060 EIP: c0109129 ERR: ffffff01 EFLAGS: 00000246 #9 [c03f9fd0] default_idle at c0109129 #10 [c03f9fd4] cpu_idle at c01091c0 I've never seen a stack address (the second number in each frame) ever be anything other than 4-byte aligned. Has something changed in the RHEL3 build environment recently? Why has this only shown up now?
In the original trace we see: __handle_sysrq_nolock [kernel] 0x7a (0xc038be36) [<c01b3159>] handle_sysrq [kernel] 0x59 (0xc038be56) [<c01b213c>] handle_kbd_event [kernel] 0xbc (0xc038bea2) [<c01b214f>] keyboard_interrupt [kernel] 0xf (0xc038beba) [<c010cff5>] handle_IRQ_event [kernel] 0x45 (0xc038bebe) [<c010d110>] do_IRQ [kernel] 0x0 (0xc038befe) [<c010d110>] do_IRQ [kernel] 0x0 (0xc038c31e) [<c01071a0>] do_linuxrc [kernel] 0x0 (0xc038e596) [<c01904eb>] devpts_read_super [kernel] 0x3b (0xc038ea56) [<c01090b0>] poll_idle [kernel] 0x0 (0xc038ea7e) [<c010fe60>] disable_8259A_irq [kernel] 0x0 (0xc038eea6) [<c0158b1c>] insert_super [kernel] 0x2c (0xc038f202) [<c01d89db>] __ide_dma_timeout [kernel] 0x1b (0xc038fb12) [<d08810ca>] bus_readdir [usbcore] 0xca (0xc0390966) (snip) many messages printed like above. ------- But where did it "go bad" originally? Perhaps there's a clue there? In other words, we've got what appears to be "correct" return addresses (perhaps back at least back to poll_idle()?), but it would seem that if the stack pointer got trashed, we would see the point at which the references start looking bizarre. So when you say "many messages printed like above", what exactly were they? Did it just keep going with non-aligned return addresses until it ran off the stack page into whatever was above it? (i.e. before the trace goes beyond the top of the stack, which would be at c038b000+8k) I'm just wondering whether it *ever* was 4-byte aligned.
From the first trace: > Process swapper (pid: 0, stackpage=c038b000) I was incorrect when stating the top of the stack would be c038b000+8k -- actually the end of stack would be at c038b000+4k. I never noticed until now, but the i386 show_registers() prints "stackpage=" as the second of the two page stack, i.e., current+4096: printk("Process %s (pid: %d, stackpage=%08lx)", current->comm, current->pid, 4096+(unsigned long)current); Weird, and confusing, if you were actually running in the first page of the stack...
Right, so then at the logical-bottom-of-stack was do_IRQ(). The 2nd do_IRQ() and the do_linuxrc() frames are fictitious (because their alleged frames are beyond the range of valid stack addresses).
Where can I find the vmcore and vmlinux/vmlinux.debug files? (The location references above don't seem to be valid anymore.)
Changing severity to "enhancement" based on answer in comment #42.
Thanks, I'll take a look. I want to try to dig out the exception frame (pt_regs) when the keyboard interrupt was received. I don't understand why there could be a "transient" situation where the stack pointer is not aligned properly. I think we can all agree that your patch may just mask the real issue by accepting that non-aligned stack addresses are legitimate. So it makes more sense to me now to first try to figure out why/when/how this can happen in the first place. BTW, I appreciate your debugging efforts in this case, and thought I'd maybe make things a bit easier. In your crash output above, you show this: > crash> foreach task | grep -e "PID" -e "esp0 =" -e "esp =" | grep -v "saved_esp" > > PID: 0 TASK: c038a000 CPU: 0 COMMAND: "swapper" > esp0 = 3224944640, > esp = 3224944540, Above you're dumping stack addresses from the task's thread_struct, and because the esp0 and esp fields are declared as "long unsigned int", gdb's output shows them in decimal by default. However, if you're not aware of it, you can flip the output radix during a crash session using the "hex" and "dec" aliases. It's decimal by default: crash> task | grep -e "PID" -e " esp0 =" -e " esp =" PID: 0 TASK: c03f8000 CPU: 0 COMMAND: "swapper" esp0 = 3225395200, esp = 3225395064, That's ugly, so flip it to hexadecimal: crash> hex output radix: 16 (hex) crash> task | grep -e "PID" -e " esp0 =" -e " esp =" PID: 0 TASK: c03f8000 CPU: 0 COMMAND: "swapper" esp0 = 0xc03fa000, esp = 0xc03f9f78, and if you want then restore it to decimal: crash> dec output radix: 10 (decimal) crash> task | grep -e "PID" -e " esp0 =" -e " esp =" PID: 0 TASK: c03f8000 CPU: 0 COMMAND: "swapper" esp0 = 3225395200, esp = 3225395064, crash> Excuse me if you're already aware of this capability! But the decimal output for stack addresses in your example was kind of ugly, and I wanted to make sure you know that you can change it. ;-) BTW, hex/dec changes the output format of several crash commands, not just gdb structure output. I keep the gdb decimal default in place, because normally it makes the most sense, especially given that *pointer* variables are still shown in hexadecimal even when the output radix is decimal. But I find myself flipping hex/dec quite often depending upon what I want to see. Note that the crash "help" command refers to the "help output" option, which in turn explains hex/dec.
I can't really come up with a really "clean" pt_regs exception frame, but it seems pretty obvious that the issue here has something to do with the alternative idle process this machine is using. Here's why... This is the oops output -- note the stack addresses are at the end (far right) of each line, and the last text address on a non-aligned stack address is the do_IRQ reference: Call Trace: [<c01b320a>] __handle_sysrq_nolock [kernel] 0x7a (0xc038be36) [<c01b3169>] handle_sysrq [kernel] 0x59 (0xc038be56) [<c01b0fbb>] handle_scancode [kernel] 0x2ab (0xc038be7a) [<c01b214c>] handle_kbd_event [kernel] 0xbc (0xc038bea2) [<c01b215f>] keyboard_interrupt [kernel] 0xf (0xc038beba) [<c010d005>] handle_IRQ_event [kernel] 0x45 (0xc038bebe) [<c010d1a4>] do_IRQ [kernel] 0x84 (0xc038bede) [<c010d120>] do_IRQ [kernel] 0x0 (0xc038befe) When running crash, "bt" alone can't find the proper (aligned) starting point, so it searches the stack until it finds something useful. The first aligned return address is found way up near the top of the stack at c038bf9c -- note that crash stack frames are printed showing the stack addresses on the left: crash> bt PID: 0 TASK: c038a000 CPU: 0 COMMAND: "swapper" #0 [c038bf9c] context_switch at c011e63b #1 [c038bfb8] apm_cpu_idle at c0118dfd #2 [c038bfd4] cpu_idle at c0109120 crash> Now, if I manually plug in the starting EIP/ESP "hooks" from the first line of oops trace: Call Trace: [<c01b320a>] __handle_sysrq_nolock [kernel] 0x7a (0xc038be36) I can come up with this at least: crash> bt -S 0xc038be36 -I c01b320a PID: 0 TASK: c038a000 CPU: 0 COMMAND: "swapper" #0 [c038be36] __handle_sysrq_nolock at c01b320a #1 [c038be56] handle_sysrq at c01b3164 #2 [c038be7a] handle_scancode at c01b0fb6 #3 [c038bea2] handle_kbd_event at c01b2147 #4 [c038beba] keyboard_interrupt at c01b215a #5 [c038bebe] handle_IRQ_event at c010d003 #6 [c038bede] do_IRQ at c010d19f #7 [c038bf02] call_do_IRQ at c038c322 EAX: 000000b0 EBX: 00000000 ECX: 00000000 EDX: 00000000 DS: 00b8 ESI: 00000000 ES: 0000 EDI: 0000000a SS: 74e4 ESP: 725d0046 EBP: c038bf44 CS: 00b0 EIP: 000077da ERR: ffffff01 EFLAGS: 00000246 crash> Now, the exception frame is very strange, and crash won't go past it back to cpu_idle() since it thinks the EIP is user address, and therefore thinks its a user-mode exception frame. To contrast this with an example from another alt-sysrq-c dump I have on hand, here's what it normally looks like when alt-sysrq-c is taken when the idle task is running: crash> bt PID: 0 TASK: c03f8000 CPU: 0 COMMAND: "swapper" #0 [c03f9ed0] sysrq_handle_crash at c01ce8b0 #1 [c03f9ecc] __handle_sysrq_nolock at c01cef38 #2 [c03f9eec] handle_sysrq at c01cee98 #3 [c03f9f10] handle_scancode at c01cc9c6 #4 [c03f9f38] handle_kbd_event at c01cdba8 #5 [c03f9f50] keyboard_interrupt at c01cdbc7 #6 [c03f9f54] handle_IRQ_event at c010dd57 #7 [c03f9f74] do_IRQ at c010df94 #8 [c03f9f9c] call_do_IRQ at c03fa322 EAX: 00000000 EBX: c0109100 ECX: c0443280 EDX: c03f8000 EBP: c0109100 DS: 0068 ESI: c03f8000 ES: 0068 EDI: c03f8000 CS: 0060 EIP: c0109129 ERR: ffffff01 EFLAGS: 00000246 #9 [c03f9fd0] default_idle at c0109129 #10 [c03f9fd4] cpu_idle at c01091c0 crash> Note that EIP above shows the default_idle() address that was executing when the keyboard interrupt was taken. Now, again, looking at this vmcore, we have at least this: crash> bt PID: 0 TASK: c038a000 CPU: 0 COMMAND: "swapper" #0 [c038bf9c] context_switch at c011e63b #1 [c038bfb8] apm_cpu_idle at c0118dfd #2 [c038bfd4] cpu_idle at c0109120 crash> where apm_cpu_idle() is in the trace instead of default_idle(). The swapper process on each cpu spends its time here in cpu_idle(): void cpu_idle (void) { /* endless idle loop with no priority at all */ while (1) { void (*idle)(void) = pm_idle; if (!idle) idle = default_idle; /* * We use the last_run timestamp to measure the idleness * of a CPU. */ current->last_run = jiffies; while (!current->need_resched) idle(); schedule(); check_pgt_cache(); } } Note that unless the global pointer "pm_idle" has been set to a alternative idle function, "default_idle" is used. And as expected, in this dump, pm_idle() has been set to apm_cpu_idle: crash> p pm_idle pm_idle = $6 = (void (*)(void)) 0xc0118d20 <apm_cpu_idle> crash> As it turns out, the call trace would be apm_cpu_idle() calling apm_do_idle() calling apm_bios_call_simple(), which makes an lcall into the segment-offset location called apm_bios_entry: crash> p apm_bios_entry apm_bios_entry = $7 = { offset = 0x7511, segment = 0xa8 } Note that apm_bios_call_simple() is pretty ugly: static u8 apm_bios_call(u32 func, u32 ebx_in, u32 ecx_in, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, u32 *esi) { APM_DECL_SEGS unsigned long flags; unsigned long cpus = apm_save_cpus(); struct desc_struct save_desc_40; __save_flags(flags); APM_DO_CLI; APM_DO_SAVE_SEGS; /* * N.B. We do NOT need a cld after the BIOS call * because we always save and restore the flags. */ __asm__ __volatile__(APM_DO_ZERO_SEGS "pushl %%edi\n\t" "pushl %%ebp\n\t" "lcall %%cs:" SYMBOL_NAME_STR(apm_bios_entry) "\n\t" "setc %%al\n\t" "popl %%ebp\n\t" "popl %%edi\n\t" APM_DO_POP_SEGS : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx), "=S" (*esi) : "a" (func), "b" (ebx_in), "c" (ecx_in) : "memory", "cc"); APM_DO_RESTORE_SEGS; __restore_flags(flags); apm_restore_cpus(cpus); cpu_gdt_table[smp_processor_id()][0x40 / 8] = save_desc_40; return *eax & 0xff; } How this all relates to exception frames and stack register manipulation is well beyond my understanding. I've disassembled the set apm-related functions, and don't see any manipulation of esp other than the standard aligned modifications coming in and out of the functions. But I have to believe this has something to do with the issue at hand. It's unclear to me what happens on the stack when an interrupt comes in while the cpu is executing in the BIOS. I don't understand exactly how it works, but it seems that the apm module must be built into the kernel? That being the case, "smp" defaults to 0, and "idle_threshold" is 95: #ifdef CONFIG_APM_CPU_IDLE #define DEFAULT_IDLE_THRESHOLD 95 #else #define DEFAULT_IDLE_THRESHOLD 100 #endif #define DEFAULT_IDLE_PERIOD (100 / 3) static int idle_threshold = DEFAULT_IDLE_THRESHOLD; static int smp = 0; That being the case, we would only see this in a single cpu system, because the apm_init() function is responsible for setting pm_idle() here: if (smp_num_cpus > 1 && !smp) { printk(KERN_NOTICE "apm: disabled - APM is not SMP safe (power off active).\n"); return 0; } misc_register(&apm_device); if (HZ != 100) idle_period = (idle_period * HZ) / 100; if (idle_threshold < 100) { original_pm_idle = pm_idle; pm_idle = apm_cpu_idle; set_pm_idle = 1; } return 0; } I do wonder whether the idle process on this machine is *always* unaligned once the new idle process is installed and executed? If you do a series of alt-sysrq-p entries on this machine when quiescent, do you see the stack aligned or unaligned when you catch the idle process running?
Also, in addition to the alt-sysrq-p question I raised above, note that if the problem is in fact due to the apm_cpu_idle issue, you can work around it by adding "idle=poll" onto your kernel's boot command line. By doing so, pm_idle will be set to "poll_idle", overriding the "apm_cpu_idle" that gets set by default on your machine.
I'm still interested in the alt-sysrq-p output, but... Presuming that this is a BIOS bug/issue, your patch concept would be necessary. I see that you've somewhat used 2.6's way of doing things as a reference: --- linux-2.4.21/arch/i386/kernel/traps.c.orig 2005-08-11 09:27:46.334491977 +0900 +++ linux-2.4.21/arch/i386/kernel/traps.c 2005-08-11 09:29:21.577869431 +0900 @@ -141,6 +141,7 @@ unsigned long addr; /* static to not take up stackspace; if we race here too bad */ static char buffer[512]; + void *context; if (!stack) stack = (unsigned long*)&stack; @@ -163,7 +164,8 @@ out: #else i = 1; - while (((long) stack & (THREAD_SIZE-1)) != 0) { + context = (void*)((unsigned long)stack & ~(THREAD_SIZE-1)); + while (context < stack && stack < context + THREAD_SIZE - 3) { addr = *stack++; if (kernel_text_address(addr)) { lookup_symbol(addr, buffer, 512); A couple questions: 1. Is there a reason you made the context variable a "void *" instead of an "unsigned long *"? 2. Is there any mathematical possibility that "context < stack" would not always be true?
Created attachment 118073 [details] stack limit patch proposed by lwoodman on rhkernel-list This is the proposed stack display patch being proposed on rhkernel-list.
Reassigning to Larry and adding to U7 ProposedList.
A fix for this problem has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.2.EL).
*** Bug 163176 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html