Description of problem: bt command doesn't work. There are 2 reasons. The first reason is that there is only one init_handler stack(Each CPU shared it). The second reason is that crash doesn't work with INIT interrupt. Version-Release number of selected component (if applicable): RHEL3-U4. How reproducible: Everytime. Steps to Reproduce: 1. Push OS_INIT switch 2. Diskdump works 3. Use bt with crash Actual results: bt doesn't work Expected results: bt works Additional info: Tatsuo Uchida posted the patch(linux-2.4.21-init-trigger-switch- 3.patch) to fix this issue on 11/4/04. It wasn't incorporated in RHEL3-U4.
NEC wrote a patch for the crash utility to do backtraces for the active tasks when the INIT interrupt is received. However, with the NEC patch applied, the bt command still does not work correctly, and I cannot accept it into the crash utility as is. I am waiting for NEC to fix the problems with the crash utility patch that they forwarded. These were my comments on the patch, which were forwarded to NEC: Re: NEC's crash utility patch -- when I said, "The other parts of the patch look good", I was a bit premature. There are definite problems with both "active" backtraces: crash> bt -a PID: 0 TASK: e0000000049f4000 CPU: 0 COMMAND: "swapper" #0 [BSP:e0000000049f53d0] try_crashdump at e000000004486190 #1 [BSP:e0000000049f53a0] init_handler_platform at e00000000444bee0 EFRAME: e000000004b98740 B0: e000000004412ce0 CR_IIP: e000000004412550 CR_IPSR: 0000121008022018 CR_IFS: 800000000000040b AR_PFS: 000000000000048c AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000 B6: e0000000044032d0 B7: e0000000044151e0 PR: 80000000afb5e967 R1: e000000004cb3d00 R2: 0000000000000001 R3: 0000000000000001 R8: 0000000000000000 R9: e000000004b82488 R10: 0000000000000001 R11: 00000000000e4c38 R12: e0000000049fbcb0 R13: e0000000049f4000 R14: 0000000000000000 R15: e000000004cb263c R16: e000000004b82488 R17: e000000004cb2280 R18: 0000000000000000 R19: 0000000000000000 R20: 0000000000000000 R21: e000000004cb2280 R22: e0000000044111c0 R23: e000000004990068 R24: e0000000049f403c R25: e000000004b83648 R26: e000000004b83590 R27: e000000004b83590 R28: e000000004b7dac8 R29: 0000000000000000 R30: 0000000000000000 R31: e0000000049f4028 F6: 1003e0000000000000000 F7: 1003e0000000000000001 F8: 1003e0000000000000004 F9: 1003e0000000000000001 #2 [BSP:e0000000049f5320] handle_IRQ_event at e000000004412550 #3 [BSP:e0000000049f52d8] do_IRQ at e000000004412ce0 #4 [BSP:e0000000049f5298] ia64_handle_irq at e000000004414f20 #5 [BSP:e0000000049f5298] ia64_leave_kernel at e00000000440ea20 EFRAME: e000000004b98740 B0: e000000004412ce0 CR_IIP: e000000004412550 CR_IPSR: 0000121008022018 CR_IFS: 800000000000040b AR_PFS: 000000000000048c AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000 B6: e0000000044032d0 B7: e0000000044151e0 PR: 80000000afb5e967 R1: e000000004cb3d00 R2: 0000000000000001 R3: 0000000000000001 R8: 0000000000000000 R9: e000000004b82488 R10: 0000000000000001 R11: 00000000000e4c38 R12: e0000000049fbcb0 R13: e0000000049f4000 R14: 0000000000000000 R15: e000000004cb263c R16: e000000004b82488 R17: e000000004cb2280 R18: 0000000000000000 R19: 0000000000000000 R20: 0000000000000000 R21: e000000004cb2280 R22: e0000000044111c0 R23: e000000004990068 R24: e0000000049f403c R25: e000000004b83648 R26: e000000004b83590 R27: e000000004b83590 R28: e000000004b7dac8 R29: 0000000000000000 R30: 0000000000000000 R31: e0000000049f4028 F6: 1003e0000000000000000 F7: 1003e0000000000000001 F8: 1003e0000000000000004 F9: 1003e0000000000000001 #6 [BSP:e0000000049f5240] handle_IRQ_event at e000000004412550 bt: unwind: bsp (0xe0000000049f50c8) out of range [0xe0000000049f5150-0xe0000000049f5378] The trace above for the "swapper" thread is clearly incorrect. The CR_IIP address in the init_stack's EFRAME at e000000004b98740 shows that took the INIT interrupt while it was operating in handle_IRQ_event at address e000000004412550. It then continues on to frame #5, where it should dump another in-kernel exception frame, which should have a CR_IIP of a function where it was operating before it took the first (other) interrupt. But instead it *repeats* the init_stack EFRAME at e000000004b98740, which makes no sense at all. It should find its way back to "schedule", or "cpu_idle". I don't know what the "out of range" error message is all about. Again I don't understand what the sequence should be exactly -- is the ia64_leave_kernel->ia64_handle_IRQ->do_IRQ->handle_IRQ_event sequence associated with the INIT interrupt itself? The second trace is also strange, because it doesn't even show the kernel going through handle_IRQ_event, do_IRQ, ia64_handle_IRQ and ia64_leave_kernel. Perhaps it is possible to get to init_handler_platform without passing through those interrupt functions?: PID: 3818 TASK: e000000006f00000 CPU: 1 COMMAND: "cat" #0 [BSP:e000000006f012f0] init_handler_platform at e00000000444c090 EFRAME: e000000004ba0740 B0: a000000000538260 CR_IIP: a000000000538260 CR_IPSR: 0000101008022018 CR_IFS: 8000000000000183 AR_PFS: 0000000000000183 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000 B6: e000000004411c80 B7: a0000000005381c0 PR: 80000000a5a9a69b R1: e000000004cb3d00 R2: 0000000000000000 R3: 0000001008026018 R8: 0000000000004000 R9: 0000000000000000 R10: a000000000008000 R11: e000000004b82488 R12: e000000006f07e50 R13: e000000006f00000 R14: 0000000000000000 R15: 0000000000000002 R16: 0000000000000002 R17: e000000004a04000 R18: 0000000005f5e100 R19: e000000004b82488 R20: e000000004b188b8 R21: e000000006f0003c R22: e000000004abcf40 R23: e000000004b82488 R24: 0000000000000000 R25: e0000000061cee60 R26: e000000001619248 R27: 0000000000000001 R28: 0000000000004000 R29: 00000000000000e0 R30: e000000004b7f408 R31: e000000004ad2d20 F6: 1003e4ec4ec4ec4ec4ec5 F7: 1003e0000000000000001 F8: 1003e000000000000ebdd F9: 1003e00000000000bfa39 #1 [BSP:e000000006f012b0] cliloop_read_proc at a000000000538260 #2 [BSP:e000000006f01228] proc_file_read at e000000004586650 #3 [BSP:e000000006f011a0] sys_read at e000000004518ba0 #4 [BSP:e000000006f011a0] ia64_ret_from_syscall at e00000000440ea00 EFRAME: e000000004ba0750 B0: 4ec4ec4ec4ec4ec5 CR_IIP: 0000000000000000 CR_IPSR: 8000000000000183 CR_IFS: 0000000000000183 AR_PFS: 0000000000000000 AR_RSC: 0000000000000000 AR_UNAT: 0000000000000003 AR_RNAT: 80000000a5a9a69b AR_CCV: a000000000538260 AR_FPSR: a0000000005381c0 LOADRS: 0000000000000000 AR_BSPSTORE: e000000004411c80 B6: e000000004cb3d00 B7: 000000000001003e PR: 0000000000000000 R1: 0000001008026018 R2: e000000006f07e50 R3: e000000006f00000 R8: a000000000008000 R9: e000000004b82488 R10: 0000000000000002 R11: e000000004a04000 R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000004000 R15: 0000000000000000 R16: 0000000005f5e100 R17: e000000004b82488 R18: e000000004b188b8 R19: e000000006f0003c R20: e000000004abcf40 R21: e000000004b82488 R22: 0000000000000000 R23: e0000000061cee60 R24: e000000001619248 R25: 0000000000000001 R26: 0000000000004000 R27: 00000000000000e0 R28: e000000004b7f408 R29: e000000004ad2d20 R30: 0000000000000000 R31: 0009804c0270033f F6: 1003e0000000000000001 F7: 1003e000000000000ebdd F8: 1003e00000000000bfa39 F9: 000000000000000000000 Anyway, even if we accept that the trace is legitimate until frame #4, clearly the second (kernel-entry) EFRAME at e000000004ba0750 is incorrect. That address is in the init_stack, only 16 bytes above the first EFRAME (!), and obviously is displaying garbage. That EFRAME should be in the process stack. All ia64 user program kernel-entry exception frames are found at the exact same offset from the top of the stack, i.e. at "stack top - sizeof(pt_regs)". So, for example this "cat" task whose 32k process stack begins at e000000006f00000, would have its kernel-entry pt_regs at: e000000006f00000 + 32k - sizeof(pt_regs), or at e000000006f07e70. (If you dump that address as a pt_regs, you'll see the correct EFRAME data that should be displayed.) So clearly there's some major problems with NEC crash utility patch such that I cannot accept it in its current state. Can you give NEC this particular dumpfile, the kernel and its debuginfo file, and have them answer my questions? Dave
It should be noted that, until NEC can come up with a new patch, that the "-t" option to "bt" may be used to determine where the active tasks were running when the INIT switch was entered. For example, using the same dump as above, here's what the -t option, displays, consisting of the text addresses in the BSP area of each task. Note that the function list is "backwards" from a real backtrace, and includes "stale" entries from prior kernel entries: crash> bt -at PID: 0 TASK: e0000000049f4000 CPU: 0 COMMAND: "swapper" START: try_crashdump at e000000004486190 [e0000000049f5208] skip_stack_dtr at e000000004408140 [e0000000049f5218] start_kernel at e000000004998d60 [e0000000049f5288] rest_init at e000000004408bf0 [e0000000049f52c8] ia64_leave_kernel at e00000000440ea20 [e0000000049f5310] ia64_handle_irq at e000000004414f20 [e0000000049f5350] do_IRQ at e000000004412ce0 [e0000000049f5370] cpu_idle at e000000004416350 [e0000000049f5390] return_from_init at e00000000444ece0 [e0000000049f53c0] ia64_init_handler at e00000000444d550 [e0000000049f53e8] init_handler_platform at e00000000444bee0 [e0000000049f5410] try_crashdump at e000000004486190 [e0000000049f5440] unw_init_running at e00000000440eff0 [e0000000049f54e0] ia64_start_dump at e0000000044179f0 [e0000000049f55a8] hid_input_report at a0000000002caba0 [e0000000049f55f0] hid_input_field at a0000000002ca500 [e0000000049f5630] tcp_v4_rcv at e000000004796ee0 [e0000000049f5690] tcp_v4_do_rcv at e000000004795d10 [e0000000049f56c0] tcp_rcv_established at e00000000477ebb0 [e0000000049f57b0] __tcp_data_snd_check at e00000000477d620 [e0000000049f57c8] ip_queue_xmit at e000000004754780 [e0000000049fbaf8] rt_run_flush__thr at e00000000473e550 [e0000000049fbb28] mpt_interrupt at a00000000018c4c0 [e0000000049fbb38] __copy_user at e000000004809e60 [e0000000049fbcc8] cpu_idle at e0000000044162e0 [e0000000049fbd08] break_fault at e000000004402f70 [e0000000049fbd88] default_idle at e0000000044161c0 [e0000000049fbe00] cpu_idle at e0000000044162e0 [e0000000049fbe08] default_idle at e0000000044161c0 PID: 3818 TASK: e000000006f00000 CPU: 1 COMMAND: "cat" START: init_handler_platform at e00000000444c090 [e000000006f01218] ia64_ret_from_syscall at e00000000440ea00 [e000000006f012a0] sys_read at e000000004518ba0 [e000000006f012b8] proc_file_read at e000000004586650 [e000000006f012e0] return_from_init at e00000000444ece0 [e000000006f01310] ia64_init_handler at e00000000444d550 [e000000006f01338] cliloop_read_proc at a000000000538260 [e000000006f01368] init_handler_platform at e00000000444be00 [e000000006f01480] ia64_ret_from_syscall at e00000000440ea00 [e000000006f01510] load_elf_binary at e000000004579eb0 [e000000006f01538] load_elf_interp at e000000004578fc0 [e000000006f01580] load_elf_interp at e0000000045790f0 [e000000006f015d8] exec_mmap at e000000004533080 [e000000006f01780] io_schedule at e00000000447eb80 [e000000006f01798] rw_intr at a000000000065fe0 [e000000006f07788] schedule at e00000000447ad40 [e000000006f077c8] __divsi3 at e000000004805a60 [e000000006f078c0] schedule at e00000000447acb0 [e000000006f078c8] ia64_switch_to at e00000000440e410 [e000000006f07948] __do_clear_user at e000000004806880 [e000000006f079c8] zap_page_range at e0000000044bdd40 [e000000006f079f8] __do_clear_user at e000000004806880 [e000000006f07a38] kfree at e0000000044e4340 [e000000006f07b30] padzero at e000000004578340 [e000000006f07b38] generic_file_mmap at e0000000044d3680 [e000000006f07cd8] __strncpy_from_user at e000000004806940 [e000000006f07d18] break_fault at e000000004402f70 crash> It's a cheap work-around for now.
Created attachment 108027 [details] Patch for crash 3.8-5.11 Sorry. There was a bug in the calculation of the stack top. Attached patch fixes the problem. (patch for crash 3.8-5.11)
Hello Jun'ichi, Very nice work! I will adapt this patch to 3.10-1 (the current version of crash) with a few minor changes: 1. As I described above, your GET_STACK_ULONG() change won't work for all processors, so I changed unwind.c to replace all usages of the generic GET_STACK_ULONG() macro with a new ia64-specific macro called IA64_GET_STACK_ULONG(): #define IA64_GET_STACK_ULONG(OFF) ((INSTACK(OFF,bt)) ? \ (GET_STACK_ULONG(OFF)) : get_init_stack_ulong((unsigned long)OFF)) ...where get_init_stack_ulong() is a new function created from your GET_STACK_ULONG_SLOW() inline function. 2. This part of your in_initstack() function is not quite right: + if (addr < init_stack_addr || + addr >= init_stack_addr + STACKSIZE()*NR_CPUS) + return 0; ...because NR_CPUS from the perspective of the crash utility is 512 for ia64. That's easy enough to fix -- in ia64_post_init() I've done this: if (symbol_exists("ia64_init_stack")) ms->ia64_init_stack_size = get_array_length("ia64_init_stack", NULL, 0); if (ia64_in_init_stack(SWITCH_STACK_ADDR(CURRENT_TASK()))) machdep->flags |= INIT; I've renamed in_initstack() to ia64_in_init_stack() and moved it to ia64.c, and made its stack address test do this: if ((addr < init_stack_addr) || (addr >= (init_stack_addr + machdep->machspec->ia64_init_stack_size))) return FALSE; Lastly, the machdep->flags "INIT" setting makes each active task's STATE information show that they were running when the INIT button was hit: crash> foreach active set PID: 0 TASK: e0000000049f4000 CPU: 0 COMMAND: "swapper" PID: 0 COMMAND: "swapper" TASK: e0000000049f4000 (1 of 2) CPU: 0 STATE: TASK_RUNNING (INIT) PID: 3818 TASK: e000000006f00000 CPU: 1 COMMAND: "cat" PID: 3818 COMMAND: "cat" TASK: e000000006f00000 CPU: 1 STATE: TASK_RUNNING (INIT) crash> The "old" way would show "TASK_RUNNING (PANIC)" for the "swapper" process, and "TASK_RUNNING (ACTIVE)" for the "cat" process, which doesn't make much sense. So again, I thank you very much for this patch -- it's really a nice addition!
Fix checked into CVS. RHEL3: * Fri Feb 04 2005 Dave Anderson <anderson> 3.10-4 - Update to handle backtraces in dumpfiles generated on IA64 with the INIT switch (functionality intro'd in RHEL3-U5 kernel). BZ #139429 RHEL4: * Thu Feb 10 2005 Dave Anderson <anderson> 3.10-7 - Update to handle backtraces in dumpfiles generated on IA64 with the INIT switch (functionality intro'd in RHEL3-U5 kernel). BZ #139429
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-184.html