Bug 436426
Summary: | [5.2][kdump] kdump not work due to SAL error processing | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Qian Cai <qcai> | ||||
Component: | kexec-tools | Assignee: | Neil Horman <nhorman> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.2 | CC: | ddomingo, jarod | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | ia64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
(ia64)
Some Itanium systems cannot properly produce console output from the kexec purgatory code. This code contains instructions for backing up the first 640k of memory after a crash.
While purgatory console output can be useful in diagnosing problems, it is not needed for kdump to properly function. As such, if your Itanium system resets during a kdump operation, disable console output in purgatory by adding --noio to the KEXEC_ARGS variable in /etc/sysconfig/kdump.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-03-19 12:07:03 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 391221, 454962 | ||||||
Attachments: |
|
Description
Qian Cai
2008-03-07 04:29:19 UTC
I think you're going to need to get up with intel on this one. Looking at the problem its reminiscent of bz 277531, although the SAL callout event is different. We have one cpu (the monacrh) managing to handle an OS INIT event from the SAL firmware. All the other cpus are supposed to handle the INIT event as slaves reporting their status as ready (or rendezvoused) to the monacrh so the sal event can be handled without other cpus racing through the same code. It appears that the other cpus either never received the sal event, or are stuck in the sal firmware. Most likey its the former, since you're getting a reset. The other cpus are probably still executing, and hit some odd bit of code that caused a subsequent fault or some such. Either way, I don't see any evidence here of the other cpus handling this event and clearly they aren't rendezvoued. Intel is going to have to take a look into this I think , as its most likely a firmware issue. Neil, can we get a on-site Intel engineer to have a look at this one? we could, but I'm down in raleigh, and we don't have any on site intel people here. Are you up in westford? I thought we had someone up there. If you know who that is, let me know and I'll ask them to look over this. No, I am not in Westford. I can only find some information from http://intranet.corp.redhat.com/ic/intranet/KernelBugzillaAssignment.html, Bruce Allan ballan Intel, network drivers Geoff Gustafson grgustaf Intel Lu Yuming luyu Intel, ia64 Jarod, do you know if there is any on-site Intel engineer in Westford can help us here? Lu is occasionally in Westford, not sure if he is right now. The person to talk to though might be Doug Chapman (dchapman), who is an on-site HP engineer who primarily does ia64 stuff. Thanks jarod. Doug, can you take a look at this please and tell me if its a Firmware issue or not. I don't see how it can't be, but my understanding of the ia64 SAL is limited at best. I have found a similar old bug, kdump not functional on HP rx8640 https://bugzilla.redhat.com/show_bug.cgi?id=213273 If KEXEC_ARGS="--noio" is added here, it can go much further without reset, until capture kernel panic, Red Hat Enterprise Linux Server release 5.2 Beta (Tikanga) Kernel 2.6.18-84.el5 on an ia64 hp-olympia1.rhts.boston.redhat.com login: SysRq : Trigger a crashdump Linux version 2.6.18-84.el5 (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-39)) #1 SMP Fri Feb 29 16:27:31 EST 2008 Ignoring memory below 128MB Ignoring memory above 640MB EFI v1.10 by HP: SALsystab=0x70fff7f86c8 ACPI 2.0=0x70fffbc0000 HCDP=0x70fffc072f0 SMBIOS=0x7fffe000 booting generic kernel on platform dig PCDP: v3 at 0x70fffc072f0 Early serial console at MMIO 0xf0000018000 (options '9600n8') Number of logical nodes in system = 4 Number of memory chunks in system = 8 rsvd_region[0]: [0xe000000008000000, 0xe000000008db2170) rsvd_region[1]: [0xe000000008dc0000, 0xe000000008dc0030) rsvd_region[2]: [0xe000000027a2c000, 0xe000000027fbce28) rsvd_region[3]: [0xe000000027fc4000, 0xe000000027fc40af) rsvd_region[4]: [0xe000000027fcc000, 0xe000000027fccea0) rsvd_region[5]: [0xe000000027fd4000, 0xe000000027fd4050) rsvd_region[6]: [0xffffffffffffffff, 0xffffffffffffffff) Initial ramdisk at: 0xe000000027a2c000 (5836328 bytes) SAL 3.2: HP Orca/IPF version 3.66 SAL Platform features: None SAL: AP wakeup using external interrupt vector 0xff No logical to physical processor mapping available ACPI: Local APIC address c0000000fee00000 GSI 16 (level, low) -> CPU 0 (0x0c00) vector 48 9 CPUs available, 9 CPUs total MCA related initialization done SMP: Allowing 9 CPUs, 0 hotplug CPUs Built 4 zonelists. Total pages: 30866 Kernel command line: BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-84.el5 root=/dev/VolGroup00/LogVol00 ro irqpoll maxcpus=1 reset_devices machvec=dig elfcorehdr=655248K max_addr=640M min_addr=128M Misrouted IRQ fixup and polling support enabled This may significantly impact system performance PID hash table entries: 2048 (order: 11, 16384 bytes) Console: colour dummy device 80x25 Placing software IO TLB between 0x11370000 - 0x15370000 Memory: 291056k/493856k available (6370k code, 216848k reserved, 3837k data, 448k init) McKinley Errata 9 workaround not needed; disabling it Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Dentry cache hash table entries: 65536 (order: 5, 524288 bytes) Inode-cache hash table entries: 32768 (order: 4, 262144 bytes) Mount-cache hash table entries: 1024 ACPI: Core revision 20060707 Boot processor id 0x0/0xc00 Brought up 1 CPUs Total of 1 processors activated (1945.60 BogoMIPS). checking if image is initramfs... it is Freeing initrd memory: 5696kB freed Bad page state in process 'swapper' page:e00000001121ae68 flags:0x0000000000000000 mapping:0000000000000000 mapcount:1 count:0 (Not tainted) Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [<a000000100013ae0>] show_stack+0x40/0xa0 sp=e000000015af7b40 bsp=e000000015af12a8 [<a000000100013b70>] dump_stack+0x30/0x60 sp=e000000015af7d10 bsp=e000000015af1290 [<a00000010010a260>] bad_page+0xe0/0x160 sp=e000000015af7d10 bsp=e000000015af1248 [<a00000010010aa30>] free_hot_cold_page+0x110/0x320 sp=e000000015af7d20 bsp=e000000015af1200 [<a00000010010ad70>] free_hot_page+0x30/0x60 sp=e000000015af7d20 bsp=e000000015af11d8 [<a00000010010d010>] __free_pages+0xb0/0x100 sp=e000000015af7d20 bsp=e000000015af11b0 [<a00000010010d1e0>] free_pages+0x180/0x1a0 sp=e000000015af7d20 bsp=e000000015af1188 [<a000000100760dc0>] free_initrd_mem+0x1e0/0x2e0 sp=e000000015af7d20 bsp=e000000015af1160 [<a000000100753410>] free_initrd+0x130/0x180 sp=e000000015af7d30 bsp=e000000015af1128 [<a000000100756460>] populate_rootfs+0x1e0/0x200 sp=e000000015af7d30 bsp=e000000015af10f8 [<a0000001007487d0>] init+0x3d0/0x780 sp=e000000015af7d30 bsp=e000000015af10c8 [<a0000001000121b0>] kernel_thread_helper+0x30/0x60 sp=e000000015af7e30 bsp=e000000015af10a0 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e000000015af7e30 bsp=e000000015af10a0 Unable to handle kernel NULL pointer dereference (address 0000000000000050) swapper[1]: Oops 8813272891392 [1] Modules linked in: Pid: 1, CPU 0, comm: swapper psr : 00001010085a6010 ifs : 800000000000048d ip : [<a00000010010aae0>] Tainted: G B ip is at free_hot_cold_page+0x1c0/0x320 unat: 0000000000000000 pfs : 000000000000048d rsc : 0000000000000003 rnat: a0000001009ecb08 bsps: a000000100928fe0 pr : 0000000000005941 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a00000010010aa30 b6 : a0000001002ca200 b7 : a00000010000c690 f6 : 0fffafffffffff0000000 f7 : 0ffdd8000000000000000 f8 : 100018000000000000000 f9 : 100038000000000000000 f10 : 0fffcfffffffff0000000 f11 : 1003e0000000000000000 r1 : a000000100be0270 r2 : a0000001009f70c4 r3 : fffffffffff5670e r8 : ffffffffffffffff r9 : a0000001009f8110 r10 : 0000000000000000 r11 : 0000000000000000 r12 : e000000015af7d20 r13 : e000000015af0000 r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001009f8464 r17 : 0000000000000000 r18 : e00000001121ae74 r19 : 0000000000000000 r20 : e000000015af1054 r21 : e000000015af7e30 r22 : a0000001000090c0 r23 : e000000015af10a0 r24 : a000000100928fe0 r25 : a000000100928fe0 r26 : a0000001009e0a10 r27 : 0000000000000000 r28 : 0000000000000034 r29 : 0000000000000034 r30 : 0000000000000000 r31 : a0000001009f846c Call Trace: [<a000000100013ae0>] show_stack+0x40/0xa0 sp=e000000015af78b0 bsp=e000000015af1358 [<a0000001000143e0>] show_regs+0x840/0x880 sp=e000000015af7a80 bsp=e000000015af1300 [<a000000100037bc0>] die+0x1c0/0x2c0 sp=e000000015af7a80 bsp=e000000015af12b8 [<a0000001006361e0>] ia64_do_page_fault+0x8e0/0xa20 sp=e000000015af7aa0 bsp=e000000015af1268 [<a00000010000c020>] __ia64_leave_kernel+0x0/0x280 sp=e000000015af7b50 bsp=e000000015af1268 [<a00000010010aae0>] free_hot_cold_page+0x1c0/0x320 sp=e000000015af7d20 bsp=e000000015af1200 [<a00000010010ad70>] free_hot_page+0x30/0x60 sp=e000000015af7d20 bsp=e000000015af11d8 [<a00000010010d010>] __free_pages+0xb0/0x100 sp=e000000015af7d20 bsp=e000000015af11b0 [<a00000010010d1e0>] free_pages+0x180/0x1a0 sp=e000000015af7d20 bsp=e000000015af1188 [<a000000100760dc0>] free_initrd_mem+0x1e0/0x2e0 sp=e000000015af7d20 bsp=e000000015af1160 [<a000000100753410>] free_initrd+0x130/0x180 sp=e000000015af7d30 bsp=e000000015af1128 [<a000000100756460>] populate_rootfs+0x1e0/0x200 sp=e000000015af7d30 bsp=e000000015af10f8 [<a0000001007487d0>] init+0x3d0/0x780 sp=e000000015af7d30 bsp=e000000015af10c8 [<a0000001000121b0>] kernel_thread_helper+0x30/0x60 sp=e000000015af7e30 bsp=e000000015af10a0 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e000000015af7e30 bsp=e000000015af10a0 <0>Kernel panic - not syncing: Fatal exception It can be reproduced reliably. Cai, which version of kexec-tools did you do this with? 1.102pre-12.el5 + patch from BZ 434927#c28 can you try it with just 1.102pre-12.el5? Without the patch from 434927? Apart from hitting the bug #434927 with zero-size vmcore, everything works fine without the patch, and with 1.102pre-12.el5, as well as KEXEC_ARGS="--noio". Created attachment 298402 [details]
capture kernel console output
Than that needs to be our solution. --noio disables console output in purgatory, which is usefull for debug, but in the event that its causing problems, I'll just add it to the default arg list. Requesting 5.2 exception. actually, I 'm clearing the exception, This is probably best handled as a release note, as its not all systems that are subject to this. Release note text: Some IA64 systems have difficulty producing console output from the kexec purgatory code. While purgatory console output can be usefull in diagnosing problems (which is why it is kept on), it is not needed for proper kdump function. If you find that your ia64 system resets during kdump operation, adding --noio to the KEXEC_ARGS variable in /etc/sysconfig/kdump may solve the issue. thanks Neil, adding to RHEL5.2 release notes under "Known Issues": <quote> (ia64) Some Itanium systems cannot properly produce console output from the kexec purgatory code. This code contains instructions for backing up the first 640k of memory after a crash. While purgatory console output can be useful in diagnosing problems, it is not needed for kdump to properly function. As such, if your Itanium system resets during a kdump operation, disable console output in purgatory by adding --noio to the KEXEC_ARGS variable in /etc/sysconfig/kdump. </quote> please advise if any further revisions are required. thanks! For the record, kdump works fine with --noio and the latest kexec-tools 1.102pre.16.el5 on this box. Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. This Release Note is currently located in the Known Issues section. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. For the record, hp-matterhorn1.rhts.bos.redhat.com was also affected. SysRq : Trigger a crashdump MCA EVENT occurred : SAL error processing Logging XBC Errors .... Size of XBC Errors : 0x0 Complete Finish the Error Event Logging .... Complete Flush the cpu cache .... Complete ReEnabling CPU Poison Check .... Complete Cpu8: MCA Rendez Always flag set to 1. Cpu8: Perform rendezvous started. Cpu8: Sent Rendez vector number 0xe8 to 0 cpus. Cpu8: Rendezvous timeout : 20000 Cpu8: Didn't have to send MCA vector. Cpu8: Perform rendezvous complete, RendezState : 0x1 . Cell local CallGate Pointer 0x723ff3a7c00 CallGate Pointer after moving to CoreCell 0x723ff3a7c00 Firmware executing from Main Memory ....... Calling OS_MCA at 0x00000000040493a0... ... https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5321586 As well as hp-rx8640-02.rhts.bos.redhat.com. *** This bug has been marked as a duplicate of bug 277531 *** |