Bug 436426

Summary: [5.2][kdump] kdump not work due to SAL error processing
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 5.2CC: ddomingo, jarod
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
(ia64) Some Itanium systems cannot properly produce console output from the kexec purgatory code. This code contains instructions for backing up the first 640k of memory after a crash. While purgatory console output can be useful in diagnosing problems, it is not needed for kdump to properly function. As such, if your Itanium system resets during a kdump operation, disable console output in purgatory by adding --noio to the KEXEC_ARGS variable in /etc/sysconfig/kdump.
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-19 12:07:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391221, 454962    
Attachments:
Description Flags
capture kernel console output none

Description Qian Cai 2008-03-07 04:29:19 UTC
Description of problem:
SysRq : Trigger a crashdump

 MCA EVENT occurred : SAL error processing 

  Logging XBC Errors ....
    Size of XBC Errors : 0x0

						Complete
 Finish the Error Event Logging ....

						Complete
 Flush the cpu cache ....

						Complete
 ReEnabling CPU Poison Check ....

						Complete
Cpu18: MCA Rendez Always flag set to 1. 
Cpu18: Perform rendezvous started. 
Cpu18: Sent Rendez vector number 0xe8 to 1 cpus. 
Cpu18: Rendezvous timeout : 20000 
Cpu18: Waiting processors to acknowledge MCA vector.
Cpu18: Sending INIT to other processors for Rendezvous.
Cpu18: Waiting processors to acknowledge INIT.
Entered OS INIT handler. PSP=ffe301a0 cpu=0 monarch=1
Delaying for 5 seconds...
mlogbuf_finish: printing switched to urgent mode, MCA/INIT might be dodgy or fail.
OS INIT slave did not rendezvous on cpu 1 2 3 4 5 6 7 8
Processes interrupted by INIT - 0 (cpu 0 task 0xa0000001007b8000)


Backtrace of pid 1 (init)

Call Trace:
 [<a00000010062d1b0>] schedule+0x1db0/0x20a0
                                sp=e0000040face79d0 bsp=e0000040face1310
 [<a00000010062e7d0>] schedule_timeout+0x110/0x180
                                sp=e0000040face7a60 bsp=e0000040face12e0
 [<a000000100194630>] do_select+0x2d0/0x7e0
                                sp=e0000040face7a90 bsp=e0000040face1200
 [<a0000001001950d0>] sys_select+0x590/0x960
                                sp=e0000040face7ce0 bsp=e0000040face1160
 [<a00000010000be80>] ia64_ret_from_syscall+0x0/0x40
                                sp=e0000040face7e30 bsp=e0000040face1160
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e0000040face8000 bsp=e0000040face1160

...

Then, the machine was reset.

Full log:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2128481

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080303.0
kernel-2.6.18-84.el5
kexec-tools-1.102pre-11.el5 with patch from BZ434927

How reproducible:
Always on hp-olympia1.rhts.boston.redhat.com

Steps to Reproduce:
1. configure kdump with crashkernel=512M@256M
2. SysRq-c

Comment 1 Neil Horman 2008-03-07 12:37:55 UTC
I think you're going to need to get up with intel on this one.  Looking at the
problem its reminiscent of bz 277531, although the SAL callout event is
different.  We have one cpu (the monacrh) managing to handle an OS INIT event
from the SAL firmware.  All the other cpus are supposed to handle the INIT event
as slaves reporting their status as ready  (or rendezvoused) to the monacrh so
the sal event can be handled without other cpus racing through the same code. 
It appears that the other cpus either never received the sal event, or are stuck
in the sal firmware.    Most likey its the former, since you're getting a reset.
 The other cpus are probably still executing, and hit some odd bit of code that
caused a subsequent fault or some such.  Either way, I don't see any evidence
here of the other cpus handling this event and clearly they aren't rendezvoued.
 Intel is going to have to take a look into this I think , as its most likely a
firmware issue.

Comment 2 Qian Cai 2008-03-07 14:57:22 UTC
Neil, can we get a on-site Intel engineer to have a look at this one?

Comment 3 Neil Horman 2008-03-07 15:34:20 UTC
we could, but I'm down in raleigh, and we don't have any on site intel people
here.  Are you up in westford?  I thought we had someone up there.  If you know
who that is, let me know and I'll ask them to look over this.

Comment 4 Qian Cai 2008-03-07 16:01:00 UTC
No, I am not in Westford. I can only find some information from
http://intranet.corp.redhat.com/ic/intranet/KernelBugzillaAssignment.html,

Bruce Allan             ballan               Intel, network drivers
Geoff Gustafson         grgustaf             Intel
Lu Yuming               luyu                 Intel, ia64




Comment 5 Qian Cai 2008-03-07 16:09:23 UTC
Jarod, do you know if there is any on-site Intel engineer in Westford can help
us here?

Comment 6 Jarod Wilson 2008-03-07 16:18:26 UTC
Lu is occasionally in Westford, not sure if he is right now. The person to talk
to though might be Doug Chapman (dchapman), who is an on-site HP
engineer who primarily does ia64 stuff.

Comment 7 Neil Horman 2008-03-07 16:29:58 UTC
Thanks jarod.  Doug, can you take a look at this please and tell me if its a
Firmware issue or not.  I don't see how it can't be, but my understanding of the
ia64 SAL is limited at best.

Comment 8 Qian Cai 2008-03-18 09:44:02 UTC
I have found a similar old bug,

kdump not functional on HP rx8640
https://bugzilla.redhat.com/show_bug.cgi?id=213273

If KEXEC_ARGS="--noio" is added here, it can go much further without reset,
until capture kernel panic,

Red Hat Enterprise Linux Server release 5.2 Beta (Tikanga)
Kernel 2.6.18-84.el5 on an ia64

hp-olympia1.rhts.boston.redhat.com login: SysRq : Trigger a crashdump
Linux version 2.6.18-84.el5 (brewbuilder.redhat.com) (gcc version
4.1.2 20071124 (Red Hat 4.1.2-39)) #1 SMP Fri Feb 29 16:27:31 EST 2008
Ignoring memory below 128MB
Ignoring memory above 640MB
EFI v1.10 by HP: SALsystab=0x70fff7f86c8 ACPI 2.0=0x70fffbc0000
HCDP=0x70fffc072f0 SMBIOS=0x7fffe000
booting generic kernel on platform dig
PCDP: v3 at 0x70fffc072f0
Early serial console at MMIO 0xf0000018000 (options '9600n8')
Number of logical nodes in system = 4
Number of memory chunks in system = 8
rsvd_region[0]: [0xe000000008000000, 0xe000000008db2170)
rsvd_region[1]: [0xe000000008dc0000, 0xe000000008dc0030)
rsvd_region[2]: [0xe000000027a2c000, 0xe000000027fbce28)
rsvd_region[3]: [0xe000000027fc4000, 0xe000000027fc40af)
rsvd_region[4]: [0xe000000027fcc000, 0xe000000027fccea0)
rsvd_region[5]: [0xe000000027fd4000, 0xe000000027fd4050)
rsvd_region[6]: [0xffffffffffffffff, 0xffffffffffffffff)
Initial ramdisk at: 0xe000000027a2c000 (5836328 bytes)
SAL 3.2: HP Orca/IPF version 3.66
SAL Platform features: None
SAL: AP wakeup using external interrupt vector 0xff
No logical to physical processor mapping available
ACPI: Local APIC address c0000000fee00000
GSI 16 (level, low) -> CPU 0 (0x0c00) vector 48
9 CPUs available, 9 CPUs total
MCA related initialization done
SMP: Allowing 9 CPUs, 0 hotplug CPUs
Built 4 zonelists.  Total pages: 30866
Kernel command line: BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-84.el5
root=/dev/VolGroup00/LogVol00  ro irqpoll maxcpus=1 reset_devices machvec=dig
elfcorehdr=655248K max_addr=640M min_addr=128M
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
PID hash table entries: 2048 (order: 11, 16384 bytes)
Console: colour dummy device 80x25
Placing software IO TLB between 0x11370000 - 0x15370000
Memory: 291056k/493856k available (6370k code, 216848k reserved, 3837k data,
448k init)
McKinley Errata 9 workaround not needed; disabling it
Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Dentry cache hash table entries: 65536 (order: 5, 524288 bytes)
Inode-cache hash table entries: 32768 (order: 4, 262144 bytes)
Mount-cache hash table entries: 1024
ACPI: Core revision 20060707
Boot processor id 0x0/0xc00
Brought up 1 CPUs
Total of 1 processors activated (1945.60 BogoMIPS).
checking if image is initramfs... it is
Freeing initrd memory: 5696kB freed
Bad page state in process 'swapper'
page:e00000001121ae68 flags:0x0000000000000000 mapping:0000000000000000
mapcount:1 count:0 (Not tainted)
Trying to fix it up, but a reboot is needed
Backtrace:

Call Trace:
 [<a000000100013ae0>] show_stack+0x40/0xa0
                                sp=e000000015af7b40 bsp=e000000015af12a8
 [<a000000100013b70>] dump_stack+0x30/0x60
                                sp=e000000015af7d10 bsp=e000000015af1290
 [<a00000010010a260>] bad_page+0xe0/0x160
                                sp=e000000015af7d10 bsp=e000000015af1248
 [<a00000010010aa30>] free_hot_cold_page+0x110/0x320
                                sp=e000000015af7d20 bsp=e000000015af1200
 [<a00000010010ad70>] free_hot_page+0x30/0x60
                                sp=e000000015af7d20 bsp=e000000015af11d8
 [<a00000010010d010>] __free_pages+0xb0/0x100
                                sp=e000000015af7d20 bsp=e000000015af11b0
 [<a00000010010d1e0>] free_pages+0x180/0x1a0
                                sp=e000000015af7d20 bsp=e000000015af1188
 [<a000000100760dc0>] free_initrd_mem+0x1e0/0x2e0
                                sp=e000000015af7d20 bsp=e000000015af1160
 [<a000000100753410>] free_initrd+0x130/0x180
                                sp=e000000015af7d30 bsp=e000000015af1128
 [<a000000100756460>] populate_rootfs+0x1e0/0x200
                                sp=e000000015af7d30 bsp=e000000015af10f8
 [<a0000001007487d0>] init+0x3d0/0x780
                                sp=e000000015af7d30 bsp=e000000015af10c8
 [<a0000001000121b0>] kernel_thread_helper+0x30/0x60
                                sp=e000000015af7e30 bsp=e000000015af10a0
 [<a0000001000090c0>] start_kernel_thread+0x20/0x40
                                sp=e000000015af7e30 bsp=e000000015af10a0
Unable to handle kernel NULL pointer dereference (address 0000000000000050)
swapper[1]: Oops 8813272891392 [1]
Modules linked in:

Pid: 1, CPU 0, comm:              swapper
psr : 00001010085a6010 ifs : 800000000000048d ip  : [<a00000010010aae0>]   
Tainted: G    B
ip is at free_hot_cold_page+0x1c0/0x320
unat: 0000000000000000 pfs : 000000000000048d rsc : 0000000000000003
rnat: a0000001009ecb08 bsps: a000000100928fe0 pr  : 0000000000005941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a00000010010aa30 b6  : a0000001002ca200 b7  : a00000010000c690
f6  : 0fffafffffffff0000000 f7  : 0ffdd8000000000000000
f8  : 100018000000000000000 f9  : 100038000000000000000
f10 : 0fffcfffffffff0000000 f11 : 1003e0000000000000000
r1  : a000000100be0270 r2  : a0000001009f70c4 r3  : fffffffffff5670e
r8  : ffffffffffffffff r9  : a0000001009f8110 r10 : 0000000000000000
r11 : 0000000000000000 r12 : e000000015af7d20 r13 : e000000015af0000
r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001009f8464
r17 : 0000000000000000 r18 : e00000001121ae74 r19 : 0000000000000000
r20 : e000000015af1054 r21 : e000000015af7e30 r22 : a0000001000090c0
r23 : e000000015af10a0 r24 : a000000100928fe0 r25 : a000000100928fe0
r26 : a0000001009e0a10 r27 : 0000000000000000 r28 : 0000000000000034
r29 : 0000000000000034 r30 : 0000000000000000 r31 : a0000001009f846c

Call Trace:
 [<a000000100013ae0>] show_stack+0x40/0xa0
                                sp=e000000015af78b0 bsp=e000000015af1358
 [<a0000001000143e0>] show_regs+0x840/0x880
                                sp=e000000015af7a80 bsp=e000000015af1300
 [<a000000100037bc0>] die+0x1c0/0x2c0
                                sp=e000000015af7a80 bsp=e000000015af12b8
 [<a0000001006361e0>] ia64_do_page_fault+0x8e0/0xa20
                                sp=e000000015af7aa0 bsp=e000000015af1268
 [<a00000010000c020>] __ia64_leave_kernel+0x0/0x280
                                sp=e000000015af7b50 bsp=e000000015af1268
 [<a00000010010aae0>] free_hot_cold_page+0x1c0/0x320
                                sp=e000000015af7d20 bsp=e000000015af1200
 [<a00000010010ad70>] free_hot_page+0x30/0x60
                                sp=e000000015af7d20 bsp=e000000015af11d8
 [<a00000010010d010>] __free_pages+0xb0/0x100
                                sp=e000000015af7d20 bsp=e000000015af11b0
 [<a00000010010d1e0>] free_pages+0x180/0x1a0
                                sp=e000000015af7d20 bsp=e000000015af1188
 [<a000000100760dc0>] free_initrd_mem+0x1e0/0x2e0
                                sp=e000000015af7d20 bsp=e000000015af1160
 [<a000000100753410>] free_initrd+0x130/0x180
                                sp=e000000015af7d30 bsp=e000000015af1128
 [<a000000100756460>] populate_rootfs+0x1e0/0x200
                                sp=e000000015af7d30 bsp=e000000015af10f8
 [<a0000001007487d0>] init+0x3d0/0x780
                                sp=e000000015af7d30 bsp=e000000015af10c8
 [<a0000001000121b0>] kernel_thread_helper+0x30/0x60
                                sp=e000000015af7e30 bsp=e000000015af10a0
 [<a0000001000090c0>] start_kernel_thread+0x20/0x40
                                sp=e000000015af7e30 bsp=e000000015af10a0
 <0>Kernel panic - not syncing: Fatal exception

It can be reproduced reliably.

Comment 9 Neil Horman 2008-03-18 11:21:56 UTC
Cai, which version of kexec-tools did you do this with?

Comment 10 Qian Cai 2008-03-18 11:29:13 UTC
1.102pre-12.el5 + patch from BZ 434927#c28

Comment 11 Neil Horman 2008-03-18 12:42:43 UTC
can you try it with just 1.102pre-12.el5?  Without the patch from 434927?

Comment 12 Qian Cai 2008-03-18 14:55:15 UTC
Apart from hitting the bug #434927 with zero-size vmcore, everything works fine
without the patch, and with 1.102pre-12.el5, as well as KEXEC_ARGS="--noio".

Comment 13 Qian Cai 2008-03-18 14:56:30 UTC
Created attachment 298402 [details]
capture kernel console output

Comment 14 Neil Horman 2008-03-18 15:18:50 UTC
Than that needs to be our solution.  --noio disables console output in
purgatory, which is usefull for debug, but in the event that its causing
problems, I'll just add it to the default arg list.  Requesting 5.2 exception.

Comment 15 Neil Horman 2008-03-18 15:22:34 UTC
actually, I 'm clearing the exception, This is probably best handled as a
release note, as its not all systems that are subject to this.

Release note text:

Some IA64 systems have difficulty producing console output from the kexec
purgatory code.  While purgatory console output can be usefull in diagnosing
problems (which is why it is kept on), it is not needed for proper kdump
function.  If you find that your ia64 system resets during kdump operation,
adding --noio to the KEXEC_ARGS variable in /etc/sysconfig/kdump may solve the
issue.

Comment 16 Don Domingo 2008-03-18 23:04:24 UTC
thanks Neil, adding to RHEL5.2 release notes under "Known Issues":

<quote>
(ia64) Some Itanium systems cannot properly produce console output from the
kexec purgatory code. This code contains instructions for backing up the first
640k of memory after a crash.

While purgatory console output can be useful in diagnosing problems, it is not
needed for kdump to properly function. As such, if your Itanium system resets
during a kdump operation, disable console output in purgatory by adding --noio
to the KEXEC_ARGS variable in /etc/sysconfig/kdump.
</quote>

please advise if any further revisions are required. thanks!

Comment 18 Qian Cai 2008-03-27 10:39:29 UTC
For the record, kdump works fine with --noio and the latest kexec-tools
1.102pre.16.el5 on this box.

Comment 19 Don Domingo 2008-04-02 02:12:50 UTC
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 20 Ryan Lerch 2008-08-11 01:10:30 UTC
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. 

This Release Note is currently located in the Known Issues section.

Comment 21 Ryan Lerch 2008-08-11 01:10:30 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Comment 23 Qian Cai 2008-11-27 05:07:03 UTC
For the record, hp-matterhorn1.rhts.bos.redhat.com was also affected.

SysRq : Trigger a crashdump

 MCA EVENT occurred : SAL error processing 

  Logging XBC Errors ....
    Size of XBC Errors : 0x0
						Complete
 Finish the Error Event Logging ....
						Complete
 Flush the cpu cache ....
						Complete
 ReEnabling CPU Poison Check ....
						Complete
Cpu8: MCA Rendez Always flag set to 1. 
Cpu8: Perform rendezvous started. 
Cpu8: Sent Rendez vector number 0xe8 to 0 cpus. 
Cpu8: Rendezvous timeout : 20000 
Cpu8: Didn't have to send MCA vector.
Cpu8: Perform rendezvous complete, RendezState : 0x1 . 
  Cell local CallGate Pointer 0x723ff3a7c00 
  CallGate Pointer after moving to CoreCell 0x723ff3a7c00 
 Firmware executing from Main Memory ....... 
Calling OS_MCA at 0x00000000040493a0...
...

https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5321586

Comment 24 Qian Cai 2008-11-27 05:08:42 UTC
As well as hp-rx8640-02.rhts.bos.redhat.com.

Comment 25 Qian Cai 2008-11-27 11:06:21 UTC

*** This bug has been marked as a duplicate of bug 277531 ***