Bug 368981

Summary: kexec/kdump doesnt work on nfs root on QS21
Product: Red Hat Enterprise Linux 5 Reporter: Scott Moser <smoser>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED DUPLICATE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 5.1CC: ahecox, bpeters, ddomingo, hannsj_uhl, jjarvis, mohd.omar
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
(all architectures) Executing kdump on an IBM Bladecenter QS21 or QS22 configured with NFS root will fail. To avoid this, specify an NFS dump target in /etc/kdump.conf.
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-07 19:08:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391221, 454962    
Attachments:
Description Flags
qs21-kdump-53.el5.log: log from 2.6.18-53.el5 with modified mkdumprd
none
qs21-kdump-58.el5.rhel5u2.sm12.log : same as above with sm12 kernel
none
netdump-log for RHEL5.2-Beta1 on QS21 none

Description Scott Moser 2007-11-06 21:48:58 UTC
Description of problem:

When trying to test bug the fix in bug 313731, I tried to execute kexec/kdump on
a QS21 configured with NFS root (QS21 has no local storage).

 - in running mkdumprd handle_netdev function does not exist (bug 368941)
 - fixing above produces a working 'service kdump start', but dump is
   not taken
 - also with 'enforcing=0' kernel parameter
  $ cat /proc/cmdline:
   crashkernel=256M@16M enforcing=0
  $ touch /etc/kdump.conf
  $ service kdump restart
  $ echo c > /proc/sysrq-trigger
  - kernel starts to boot, then SOL console drops
   - last thing you see is 'md: bitmap version 4.39"
  - my assumption is that the booted kdump kernel somehow dies  after that
  - soon after (<1 minute), the tftp server sees a request for a boot
    image from the QS21 and the QS21 "comes back" but is now booting the
    non-kdump kernel
  - there is no crash in /var/crash
  - after re-connect with 'console' eventually comes back, but SOL is
    messed up, and generally unusable.  I have to power off the system, 
    detach from the bladecenter and power it back on to get the SOL back.

Comment 2 Neil Horman 2007-12-18 19:50:02 UTC
contents of /etc/kdump.conf?

Can you provide a log of what the serial console did manage to capture before it
dropped?

I expect what happened is, due to bz 313731 mkdumprd got rather confused and
produced a bad initramfs for kdump.  given the error message above it likely
thinks that you are using a software raid setup of some sort (the md utility).
At which point it fails setup, and reboots the system back to the origional
kernel.  My guess is this is a duplicate of bz 368941.  I'll leave it open until
we're sure.

No idea why the SOL would drop during reboot.  Isn't the SOL run independent of
the system in question?  I thought crashes/reboots weren't supposed to affect
the management interfaces.



Comment 3 Scott Moser 2007-12-18 22:59:47 UTC
Created attachment 289959 [details]
qs21-kdump-53.el5.log: log from 2.6.18-53.el5 with modified mkdumprd

The above is a console log of a kdump attempt on qs21 running 2.6.18-53.el5.
The suggested patch from 368941 (attachment 289946 [details]) is applied to mkdumprd.

Comment 4 Scott Moser 2007-12-18 23:02:09 UTC
Created attachment 289961 [details]
qs21-kdump-58.el5.rhel5u2.sm12.log : same as above with sm12 kernel

this is the log running the sm12 kernel (my development kernel) from
http://people.redhat.com/smoser/rhel5u2/sm12 . It contains all Cell
related fixes for RHEL5u2 (amoung other things).  There is no real
difference in the log other than the kernel used.

Comment 5 Neil Horman 2007-12-19 00:32:56 UTC
hmm, ok, this may not be a dupe after all.  Judging by those logs, we either:
1) may not be getting into the initrd at all (i.e. hanging prior to
loading/running /init in the initramfs)

2) Somehow not getting messages to the console properly, even though we are
functioning properly otherwise.

Scott, did you say I could get access to this machine to test on?  It would
probabaly be easiest if I could just have direct access to tinker for a bit, if
thats possible.  Thanks!

Comment 6 Scott Moser 2007-12-19 01:02:46 UTC
(In reply to comment #5)
> Scott, did you say I could get access to this machine to test on?  It would
> probabaly be easiest if I could just have direct access to tinker for a bit, if
> thats possible.  Thanks!

I've forwarded you info.

Comment 7 Neil Horman 2007-12-19 01:06:08 UTC
Thats right, I remember now.  Thanks!

Comment 8 Neil Horman 2007-12-19 17:39:03 UTC
FWIW it looks from my tinkering like we're not getting into the initramfs at all
yet on this system.  Depending on the iteration, we either jump back to bios
halfway through kernel init, or we try to access the initramfs, but seem to fail
the sys_access call in init()
.  blocking this on 313731

Comment 9 Neil Horman 2007-12-19 20:36:36 UTC
as per conversation with scott, I'm moving this bug to be dependent on the
correct kexec/cell bug.

Comment 18 Don Domingo 2008-02-19 00:02:46 UTC
thanks Neil, adding to RHEl5.2 release notes under "known issues":

<quote>
Executing kdump on a QS21 configured with NFS root will fail. To avoid this,
specify an NFS dump target in /etc/kdump.conf.
</quote>

please advise if any revisions are required. 

Comment 19 omar 2008-03-07 10:37:48 UTC
Created attachment 297165 [details]
netdump-log for RHEL5.2-Beta1 on QS21

I have tried to verify kdump support for RHEL5.2-Beta1(2.6.18-84.el5) on QS21.
I found that the secondary kernel has same problem booting on QS21 diskless
machine.

I performed the following steps:
=================================
- install RHEL5.2-Beta1 on QS21(2.6.18-84.el5)
- install
kernel-kdump(http://people.redhat.com/dzickus/el5/84.el5/ppc64/kernel-kdump-2.6.18-84.el5.ppc64.rpm)


- Reboot with crashkernel to kernel command line (boot net
crashkernel=256M@32M)
- set up kdump to dump to nfs mount point:
echo "net your.host.here:/your/exported/dir" >> /etc/kdump.conf
- service kdump restart
- echo 'c' > /proc/sysrq-trigger

The secondary kernel is loaded and starts booting, then the system reboots.
I found /var/crash is empty.

*Attaching the log.

--Regards
  Omar M

Comment 20 Neil Horman 2008-03-07 19:08:22 UTC
I'm closing this as a dupe of bz 368941, as they're both tracking the same
issue, and the other bz has an additional patch in it already to clean some
other cruft up.

*** This bug has been marked as a duplicate of 368941 ***

Comment 21 Don Domingo 2008-03-19 00:54:24 UTC
minor release note revision as per BZ#438030:

<quote>
Executing kdump on an IBM Bladecenter QS21 or QS22 configured with NFS root will
fail. To avoid this, specify an NFS dump target in /etc/kdump.conf.
</quote>

please advise if any further revisions are required. thanks!

Comment 22 omar 2008-03-19 07:00:28 UTC
But It didn't work for me as I said in Comment#19

Comment 23 Neil Horman 2008-03-19 11:37:48 UTC
What you encountered in comment 19 was  a different problem, one which IBM is
investigating.  The RHEL5 kernel was booting as of kernel release -65.el5, but
stopped again sometime between -65.el5 and -84.el5.  IIRC IBM is bisecting to
determine the release in which it initially (re)-broke.  If you try to boot with
kernel -65.el5 and use the config suggested by Don's release note, then all
should work quite well

Comment 24 Don Domingo 2008-04-02 02:12:06 UTC
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 25 Ryan Lerch 2008-08-08 01:00:10 UTC
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. 

This Release Note is currently located in the Known Issues section.

Comment 26 Ryan Lerch 2008-08-08 01:00:10 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.