Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 602395

Summary: bonded interface hangs during booting of kdump kernel
Product: Red Hat Enterprise Linux 5 Reporter: Joe Pope <pope_svr4>
Component: kexec-toolsAssignee: Cong Wang <amwang>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: amwang, cww, nhorman, phan, qcai, rkhan, rprice, solgato, tao
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-26 11:33:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 809207    
Bug Blocks:    
Attachments:
Description Flags
page 1 kdump boot output
none
page 2 kdump boot output
none
kdump boot 6-10 pg1
none
kdump boot 6-10 pg2
none
kdump boot 6-10 pg3
none
Proposed patch
none
Latest kexec-tools none

Description Joe Pope 2010-06-09 18:59:35 UTC
Description of problem:
After executing "echo c > /proc/sysrq-trigger" the server panics and boots the kdump kernel. When the boot sequence (kdump kernel) tries to bring up the bonded interface, the server hangs indefinitely and does not continue booting. The bonded interface (bond0) consists of eth0/eth1. The NICs are 10G Emulex cards. The server is an IBM P570.

Version-Release number of selected component (if applicable):
kernel - 2.6.18-194.3.1.el5
kexec-tools - 1.102pre-96.el5_5.1
kernel-kdump - 2.6.18-194.3.1.el5

How reproducible:
Initiate a kernel panic

Steps to Reproduce:
1.Bond eth0/eth1
2.initiate kernel panic
3.server will hang trying to bring up bond0
  
Actual results:
The server hangs indefinitely when trying to bring up the bonded interface

Expected results:
The kdump kernel will boot and dump a vmcore file

Additional info:
kdump works fine with non-bonded interfaces
Bonded interface options: mode=1, arp_interval=1000, arp_ip_target=x.x.x.x

Comment 1 Neil Horman 2010-06-09 19:28:49 UTC
can you please provide the serial console output of the kdump kernel boot process?  Thanks!

Comment 2 Joe Pope 2010-06-09 23:20:19 UTC
This is a closed network so I attached the output as a PDF. The IBM P570 server is connected to an HMC and that is where I copied the output from.

Comment 3 Joe Pope 2010-06-09 23:21:09 UTC
Created attachment 422728 [details]
page 1 kdump boot output

Comment 4 Joe Pope 2010-06-09 23:22:00 UTC
Created attachment 422729 [details]
page 2 kdump boot output

Comment 5 Joe Pope 2010-06-10 12:43:04 UTC
Correction: The NICs are 10G IBM onboard and the HBA's are Emulex. I typoed this in the description above.

Comment 6 Neil Horman 2010-06-10 13:27:29 UTC
looks like it hit an error trying to access files in sysfs created bythe bonding module.  Looking at the start of the log, it appears the bonding module never loaded.  Does the problem go away if you add this:
extra_modules bonding
to the kdump.conf file?

That would suggest we need to fix up our module dependency checking in mkdumprd

Comment 7 Joe Pope 2010-06-11 00:35:03 UTC
I added that line to kdump.conf and I saw the bonding module load. I attached another console output. The server is running the versions of kernel, kexec-tools and kernel-kdump as mentioned above but I have not done a full "yum update" to RHEL 5.5. Is it possible there is another package that is out of date causing the issue?

Comment 8 Joe Pope 2010-06-11 00:35:30 UTC
The boot still hangs at bringing up bond0.

Comment 9 Joe Pope 2010-06-11 00:36:13 UTC
Created attachment 423097 [details]
kdump boot 6-10 pg1

Comment 10 Joe Pope 2010-06-11 00:36:43 UTC
Created attachment 423098 [details]
kdump boot 6-10 pg2

Comment 11 Joe Pope 2010-06-11 00:37:15 UTC
Created attachment 423099 [details]
kdump boot 6-10 pg3

Comment 12 Joe Pope 2010-06-11 02:05:45 UTC
I had been running the 2.6.18-164.11.1 kernel. After reading some docs about kdump I updated the three packages noted above.

Comment 13 Neil Horman 2010-06-11 11:13:52 UTC
well, there have been several bonding deadlock fixes, although I'm not 100% certain why they would present in a kdump kernel but not the normal kernel.  There are test kernels here:
http://people.redhat.com/agospoda/

If you can give them a try.

Comment 14 Joe Pope 2010-06-11 21:13:21 UTC
The fix was to add "extra_modules ehea" to kdump.conf. Thanks for your help.

Comment 15 Cong Wang 2010-06-15 03:04:06 UTC
(In reply to comment #14)
> The fix was to add "extra_modules ehea" to kdump.conf. Thanks for your help.    

Oh, you missed the net driver. :)
So can we close this as NOTABUG?

Comment 16 Joe Pope 2010-06-15 12:12:29 UTC
I do not know if you want to close this as NOTABUG because without adding the "extra_modules ehea" line to kdump.conf, the bonded interface would not come up when booting the kdump kernel. If the interface was not bonded the extra_modules line was not needed.

Comment 17 Robin R. Price II 2010-06-21 21:07:01 UTC
IBM would like to see this resolved.   We would still like to see this issue resolved so that EHEA NICs don't have to be a special case.  Other customers may not know to add the extra_modules ehea in kdump.conf

Will we be looking at a fix for this or is this just the way it is?

~rp

Comment 18 Cong Wang 2010-06-22 02:24:18 UTC
(In reply to comment #17)
> IBM would like to see this resolved.   We would still like to see this issue
> resolved so that EHEA NICs don't have to be a special case.  Other customers
> may not know to add the extra_modules ehea in kdump.conf
> 
> Will we be looking at a fix for this or is this just the way it is?
> 

According to comment #16, this sounds like a bug, so we should fix it.

Comment 19 Cong Wang 2010-06-22 06:46:18 UTC
Joe, what is your bonding configuration of your system?

Comment 20 Joe Pope 2010-06-22 13:22:32 UTC
modprobe.conf:

alias bond0 bonding
options bond0 mode=1 arp_interval=1000 arp_ip_target=x.x.x.x

Comment 21 Cong Wang 2010-06-23 08:41:26 UTC
(In reply to comment #20)
> modprobe.conf:
> 
> alias bond0 bonding
> options bond0 mode=1 arp_interval=1000 arp_ip_target=x.x.x.x    

And what does 'cat  /sys/class/net/bond0/bonding/slaves' say?

Comment 22 Joe Pope 2010-06-23 11:57:14 UTC
eth0 eth1

Comment 23 Cong Wang 2010-06-24 03:21:12 UTC
(In reply to comment #22)
> eth0 eth1    

So either eth0 or eth1 is an ehea NIC, right? Hmm, if so, it should be included into kdump initrd and get loaded during reboot.

Comment 24 Kevin W. Rudd 2010-06-24 14:56:03 UTC
Actually, not by default.  The kdump (or even standard) initrd doesn't usually contain network modules as they are not considered critical for the initial access to the root filesystem.  Listing them as "extra_modules" gets them added to the initrd, and this works around the odd timing issue that seems to be happening when kdump tries to bring up bonding, but we would still like to know why kdump is having trouble bringing up bonding with 10G ehea devices when other network devices come up just fine without needing their modules included in the initrd.

Comment 25 Cong Wang 2010-06-29 02:10:42 UTC
(In reply to comment #24)
> Actually, not by default.  The kdump (or even standard) initrd doesn't usually
> contain network modules as they are not considered critical for the initial
> access to the root filesystem.  Listing them as "extra_modules" gets them added
> to the initrd, and this works around the odd timing issue that seems to be
> happening when kdump tries to bring up bonding, but we would still like to know
> why kdump is having trouble bringing up bonding with 10G ehea devices when
> other network devices come up just fine without needing their modules included
> in the initrd.    


Well, I believe Joe configures kdump to dump over net via a bonding interface, in this case, the related network modules should be included. From the code, I can't see any reason why ehea driver is special here.

I will ask QA to find an ehea NIC to see if we can reproduce the problem here.

Comment 27 Cong Wang 2010-06-29 06:42:26 UTC
Created attachment 427584 [details]
Proposed patch

Joe, could you please try this patch?
Just save this patch, then do:

cd /sbin; patch -p1 < this_patch.diff;

and then make kdump to regenerate the initrd, do a crash reboot to see if it works now.

Thanks.

Comment 28 Cong Wang 2010-06-29 06:48:48 UTC
Kevin, Neil,

The problem is that on ppc, the following method of getting net driver doesn't work:

cat /sys/class/net/$netdev/device/modalias

So, with this patch, it will fall into the other method:

ethtool -i $netdev | awk '/^drivers:/ {print $2}'

This patch is not the best fix, but at least it should work.

Comment 29 Neil Horman 2010-06-29 11:32:33 UTC
hmm, I think thats actually a fine fix.  In fact, I wonder if it wouldn't be worth completely ripping out the conditional that check /sys, and replace that logic with your ethtool query, as I expect ethtool will work in all cases for all device interfaces.

Comment 30 Joe Pope 2010-06-30 12:16:29 UTC
The fix we were given before, the "extra_modules" line in kdump.conf, fixed our issue. I am not able to apply the patch to this system at this time.

Comment 31 Cong Wang 2010-07-01 07:31:08 UTC
(In reply to comment #28)
> Kevin, Neil,
> 
> The problem is that on ppc, the following method of getting net driver doesn't
> work:
> 
> cat /sys/class/net/$netdev/device/modalias
> 

Oops! On RHEL5 this file even doesn't exist, but it is true on RHEL6.
So, actually on RHEL5 it will fall to use 'ethtool -i' to get the driver name.

When QA tried to reproduce this problem, we saw ehea driver _did_ get loaded in the second kernel (without extra_modules). So this problem looks really odd to me, we are using the same version of kernel and kexec-tools.

Comment 32 Joe Pope 2010-07-01 12:12:02 UTC
If I "unbonded" the interfaces the kdump kernel DID boot fine. It was when the kdump kernel tried to bring up bond0, made up of eth0/eth1, that the system would hang and never continue the boot. The eth0/eth1 interfaces are ehea. We are dumping the vmcore to a local file system so network was not required, but with the server hanging on bringing up bond0, we never got the vmcore dumped. The addition of "extra_modules = ehea" was the only way the kdump kernel would completely boot and dump vmcore.

Comment 35 RHEL Program Management 2011-05-31 14:37:54 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 39 RHEL Program Management 2012-04-02 10:44:48 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 40 Cong Wang 2012-06-25 07:35:13 UTC
Created attachment 594120 [details]
Latest kexec-tools

This is the latest RHEL5 kexec-tools, ppc64 rpm. There are lots of changes since the last update of this BZ, please retest this bug.

Thanks!

Comment 41 Joe Pope 2012-06-26 11:30:37 UTC
Thanks for the follow-up. We have since moved to RHEL6 and we are not using kdump anymore. I do not have any RHEL5 servers left to retest this bug.

Comment 42 Cong Wang 2012-06-26 11:33:51 UTC
Thanks, Joe! Then let's close this bug...

Comment 43 Joe Pope 2012-06-26 14:24:05 UTC
works for me.