Bug 602395
| Summary: | bonded interface hangs during booting of kdump kernel | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Joe Pope <pope_svr4> | ||||||||||||||||
| Component: | kexec-tools | Assignee: | Cong Wang <amwang> | ||||||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||
| Priority: | high | ||||||||||||||||||
| Version: | 5.4 | CC: | amwang, cww, nhorman, phan, qcai, rkhan, rprice, solgato, tao | ||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | ppc64 | ||||||||||||||||||
| OS: | Linux | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | 2012-06-26 11:33:51 UTC | Type: | --- | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Bug Depends On: | 809207 | ||||||||||||||||||
| Bug Blocks: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Joe Pope
2010-06-09 18:59:35 UTC
can you please provide the serial console output of the kdump kernel boot process? Thanks! This is a closed network so I attached the output as a PDF. The IBM P570 server is connected to an HMC and that is where I copied the output from. Created attachment 422728 [details]
page 1 kdump boot output
Created attachment 422729 [details]
page 2 kdump boot output
Correction: The NICs are 10G IBM onboard and the HBA's are Emulex. I typoed this in the description above. looks like it hit an error trying to access files in sysfs created bythe bonding module. Looking at the start of the log, it appears the bonding module never loaded. Does the problem go away if you add this: extra_modules bonding to the kdump.conf file? That would suggest we need to fix up our module dependency checking in mkdumprd I added that line to kdump.conf and I saw the bonding module load. I attached another console output. The server is running the versions of kernel, kexec-tools and kernel-kdump as mentioned above but I have not done a full "yum update" to RHEL 5.5. Is it possible there is another package that is out of date causing the issue? The boot still hangs at bringing up bond0. Created attachment 423097 [details]
kdump boot 6-10 pg1
Created attachment 423098 [details]
kdump boot 6-10 pg2
Created attachment 423099 [details]
kdump boot 6-10 pg3
I had been running the 2.6.18-164.11.1 kernel. After reading some docs about kdump I updated the three packages noted above. well, there have been several bonding deadlock fixes, although I'm not 100% certain why they would present in a kdump kernel but not the normal kernel. There are test kernels here: http://people.redhat.com/agospoda/ If you can give them a try. The fix was to add "extra_modules ehea" to kdump.conf. Thanks for your help. (In reply to comment #14) > The fix was to add "extra_modules ehea" to kdump.conf. Thanks for your help. Oh, you missed the net driver. :) So can we close this as NOTABUG? I do not know if you want to close this as NOTABUG because without adding the "extra_modules ehea" line to kdump.conf, the bonded interface would not come up when booting the kdump kernel. If the interface was not bonded the extra_modules line was not needed. IBM would like to see this resolved. We would still like to see this issue resolved so that EHEA NICs don't have to be a special case. Other customers may not know to add the extra_modules ehea in kdump.conf Will we be looking at a fix for this or is this just the way it is? ~rp (In reply to comment #17) > IBM would like to see this resolved. We would still like to see this issue > resolved so that EHEA NICs don't have to be a special case. Other customers > may not know to add the extra_modules ehea in kdump.conf > > Will we be looking at a fix for this or is this just the way it is? > According to comment #16, this sounds like a bug, so we should fix it. Joe, what is your bonding configuration of your system? modprobe.conf: alias bond0 bonding options bond0 mode=1 arp_interval=1000 arp_ip_target=x.x.x.x (In reply to comment #20) > modprobe.conf: > > alias bond0 bonding > options bond0 mode=1 arp_interval=1000 arp_ip_target=x.x.x.x And what does 'cat /sys/class/net/bond0/bonding/slaves' say? eth0 eth1 (In reply to comment #22) > eth0 eth1 So either eth0 or eth1 is an ehea NIC, right? Hmm, if so, it should be included into kdump initrd and get loaded during reboot. Actually, not by default. The kdump (or even standard) initrd doesn't usually contain network modules as they are not considered critical for the initial access to the root filesystem. Listing them as "extra_modules" gets them added to the initrd, and this works around the odd timing issue that seems to be happening when kdump tries to bring up bonding, but we would still like to know why kdump is having trouble bringing up bonding with 10G ehea devices when other network devices come up just fine without needing their modules included in the initrd. (In reply to comment #24) > Actually, not by default. The kdump (or even standard) initrd doesn't usually > contain network modules as they are not considered critical for the initial > access to the root filesystem. Listing them as "extra_modules" gets them added > to the initrd, and this works around the odd timing issue that seems to be > happening when kdump tries to bring up bonding, but we would still like to know > why kdump is having trouble bringing up bonding with 10G ehea devices when > other network devices come up just fine without needing their modules included > in the initrd. Well, I believe Joe configures kdump to dump over net via a bonding interface, in this case, the related network modules should be included. From the code, I can't see any reason why ehea driver is special here. I will ask QA to find an ehea NIC to see if we can reproduce the problem here. Created attachment 427584 [details]
Proposed patch
Joe, could you please try this patch?
Just save this patch, then do:
cd /sbin; patch -p1 < this_patch.diff;
and then make kdump to regenerate the initrd, do a crash reboot to see if it works now.
Thanks.
Kevin, Neil,
The problem is that on ppc, the following method of getting net driver doesn't work:
cat /sys/class/net/$netdev/device/modalias
So, with this patch, it will fall into the other method:
ethtool -i $netdev | awk '/^drivers:/ {print $2}'
This patch is not the best fix, but at least it should work.
hmm, I think thats actually a fine fix. In fact, I wonder if it wouldn't be worth completely ripping out the conditional that check /sys, and replace that logic with your ethtool query, as I expect ethtool will work in all cases for all device interfaces. The fix we were given before, the "extra_modules" line in kdump.conf, fixed our issue. I am not able to apply the patch to this system at this time. (In reply to comment #28) > Kevin, Neil, > > The problem is that on ppc, the following method of getting net driver doesn't > work: > > cat /sys/class/net/$netdev/device/modalias > Oops! On RHEL5 this file even doesn't exist, but it is true on RHEL6. So, actually on RHEL5 it will fall to use 'ethtool -i' to get the driver name. When QA tried to reproduce this problem, we saw ehea driver _did_ get loaded in the second kernel (without extra_modules). So this problem looks really odd to me, we are using the same version of kernel and kexec-tools. If I "unbonded" the interfaces the kdump kernel DID boot fine. It was when the kdump kernel tried to bring up bond0, made up of eth0/eth1, that the system would hang and never continue the boot. The eth0/eth1 interfaces are ehea. We are dumping the vmcore to a local file system so network was not required, but with the server hanging on bringing up bond0, we never got the vmcore dumped. The addition of "extra_modules = ehea" was the only way the kdump kernel would completely boot and dump vmcore. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Created attachment 594120 [details]
Latest kexec-tools
This is the latest RHEL5 kexec-tools, ppc64 rpm. There are lots of changes since the last update of this BZ, please retest this bug.
Thanks!
Thanks for the follow-up. We have since moved to RHEL6 and we are not using kdump anymore. I do not have any RHEL5 servers left to retest this bug. Thanks, Joe! Then let's close this bug... works for me. |