Bug 602325

Summary: kdump to network target fails over bridge device
Product: Red Hat Enterprise Linux 6 Reporter: Dave Maley <dmaley>
Component: kexec-toolsAssignee: Cong Wang <amwang>
Status: CLOSED CURRENTRELEASE QA Contact: Chao Ye <cye>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0CC: amwang, cye, jboggs, jolsa, nhorman, phan, qcai, rkhan, tao, tgraf, vbenes
Target Milestone: rc   
Target Release: 6.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kexec-tools-2_0_0-120_el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-15 14:29:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 506995, 578501    
Attachments:
Description Flags
serial console log during kdump using scp
none
Proposed patch
none
related tcpdump output none

Description Dave Maley 2010-06-09 15:35:30 UTC
Created attachment 422599 [details]
serial console log during kdump using scp

Description of problem:
kdump to a network target (scp or nfs) fails over a bridge device.  The messages below are output when restarting the kdump service:

   # service kdump restart
   Stopping kdump:                                            [  OK  ]
   Detected /etc/kdump.conf or /boot/vmlinuz-2.6.32-28.el6.x86_64 change
   Rebuilding /boot/initrd-2.6.32-28.el6.x86_64kdump.img
   ls: cannot access /sys/class/net/br0/device: No such file or directory
   Starting kdump:                                            [  OK  ]


Version-Release number of selected component (if applicable):
- kernel-2.6.32-28.el6
- kexec-tools-2.0.0-69.el6


How reproducible:
always


Steps to Reproduce:
1. create bridge device
2. configure kdump.conf to use scp or nfs
3. restart kdump service
4. panic the server


Actual results:
vmcore is not captured on remote host


Expected results:
vmcore captured on remote host


Additional info:
- problem occurs in both i386 and x86_64
- configuring kdump.conf to dump to local host succeeds

Comment 1 Cong Wang 2010-07-02 08:22:54 UTC
Created attachment 428763 [details]
Proposed patch

This works, but unfortunately udhcpc fails to get a dynamic IP for br0 and eth0 finally. I can't fingure out if this is a problem of udhcpc itself.

Comment 2 Cong Wang 2010-07-19 09:48:48 UTC
(In reply to comment #1)
> This works, but unfortunately udhcpc fails to get a dynamic IP for br0 and eth0
> finally. I can't fingure out if this is a problem of udhcpc itself.    

This problem is due to our network configuration inside RH, we shouldn't setup br0 and eth0 with DHCP at the same time. So I think the patch is OK.

Comment 6 Cong Wang 2010-07-21 06:06:53 UTC
Hi, please attach your ifcfg-br0 and ifcfg-eth0, and do they work well in the
first kernel with 'service network restart'?

Comment 8 Issue Tracker 2010-07-21 07:29:29 UTC
Event posted on 07-21-2010 04:29pm JST by mfuruta




This event sent from IssueTracker by mfuruta 
 issue 959923
it_file 882273

Comment 9 Cong Wang 2010-07-22 02:39:13 UTC
Hmm, still like a problem of udhcpc...

Comment 10 Cong Wang 2010-07-28 09:05:17 UTC
Seems like a bug of udhcpc, it doesn't bind port 68, causes an ICMP unreachable error. I don't know why this only happens on bridge.

Comment 12 Neil Horman 2010-07-28 12:40:43 UTC
What makes you think that udhcpc doesn't bind to port 68?  Why would it be binding to a port at all.  dhcp clients use raw sockets when requesting network addresses, binding to a port in that state is meaningless.  

Looking at the log, I'd say this is actually your problem:
mapping br0 to lo

For some reason mkdumprd has gotten confused and thinks that br0 should actually be lo (the loopback interface).  Thats what needs fixing.

Comment 14 Cong Wang 2010-07-29 03:01:44 UTC
(In reply to comment #12)
> What makes you think that udhcpc doesn't bind to port 68?  Why would it be
> binding to a port at all.  dhcp clients use raw sockets when requesting network
> addresses, binding to a port in that state is meaningless.  


Sorry, I mean open port 68, not bind, the ICMP unreachable error said this port is not open.

> 
> Looking at the log, I'd say this is actually your problem:
> mapping br0 to lo
> 
> For some reason mkdumprd has gotten confused and thinks that br0 should
> actually be lo (the loopback interface).  Thats what needs fixing.    

Where? I didn't see this on my machine. The log above says:

mapping br0 to br0
mapping eth0 to eth0
udhcpc (v1.15.1) started
Sending discover...
Sending discover...
Sending discover...
No lease, failing
br0 failed to come up

This problem can also be reproduced manually in the first kernel, by running 'busybox udhcpc br0' manually.

Comment 15 Cong Wang 2010-07-29 03:06:31 UTC
Created attachment 435180 [details]
related tcpdump output

Comment 16 Neil Horman 2010-08-04 11:23:59 UTC
ah, well, thats different.  It seems like the origional reporter and you are seeing (at least in part) different issues then.  The tcpdump shows that you're getting dhcp offer replies from the server, from which we can assume that we've successuly sent a dhcp discover message (although the tcpdump doesn't show that). If we're not sending a a dhcp request message in response to those offers, my first assumption would be that something in the network stack is dropping those frames, perhaps iptables rules?

That would be odd however, as iptables rules wouldn't be in effect in the second kernel.

Comment 19 Cong Wang 2010-08-12 07:13:55 UTC
Solved, we need to set eth0 in promiscuous mode in this case.

Comment 20 Cong Wang 2010-08-12 07:31:47 UTC
Fixed in kexec-tools-2_0_0-138_el6.

Comment 21 Chao Ye 2010-08-20 09:54:30 UTC
Tested with -142.el6 on hp-ml370g4-01.rhts.eng.bos.redhat.com. It's very strange, maybe dump success, maybe failed to dump. I tried more than six times, only success two times.
===============================================================================
[root@hp-ml370g4-01 ~]# rpm -q kexec-tools
kexec-tools-2.0.0-142.el6.i686
[root@hp-ml370g4-01 ~]# tail /etc/kdump.conf 
#core_collector cp --sparse=always
#link_delay 60
#kdump_post /var/crash/scripts/kdump-post.sh
#extra_bins /usr/bin/lftp
#disk_timeout 30
#extra_modules gfs2
#options modulename options
#default shell
net nest.test.redhat.com:/mnt/qa
link_delay 60
[root@hp-ml370g4-01 ~]# touch /etc/kdump.conf 
[root@hp-ml370g4-01 ~]# service kdump restart
Stopping kdump:                                            [  OK  ]
Detected change(s) the following file(s):
  
  /etc/kdump.conf
Rebuilding /boot/initrd-2.6.32-66.el6.i686kdump.img
Netmask is missed!
Starting kdump:                                            [  OK  ]
[root@hp-ml370g4-01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br0 
DEVICE=br0
TYPE=Bridge
BOOTPROTO=dhcp
ONBOOT=yes
[root@hp-ml370g4-01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
BRIDGE=br0
--------------------------------------------------------------------------------
Failed console output:

Free memory/Total memory (free %): 94912 / 118644 ( 79.9973 )
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_hpml370g401" using metadata type lvm2
Activating logical volumes
  2 logical volume(s) in volume group "vg_hpml370g401" now active
Free memory/Total memory (free %): 94468 / 118644 ( 79.6231 )
mapping br0 to br0
mapping eth0 to eth0
br0 Link Up.  Waiting 60 Seconds
Continuing
device eth0 entered promiscuous mode
ADDRCONF(NETDEV_UP): eth0: link is not ready
udhcpc (v1.15.1) started
Sending discover...
Sending discover...
Sending discover...
No lease, failing
br0 failed to comd: stopping all md devices.
me up
Restarting system.
machine restart

------------------------------------------------------------------------------
Success dump output:

Free memory/Total memory (free %): 94468 / 118644 ( 79.6231 )
mapping br0 to br0
mapping eth0 to eth0
br0 Link Up.  Waiting 60 Seconds
Continuing
device eth0 entered promiscuous mode
ADDRCONF(NETDEV_UP): eth0: link is not ready
udhcpc (v1.15.1) started
Sending discover...
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
br0: port 1(eth0) entering learning state
Sending discover...
br0: port 1(eth0) entering forwarding state
Sending discover...
Sending select for 10.16.65.55...
Lease of 10.16.65.55 obtained, lease time 86400
deleting routers
adding dns 10.16.36.29
adding dns 10.16.255.2
adding dns 10.16.255.3
Saving to remote location nest.test.redhat.com:/mnt/qa
Free memory/Total memory (free %): 93840 / 118644 ( 79.0938 )
Copying data                       : [100 %]
Saving core complete
md: stopping all md devices.
Restarting system.
===============================================================================

I also tested on ibm-x3655-05.ovirt.rhts.eng.bos.redhat.com with RHEL6.0-20100811.2_nfs-Server-x86_64 and ibm-js22-03.rhts.eng.bos.redhat.com with RHEL6.0-20100805.0_nfs-Server-ppc64. They are all works.Maybe it's just the terrible, maybe hardware related.

Comment 22 Cong Wang 2010-08-23 11:20:51 UTC
(In reply to comment #21)
> mapping br0 to br0
> mapping eth0 to eth0
> br0 Link Up.  Waiting 60 Seconds
> Continuing
> device eth0 entered promiscuous mode
> ADDRCONF(NETDEV_UP): eth0: link is not ready

Seems like a tg3 driver issue, link is still not ready after 'ifconfig eth0 up' and sleeping for 60 seconds.

Comment 25 releng-rhel@redhat.com 2010-11-15 14:29:31 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.