Bug 269281

Summary: kdump to remote dump server doesn't transfer anything
Product: Red Hat Enterprise Linux 5 Reporter: Maarten Broekman <maarten>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.18-8.1.8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-04 17:16:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
initrd
none
tcpdump
none
kernel-panic none

Description Maarten Broekman 2007-08-30 20:56:47 UTC
Description of problem:
When I force a crash on my system, kdump doesn't successfully transfer any 
data to my remote dump server.

Version-Release number of selected component (if applicable):
kexec-tools 1.101-192.el5
kernel-PAE 2.6.18-8.1.8.el5

How reproducible:
100%

Steps to Reproduce:
1. Configure host to dump to remote dump server via scp
2. service kdump restart
3. echo c>/proc/sysrq-trigger
  
Actual results:
...
mapping eth0 to eth0
eth0 Link Up. Waiting 60 Seconds
Continuing
route: resolving dev
Saving to remote location netdump.xxx.xxx
1+0 records in
1+0 records out
/system hangs here/

A check on the dump server shows no new files.  The log files on the dump 
server show no connection attempt from my test server.

Expected results:
1. At least a connection attempt in either /var/log/secure or /var/log/messages
2. A new file / directory in /var/crash on the dump server from the test 
server.

Additional info:
/etc/kdump.conf:
net netdump.xxx.xxx
link_delay 60
default reboot

/etc/sysconfig/kdump is the default kdump file.

scp from the test system to the dump server works fine and without passwords 
(authorized_keys):
root@atl-lxblade21 ~ # scp disk_map netdump.xxx.xxx:/var/crash
disk_map                                     100%  593     0.6KB/s   00:00

Comment 1 Neil Horman 2007-08-31 13:31:48 UTC
Can you please send in:
a binary tcpdump taken from the server during the clients dump process
the kdump initrd you are using

Also, while I look those over, can you add this command:
default shell
to the end of your /etc/kdump.conf file, restart the service and crash the system?

That should place you at a shell prompt in the initramfs after you try to
transfer the vmcore to your ssh server 

from there you can try to manually ssh over to the remote system and record any
errors that you get in the attempt.  Thanks!

Comment 2 Maarten Broekman 2007-08-31 14:40:56 UTC
Created attachment 183461 [details]
initrd

Comment 3 Maarten Broekman 2007-08-31 14:42:13 UTC
Created attachment 183481 [details]
tcpdump

tcpdump from before the crash through the reboot and including a service kdump
propagate after the fact (just to show some traffic).

Comment 4 Maarten Broekman 2007-08-31 14:48:01 UTC
I changed 'default reboot' to 'default shell' and I'm trying to determine why 
I can't type anything when in the crash shell (HP ilo2 console interface).

Comment 5 Maarten Broekman 2007-08-31 15:38:07 UTC
Seems like this is a problem with the ilo2 console interface (case opened with 
HP on this).  I set up a serial console port and I can get a shell prompt.  If 
I try to ssh or ping my dump server, I get nothing.

mapping eth0 to eth0
route: resolving dev
Saving to remote location netdump.40.81
1+0 records in
1+0 records out
lost connection
dropping to initramfs shell
exiting this shell will reboot your system
root:/> ping 10.105.40.81
PING 10.105.40.81 (10.105.40.81): 56 data bytes


Nothing happens for several minutes and then the system reboots.




Comment 6 Maarten Broekman 2007-08-31 15:44:27 UTC
I tried this on a different system (HP BL25p G1 vs BL465c G1).  The BL25p 
worked fine.  It still shows the same "1+0 records in / out" messages, but 
then it actually starts transferring data (there's no indication on the 
console that it is transferring however).  The BL465c has multiple NICs and it 
appears that the problem may be related to which NIC gets picked for the 
transfer.  How can I check / change that?

Comment 7 Maarten Broekman 2007-08-31 16:47:26 UTC
On the BL465c G1, I checked the network settings being used in the crash 
kernel and the settings used by the regular kernel.

Both kernels are using the same network interface but the regular kernel is 
able to talk over it while the crash kernel isn't.

Comment 8 Neil Horman 2007-08-31 17:08:40 UTC
Are you using the same NIC driver on both systems?  IIRC we had a tg3 problem
with some chip variants that caused problems in resetting the NIC when the
module was re-inserted on a kdump boot.  You may want to try booting with the
RHEL5.1 beta kernel as we incorporated a tg3 update to correct that problem. 

Comment 9 Maarten Broekman 2007-08-31 17:36:00 UTC
I tried the -36.el5PAE kernel from the RHEL5 Beta channel (Red Hat Enterprise 
Linux (v. 5 for 32-bit x86) Beta).  I downloaded the kernel, kernel-devel, 
kernel-headers, kernel-PAE, and kernel-PAE-devel RPMs.  I was unable to get 
the corresponding debuginfo packages as the links from RHN to the debuginfo 
site seemed to be incorrect.

I installed all 5 RPMs and rebooted.  The system booted fine.  I double-
checked that kdump was operational (it was).  Checked kdump.conf.  Crashed the 
system.  The system panic'd at this point.  See new attachment.

Comment 10 Maarten Broekman 2007-08-31 17:37:51 UTC
Created attachment 183721 [details]
kernel-panic

kernel panic on crash after upgrading to 2.6.18-36.el5PAE

Comment 11 Neil Horman 2007-08-31 19:14:21 UTC
sorry, you need to add reset_devices to KEXEC_COMMANDLINE_APPEND in
/etc/sysconfig/kdump

Comment 12 Maarten Broekman 2007-08-31 19:43:05 UTC
It dumped over the network but the system rebooted after 10 minutes so it 
didn't copy the entire dump (still have a vmcore-incomplete).

Comment 13 Neil Horman 2007-08-31 20:07:08 UTC
That could be any number of things.  Did you get an error message on the serial
console prior to reboot?  If so, what was it?  If you didn't get any error, and
the system just seemd to spontaneously reboot, that could be an ilo issue.  Do
you normally use any health monitoring modules from HP?  Or do you have any
system activity monitor configured in Ilo?   It could be considering the system
hung during the kdump period, and it winds up NMI-ing the box inappropriately.
If you can disable Ilo completely and use a plain serial console to test with
you should be able to confirm this.

Comment 14 Maarten Broekman 2007-09-04 15:57:31 UTC
As you suspected, ASR was enabeld and that was the cause of the reboot.  With 
ASR disabled, everything is working as expected now.  Looks like this is 
resolved with 5.1.