Bug 517584

Summary: kdump not working in RHEL6 alpha
Product: Red Hat Enterprise Linux 6 Reporter: Dave Maley <dmaley>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: ahecox, bernhard.furtmueller, caiqian, phan, syeghiay, tao
Target Milestone: alphaKeywords: Regression, Reopened, TestBlocker
Target Release: 6.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kexec-tools-2.0.0-35.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 524820 (view as bug list) Environment:
Last Closed: 2010-07-02 19:21:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 391521, 524820    
Attachments:
Description Flags
patch to make find follow symlinks so we can pull in modules correctly.
none
kdump config used for testing
none
initrd from testing
none
patched mkdumprd
none
additinoal patch to force load all present modules
none
omnibus patch
none
initrd from testing
none
new mkdumprd patch
none
new patch
none
modified patch
none
log from failed kdump
none
new patch to avoid recursive loop from broken modprobe command
none
new updated patch to wrok around infinite recursion from broken modprobe
none
modified patch v2 none

Description Dave Maley 2009-08-14 19:50:32 UTC
Description of problem:
kdump does not work in RHEL6 alpha.  This has been reproduced by both the partner and the TAM.

(from I-T)
The kdump stopped at the start of 2nd kernel.  
Here is a log of that time as follows.  
---
(snip)
usbcore: registered new interface driver hiddev
usbcore: registered new interface dviver usbhid
usbhid: v2.6: USB HID core driver
nf_conntrack version 0.5.0 (12277 bukets 49108 max)
CONFIG_NF_CT_ACCT is deprecated and will be removed soon Please
use nf_conntrack.acct=1 kernel paramater, acct=1 nf_conntrack module
option or sysctl net.netfilter.nf.conntrack.acct=1 to enable it
iptables: C 2000-2006 Netfilter Core Team
TCP cubic registered
Initalizing XFRM netlink socket
Using IPI No shortcut mode
registered taskstats version 1
  Magic number: 1 933:590
Freeing used kernel Memory: 439k freed
Write protecting the kernel read-only data: 1536k
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Createing initial device nodes
Waiting for required block device discovery
Waiting for sde...input:AT Translated Set 2 keyboad as /device/platform/i8042/
serio1/input/input2
---
The 2nd kernel hangs up here.


Version-Release number of selected component (if applicable):
kernel-2.6.29.4-1.el6.i686.PAE
kexec-tools-2.0.0-17.el6-i586


How reproducible:
100%


Steps to Reproduce:
1. configure kdump
2. crash system


Actual results:
kdump hangs attempting to boot into kdump kernel


Expected results:
kdump successfully captures vmcore


Additional info:

Hardware configuration:
Model: PRIMERGY H450
CPU Info: Intel Xeon CPU 1.60GHz
Memory Info: 8GB

Partner reports this has also been hit on ia64

Comment 2 Neil Horman 2009-08-14 23:21:53 UTC
Sounds like something I fixed in RHEL5 and forgot to forward port to rawhide.  Dave, do me a favor, can you, on the test system, replace the /sbin/mkdumprd utility with the mkdumprd utility from an up to date RHEL5 system  (the latest 5.4 release).  And see if the same problem occurs?  I'd appreciate it.

Thanks!

Comment 3 Qian Cai 2009-08-25 04:31:40 UTC
Looks like the same problem here,

Red Hat Enterprise Linux release 6.0 Beta (Six)
Kernel 2.6.29.4-1.el6.i686.PAE on an i686

hp-xw4550-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump
tick-broadcast: ignoring broadcast for offline CPU #1
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading tg3.ko module
Waiting for required block device discovery
Waiting for sda...

Comment 4 Qian Cai 2009-08-25 05:00:15 UTC
Using mkdumprd from RHEL5.4 1.102pre-77.el5 kexec-tools results in the same problem.

# service kdump restart
Stopping kdump:[  OK  ]
Detected /etc/kdump.conf or /boot/vmlinuz-2.6.29.4-1.el6.i686.PAE change
Rebuilding /boot/initrd-2.6.29.4-1.el6.i686.PAEkdump.img
/sbin/mkdumprd: line 723: /etc/modprobe.conf: No such file or directory
cp: cannot stat `/sbin/lvm.static': No such file or directory
cp: cannot stat `/sbin/dmsetup.static': No such file or directory
cp: cannot stat `/sbin/kpartx.static': No such file or directory
Starting kdump:[  OK  ]

hp-xw4550-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump
tick-broadcast: ignoring broadcast for offline CPU #1
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading tg3.ko module
Waiting for required block device discovery
Waiting for sda...

Comment 5 Qian Cai 2009-08-25 05:26:52 UTC
Sometimes, it looks like found some USB devices,

Waiting for required block device discovery
Waiting for sda...usb 3-1: New USB device found, idVendor=0624, idProduct=0200
usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 3-1: Product: USB DSRIQ
usb 3-1: Manufacturer: Avocent
usb 3-1: configuration #1 chosen from 1 choice
input: Avocent USB DSRIQ as /devices/pci0000:00/0000:00:13.1/usb3/3-1/3-1:1.0/input/input3
generic-usb 0003:0624:0200.0001: input,hidraw0: USB HID v1.10 Keyboard [Avocent USB DSRIQ] on usb-0000:00:13.1-1/input0
input: Avocent USB DSRIQ as /devices/pci0000:00/0000:00:13.1/usb3/3-1/3-1:1.1/input/input4
generic-usb 0003:0624:0200.0002: input,hidraw1: USB HID v1.10 Mouse [Avocent USB DSRIQ] on usb-0000:00:13.1-1/input1

Comment 6 Neil Horman 2009-08-25 11:05:39 UTC
Ok, this is something new.  And from above, it sounds like what your saying is that it doesn't always hang in the same place, which is bad.  What system are you reproducing this on?  I'll need to reserve it in rhts and tinker with it for a bit.

Comment 7 Qian Cai 2009-08-25 11:33:05 UTC
I guess the question is for me. The machine is,

  hp-xw4550-01.rhts.bos.redhat.com

It has another problem recorded in,

  Bug 519094 - Kdump kernel BUG at arch/x86/kernel/irq_32.c:219!

Comment 8 Neil Horman 2009-08-25 13:36:54 UTC
hmm, seems we're not including any modules in the initramfs, thats odd, to say the least.  investigating...

Comment 9 Neil Horman 2009-08-25 17:51:04 UTC
IT would appear that some things have changed in sysfs, and the nash-find routine in nash no longer works properly, so drivers weren't getting found (specifically some directories changed to symlinks so nash wasn't following them anymore.  I'm attaching a patch to mkdumprd that worked in my test system.  Its not perfect, but it should get the job done.  Cai, if you can try it out please and confirm, I'd appreciate it.  Thanks!

Comment 10 Neil Horman 2009-08-25 17:51:53 UTC
Created attachment 358611 [details]
patch to make find follow symlinks so we can pull in modules correctly.

Comment 11 Neil Horman 2009-08-25 17:52:36 UTC
sorry, when I said Cai above, I meant you dave.  if either of you guys actually could test/confirm, I'd appreciate it.

Comment 12 Dave Maley 2009-08-26 17:19:07 UTC
Neil - apologies for the delayed response here.  I finally got a lab system setup and was able to seemingly reproduce the initial report myself.  It's hanging waiting for the storage dev:

-----
Creating /dev
Creating initial device nodes
Waiting for required block device discovery
Waiting for sda...
-----

I then applied the changes from patch in comment 10 however the test system still hung in the same place/way.

For completeness I tested using a RHEL5.4 mkdumprd (-77) which also failed.

I'm going to continue to look into this myself, let me know if you have anything further you'd like me to test.

Comment 14 Neil Horman 2009-08-26 17:57:04 UTC
please attach the kdump.conf and resultant initramfs from the system you tested after applying the patch.  it worked fine on my test system

Comment 15 Dave Maley 2009-08-26 19:40:17 UTC
Created attachment 358760 [details]
kdump config used for testing

I tested using a completely stock kdump.conf (ie. everything still commented out) as well as w/ the attached config.  I'll attach the initrd momentarily.

Comment 16 Dave Maley 2009-08-26 19:41:42 UTC
Created attachment 358761 [details]
initrd from testing

Comment 17 Dave Maley 2009-08-26 19:46:48 UTC
Created attachment 358762 [details]
patched mkdumprd

Comment 18 Neil Horman 2009-08-26 20:14:29 UTC
Created attachment 358765 [details]
additinoal patch to force load all present modules

great, the dependencies for scsi modules are all messed up.  Lets just keep this thing breathing long enough for us to move to the new config system.  This patch should apply on top of the previous one and will force all modules to be included.  Please apply it on top of the previous patch and let me know how it goes.  Thanks!

Comment 20 Dave Maley 2009-08-28 15:36:14 UTC
Neil - With the additional patch applied I can't start kdump:

[root@sun-x4600-1 ~]# service kdump restart
Stopping kdump:                                            [  OK  ]
Detected /etc/kdump.conf or /boot/vmlinuz-2.6.29.4-1.el6.i686.PAE change
Rebuilding /boot/initrd-2.6.29.4-1.el6.i686.PAEkdump.img
No module dm_multipath found for kernel 2.6.29.4-1.el6.i686.PAE, aborting.
Failed to run mkdumprd

So findmodule() is failing to locate dm-multipath.ko.  Does dm_multipath need to be special-cased?  Doesn't seem like something that should be needed .....

Comment 21 Neil Horman 2009-08-28 18:17:23 UTC
well, I can see why its happening, but what I don't understand is how we can let that happen.  The mkdumprd script is looking for dm_multipath.ko, which is whats listed in the output of lsmod, but the module name itself appears to be dm-multipath.ko (note the use of the dash rather than the underscore), so we don't find the module when we go looking for it.  It appears to be that way with all of the dm modules.  In RHEL5 and RHEL6 (we never required the dm modules previously so we didn't hit this).  Normally, the name in /proc/modules & lsmod is taken from the module structure thats linked into the module at build time, so theres no strict requirement for the module name to match the file name of the module, but for the sake of sanity (for cases exactly like this), they really need to be the same.  I would suggest that you open a new bug against the kernel, pointing out the discrepancy between the name in /proc/module and the file name, and set this bug to block on it, as thats going to cause all sorts of problems.

Comment 23 Dave Maley 2009-08-31 23:08:18 UTC
Neil - there's a number of precedents wrt _ vs. - differences between lsmod and module names, such as all the alsa (snd_) modules, usb_storage, kvm_* just to name a few.  While I agree that this doesn't seem like a good practice to follow I don't think we'll be able to make an argument to get the dm_* modules changed.  Especially considering, based on comment 18, this is just to get things limping along until the new config system is in place.

If you'd still like me to open the new bug and set blocker I'll do so.

Comment 24 Neil Horman 2009-09-01 11:14:49 UTC
no, I see your point, although I'd be very interested to know what and why we're converting '-' characters to '_' characters.  If thats teh only name shift though, thats something we can likely handle.  I'll update the patch to handle those cases.

Comment 25 Neil Horman 2009-09-01 13:20:17 UTC
Created attachment 359386 [details]
omnibus patch

Ok, heres an omnibus patch that rolls all the previous ones in and should fix the -/_ conversion.  Let me know.  Thanks!

Comment 26 Dave Maley 2009-09-01 16:45:28 UTC
Neil - service kdump start still fails w/ the new patch.  For any modules w/ a - or _ in their lsmod listing 1 of the 2 calls to findmodule will fail.  I was able to get around this by adding --skiperrors to both findmodule calls.  I have no idea if this is the right thing to do here ....

That being said it's still not working.  It hangs in a similar place as originally reported, this time sda is found but it hangs trying to find sdb.  However there's numerous errors from insmod as seen below:

---
Creating /dev
Creating initial device nodes
Loading ata_generic.ko module
Loading ata_piix.ko module
Loading nfs.ko module
insmod: cannot insert '/lib/nfs.ko': unknown symbol in module or invalid parameter
Loading lockd.ko module
insmod: cannot insert '/lib/lockd.ko': unknown symbol in module or invalid parameter
Loading nfs_acl.ko module
insmod: cannot insert '/lib/nfs_acl.ko': unknown symbol in module or invalid parameter
Loading auth_rpcgss.ko module
insmod: cannot insert '/lib/auth_rpcgss.ko': unknown symbol in module or invalid parameter
Loading dm-crypt.ko module
Loading dm-multipath.ko module
Loading sco.ko module
insmod: cannot insert '/lib/sco.ko': unknown symbol in module or invalid parameter
Loading bridge.ko module
insmod: cannot insert '/lib/bridge.ko': unknown symbol in module or invalid parameter
Loading stp.ko module
insmod: cannot insert '/lib/stp.ko': unknown symbol in module or invalid parameter
Loading llc.ko module
Loading bnep.ko module
insmod: cannot insert '/lib/bnep.ko': unknown symbol in module or invalid parameter
Loading l2cap.ko module
insmod: cannot insert '/lib/l2cap.ko': unknown symbol in module or invalid parameter
Loading bluetooth.ko module
Loading sunrpc.ko module
Loading ipv6.ko module
Loading uinput.ko module
Loading pata_acpi.ko module
Loading pcspkr.ko module
Loading serio_raw.ko module
Loading k8temp.ko module
insmod: cannot insert '/lib/k8temp.ko': unknown symbol in module or invalid parameter
Loading hwmon.ko module
Loading e1000.ko module
Loading usb-storage.ko module
Waiting 8 seconds for driver initialization.
Loading i2c-nforata1: ACPI get timing mode failed (AE 0x300b)
ce2.ko module
iata2: ACPI get timing mode failed (AE 0x300b)
nsmod: cannot insert '/lib/i2c-nforce2.ko': unknown symbol in module or invalid parameter
Loading joydev.ko module
Loading pata_amd.ko module
Loading i2c-core.ko module
Loading mptsas.ko module
insmod: cannot insert '/lib/mptsas.ko': unknown symbol in module or invalid parameter
Loading mptscsih.ko module
insmod: cannot insert '/lib/mptscsih.ko': unknown symbol in module or invalid parameter
Loading mptbase.ko module
Loading scsi_transport_sas.ko module
Waiting for required block device discovery
Waiting for sda...Found
Waiting for sdb...
---

Comment 27 Dave Maley 2009-09-01 16:56:39 UTC
Created attachment 359430 [details]
initrd from testing

here's the initrd from my latest test using the (modified) omnibus.patch

Comment 28 Neil Horman 2009-09-01 18:29:37 UTC
gaarhh!  Dang it!  Yeah, I forgot the skiperrors, I'll update the patch for that.

As for the errors above, I'm confused.  It appears that the module load list is being addressed backwards.  i.e. every module is getting loaded before its dependencies, which is odd to say the least, since findmodule has been to date, working.  I'll have to look into that further.

Comment 29 Neil Horman 2009-09-01 20:01:44 UTC
Think I figured it out.  If you add modules individually in the reverse order that they are dependent, the dependency won't get corrected in the module list.  I'll have a patch to fix that shortly.

Comment 30 Neil Horman 2009-09-02 20:21:31 UTC
Created attachment 359592 [details]
new mkdumprd patch

Ok, this is working for me on the sun system.  I had to rewrite the depsolver for modules to handle several heretofore unseen bugs (grr).  Anywho, its still got a wart or two, but I was able to boot and run the kdump initramfs with this chenge.  Please test and confrim.  Thanks!

Comment 31 Dave Maley 2009-09-09 18:00:53 UTC
in my testing I was seeing the following errors:

 FATAL: Module ide_disk not found.
 FATAL: Module ext4 not found.
 FATAL: Module dm_mod not found.
 FATAL: Module dm_mirror not found.
 FATAL: Module dm_zero not found.
 FATAL: Module dm_snapshot not found.

chatted w/ nhorman. He looked into it quickly and determined they are caused by an awk misspecification.  He will be posting an updated patch shortly from which I'll continue my testing.

Comment 32 Neil Horman 2009-09-09 18:55:08 UTC
Created attachment 360322 [details]
new patch

new patch fixing the echos and removing the FATAL warnings

Comment 33 Dave Maley 2009-09-10 15:16:01 UTC
My testing of this latest patch has been a positive.  I'm no longer seeing the errors reported in comment 31, and I was able to successfully capture a vmcore.  

I'll get a test pkg built for the partner to also test/verify.

Comment 34 Dave Maley 2009-09-10 16:30:17 UTC
I've provided a test pkg to the TAM and requested that the partner validate the fix.

Comment 35 Neil Horman 2009-09-10 17:32:13 UTC
ok.  Please let me know the results.

Comment 36 Qian Cai 2009-09-12 07:36:00 UTC
This patch fails on my rawhide system.

# service  kdump restart
Stopping kdump:                                            [  OK  ]
Detected /etc/kdump.conf or /boot/vmlinuz-2.6.31-2.fc12.x86_64 change
Rebuilding /boot/initrd-2.6.31-2.fc12.x86_64kdump.img
No module ata_piix found for kernel 2.6.31-2.fc12.x86_64, aborting.
Failed to run mkdumprd
Starting kdump:                                            [FAILED]

There is no such module ata_piix indeed.

Comment 37 Qian Cai 2009-09-12 08:04:35 UTC
Even after temporarily remove this line from the patch against kexec-tools-2.0.0-25.fc12,

  findmodule ata_piix

the kdump service has not finished starting after 7 minutes.

# service kdump stop   
Stopping kdump:                                            [  OK  ]
[root@localhost sbin]# time service kdump start
Detected /etc/kdump.conf or /boot/vmlinuz-2.6.31-2.fc12.x86_64 change
Rebuilding /boot/initrd-2.6.31-2.fc12.x86_64kdump.img
^C

real	7m34.225s
user	1m4.337s
sys	6m25.909s

Looks like it was running into infinite loop when resolving module dependencies.

Comment 38 Qian Cai 2009-09-12 08:13:07 UTC
# lsmod
Module                  Size  Used by
fuse                   70408  2 
rfcomm                 76072  4 
sco                    21624  2 
bridge                 62792  0 
stp                     3108  1 bridge
llc                     7056  2 bridge,stp
bnep                   20352  2 
l2cap                  41424  16 rfcomm,bnep
sunrpc                216968  1 
ip6t_REJECT             6016  2 
nf_conntrack_ipv6      23288  2 
ip6table_filter         4256  1 
ip6_tables             20528  1 ip6table_filter
ipv6                  330216  20 ip6t_REJECT,nf_conntrack_ipv6
cpufreq_ondemand        8992  1 
powernow_k8            18420  0 
freq_table              5312  2 cpufreq_ondemand,powernow_k8
dm_multipath           19600  0 
uinput                 10520  0 
arc4                    2320  2 
ecb                     3632  2 
b43                   163176  0 
mac80211              204628  1 b43
ppdev                  10568  0 
snd_atiixp_modem       15548  0 
snd_atiixp             20180  2 
snd_ac97_codec        128872  2 snd_atiixp_modem,snd_atiixp
ac97_bus                2224  1 snd_ac97_codec
btusb                  20460  2 
bluetooth             103028  9 rfcomm,sco,bnep,l2cap,btusb
parport_pc             29304  0 
amd64_edac_mod         30864  0 
snd_pcm                90568  3 snd_atiixp_modem,snd_atiixp,snd_ac97_codec
cfg80211               96272  2 b43,mac80211
snd_timer              24544  1 snd_pcm
firewire_ohci          25612  0 
rfkill                 23736  2 bluetooth,cfg80211
parport                37964  2 ppdev,parport_pc
tifm_7xx1               8832  0 
joydev                 13152  0 
snd                    74936  9 snd_atiixp_modem,snd_atiixp,snd_ac97_codec,snd_pcm,snd_timer
firewire_core          55192  1 firewire_ohci
edac_core              50096  1 amd64_edac_mod
sdhci_pci               9664  0 
sdhci                  23176  1 sdhci_pci
yenta_socket           30804  1 
tifm_core              10120  1 tifm_7xx1
serio_raw               7228  0 
k8temp                  5784  0 
hwmon                   4000  1 k8temp
rsrc_nonstatic         11648  1 yenta_socket
crc_itu_t               2176  1 firewire_core
soundcore               7952  1 snd
wmi                     7792  0 
shpchp                 38200  0 
mmc_core               66080  1 sdhci
snd_page_alloc         10464  3 snd_atiixp_modem,snd_atiixp,snd_pcm
i2c_piix4              14976  0 
pata_acpi               5632  0 
ata_generic             6116  0 
tg3                   114744  0 
ssb                    49640  1 b43
pata_atiixp             6256  2 
video                  25324  0 
output                  3624  1 video
radeon                516480  0 
ttm                    47720  1 radeon
drm_kms_helper         23552  1 radeon
drm                   190768  3 radeon,ttm,drm_kms_helper
i2c_algo_bit            6676  1 radeon
i2c_core               30760  4 i2c_piix4,radeon,drm,i2c_algo_bit

Comment 39 Neil Horman 2009-09-12 12:25:08 UTC
Cai, why are you testing on rawhide?  You should be testing on RHEL6 alpha.  Thats what this is written for.  rawhide is getting a completely new configuration system (as RHEL6 will be soon too I hope).

Comment 40 Neil Horman 2009-09-14 20:16:09 UTC
Dave, any test results?  I'd like to get this committed asap, or schedule some time to keep working on it if need be.

Comment 41 Dave Maley 2009-09-14 20:43:40 UTC
Neil - I haven't heard anything back yet from Fujitsu on their testing.  IBM has also reported this same issue (it 336128) and they have been provided the patch as well, however we have yet to hear back from them either.

Comment 42 Issue Tracker 2009-09-15 13:26:26 UTC
Event posted on 2009-09-15 05:35 EDT by Glen Johnson

------- Comment From  2009-09-15 05:24 EDT-------
Hello RedHat,
Could you please provide the patch to mkdumprd based on
kexec-tools-2.0.0-17.el6.x86_64 version. While applying the patch one of
the hunk fails as ...

#patch -p0 < mkdumprd.patch
patching file mkdumprd
Hunk #8 FAILED at 373.
Hunk #9 succeeded at 952 (offset 12 lines).
1 out of 9 hunks FAILED -- saving rejects to file mkdumprd.rej

I edited mkdumrd by hand and made changes to it based on mkdumprd.rej
file. After restarting kdump service , /sbin/mkdumprd enters into an
endless loop in depsolve_modlist() routine. Hence please provide a patch
which could be applied on mkdumprd available in
kexec-tools-2.0.0-17.el6.x86_64. thanks

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech
Ticket type changed from 'Problem' to ''

This event sent from IssueTracker by balkov 
 issue 336128

Comment 43 Neil Horman 2009-09-15 17:32:51 UTC
never did that for me in my testing.  Can we just provide you with the latest rpm from cvs?  It would be much easeier:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1987376
Theres a brew build with the latest kexec tools and my patch on top of it.  If the same infinite recursion occurs, please add a set -x to the top of makedump file and provide the output please.

Comment 45 Dave Maley 2009-09-16 14:49:34 UTC
Neil/All - I will attach a modified version of the patch that addresses the problem applying the patch mentioned in comment #42.  It will apply cleanly to the rhel6-alpha1 kexec-tools pkg (2.0.0-17.el6).  Note that the latest brew builds will not work w/ the alpha1 pkg set, significant changes have been made to the rhel6 tree since alpha1 was cut (specifically updated glibc and rpmlib).

Also as noted in comment #44 Fujitsu has confirmed that the test pkg I provided (which uses this modified patch) does work properly.  So we now have validation of the fix by the initial reported of this bug.

Comment 46 Dave Maley 2009-09-16 14:50:45 UTC
Created attachment 361292 [details]
modified patch

patch that will apply cleanly to the rhel6-alpha1 kexec-tools pkg

Comment 47 Neil Horman 2009-09-16 17:00:36 UTC
great, thanks.  Do we have any word back on IBM's testing with the patch, since they claimed to see an infinite loop?

Comment 48 Dave Maley 2009-09-16 17:21:08 UTC
Neil - They've just been provided the test pkg w/ the modified version of the patch.  Will update here as soon as we hear from them ....

Comment 49 Dave Maley 2009-09-17 14:08:03 UTC
Created attachment 361504 [details]
log from failed kdump

Comment 50 Dave Maley 2009-09-17 14:09:40 UTC
From IBM via their I-T:	

I downloaded the modified.patch from RH-517584 and it cleanly applied on the el6-alpha1 mkdumprd script. The script however again ends up in a loop in depsolve_modlist() routine for nfsd.ko module. I have attached the last 1000 lines of the log when kdump service was interrupted through 'ctrl-c'.  Same is the case with mkdumprd extracted from kexec-tools-2.0.0-26.test1.el6.x86_64.rpm.

Comment 51 Dave Maley 2009-09-17 14:10:34 UTC
Comment on attachment 361504 [details]
log from failed kdump

log from failed kdump resulting in endless depsolving loop

Comment 52 Neil Horman 2009-09-17 14:42:35 UTC
grr, its a problem with module-init-tools.  I had jcm fix this problem in RHEL5 but it looks like he never bothered to take care of it in fedora or RHEL6.  It was initially bz 497923. I've cloned it as bz 523995 and bz 523997 for RHEL6 and rawhide respectively.  In ther interim I'll try come up with a workaround.

Comment 53 Neil Horman 2009-09-17 15:32:04 UTC
Created attachment 361516 [details]
new patch to avoid recursive loop from broken modprobe command

here you go dave, this variant of the patch should avoid the loop that IBM found which results form modprobe not properly printing the output of --show-depends.  If you/IBM could test it and confirm that it works, I would appreciate it.  Thanks!

Comment 54 Neil Horman 2009-09-17 18:43:04 UTC
Created attachment 361539 [details]
new updated patch to wrok around infinite recursion from broken modprobe

sorry, I'd tried to use a less specific grep pattern there, and it got aliased with some directory names, this should fix it.

Comment 55 Dave Maley 2009-09-17 19:37:53 UTC
Created attachment 361554 [details]
modified patch v2

Neil - This latest patch works properly on my test box.  I'm attaching the modified version which applies cleanly to the RHEL6-alpha1 kexec-tools pkg (2.0.0-17.el6).  I'll get this to IBM right now for their verification.

Comment 56 Dave Maley 2009-09-17 19:54:32 UTC
- updated patch has been provided to IBM for testing/validation
- updated test pkg has been provided to FJ for testing/validation

Comment 57 Issue Tracker 2009-09-18 14:14:28 UTC
Event posted on 09-18-2009 12:32am EDT by Glen Johnson

------- Comment From risrajak@in.ibm.com 2009-09-18 00:26 EDT-------
Hurrayyyyyyyyyy it worked!! Tested on 64 bit machine with the patch
attached and it resolves the issue.

Will update soon with 32 bit machine result.

[root@mx3755a ~]# cd /var/crash/
[root@mx3755a crash]# ls
2009-09-18-09:36
[root@mx3755a crash]# cd 2009-09-18-09\:36/
[root@mx3755a 2009-09-18-09:36]# ls
vmcore
[root@mx3755a 2009-09-18-09:36]# ls -lh
total 8.1G
-r-------- 1 root root 8.8G 2009-09-18 09:40 vmcore
[root@mx3755a 2009-09-18-09:36]# uname -a
Linux mx3755a.in.ibm.com 2.6.29.4-1.el6.x86_64 #1 SMP Fri Jun 5 10:28:37
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@mx3755a 2009-09-18-09:36]# date
Fri Sep 18 09:50:31 IST 2009
[root@mx3755a 2009-09-18-09:36]#

Thanks to all for working on this.

-Rishi

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by jkachuck 
 issue 336128

Comment 59 Dave Maley 2009-09-18 15:36:37 UTC
both IBM and Fujitsu have verified that w/ the latest patch mkdumprd executes properly and that a vmcore can successfully be captured.

Comment 60 Neil Horman 2009-09-18 18:18:01 UTC
grr, I hate doing stuff so fast.  dgregor, when will the RHEL6 cvs tree be unlocked.  The patch that we built with yesterday has a bug which I fixed in the most recent patch that IBM/FJ verified above.  I need to revert that one and pull in the new one.

Comment 61 Neil Horman 2009-09-18 18:20:43 UTC
ok, committed to -27.el6

Comment 62 Dave Maley 2009-09-18 18:51:24 UTC
bah, and of course I completely forgot to point out the small typo I corrected in my latest modified version of the patch, and thus -27 mkdumprd will fail to run w/ the error:

  /sbin/mkdumprd: line 200: [: missing `]'


The needed change is:

200c200
<                  if [ $? -eq 0]
---
>                  if [ $? -eq 0 ]

Sorry Neil for not pointing this out previously .....

Comment 63 Issue Tracker 2009-10-16 15:13:21 UTC
Event posted on 10-16-2009 04:04am EDT by Glen Johnson

File uploaded: rhel6-20091006-drop

This event sent from IssueTracker by jkachuck 
 issue 336128
it_file 265363

Comment 64 Issue Tracker 2009-10-16 15:13:28 UTC
Event posted on 10-16-2009 04:04am EDT by Glen Johnson

<cde:attachment>
Comment on attachment: rhel6-20091006-drop

------- Comment on attachment From risrajak@in.ibm.com 2009-10-16 04:00
EDT-------


I got a chance to test again on this machine, looks like some regression
has been introduced we are facing with RHEL6 10/06 drop.

I am not able to generate vmcore on this machine again. Attaching serial
console log for debugging purpose.

kexec-tools-2.0.0-32.el6.x86_64
</cde:attachment>


This event sent from IssueTracker by jkachuck 
 issue 336128

Comment 65 Qian Cai 2009-10-16 15:24:17 UTC
Please open a new bug for this kdump kernel panic. The original problem is to track kexec-tools hangs for critical disks detection. Neil, please correct me if I am wrong.

Comment 66 Neil Horman 2009-10-16 15:31:09 UTC
Cai, that is correct, although, I think we already have a bz open for the kdump hangs on x86 (bz 520581)

Comment 67 Qian Cai 2009-10-16 15:37:18 UTC
BZ 520581 is for i686, but the one mentioned here is for x86_64 according to the serial console output from the comment #63.

Comment 68 Dave Maley 2009-10-16 15:44:52 UTC
Cai/Neil - Yes this is different, I've requested a new IT/BZ be opened for
tracking this new problem.

Comment 74 releng-rhel@redhat.com 2010-07-02 19:21:02 UTC
Red Hat Enterprise Linux Beta 2 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.