Bug 517584
Summary: | kdump not working in RHEL6 alpha | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dave Maley <dmaley> | |
Component: | kexec-tools | Assignee: | Neil Horman <nhorman> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.0 | CC: | ahecox, bernhard.furtmueller, phan, qcai, syeghiay, tao | |
Target Milestone: | alpha | Keywords: | Regression, Reopened, TestBlocker | |
Target Release: | 6.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kexec-tools-2.0.0-35.el6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 524820 (view as bug list) | Environment: | ||
Last Closed: | 2010-07-02 19:21:02 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 391521, 524820 | |||
Attachments: |
Description
Dave Maley
2009-08-14 19:50:32 UTC
Sounds like something I fixed in RHEL5 and forgot to forward port to rawhide. Dave, do me a favor, can you, on the test system, replace the /sbin/mkdumprd utility with the mkdumprd utility from an up to date RHEL5 system (the latest 5.4 release). And see if the same problem occurs? I'd appreciate it. Thanks! Looks like the same problem here, Red Hat Enterprise Linux release 6.0 Beta (Six) Kernel 2.6.29.4-1.el6.i686.PAE on an i686 hp-xw4550-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump tick-broadcast: ignoring broadcast for offline CPU #1 Mounting proc filesystem Mounting sysfs filesystem Creating /dev Creating initial device nodes Loading tg3.ko module Waiting for required block device discovery Waiting for sda... Using mkdumprd from RHEL5.4 1.102pre-77.el5 kexec-tools results in the same problem. # service kdump restart Stopping kdump:[ OK ] Detected /etc/kdump.conf or /boot/vmlinuz-2.6.29.4-1.el6.i686.PAE change Rebuilding /boot/initrd-2.6.29.4-1.el6.i686.PAEkdump.img /sbin/mkdumprd: line 723: /etc/modprobe.conf: No such file or directory cp: cannot stat `/sbin/lvm.static': No such file or directory cp: cannot stat `/sbin/dmsetup.static': No such file or directory cp: cannot stat `/sbin/kpartx.static': No such file or directory Starting kdump:[ OK ] hp-xw4550-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump tick-broadcast: ignoring broadcast for offline CPU #1 Mounting proc filesystem Mounting sysfs filesystem Creating /dev Creating initial device nodes Loading tg3.ko module Waiting for required block device discovery Waiting for sda... Sometimes, it looks like found some USB devices, Waiting for required block device discovery Waiting for sda...usb 3-1: New USB device found, idVendor=0624, idProduct=0200 usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 3-1: Product: USB DSRIQ usb 3-1: Manufacturer: Avocent usb 3-1: configuration #1 chosen from 1 choice input: Avocent USB DSRIQ as /devices/pci0000:00/0000:00:13.1/usb3/3-1/3-1:1.0/input/input3 generic-usb 0003:0624:0200.0001: input,hidraw0: USB HID v1.10 Keyboard [Avocent USB DSRIQ] on usb-0000:00:13.1-1/input0 input: Avocent USB DSRIQ as /devices/pci0000:00/0000:00:13.1/usb3/3-1/3-1:1.1/input/input4 generic-usb 0003:0624:0200.0002: input,hidraw1: USB HID v1.10 Mouse [Avocent USB DSRIQ] on usb-0000:00:13.1-1/input1 Ok, this is something new. And from above, it sounds like what your saying is that it doesn't always hang in the same place, which is bad. What system are you reproducing this on? I'll need to reserve it in rhts and tinker with it for a bit. I guess the question is for me. The machine is, hp-xw4550-01.rhts.bos.redhat.com It has another problem recorded in, Bug 519094 - Kdump kernel BUG at arch/x86/kernel/irq_32.c:219! hmm, seems we're not including any modules in the initramfs, thats odd, to say the least. investigating... IT would appear that some things have changed in sysfs, and the nash-find routine in nash no longer works properly, so drivers weren't getting found (specifically some directories changed to symlinks so nash wasn't following them anymore. I'm attaching a patch to mkdumprd that worked in my test system. Its not perfect, but it should get the job done. Cai, if you can try it out please and confirm, I'd appreciate it. Thanks! Created attachment 358611 [details]
patch to make find follow symlinks so we can pull in modules correctly.
sorry, when I said Cai above, I meant you dave. if either of you guys actually could test/confirm, I'd appreciate it. Neil - apologies for the delayed response here. I finally got a lab system setup and was able to seemingly reproduce the initial report myself. It's hanging waiting for the storage dev: ----- Creating /dev Creating initial device nodes Waiting for required block device discovery Waiting for sda... ----- I then applied the changes from patch in comment 10 however the test system still hung in the same place/way. For completeness I tested using a RHEL5.4 mkdumprd (-77) which also failed. I'm going to continue to look into this myself, let me know if you have anything further you'd like me to test. please attach the kdump.conf and resultant initramfs from the system you tested after applying the patch. it worked fine on my test system Created attachment 358760 [details]
kdump config used for testing
I tested using a completely stock kdump.conf (ie. everything still commented out) as well as w/ the attached config. I'll attach the initrd momentarily.
Created attachment 358761 [details]
initrd from testing
Created attachment 358762 [details]
patched mkdumprd
Created attachment 358765 [details]
additinoal patch to force load all present modules
great, the dependencies for scsi modules are all messed up. Lets just keep this thing breathing long enough for us to move to the new config system. This patch should apply on top of the previous one and will force all modules to be included. Please apply it on top of the previous patch and let me know how it goes. Thanks!
Neil - With the additional patch applied I can't start kdump: [root@sun-x4600-1 ~]# service kdump restart Stopping kdump: [ OK ] Detected /etc/kdump.conf or /boot/vmlinuz-2.6.29.4-1.el6.i686.PAE change Rebuilding /boot/initrd-2.6.29.4-1.el6.i686.PAEkdump.img No module dm_multipath found for kernel 2.6.29.4-1.el6.i686.PAE, aborting. Failed to run mkdumprd So findmodule() is failing to locate dm-multipath.ko. Does dm_multipath need to be special-cased? Doesn't seem like something that should be needed ..... well, I can see why its happening, but what I don't understand is how we can let that happen. The mkdumprd script is looking for dm_multipath.ko, which is whats listed in the output of lsmod, but the module name itself appears to be dm-multipath.ko (note the use of the dash rather than the underscore), so we don't find the module when we go looking for it. It appears to be that way with all of the dm modules. In RHEL5 and RHEL6 (we never required the dm modules previously so we didn't hit this). Normally, the name in /proc/modules & lsmod is taken from the module structure thats linked into the module at build time, so theres no strict requirement for the module name to match the file name of the module, but for the sake of sanity (for cases exactly like this), they really need to be the same. I would suggest that you open a new bug against the kernel, pointing out the discrepancy between the name in /proc/module and the file name, and set this bug to block on it, as thats going to cause all sorts of problems. Neil - there's a number of precedents wrt _ vs. - differences between lsmod and module names, such as all the alsa (snd_) modules, usb_storage, kvm_* just to name a few. While I agree that this doesn't seem like a good practice to follow I don't think we'll be able to make an argument to get the dm_* modules changed. Especially considering, based on comment 18, this is just to get things limping along until the new config system is in place. If you'd still like me to open the new bug and set blocker I'll do so. no, I see your point, although I'd be very interested to know what and why we're converting '-' characters to '_' characters. If thats teh only name shift though, thats something we can likely handle. I'll update the patch to handle those cases. Created attachment 359386 [details]
omnibus patch
Ok, heres an omnibus patch that rolls all the previous ones in and should fix the -/_ conversion. Let me know. Thanks!
Neil - service kdump start still fails w/ the new patch. For any modules w/ a - or _ in their lsmod listing 1 of the 2 calls to findmodule will fail. I was able to get around this by adding --skiperrors to both findmodule calls. I have no idea if this is the right thing to do here .... That being said it's still not working. It hangs in a similar place as originally reported, this time sda is found but it hangs trying to find sdb. However there's numerous errors from insmod as seen below: --- Creating /dev Creating initial device nodes Loading ata_generic.ko module Loading ata_piix.ko module Loading nfs.ko module insmod: cannot insert '/lib/nfs.ko': unknown symbol in module or invalid parameter Loading lockd.ko module insmod: cannot insert '/lib/lockd.ko': unknown symbol in module or invalid parameter Loading nfs_acl.ko module insmod: cannot insert '/lib/nfs_acl.ko': unknown symbol in module or invalid parameter Loading auth_rpcgss.ko module insmod: cannot insert '/lib/auth_rpcgss.ko': unknown symbol in module or invalid parameter Loading dm-crypt.ko module Loading dm-multipath.ko module Loading sco.ko module insmod: cannot insert '/lib/sco.ko': unknown symbol in module or invalid parameter Loading bridge.ko module insmod: cannot insert '/lib/bridge.ko': unknown symbol in module or invalid parameter Loading stp.ko module insmod: cannot insert '/lib/stp.ko': unknown symbol in module or invalid parameter Loading llc.ko module Loading bnep.ko module insmod: cannot insert '/lib/bnep.ko': unknown symbol in module or invalid parameter Loading l2cap.ko module insmod: cannot insert '/lib/l2cap.ko': unknown symbol in module or invalid parameter Loading bluetooth.ko module Loading sunrpc.ko module Loading ipv6.ko module Loading uinput.ko module Loading pata_acpi.ko module Loading pcspkr.ko module Loading serio_raw.ko module Loading k8temp.ko module insmod: cannot insert '/lib/k8temp.ko': unknown symbol in module or invalid parameter Loading hwmon.ko module Loading e1000.ko module Loading usb-storage.ko module Waiting 8 seconds for driver initialization. Loading i2c-nforata1: ACPI get timing mode failed (AE 0x300b) ce2.ko module iata2: ACPI get timing mode failed (AE 0x300b) nsmod: cannot insert '/lib/i2c-nforce2.ko': unknown symbol in module or invalid parameter Loading joydev.ko module Loading pata_amd.ko module Loading i2c-core.ko module Loading mptsas.ko module insmod: cannot insert '/lib/mptsas.ko': unknown symbol in module or invalid parameter Loading mptscsih.ko module insmod: cannot insert '/lib/mptscsih.ko': unknown symbol in module or invalid parameter Loading mptbase.ko module Loading scsi_transport_sas.ko module Waiting for required block device discovery Waiting for sda...Found Waiting for sdb... --- Created attachment 359430 [details]
initrd from testing
here's the initrd from my latest test using the (modified) omnibus.patch
gaarhh! Dang it! Yeah, I forgot the skiperrors, I'll update the patch for that. As for the errors above, I'm confused. It appears that the module load list is being addressed backwards. i.e. every module is getting loaded before its dependencies, which is odd to say the least, since findmodule has been to date, working. I'll have to look into that further. Think I figured it out. If you add modules individually in the reverse order that they are dependent, the dependency won't get corrected in the module list. I'll have a patch to fix that shortly. Created attachment 359592 [details]
new mkdumprd patch
Ok, this is working for me on the sun system. I had to rewrite the depsolver for modules to handle several heretofore unseen bugs (grr). Anywho, its still got a wart or two, but I was able to boot and run the kdump initramfs with this chenge. Please test and confrim. Thanks!
in my testing I was seeing the following errors: FATAL: Module ide_disk not found. FATAL: Module ext4 not found. FATAL: Module dm_mod not found. FATAL: Module dm_mirror not found. FATAL: Module dm_zero not found. FATAL: Module dm_snapshot not found. chatted w/ nhorman. He looked into it quickly and determined they are caused by an awk misspecification. He will be posting an updated patch shortly from which I'll continue my testing. Created attachment 360322 [details]
new patch
new patch fixing the echos and removing the FATAL warnings
My testing of this latest patch has been a positive. I'm no longer seeing the errors reported in comment 31, and I was able to successfully capture a vmcore. I'll get a test pkg built for the partner to also test/verify. I've provided a test pkg to the TAM and requested that the partner validate the fix. ok. Please let me know the results. This patch fails on my rawhide system. # service kdump restart Stopping kdump: [ OK ] Detected /etc/kdump.conf or /boot/vmlinuz-2.6.31-2.fc12.x86_64 change Rebuilding /boot/initrd-2.6.31-2.fc12.x86_64kdump.img No module ata_piix found for kernel 2.6.31-2.fc12.x86_64, aborting. Failed to run mkdumprd Starting kdump: [FAILED] There is no such module ata_piix indeed. Even after temporarily remove this line from the patch against kexec-tools-2.0.0-25.fc12, findmodule ata_piix the kdump service has not finished starting after 7 minutes. # service kdump stop Stopping kdump: [ OK ] [root@localhost sbin]# time service kdump start Detected /etc/kdump.conf or /boot/vmlinuz-2.6.31-2.fc12.x86_64 change Rebuilding /boot/initrd-2.6.31-2.fc12.x86_64kdump.img ^C real 7m34.225s user 1m4.337s sys 6m25.909s Looks like it was running into infinite loop when resolving module dependencies. # lsmod Module Size Used by fuse 70408 2 rfcomm 76072 4 sco 21624 2 bridge 62792 0 stp 3108 1 bridge llc 7056 2 bridge,stp bnep 20352 2 l2cap 41424 16 rfcomm,bnep sunrpc 216968 1 ip6t_REJECT 6016 2 nf_conntrack_ipv6 23288 2 ip6table_filter 4256 1 ip6_tables 20528 1 ip6table_filter ipv6 330216 20 ip6t_REJECT,nf_conntrack_ipv6 cpufreq_ondemand 8992 1 powernow_k8 18420 0 freq_table 5312 2 cpufreq_ondemand,powernow_k8 dm_multipath 19600 0 uinput 10520 0 arc4 2320 2 ecb 3632 2 b43 163176 0 mac80211 204628 1 b43 ppdev 10568 0 snd_atiixp_modem 15548 0 snd_atiixp 20180 2 snd_ac97_codec 128872 2 snd_atiixp_modem,snd_atiixp ac97_bus 2224 1 snd_ac97_codec btusb 20460 2 bluetooth 103028 9 rfcomm,sco,bnep,l2cap,btusb parport_pc 29304 0 amd64_edac_mod 30864 0 snd_pcm 90568 3 snd_atiixp_modem,snd_atiixp,snd_ac97_codec cfg80211 96272 2 b43,mac80211 snd_timer 24544 1 snd_pcm firewire_ohci 25612 0 rfkill 23736 2 bluetooth,cfg80211 parport 37964 2 ppdev,parport_pc tifm_7xx1 8832 0 joydev 13152 0 snd 74936 9 snd_atiixp_modem,snd_atiixp,snd_ac97_codec,snd_pcm,snd_timer firewire_core 55192 1 firewire_ohci edac_core 50096 1 amd64_edac_mod sdhci_pci 9664 0 sdhci 23176 1 sdhci_pci yenta_socket 30804 1 tifm_core 10120 1 tifm_7xx1 serio_raw 7228 0 k8temp 5784 0 hwmon 4000 1 k8temp rsrc_nonstatic 11648 1 yenta_socket crc_itu_t 2176 1 firewire_core soundcore 7952 1 snd wmi 7792 0 shpchp 38200 0 mmc_core 66080 1 sdhci snd_page_alloc 10464 3 snd_atiixp_modem,snd_atiixp,snd_pcm i2c_piix4 14976 0 pata_acpi 5632 0 ata_generic 6116 0 tg3 114744 0 ssb 49640 1 b43 pata_atiixp 6256 2 video 25324 0 output 3624 1 video radeon 516480 0 ttm 47720 1 radeon drm_kms_helper 23552 1 radeon drm 190768 3 radeon,ttm,drm_kms_helper i2c_algo_bit 6676 1 radeon i2c_core 30760 4 i2c_piix4,radeon,drm,i2c_algo_bit Cai, why are you testing on rawhide? You should be testing on RHEL6 alpha. Thats what this is written for. rawhide is getting a completely new configuration system (as RHEL6 will be soon too I hope). Dave, any test results? I'd like to get this committed asap, or schedule some time to keep working on it if need be. Neil - I haven't heard anything back yet from Fujitsu on their testing. IBM has also reported this same issue (it 336128) and they have been provided the patch as well, however we have yet to hear back from them either. Event posted on 2009-09-15 05:35 EDT by Glen Johnson ------- Comment From 2009-09-15 05:24 EDT------- Hello RedHat, Could you please provide the patch to mkdumprd based on kexec-tools-2.0.0-17.el6.x86_64 version. While applying the patch one of the hunk fails as ... #patch -p0 < mkdumprd.patch patching file mkdumprd Hunk #8 FAILED at 373. Hunk #9 succeeded at 952 (offset 12 lines). 1 out of 9 hunks FAILED -- saving rejects to file mkdumprd.rej I edited mkdumrd by hand and made changes to it based on mkdumprd.rej file. After restarting kdump service , /sbin/mkdumprd enters into an endless loop in depsolve_modlist() routine. Hence please provide a patch which could be applied on mkdumprd available in kexec-tools-2.0.0-17.el6.x86_64. thanks Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech Ticket type changed from 'Problem' to '' This event sent from IssueTracker by balkov issue 336128 never did that for me in my testing. Can we just provide you with the latest rpm from cvs? It would be much easeier: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1987376 Theres a brew build with the latest kexec tools and my patch on top of it. If the same infinite recursion occurs, please add a set -x to the top of makedump file and provide the output please. Neil/All - I will attach a modified version of the patch that addresses the problem applying the patch mentioned in comment #42. It will apply cleanly to the rhel6-alpha1 kexec-tools pkg (2.0.0-17.el6). Note that the latest brew builds will not work w/ the alpha1 pkg set, significant changes have been made to the rhel6 tree since alpha1 was cut (specifically updated glibc and rpmlib). Also as noted in comment #44 Fujitsu has confirmed that the test pkg I provided (which uses this modified patch) does work properly. So we now have validation of the fix by the initial reported of this bug. Created attachment 361292 [details]
modified patch
patch that will apply cleanly to the rhel6-alpha1 kexec-tools pkg
great, thanks. Do we have any word back on IBM's testing with the patch, since they claimed to see an infinite loop? Neil - They've just been provided the test pkg w/ the modified version of the patch. Will update here as soon as we hear from them .... Created attachment 361504 [details]
log from failed kdump
From IBM via their I-T: I downloaded the modified.patch from RH-517584 and it cleanly applied on the el6-alpha1 mkdumprd script. The script however again ends up in a loop in depsolve_modlist() routine for nfsd.ko module. I have attached the last 1000 lines of the log when kdump service was interrupted through 'ctrl-c'. Same is the case with mkdumprd extracted from kexec-tools-2.0.0-26.test1.el6.x86_64.rpm. Comment on attachment 361504 [details]
log from failed kdump
log from failed kdump resulting in endless depsolving loop
grr, its a problem with module-init-tools. I had jcm fix this problem in RHEL5 but it looks like he never bothered to take care of it in fedora or RHEL6. It was initially bz 497923. I've cloned it as bz 523995 and bz 523997 for RHEL6 and rawhide respectively. In ther interim I'll try come up with a workaround. Created attachment 361516 [details]
new patch to avoid recursive loop from broken modprobe command
here you go dave, this variant of the patch should avoid the loop that IBM found which results form modprobe not properly printing the output of --show-depends. If you/IBM could test it and confirm that it works, I would appreciate it. Thanks!
Created attachment 361539 [details]
new updated patch to wrok around infinite recursion from broken modprobe
sorry, I'd tried to use a less specific grep pattern there, and it got aliased with some directory names, this should fix it.
Created attachment 361554 [details]
modified patch v2
Neil - This latest patch works properly on my test box. I'm attaching the modified version which applies cleanly to the RHEL6-alpha1 kexec-tools pkg (2.0.0-17.el6). I'll get this to IBM right now for their verification.
- updated patch has been provided to IBM for testing/validation - updated test pkg has been provided to FJ for testing/validation Event posted on 09-18-2009 12:32am EDT by Glen Johnson ------- Comment From risrajak.com 2009-09-18 00:26 EDT------- Hurrayyyyyyyyyy it worked!! Tested on 64 bit machine with the patch attached and it resolves the issue. Will update soon with 32 bit machine result. [root@mx3755a ~]# cd /var/crash/ [root@mx3755a crash]# ls 2009-09-18-09:36 [root@mx3755a crash]# cd 2009-09-18-09\:36/ [root@mx3755a 2009-09-18-09:36]# ls vmcore [root@mx3755a 2009-09-18-09:36]# ls -lh total 8.1G -r-------- 1 root root 8.8G 2009-09-18 09:40 vmcore [root@mx3755a 2009-09-18-09:36]# uname -a Linux mx3755a.in.ibm.com 2.6.29.4-1.el6.x86_64 #1 SMP Fri Jun 5 10:28:37 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@mx3755a 2009-09-18-09:36]# date Fri Sep 18 09:50:31 IST 2009 [root@mx3755a 2009-09-18-09:36]# Thanks to all for working on this. -Rishi Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by jkachuck issue 336128 both IBM and Fujitsu have verified that w/ the latest patch mkdumprd executes properly and that a vmcore can successfully be captured. grr, I hate doing stuff so fast. dgregor, when will the RHEL6 cvs tree be unlocked. The patch that we built with yesterday has a bug which I fixed in the most recent patch that IBM/FJ verified above. I need to revert that one and pull in the new one. ok, committed to -27.el6 bah, and of course I completely forgot to point out the small typo I corrected in my latest modified version of the patch, and thus -27 mkdumprd will fail to run w/ the error:
/sbin/mkdumprd: line 200: [: missing `]'
The needed change is:
200c200
< if [ $? -eq 0]
---
> if [ $? -eq 0 ]
Sorry Neil for not pointing this out previously .....
Event posted on 10-16-2009 04:04am EDT by Glen Johnson File uploaded: rhel6-20091006-drop This event sent from IssueTracker by jkachuck issue 336128 it_file 265363 Event posted on 10-16-2009 04:04am EDT by Glen Johnson <cde:attachment> Comment on attachment: rhel6-20091006-drop ------- Comment on attachment From risrajak.com 2009-10-16 04:00 EDT------- I got a chance to test again on this machine, looks like some regression has been introduced we are facing with RHEL6 10/06 drop. I am not able to generate vmcore on this machine again. Attaching serial console log for debugging purpose. kexec-tools-2.0.0-32.el6.x86_64 </cde:attachment> This event sent from IssueTracker by jkachuck issue 336128 Please open a new bug for this kdump kernel panic. The original problem is to track kexec-tools hangs for critical disks detection. Neil, please correct me if I am wrong. Cai, that is correct, although, I think we already have a bz open for the kdump hangs on x86 (bz 520581) BZ 520581 is for i686, but the one mentioned here is for x86_64 according to the serial console output from the comment #63. Cai/Neil - Yes this is different, I've requested a new IT/BZ be opened for tracking this new problem. Red Hat Enterprise Linux Beta 2 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |