Bug 597268 - lvm devices are not initialized in kdump kernel
lvm devices are not initialized in kdump kernel
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kexec-tools (Show other bugs)
6.0
All Linux
low Severity high
: rc
: ---
Assigned To: Neil Horman
Boris Ranto
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-28 10:30 EDT by Boris Ranto
Modified: 2010-11-10 16:00 EST (History)
4 users (show)

See Also:
Fixed In Version: kexec-tools-2.0.0-121
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 16:00:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Log from dell-pe2850-04.rhts.eng.bos.redhat.com (45.97 KB, application/octet-stream)
2010-05-28 10:30 EDT, Boris Ranto
no flags Details
sosreport for the system (409.07 KB, application/x-xz)
2010-06-03 08:48 EDT, Boris Ranto
no flags Details
Console output when blacklisting sd_mod (22.53 KB, application/octet-stream)
2010-06-23 10:11 EDT, Boris Ranto
no flags Details
Blacklisted ata_piix and ata_generic (30.02 KB, application/octet-stream)
2010-06-29 04:02 EDT, Boris Ranto
no flags Details
test package (238.04 KB, application/x-rpm)
2010-06-29 10:07 EDT, Neil Horman
no flags Details
Log with new kexec tools (31.26 KB, application/octet-stream)
2010-06-30 06:32 EDT, Boris Ranto
no flags Details
patch to detect UUIDs for compatible devices (2.49 KB, patch)
2010-07-02 13:51 EDT, Neil Horman
no flags Details | Diff
Log with patch (17.86 KB, application/octet-stream)
2010-07-06 08:41 EDT, Boris Ranto
no flags Details
new version of patch (2.43 KB, application/octet-stream)
2010-07-06 11:00 EDT, Neil Horman
no flags Details
Patch that partially work (2.14 KB, patch)
2010-07-08 06:06 EDT, Boris Ranto
no flags Details | Diff
new approach to fix this (2.13 KB, patch)
2010-07-08 13:33 EDT, Neil Horman
no flags Details | Diff
Patch that handles same devices (3.03 KB, patch)
2010-07-09 08:53 EDT, Boris Ranto
no flags Details | Diff

  None (edit)
Description Boris Ranto 2010-05-28 10:30:14 EDT
Created attachment 417634 [details]
Log from dell-pe2850-04.rhts.eng.bos.redhat.com

Description of problem:
kdump kernel failed to mount device in /dev/mapper/ because the device is not being created on machine dell-pe2850-04.rhts.eng.bos.redhat.com running i386 kernel.

Version-Release number of selected component (if applicable):
kernel: 2.6.32-30
kexec-tools: 2.0.0-72.el6
lvm2: 2.02.66-2.el6

How reproducible:
100 %

Steps to Reproduce:
1. Setup kdump to dump to local lvm device(i.e. /dev/mapper/vg_dellpe285004-lv_root)
2. Crash kernel, i.e. echo c >/proc/sysrq-trigger
3. Watch output
  
Actual results:
LVM device is not initialized(no lvm device in /dev/mapper/) and kdump can't be taken.

Expected results:
LVM device is initialized.

Additional info:
This should be machine/device-specific(any dell-pe2850-0x machine should be ok). Other machines usually work ok.
I'm attaching console output.
Comment 2 Neil Horman 2010-06-02 06:43:55 EDT
could you please provide a sosreport of this system when its running normally?  Thanks!
Comment 3 Boris Ranto 2010-06-03 08:48:41 EDT
Created attachment 419356 [details]
sosreport for the system

Ok, this is from latest system build. Hope it helps.
Comment 6 Neil Horman 2010-06-22 07:07:54 EDT
grubmle, this looks like a combination of bad latency in the megaraid scan and a another device getting scanned early that satisfies the critical disk requirement.  We can see in the console log where we start the scsi bus scan right after the module is loaded, but the drives on that device don't get detected until after we drop to a shell.  Since a scsi cd rom drive from another bus takes the the sda name, we pass the critical disks check.  Is this problem corrected if you add ata_piix to the kdump.conf module blacklist?
Comment 7 Neil Horman 2010-06-22 07:09:06 EDT
grubmle, this looks like a combination of bad latency in the megaraid scan and a another device getting scanned early that satisfies the critical disk requirement.  We can see in the console log where we start the scsi bus scan right after the module is loaded, but the drives on that device don't get detected until after we drop to a shell.  Since a scsi cd rom drive from another bus takes the the sda name, we pass the critical disks check.  Is this problem corrected if you add ata_piix to the kdump.conf module blacklist?
Comment 8 Neil Horman 2010-06-22 07:09:15 EDT
grubmle, this looks like a combination of bad latency in the megaraid scan and a another device getting scanned early that satisfies the critical disk requirement.  We can see in the console log where we start the scsi bus scan right after the module is loaded, but the drives on that device don't get detected until after we drop to a shell.  Since a scsi cd rom drive from another bus takes the the sda name, we pass the critical disks check.  Is this problem corrected if you add ata_piix to the kdump.conf module blacklist?
Comment 9 Boris Ranto 2010-06-23 10:11:59 EDT
Created attachment 426281 [details]
Console output when blacklisting sd_mod

I finally managed to find out what module the drive uses(it was sd_mod) but the result is not very promising. When I blacklist sd_mod the system gets stuck for a very long time(waited for about half an hour or so and no progress). From log it looks like it waits for device sda.
Comment 10 Neil Horman 2010-06-24 10:38:49 EDT
you don't want to blacklist sd_mod (that module enables all your scsi exported drives), you want to blacklist, as I noted above the ata_piix module.
Comment 11 Boris Ranto 2010-06-29 04:02:59 EDT
Created attachment 427602 [details]
Blacklisted ata_piix and ata_generic

If I blacklisted just ata_piix, nothing changed. I tried to blacklist ata_generic too but again, no change.
Comment 12 Neil Horman 2010-06-29 10:06:24 EDT
ok, thanks.  It just occured to me that this might be a different case of a known problem that we've recently fixed in RHEL6.  Can you try the attached package and see if it clears the issue for you please?
Comment 13 Neil Horman 2010-06-29 10:07:21 EDT
Created attachment 427693 [details]
test package
Comment 14 Boris Ranto 2010-06-30 06:32:46 EDT
Created attachment 427937 [details]
Log with new kexec tools

I've installed the test package and checked with and without blacklisted ata_piix module but no significant improvement was found. The result is still the same.
Comment 15 Neil Horman 2010-07-01 13:48:44 EDT
grr, ok, this is something new then.  Lemme see if we can do scsi device mapping here by hand.  Until then you can manually update mkdumprd to pause for a minute or so.  That will give sdb an opportunity to get detected so that lvm will assemble all your devices.  Just add this:
emit "sleep 120"
after the line in /sbin/mkdumprd that contains the string:
Making device-mapper control node
Comment 16 Boris Ranto 2010-07-02 07:59:14 EDT
I added the sleep but it didn't help. I guess the problem might be that in normal kernel, the device is detected as sda, not sdb. sdb is 'Attached SCSI removable disk':

sd 2:2:0:0: [sda] 71024640 512-byte logical blocks: (36.3 GB/33.8 GiB)
sd 2:2:0:0: [sda] Write Protect is off
sd 2:2:0:0: [sda] Mode Sense: 00 00 00 00
sd 2:2:0:0: [sda] Asking for cache data failed
sd 2:2:0:0: [sda] Assuming drive cache: write through
sd 2:2:0:0: [sda] Asking for cache data failed
sd 2:2:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2
sd 2:2:0:0: [sda] Asking for cache data failed
sd 2:2:0:0: [sda] Assuming drive cache: write through
sd 3:0:0:0: [sdb] Attached SCSI removable disk
sd 2:2:0:0: [sda] Attached SCSI disk

According to next line, logical volumes are in sda2:
dracut: Scanning devices sda2  for LVM logical volumes vg_dellpe285004/lv_root vg_dellpe285004/lv_swap

Another thing I don't like is that even though sdb is finally initialized, /dev/sdb* devices doesn't exist:

/ # ls /dev/sd*
/dev/sda    /dev/sda11  /dev/sda14  /dev/sda2   /dev/sda5   /dev/sda8
/dev/sda1   /dev/sda12  /dev/sda15  /dev/sda3   /dev/sda6   /dev/sda9
/dev/sda10  /dev/sda13  /dev/sda16  /dev/sda4   /dev/sda7
/ #
Comment 17 Neil Horman 2010-07-02 11:23:45 EDT
Thats part of the problem, but lvm should handle that, as long as sdb gets detected eventually prior to the creation of block devices...

Which is the problem.  Sorry, I told you the wrong place to insert the sleep.  Instead of being right before the "Making device mapper control node" line it should be right after the "Creating Block Devices" line.  That will allow the driver to detect sdb and register it in sysfs, which in turn will allow the init script to build the device node in /dev
Comment 18 Boris Ranto 2010-07-02 12:48:54 EDT
Ok, with the sleep on the other place, kdump works well.
Comment 19 Neil Horman 2010-07-02 13:01:08 EDT
ok, good, that can be your workaround for now.  I'll work on putting together a smarter disk mapping.
Comment 20 Neil Horman 2010-07-02 13:51:10 EDT
Created attachment 429128 [details]
patch to detect UUIDs for compatible devices

ok, its not perfect, and I've not tested it yet, but this should allow mkdumprd to search for devices based on uuid for those devices which support uuid assignment (most/all disk drives).  If you could give this a spin and see if it fixes your problem, that would be a big help to me.  Thanks!
Comment 21 Boris Ranto 2010-07-06 08:41:46 EDT
Created attachment 429764 [details]
Log with patch

I've patched the /sbin/mkdumprd but it quite didn't help.
I guess the reason is this line from the log:
Usage: msh LABEL=<label>|UUID=<uuid>
Comment 22 Neil Horman 2010-07-06 11:00:57 EDT
Created attachment 429803 [details]
new version of patch

sorry missed escaping a few $ symbols.
Comment 23 Boris Ranto 2010-07-08 06:06:39 EDT
Created attachment 430289 [details]
Patch that partially work

I had to update the patch in order to get kdump kernel to start but it still can't kdump.
I had to change /sbin/findfs to /sbin/findfs_sys (and copy it there) because otherwise two versions of findfs got mixed up (actually only the /sbin/findfs was used).
Either way findfs_sys couldn't find device (I tried blkid and it couldn't find it too). The device gets created in /sys/block/sdb but is not initialized in /dev/.
With this patch at least UUID generation for device work (had to add /dev/ before $device because input of findstoragedriver() was only sda2).
I had to change the blkid -sUUID line to blkid -o export -sUUID because otherwise there were " around the UUID that caused problems in running system.

findfs only does this:
findfs_sys: unable to resolve 'UUID=iV6wzv-mAaL-OW18-Z35b-t7tf-NIE1-AxbjPb'

So I guess the only problem now is with findfs not being able to recognize devices.
Comment 24 Neil Horman 2010-07-08 13:33:10 EDT
Created attachment 430431 [details]
new approach to fix this

doh! I just realized something.  While its going to be workable in your case, there are several cases in which detecting uuid is going to fail, as uuids apply to filesystems, not devices.  What we need is an immutable, unique value to identify devices regardless of the named order in which they are detected.  I'm afraid thats very difficult to put together, but this patch should get us close(er) to that goal.  I've not tested it yet, but if you'd like to give it a try, you're welcome to.
Comment 25 Boris Ranto 2010-07-09 08:53:08 EDT
Created attachment 430667 [details]
Patch that handles same devices

After update of the patch(escaping, local outside of function and similar) it started to work on my machine but I think the DSKSTRING is not as unique as it should be. The way I see it if the computer had 2 same disks it would wait only for one of them so I propose this patch that I've created (it looks in /sys/block for devices that have same DSKSTRING and then write its count as 3rd value to the /etc/critical_disks, also when it waits for devices it waits for the necessary amount of them).

I've tested the patch on the machine and it works fine (kdump works, vmcore is created).
Comment 26 Neil Horman 2010-07-09 12:01:37 EDT
yeah, I like that modification, thank you.  looks like this is slated for 6.1, so as soon as its approved I'll commit it, thanks!
Comment 30 releng-rhel@redhat.com 2010-11-10 16:00:10 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.