Bug 671013

Summary: Kdump fails after updating drive firmware.
Product: Red Hat Enterprise Linux 6 Reporter: Stephen Cameron <steve.cameron>
Component: kexec-toolsAssignee: Cong Wang <amwang>
Status: CLOSED ERRATA QA Contact: Boris Ranto <branto>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: branto, coughlan, nhorman, phan, rkhan
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kexec-tools-2_0_0-163_el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 14:15:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch
none
Patch to /sbin/mkdumprd script from RHEL6 to make it ignore drive firmware revisions none

Description Stephen Cameron 2011-01-19 22:09:50 UTC
Description of problem:

Updating firmware of a disk drive which is required for kdump to work, and then rebooting will cause subsequent kdump attempts to fail unless the kdump initrd is rebuilt manually.  This is because the firmware revision of the necessary drives are stored in the kdump initrd, obtained from /sys/block/sd*/device/rev.

Version-Release number of selected component (if applicable):

kexec-tools-2.0.0-145.el6.x86_64


How reproducible:

Verify kdump works.  Upgrade or downgrade firmware on the disk to which the dump is captured.  Reboot.  Re-try the kdump.  It won't recognize the disks with different firmware.

Steps to Reproduce:

1.  Setup kdump on a system with SCSI, SAS, SATA, or SmartArray, verify that kdump works by:

a. echo 1 > /proc/sys/kernel/sysrq
b. echo s > /proc/sysrq-trigger ; echo c > /proc/sysrq-trigger
c. watch console to see that dump occurs.

2.  upgrade or downgrade firmware of drive to which dump is captured.    Or, on an HP Smart Array controller, upgrade or downgrade the controller firmware (the controller firmware revision is reported as the logical drive firmware revision on Smart Arrays.)

3. Reboot.

4. Attempt kdump.  You will noticed that kdump does not recognize the disks anymore.
  
Actual results:

Kdump fails to recognize disks with different firmware.

Expected results:

Kdump should not care about the firmware revision.


Additional info:

It looks like, in /sbin/mkdumprd, the critical disk information is stored in /etc/critical-disks in the kdump initrd image.  What's in there that is compared later to recognize the disks are the Vendor, Model, Revision, and Type, obtained from /sys/block/sd*/device directory from the files "vendor", "model", "rev" and "type".  "rev" should probably be ignored.

See this section of code from /sbin/mkdumprd:

for i in \`cat /etc/critical_disks | awk '{print \$1}'\`
do
    IDSTRING=\`grep \$i /etc/critical_disks | awk '{print \$2}'\`
    COUNT=\`grep \$i /etc/critical_disks | awk '{print \$3}'\`
    found=0

    echo -n "Waiting for \$COUNT \$i-like device(s)..."
    while true
    do
        for j in \`ls /sys/block\`
        do
            DSKSTRING=""
            TMPNAME=""
            if [ ! -d /sys/block/\$j/device ]
            then
                continue
            fi
            for a in "vendor" "model" "rev" "type"
            do
                TMPNAME=\`cat /sys/block/\$j/device/\$a\`
                DSKSTRING="\$DSKSTRING \$TMPNAME"
            done
            DSKSTRING=\`echo \$DSKSTRING | sed -e's/ //g'\`
            if [ "\$DSKSTRING" == "\$IDSTRING" ]
            then
                found=\$((\$found + 1))
            fi
            if [ \$found -ge \$COUNT ]
            then
                break 2
            fi
        done


However, to really identify the disks, the tuple "vendor, model, rev, type" seems a little weak, since, for example, all of the logical drives on an HP Smart Array will have identical vendor/model/rev/type values, so this code will not be able to distinguish one drive from another.  Luckily, hpsa and cciss drivers (and most SCSI or SAS HBAs -- but not most fibre SANs) will present disks in a predetermined order most of the time (barring messing around with /proc/scsi/scsi to re-order drives with linux hotplug functionality).  It is also likely that servers from any vender will be shipped with disks which have identical vendor/model/rev/type.  (For disks, the type will always be 0 anyway.)

There probably needs to be a better way to identify the drives.  Perhaps using the device identifier from SCSI Inquiry page 0x83, which should be obtainable via SG_IO (e.g. see sg_inq program from sg3utils package).  Ideally (I think), some unique ID should be exported via /sys (e.g. ascii representation of the device identifier from SCSI Inquiry page 0x83, a la the unique_id attribute which the hpsa driver exports for each logical drive), although it would probably be best if a similar attribute were exported by the scsi mid layer rather than by the LLDs.  But these are implementation details.  The gist of the complaint is that there needs to be a better way to identify drives than by vendor/model/rev/type tuple.

-- steve

Comment 2 Cong Wang 2011-01-26 09:42:31 UTC
Created attachment 475355 [details]
Proposed patch

Neil suggested to remove "rev" from the tuple, so could try this patch?
Thanks!

Comment 3 Stephen Cameron 2011-01-26 14:29:56 UTC
Ok, I will give it a try.

-- steve

Comment 4 Stephen Cameron 2011-01-26 14:58:23 UTC
Have not yet tried the patch, but wanted to report what I've found so far.  When I attempted to apply the patch, it gave me some offsets (-3 and -14 lines), and there remained one  instance of the "vendor" "model" "rev" "type" tuple in the mkdumprd script.  I was expecting the patch to go in clean, if my mkdumprd script was the same as Neil's before applying the patch, so I began to suspect maybe I was on a beta release of RHEL6, but I double checked on another RHEL6 system which I just installed yesterday and found the same thing.

So, I'm thinking maybe Neil's patch is vs. a newer variant of the mkdumprd script than what RHEL6 shipped with?

Should I still try it?  I suspect that third instance of "rev" needs to be removed too.

-- steve

Comment 5 Stephen Cameron 2011-01-26 15:45:42 UTC
Created attachment 475422 [details]
Patch to /sbin/mkdumprd script from RHEL6 to make it ignore drive firmware revisions

Comment 6 Stephen Cameron 2011-01-26 15:46:53 UTC
So I took the liberty of making my own patch against the mkdumprd script which, to the best of my knowledge, is the one which actually ships with RHEL6, and tested it, flashing to 3.50 firmware on the P410i, then rebuilding the kdump initrd with the patched mkdumprd script, then flashing the firmware to 3.66, rebooting, and trying kdump without rebuilding the kdump initrd, and it seem to work.

attachment 475422 [details]

-- steve

Comment 7 Cong Wang 2011-01-27 10:43:13 UTC
Sorry that my patch is not correctly generated, your patch is exactly what I want.
Thanks, Stephen!

Comment 14 errata-xmlrpc 2011-05-19 14:15:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0736.html