+++ This bug was initially created as a clone of Bug #569654 +++ Same bug and same fix for RHEL4. Customer has tested and verified the patch on RHEL4 too. Description of problem: system failed to boot due to hung partition scan on faulty non-root scsi drive. An upstream patch has been tested and verified to fix the issue : commit 69bdd88ca2670c321fef774e77059516f836c6f2 Author: Hannes Reinecke <hare> Date: Fri Sep 1 15:50:23 2006 +0200 [SCSI] Wrong size information for devices with disabled read access When accessing a device with disabled read access the capacity is set randomly to 1GB. This makes it impossible to userspace tools to detect invalid device capacities. Signed-off-by: Mike Anderson <andmike.com> Acked-by: Chris Mason <mason> Signed-off-by: Hannes Reinecke <hare> Signed-off-by: James Bottomley <James.Bottomley> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 98bd3aa..638cff4 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1215,7 +1215,7 @@ repeat: /* Either no media are present but the drive didn't tell us, or they are present but the read capacity command fails */ /* sdkp->media_present = 0; -- not always correct */ - sdkp->capacity = 0x200000; /* 1 GB - random */ + sdkp->capacity = 0; /* unknown mapped to zero - as usual */ return; } else if (the_result && longrc) { Version-Release number of selected component (if applicable): RHEL5.5-beta. Also observed on RHEL5.4 and RHEL4. A separate BZ will be cloned for RHEL4. How reproducible: Always, with an appropriately faulty scsi drive. For h/w details, see the associated IT. Steps to Reproduce: 1. with faulty drive installed, boot Actual results: hung boot Expected results: no hang. scsi read capacity for the faulty drive should default to zero rather than 1GB, so the partition scan will not be done and the boot wont hang. --- Additional comment from mgoodwin on 2010-03-01 18:32:43 EST --- Created an attachment (id=397217) upstream patch to set default capacity to zero on faulty scsi drive
Created attachment 397224 [details] upstream patch to set default capacity to zero on faulty scsi drive
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.25.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Any possible for us to simulate this kind of faulty disk?
(In reply to comment #13) > Any possible for us to simulate this kind of faulty disk? You might be able to use a scsi_debug module in the initrd with the "every_nth" parameter set to 1 to force/inject I/O errors during boot. There are pre-built scsi_debug modules for RHEL at http://people.redhat.com/mgoodwin/scsi_debug/ See http://sg.danny.cz/sg/sdebug26.html for the scsi_debug documentation. But really, I seem to remember the customer (Fujitsu) reported it was tested and fixed in the GSS support issue tracking tool, as per comment #8. All the patch does is prevent a partition scan on faulty drives from causing the boot to hang. It's pretty simple. Regards -- Mark Goodwin GSS/SEG
Mark, With kernel -89, I was not be able to reproduce the problem. I build up the new initrd.img with the module you provide and add these line into init: echo "Loading scsi-debug.ko module" insmod /lib/scsi_mod.ko insmod /lib/sd_mod.ko insmod /lib/scsi-debug.ko every_nth=1 opts=4 =========================================== I have tested these opts: opts=4 will cause system hang about 2 minutes and got this: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 The the system boot up normally and no /dev file created for scsi_debug module. ==================== opts=8 system got the correct disk size, but system doesn't hang. Any thing I miss?
Code reviewed. Patch linux-2.6.9-scsi-fixup-size-on-read-capacity-failure.patch was applied into kernel-2.6.9-95.EL Customer (Fujitsu) report fix, No hardware, sanity only.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html