Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 569668

Summary: [RHEL4] boot hangs if scsi read capacity fails on faulty non system drive
Product: Red Hat Enterprise Linux 4 Reporter: Mark Goodwin <mgoodwin>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED ERRATA QA Contact: Gris Ge <fge>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: emcnabb, fge, jwest, moshiro, tao, vgoyal
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 569654 Environment:
Last Closed: 2011-02-16 15:27:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 569654    
Bug Blocks: 485811, 583726, 589295    
Attachments:
Description Flags
upstream patch to set default capacity to zero on faulty scsi drive none

Description Mark Goodwin 2010-03-02 00:18:50 UTC
+++ This bug was initially created as a clone of Bug #569654 +++

Same bug and same fix for RHEL4. Customer has tested and verified the
patch on RHEL4 too.

Description of problem:

system failed to boot due to hung partition scan on faulty non-root
scsi drive. An upstream patch has been tested and verified to fix
the issue :

commit 69bdd88ca2670c321fef774e77059516f836c6f2
Author: Hannes Reinecke <hare>
Date:   Fri Sep 1 15:50:23 2006 +0200

    [SCSI] Wrong size information for devices with disabled read access
    
    When accessing a device with disabled read access the capacity is set
    randomly to 1GB. This makes it impossible to userspace tools to detect
    invalid device capacities.
    
    Signed-off-by: Mike Anderson <andmike.com>
    Acked-by: Chris Mason <mason>
    Signed-off-by: Hannes Reinecke <hare>
    Signed-off-by: James Bottomley <James.Bottomley>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 98bd3aa..638cff4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1215,7 +1215,7 @@ repeat:
                /* Either no media are present but the drive didn't tell us,
                   or they are present but the read capacity command fails */
                /* sdkp->media_present = 0; -- not always correct */
-               sdkp->capacity = 0x200000; /* 1 GB - random */
+               sdkp->capacity = 0; /* unknown mapped to zero - as usual */
 
                return;
        } else if (the_result && longrc) {


Version-Release number of selected component (if applicable):
RHEL5.5-beta. Also observed on RHEL5.4 and RHEL4. A separate BZ
will be cloned for RHEL4.

How reproducible:
Always, with an appropriately faulty scsi drive. For h/w details,
see the associated IT.

Steps to Reproduce:
1. with faulty drive installed, boot
  
Actual results:
hung boot

Expected results:
no hang. scsi read capacity for the faulty drive should default to
zero rather than 1GB, so the partition scan will not be done and the
boot wont hang.

--- Additional comment from mgoodwin on 2010-03-01 18:32:43 EST ---

Created an attachment (id=397217)
upstream patch to set default capacity to zero on faulty scsi drive

Comment 1 Mark Goodwin 2010-03-02 00:23:54 UTC
Created attachment 397224 [details]
upstream patch to set default capacity to zero on faulty scsi drive

Comment 4 RHEL Program Management 2010-03-09 18:18:03 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Vivek Goyal 2010-05-07 17:01:20 UTC
Committed in 89.25.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 13 Gris Ge 2011-01-13 02:46:49 UTC
Any possible for us to simulate this kind of faulty disk?

Comment 14 Mark Goodwin 2011-01-13 03:04:07 UTC
(In reply to comment #13)
> Any possible for us to simulate this kind of faulty disk?

You might be able to use a scsi_debug module in the initrd with the "every_nth" parameter set to 1 to force/inject I/O errors during boot. There are pre-built 
scsi_debug modules for RHEL at http://people.redhat.com/mgoodwin/scsi_debug/
See http://sg.danny.cz/sg/sdebug26.html for the scsi_debug documentation.

But really, I seem to remember the customer (Fujitsu) reported it was
tested and fixed in the GSS support issue tracking tool, as per comment #8.
All the patch does is prevent a partition scan on faulty drives from
causing the boot to hang. It's pretty simple.

Regards
-- Mark Goodwin GSS/SEG

Comment 15 Gris Ge 2011-01-19 06:23:14 UTC
Mark,

With kernel -89, I was not be able to reproduce the problem.
I build up the new initrd.img with the module you provide and add these line into init:
echo "Loading scsi-debug.ko module"
insmod /lib/scsi_mod.ko
insmod /lib/sd_mod.ko
insmod /lib/scsi-debug.ko every_nth=1 opts=4
===========================================

I have tested these opts:
opts=4 will cause system hang about 2 minutes and got this:
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0

The the system boot up normally and no /dev file created for scsi_debug module.
====================
opts=8 system got the correct disk size, but system doesn't hang.

Any thing I miss?

Comment 16 Gris Ge 2011-01-25 06:25:39 UTC
Code reviewed. Patch linux-2.6.9-scsi-fixup-size-on-read-capacity-failure.patch was applied into kernel-2.6.9-95.EL

Customer (Fujitsu) report fix, No hardware, sanity only.

Comment 17 errata-xmlrpc 2011-02-16 15:27:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html