Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 569668 - [RHEL4] boot hangs if scsi read capacity fails on faulty non system drive
[RHEL4] boot hangs if scsi read capacity fails on faulty non system drive
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.8
All Linux
high Severity high
: rc
: ---
Assigned To: David Milburn
Gris Ge
:
Depends On: 569654
Blocks: 485811 583726 589295
  Show dependency treegraph
 
Reported: 2010-03-01 19:18 EST by Mark Goodwin
Modified: 2011-02-16 10:27 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 569654
Environment:
Last Closed: 2011-02-16 10:27:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
upstream patch to set default capacity to zero on faulty scsi drive (1.08 KB, patch)
2010-03-01 19:23 EST, Mark Goodwin
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 10:14:55 EST

  None (edit)
Description Mark Goodwin 2010-03-01 19:18:50 EST
+++ This bug was initially created as a clone of Bug #569654 +++

Same bug and same fix for RHEL4. Customer has tested and verified the
patch on RHEL4 too.

Description of problem:

system failed to boot due to hung partition scan on faulty non-root
scsi drive. An upstream patch has been tested and verified to fix
the issue :

commit 69bdd88ca2670c321fef774e77059516f836c6f2
Author: Hannes Reinecke <hare@suse.de>
Date:   Fri Sep 1 15:50:23 2006 +0200

    [SCSI] Wrong size information for devices with disabled read access
    
    When accessing a device with disabled read access the capacity is set
    randomly to 1GB. This makes it impossible to userspace tools to detect
    invalid device capacities.
    
    Signed-off-by: Mike Anderson <andmike@us.ibm.com>
    Acked-by: Chris Mason <mason@suse.com>
    Signed-off-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 98bd3aa..638cff4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1215,7 +1215,7 @@ repeat:
                /* Either no media are present but the drive didn't tell us,
                   or they are present but the read capacity command fails */
                /* sdkp->media_present = 0; -- not always correct */
-               sdkp->capacity = 0x200000; /* 1 GB - random */
+               sdkp->capacity = 0; /* unknown mapped to zero - as usual */
 
                return;
        } else if (the_result && longrc) {


Version-Release number of selected component (if applicable):
RHEL5.5-beta. Also observed on RHEL5.4 and RHEL4. A separate BZ
will be cloned for RHEL4.

How reproducible:
Always, with an appropriately faulty scsi drive. For h/w details,
see the associated IT.

Steps to Reproduce:
1. with faulty drive installed, boot
  
Actual results:
hung boot

Expected results:
no hang. scsi read capacity for the faulty drive should default to
zero rather than 1GB, so the partition scan will not be done and the
boot wont hang.

--- Additional comment from mgoodwin@redhat.com on 2010-03-01 18:32:43 EST ---

Created an attachment (id=397217)
upstream patch to set default capacity to zero on faulty scsi drive
Comment 1 Mark Goodwin 2010-03-01 19:23:54 EST
Created attachment 397224 [details]
upstream patch to set default capacity to zero on faulty scsi drive
Comment 4 RHEL Product and Program Management 2010-03-09 13:18:03 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 Vivek Goyal 2010-05-07 13:01:20 EDT
Committed in 89.25.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 13 Gris Ge 2011-01-12 21:46:49 EST
Any possible for us to simulate this kind of faulty disk?
Comment 14 Mark Goodwin 2011-01-12 22:04:07 EST
(In reply to comment #13)
> Any possible for us to simulate this kind of faulty disk?

You might be able to use a scsi_debug module in the initrd with the "every_nth" parameter set to 1 to force/inject I/O errors during boot. There are pre-built 
scsi_debug modules for RHEL at http://people.redhat.com/mgoodwin/scsi_debug/
See http://sg.danny.cz/sg/sdebug26.html for the scsi_debug documentation.

But really, I seem to remember the customer (Fujitsu) reported it was
tested and fixed in the GSS support issue tracking tool, as per comment #8.
All the patch does is prevent a partition scan on faulty drives from
causing the boot to hang. It's pretty simple.

Regards
-- Mark Goodwin GSS/SEG
Comment 15 Gris Ge 2011-01-19 01:23:14 EST
Mark,

With kernel -89, I was not be able to reproduce the problem.
I build up the new initrd.img with the module you provide and add these line into init:
echo "Loading scsi-debug.ko module"
insmod /lib/scsi_mod.ko
insmod /lib/sd_mod.ko
insmod /lib/scsi-debug.ko every_nth=1 opts=4
===========================================

I have tested these opts:
opts=4 will cause system hang about 2 minutes and got this:
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0

The the system boot up normally and no /dev file created for scsi_debug module.
====================
opts=8 system got the correct disk size, but system doesn't hang.

Any thing I miss?
Comment 16 Gris Ge 2011-01-25 01:25:39 EST
Code reviewed. Patch linux-2.6.9-scsi-fixup-size-on-read-capacity-failure.patch was applied into kernel-2.6.9-95.EL

Customer (Fujitsu) report fix, No hardware, sanity only.
Comment 17 errata-xmlrpc 2011-02-16 10:27:58 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.