569668 – [RHEL4] boot hangs if scsi read capacity fails on faulty non system drive

Bug 569668 - [RHEL4] boot hangs if scsi read capacity fails on faulty non system drive

Summary: [RHEL4] boot hangs if scsi read capacity fails on faulty non system drive

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Milburn
QA Contact:	Gris Ge
Docs Contact:
URL:
Whiteboard:
Depends On:	569654
Blocks:	485811 583726 589295
TreeView+	depends on / blocked

Reported:	2010-03-02 00:18 UTC by Mark Goodwin
Modified:	2018-11-14 17:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	569654
Environment:
Last Closed:	2011-02-16 15:27:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
upstream patch to set default capacity to zero on faulty scsi drive (1.08 KB, patch) 2010-03-02 00:23 UTC, Mark Goodwin	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0263	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update	2011-02-16 15:14:55 UTC

Description Mark Goodwin 2010-03-02 00:18:50 UTC

+++ This bug was initially created as a clone of Bug #569654 +++

Same bug and same fix for RHEL4. Customer has tested and verified the
patch on RHEL4 too.

Description of problem:

system failed to boot due to hung partition scan on faulty non-root
scsi drive. An upstream patch has been tested and verified to fix
the issue :

commit 69bdd88ca2670c321fef774e77059516f836c6f2
Author: Hannes Reinecke <hare>
Date:   Fri Sep 1 15:50:23 2006 +0200

    [SCSI] Wrong size information for devices with disabled read access
    
    When accessing a device with disabled read access the capacity is set
    randomly to 1GB. This makes it impossible to userspace tools to detect
    invalid device capacities.
    
    Signed-off-by: Mike Anderson <andmike.com>
    Acked-by: Chris Mason <mason>
    Signed-off-by: Hannes Reinecke <hare>
    Signed-off-by: James Bottomley <James.Bottomley>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 98bd3aa..638cff4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1215,7 +1215,7 @@ repeat:
                /* Either no media are present but the drive didn't tell us,
                   or they are present but the read capacity command fails */
                /* sdkp->media_present = 0; -- not always correct */
-               sdkp->capacity = 0x200000; /* 1 GB - random */
+               sdkp->capacity = 0; /* unknown mapped to zero - as usual */
 
                return;
        } else if (the_result && longrc) {


Version-Release number of selected component (if applicable):
RHEL5.5-beta. Also observed on RHEL5.4 and RHEL4. A separate BZ
will be cloned for RHEL4.

How reproducible:
Always, with an appropriately faulty scsi drive. For h/w details,
see the associated IT.

Steps to Reproduce:
1. with faulty drive installed, boot
  
Actual results:
hung boot

Expected results:
no hang. scsi read capacity for the faulty drive should default to
zero rather than 1GB, so the partition scan will not be done and the
boot wont hang.

--- Additional comment from mgoodwin on 2010-03-01 18:32:43 EST ---

Created an attachment (id=397217)
upstream patch to set default capacity to zero on faulty scsi drive

Comment 1 Mark Goodwin 2010-03-02 00:23:54 UTC

Created attachment 397224 [details]
upstream patch to set default capacity to zero on faulty scsi drive

Comment 4 RHEL Program Management 2010-03-09 18:18:03 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Vivek Goyal 2010-05-07 17:01:20 UTC

Committed in 89.25.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 13 Gris Ge 2011-01-13 02:46:49 UTC

Any possible for us to simulate this kind of faulty disk?

Comment 14 Mark Goodwin 2011-01-13 03:04:07 UTC

(In reply to comment #13)
> Any possible for us to simulate this kind of faulty disk?

You might be able to use a scsi_debug module in the initrd with the "every_nth" parameter set to 1 to force/inject I/O errors during boot. There are pre-built 
scsi_debug modules for RHEL at http://people.redhat.com/mgoodwin/scsi_debug/
See http://sg.danny.cz/sg/sdebug26.html for the scsi_debug documentation.

But really, I seem to remember the customer (Fujitsu) reported it was
tested and fixed in the GSS support issue tracking tool, as per comment #8.
All the patch does is prevent a partition scan on faulty drives from
causing the boot to hang. It's pretty simple.

Regards
-- Mark Goodwin GSS/SEG

Comment 15 Gris Ge 2011-01-19 06:23:14 UTC

Mark,

With kernel -89, I was not be able to reproduce the problem.
I build up the new initrd.img with the module you provide and add these line into init:
echo "Loading scsi-debug.ko module"
insmod /lib/scsi_mod.ko
insmod /lib/sd_mod.ko
insmod /lib/scsi-debug.ko every_nth=1 opts=4
===========================================

I have tested these opts:
opts=4 will cause system hang about 2 minutes and got this:
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0

The the system boot up normally and no /dev file created for scsi_debug module.
====================
opts=8 system got the correct disk size, but system doesn't hang.

Any thing I miss?

Comment 16 Gris Ge 2011-01-25 06:25:39 UTC

Code reviewed. Patch linux-2.6.9-scsi-fixup-size-on-read-capacity-failure.patch was applied into kernel-2.6.9-95.EL

Customer (Fujitsu) report fix, No hardware, sanity only.

Comment 17 errata-xmlrpc 2011-02-16 15:27:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.