Bug 241215

Summary: SATA update breaks reference counting
Product: Red Hat Enterprise Linux 4 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Bryn M. Reeves <bmr>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: i-kitayama, jbaron, jfeeney, poelstra, sandeep_k_shandilya
Target Milestone: ---Keywords: OtherQA, Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:27:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 245197    
Bug Blocks: 245198    
Attachments:
Description Flags
remove bogus scsi_device_put in ata_scsi_scan_host none

Description Bryn M. Reeves 2007-05-24 13:31:23 UTC
Description of problem:
The updated SATA support in 2.6.9-44 has a reference counting bug in
ata_scsi_scan_host.

ata_scsi_scan_host iterates over all entries in ap->devices[] calling
__scsi_add_device for each device found.

In 2.6.18 this looks like this:

ata_scsi_scan_host()
        -> __scsi_add_device()
                -> scsi_probe_and_add_lun()
                        -> scsi_device_lookup_by_target()
                                -> scsi_device_get()

So, on return from __scsi_add_device we are holding a reference on the returned
scsi_device and need a scsi_device_put(sdev) following the return from
__scsi_add_device:

                sdev = __scsi_add_device(ap->scsi_host, 0, i, 0, NULL);
                if (!IS_ERR(sdev)) {
                        dev->sdev = sdev;
                        scsi_device_put(sdev);
                }


In the RHEL4 2.6.9 kernels the sequence looks instead like this:

ata_scsi_scan_host()
        -> __scsi_add_device()
                -> scsi_probe_and_add_lun()
                        -> scsi_device_lookup()

And we are NOT holding a reference on the scsi_device when we return into
ata_scsi_scan_host.

The backport from 2.6.18 included the upstream scsi_device_put():

 void ata_scsi_scan_host(struct ata_port *ap)
 {
-       struct ata_device *dev;
        unsigned int i;

-       if (ap->flags & ATA_FLAG_PORT_DISABLED)
+       if (ap->flags & ATA_FLAG_DISABLED)
                return;

        for (i = 0; i < ATA_MAX_DEVICES; i++) {
+               struct ata_device *dev = &ap->device[i];
+               struct scsi_device *sdev;
+
+               if (!ata_dev_enabled(dev) || dev->sdev)
+                       continue;
+
+               sdev = __scsi_add_device(ap->scsi_host, 0, i, 0, NULL);
+               if (!IS_ERR(sdev)) {
+                       dev->sdev = sdev;
+                       scsi_device_put(sdev);
+               }
+       }
+}

But since we aren't holding a reference, this breaks the reference counting for
the SATA module, e.g. ata_piix. If the reference count is incorrectly
decremented to 0 the module may be unloaded while still in use, triggering a
panic on the next access to the SATA device.

Version-Release number of selected component (if applicable):
2.6.9-44.EL onward (reproduced on 55.EL)

How reproducible:
100% but see notes below.

Steps to Reproduce:
Depending on the configuration of the machine it's easier/harder to see the
problem and trigger a panic because of it. For e.g. if device-mapper is used,
when dm claims the devices the reference count is incremented above zero (it's
still wrong, but it's harder to see).

Examples here use ata_piix as that's what I had for testing but any of the
libata based drivers should be similarly affected.

Reproducing with rescue mode
1. Boot a machine with a single ata_piix device using rhel4.5 install media in
rescue mode
2. Select "skip" when asked about fs detection
3. Examine reference count on the ata_piix module, it is 4294967295 (-1). 
4. Mount a partition from the SATA disk (I used /boot)
5. Examine reference count on the ata_piix module, it is 0.
6. rmmod ata_piix
7. Poke the device (e.g. ls -R /path/to/mount)

Reproducing on an installed system
1. Install the machine onto a SAT disk with only a root file system (no LVM, no
/boot)
2. Check the reference count on the ata module (will be 0)
3. rmmod ata_piix
4. Poke the device (e.g. ls -R /)

Actual results:
rmmod succeeds, machine panics at 7 (rescue mode) or 4 (installed system).

Expected results:
rmmod fails, machine does not panic

Additional info:

Comment 1 Bryn M. Reeves 2007-05-24 13:31:25 UTC
Created attachment 155343 [details]
remove bogus scsi_device_put in ata_scsi_scan_host

Comment 2 Bryn M. Reeves 2007-05-24 14:20:26 UTC
*** Bug 240016 has been marked as a duplicate of this bug. ***

Comment 3 RHEL Program Management 2007-05-24 14:24:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 RHEL Program Management 2007-05-24 14:26:50 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 5 Bryn M. Reeves 2007-05-24 15:08:46 UTC
Posted to rhkl

Comment 6 RHEL Program Management 2007-05-24 15:21:45 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 7 Jason Baron 2007-06-28 19:49:10 UTC
committed in stream U6 build 55.14. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 9 John Poelstra 2007-08-29 00:00:09 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 10 John Poelstra 2007-09-05 22:21:08 UTC
A fix for this issue should have been included in the packages contained in 
the RHEL4.6-Snapshot1 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed to ensure that it is 
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed, 
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent 
symptoms of the problem you are having and change the status of the bug to 
FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test 
results to Issue Tracker.  If you need assistance accessing 
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 11 John Poelstra 2007-09-12 00:40:40 UTC
A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on
partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 12 John Poelstra 2007-09-20 04:29:38 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6-Snapshot3 on partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 15 errata-xmlrpc 2007-11-15 16:27:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html