Bug 239725 - Using hdparm -W to change write cache hangs libata disk
Using hdparm -W to change write cache hangs libata disk
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.5
All Linux
medium Severity medium
: ---
: ---
Assigned To: Kimball Murray
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-05-10 15:31 EDT by nate.dailey
Modified: 2008-01-29 14:31 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-29 14:31:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nate.dailey 2007-05-10 15:31:52 EDT
Description of problem:

Using hdparm -W to modify write cache setting on a SATA disk using libata
(sata_vsc in my case) results in the disk hanging waiting for error handling.


Version-Release number of selected component (if applicable):

This happens with RHEL4 U5.


How reproducible:

Every time.


Steps to Reproduce:
1. hdparm -W 0 /dev/sda

  
Actual results:

IO to the disk hangs


Expected results:

IO does not hang.


Additional info:

libata-scsi's ata_scsi_qc_complete:

   1395 	/* We snoop the SET_FEATURES - Write Cache ON/OFF command, and
   1396 	 * schedule EH_REVALIDATE operation to update the IDENTIFY DEVICE
   1397 	 * cache
   1398 	 */
   1399 	if (ap->ops->error_handler &&
   1400 	    !need_sense && (qc->tf.command == ATA_CMD_SET_FEATURES) &&
   1401 	    ((qc->tf.feature == SETFEATURES_WC_ON) ||
   1402 	     (qc->tf.feature == SETFEATURES_WC_OFF))) {
   1403 		ap->eh_info.action |= ATA_EH_REVALIDATE;
   1404 		ata_port_schedule_eh(ap);
   1405 	}

Then, ata_port_schedule_eh calls scsi_schedule_eh:

     64 void scsi_schedule_eh(struct Scsi_Host *shost)
     65 {
     66 	unsigned long flags;
     67 
     68 	spin_lock_irqsave(shost->host_lock, flags);
     69 
     70 	if (test_and_set_bit(SHOST_RECOVERY, &shost->shost_state) == 0 ||
     71 	    test_and_set_bit(SHOST_CANCEL, &shost->shost_state) == 0) {
     72 		scsi_eh_wakeup(shost);
     73 	}
     74 
     75 	spin_unlock_irqrestore(shost->host_lock, flags);
     76 

The scsi_eh_wakeup doesn't trigger error handling, because at this point
host_busy != host_failed.

It turns out that scsi_device_unbusy should wake up error handling:

    390 void scsi_device_unbusy(struct scsi_device *sdev)
    391 {
    392 	struct Scsi_Host *shost = sdev->host;
    393 	unsigned long flags;
    394 
    395 	spin_lock_irqsave(shost->host_lock, flags);
    396 	shost->host_busy--;
    397 	if (unlikely(test_bit(SHOST_RECOVERY, &shost->shost_state) &&
    398 		     shost->host_failed))
    399 		scsi_eh_wakeup(shost);

However, it doesn't, because host_failed is 0. At this point, no more IO can
happen because the host is waiting for error handling, but we'll never get into
error handling.

Later kernels have "host_eh_scheduled" which fixes this problem.
Comment 1 Andrius Benokraitis 2008-01-29 14:31:44 EST
Kimball is no longer the onsite engineer, please reopen (Simon) if you'd like to
pick this up. Neither of these bug btw have been prioritized correctly.

Note You need to log in before you can comment on or make changes to this bug.