Bug 239725 - Using hdparm -W to change write cache hangs libata disk
Summary: Using hdparm -W to change write cache hangs libata disk
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 4.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Kimball Murray
QA Contact: Martin Jenner
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-10 19:31 UTC by nate.dailey
Modified: 2008-01-29 19:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-29 19:31:44 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description nate.dailey 2007-05-10 19:31:52 UTC
Description of problem:

Using hdparm -W to modify write cache setting on a SATA disk using libata
(sata_vsc in my case) results in the disk hanging waiting for error handling.


Version-Release number of selected component (if applicable):

This happens with RHEL4 U5.


How reproducible:

Every time.


Steps to Reproduce:
1. hdparm -W 0 /dev/sda

  
Actual results:

IO to the disk hangs


Expected results:

IO does not hang.


Additional info:

libata-scsi's ata_scsi_qc_complete:

   1395 	/* We snoop the SET_FEATURES - Write Cache ON/OFF command, and
   1396 	 * schedule EH_REVALIDATE operation to update the IDENTIFY DEVICE
   1397 	 * cache
   1398 	 */
   1399 	if (ap->ops->error_handler &&
   1400 	    !need_sense && (qc->tf.command == ATA_CMD_SET_FEATURES) &&
   1401 	    ((qc->tf.feature == SETFEATURES_WC_ON) ||
   1402 	     (qc->tf.feature == SETFEATURES_WC_OFF))) {
   1403 		ap->eh_info.action |= ATA_EH_REVALIDATE;
   1404 		ata_port_schedule_eh(ap);
   1405 	}

Then, ata_port_schedule_eh calls scsi_schedule_eh:

     64 void scsi_schedule_eh(struct Scsi_Host *shost)
     65 {
     66 	unsigned long flags;
     67 
     68 	spin_lock_irqsave(shost->host_lock, flags);
     69 
     70 	if (test_and_set_bit(SHOST_RECOVERY, &shost->shost_state) == 0 ||
     71 	    test_and_set_bit(SHOST_CANCEL, &shost->shost_state) == 0) {
     72 		scsi_eh_wakeup(shost);
     73 	}
     74 
     75 	spin_unlock_irqrestore(shost->host_lock, flags);
     76 

The scsi_eh_wakeup doesn't trigger error handling, because at this point
host_busy != host_failed.

It turns out that scsi_device_unbusy should wake up error handling:

    390 void scsi_device_unbusy(struct scsi_device *sdev)
    391 {
    392 	struct Scsi_Host *shost = sdev->host;
    393 	unsigned long flags;
    394 
    395 	spin_lock_irqsave(shost->host_lock, flags);
    396 	shost->host_busy--;
    397 	if (unlikely(test_bit(SHOST_RECOVERY, &shost->shost_state) &&
    398 		     shost->host_failed))
    399 		scsi_eh_wakeup(shost);

However, it doesn't, because host_failed is 0. At this point, no more IO can
happen because the host is waiting for error handling, but we'll never get into
error handling.

Later kernels have "host_eh_scheduled" which fixes this problem.

Comment 1 Andrius Benokraitis 2008-01-29 19:31:44 UTC
Kimball is no longer the onsite engineer, please reopen (Simon) if you'd like to
pick this up. Neither of these bug btw have been prioritized correctly.


Note You need to log in before you can comment on or make changes to this bug.