Bug 239725

Summary: Using hdparm -W to change write cache hangs libata disk
Product: Red Hat Enterprise Linux 4 Reporter: nate.dailey
Component: kernelAssignee: Kimball Murray <kmurray>
Status: CLOSED CANTFIX QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: chas.horvath, jbaron, mpaesold, smcgrath
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-01-29 19:31:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nate.dailey 2007-05-10 19:31:52 UTC
Description of problem:

Using hdparm -W to modify write cache setting on a SATA disk using libata
(sata_vsc in my case) results in the disk hanging waiting for error handling.


Version-Release number of selected component (if applicable):

This happens with RHEL4 U5.


How reproducible:

Every time.


Steps to Reproduce:
1. hdparm -W 0 /dev/sda

  
Actual results:

IO to the disk hangs


Expected results:

IO does not hang.


Additional info:

libata-scsi's ata_scsi_qc_complete:

   1395 	/* We snoop the SET_FEATURES - Write Cache ON/OFF command, and
   1396 	 * schedule EH_REVALIDATE operation to update the IDENTIFY DEVICE
   1397 	 * cache
   1398 	 */
   1399 	if (ap->ops->error_handler &&
   1400 	    !need_sense && (qc->tf.command == ATA_CMD_SET_FEATURES) &&
   1401 	    ((qc->tf.feature == SETFEATURES_WC_ON) ||
   1402 	     (qc->tf.feature == SETFEATURES_WC_OFF))) {
   1403 		ap->eh_info.action |= ATA_EH_REVALIDATE;
   1404 		ata_port_schedule_eh(ap);
   1405 	}

Then, ata_port_schedule_eh calls scsi_schedule_eh:

     64 void scsi_schedule_eh(struct Scsi_Host *shost)
     65 {
     66 	unsigned long flags;
     67 
     68 	spin_lock_irqsave(shost->host_lock, flags);
     69 
     70 	if (test_and_set_bit(SHOST_RECOVERY, &shost->shost_state) == 0 ||
     71 	    test_and_set_bit(SHOST_CANCEL, &shost->shost_state) == 0) {
     72 		scsi_eh_wakeup(shost);
     73 	}
     74 
     75 	spin_unlock_irqrestore(shost->host_lock, flags);
     76 

The scsi_eh_wakeup doesn't trigger error handling, because at this point
host_busy != host_failed.

It turns out that scsi_device_unbusy should wake up error handling:

    390 void scsi_device_unbusy(struct scsi_device *sdev)
    391 {
    392 	struct Scsi_Host *shost = sdev->host;
    393 	unsigned long flags;
    394 
    395 	spin_lock_irqsave(shost->host_lock, flags);
    396 	shost->host_busy--;
    397 	if (unlikely(test_bit(SHOST_RECOVERY, &shost->shost_state) &&
    398 		     shost->host_failed))
    399 		scsi_eh_wakeup(shost);

However, it doesn't, because host_failed is 0. At this point, no more IO can
happen because the host is waiting for error handling, but we'll never get into
error handling.

Later kernels have "host_eh_scheduled" which fixes this problem.

Comment 1 Andrius Benokraitis 2008-01-29 19:31:44 UTC
Kimball is no longer the onsite engineer, please reopen (Simon) if you'd like to
pick this up. Neither of these bug btw have been prioritized correctly.