Bug 239725

Summary:	Using hdparm -W to change write cache hangs libata disk
Product:	Red Hat Enterprise Linux 4	Reporter:	nate.dailey
Component:	kernel	Assignee:	Kimball Murray <kmurray>
Status:	CLOSED CANTFIX	QA Contact:	Martin Jenner <mjenner>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	chas.horvath, jbaron, mpaesold, smcgrath
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-01-29 19:31:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nate.dailey 2007-05-10 19:31:52 UTC

Description of problem:

Using hdparm -W to modify write cache setting on a SATA disk using libata
(sata_vsc in my case) results in the disk hanging waiting for error handling.


Version-Release number of selected component (if applicable):

This happens with RHEL4 U5.


How reproducible:

Every time.


Steps to Reproduce:
1. hdparm -W 0 /dev/sda

  
Actual results:

IO to the disk hangs


Expected results:

IO does not hang.


Additional info:

libata-scsi's ata_scsi_qc_complete:

   1395 	/* We snoop the SET_FEATURES - Write Cache ON/OFF command, and
   1396 	 * schedule EH_REVALIDATE operation to update the IDENTIFY DEVICE
   1397 	 * cache
   1398 	 */
   1399 	if (ap->ops->error_handler &&
   1400 	    !need_sense && (qc->tf.command == ATA_CMD_SET_FEATURES) &&
   1401 	    ((qc->tf.feature == SETFEATURES_WC_ON) ||
   1402 	     (qc->tf.feature == SETFEATURES_WC_OFF))) {
   1403 		ap->eh_info.action |= ATA_EH_REVALIDATE;
   1404 		ata_port_schedule_eh(ap);
   1405 	}

Then, ata_port_schedule_eh calls scsi_schedule_eh:

     64 void scsi_schedule_eh(struct Scsi_Host *shost)
     65 {
     66 	unsigned long flags;
     67 
     68 	spin_lock_irqsave(shost->host_lock, flags);
     69 
     70 	if (test_and_set_bit(SHOST_RECOVERY, &shost->shost_state) == 0 ||
     71 	    test_and_set_bit(SHOST_CANCEL, &shost->shost_state) == 0) {
     72 		scsi_eh_wakeup(shost);
     73 	}
     74 
     75 	spin_unlock_irqrestore(shost->host_lock, flags);
     76 

The scsi_eh_wakeup doesn't trigger error handling, because at this point
host_busy != host_failed.

It turns out that scsi_device_unbusy should wake up error handling:

    390 void scsi_device_unbusy(struct scsi_device *sdev)
    391 {
    392 	struct Scsi_Host *shost = sdev->host;
    393 	unsigned long flags;
    394 
    395 	spin_lock_irqsave(shost->host_lock, flags);
    396 	shost->host_busy--;
    397 	if (unlikely(test_bit(SHOST_RECOVERY, &shost->shost_state) &&
    398 		     shost->host_failed))
    399 		scsi_eh_wakeup(shost);

However, it doesn't, because host_failed is 0. At this point, no more IO can
happen because the host is waiting for error handling, but we'll never get into
error handling.

Later kernels have "host_eh_scheduled" which fixes this problem.

Comment 1 Andrius Benokraitis 2008-01-29 19:31:44 UTC

Kimball is no longer the onsite engineer, please reopen (Simon) if you'd like to
pick this up. Neither of these bug btw have been prioritized correctly.