Bug 86312

Summary:

kernel may destroy a data writing into disk, when it's too busy.

Product:

Red Hat Enterprise Linux 2.1

Reporter:

Shinya Narahara <naraha_s>

Component:

kernel

Assignee:

Doug Ledford <dledford>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

2.1

CC:

bennet, coughlan, djenkins, dledford, edwin.mcelearney, ggallagh, halligan, hashimoh, jkulesa, jneedle, kmori, minoru.yoshida, miurahid, nobody+wcheng, petrides, rperkins, sct, si-yama, smorin, tao, tbarr, terry.magill, walter.crasto, yu-maeda, yushio

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-09-02 04:30:33 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

107562, 107565, 116727

Attachments:

Description	Flags
Syslog with qla2200(5.31.RH1).	none
Scsi patch for changeable scsi_write_retries	none
Patch that fixes the scsi mid layer queue handling for commands that need infinite retries	none
Patch to use the new mid layer queue handling for aborted_cmd commands	none
Update to the device whitelist entries to catch all SanRISE equipment	none
Proposed fix patch	none
paranoiac patch 1	none
paranoiac patch 2	none

Description Shinya Narahara 2003-03-19 13:34:27 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.78 [ja] (WinNT; U)

Description of problem:
With very heavy disk benchmark test for SCSI disk, the Linux kernel
may destory a data writing into disk. This is certainly rare case,
but we guess this is kernel bug.


Version-Release number of selected component (if applicable):
kernel-2.4.9-e.3, kernel-2.4.18-e.25

How reproducible:
Always

Steps to Reproduce:
1.heavy disk benchmark test.
2.copy and compare many files, with the test.
3.
 
    

Actual Results:  the file is broken. comparing'll be fail.


Expected Results:  never fail.


Additional info:

The log file at the time of file broken below:
# this is with some debug options for SCSI.

......
Mar 11 13:07:36 RedHat-AS kernel: In MAYREDO, allowing 5 retries, have 3
Mar 11 13:07:36 RedHat-AS kernel: Non-zero result in scsi_done 2 0:1
Mar 11 13:07:36 RedHat-AS kernel: In scsi_done(host = 1, result = 000002)
Mar 11 13:07:36 RedHat-AS kernel: scsi1, channel0 : Current 08:11: sense key Aborted Command
Mar 11 13:07:36 RedHat-AS kernel: 
Mar 11 13:07:36 RedHat-AS kernel: In MAYREDO, allowing 5 retries, have 4
Mar 11 13:07:36 RedHat-AS kernel: SCSI disk error : host 1 channel 0 id 0 lun 1 return code = 18000002
Mar 11 13:07:36 RedHat-AS kernel: Current sd08:11: sense key Aborted Command
Mar 11 13:07:36 RedHat-AS kernel:  I/O error: dev 08:11, sector 597696
Mar 11 13:07:37 RedHat-AS kernel: Non-zero result in scsi_done 2 0:1
Mar 11 13:07:37 RedHat-AS kernel: In scsi_done(host = 1, result = 000002)
Mar 11 13:07:37 RedHat-AS kernel: scsi1, channel0 : Current 08:11: sense key Aborted Command

and then the MAYREDO counter was reseted, the counter is repeated
to be up.

We guess the retry logic is bad and causes this issue.
The retry(REDO) counter is 5 in sd.c. We tested that we
changed it into 255, confirmed that this issue didn't occur
for a long time, until the value became over 255.

It doesn't seem that the scsi_decide_disposition() in scsi_errors.c 
handles status correctly when the retry counter is maximized.
Can upper level routine decide the scsi retry counter is maximized?

Does this patch below improve this issue?

--- linux/drivers/scsi/Makefile.org	Mon Mar 10 11:10:14 2003
+++ linux/drivers/scsi/Makefile	Mon Mar 10 11:11:14 2003
@@ -163,6 +163,9 @@
 
 include $(TOPDIR)/Rules.make
 
+CFLAGS_scsi_dma.o      += -DDEBUG -DDEBUG_INIT -DDEBUG_TIMEOUT
+CFLAGS_scsi_error.o    += -DDEBUG -DDEBUG_INIT -DDEBUG_TIMEOUT
+CFLAGS_scsi_obsolete.o += -DDEBUG -DDEBUG_INIT -DDEBUG_TIMEOUT
 
 scsi_mod.o: $(scsi_mod-objs)
 	$(LD) -r -o $@ $(scsi_mod-objs)
--- linux/drivers/scsi/scsi_error.c.org	Mon Mar 17 15:47:16 2003
+++ linux/drivers/scsi/scsi_error.c	Mon Mar 17 15:48:04 2003
@@ -1087,7 +1087,8 @@
                 /*
                  * No more retries - report this one back to upper level.
                  */
-		return SUCCESS;
+		/* return SUCCESS; this is commented out to avoid scsi error */
+		return FAILED;
 	}
 }

Comment 1 Shinya Narahara 2003-04-02 05:58:59 UTC

We tested the patch above with "traped disk", which
has "programable broken sector"(any access onto the
sector becames scsi error(sensekey=0x0b, sensecode=0xc000)).
Actually the patch above didn't work at all,
kernel got retry and retry forever, and the data
was broken when the limitation(returning scsi error) was stoped.

We were testhing it with qla2200.o(5.31.RH1), it
didn't have a flag "use_new_eh_code"(was set zero).
Another driver qla2200_new.o(5.31.RH3) has the flag as 1,
so it never destroy any data on disk, when the disk returns
scsi error or not. But the driver retry writing the data
on the disk forever if the disk return scsi error, or
sometime make kernel freeze.

Actually we haven't test the latest driver(6.04.00) from
qlogic web page yet.

The desirable kernel behavior, when the disk returns scsi
error many times, should be 
     1) first, retry assigned times(ex. for sd.c, 5 times)
     2) then, return with some errors and interrupt writing/reading

This issue titled is caused by the flag "use_new_eh_code" is unset.
But even if it is set(new driver), we have another issue now...

Comment 2 Shinya Narahara 2003-04-02 06:09:28 UTC

Created attachment 90832 [details]
Syslog with qla2200(5.31.RH1). 

The issue may occur at Mar 11 13:07:36 or later.

Comment 3 Shinya Narahara 2003-05-19 06:06:52 UTC

To solve this "incompletely", we've tested a silly patch below.
This is change the kernel logic "retry 5 time for SCSI disk",
into "retry infinitely for SCSI write commands". 

--- linux/drivers/scsi/scsi_obsolete.c.org	2003-04-28 17:00:35.000000000 
+0900
+++ linux/drivers/scsi/scsi_obsolete.c	2003-04-28 19:43:33.000000000 +0900
@@ -623,6 +623,11 @@
 		printk("In MAYREDO, allowing %d retries, have %d\n",
 		       SCpnt->allowed, SCpnt->retries);
 #endif
+
+		/* naraha_s.co.jp added this 2003/04/28 */
+		/* to avoid data crush when the disk is very busy. */
+		SCpnt->retries = retries_check( SCpnt->cmnd[0], SCpnt->allowed, 
SCpnt->retries );
+
 		if ((++SCpnt->retries) < SCpnt->allowed) {
 			if ((SCpnt->retries >= (SCpnt->allowed >> 1))
 			    && !(SCpnt->host->resetting && time_before(jiffies, 
SCpnt->host->last_reset + MIN_RESET_PERIOD))
--- linux/drivers/scsi/scsi_error.c.org	2003-04-28 19:25:31.000000000 +0900
+++ linux/drivers/scsi/scsi_error.c	2003-04-28 19:42:21.000000000 +0900
@@ -1072,6 +1072,10 @@
 
       maybe_retry:
 
+	/* naraha_s.co.jp added this 2003/04/28 */
+	/* to avoid data crush when the disk is very busy. */
+	SCpnt->retries = retries_check( SCpnt->cmnd[0], SCpnt->allowed, SCpnt-
>retries );
+
 	if ((++SCpnt->retries) < SCpnt->allowed) {
 		return NEEDS_RETRY;
 	} else {
--- linux/drivers/scsi/scsi.c.org	2003-04-28 19:33:41.000000000 +0900
+++ linux/drivers/scsi/scsi.c	2003-04-28 19:41:24.000000000 +0900
@@ -2798,3 +2798,54 @@
  * tab-width: 8
  * End:
  */
+
+/* Checking retries over allowed when writing.       */
+/* naraha_s.co.jp added this 2003/04/28 */
+/* to avoid data crush when the disk is very busy. */
+int
+retries_check( unsigned char cmnd, int allowed, int retries )
+{
+	unsigned char retry[] = {
+		/* scsi commands that should be trapped */
+		WRITE_6,
+		WRITE_FILEMARKS,
+		RECOVER_BUFFERED_DATA,
+		COPY,
+		ERASE,
+		WRITE_10,
+		WRITE_VERIFY,
+		SYNCHRONIZE_CACHE,
+		COPY_VERIFY,
+		WRITE_BUFFER,
+		WRITE_LONG,
+		WRITE_SAME,
+		WRITE_12,
+		WRITE_VERIFY_12,
+		WRITE_LONG_2,
+	};
+	int i, arysiz = sizeof(retry)/sizeof(unsigned char);
+
+	/* disk   MAX_RETRIES = 5 defined in "sd.c"  */
+	/* cd-rom MAX_RETRIES = 3 defined in "sr.c"  */
+	/* tape   MAX_RETRIES = 0 defined in "st.c"  */
+	/* not regard tape device.                   */
+	if (allowed > 2 && retries+1 >= allowed) {
+		/* Check if the scsi command should be traped. */
+		for(i = 0; i < arysiz; i++) {
+			if(cmnd == retry[i]) {
+				break;
+			}
+		}
+		if(i < arysiz) {
+			/* found this is one of a special command. */
+			printk("In MAYREDO, %d retries scsi_cmnd = %d,"
+				"forced the number into %d\n",
+			       	retries+1, cmnd, allowed - 1 /* not 2 */ );
+			/* retries must be -2 of allowed for infloop */
+			/* because after this function, it'll be incremented. */
+			return( allowed - 2 );
+		}
+	}
+	return( retries );
+}
+

This is just a substitution. The radical solution may be
change kernel logic, 
if(scsi_disk_error) {
    if(retry_counter_is_lower) {
        retry;
    } else { /* retry_counter_is_over */
        stop_all_access_to_the_disk_anymore;
        return_error_to_all_application_which_accessing_the_disk;
    }
}

Comment 4 Alan Cox 2003-06-23 11:21:29 UTC

Infinite retries don't work, and can cause very great data loss because the
retry will prevent all further writes from completing. The more important
question is why is the device timing out repeatedly in this case. Five retries
is more than reasonable.

Given these are aborts perhaps the problem is that we need to adopt a longer
backoff on timeouts ?

Comment 5 Tim Pepper 2003-06-23 17:43:17 UTC

For comparison's sake have you considered trying a more recent Qlogic driver? 
The 6.0500b9 driver has fixes to many error handling issues I've seen in
testing.  Personally I wouldn't trust my data on Qlogic's older drivers, but I
don't know if RedHat is backporting fixes into their 5.31.RHx drivers.

Comment 6 Stephen Tweedie 2003-06-26 16:43:26 UTC

From a related issue (IT#24399), the ABORTED_COMMAND sense key is being returned
while a disk is offlined for maintenance.  The kernel is entitled to fail an IO
if the storage subsystem is returning ABORTED_COMMAND in that case.

If that's the underlying cause of the ABORTED_COMMAND here, then it's a disk
subsystem problem --- it should be returning BUSY status on the command, not an
abort.  It's entirely reasonable for the kernel to fail the command if repeated
retries keep on returning ABORTED_COMMAND.

Comment 7 Shinya Narahara 2003-06-27 02:52:31 UTC

rhn wrote:
> For comparison's sake have you considered trying a more recent Qlogic driver? 

Actually, this is not a scsi device driver problem but a scsi protocol
driver's one. The scsi cards and it's device drivers don't depend on
this issue.

Alan Cox wrote:
> The more important question is why is the device timing out repeatedly
> in this case. Five retries is more than reasonable.
Stephen Tweedie wrote: 
> the ABORTED_COMMAND sense key is being returned while a disk is
> offlined for maintenance.

Yes, the disk returns "ABORTED_COMMAND" while it's in mentenance mode,
ex. rebuilding RAID array when the disk is broken and changed.
Unfortunetelly, many disk subsystems can't return status byte code
"BUSY" if they return sense key "ABORTED_COMMAND", so the linux
kernel must treat this carefully. Sometimes the status byte code
may be vender specific. Although the current kernel doesn't
handle scsi device properly when the sense key is
"ABORTED_COMMAND" and status byte code is "BUSY"...(5 retries only)
I think this is not a bug, but a linux policy, however it
should be changed.

At least, The retry number, specificed in sd.c, should be able to be
customized by using a module parameter, like below:
# insmod sd.c "max_retries=10"

Or, add a patch to handle "ABORTED_COMMAND" and "BUSY", 
and more patch to treat vender specific status byte code by
using a module parameter.
# insmod scsi_mod.o "sameas_busy=0xc0"

The general unixs like AIX, HP-UX or so, have interesting argorithm
if the scsi subsystem returns "ABORTED_COMMAND" or some errors.
    1) retry for specified times(customizable)
    2) search substitute sector
    3) retry 1) and 2) for specified times(also customizable)
    4) close the filesystem, and return error to all applications
       which is accessing the filesystem

Comment 8 Shinya Narahara 2003-08-07 06:51:24 UTC

We tested many situations, with qla2x00(6.05.00b9, 6.06.00b13),
and the disk returns alternative status CHECK_CONDITION(Aborted_command)
or BUSY by changing disk firmware, when the error occurs.

qla2x00(5.3x) + returning CHECK_CONDITION(Aborted_command):
    some data is lost, and test program can't sense it.

qla2x00(5.3x) + returning BUSY:
    no data is lost, retrying forever. According to scsi_old_done()
    in scsi_obsolute.c, CHECK_CONDITION(Aborted_command)'ll be MAYREDO,
    and BUSY'll be REDO.

qla2x00(6.06.00b13) + returning CHECK_CONDITION(Aborted_command):
    some data is lost, and test program can't sense it.

qla2x00(6.06.00b13) + returning BUSY:
    some data is lost, and test program can't sense it.

Unfortunately, the last one says new driver and returning BUSY
can't suppress the data lost...

According to scsi_decide_disposition() in scsi_error.c,
the scsi status CHECK_CONDITION(Aborted_command) or BUSY
will be same condition(goto maybe_retry:). We know the new
qlogic driver(6.06.00b13) doesn't use this function, but
some driver with use_new_en_code=1 may use this. 

To avoid this issue for all scsi driver completely, our
silly patch is not so bad(we confirmed that the patch can
inhibit the data lost).

We believe that the problems are both:
   1) data lost
   2) application can't detect it when 1) occurs
Whether a disk returns CHECK_CONDITION(Aborted_command) or BUSY, 
they are not eternal error but temporary one, so we recommend
the Linux kernel repeat trying "many" times...

FYI: qlogic driver(6.06.00b13) has retry counter set as 20 or 30.

Comment 9 Stephen Tweedie 2003-08-07 11:43:28 UTC

> Unfortunately, the last one says new driver and returning BUSY
> can't suppress the data lost...

Agreed --- the new-style error handling should probably be repeating
indefinitely on BUSY.  Old-style already does, which is what most drivers
currently use.  I'll defer to our internal scsi experts on that, though.

On the "application doesn't detect errors" question: what test code are you
using for that?  Normal Unix "write()" syscall traffic cannot return such errors
to the app since we do write-behind by default.

Comment 10 Shinya Narahara 2003-08-08 01:18:57 UTC

> Normal Unix "write()" syscall traffic cannot return such
> errors to the app since we do write-behind by default.

I agree that the write() can't return the errors.
Actually, normal UNIXes like AIX or HP-UX we usually use, have
synchronous behavior on its close(). So app can detect
disk error on its close at least.
Linux doesn't sync at close() because of its performance.

Our test program is very simple, just write, read and
compare a data(essence below, actual program is more
complex and has many error checks ):

int fd;
long rsiz, wsiz;
char data[...] = FIXED_DATA;
char buf[...];

fd = open( "file", O_RDWR|O_CREATE );
wsiz = write( fd, data, sizeof( data ) );
close( fd );
fd = oepn( "file", O_RDONLY );
rsiz = read( fd, buf, sizeof( data ) );
close( fd );
if(rsiz != wsiz || memcmp( data, buf, sizeof( data ) ) != 0)
    exit( 1 );
exit( 0 );

You know, Linux can't write synchronically a data onto disk completely
even if using O_SYNC flag when open(), because of its poor implementation.

Comment 11 Alan Cox 2003-08-08 10:35:26 UTC

Synchronous behaviour at close is just a quirk of some old systems and bad for
performance. If you want portable synchronous closure use fsync on the file
handle and check its return before the close. 

Please provide specific details for the case you say O_SYNC doesn't work,
probably in another bug. Thanks.

Comment 13 Shinya Narahara 2003-08-19 11:58:49 UTC

We confirmed that the same test program above didn't work even if
it had O_SYNC flag, never returned any errors from write().
You know Linux kernel doesn't have perfect synchronous I/O,
our experiment also says so.

> Please provide specific details for the case you say O_SYNC doesn't work,
> probably in another bug.

I agree the bug that the O_SYNC flag doesn't work is another bug.
Sometimes a program works un-synchronously even if
it has open() with O_SYNC on current Linux.

Comment 14 Stephen Tweedie 2003-08-25 16:00:01 UTC

Failure of O_SYNC to write data is a serious bug.  Have you opened a bug report
for that?  If not, do you have a proper test case including description of how
it fails?

Comment 15 Shinya Narahara 2003-08-29 10:31:27 UTC

> Failure of O_SYNC to write data is a serious bug.  Have you opened a bug 
report for that?

No, I haven't. We'll open it when we'll have sufficient data.

Comment 17 Shinya Narahara 2003-08-29 11:04:12 UTC

We've made a new kernel patch, to add a kernel parameter
which can specify retry number when writing. The patch
is being tested now, and will be attached into bugzilla
when the test'll be over.

The kernel parameter is below:
SYNTAX:
      write_retries=HOSTNO:CHANNEL:SCSIID:RETRIES
[,HOSTNO:CHANNEL:SCSIID:RETRIES]*

DESCRIPTION:
      The new kernel parameter "write_retries" can
      assign a new retry number for specific SCSI Device.
      It is a parameter for the module scsi_mod.o.

ARGUMENT:
      HOSTNO,CHANNEL,SCSIID is to specify scsi device.
      HOSTNO  = SCpnt->host->host_no
      CHANNEL = SCpnt->channel
      SCSIID  = SCpnt->target
      RETRIES = parameter user can specify(override SCpnt->allowed).
                If RETRIES is 0, it meens infinit retry.

EXAMPLE:
      To set infinit retry for a scsi device host_no=0, channel=1, 
      scsiid=2:
                   options scsi_mod write_retries=0:1:2:0
      need to be written into /etc/modules.conf, re-make initrd,
      and rebooted.

If you don't specify this parameter, Linux kernel works with
default behavior. Default Linux kernel works except for
specified scsi device(s) even if you specify this parameter.
How do you think of this patch?

Comment 18 Shinya Narahara 2003-09-05 05:31:57 UTC

Created attachment 94224 [details]
Scsi patch for changeable scsi_write_retries

This patch's been confirmed for both qla2x00(4.28) and qla2300(6.04),
The kernel version is 2.4.7-10(RH72 default). We'll test this on
RHAS(RHES) kernels, but it'll be fine.
By specifing infinit loop with this patch, the issue(data lost) is
never happen.

Comment 20 Tom Coughlan 2003-09-25 15:09:42 UTC

We do not believe that the use of infinite retries is the correct solution to
this problem.  This is because during the retry period the SCSI subsystem would
essentially be locked waiting on the ability to write to the device.  We have
been told that this maintenance period can be one hour or more.  We expect that
most customers will not want to create a situation where their entire system may
 hang for an hour.  We also can not create a situation where a real hardware
failure may cause the system to hang indefinitely.

As an example, consider that some customers configure software RAID devices that
are made up of logical units exported by separate hardware RAID subsystems. 
This is so that the data will continue to be available even if one of the RAID
subsystems fails. In order for this to work well, the hardware RAID subsystem
must fail reasonably quickly, so that the software RAID can remove the failed
hardware element, and continue to provide access to the data.

If may be acceptable to include a patch like the one suggested above, so that a
customer can select the amount of time that the SCSI subsystem will wait for a
failed I/O, based on their availability requirements and the characteristics of
their storage hardware. We would not recommend that they use this method to
select an infinite amount of wait time.

Since an infinite wait time is not desirable, it is always possible that an I/O
will fail because the controller was in maintenance mode a bit longer than the
wait time. For this reason, we consider it most important to resolve the failure
of O_SYNC that you described earlier.  Please provide more information about
that failure, as requested previously (preferably in another bugzilla,
referencing this one).  Ultimately, this is the _only_ solution to the data
corruption problem, and should be our highest priority.

RHEL 3 is frozen.  If any changes are to be made they will have to be in the
first RHEL 3 update. We typically do an update every three months, but the
schedule for this is not yet determined, so this is _not_ guaranteed.

We will consider including a patch to allow the customer to adjust the retry
delay in an RHEL 3 update. This will depend on whether this approach is
acceptable to the upstream kernel developers.

We request that you provide more information on the O_SYNC problem as soon as
possible.

(Reassigning this bug to me.)

Comment 44 Stephen Tweedie 2003-10-17 14:44:45 UTC

First of all, it does make some sense for us to be more uniform about retries on
BUSY: 2.6 and scsi_obsolete.c both keep retrying indefinitely in those
situations, and new-eh scsi_error.c in 2.4 could be brought in line with that
behaviour.  Ideally (but not essentially), it would be done with the addition of
a separate, timer-driven queue for the retries to avoid spinning if the error
condition is coming back from the target immediately.

Secondly, there's the issue of what to do if the IOs fail.  We propagate that
back to the application if the app has requested notification via O_SYNC/fsync,
but the entire Linux VFS/VM layer is really not set up to deal with drives
dropping writes on the floor if we're just doing writeback IO.  The basic
assumption is that write requests to the driver layer are not optional, which
seems reasonable enough.

The filesystem tries to cope, but basically does so by eventually rereading the
old version of the data off disk, losing the new version that it tried to write.
 But the only sane alternative to this behaviour is to take the filesystem
completely offline if the drive drops writes; it might be possible to make such
a change, although it would require significant VFS help (we don't want to
offline a filesystem just because a data block got dropped, but if it's
metadata, that's a different story.)

But the single biggest problem is the VM stability.  The VM is simply not
expecting to have to deal with IOs which can last for an hour or more without
completion.  Memory allocation strategies assume that we can reclaim memory by
writing used information to disk on demand, and the rate of memory reclaim is to
some extent throttled on the rate at which those IOs complete.  If IOs stop
completing, it is possible for the VM to get stalled waiting for them, and the
kernel essentially becomes unable to allocate more memory.  There are some
attempts to mitigate against slow devices in the VM, but none to bypass a
completely stopped device.  

There are similar problems in the VFS, where a totally-stalled disk device may
hold up the regular background "sync" updating process, or may block the cache
reclaim routines due to stalling in the inode flushing code.

I simply don't think we can sensibly support normal kernel operation on a system
where a busy filesystem is on such a long-term-stalled device.

There might be more hope if the storage array is only being used for raw access,
 where there are no VM structures depending on writeback to that device, but
that doesn't help us in the general case.

Comment 58 Stephen Tweedie 2003-12-22 15:29:47 UTC

That's not valid testing.  It is perfectly legal for the kernel to
fail a write due to transient errors but for a subsequent read to
succeed, either because the previously-read data is still in cache
(for O_SYNC, obviously not for raw IO) or because the error condition
has recovered.  In other words, even if the kernel does, correctly,
report the error back to user space, the above test code will still fail.

Comment 61 Stephen Tweedie 2003-12-24 11:49:26 UTC

It's important to know what you are trying to test.  There have been
two separate issues mixed together in this report so far: (1) the SCSI
stack returning error after limited ABORT retries, and (2) alleged
failure for the error to be returned to user space.  (2) is
undiagnosed so far and needs more information.

The trouble is that simply testing for short writes or read/write
miscompares does not allow you to distinguish between the two.  If you
are doing raw IO and get a write failure from the kernel, in this case
it is likely to be because of the hardware returning ABORT for too
long: this is ultimately _not_ a kernel problem, even though we are
looking at kernel workarounds for the device behaviour.  

But if the kernel acknowledges the write and still results in a data
miscompare, then that's a different issue.  The test code really needs
to distinguish between these cases.  When analysing test results, it
is important for us in engineering to be able to tell which case is
occurring.

Comment 75 Keiichi Mori 2004-01-22 08:15:23 UTC

Red Hat engineers have agreed to provide a fix for this problem in Red
Hat Enterprise Linux 3 Update 2, our next update.  This issue is on
the "must fix" list.

The Red Hat engineers have, in the end, agreed to provide the fix that
Hitachi requests.  This fix will only apply to Hitachi hardware. The
next actions that need to happen are:

Hitachi needs to provide unique identifiers for all affected HDS 
hardware.   This includes a list of vendor and model strings for each
device that is affected by the Hitachi implementation.

The information should be posted to this ticket (Bugzilla #86312).

Red Hat will then provide a fix for Hitachi to test.   The fix does
include infinite retrying on Hitachi systems.   This is the fix that
is available to meet the RHEL3 U2 schedule.   If Hitachi has decided
that this solution is not acceptable, then the fix will not be in
RHEL3 U2 at this time, and I and Robert Perkins, Partner Product
Manager, will coordinate with Hitachi to set up face-to-face meetings,
conference calls between Hitachi and Red Hat engineers.

Hitachi should indicate their response in this ticket.

Comment 76 Shinya Narahara 2004-01-23 09:08:25 UTC

Essence of this contains 2 problems:

1) kernel behavior in case of disk error or disk busy.
2) Red Hat policy for Enterprise Linux

We agree that the disk should return "BUSY", instead of
CHECK_CONDITION in case that disk is busy. 
But when disk or sector has an error, we absolutely disagree the RHEL
kernel has a logic that occurs data loss. If *a* sector on a disk
is broken, Linux kernel may skip the sector and lose the data which
should be written on it without any notification to
user/application. Does RedHat think this is O.K. or not?

> But the only sane alternative to this behavior is to take the
> filesystem completely offline if the drive drops writes;

Yes we agree it, but there's no implementation in Linux now,
even though Solaris or other Unix has online rescue mode.

Therefore, we strongly recommend to fix your kernel, so that 
user can choice its behavior. Infinite retrying isn't good
for the device with software RAID, raw device and Active-Standby
path selector for scsi path, because this may cause kernel
stall(Tom says).  But a disk without these features, it isn't
so bad to retry infinitely to avoid data loss.

So we believe we need two patches which add selectable parameters.
1) for scsi protocol driver layer:
    1-1) infinite retrying for specific device
    1-2) default kernel behavior(5 retrying)
2) for block device (or more upper)layer:
    2-1) infinite retrying for specific device
    2-2) default kernel behavior(may be data loss)
    2-3) kernel panic when the data loss occurs

The first one is just like our patch.

You know RHEL is for enterprise use. Consumer may accept
this level implementation, but our enterprise customer never
allow us to lose any data. They believe it's better the
kernel panic occurs, than their data is lost silently
(kernel log is not so noisy).
Especially Japanese customers really hope so.

> The Red Hat engineers have, in the end, agreed to provide the
> fix that Hitachi requests.  

We really appreciate you that you decide it. In addition,

> The fix does include infinite retrying on Hitachi systems.   

Please include some logic that user can select infinite/limited
retrying.

Comment 77 Keiichi Mori 2004-01-26 09:10:03 UTC

>If *a* sector on a disk
>is broken, Linux kernel may skip the sector and lose the data which
>should be written on it without any notification to
>user/application. Does RedHat think this is O.K. or not?

Normal application writes are entirely asynchronous. The application
writes not to disk but memory, and the kernel is responsible for doing
efficient batched writes of the memory caches to disk some time later. 

The application can easily have written the file to cache entirely and
closed it without a single byte having been written to disk.  There's
simply no way to tell the application that the write failed in such
cases: the write hasn't even *happened*.  The application needs to
wait for the write to complete if it wants failure notification.

Comment 78 Shinya Narahara 2004-01-26 09:54:59 UTC

> Normal application writes are entirely asynchronous. The application
> writes not to disk but memory, and the kernel is responsible for 
doing
> efficient batched writes of the memory caches to disk some time 
later. 

We know it is.
It is better that the application can sense it, but
we agree that it is impossible in some case. The issue is
user(not only app) can't notice it. How should we send error
message to user?(I said syslog isn't so good). Many UNIXs 
have special gimmick, like rescue mode or so,
but linux doesn't have anything.

Therefore, we must decide to adopt infinite retrying, or
kernel panic to avoid data lose.
We really don't wanna do so, but there's no way on linux.

Or, do you recommend any access to disk must always be RAW or sync
to inhibit any data lose? You know it is very slow, not attractive
for any customers.

Comment 79 Keiichi Mori 2004-01-26 11:19:03 UTC

In current linux, applications should use RAW or sync if it need to
know the write failure, and SCSI hardware devices should observe the
SCSI specification.

It may be better solution that linux has more strong feature to block
the data corruption in the future. However, the implementation should
be consider carefully not only in Red Hat and Hitachi but also in the
whole of the Linux community.

I hope Hitachi who well knows a necessity of it will play a central
role in the implementation for the whole of the Linux community
(include the users).

Comment 80 Stephen Tweedie 2004-01-26 21:05:09 UTC

Please note that we don't have any objection in principle to doing
infinite retry here.  The main difficulty is knowing exactly when to
do so.  On most scsi targets, ABORTED_COMMAND is not the right way to
detect that infinite retry is needed: it prevents the kernel from
recovering when one single device on a SCSI bus is wedged and cannot
complete commands sent to it.  So, we need to be careful how we detect
the retry conditions and avoid looping forever in cases where it's not
appropriate.

Regarding sending information about arbitrary low-level IO failures to
the user: syslog is quite simply the most effective mechanism we have
for this in current kernels.  IO failures *do* get reported via the
syslogs.  The functionality is not missing, but it could certainly be
improved, and there have been proposals in the upstream linux kernel
community for ways to do that.  It is entirely possible that future
kernels will have better ways of returning structured error logs from
the kernel to user space, but right now klogd/syslog is the
recommended mechanism.

Comment 83 Doug Ledford 2004-01-31 18:13:48 UTC

OK, I want to reply to several comments at one time.

First, naraha_s.co.jp, you wrote:

> Essence of this contains 2 problems:
> 1) kernel behavior in case of disk error or disk busy.
> 2) Red Hat policy for Enterprise Linux
> 
> We agree that the disk should return "BUSY", instead of
> CHECK_CONDITION in case that disk is busy. But when disk
> or sector has an error, we absolutely disagree the RHEL
> kernel has a logic that occurs data loss. If *a* sector
> on a disk is broken, Linux kernel may skip the sector and
> lose the data which should be written on it without any
> notification to user/application. Does RedHat think this
> is O.K. or not?

I think that it will actually help this discussion if we define a few
things regarding the linux kernel.

First, there is the concept of a catastrophic error.  This is an error
from which recovery is not reasonably possible.  If someone walks into
the server room and spills an entire cup of water into a server
causing it to short out and burn out the CPUs, mainboard, and RAM all
at once, this is obviously a catastrophic error.  We can not recover
from this and data loss is sure to happen.  The company must then
reinstall from backup and try to recreate the missing data.

Second, there are recoverable errors.  Lots of little things can go
wrong in a machine from which the machine is able to recover.  ECC
error correction of single bit memory errors is an example of this. 
Generally speaking, the linux kernel relies solely upon the hardware
to correct these minor errors.  If an error makes it past the hardware
correction mechanisms, then linux considers that a catastrophic
failure and makes no attempt to recover from it.  The only exception
to this that I know of is the software raid support in linux. 
Software raid is, by it's very nature, an error recovery mechanism. 
Because disks fail so often, and because puchasing an external disk
array that does the raid for you is so expensive, software raid was
written in the linux kernel so that even people just using linux as a
workstation could be protected from disk failures.

Now, I think our primary area of disagreement is in what qualifies as
a catastrophic error.  In linux, any unrecoverable disk error is
catastrophic.  If you consider the origins of the linux operating
system, this is no suprise.  Linux was originally written to run on
i386 based personal computers that didn't have all the advanced
hardware recovery mechanisms that mainframe class computers have. 
When a disk failed in a linux computer, there simply wasn't anything
to be done about it.  So, the linux kernel was written with this in
mind.  Hard errors on a disk are therefore catastrophic errors. 
Changing this now would be a monumental engineering task.  In order to
provide some amount of security against this type of catastrophic
error, especially since disks are unreliable, we implemented software
RAID in the linux kernel.  This way a single disk failure on a RAID1,
4 or 5 device would not be catastrophic to the system.  A hard error
on a disk in a software raid array is still catastrophic as far as
that single disk is concerned, but the redundancy of the software raid
device saves the system from ever seeing it and data loss does not
occur as a result.  However, the requirement for our software raid
stack is that it must *never* return a hard error for the raid device
unless it has already gotten to the state where there is no way of
safely interacting with the disks.  This is exactly how our software
raid stack works.  The core operating system will never see an error
from the software raid stack unless the raid device has lost too many
disks to be able to continue.  When that happens, it is considered a
catastrophic failure and no recovery is possible.

Now, an option for enterprise customers is that they can use external
hardware raid devices, such as the SanRise devices, instead of the
linux software RAID.  This offloads the core CPU from the work of
maintaining the raid array and talking to all the individual hard
disks.  However, the linux kernel has the same expectation of external
hardware raid devices that it has for its software raid devices.  The
linux kernel expects that it will *never* see a hard error from the
external raid device until the raid array is already in an unusable
state and no recovery is possible.

The problem that we've had, and that this bug report is all about, is
the fact that the linux kernel's SCSI stack thinks that if a command
fails with ABORTED_CMD 5 times in a row, then it is a hard error and
it returns that hard error to the core of the operating system.  The
SanRise equipment uses ABORTED_CMD for soft errors, or errors that
will go away.  So, I am writing a patch to make the SCSI stack treat
ABORTED_CMD as a soft error on the SanRise equipment.  By treating
this as a soft error, the command will be retried infinitely and the
data loss will not occur.

However, Shinya-san (is that the correct way to address you, I'm not
familiar enough with Japanese culture to know, please forgive my
ignorance on this issue), you are correct that this does not change
the fact that the linux kernel will still consider other hard errors
from SanRise equipment as catastrophic failures and will allow data
loss to occur.  Implementing a full recovery mode is probably not
going to be possible.  Let me explain why I think this.

We try to write all of our kernel changes in a way that will be
acceptable to the upstream kernel maintainers (Linus Torvalds for the
2.6/2.7 kernel, Marcelo Tossati for the 2.4 kernel).  The upstream
kernel maintainers are concerned with making linux work well on the
largest number of machines possible.  Obviously, the number of
workstations that run linux is far, far greater than the number of
mainframe or PC class server computers.  The overhead required in the
core portions of the linux kernel in order to support recovery
operations like you speak of is significant.  Since most workstation
users would be unhappy to have their linux kernel run slower because
it is maintaining the information necessary to support on line
recovery operations, it is doubtful that the upstream kernel
maintainers would accept any patches to implement such a feature. 
Instead, they would argue that if a company is concerned over the
possibility of data loss, they should instead use hardware raid
devices or software raid arrays or possibly even both.  So, instead of
trying to write a recovery mode into the linux kernel, I think it is
preferable that enterprise customers design their servers such that a
recovery mode should never be needed.  For example, if they think it
might ever be possible that a SanRise disk array could return a hard
error and be taken offline, then they could buy two SanRise disk
arrays and use the linux software RAID to treat them as RAID1 mirrors
of each other.  That way, if a SanRise array ever goes offline for
hard errors, there is still the other one to operate from.

I actually think this is a very fair way to handle the problem of
trying to find a solution that is acceptable to both regular users of
linux on a workstation and enterprise customers.  Instead of putting
code into the kernel to enable a rescue mode, which would slow down
everyone's machines, it allows the enterprise customer to buy whatever
level of fault tolerance they want in their hardware and then rely
upon that hardware to protect their data.  That way no one suffers the
performance penalty of rescue mode capability overhead, not even
enterprise customers, but by buying more fault resilient hardware they
are able to protect their data.  But that's just my opinion.

Now, in a different post, Shinya-san also wrote:

> How should we send error message to user?  (I said syslog
> isn't so good).

I think that syslog is good enough.  But not because the user notices
the entries in syslog.  We have a nightly script that runs on RHEL
machines that scans that day's syslog entries looking for anything
unusual.  In the event of a disk error, it sends an email to
root@localhost with a report of the error.  As long as the machine is
configured to send email for root@localhost to some valid email
address, that user will get notified that a disk error took place via
email.  This keeps the error from happening in a way that users can
overlook.

Comment 85 Doug Ledford 2004-02-26 12:42:56 UTC

I've submitted a patch for inclusion in RHEL3 U2 to address this
problem.  It's actually based upon a previous patch I submitted.  I've
attached three patches to this bugzilla report.  The first patch that
things are based upon (the scsi queue fix patch), the actual retry
aborted_cmd patch, and then the whitelist update patch that makes this
effective on Hitachi SanRISE equipment.  Please review these patches
and let me know about any problems you see.

Comment 86 Doug Ledford 2004-02-26 12:44:47 UTC

Created attachment 98068 [details]
Patch that fixes the scsi mid layer queue handling for commands that need infinite retries

Comment 87 Doug Ledford 2004-02-26 12:45:38 UTC

Created attachment 98069 [details]
Patch to use the new mid layer queue handling for aborted_cmd commands

Comment 88 Doug Ledford 2004-02-26 12:46:20 UTC

Created attachment 98070 [details]
Update to the device whitelist entries to catch all SanRISE equipment

Comment 89 Ernie Petrides 2004-03-01 12:27:50 UTC

The fixes for this problem were committed to the RHEL 3 U2 patch
pool today.  The changes in comment #86 were already in U1.  An
updated version of the patch in comment #87 (adding the new
retry_aborted_cmd field) and the patch in comment #88 were both
committed to the internal Engineering build of kernel version
2.4.21-9.16.EL.

Comment 91 Keiichi Mori 2004-03-22 03:31:40 UTC

Hitachi has posted the test result to IT#27794.
Unfortunatelly, the data loss has still occured for ABORTED_COMMAND.

Test Environment:
ãããKernel Ver :2.4.21-11.ELsmp
     Test Program :seqtp17 (Hitachi Original)
     HBA :QLA2340
ãããDriver :6.07.02-RH1
ãããFirmwareã :3.02.16

Test Pattern:
(1) SANRISE has begun to return ABORTED_COMMAND
    BEFORE starting the test program.
(2) SANRISE has begun to return ABORTED_COMMAND
    AFTER  starting the test program. 

Test Pattern | Write method | Result             |
-------------+--------------+--------------------|
(1)          | syncronous   | Write Error        | => OK
             | asyncronous  | Data loss occured  | => NG
             | raw I/O      | Input/Output Error | => OK
-------------+--------------+--------------------|
(2)          | syncronous   | Data loss occured  | => NG
             | asyncronous  | Data loss occured  | => NG
             | raw I/O      | Input/Output Error | => OK
--------------------------------------------------

If you need more information to analyze this, please let me know.

Comment 93 Keiichi Mori 2004-03-25 02:53:31 UTC

Follows are new test result.
The difference from before one are:
 - Vendor code (before: "HP      ", new: "HITACHI")
 - Firmware version (before: 3.02.16, new: 3.02.24)

Test Environment:
 Kernel Ver :2.4.21-11.ELsmp
 Test Program :seqtp17 (Hitachi Original)
 HBA :QLA2340
 Driver :6.07.02-RH1
 Firmwareã :3.02.24
 SANRISE Vendor Code: "HITACHI"
                                                                     
                                        
Test Pattern:
(1) SANRISE has begun to return ABORTED_COMMAND
    BEFORE starting the test program.
(2) SANRISE has begun to return ABORTED_COMMAND
    AFTER  starting the test program.
                                                                     
                                        
Test Pattern | Write method | Result                |
-------------+--------------+-----------------------|
(1)          | syncronous   | Normal End (2/5)      |
             |              | System unstable (3/5) |
             |--------------+-----------------------|
             | asyncronous  | Normal End (1/4)      |
             |              | System unstable (3/4) |
             |--------------+-----------------------|
             | raw I/O      | Normal End (5/5)      |
-------------+--------------+-----------------------|
(2)          | syncronous   | Normal End (2/5)      |
             |              | System unstable (3/5) |
             |--------------+-----------------------|
             | asyncronous  | System unstable (4/4) |
----------------------------------------------------|
                                                                     
                                        
"System unstable" means:
 - It cannot kill the test program process.
 - It cannot umount devices which test program is using on SANRISE.
 - It cannot build a filesystem on other SCSI disk connected to a same
HBA card.
 - It cannot reboot or shutdown. reboot/shutdown process has stopped at:
   Broadcast message from root (pts/0) Tue Mar 23 XX:XX:XX 2004...
   The system is going down for reboot/system halt NOW !!
 - It can execute follow commands:
   ps -ef, ps -ef | grep , cd /somewhere/notOnSanriseDevices
                                                                     
                                        
SANRISE was returning ABORTED_COMMAND for 2 minutes at most during the
test.
After stopping to return ABORTED_COMMAND, the system still was "UNSTABLE".
Therefore, Hitachi thinks this "unstable state" is caused by what system
cannot treat queued tasks well after stopping to return
ABORTED_COMMAND. Hitachi hopes this is fixed.

Comment 95 Doug Ledford 2004-03-26 02:41:39 UTC

Created attachment 98868 [details]
Proposed fix patch

OK, I think the problem is that the requeue code in scsi_queue.c works fine for
commands that failed with a bad status code, but fails for commands that are
being retried because of sense data.  It basically fails to clear the sense
data and reset the command for another attempt.  This patch adds this operation
to scsi_mlqueue_insert so that the command is cleared out between retries.

Comment 96 Keiichi Mori 2004-04-16 09:09:27 UTC

Hitach post the result of test with Doug's proposal patch into
IT#27794. However, the result was not good. System had been still
unstable sometimes.

Test Environment:
 Kernel Ver :2.4.21-11.EL + above patch (id=98868) 
 HBA :QLA2340/LP9802
 Firmwareã :3.02.16/1.01A2                                          
                         
                                                                     
                                      
 Write method | Result                |
--------------+-----------------------|
 syncronous   | Normal End           |
              | System unstable      |
--------------+-----------------------|
 asyncronous  | Normal End           |
              | System hang up        |
--------------+-----------------------|
 raw I/O      | Normal End           |
              | System unstable      |
--------------+-----------------------|                              
                                     
                                        
"System unstable" means:
 - It cannot kill the test program process.
 - It cannot umount devices which test program is using on SANRISE.
 - It cannot build a filesystem on other SCSI disk connected to a same
HBA card.
 - It cannot reboot or shutdown. reboot/shutdown process has stopped at:
   Broadcast message from root (pts/0) Tue Apr 7 XX:XX:XX 2004...
   The system is going down for reboot/system halt NOW !!
 - It can execute follow commands:
   ps -ef, ps -ef | grep , cd

"System hang up" means:
 impossible to enter any command from keyboard.

Each condition had no choice excluding turning power switch off.

This result is not anything Hitachi expects. So, Hitachi want *NOT* to
apply patches for this issue in RHEL3 U2.

Comment 105 Keiichi Mori 2004-06-07 09:26:09 UTC

Hitachi evaluated Doug's proposal patches (ver1.14 in his bk trees).
However, the system after their storage recovered from the
ABORTED_COMMAND condition has not worked well.
(It's not different from the before test).

The test result for RAW I/O and Syncronous I/::
System has been unstable condition.
 - could not kill the test program process.
 - could not umount devices which was connected to a same HBA card.
 - could not build a filesystem on other SCSI disk connected
   to a same HBA card.
 - could not reboot or shutdown.
   reboot/shutdown process has stopped at:
   Broadcast message from root (pts/0) Tue Mar 23 XX:XX:XX 2004...
   The system is going down for reboot/system halt NOW !!
 - could execute follow commands:
   ps -ef, ps -ef | grep , cd

The result for asyncronous I/O (write-back I/O)::
 - could not input any keys and could not execute any commands.

I have not gotten any logs of 'echo "scsi dump 2" > /proc/scsi/scsi'
from Hitachi yet. (though they could not collect them in the
asyncronous I/O test).

Comment 107 Doug Ledford 2004-06-07 12:15:11 UTC

The fact that the 1.14 kernel failed is not unexpected.  However, the
information that I requested in the form of the system logs after the
kernel had failed and 'echo "scsi dump 2" > /proc/scsi/scsi' had been
run was going to be used to confirm what the problem really was.

In the meantime, since I suspect what the problem is, a possible
solution for that problem was committed to my bk tree on Jun-1-2004. 
This is the changeset log for that change:

ChangeSet, 2004-06-01 09:33:23-04:00, dledford.com
  drivers/scsi/scsi.c
      Init sdev_retry_q during build_commandblocks
  drivers/scsi/scsi.h
      Add sdev_retry_q so retried commands aren't shoved into the block
      request queue (means no merging of additional requests into an
already
      initialized command and means we can easily give first priority to
      commands on the retry queue instead of the regular block layer queue
      in order to avoid lockups on devices if they manage to get *all* of
      their command blocks allocated to commands that need retried).
  drivers/scsi/scsi_lib.c
      Change scsi_request_fn so that it processes commands on the retry
      queue first and so that it skips all the command init and command
      allocation code for any commands on the retry queue.
  drivers/scsi/scsi_queue.c
      Make the scsi_mlqueue_insert routine put the delayed commands on the
      new sdev_retry_q list instead of back in the block layer request
      queue.

Comment 111 Keiichi Mori 2004-06-28 08:01:33 UTC

Hitachi posted the test result to IT#27794 today.
Doug's proposal patch makes system work well.

Test Environment:
Kernel: 2.4.21-15.EL.hitachi2 (2.4.21-15.EL + Doug's bk tree rev1.37)
Test Program: seqtp19i_32
HBA: QLA2340
Driver: 6.07.02-RH2
Firmware: 3.02.24

Infinite retries for ABORTED_COMMAND worked well,
then no data corruption occured, and system has been in stable condition.

Write method | Result                |
--------------+-----------------------|
 syncronous   | Normal End           |
--------------+-----------------------|
 asyncronous  | Normal End           |
--------------+-----------------------|
 raw I/O      | Normal End           |
--------------+-----------------------|


Additional Information:
Data corruption occurs in spite of using O_SYNC, if storage device
was returning HardwareErr, MediumErr.
This problem should be discussed in Bugzilla#116900 and/or 
related IssueTracker Tickets.

Comment 112 Ernie Petrides 2004-06-28 19:12:47 UTC

A fix to this problem was committed to the RHEL3 U3 patch
pool on 21-June-2004 (in kernel version 2.4.21-15.15.EL).

Comment 115 Jay Turner 2004-08-17 11:14:28 UTC

Closing this issue out, as original reporter has tested and confirmed
resolution.

Comment 116 Ernie Petrides 2004-08-19 22:28:11 UTC

I've reverted this bug to MODIFIED state.  It will be closed
automatically by the Errata System with more detailed information
when RHEL3 U3 is actually released.

Comment 117 John Flanagan 2004-09-02 04:30:33 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html

Comment 120 Shinya Narahara 2005-03-14 02:09:51 UTC

Created attachment 111960 [details]
paranoiac patch 1

There's our last patch to avoid this issue.
Our patch will eliminate this issue, but you know
this is one of paranoiac patch only for delicate customers.
We know this patch will not be included for main stream kernel,
but we'd like to let you tell that there're paranoiac customers
in an enterprise world.
Don't respond to this please, let us finish this issue...

Comment 121 Shinya Narahara 2005-03-14 02:11:16 UTC

Created attachment 111961 [details]
paranoiac patch 2

additional patch

Comment 122 Charlie Bennett 2006-12-19 14:19:06 UTC

Guys - did some rendition of this "fix" make it into the upstream kernels?

Thanks, ccb