Bug 86312
Summary: | kernel may destroy a data writing into disk, when it's too busy. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Shinya Narahara <naraha_s> |
Component: | kernel | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1 | CC: | bennet, coughlan, djenkins, dledford, edwin.mcelearney, ggallagh, halligan, hashimoh, jkulesa, jneedle, kmori, minoru.yoshida, miurahid, nobody+wcheng, petrides, rperkins, sct, si-yama, smorin, tao, tbarr, terry.magill, walter.crasto, yu-maeda, yushio |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-02 04:30:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 107562, 107565, 116727 | ||
Attachments: |
Description
Shinya Narahara
2003-03-19 13:34:27 UTC
We tested the patch above with "traped disk", which has "programable broken sector"(any access onto the sector becames scsi error(sensekey=0x0b, sensecode=0xc000)). Actually the patch above didn't work at all, kernel got retry and retry forever, and the data was broken when the limitation(returning scsi error) was stoped. We were testhing it with qla2200.o(5.31.RH1), it didn't have a flag "use_new_eh_code"(was set zero). Another driver qla2200_new.o(5.31.RH3) has the flag as 1, so it never destroy any data on disk, when the disk returns scsi error or not. But the driver retry writing the data on the disk forever if the disk return scsi error, or sometime make kernel freeze. Actually we haven't test the latest driver(6.04.00) from qlogic web page yet. The desirable kernel behavior, when the disk returns scsi error many times, should be 1) first, retry assigned times(ex. for sd.c, 5 times) 2) then, return with some errors and interrupt writing/reading This issue titled is caused by the flag "use_new_eh_code" is unset. But even if it is set(new driver), we have another issue now... Created attachment 90832 [details]
Syslog with qla2200(5.31.RH1).
The issue may occur at Mar 11 13:07:36 or later.
To solve this "incompletely", we've tested a silly patch below.
This is change the kernel logic "retry 5 time for SCSI disk",
into "retry infinitely for SCSI write commands".
--- linux/drivers/scsi/scsi_obsolete.c.org 2003-04-28 17:00:35.000000000
+0900
+++ linux/drivers/scsi/scsi_obsolete.c 2003-04-28 19:43:33.000000000 +0900
@@ -623,6 +623,11 @@
printk("In MAYREDO, allowing %d retries, have %d\n",
SCpnt->allowed, SCpnt->retries);
#endif
+
+ /* naraha_s.co.jp added this 2003/04/28 */
+ /* to avoid data crush when the disk is very busy. */
+ SCpnt->retries = retries_check( SCpnt->cmnd[0], SCpnt->allowed,
SCpnt->retries );
+
if ((++SCpnt->retries) < SCpnt->allowed) {
if ((SCpnt->retries >= (SCpnt->allowed >> 1))
&& !(SCpnt->host->resetting && time_before(jiffies,
SCpnt->host->last_reset + MIN_RESET_PERIOD))
--- linux/drivers/scsi/scsi_error.c.org 2003-04-28 19:25:31.000000000 +0900
+++ linux/drivers/scsi/scsi_error.c 2003-04-28 19:42:21.000000000 +0900
@@ -1072,6 +1072,10 @@
maybe_retry:
+ /* naraha_s.co.jp added this 2003/04/28 */
+ /* to avoid data crush when the disk is very busy. */
+ SCpnt->retries = retries_check( SCpnt->cmnd[0], SCpnt->allowed, SCpnt-
>retries );
+
if ((++SCpnt->retries) < SCpnt->allowed) {
return NEEDS_RETRY;
} else {
--- linux/drivers/scsi/scsi.c.org 2003-04-28 19:33:41.000000000 +0900
+++ linux/drivers/scsi/scsi.c 2003-04-28 19:41:24.000000000 +0900
@@ -2798,3 +2798,54 @@
* tab-width: 8
* End:
*/
+
+/* Checking retries over allowed when writing. */
+/* naraha_s.co.jp added this 2003/04/28 */
+/* to avoid data crush when the disk is very busy. */
+int
+retries_check( unsigned char cmnd, int allowed, int retries )
+{
+ unsigned char retry[] = {
+ /* scsi commands that should be trapped */
+ WRITE_6,
+ WRITE_FILEMARKS,
+ RECOVER_BUFFERED_DATA,
+ COPY,
+ ERASE,
+ WRITE_10,
+ WRITE_VERIFY,
+ SYNCHRONIZE_CACHE,
+ COPY_VERIFY,
+ WRITE_BUFFER,
+ WRITE_LONG,
+ WRITE_SAME,
+ WRITE_12,
+ WRITE_VERIFY_12,
+ WRITE_LONG_2,
+ };
+ int i, arysiz = sizeof(retry)/sizeof(unsigned char);
+
+ /* disk MAX_RETRIES = 5 defined in "sd.c" */
+ /* cd-rom MAX_RETRIES = 3 defined in "sr.c" */
+ /* tape MAX_RETRIES = 0 defined in "st.c" */
+ /* not regard tape device. */
+ if (allowed > 2 && retries+1 >= allowed) {
+ /* Check if the scsi command should be traped. */
+ for(i = 0; i < arysiz; i++) {
+ if(cmnd == retry[i]) {
+ break;
+ }
+ }
+ if(i < arysiz) {
+ /* found this is one of a special command. */
+ printk("In MAYREDO, %d retries scsi_cmnd = %d,"
+ "forced the number into %d\n",
+ retries+1, cmnd, allowed - 1 /* not 2 */ );
+ /* retries must be -2 of allowed for infloop */
+ /* because after this function, it'll be incremented. */
+ return( allowed - 2 );
+ }
+ }
+ return( retries );
+}
+
This is just a substitution. The radical solution may be
change kernel logic,
if(scsi_disk_error) {
if(retry_counter_is_lower) {
retry;
} else { /* retry_counter_is_over */
stop_all_access_to_the_disk_anymore;
return_error_to_all_application_which_accessing_the_disk;
}
}
Infinite retries don't work, and can cause very great data loss because the retry will prevent all further writes from completing. The more important question is why is the device timing out repeatedly in this case. Five retries is more than reasonable. Given these are aborts perhaps the problem is that we need to adopt a longer backoff on timeouts ? For comparison's sake have you considered trying a more recent Qlogic driver? The 6.0500b9 driver has fixes to many error handling issues I've seen in testing. Personally I wouldn't trust my data on Qlogic's older drivers, but I don't know if RedHat is backporting fixes into their 5.31.RHx drivers. From a related issue (IT#24399), the ABORTED_COMMAND sense key is being returned while a disk is offlined for maintenance. The kernel is entitled to fail an IO if the storage subsystem is returning ABORTED_COMMAND in that case. If that's the underlying cause of the ABORTED_COMMAND here, then it's a disk subsystem problem --- it should be returning BUSY status on the command, not an abort. It's entirely reasonable for the kernel to fail the command if repeated retries keep on returning ABORTED_COMMAND. rhn wrote: > For comparison's sake have you considered trying a more recent Qlogic driver? Actually, this is not a scsi device driver problem but a scsi protocol driver's one. The scsi cards and it's device drivers don't depend on this issue. Alan Cox wrote: > The more important question is why is the device timing out repeatedly > in this case. Five retries is more than reasonable. Stephen Tweedie wrote: > the ABORTED_COMMAND sense key is being returned while a disk is > offlined for maintenance. Yes, the disk returns "ABORTED_COMMAND" while it's in mentenance mode, ex. rebuilding RAID array when the disk is broken and changed. Unfortunetelly, many disk subsystems can't return status byte code "BUSY" if they return sense key "ABORTED_COMMAND", so the linux kernel must treat this carefully. Sometimes the status byte code may be vender specific. Although the current kernel doesn't handle scsi device properly when the sense key is "ABORTED_COMMAND" and status byte code is "BUSY"...(5 retries only) I think this is not a bug, but a linux policy, however it should be changed. At least, The retry number, specificed in sd.c, should be able to be customized by using a module parameter, like below: # insmod sd.c "max_retries=10" Or, add a patch to handle "ABORTED_COMMAND" and "BUSY", and more patch to treat vender specific status byte code by using a module parameter. # insmod scsi_mod.o "sameas_busy=0xc0" The general unixs like AIX, HP-UX or so, have interesting argorithm if the scsi subsystem returns "ABORTED_COMMAND" or some errors. 1) retry for specified times(customizable) 2) search substitute sector 3) retry 1) and 2) for specified times(also customizable) 4) close the filesystem, and return error to all applications which is accessing the filesystem We tested many situations, with qla2x00(6.05.00b9, 6.06.00b13), and the disk returns alternative status CHECK_CONDITION(Aborted_command) or BUSY by changing disk firmware, when the error occurs. qla2x00(5.3x) + returning CHECK_CONDITION(Aborted_command): some data is lost, and test program can't sense it. qla2x00(5.3x) + returning BUSY: no data is lost, retrying forever. According to scsi_old_done() in scsi_obsolute.c, CHECK_CONDITION(Aborted_command)'ll be MAYREDO, and BUSY'll be REDO. qla2x00(6.06.00b13) + returning CHECK_CONDITION(Aborted_command): some data is lost, and test program can't sense it. qla2x00(6.06.00b13) + returning BUSY: some data is lost, and test program can't sense it. Unfortunately, the last one says new driver and returning BUSY can't suppress the data lost... According to scsi_decide_disposition() in scsi_error.c, the scsi status CHECK_CONDITION(Aborted_command) or BUSY will be same condition(goto maybe_retry:). We know the new qlogic driver(6.06.00b13) doesn't use this function, but some driver with use_new_en_code=1 may use this. To avoid this issue for all scsi driver completely, our silly patch is not so bad(we confirmed that the patch can inhibit the data lost). We believe that the problems are both: 1) data lost 2) application can't detect it when 1) occurs Whether a disk returns CHECK_CONDITION(Aborted_command) or BUSY, they are not eternal error but temporary one, so we recommend the Linux kernel repeat trying "many" times... FYI: qlogic driver(6.06.00b13) has retry counter set as 20 or 30. > Unfortunately, the last one says new driver and returning BUSY
> can't suppress the data lost...
Agreed --- the new-style error handling should probably be repeating
indefinitely on BUSY. Old-style already does, which is what most drivers
currently use. I'll defer to our internal scsi experts on that, though.
On the "application doesn't detect errors" question: what test code are you
using for that? Normal Unix "write()" syscall traffic cannot return such errors
to the app since we do write-behind by default.
> Normal Unix "write()" syscall traffic cannot return such
> errors to the app since we do write-behind by default.
I agree that the write() can't return the errors.
Actually, normal UNIXes like AIX or HP-UX we usually use, have
synchronous behavior on its close(). So app can detect
disk error on its close at least.
Linux doesn't sync at close() because of its performance.
Our test program is very simple, just write, read and
compare a data(essence below, actual program is more
complex and has many error checks ):
int fd;
long rsiz, wsiz;
char data[...] = FIXED_DATA;
char buf[...];
fd = open( "file", O_RDWR|O_CREATE );
wsiz = write( fd, data, sizeof( data ) );
close( fd );
fd = oepn( "file", O_RDONLY );
rsiz = read( fd, buf, sizeof( data ) );
close( fd );
if(rsiz != wsiz || memcmp( data, buf, sizeof( data ) ) != 0)
exit( 1 );
exit( 0 );
You know, Linux can't write synchronically a data onto disk completely
even if using O_SYNC flag when open(), because of its poor implementation.
Synchronous behaviour at close is just a quirk of some old systems and bad for performance. If you want portable synchronous closure use fsync on the file handle and check its return before the close. Please provide specific details for the case you say O_SYNC doesn't work, probably in another bug. Thanks. We confirmed that the same test program above didn't work even if
it had O_SYNC flag, never returned any errors from write().
You know Linux kernel doesn't have perfect synchronous I/O,
our experiment also says so.
> Please provide specific details for the case you say O_SYNC doesn't work,
> probably in another bug.
I agree the bug that the O_SYNC flag doesn't work is another bug.
Sometimes a program works un-synchronously even if
it has open() with O_SYNC on current Linux.
Failure of O_SYNC to write data is a serious bug. Have you opened a bug report for that? If not, do you have a proper test case including description of how it fails? > Failure of O_SYNC to write data is a serious bug. Have you opened a bug
report for that?
No, I haven't. We'll open it when we'll have sufficient data.
We've made a new kernel patch, to add a kernel parameter which can specify retry number when writing. The patch is being tested now, and will be attached into bugzilla when the test'll be over. The kernel parameter is below: SYNTAX: write_retries=HOSTNO:CHANNEL:SCSIID:RETRIES [,HOSTNO:CHANNEL:SCSIID:RETRIES]* DESCRIPTION: The new kernel parameter "write_retries" can assign a new retry number for specific SCSI Device. It is a parameter for the module scsi_mod.o. ARGUMENT: HOSTNO,CHANNEL,SCSIID is to specify scsi device. HOSTNO = SCpnt->host->host_no CHANNEL = SCpnt->channel SCSIID = SCpnt->target RETRIES = parameter user can specify(override SCpnt->allowed). If RETRIES is 0, it meens infinit retry. EXAMPLE: To set infinit retry for a scsi device host_no=0, channel=1, scsiid=2: options scsi_mod write_retries=0:1:2:0 need to be written into /etc/modules.conf, re-make initrd, and rebooted. If you don't specify this parameter, Linux kernel works with default behavior. Default Linux kernel works except for specified scsi device(s) even if you specify this parameter. How do you think of this patch? Created attachment 94224 [details]
Scsi patch for changeable scsi_write_retries
This patch's been confirmed for both qla2x00(4.28) and qla2300(6.04),
The kernel version is 2.4.7-10(RH72 default). We'll test this on
RHAS(RHES) kernels, but it'll be fine.
By specifing infinit loop with this patch, the issue(data lost) is
never happen.
We do not believe that the use of infinite retries is the correct solution to this problem. This is because during the retry period the SCSI subsystem would essentially be locked waiting on the ability to write to the device. We have been told that this maintenance period can be one hour or more. We expect that most customers will not want to create a situation where their entire system may hang for an hour. We also can not create a situation where a real hardware failure may cause the system to hang indefinitely. As an example, consider that some customers configure software RAID devices that are made up of logical units exported by separate hardware RAID subsystems. This is so that the data will continue to be available even if one of the RAID subsystems fails. In order for this to work well, the hardware RAID subsystem must fail reasonably quickly, so that the software RAID can remove the failed hardware element, and continue to provide access to the data. If may be acceptable to include a patch like the one suggested above, so that a customer can select the amount of time that the SCSI subsystem will wait for a failed I/O, based on their availability requirements and the characteristics of their storage hardware. We would not recommend that they use this method to select an infinite amount of wait time. Since an infinite wait time is not desirable, it is always possible that an I/O will fail because the controller was in maintenance mode a bit longer than the wait time. For this reason, we consider it most important to resolve the failure of O_SYNC that you described earlier. Please provide more information about that failure, as requested previously (preferably in another bugzilla, referencing this one). Ultimately, this is the _only_ solution to the data corruption problem, and should be our highest priority. RHEL 3 is frozen. If any changes are to be made they will have to be in the first RHEL 3 update. We typically do an update every three months, but the schedule for this is not yet determined, so this is _not_ guaranteed. We will consider including a patch to allow the customer to adjust the retry delay in an RHEL 3 update. This will depend on whether this approach is acceptable to the upstream kernel developers. We request that you provide more information on the O_SYNC problem as soon as possible. (Reassigning this bug to me.) First of all, it does make some sense for us to be more uniform about retries on BUSY: 2.6 and scsi_obsolete.c both keep retrying indefinitely in those situations, and new-eh scsi_error.c in 2.4 could be brought in line with that behaviour. Ideally (but not essentially), it would be done with the addition of a separate, timer-driven queue for the retries to avoid spinning if the error condition is coming back from the target immediately. Secondly, there's the issue of what to do if the IOs fail. We propagate that back to the application if the app has requested notification via O_SYNC/fsync, but the entire Linux VFS/VM layer is really not set up to deal with drives dropping writes on the floor if we're just doing writeback IO. The basic assumption is that write requests to the driver layer are not optional, which seems reasonable enough. The filesystem tries to cope, but basically does so by eventually rereading the old version of the data off disk, losing the new version that it tried to write. But the only sane alternative to this behaviour is to take the filesystem completely offline if the drive drops writes; it might be possible to make such a change, although it would require significant VFS help (we don't want to offline a filesystem just because a data block got dropped, but if it's metadata, that's a different story.) But the single biggest problem is the VM stability. The VM is simply not expecting to have to deal with IOs which can last for an hour or more without completion. Memory allocation strategies assume that we can reclaim memory by writing used information to disk on demand, and the rate of memory reclaim is to some extent throttled on the rate at which those IOs complete. If IOs stop completing, it is possible for the VM to get stalled waiting for them, and the kernel essentially becomes unable to allocate more memory. There are some attempts to mitigate against slow devices in the VM, but none to bypass a completely stopped device. There are similar problems in the VFS, where a totally-stalled disk device may hold up the regular background "sync" updating process, or may block the cache reclaim routines due to stalling in the inode flushing code. I simply don't think we can sensibly support normal kernel operation on a system where a busy filesystem is on such a long-term-stalled device. There might be more hope if the storage array is only being used for raw access, where there are no VM structures depending on writeback to that device, but that doesn't help us in the general case. That's not valid testing. It is perfectly legal for the kernel to fail a write due to transient errors but for a subsequent read to succeed, either because the previously-read data is still in cache (for O_SYNC, obviously not for raw IO) or because the error condition has recovered. In other words, even if the kernel does, correctly, report the error back to user space, the above test code will still fail. It's important to know what you are trying to test. There have been two separate issues mixed together in this report so far: (1) the SCSI stack returning error after limited ABORT retries, and (2) alleged failure for the error to be returned to user space. (2) is undiagnosed so far and needs more information. The trouble is that simply testing for short writes or read/write miscompares does not allow you to distinguish between the two. If you are doing raw IO and get a write failure from the kernel, in this case it is likely to be because of the hardware returning ABORT for too long: this is ultimately _not_ a kernel problem, even though we are looking at kernel workarounds for the device behaviour. But if the kernel acknowledges the write and still results in a data miscompare, then that's a different issue. The test code really needs to distinguish between these cases. When analysing test results, it is important for us in engineering to be able to tell which case is occurring. Red Hat engineers have agreed to provide a fix for this problem in Red Hat Enterprise Linux 3 Update 2, our next update. This issue is on the "must fix" list. The Red Hat engineers have, in the end, agreed to provide the fix that Hitachi requests. This fix will only apply to Hitachi hardware. The next actions that need to happen are: Hitachi needs to provide unique identifiers for all affected HDS hardware. This includes a list of vendor and model strings for each device that is affected by the Hitachi implementation. The information should be posted to this ticket (Bugzilla #86312). Red Hat will then provide a fix for Hitachi to test. The fix does include infinite retrying on Hitachi systems. This is the fix that is available to meet the RHEL3 U2 schedule. If Hitachi has decided that this solution is not acceptable, then the fix will not be in RHEL3 U2 at this time, and I and Robert Perkins, Partner Product Manager, will coordinate with Hitachi to set up face-to-face meetings, conference calls between Hitachi and Red Hat engineers. Hitachi should indicate their response in this ticket. Essence of this contains 2 problems: 1) kernel behavior in case of disk error or disk busy. 2) Red Hat policy for Enterprise Linux We agree that the disk should return "BUSY", instead of CHECK_CONDITION in case that disk is busy. But when disk or sector has an error, we absolutely disagree the RHEL kernel has a logic that occurs data loss. If *a* sector on a disk is broken, Linux kernel may skip the sector and lose the data which should be written on it without any notification to user/application. Does RedHat think this is O.K. or not? > But the only sane alternative to this behavior is to take the > filesystem completely offline if the drive drops writes; Yes we agree it, but there's no implementation in Linux now, even though Solaris or other Unix has online rescue mode. Therefore, we strongly recommend to fix your kernel, so that user can choice its behavior. Infinite retrying isn't good for the device with software RAID, raw device and Active-Standby path selector for scsi path, because this may cause kernel stall(Tom says). But a disk without these features, it isn't so bad to retry infinitely to avoid data loss. So we believe we need two patches which add selectable parameters. 1) for scsi protocol driver layer: 1-1) infinite retrying for specific device 1-2) default kernel behavior(5 retrying) 2) for block device (or more upper)layer: 2-1) infinite retrying for specific device 2-2) default kernel behavior(may be data loss) 2-3) kernel panic when the data loss occurs The first one is just like our patch. You know RHEL is for enterprise use. Consumer may accept this level implementation, but our enterprise customer never allow us to lose any data. They believe it's better the kernel panic occurs, than their data is lost silently (kernel log is not so noisy). Especially Japanese customers really hope so. > The Red Hat engineers have, in the end, agreed to provide the > fix that Hitachi requests. We really appreciate you that you decide it. In addition, > The fix does include infinite retrying on Hitachi systems. Please include some logic that user can select infinite/limited retrying. >If *a* sector on a disk
>is broken, Linux kernel may skip the sector and lose the data which
>should be written on it without any notification to
>user/application. Does RedHat think this is O.K. or not?
Normal application writes are entirely asynchronous. The application
writes not to disk but memory, and the kernel is responsible for doing
efficient batched writes of the memory caches to disk some time later.
The application can easily have written the file to cache entirely and
closed it without a single byte having been written to disk. There's
simply no way to tell the application that the write failed in such
cases: the write hasn't even *happened*. The application needs to
wait for the write to complete if it wants failure notification.
> Normal application writes are entirely asynchronous. The application > writes not to disk but memory, and the kernel is responsible for doing > efficient batched writes of the memory caches to disk some time later. We know it is. It is better that the application can sense it, but we agree that it is impossible in some case. The issue is user(not only app) can't notice it. How should we send error message to user?(I said syslog isn't so good). Many UNIXs have special gimmick, like rescue mode or so, but linux doesn't have anything. Therefore, we must decide to adopt infinite retrying, or kernel panic to avoid data lose. We really don't wanna do so, but there's no way on linux. Or, do you recommend any access to disk must always be RAW or sync to inhibit any data lose? You know it is very slow, not attractive for any customers. In current linux, applications should use RAW or sync if it need to know the write failure, and SCSI hardware devices should observe the SCSI specification. It may be better solution that linux has more strong feature to block the data corruption in the future. However, the implementation should be consider carefully not only in Red Hat and Hitachi but also in the whole of the Linux community. I hope Hitachi who well knows a necessity of it will play a central role in the implementation for the whole of the Linux community (include the users). Please note that we don't have any objection in principle to doing infinite retry here. The main difficulty is knowing exactly when to do so. On most scsi targets, ABORTED_COMMAND is not the right way to detect that infinite retry is needed: it prevents the kernel from recovering when one single device on a SCSI bus is wedged and cannot complete commands sent to it. So, we need to be careful how we detect the retry conditions and avoid looping forever in cases where it's not appropriate. Regarding sending information about arbitrary low-level IO failures to the user: syslog is quite simply the most effective mechanism we have for this in current kernels. IO failures *do* get reported via the syslogs. The functionality is not missing, but it could certainly be improved, and there have been proposals in the upstream linux kernel community for ways to do that. It is entirely possible that future kernels will have better ways of returning structured error logs from the kernel to user space, but right now klogd/syslog is the recommended mechanism. OK, I want to reply to several comments at one time. First, naraha_s.co.jp, you wrote: > Essence of this contains 2 problems: > 1) kernel behavior in case of disk error or disk busy. > 2) Red Hat policy for Enterprise Linux > > We agree that the disk should return "BUSY", instead of > CHECK_CONDITION in case that disk is busy. But when disk > or sector has an error, we absolutely disagree the RHEL > kernel has a logic that occurs data loss. If *a* sector > on a disk is broken, Linux kernel may skip the sector and > lose the data which should be written on it without any > notification to user/application. Does RedHat think this > is O.K. or not? I think that it will actually help this discussion if we define a few things regarding the linux kernel. First, there is the concept of a catastrophic error. This is an error from which recovery is not reasonably possible. If someone walks into the server room and spills an entire cup of water into a server causing it to short out and burn out the CPUs, mainboard, and RAM all at once, this is obviously a catastrophic error. We can not recover from this and data loss is sure to happen. The company must then reinstall from backup and try to recreate the missing data. Second, there are recoverable errors. Lots of little things can go wrong in a machine from which the machine is able to recover. ECC error correction of single bit memory errors is an example of this. Generally speaking, the linux kernel relies solely upon the hardware to correct these minor errors. If an error makes it past the hardware correction mechanisms, then linux considers that a catastrophic failure and makes no attempt to recover from it. The only exception to this that I know of is the software raid support in linux. Software raid is, by it's very nature, an error recovery mechanism. Because disks fail so often, and because puchasing an external disk array that does the raid for you is so expensive, software raid was written in the linux kernel so that even people just using linux as a workstation could be protected from disk failures. Now, I think our primary area of disagreement is in what qualifies as a catastrophic error. In linux, any unrecoverable disk error is catastrophic. If you consider the origins of the linux operating system, this is no suprise. Linux was originally written to run on i386 based personal computers that didn't have all the advanced hardware recovery mechanisms that mainframe class computers have. When a disk failed in a linux computer, there simply wasn't anything to be done about it. So, the linux kernel was written with this in mind. Hard errors on a disk are therefore catastrophic errors. Changing this now would be a monumental engineering task. In order to provide some amount of security against this type of catastrophic error, especially since disks are unreliable, we implemented software RAID in the linux kernel. This way a single disk failure on a RAID1, 4 or 5 device would not be catastrophic to the system. A hard error on a disk in a software raid array is still catastrophic as far as that single disk is concerned, but the redundancy of the software raid device saves the system from ever seeing it and data loss does not occur as a result. However, the requirement for our software raid stack is that it must *never* return a hard error for the raid device unless it has already gotten to the state where there is no way of safely interacting with the disks. This is exactly how our software raid stack works. The core operating system will never see an error from the software raid stack unless the raid device has lost too many disks to be able to continue. When that happens, it is considered a catastrophic failure and no recovery is possible. Now, an option for enterprise customers is that they can use external hardware raid devices, such as the SanRise devices, instead of the linux software RAID. This offloads the core CPU from the work of maintaining the raid array and talking to all the individual hard disks. However, the linux kernel has the same expectation of external hardware raid devices that it has for its software raid devices. The linux kernel expects that it will *never* see a hard error from the external raid device until the raid array is already in an unusable state and no recovery is possible. The problem that we've had, and that this bug report is all about, is the fact that the linux kernel's SCSI stack thinks that if a command fails with ABORTED_CMD 5 times in a row, then it is a hard error and it returns that hard error to the core of the operating system. The SanRise equipment uses ABORTED_CMD for soft errors, or errors that will go away. So, I am writing a patch to make the SCSI stack treat ABORTED_CMD as a soft error on the SanRise equipment. By treating this as a soft error, the command will be retried infinitely and the data loss will not occur. However, Shinya-san (is that the correct way to address you, I'm not familiar enough with Japanese culture to know, please forgive my ignorance on this issue), you are correct that this does not change the fact that the linux kernel will still consider other hard errors from SanRise equipment as catastrophic failures and will allow data loss to occur. Implementing a full recovery mode is probably not going to be possible. Let me explain why I think this. We try to write all of our kernel changes in a way that will be acceptable to the upstream kernel maintainers (Linus Torvalds for the 2.6/2.7 kernel, Marcelo Tossati for the 2.4 kernel). The upstream kernel maintainers are concerned with making linux work well on the largest number of machines possible. Obviously, the number of workstations that run linux is far, far greater than the number of mainframe or PC class server computers. The overhead required in the core portions of the linux kernel in order to support recovery operations like you speak of is significant. Since most workstation users would be unhappy to have their linux kernel run slower because it is maintaining the information necessary to support on line recovery operations, it is doubtful that the upstream kernel maintainers would accept any patches to implement such a feature. Instead, they would argue that if a company is concerned over the possibility of data loss, they should instead use hardware raid devices or software raid arrays or possibly even both. So, instead of trying to write a recovery mode into the linux kernel, I think it is preferable that enterprise customers design their servers such that a recovery mode should never be needed. For example, if they think it might ever be possible that a SanRise disk array could return a hard error and be taken offline, then they could buy two SanRise disk arrays and use the linux software RAID to treat them as RAID1 mirrors of each other. That way, if a SanRise array ever goes offline for hard errors, there is still the other one to operate from. I actually think this is a very fair way to handle the problem of trying to find a solution that is acceptable to both regular users of linux on a workstation and enterprise customers. Instead of putting code into the kernel to enable a rescue mode, which would slow down everyone's machines, it allows the enterprise customer to buy whatever level of fault tolerance they want in their hardware and then rely upon that hardware to protect their data. That way no one suffers the performance penalty of rescue mode capability overhead, not even enterprise customers, but by buying more fault resilient hardware they are able to protect their data. But that's just my opinion. Now, in a different post, Shinya-san also wrote: > How should we send error message to user? (I said syslog > isn't so good). I think that syslog is good enough. But not because the user notices the entries in syslog. We have a nightly script that runs on RHEL machines that scans that day's syslog entries looking for anything unusual. In the event of a disk error, it sends an email to root@localhost with a report of the error. As long as the machine is configured to send email for root@localhost to some valid email address, that user will get notified that a disk error took place via email. This keeps the error from happening in a way that users can overlook. I've submitted a patch for inclusion in RHEL3 U2 to address this problem. It's actually based upon a previous patch I submitted. I've attached three patches to this bugzilla report. The first patch that things are based upon (the scsi queue fix patch), the actual retry aborted_cmd patch, and then the whitelist update patch that makes this effective on Hitachi SanRISE equipment. Please review these patches and let me know about any problems you see. Created attachment 98068 [details]
Patch that fixes the scsi mid layer queue handling for commands that need infinite retries
Created attachment 98069 [details]
Patch to use the new mid layer queue handling for aborted_cmd commands
Created attachment 98070 [details]
Update to the device whitelist entries to catch all SanRISE equipment
The fixes for this problem were committed to the RHEL 3 U2 patch pool today. The changes in comment #86 were already in U1. An updated version of the patch in comment #87 (adding the new retry_aborted_cmd field) and the patch in comment #88 were both committed to the internal Engineering build of kernel version 2.4.21-9.16.EL. Hitachi has posted the test result to IT#27794. Unfortunatelly, the data loss has still occured for ABORTED_COMMAND. Test Environment: ãããKernel Ver :2.4.21-11.ELsmp Test Program :seqtp17 (Hitachi Original) HBA :QLA2340 ãããDriver :6.07.02-RH1 ãããFirmwareã :3.02.16 Test Pattern: (1) SANRISE has begun to return ABORTED_COMMAND BEFORE starting the test program. (2) SANRISE has begun to return ABORTED_COMMAND AFTER starting the test program. Test Pattern | Write method | Result | -------------+--------------+--------------------| (1) | syncronous | Write Error | => OK | asyncronous | Data loss occured | => NG | raw I/O | Input/Output Error | => OK -------------+--------------+--------------------| (2) | syncronous | Data loss occured | => NG | asyncronous | Data loss occured | => NG | raw I/O | Input/Output Error | => OK -------------------------------------------------- If you need more information to analyze this, please let me know. Follows are new test result. The difference from before one are: - Vendor code (before: "HP ", new: "HITACHI") - Firmware version (before: 3.02.16, new: 3.02.24) Test Environment: Kernel Ver :2.4.21-11.ELsmp Test Program :seqtp17 (Hitachi Original) HBA :QLA2340 Driver :6.07.02-RH1 Firmwareã :3.02.24 SANRISE Vendor Code: "HITACHI" Test Pattern: (1) SANRISE has begun to return ABORTED_COMMAND BEFORE starting the test program. (2) SANRISE has begun to return ABORTED_COMMAND AFTER starting the test program. Test Pattern | Write method | Result | -------------+--------------+-----------------------| (1) | syncronous | Normal End (2/5) | | | System unstable (3/5) | |--------------+-----------------------| | asyncronous | Normal End (1/4) | | | System unstable (3/4) | |--------------+-----------------------| | raw I/O | Normal End (5/5) | -------------+--------------+-----------------------| (2) | syncronous | Normal End (2/5) | | | System unstable (3/5) | |--------------+-----------------------| | asyncronous | System unstable (4/4) | ----------------------------------------------------| "System unstable" means: - It cannot kill the test program process. - It cannot umount devices which test program is using on SANRISE. - It cannot build a filesystem on other SCSI disk connected to a same HBA card. - It cannot reboot or shutdown. reboot/shutdown process has stopped at: Broadcast message from root (pts/0) Tue Mar 23 XX:XX:XX 2004... The system is going down for reboot/system halt NOW !! - It can execute follow commands: ps -ef, ps -ef | grep , cd /somewhere/notOnSanriseDevices SANRISE was returning ABORTED_COMMAND for 2 minutes at most during the test. After stopping to return ABORTED_COMMAND, the system still was "UNSTABLE". Therefore, Hitachi thinks this "unstable state" is caused by what system cannot treat queued tasks well after stopping to return ABORTED_COMMAND. Hitachi hopes this is fixed. Created attachment 98868 [details]
Proposed fix patch
OK, I think the problem is that the requeue code in scsi_queue.c works fine for
commands that failed with a bad status code, but fails for commands that are
being retried because of sense data. It basically fails to clear the sense
data and reset the command for another attempt. This patch adds this operation
to scsi_mlqueue_insert so that the command is cleared out between retries.
Hitach post the result of test with Doug's proposal patch into IT#27794. However, the result was not good. System had been still unstable sometimes. Test Environment: Kernel Ver :2.4.21-11.EL + above patch (id=98868) HBA :QLA2340/LP9802 Firmwareã :3.02.16/1.01A2 Write method | Result | --------------+-----------------------| syncronous | Normal End | | System unstable | --------------+-----------------------| asyncronous | Normal End | | System hang up | --------------+-----------------------| raw I/O | Normal End | | System unstable | --------------+-----------------------| "System unstable" means: - It cannot kill the test program process. - It cannot umount devices which test program is using on SANRISE. - It cannot build a filesystem on other SCSI disk connected to a same HBA card. - It cannot reboot or shutdown. reboot/shutdown process has stopped at: Broadcast message from root (pts/0) Tue Apr 7 XX:XX:XX 2004... The system is going down for reboot/system halt NOW !! - It can execute follow commands: ps -ef, ps -ef | grep , cd "System hang up" means: impossible to enter any command from keyboard. Each condition had no choice excluding turning power switch off. This result is not anything Hitachi expects. So, Hitachi want *NOT* to apply patches for this issue in RHEL3 U2. Hitachi evaluated Doug's proposal patches (ver1.14 in his bk trees). However, the system after their storage recovered from the ABORTED_COMMAND condition has not worked well. (It's not different from the before test). The test result for RAW I/O and Syncronous I/:: System has been unstable condition. - could not kill the test program process. - could not umount devices which was connected to a same HBA card. - could not build a filesystem on other SCSI disk connected to a same HBA card. - could not reboot or shutdown. reboot/shutdown process has stopped at: Broadcast message from root (pts/0) Tue Mar 23 XX:XX:XX 2004... The system is going down for reboot/system halt NOW !! - could execute follow commands: ps -ef, ps -ef | grep , cd The result for asyncronous I/O (write-back I/O):: - could not input any keys and could not execute any commands. I have not gotten any logs of 'echo "scsi dump 2" > /proc/scsi/scsi' from Hitachi yet. (though they could not collect them in the asyncronous I/O test). The fact that the 1.14 kernel failed is not unexpected. However, the information that I requested in the form of the system logs after the kernel had failed and 'echo "scsi dump 2" > /proc/scsi/scsi' had been run was going to be used to confirm what the problem really was. In the meantime, since I suspect what the problem is, a possible solution for that problem was committed to my bk tree on Jun-1-2004. This is the changeset log for that change: ChangeSet, 2004-06-01 09:33:23-04:00, dledford.com drivers/scsi/scsi.c Init sdev_retry_q during build_commandblocks drivers/scsi/scsi.h Add sdev_retry_q so retried commands aren't shoved into the block request queue (means no merging of additional requests into an already initialized command and means we can easily give first priority to commands on the retry queue instead of the regular block layer queue in order to avoid lockups on devices if they manage to get *all* of their command blocks allocated to commands that need retried). drivers/scsi/scsi_lib.c Change scsi_request_fn so that it processes commands on the retry queue first and so that it skips all the command init and command allocation code for any commands on the retry queue. drivers/scsi/scsi_queue.c Make the scsi_mlqueue_insert routine put the delayed commands on the new sdev_retry_q list instead of back in the block layer request queue. Hitachi posted the test result to IT#27794 today. Doug's proposal patch makes system work well. Test Environment: Kernel: 2.4.21-15.EL.hitachi2 (2.4.21-15.EL + Doug's bk tree rev1.37) Test Program: seqtp19i_32 HBA: QLA2340 Driver: 6.07.02-RH2 Firmware: 3.02.24 Infinite retries for ABORTED_COMMAND worked well, then no data corruption occured, and system has been in stable condition. Write method | Result | --------------+-----------------------| syncronous | Normal End | --------------+-----------------------| asyncronous | Normal End | --------------+-----------------------| raw I/O | Normal End | --------------+-----------------------| Additional Information: Data corruption occurs in spite of using O_SYNC, if storage device was returning HardwareErr, MediumErr. This problem should be discussed in Bugzilla#116900 and/or related IssueTracker Tickets. A fix to this problem was committed to the RHEL3 U3 patch pool on 21-June-2004 (in kernel version 2.4.21-15.15.EL). Closing this issue out, as original reporter has tested and confirmed resolution. I've reverted this bug to MODIFIED state. It will be closed automatically by the Errata System with more detailed information when RHEL3 U3 is actually released. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html Created attachment 111960 [details]
paranoiac patch 1
There's our last patch to avoid this issue.
Our patch will eliminate this issue, but you know
this is one of paranoiac patch only for delicate customers.
We know this patch will not be included for main stream kernel,
but we'd like to let you tell that there're paranoiac customers
in an enterprise world.
Don't respond to this please, let us finish this issue...
Created attachment 111961 [details]
paranoiac patch 2
additional patch
Guys - did some rendition of this "fix" make it into the upstream kernels? Thanks, ccb |