79027 – Tape Issues in 2.4.18-18.7 (and other) kernel

Bug 79027 - Tape Issues in 2.4.18-18.7 (and other) kernel

Summary: Tape Issues in 2.4.18-18.7 (and other) kernel

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	8.0
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-12-04 17:48 UTC by Need Real Name
Modified:	2005-10-31 22:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:40:16 UTC
Embargoed:

Attachments	(Terms of Use)
Kai's Patch for Generic 2.4.20 kernel (17.83 KB, patch) 2003-08-28 15:26 UTC, Need Real Name	no flags	Details \| Diff
View All

Description Need Real Name 2002-12-04 17:48:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826

Description of problem:
This actually affects 7.2, 7.3, and 8.0.

There are many issues with tape I/O in the 2.4 kernels.  The latest 2.4.18-18.7
from Red Hat has an issue with end of tape conditions.  The proceedure SHOULD be
that the drive sends an check condition, early warning end of media during the
write phase (this is not an error).  st.o should then translate that into an
ENOSPC and pass it to the user application (tar, cpio, et al).

This is not occurring, so the application continues to try writing data beyond
the end of media.  Once the I/O timeout is met, the application then issues the
close on the device, but the device fails when it tries to write the end of data
filemarks.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Execute a backup that will exceed the capacity of the current media
2.Watch the data rate drop to 0k/sec one the drive reaches the end of media
(drive will start reseeking, or shoe-shining).
3.wait for the timeout (could be hours depending on the application)
	

Actual Results:  As described in the Description above, the application will
eventually get a write error and the drive will further report the failure to
write the filemarks in the syslog.

Expected Results:  ENOSPC should have been sent up to the user space application
by the st driver.

Additional info:

This occurs in tests with DDS2, 3, & 4 DAT, LTO from Seagate and HP, DLT and
VS80DLT, OnStream ADR2, Tandberg SLR, and Travan under IDE-SCSI, so I suspect
that it is limited to the st.c code or some parent function in scsi.c lowlevel code.

Comment 1 Simon Matter 2003-01-08 23:00:24 UTC

I'm using 2.4.18-19.7.x and have problems with backup since upgrading from 
2.4.9-34.

I do:

mt setblk 10240
tar clpv --exclude boot/lost+found --exclude lost+found --exclude 
var/lost+found --exclude home/lost+found -b 20 -f /dev/nst0 
var/log/lback.ls-laR.gz boot . var home

tar reports this:
January 08 08:57:02 lback: tar: var/spool/postfix/public/showq: socket 
ignored
January 08 09:00:55 lback: tar: home/backup: file is on a different 
filesystem; not dumped
January 08 10:44:40 lback: tar: /dev/nst0: Wrote only 0 of 10240 bytes
January 08 10:44:40 lback: tar: Error is not recoverable: exiting now
January 08 10:44:40 lback: Command exited with non-zero status 2

syslog shows this:
st0: Error with sense data: Info fld=0x3, Current st09:00: sense key Not 
Ready
Additional sense indicates Logical unit not ready,cause not reportable
st0: Error with sense data: Info fld=0x1, Current st09:00: sense key Not 
Ready
Additional sense indicates Logical unit not ready,cause not reportable
st0: Error on write filemark.

Is it the same bug? The backup has worked perfectly for months now and I 
don't understand what happened.

Comment 2 Kevin Fenzi 2003-02-19 19:36:07 UTC

I can also reliably reproduce this problem on a DDS2 drive. End of tape always
results in the application getting a EIO instead of a ENOSPC. 
The Controller doesn't seem to matter, I have seen this on a aic and a sym
controller. LTO, DDS2 and DDS4 drives from diffrent vendors. 
It seems that the 2.4.9-34 kernel works fine, so it has to be a kernel change
since then. Perhaps in the st module? 
I have reported the problem to the scsi-tape maintainer and the linux-scsi list. 
I am happy to run any test cases or provide access to the tape drive.

Comment 3 Need Real Name 2003-02-19 23:07:28 UTC

Additional news.

This is actually related to the check sense bit not being propagated up to the
ST driver.  A simpler test (beats writing 40GB to a tape ...):

use a 2.2.19/20/21 or 22 kernel, or a 2.4.9-34 kernel
Remove the tape from the tape device
execute:

  tar -cvvf /dev/nst0 /etc

You will receive a "No medium found" message

Replace the kernel with 2.4.11+ and repeat the tar write test.  This time, you
will receive a write failure.

This is caused by the check sense not being set and the ST driver sending up a
EIO instead of the ENOMEDIUM.

Tim

Comment 4 acount closed by user 2003-03-22 05:34:51 UTC

makisara has put the solution for the EOF bug at:

http://www.kolumbus.fi/kai.makisara/st-eot.html

Comment 5 Francois Levasseur 2003-04-25 14:11:00 UTC

This is great, but how I can start write to the device after that? Do I need to
reboot? 

[root@svrlinux log]# mt -f /dev/st0 status
/dev/st0: No such device or address

[root@svrlinux log]# cat messages* | grep st0
Apr 23 23:43:15 svrlinux kernel: st0: Error 6000000 (sugg. bt 0x0, driver bt
0x6, host bt 0x0).
Apr 23 23:44:58 svrlinux kernel: st0: Error 30000 (sugg. bt 0x0, driver bt 0x0,
host bt 0x3).
Apr 23 23:44:58 svrlinux kernel: st0: Error on write filemark.
Apr  8 17:49:52 svrlinux kernel: Attached scsi tape st0 at scsi2, channel 0, id
0, lun 0
Apr  8 17:49:52 svrlinux kernel: st0: Block limits 1 - 16777215 bytes.


Where april 8th is the date I rebooted and april 23th is the date where the bug
cames.

Now I can't use my tape drive. Is there a solution to correct this without
having to reboot? 

Francois,

Comment 6 Need Real Name 2003-04-25 15:08:55 UTC

You will need to reboot ONLY if st is not a module.  If you lsmod, is st in the
modules list?  If so, try rmmod'ing it and then insmod it back in.  This works
in most situations, but there are some situations where you can't rmmod the
module.  In those cases, you will need to reboot.

BTW - Kai has produced a patch for 2.4.20 which will (hopefully) get into
2.4.21, but we're not sure.

Tim

Comment 7 acount closed by user 2003-04-25 22:43:44 UTC

NO, only .21-ac kernels have this patch, and not in .21-rc1!!
but rh kernel are bases in -ac

Comment 8 acount closed by user 2003-05-23 01:19:24 UTC

new kernel 2.4.20-13.9 has a recent st version '20030406', in theory this bug
should be closed

Comment 9 Dieter Thiel 2003-06-30 08:35:47 UTC

On our System this bug still lives on. I tried kernel
versions 2.4.20-13.7smp and 2.4.20-18.9smp.
After rmmod st, insmod st and running my backup script
i have the following messages:

st: Unloaded.
st: Version 20030406, bufsize 32768, max init. bufs 4, s/g segs 16
Attached scsi tape st0 at scsi0, channel 0, id 0, lun 0
st0: Block limits 1 - 16777215 bytes.
st0: Error 10000 (sugg. bt 0x0, driver bt 0x0, host bt 0x1).
st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0).
st0: Error on write filemark.

The Tape drive is a COMPAQ   Model: SDT-10000,
cat /proc/scsi/scsi says:

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: COMPAQ   Model: SDT-10000        Rev: 1.09
  Type:   Sequential-Access                ANSI SCSI revision: 02

I cleaned the drive twice and tried a new cartridge. No changes.
Any ideas on that?

Comment 10 Need Real Name 2003-06-30 15:13:05 UTC

This instance is definitely related to the problem.  What software are you using to write to tape?

Comment 11 Paul Felts 2003-07-21 20:27:51 UTC

I see this bug is closed, but I don't see how it was resolved. Can someone 
point me to the solution? Thanks!

Comment 12 Need Real Name 2003-07-22 14:19:34 UTC

2.4.20-18.9 has the patch applied (RH's latest production kernel).

Comment 13 Paul Felts 2003-07-23 02:13:29 UTC

Maybe I'm missing the boat, here. k-2.4.20-18.9 doesn't seem to be offered as a 
download for RH 7.3 through RHN, so I installed k-2.4.20-19.7 from rpmfind:

http://www.rpmfind.net/linux/RPM/redhat/updates/7.3/i386/kernel-2.4.20-
19.7.i386.html

figuring this version would have the fix. It didn't help. Was I wrong to 
assume? 

Thanks to everyone for their help.

Comment 14 Paul Felts 2003-08-04 20:10:09 UTC

 k-2.4.20-18.9 has not fixed this issue. what else can I do?

Comment 15 Paul Felts 2003-08-28 00:40:55 UTC

this bug is not fixed in recent kernel 2.4.20-20.7 on RH 7.3 :(
still kernel panic and hangs machine.

Comment 16 Need Real Name 2003-08-28 15:18:01 UTC

Update to all:

First, please remember that this bug is ONLY related to the lack of end of tape
(EOT) recognition.   This is not a generic SCSI issue nor does it apply to
non-tape problems.

Next, while RH's 2.4.20-18.9 was originally deterined to have the patch applied,
it appears that it doesn't have it applied and the successful runs were simply a
lucky situation.

At this point, I can only recommend that you all contact RH directly if you are
under RH support and request that this be fixed immediately.

Otherwise, if you don't mind moving away from the RH supplied kernels, by
downloading the generic 2.4.20 kernel from ftp.kernel.org, and downloading Kai's
patch from http://www.kolumbus.fi/kai.makisara/st-eot.html and patching the
standard kernel, the problem with SCSI EOT recognition is resolved.

Comment 17 Need Real Name 2003-08-28 15:26:58 UTC

Created attachment 94036 [details]
Kai's Patch for Generic 2.4.20 kernel

This is Kai Makisara's patch for the generic (non-RH) 2.4.20 kernel.

Without this patch, the 2.4 kernels since 2.4.9 are broken in a manner that
prevents recognition and proper processing of the early warning end of tape
message from a tape drive.  This results in failures in backups when more than
one tape is required.  This effects ANY backup application that uses the
standard /dev/st?? device drivers under the 2.4 kernels.

Comment 18 Norbert JUNGHAUSZ 2003-09-09 07:43:48 UTC

I patched a kernel.org 2.4.20 kernel with this patch and it didnt fix the
problem on my server.
I still see the same messages during backup:

scsi: device set offline - not ready or command retry failed after host reset:
host 0 channel 2 id 3 lun 0
st0: Error 30000 (sugg. bt 0x0, driver bt 0x0, host bt 0x3).
I have a ServeRAID 5i controller with a 40/80gb DLT1 in an IBM server.

Is there anything else to try ?

Comment 19 Need Real Name 2003-09-09 17:20:02 UTC

Yes - place your tape drive onto a separate SCSI adapter.  You should never
place a tape drive onto the same SCSI HBA as your primary or RAID'ed SCSI disks.

First, this is a different error than what we're discussing here and I suspect
the problem is with the Dell/PercRAID adapter.  Additionally, I've seen many
issues with tape attached to the Dell RAID adapter that have absolutely nothing
to do with the EOT issue that this report concerns.

I would recommend adding a low cost 40MHz Ultra Wide controller such as the SiiG
AP40 (US$69 through CompUSA.com) or an LSI Logic HBA (usually less than US$100)
specifically for the tape drive.

Comment 20 Need Real Name 2003-09-09 17:28:57 UTC

Sorry, That wording should have said "a problems with the Dell/PercRAID Linux
device driver."  The Dell RAID controller works fine under Linux with the RAID
functions and disks.

Tim

Comment 21 James Ralston 2003-10-08 03:40:47 UTC

I downloaded kernel-2.4.20-20.7.src.rpm (the latest errata kernel) and went
through drivers/scsi/st.c and Kai Makisara's patch line-by-line.

RED HAT'S 2.4.20-20 KERNEL SERIES HAS KAI'S PATCH APPLIED.

Specifically, Kai's patch was rolled into this patch:

    Patch5: linux-2.4.20-later-ac-updates.patch

If you check the bug history, you'll see that this bug was closed on 2003-06-08
with CURRENTRELEASE as the solution.  That was almost certainly when Red Hat
added Kai's patch.

The problem is *not* that Red Hat's latest kernels don't have the patch applied.
 Rather, the problem is that Kai's patch doesn't fix the EOT recognition
problem--at least not in all circumstances.  :(

Those of us who are affected by this problem need to let Kai know that his patch
doesn't completely solve the problem, and be prepared to give him the debugging
output he'll need to further diagnose the problem.

Comment 22 CAR SYSTEMS 2003-10-20 15:48:36 UTC

RH 7.3

I patched the kernel with 2.4.20-20.7 and it didnt fix the problem on my server.

Backup (cpio) works fine with DDS3 dat but failed with DDS4.

When i try to make a backup with DDS4 dat, i have the folowing message to the 
console :

Found end of volume. To continue type device/file name when ready.

and in the log files :

st0: Block limits 1 - 16777215 bytes.
(ips0) Reset Request - Flushed Cache
scsi: device set offline - not ready or command retry failed after host reset: 
host 0 channel 2 id 6 lun 0
st0: Error 30000 (sugg. bt 0x0, driver bt 0x0, host bt 0x3).
st0: Error with sense data: Current st09:00: sense key Unit Attention
Additional sense indicates Power on,reset,or bus device reset occurred
st0: Error on write filemark.

And i can't access the tape without a cold reboot.

Thanks to everyone for their help.

Comment 23 Paul Felts 2003-10-20 17:34:20 UTC

I was plagued with this problem, myself. Attempting backup would result in 
complete kernel panic. I tried many different versions of RH 2.4 kernel, 
patched and unpatched. None of them worked. The solution, for me, was to use a 
generic, non RH, 2.4 kernel. I still get tape error messages, but the system 
seems able to backup and restore properly now. No crashes or panics.

Comment 24 James Ralston 2003-10-24 21:16:36 UTC

We upgraded to Red Hat Enterprise Linux 2.1ES to try to avoid this bug, but the
bug appears to be in the RHEL 2.4.9-e.27 kernel as well.

Since we paid for a Standard subscription to RHEL instead of just Basic, we've
opened a support ticket with Red Hat about this problem.  Hopefully they'll get
to the bottom of it soon.

Comment 25 Ed Novak 2003-11-18 20:26:57 UTC

This is now causing system hangs on a 2.4.18-27.8.0smp kernel with an 
IBM ServeRAID 5i card and a VXA-2 tape drive.  The disk array and the 
tape drive are on separate channels.  I have not tried going to a 
more recent kernel as the other comments indicate this does not make 
the problem go away.  Has anyone else found a patch for this problem?

Comment 26 Need Real Name 2003-11-18 23:46:51 UTC

Please note - as stated before, this bug is related to the lack of end
of tape recognition, not other SCSI issues.  I strongly urge each of
you to open NEW bug reports that relate to your particular errors and
panics.  Tracking the EOT bug will most likely NOT fix any of the
panics that you are all witnessing.

Comment 27 Diego Soares 2003-12-09 14:47:37 UTC

I have same problem in my RedHat server. I find this problem in
internet and see this bug.

My confs:
*uname -a
Linux linux 2.4.20-24.7 #1 Mon Dec 1 13:35:11 EST 2003 i686 unknown

*more /etc/redhat-release
Red Hat Linux release 7.2 (Enigma)

*The last patch is applied: 
vi /usr/src/linux-2.4.20-24.7/drivers/scsi/st.c
Last modified: Sun Apr  6 22:44:42 2003 by makisara
   Some small formal changes - aeb, 950809

[root@linux backup]# mt -f /dev/st0 status
SCSI 2 tape drive:
File number=-1, block number=-1, partition=0.
Tape block size 0 bytes. Density code 0x0 (default).
Soft error count since last status=0
General status bits on (50000):
 DR_OPEN IM_REP_EN


This Bug 79027 is SOLVED? For me donÂ´t fix it..
ThankÂ´s for all.

Diego S. Soares
diegosoares.br

Comment 28 Need Real Name 2004-01-14 17:00:00 UTC

Currently, we've found that the issue is resolved in 2.4.23, however,
it is NOT fixed in ANY Red Hat kernel.

What will it take to get the RH supplied kernels up to snuff?  It's
obvious that this problem is affecting MANY RH customers.

Comment 29 Arjan van de Ven 2004-01-14 17:02:41 UTC

Red Hat Linux 8.0 (and 7.x) are End Of Life so it will take a LOT to
get RH to suppy a newer kernel for those.

Comment 30 Paul Felts 2004-01-14 21:16:05 UTC

Red Hat is dropping all versions of RH Linux. You will now have two 
choices: 1) RH Enterprise Linux, where you will pay (big time) for a 
support contract or 2) RH Fedora Linux, for which RH offers no 
support at all, but happily harvests the efforts of the development 
community. Lovely.

Support for ide tape drives (all versions of Linux) is equally 
lacking. The ide tape driver is completely broken and unusable and 
likely to stay that way. No one cares enough to fix it, apparently. 
The ide-scsi / st driver combination, which allows an ide tape drive 
to emulate a scsi tape drive, is broken but usable. Linus apparently 
considers ide-scsi to be "an abortion" and there isn't much interest 
in fixing ide tape problems. 

I wish things were different. :(

Comment 31 Simon Matter 2004-01-15 07:16:52 UTC

I want to respond to Arjan's comment #29: RedHat 9 is still supported
and I don't see a real difference between the 7.x and the 9 Kernel. In
fact it's one line in the .spec file. So can we expect the RedHat 9
Kernel gets fixed? I think no. I think this bug, like others, will
stay open until RedHat 9 reaches EOL and then we're told that RH9
isn't supported anymore. Being a customer who has paid real money for
many RedHat releases since years, I'm not really happy whith this.

Comment 32 Paul Felts 2004-01-16 00:47:40 UTC

Well, RH 9 reaches EOL on April 30, 2004. You don't have long to 
wait. :( Maybe the problem will be fixed in Fedora. We can hope.

Comment 33 Bugzilla owner 2004-09-30 15:40:16 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.