From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826 Description of problem: This actually affects 7.2, 7.3, and 8.0. There are many issues with tape I/O in the 2.4 kernels. The latest 2.4.18-18.7 from Red Hat has an issue with end of tape conditions. The proceedure SHOULD be that the drive sends an check condition, early warning end of media during the write phase (this is not an error). st.o should then translate that into an ENOSPC and pass it to the user application (tar, cpio, et al). This is not occurring, so the application continues to try writing data beyond the end of media. Once the I/O timeout is met, the application then issues the close on the device, but the device fails when it tries to write the end of data filemarks. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Execute a backup that will exceed the capacity of the current media 2.Watch the data rate drop to 0k/sec one the drive reaches the end of media (drive will start reseeking, or shoe-shining). 3.wait for the timeout (could be hours depending on the application) Actual Results: As described in the Description above, the application will eventually get a write error and the drive will further report the failure to write the filemarks in the syslog. Expected Results: ENOSPC should have been sent up to the user space application by the st driver. Additional info: This occurs in tests with DDS2, 3, & 4 DAT, LTO from Seagate and HP, DLT and VS80DLT, OnStream ADR2, Tandberg SLR, and Travan under IDE-SCSI, so I suspect that it is limited to the st.c code or some parent function in scsi.c lowlevel code.
I'm using 2.4.18-19.7.x and have problems with backup since upgrading from 2.4.9-34. I do: mt setblk 10240 tar clpv --exclude boot/lost+found --exclude lost+found --exclude var/lost+found --exclude home/lost+found -b 20 -f /dev/nst0 var/log/lback.ls-laR.gz boot . var home tar reports this: January 08 08:57:02 lback: tar: var/spool/postfix/public/showq: socket ignored January 08 09:00:55 lback: tar: home/backup: file is on a different filesystem; not dumped January 08 10:44:40 lback: tar: /dev/nst0: Wrote only 0 of 10240 bytes January 08 10:44:40 lback: tar: Error is not recoverable: exiting now January 08 10:44:40 lback: Command exited with non-zero status 2 syslog shows this: st0: Error with sense data: Info fld=0x3, Current st09:00: sense key Not Ready Additional sense indicates Logical unit not ready,cause not reportable st0: Error with sense data: Info fld=0x1, Current st09:00: sense key Not Ready Additional sense indicates Logical unit not ready,cause not reportable st0: Error on write filemark. Is it the same bug? The backup has worked perfectly for months now and I don't understand what happened.
I can also reliably reproduce this problem on a DDS2 drive. End of tape always results in the application getting a EIO instead of a ENOSPC. The Controller doesn't seem to matter, I have seen this on a aic and a sym controller. LTO, DDS2 and DDS4 drives from diffrent vendors. It seems that the 2.4.9-34 kernel works fine, so it has to be a kernel change since then. Perhaps in the st module? I have reported the problem to the scsi-tape maintainer and the linux-scsi list. I am happy to run any test cases or provide access to the tape drive.
Additional news. This is actually related to the check sense bit not being propagated up to the ST driver. A simpler test (beats writing 40GB to a tape ...): use a 2.2.19/20/21 or 22 kernel, or a 2.4.9-34 kernel Remove the tape from the tape device execute: tar -cvvf /dev/nst0 /etc You will receive a "No medium found" message Replace the kernel with 2.4.11+ and repeat the tar write test. This time, you will receive a write failure. This is caused by the check sense not being set and the ST driver sending up a EIO instead of the ENOMEDIUM. Tim
makisara has put the solution for the EOF bug at: http://www.kolumbus.fi/kai.makisara/st-eot.html
This is great, but how I can start write to the device after that? Do I need to reboot? [root@svrlinux log]# mt -f /dev/st0 status /dev/st0: No such device or address [root@svrlinux log]# cat messages* | grep st0 Apr 23 23:43:15 svrlinux kernel: st0: Error 6000000 (sugg. bt 0x0, driver bt 0x6, host bt 0x0). Apr 23 23:44:58 svrlinux kernel: st0: Error 30000 (sugg. bt 0x0, driver bt 0x0, host bt 0x3). Apr 23 23:44:58 svrlinux kernel: st0: Error on write filemark. Apr 8 17:49:52 svrlinux kernel: Attached scsi tape st0 at scsi2, channel 0, id 0, lun 0 Apr 8 17:49:52 svrlinux kernel: st0: Block limits 1 - 16777215 bytes. Where april 8th is the date I rebooted and april 23th is the date where the bug cames. Now I can't use my tape drive. Is there a solution to correct this without having to reboot? Francois,
You will need to reboot ONLY if st is not a module. If you lsmod, is st in the modules list? If so, try rmmod'ing it and then insmod it back in. This works in most situations, but there are some situations where you can't rmmod the module. In those cases, you will need to reboot. BTW - Kai has produced a patch for 2.4.20 which will (hopefully) get into 2.4.21, but we're not sure. Tim
NO, only .21-ac kernels have this patch, and not in .21-rc1!! but rh kernel are bases in -ac
new kernel 2.4.20-13.9 has a recent st version '20030406', in theory this bug should be closed
On our System this bug still lives on. I tried kernel versions 2.4.20-13.7smp and 2.4.20-18.9smp. After rmmod st, insmod st and running my backup script i have the following messages: st: Unloaded. st: Version 20030406, bufsize 32768, max init. bufs 4, s/g segs 16 Attached scsi tape st0 at scsi0, channel 0, id 0, lun 0 st0: Block limits 1 - 16777215 bytes. st0: Error 10000 (sugg. bt 0x0, driver bt 0x0, host bt 0x1). st0: Error 8 (sugg. bt 0x0, driver bt 0x0, host bt 0x0). st0: Error on write filemark. The Tape drive is a COMPAQ Model: SDT-10000, cat /proc/scsi/scsi says: Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: COMPAQ Model: SDT-10000 Rev: 1.09 Type: Sequential-Access ANSI SCSI revision: 02 I cleaned the drive twice and tried a new cartridge. No changes. Any ideas on that?
This instance is definitely related to the problem. What software are you using to write to tape?
I see this bug is closed, but I don't see how it was resolved. Can someone point me to the solution? Thanks!
2.4.20-18.9 has the patch applied (RH's latest production kernel).
Maybe I'm missing the boat, here. k-2.4.20-18.9 doesn't seem to be offered as a download for RH 7.3 through RHN, so I installed k-2.4.20-19.7 from rpmfind: http://www.rpmfind.net/linux/RPM/redhat/updates/7.3/i386/kernel-2.4.20- 19.7.i386.html figuring this version would have the fix. It didn't help. Was I wrong to assume? Thanks to everyone for their help.
k-2.4.20-18.9 has not fixed this issue. what else can I do?
this bug is not fixed in recent kernel 2.4.20-20.7 on RH 7.3 :( still kernel panic and hangs machine.
Update to all: First, please remember that this bug is ONLY related to the lack of end of tape (EOT) recognition. This is not a generic SCSI issue nor does it apply to non-tape problems. Next, while RH's 2.4.20-18.9 was originally deterined to have the patch applied, it appears that it doesn't have it applied and the successful runs were simply a lucky situation. At this point, I can only recommend that you all contact RH directly if you are under RH support and request that this be fixed immediately. Otherwise, if you don't mind moving away from the RH supplied kernels, by downloading the generic 2.4.20 kernel from ftp.kernel.org, and downloading Kai's patch from http://www.kolumbus.fi/kai.makisara/st-eot.html and patching the standard kernel, the problem with SCSI EOT recognition is resolved.
Created attachment 94036 [details] Kai's Patch for Generic 2.4.20 kernel This is Kai Makisara's patch for the generic (non-RH) 2.4.20 kernel. Without this patch, the 2.4 kernels since 2.4.9 are broken in a manner that prevents recognition and proper processing of the early warning end of tape message from a tape drive. This results in failures in backups when more than one tape is required. This effects ANY backup application that uses the standard /dev/st?? device drivers under the 2.4 kernels.
I patched a kernel.org 2.4.20 kernel with this patch and it didnt fix the problem on my server. I still see the same messages during backup: scsi: device set offline - not ready or command retry failed after host reset: host 0 channel 2 id 3 lun 0 st0: Error 30000 (sugg. bt 0x0, driver bt 0x0, host bt 0x3). I have a ServeRAID 5i controller with a 40/80gb DLT1 in an IBM server. Is there anything else to try ?
Yes - place your tape drive onto a separate SCSI adapter. You should never place a tape drive onto the same SCSI HBA as your primary or RAID'ed SCSI disks. First, this is a different error than what we're discussing here and I suspect the problem is with the Dell/PercRAID adapter. Additionally, I've seen many issues with tape attached to the Dell RAID adapter that have absolutely nothing to do with the EOT issue that this report concerns. I would recommend adding a low cost 40MHz Ultra Wide controller such as the SiiG AP40 (US$69 through CompUSA.com) or an LSI Logic HBA (usually less than US$100) specifically for the tape drive.
Sorry, That wording should have said "a problems with the Dell/PercRAID Linux device driver." The Dell RAID controller works fine under Linux with the RAID functions and disks. Tim
I downloaded kernel-2.4.20-20.7.src.rpm (the latest errata kernel) and went through drivers/scsi/st.c and Kai Makisara's patch line-by-line. RED HAT'S 2.4.20-20 KERNEL SERIES HAS KAI'S PATCH APPLIED. Specifically, Kai's patch was rolled into this patch: Patch5: linux-2.4.20-later-ac-updates.patch If you check the bug history, you'll see that this bug was closed on 2003-06-08 with CURRENTRELEASE as the solution. That was almost certainly when Red Hat added Kai's patch. The problem is *not* that Red Hat's latest kernels don't have the patch applied. Rather, the problem is that Kai's patch doesn't fix the EOT recognition problem--at least not in all circumstances. :( Those of us who are affected by this problem need to let Kai know that his patch doesn't completely solve the problem, and be prepared to give him the debugging output he'll need to further diagnose the problem.
RH 7.3 I patched the kernel with 2.4.20-20.7 and it didnt fix the problem on my server. Backup (cpio) works fine with DDS3 dat but failed with DDS4. When i try to make a backup with DDS4 dat, i have the folowing message to the console : Found end of volume. To continue type device/file name when ready. and in the log files : st0: Block limits 1 - 16777215 bytes. (ips0) Reset Request - Flushed Cache scsi: device set offline - not ready or command retry failed after host reset: host 0 channel 2 id 6 lun 0 st0: Error 30000 (sugg. bt 0x0, driver bt 0x0, host bt 0x3). st0: Error with sense data: Current st09:00: sense key Unit Attention Additional sense indicates Power on,reset,or bus device reset occurred st0: Error on write filemark. And i can't access the tape without a cold reboot. Thanks to everyone for their help.
I was plagued with this problem, myself. Attempting backup would result in complete kernel panic. I tried many different versions of RH 2.4 kernel, patched and unpatched. None of them worked. The solution, for me, was to use a generic, non RH, 2.4 kernel. I still get tape error messages, but the system seems able to backup and restore properly now. No crashes or panics.
We upgraded to Red Hat Enterprise Linux 2.1ES to try to avoid this bug, but the bug appears to be in the RHEL 2.4.9-e.27 kernel as well. Since we paid for a Standard subscription to RHEL instead of just Basic, we've opened a support ticket with Red Hat about this problem. Hopefully they'll get to the bottom of it soon.
This is now causing system hangs on a 2.4.18-27.8.0smp kernel with an IBM ServeRAID 5i card and a VXA-2 tape drive. The disk array and the tape drive are on separate channels. I have not tried going to a more recent kernel as the other comments indicate this does not make the problem go away. Has anyone else found a patch for this problem?
Please note - as stated before, this bug is related to the lack of end of tape recognition, not other SCSI issues. I strongly urge each of you to open NEW bug reports that relate to your particular errors and panics. Tracking the EOT bug will most likely NOT fix any of the panics that you are all witnessing.
I have same problem in my RedHat server. I find this problem in internet and see this bug. My confs: *uname -a Linux linux 2.4.20-24.7 #1 Mon Dec 1 13:35:11 EST 2003 i686 unknown *more /etc/redhat-release Red Hat Linux release 7.2 (Enigma) *The last patch is applied: vi /usr/src/linux-2.4.20-24.7/drivers/scsi/st.c Last modified: Sun Apr 6 22:44:42 2003 by makisara Some small formal changes - aeb, 950809 [root@linux backup]# mt -f /dev/st0 status SCSI 2 tape drive: File number=-1, block number=-1, partition=0. Tape block size 0 bytes. Density code 0x0 (default). Soft error count since last status=0 General status bits on (50000): DR_OPEN IM_REP_EN This Bug 79027 is SOLVED? For me don´t fix it.. Thank´s for all. Diego S. Soares diegosoares.br
Currently, we've found that the issue is resolved in 2.4.23, however, it is NOT fixed in ANY Red Hat kernel. What will it take to get the RH supplied kernels up to snuff? It's obvious that this problem is affecting MANY RH customers.
Red Hat Linux 8.0 (and 7.x) are End Of Life so it will take a LOT to get RH to suppy a newer kernel for those.
Red Hat is dropping all versions of RH Linux. You will now have two choices: 1) RH Enterprise Linux, where you will pay (big time) for a support contract or 2) RH Fedora Linux, for which RH offers no support at all, but happily harvests the efforts of the development community. Lovely. Support for ide tape drives (all versions of Linux) is equally lacking. The ide tape driver is completely broken and unusable and likely to stay that way. No one cares enough to fix it, apparently. The ide-scsi / st driver combination, which allows an ide tape drive to emulate a scsi tape drive, is broken but usable. Linus apparently considers ide-scsi to be "an abortion" and there isn't much interest in fixing ide tape problems. I wish things were different. :(
I want to respond to Arjan's comment #29: RedHat 9 is still supported and I don't see a real difference between the 7.x and the 9 Kernel. In fact it's one line in the .spec file. So can we expect the RedHat 9 Kernel gets fixed? I think no. I think this bug, like others, will stay open until RedHat 9 reaches EOL and then we're told that RH9 isn't supported anymore. Being a customer who has paid real money for many RedHat releases since years, I'm not really happy whith this.
Well, RH 9 reaches EOL on April 30, 2004. You don't have long to wait. :( Maybe the problem will be fixed in Fedora. We can hope.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/