Bug 427576 - Reading/writing a tape drive sometimes fails with scsi errors
Summary: Reading/writing a tape drive sometimes fails with scsi errors
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: i386
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-01-04 20:52 UTC by Kern Sibbald
Modified: 2013-04-30 12:47 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-30 12:47:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
scsi errors found in /var/log/messages corresponding to failures (8.85 KB, text/plain)
2008-01-04 20:52 UTC, Kern Sibbald
no flags Details

Description Kern Sibbald 2008-01-04 20:52:58 UTC
Description of problem:
When running Bacula (network backup) and attempting to write to a tape, it 
fails.  The kernel reports a number of scsi errors.


Version-Release number of selected component (if applicable):
kernel-2.6.18-53.1.4.el5

The problem does not exist in:
kernel-2.6.18-8.1.15.el5



How reproducible:
Run Bacula regression testing with a tape test. Activating the autochanger 
causes the situation to worsen to the point that no Bacula I/O can be done to 
the drive even when the autochanger is not used.


Steps to Reproduce:
1. Run tape regression scripts for Bacula preferably with an autochanger.
2.
3.
  
Actual results:
Unable to use Bacula to write to a tape drive.


Expected results:
With a different kernel, everything works fine.


Additional info:
- I am actually running CentOS 5.1
- The problem does not exist on the CentOS 5.0 kernel (running on a system 
that is otherwise 5.1).
- This may have something to do with autochangers, but I don't think so.  In 
any case, activating an autochanger using mtx increases the problem.
- This problem looks much like one I saw on SuSE a year ago when they were 
running a similar kernel version.  It was a *very* serious problem that lead 
to system lock up, and on one of my machines the total loss of a hard disk. 

Kernel (scsi) error messages attached

Comment 1 Kern Sibbald 2008-01-04 20:52:58 UTC
Created attachment 290859 [details]
scsi errors found in /var/log/messages corresponding to failures

Comment 2 Kern Sibbald 2008-01-05 20:11:21 UTC
In checking back on the previous SuSE kernel problem, it was in kernel 2.6.16 
and so is not related to this bug.

This is a bit complicated and I am still analyzing it, but it looks like while 
one thread is doing write() operations to the drive another thread calls mtx 
which removes the tape from the drive.  That causes the kernel tape driver to 
get confused, and thereafter even though the drive is close()ed and open()ed 
again, it fails to work always getting either a Resource is Busy or an I/O 
Error. The only way I have found to be able to re-use the drive is to reboot.  
If you have any less drastic way of clearing the kernel driver, I would like 
to know.

In addition, despite what I previously said, it seems to occur on both kernels 
mentioned, though I cannot reproduce it on later kernels (SuSE 10.2).

Comment 3 Tom Coughlan 2009-02-11 14:12:47 UTC
(In reply to comment #2)
 
> If you have any less drastic way of clearing the kernel driver, I would like 
> to know.

I assume you tried rmmod st.ko ? 

From the messages "Dumping Card State" it appears you have an aic* HBA. You should rmmod/insmod that as well to try to clear the error. 


> In addition, despite what I previously said, it seems to occur on both kernels 
> mentioned, though I cannot reproduce it on later kernels (SuSE 10.2).

So, kernel-2.6.18-53.1.4.el5 fails. Please try a newer kernel-2.6.18...el5 version.

Comment 4 Jes Sorensen 2013-04-30 12:47:03 UTC
This bugzilla is really old and there has been no reply to Tom's questions
for years, so taking the liberty to close it.

If you experience similar problems with a newer version of RHEL, please open
a new bugzilla.

Jes


Note You need to log in before you can comment on or make changes to this bug.