Bug 162212 - st causes system hang and kernel panic when writing to tape on x86_64
Summary: st causes system hang and kernel panic when writing to tape on x86_64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2005-06-30 20:53 UTC by Josef Pfeiffer
Modified: 2008-01-10 18:44 UTC (History)
11 users (show)

Fixed In Version: RHSA-2006-0144
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-23 19:53:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Attaching an oops captured via Netdump at the time the failure occured. (3.44 KB, text/plain)
2005-07-30 00:52 UTC, Johnray Fuller
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0144 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 7 2006-03-15 05:00:00 UTC

Description Josef Pfeiffer 2005-06-30 20:53:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

Description of problem:
While testing Veritas NetBackup 5.1 MP3 on RedHat 3 x86_64 we ran across this issue.  The system locks up when beginning to write to a second tape path.  This forces a reboot and the system throws a kernel panic:

System Trace
------------
{pci_map_sg+227}
{:qla2300: qla2x00_64bit_start_scsi+888}
{:qla2300: qla2x00_queuecommand+1297}
{:qla2300: qla2x00_next+532}
{:qla2300: qla2x00_queuecommand+1297}
{scsi_mod: scsi_times_out+0}
{:scsi_mod: scsi_dispatch_cmd+640}
{:scsi_mod: scsi_request_fn+1041}
{:scsi_mod: __scsi_insert_special_req+127}
{:scsi_mod: scsi_insert_special_req+31}
{:scsi_mod: scsi_do_req_Rsmp_ff69bbf9+350}
{:st: st_sleep_done+0}
{:st: st_do_scsi+310}
{:st: read_tape+260}
{__get_free_pages+16}
{:st: st_read+829}
{:sys_read+178}
{ia32_syscall+103}

Code:
0f 0b b6 7c 2d 80 ff ff ff ff 2b 00 eb 5d 48 8b 4b 08 48 85

Kernel panic: Fatal exception

The system shows this panic everytime it is booted until the device is unattached from the host.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Install RedHat 3 x86 64_bit
2. Install NetBackup 5.1 MP3
3. Create a policy that will write to two devices at the same time
4. Fire off the policy and the system locks
5. Reboot to see kernel panic
6. Unplug the device or else the system will panic everytime it boots
  

Actual Results:  System locked.
Kernel panic.

Expected Results:  Should keep writing to tapes.

Additional info:

This issue was seen on two different machine types including a Sun Fire V20z (dual Opteron's) and Supermicro (dual EM64T's).  The exact same machines were loaded with the 32-bit build of RH3 and worked without a problem.

Tried recreating this using tar and mt but the system does not lock.  The 32-bit Linux build of NetBackup was used on both RH3 32-bit, which didn't show this issue, and RH3 x86_64 which did.  This, and the system trace make us believe that it is not a NetBackup issue.  Qlogic and Emulex have investigated the problem since it was their driver that originally called st but they have not found anything in their debug logs that indicate an issue with their driver.

Found a similar bug (but different) here:
https://bugzilla.redhat.com/bugzilla/process_bug.cgi
Sebastien BLAISOT (sblaisot)
Comment #14 has the same panic trace that I had.

Comment 10 Peter Martuccelli 2005-09-07 18:02:40 UTC
Engineering is waiting on the output from the latest IT post requesting the
dmesg output.

Comment 24 Doug Ledford 2005-09-14 21:37:46 UTC
The kernel rpms can be downloaded from
http://people.redhat.com/dledford/st_tape_test/

Comment 35 Doug Ledford 2005-09-21 02:14:16 UTC
A new set of test kernel RPMs have been posted to the same place as before. 
These include the fix for the sg+st write bug and one other tweak that might
help with this problem.  Please test these out and let me know the results.

Comment 49 Jeff Needle 2005-10-10 19:30:42 UTC
NEEDINFO_REPORTER does not seem to be the correct state for this, moving back to
ASSIGNED.

Comment 55 Ernie Petrides 2005-10-20 05:44:26 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.6.EL).


Comment 60 Ernie Petrides 2005-11-03 20:34:36 UTC
*** Bug 156396 has been marked as a duplicate of this bug. ***

Comment 69 Red Hat Bugzilla 2006-03-15 16:10:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html


Comment 70 Josef Pfeiffer 2006-03-15 16:13:17 UTC
I tested RC1 of Update 7 and it is causing the same panic.  The hotfix provided
a few months back did resolve the issue and several customers are using it
without problems.

Comment 71 Gary Case 2006-03-15 17:48:07 UTC
Josef,

Can you tell me the kernel version on the U7 RC1 system you're using? I want to
make sure it should have had the fix incorporated into it.


Comment 73 Josef Pfeiffer 2006-03-15 22:44:35 UTC
2.4.21-40.EL
I obtained the Update from the FTP location here:
ftp://partners.redhat.com/45cf7905562e922e7817d4a01ca8be26/RHEL3-U7/

(In reply to comment #71)

Comment 77 Andrius Benokraitis 2006-03-23 19:53:26 UTC
josef, let's move this thread to bug 182996 because this current bug was for
RHEL 3 U7 and bug 182996 is for RHEL 3 U8. Let's consider this CLOSED and bug
182996 as a regression to the "fix".

Comment 78 Ernie Petrides 2006-03-24 02:06:32 UTC
Fixing bug's disposition (reverting to ERRATA).



Note You need to log in before you can comment on or make changes to this bug.