Bug 162212

Summary: st causes system hang and kernel panic when writing to tape on x86_64
Product: Red Hat Enterprise Linux 3 Reporter: Josef Pfeiffer <josef_pfeiffer>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: cjones, coldwell, coughlan, gcase, jeff.johnson, lakamine, peterm, petrides, rkenna, sblaisot, vkanakas
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0144 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-23 19:53:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168424    
Attachments:
Description Flags
Attaching an oops captured via Netdump at the time the failure occured. none

Description Josef Pfeiffer 2005-06-30 20:53:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

Description of problem:
While testing Veritas NetBackup 5.1 MP3 on RedHat 3 x86_64 we ran across this issue.  The system locks up when beginning to write to a second tape path.  This forces a reboot and the system throws a kernel panic:

System Trace
------------
{pci_map_sg+227}
{:qla2300: qla2x00_64bit_start_scsi+888}
{:qla2300: qla2x00_queuecommand+1297}
{:qla2300: qla2x00_next+532}
{:qla2300: qla2x00_queuecommand+1297}
{scsi_mod: scsi_times_out+0}
{:scsi_mod: scsi_dispatch_cmd+640}
{:scsi_mod: scsi_request_fn+1041}
{:scsi_mod: __scsi_insert_special_req+127}
{:scsi_mod: scsi_insert_special_req+31}
{:scsi_mod: scsi_do_req_Rsmp_ff69bbf9+350}
{:st: st_sleep_done+0}
{:st: st_do_scsi+310}
{:st: read_tape+260}
{__get_free_pages+16}
{:st: st_read+829}
{:sys_read+178}
{ia32_syscall+103}

Code:
0f 0b b6 7c 2d 80 ff ff ff ff 2b 00 eb 5d 48 8b 4b 08 48 85

Kernel panic: Fatal exception

The system shows this panic everytime it is booted until the device is unattached from the host.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Install RedHat 3 x86 64_bit
2. Install NetBackup 5.1 MP3
3. Create a policy that will write to two devices at the same time
4. Fire off the policy and the system locks
5. Reboot to see kernel panic
6. Unplug the device or else the system will panic everytime it boots
  

Actual Results:  System locked.
Kernel panic.

Expected Results:  Should keep writing to tapes.

Additional info:

This issue was seen on two different machine types including a Sun Fire V20z (dual Opteron's) and Supermicro (dual EM64T's).  The exact same machines were loaded with the 32-bit build of RH3 and worked without a problem.

Tried recreating this using tar and mt but the system does not lock.  The 32-bit Linux build of NetBackup was used on both RH3 32-bit, which didn't show this issue, and RH3 x86_64 which did.  This, and the system trace make us believe that it is not a NetBackup issue.  Qlogic and Emulex have investigated the problem since it was their driver that originally called st but they have not found anything in their debug logs that indicate an issue with their driver.

Found a similar bug (but different) here:
https://bugzilla.redhat.com/bugzilla/process_bug.cgi
Sebastien BLAISOT (sblaisot)
Comment #14 has the same panic trace that I had.

Comment 10 Peter Martuccelli 2005-09-07 18:02:40 UTC
Engineering is waiting on the output from the latest IT post requesting the
dmesg output.

Comment 24 Doug Ledford 2005-09-14 21:37:46 UTC
The kernel rpms can be downloaded from
http://people.redhat.com/dledford/st_tape_test/

Comment 35 Doug Ledford 2005-09-21 02:14:16 UTC
A new set of test kernel RPMs have been posted to the same place as before. 
These include the fix for the sg+st write bug and one other tweak that might
help with this problem.  Please test these out and let me know the results.

Comment 49 Jeff Needle 2005-10-10 19:30:42 UTC
NEEDINFO_REPORTER does not seem to be the correct state for this, moving back to
ASSIGNED.

Comment 55 Ernie Petrides 2005-10-20 05:44:26 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.6.EL).


Comment 60 Ernie Petrides 2005-11-03 20:34:36 UTC
*** Bug 156396 has been marked as a duplicate of this bug. ***

Comment 69 Red Hat Bugzilla 2006-03-15 16:10:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html


Comment 70 Josef Pfeiffer 2006-03-15 16:13:17 UTC
I tested RC1 of Update 7 and it is causing the same panic.  The hotfix provided
a few months back did resolve the issue and several customers are using it
without problems.

Comment 71 Gary Case 2006-03-15 17:48:07 UTC
Josef,

Can you tell me the kernel version on the U7 RC1 system you're using? I want to
make sure it should have had the fix incorporated into it.


Comment 73 Josef Pfeiffer 2006-03-15 22:44:35 UTC
2.4.21-40.EL
I obtained the Update from the FTP location here:
ftp://partners.redhat.com/45cf7905562e922e7817d4a01ca8be26/RHEL3-U7/

(In reply to comment #71)

Comment 77 Andrius Benokraitis 2006-03-23 19:53:26 UTC
josef, let's move this thread to bug 182996 because this current bug was for
RHEL 3 U7 and bug 182996 is for RHEL 3 U8. Let's consider this CLOSED and bug
182996 as a regression to the "fix".

Comment 78 Ernie Petrides 2006-03-24 02:06:32 UTC
Fixing bug's disposition (reverting to ERRATA).