From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 Description of problem: While testing Veritas NetBackup 5.1 MP3 on RedHat 3 x86_64 we ran across this issue. The system locks up when beginning to write to a second tape path. This forces a reboot and the system throws a kernel panic: System Trace ------------ {pci_map_sg+227} {:qla2300: qla2x00_64bit_start_scsi+888} {:qla2300: qla2x00_queuecommand+1297} {:qla2300: qla2x00_next+532} {:qla2300: qla2x00_queuecommand+1297} {scsi_mod: scsi_times_out+0} {:scsi_mod: scsi_dispatch_cmd+640} {:scsi_mod: scsi_request_fn+1041} {:scsi_mod: __scsi_insert_special_req+127} {:scsi_mod: scsi_insert_special_req+31} {:scsi_mod: scsi_do_req_Rsmp_ff69bbf9+350} {:st: st_sleep_done+0} {:st: st_do_scsi+310} {:st: read_tape+260} {__get_free_pages+16} {:st: st_read+829} {:sys_read+178} {ia32_syscall+103} Code: 0f 0b b6 7c 2d 80 ff ff ff ff 2b 00 eb 5d 48 8b 4b 08 48 85 Kernel panic: Fatal exception The system shows this panic everytime it is booted until the device is unattached from the host. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Install RedHat 3 x86 64_bit 2. Install NetBackup 5.1 MP3 3. Create a policy that will write to two devices at the same time 4. Fire off the policy and the system locks 5. Reboot to see kernel panic 6. Unplug the device or else the system will panic everytime it boots Actual Results: System locked. Kernel panic. Expected Results: Should keep writing to tapes. Additional info: This issue was seen on two different machine types including a Sun Fire V20z (dual Opteron's) and Supermicro (dual EM64T's). The exact same machines were loaded with the 32-bit build of RH3 and worked without a problem. Tried recreating this using tar and mt but the system does not lock. The 32-bit Linux build of NetBackup was used on both RH3 32-bit, which didn't show this issue, and RH3 x86_64 which did. This, and the system trace make us believe that it is not a NetBackup issue. Qlogic and Emulex have investigated the problem since it was their driver that originally called st but they have not found anything in their debug logs that indicate an issue with their driver. Found a similar bug (but different) here: https://bugzilla.redhat.com/bugzilla/process_bug.cgi Sebastien BLAISOT (sblaisot) Comment #14 has the same panic trace that I had.
Engineering is waiting on the output from the latest IT post requesting the dmesg output.
The kernel rpms can be downloaded from http://people.redhat.com/dledford/st_tape_test/
A new set of test kernel RPMs have been posted to the same place as before. These include the fix for the sg+st write bug and one other tweak that might help with this problem. Please test these out and let me know the results.
NEEDINFO_REPORTER does not seem to be the correct state for this, moving back to ASSIGNED.
A fix for this problem has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.6.EL).
*** Bug 156396 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html
I tested RC1 of Update 7 and it is causing the same panic. The hotfix provided a few months back did resolve the issue and several customers are using it without problems.
Josef, Can you tell me the kernel version on the U7 RC1 system you're using? I want to make sure it should have had the fix incorporated into it.
2.4.21-40.EL I obtained the Update from the FTP location here: ftp://partners.redhat.com/45cf7905562e922e7817d4a01ca8be26/RHEL3-U7/ (In reply to comment #71)
josef, let's move this thread to bug 182996 because this current bug was for RHEL 3 U7 and bug 182996 is for RHEL 3 U8. Let's consider this CLOSED and bug 182996 as a regression to the "fix".
Fixing bug's disposition (reverting to ERRATA).