Bug 180212

Summary:	panic occurred after reboot
Product:	Red Hat Enterprise Linux 4	Reporter:	Pan Haifeng <pan_haifeng>
Component:	kernel	Assignee:	Doug Ledford <dledford>
Status:	CLOSED DUPLICATE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	andriusb, berthiaume_wayne, coughlan, jbaron
Target Milestone:	---
Target Release:	---
Hardware:	ia64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-08-10 19:00:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pan Haifeng 2006-02-06 19:49:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; EMC IS 55; .NET CLR 1.0.3705; .NET CLR 1.1.4322; InfoPath.1)

Description of problem:
In the 4 nodes clustering environment, randomly issue reboot 3 of 4 server. One server each time. The panic occurred after the reboot was issues and the node started to come back up.


Version-Release number of selected component (if applicable):
kernel-2.6.9-22.EL

How reproducible:
Always

Steps to Reproduce:
1.Configuration
Server model: rx5670
HBA model: QLA2312
Driver version : 8.01.00
OS kernel: RH 4 UP2  2.6.9-22.EL
FC switch : Brocade 3800 16 port
Array model: CLARiiON CX300 

2.runs in a cluster of 4 nodes randomly issues reboots to 3 out 4 nodes during the 24 hour cycle. However, reboot is not issued to all the nodes at the same time. That is, only one node at a time. After the node comes back up, the server loads it with lots of IO and continues with test and then issues another reboot after an hour or so. The panic occurred after the reboot was issues and the node started to come back up.

3.
  

Additional info:

Comment 1 Jason Baron 2006-02-08 03:09:57 UTC

Do you have a trace of the panic?

Comment 2 Pan Haifeng 2006-02-08 13:56:07 UTC

The problem is not always reproduce. Clustering environment is not true. It is 
4 servers on the same switch and 3 of them randomly reboot to generate RSCN. 

Console trace as following:
end_request: I/O error, dev sdfz, sector 2799341
kernel BUG at drivers/scsi/scsi.c:292!
scsi_eh_3[729]: bugcheck! 0 [1]
Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4
sunrpc ds yenta_socket pcmcia_core deadman(U) vfat fat sg dm_multipath
emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U)
emcplib(U) button ohci_hcd ehci_hcd e1000 tg3 bonding(U) dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod qla2300(U) qla2xxx(U) qla2xxx_conf(U)
mptscsih mptbase sd_mod scsi_mod

Pid: 729, CPU 0, comm:            scsi_eh_3
psr : 0000101008122010 ifs : 800000000000058e ip  : [<a000000200069f00>]
Tainted: P
ip is at scsi_put_command+0x1e0/0x200 [scsi_mod]
unat: 0000000000000000 pfs : 000000000000058e rsc : 0000000000000003
rnat: 0000000043de6db3 bsps: 000000000002cc51 pr  : 0000000000269941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000200069f00 b6  : a0000001003418a0 b7  : a000000100256b40
f6  : 0fffbccccccccc8c00000 f7  : 0ffdaa200000000000000
f8  : 100008000000000000000 f9  : 10002a000000000000000
f10 : 0fffcccccccccc8c00000 f11 : 1003e0000000000000000
r1  : a00000010099d0e0 r2  : 000000000032b272 r3  : a00000010079d740
r8  : 0000000000000027 r9  : a000000100732ac0 r10 : a000000100732ab8
r11 : a00000010079d328 r12 : e0000001004c7df0 r13 : e0000001004c0000
r14 : 0000000000004000 r15 : a00000010074e540 r16 : 0000000000000001
r17 : 0000000000000538 r18 : a000000100650198 r19 : a000000100256b40
r20 : c0000000f4050000 r21 : 0000000000000005 r22 : a0000001007b3b30
r23 : a0000001007b3a40 r24 : a0000001007b3a40 r25 : a000000100a3d7c8
r26 : 00000ba5d86cafa5 r27 : 0000001008122010 r28 : 0000000000000000
r29 : 00000000110000c0 r30 : 0000000000000000 r31 : a0000001007b00c0




Call Trace:
 [<a000000100016a60>] show_stack+0x80/0xa0
                                sp=e0000001004c7960 bsp=e0000001004c1220
 [<a000000100017370>] show_regs+0x890/0x8c0
                                sp=e0000001004c7b30 bsp=e0000001004c11d0
 [<a00000010003d7f0>] die+0x150/0x240
                                sp=e0000001004c7b50 bsp=e0000001004c1190
 [<a00000010003d920>] die_if_kernel+0x40/0x60
                                sp=e0000001004c7b50 bsp=e0000001004c1160
 [<a00000010003dac0>] ia64_bad_break+0x180/0x600
                                sp=e0000001004c7b50 bsp=e0000001004c1138
 [<a00000010000f480>] ia64_leave_kernel+0x0/0x260
                                sp=e0000001004c7c20 bsp=e0000001004c1138
 [<a000000200069f00>] scsi_put_command+0x1e0/0x200 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c10c8
 [<a000000200078d00>] scsi_next_command+0x40/0x80 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c10a0
 [<a000000200078ff0>] scsi_end_request+0x1d0/0x2e0 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c1058
 [<a000000200079570>] scsi_io_completion+0x2b0/0xa00 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c0fd0
 [<a000000200026710>] sd_rw_intr+0x110/0x700 [sd_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c0f80
 [<a00000020006c190>] scsi_finish_command+0x2d0/0x300 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c0f50
 [<a0000002000769d0>] scsi_error_handler+0x16b0/0x2560 [scsi_mod]
                                sp=e0000001004c7df0 bsp=e0000001004c0e38
 [<a000000100018930>] kernel_thread_helper+0x30/0x60
                                sp=e0000001004c7e30 bsp=e0000001004c0e10
 [<a000000100008c60>] start_kernel_thread+0x20/0x40
                                sp=e0000001004c7e30 bsp=e0000001004c0e10
Kernel panic - not syncing: Fatal exception

Comment 3 Jason Baron 2006-02-10 18:39:08 UTC

ok. thanks for the trace. re-assigning to our scsi expert.

Comment 4 Luming Yu 2007-07-27 15:59:25 UTC

Is it still reproducible with the latest rhel4 or rhel5 or upstream?

Comment 5 Pan Haifeng 2007-08-10 19:00:07 UTC

It is looks like the same issue as Bug# 231319. 

*** This bug has been marked as a duplicate of 231319 ***