Description of problem: I have a two nodes cluster with share storage(GFS).When emulating the split cluster, I use iptables command to drop both of ping packets to each other, unfortunately, kernel panic occurred on one node and its OS hangs up. Version-Release number of selected component (if applicable): # uname -a Linux archprx26k13319.storage.tucson.ibm.com 2.6.9-67.EL #1 SMP Wed Nov 7 13:43:35 EST 2007 ia64 ia64 ia64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 6) How reproducible: Has happened a number of times. Steps to Reproduce: 1.use iptables command to drop ping packets each other, for example iptables -A INPUT -s X.X.X.X -j DROP,X.X.X.X is the other node's ip, both of the public and private ip need do this. 2. 3. Actual results: the one node is fenced by the other and shutdown,but on this node ,kernel panic occurred. the log message and the console message on this node like this : NODE1,the log message : Apr 29 06:12:57 archprx26k13319 kernel: GFS: fsid=new_cluster:JB10106_16.1: Scanning for log elements... Apr 29 06:12:58 archprx26k13319 kernel: GFS: fsid=new_cluster:JB10106_16.1: Found 0 unlinked inodes Apr 29 06:12:58 archprx26k13319 kernel: GFS: fsid=new_cluster:JB10106_16.1: Found quota changes for 2 IDs Apr 29 06:12:58 archprx26k13319 kernel: GFS: fsid=new_cluster:JB10106_16.1: Done Apr 29 06:12:58 archprx26k13319 clurgmgrd: [5707]: <info> Adding export: *:/JB10106_16 (fsid=4730,no_root_squash,rw,sync) Apr 29 06:12:58 archprx26k13319 clurgmgrd: [5707]: <info> Adding IPv4 address 9.11.113.133 to eth1 Apr 29 06:12:59 archprx26k13319 multipathd: diapered_dm-1: failed to store path info Apr 29 06:12:59 archprx26k13319 clurgmgrd[5707]: <notice> Service new_service started Apr 29 06:14:42 archprx26k13319 kernel: clurgmgrd(13113): unaligned access to 0x200000000107a864, ip=0x4000000000012240 Apr 29 06:14:42 archprx26k13319 kernel: clurgmgrd(13113): unaligned access to 0x200000000107a86c, ip=0x4000000000012281 Apr 29 06:14:42 archprx26k13319 kernel: clurgmgrd(13113): unaligned access to 0x200000000107a86c, ip=0x40000000000122a0 Apr 29 06:14:43 archprx26k13319 kernel: CMAN: Being told to leave the cluster by node 2 Apr 29 06:14:43 archprx26k13319 kernel: CMAN: we are leaving the cluster. Apr 29 06:14:43 archprx26k13319 kernel: WARNING: dlm_emergency_shutdown Apr 29 06:14:43 archprx26k13319 kernel: WARNING: dlm_emergency_shutdown Apr 29 06:14:43 archprx26k13319 kernel: SM: 00000003 sm_stop: SG still joined Apr 29 06:14:43 archprx26k13319 kernel: SM: 01000005 sm_stop: SG still joined Apr 29 06:14:43 archprx26k13319 kernel: SM: 0200000d sm_stop: SG still joined Apr 29 06:14:43 archprx26k13319 kernel: SM: 03000007 sm_stop: SG still joined Apr 29 06:14:43 archprx26k13319 ccsd[4805]: Cluster manager shutdown. Attemping to reconnect... Apr 29 06:14:45 archprx26k13319 qdiskd[4980]: <err> cman_dispatch: Host is down Apr 29 06:14:45 archprx26k13319 qdiskd[4980]: <err> Halting qdisk operations Apr 29 06:15:08 archprx26k13319 ccsd[4805]: Unable to connect to cluster infrastructure after 30 seconds. Apr 29 06:15:38 archprx26k13319 ccsd[4805]: Unable to connect to cluster infrastructure after 60 seconds. the console message is : 10840 recovery_done jid 1 msg 309 a 10840 recovery_done nodeid 1 flg 18 lock_dlm: Assertion failed on line 361 of file /builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c lock_dlm: assertion: "!error || (plock && error == -EINPROGRESS)" lock_dlm: time = 4295432297 JB10106_1: error=-22 num=2,15 lkf=44 flags=84 kernel BUG at /builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c:361! gfs_glockd[8856]: bugcheck! 0 [1] Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 dm_cmirror(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core vfat fat dm_mirror dm_round_robin dm_multipath dm_mod button ohci_hcd ehci_hcd tg3 tulip e100 mii ext3 jbd qla2400 mptscsih mptsas mptspi mptscsi mptbase qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 8856, CPU 0, comm: gfs_glockd psr : 0000101008126010 ifs : 800000000000040f ip : [<a000000200abd530>] Not tainted ip is at do_dlm_unlock+0x290/0x2a0 [lock_dlm] unat: 0000000000000000 pfs : 000000000000040f rsc : 0000000000000003 rnat: a0000001007eff44 bsps: a00000010067ea58 pr : 0000000000009941 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000200abd530 b6 : a000000100264e40 b7 : a000000100264e40 f6 : 1003e00000000000000a0 f7 : 1003e0000000000000001 f8 : 1003e00000000000000a0 f9 : 10002a000000000000000 f10 : 0fffd9999999996900000 f11 : 1003e0000000000000000 r1 : a0000001009dc9a0 r2 : 000000000001248f r3 : a0000001007dd010 r8 : 000000000000004f r9 : a00000010076fbc0 r10 : a00000010076fbb8 r11 : a0000001007dcbf0 r12 : e0000000309f7dd0 r13 : e0000000309f0000 r14 : 0000000000004000 r15 : a000000100788690 r16 : 0000000000000001 r17 : a0000001007f0af8 r18 : 00000000000000ff r19 : e0000000010072b0 r20 : a0000001007f0b00 r21 : 0000000000000000 r22 : e000000002650504 r23 : 0000000000000000 r24 : a0000001007f0b18 r25 : a0000001007ef704 r26 : a0000001007f0b48 r27 : 0000001008126010 r28 : 0000000100071878 r29 : 0000000000096000 r30 : 0000000000000000 r31 : a0000001007eff20 Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e0000000309f7940 bsp=e0000000309f11a0 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e0000000309f7b10 bsp=e0000000309f1158 [<a00000010003e9b0>] die+0x150/0x240 sp=e0000000309f7b30 bsp=e0000000309f1118 [<a00000010003eae0>] die_if_kernel+0x40/0x60 sp=e0000000309f7b30 bsp=e0000000309f10e8 [<a00000010003ec80>] ia64_bad_break+0x180/0x600 sp=e0000000309f7b30 bsp=e0000000309f10c0 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e0000000309f7c00 bsp=e0000000309f10c0 [<a000000200abd530>] do_dlm_unlock+0x290/0x2a0 [lock_dlm] sp=e0000000309f7dd0 bsp=e0000000309f1048 [<a000000200abdf40>] lm_dlm_unlock+0xa0/0xc0 [lock_dlm] sp=e0000000309f7dd0 bsp=e0000000309f1028 [<a000000200cbd7d0>] gfs_lm_unlock+0xb0/0x120 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0ff0 [<a000000200ca4870>] gfs_glock_drop_th+0x2d0/0x500 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0fa8 [<a000000200ca9fc0>] inode_go_drop_th+0x40/0x60 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0f88 [<a000000200c9fbf0>] run_queue+0x9d0/0xbc0 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0f00 [<a000000200c9fe60>] unlock_on_glock+0x80/0xc0 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0ed8 [<a000000200ca0340>] gfs_reclaim_glock+0x4a0/0x540 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0e98 [<a000000200c7e8d0>] gfs_glockd+0x110/0x360 [gfs] sp=e0000000309f7dd0 bsp=e0000000309f0e38 [<a000000100018d10>] kernel_thread_helper+0x30/0x60 sp=e0000000309f7e30 bsp=e0000000309f0e10 [<a000000100008c60>] start_kernel_thread+0x20/0x40 sp=e0000000309f7e30 bsp=e0000000309f0e10 Kernel panic - not syncing: Fatal exception NODE2 ,the other node log like this : Apr 28 23:18:40 archprx26k46328 kernel: CMAN: removing node archprx26k13319 from the cluster : Missed too many heartbeats Apr 28 23:18:42 archprx26k46328 qdiskd[4955]: <info> Node 1 shutdown Apr 28 23:18:51 archprx26k46328 kernel: clurgmgrd(15723): unaligned access to 0x200000000107a86c, ip=0x40000000000122a0 Apr 28 23:18:55 archprx26k46328 clurgmgrd[14422]: <emerg> #1: Quorum Dissolved Apr 28 23:18:55 archprx26k46328 kernel: CMAN: Quorum device /dev/sdj timed out Apr 28 23:18:55 archprx26k46328 kernel: CMAN: quorum lost, blocking activity Apr 28 23:18:55 archprx26k46328 ccsd[4814]: Cluster is not quorate. Refusing connection. Apr 28 23:19:04 archprx26k46328 ccsd[4814]: Someone may be attempting something evil. Apr 28 23:19:04 archprx26k46328 ccsd[4814]: Error while processing disconnect: Invalid request descriptor Apr 28 23:19:09 archprx26k46328 ccsd[4814]: Cluster is not quorate. Refusing connection. Apr 28 23:19:09 archprx26k46328 ccsd[4814]: Error while processing connect: Connection refused Apr 28 23:19:10 archprx26k46328 ccsd[4814]: Cluster is not quorate. Refusing connection. Apr 28 23:19:10 archprx26k46328 ccsd[4814]: Error while processing connect: Connection refused Apr 28 23:19:12 archprx26k46328 clurgmgrd[14422]: <notice> Quorum Achieved Apr 28 23:19:12 archprx26k46328 clurgmgrd[14422]: <info> Magma Event: Membership Change Apr 28 23:19:12 archprx26k46328 clurgmgrd[14422]: <info> State change: Local UP Apr 28 23:19:12 archprx26k46328 kernel: CMAN: quorum regained, resuming activity Apr 28 23:19:12 archprx26k46328 ccsd[4814]: Cluster is quorate. Allowing connections. Apr 28 23:19:12 archprx26k46328 clurgmgrd[14422]: <info> Loading Service Data Apr 28 23:19:12 archprx26k46328 clurgmgrd[14422]: <info> Initializing Services Apr 28 23:19:15 archprx26k46328 clurgmgrd[14422]: <info> Services Initialized Apr 28 23:20:00 archprx26k46328 fenced[4968]: archprx26k13319 not a cluster member after 80 sec post_fail_delay Apr 28 23:20:00 archprx26k46328 fenced[4968]: fencing node "archprx26k13319" Apr 28 23:20:03 archprx26k46328 kernel: clurgmgrd(17693): unaligned access to 0x200000000107a864, ip=0x4000000000012221 Apr 28 23:20:07 archprx26k46328 fenced[4968]: fence "archprx26k13319" success Apr 28 23:20:21 archprx26k46328 clurgmgrd: [14422]: <info> Adding export: *:/JB10106_16 (fsid=4730,no_root_squash,rw,sync) Apr 28 23:20:21 archprx26k46328 clurgmgrd: [14422]: <info> Adding IPv4 address 9.11.113.133 to eth0 Apr 28 23:20:22 archprx26k46328 clurgmgrd[14422]: <notice> Service new_service started Expected results: node1 should reboot correctly and join the cluster with no kernel panic . Additional info:
The kernel panic on node1 is intended in this case. What fencing method are you using? Many configurations use power fencing which will reboot the failed node. You might also consider setting up a watchdog to reboot a node when it panics.
fencing method I used is fence_brocade,(In reply to comment #1) > The kernel panic on node1 is intended in this case. > What fencing method are you using? Many configurations use power fencing which > will reboot the failed node. > You might also consider setting up a watchdog to reboot a node when it panics.