Description of problem: A cluster node panics after hearbeat connectivity is lost ie. heartbeat cable is pulled from master node. Data link is via another interface and physical connection. This does not happen consistently, but was observed on at least four occasions. It is a 2 node cluster using qdisk technology. Version-Release number of selected component (if applicable): RHEL 4 U4, Cluster 4 U5 beta, GFS 4 U4 All 32bit software on x86-64 HP blades How reproducible: Steps to Reproduce: 1. Pull heartbeat cable from master node. 2. 3. Actual results: Incomplete; the kernel is issuing a "kernel BUG" message for dlm/lock.c that is associated with an invalid operation. See attached message. I expect that a new version of dlm-kernel will be required to fix the problem. Expected results: secondary node is fenced via fencing mechansim ie. iLO fencing Additional info: KERNEL Panic info: root@hx422 ~]# Apr 19 14:44:11 hx422 kernel: CMAN: Being told to leave the cluster by node 2 Apr 19 14:44:11 hx422 kernel: CMAN: we are leaving the cluster. Apr 19 14:44:11 hx422 ccsd[4425]: Cluster manager shutdown. At temping to reconnect... Apr 19 14:44:11 hx422 kernel: WARNING: dlm_emergency_shutdown Apr 19 14:44:11 hx422 kernel: WARNING: dlm_emergency_shutdown Apr 19 14:44:11 hx422 kernel: SM: 00000002 sm_stop: SG still jo ined Apr 19 14:44:11 hx422 kernel: SM: 01000003 sm_stop: SG still jo ined Apr 19 14:44:11 hx422 kernel: SM: 02000005 sm_stop: SG still jo ined Apr 19 14:44:21 hx422 kernel: dlm: dlm_unlock: lkid 2001c locks pace not found Apr 19 14:44:21 hx422 kernel: clvmd move flags 0,1,0 ids 0,2,0 Apr 19 14:44:21 hx422 kernel: clvmd move use event 2 Apr 19 14:44:21 hx422 kernel: clvmd recover event 2 (first) Apr 19 14:44:21 hx422 kernel: clvmd add nodes Apr 19 14:44:21 hx422 kernel: clvmd total nodes 2 Apr 19 14:44:21 hx422 kernel: clvmd rebuild resource directory Apr 19 14:44:21 hx422 kernel: clvmd rebuilt 0 resources Apr 19 14:44:21 hx422 kernel: clvmd recover event 2 done Apr 19 14:44:21 hx422 kernel: clvmd move flags 0,0,1 ids 0,2,2 Apr 19 14:44:21 hx422 kernel: clvmd process held requests Apr 19 14:44:21 hx422 kernel: clvmd processed 0 requests Apr 19 14:44:21 hx422 kernel: clvmd recover event 2 finished Apr 19 14:44:21 hx422 kernel: emsshare move flags 0,1,0 ids 0,3 ,0 Apr 19 14:44:21 hx422 kernel: emsshare move use event 3 Apr 19 14:44:21 hx422 kernel: emsshare recover event 3 (first) Apr 19 14:44:21 hx422 kernel: emsshare add nodes Apr 19 14:44:21 hx422 kernel: emsshare total nodes 2 Apr 19 14:44:21 hx422 kernel: emsshare total nodes 2 Apr 19 14:44:21 hx422 kernel: emsshare rebuild resource directo ry Apr 19 14:44:21 hx422 kernel: emsshare rebuilt 60 resources Apr 19 14:44:21 hx422 kernel: emsshare recover event 3 done Apr 19 14:44:21 hx422 kernel: emsshare move flags 0,0,1 ids 0,3 ,3 Apr 19 14:44:21 hx422 kernel: emsshare process held requests Apr 19 14:44:21 hx422 kernel: emsshare processed 0 requests Apr 19 14:44:21 hx422 kernel: emsshare recover event 3 finished Apr 19 14:44:21 hx422 kernel: 4638 pr_start last_stop 0 last_st art 4 last_finish 0 Apr 19 14:44:21 hx422 kernel: 4638 pr_start count 2 type 2 even t 4 flags 250 Apr 19 14:44:21 hx422 kernel: 4638 claim_jid 0 Apr 19 14:44:21 hx422 kernel: 4638 pr_start 4 done 1 Apr 19 14:44:21 hx422 kernel: 4637 pr_finish flags 5a Apr 19 14:44:21 hx422 kernel: 4617 recovery_done jid 0 msg 309 a Apr 19 14:44:21 hx422 kernel: 4617 recovery_done nodeid 1 flg 1 8 Apr 19 14:44:21 hx422 kernel: Apr 19 14:44:21 hx422 kernel: lock_dlm: Assertion failed on li ne 361 of file /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/sr c/dlm/lock.c Apr 19 14:44:21 hx422 kernel: lock_dlm: assertion: "!error || (plock && error == -EINPROGRESS)" Apr 19 14:44:21 hx422 kernel: lock_dlm: time = 102095 Apr 19 14:44:21 hx422 kernel: emsshare: error=-22 num=2,12e0ec lkf=10000 flags=84 Apr 19 14:44:21 hx422 kernel: Apr 19 14:44:21 hx422 kernel: ------------[ cut here ]--------- --- Apr 19 14:44:21 hx422 kernel: kernel BUG at /builddir/build/BUI LD/gfs-kernel-2.6.9-72/smp/src/dlm/lock.c:361! Apr 19 14:44:21 hx422 kernel: invalid operand: 0000 [#1] Apr 19 14:44:21 hx422 kernel: SMP Apr 19 14:44:21 hx422 kernel: Modules linked in: mptctl mptbase sg cpqci(U) parport_pc lp parport i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc dm_mirro r dm_round_robin dm_multipath button battery ac uhci_hcd ehci_h cd hw_random tg3 bonding(U) ata_piix libata ext3 jbd dm_mod qla 2400 qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Apr 19 14:44:21 hx422 kernel: CPU: 2 Apr 19 14:44:21 hx422 kernel: EIP: 0060:[<f8ee2611>] Tain ted: P VLI Apr 19 14:44:21 hx422 kernel: EFLAGS: 00010246 (2.6.9-54.ELsm p) Apr 19 14:44:21 hx422 kernel: EIP is at do_dlm_unlock+0xa9/0xbf [lock_dlm] Apr 19 14:44:21 hx422 kernel: eax: 00000001 ebx: f714d880 e cx: f73b0f28 edx: f8ee72af Apr 19 14:44:21 hx422 kernel: esi: ffffffea edi: 00000000 e bp: f8f99000 esp: f73b0f24 Apr 19 14:44:21 hx422 kernel: ds: 007b es: 007b ss: 0068 Apr 19 14:44:21 hx422 kernel: Process gfs_glockd (pid: 4640, th readinfo=f73b0000 task=f6c51230) Apr 19 14:44:21 hx422 kernel: Stack: f8ee72af f8f99000 00000003 f714d880 f8ee28b2 f9009e32 00000001 f65575c4 Apr 19 14:44:21 hx422 kernel: f65575a8 f8fffd63 f9036b00 f6175400 f65575a8 f9036b00 f65575a8 f8fff257 Apr 19 14:44:21 hx422 kernel: f65575a8 00000001 f655763c f8fff30e f65575a8 f65575c4 f8fff431 00000001 Apr 19 14:44:21 hx422 kernel: Call Trace: Apr 19 14:44:21 hx422 kernel: [<f8ee28b2>] lm_dlm_unlock+0x14/ 0x1c [lock_dlm] Apr 19 14:44:21 hx422 kernel: [<f9009e32>] gfs_lm_unlock+0x2c/ 0x42 [gfs] Apr 19 14:44:21 hx422 kernel: [<f8fffd63>] gfs_glock_drop_th+0 xf3/0x12d [gfs] Apr 19 14:44:21 hx422 kernel: [<f8fff257>] rq_demote+0x7f/0x98 [gfs] Apr 19 14:44:21 hx422 kernel: [<f8fff30e>] run_queue+0x5a/0xc1 [gfs] Apr 19 14:44:21 hx422 kernel: [<f8fff431>] unlock_on_glock+0x1 f/0x28 [gfs] Apr 19 14:44:21 hx422 kernel: [<f90013e9>] gfs_reclaim_glock+0 xc3/0x13c [gfs] Apr 19 14:44:21 hx422 kernel: [<f8ff3e05>] gfs_glockd+0x39/0xd e [gfs] Apr 19 14:44:21 hx422 kernel: [<c011e7cb>] default_wake_functi on+0x0/0xc Apr 19 14:44:21 hx422 kernel: [<c02d5db2>] ret_from_fork+0x6/0 x14 Apr 19 14:44:21 hx422 kernel: [<c011e7cb>] default_wake_functi on+0x0/0xc Apr 19 14:44:21 hx422 kernel: [<f8ff3dcc>] gfs_glockd+0x0/0xde [gfs] Apr 19 14:44:21 hx422 kernel: [<c01041f5>] kernel_thread_helpe r+0x5/0xb Apr 19 14:44:21 hx422 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 cb 73 ee f8 e8 ee 02 24 c7 83 c4 34 68 af 72 ee f8 e8 e1 02 24 c7 <0f> 0b 69 01 f7 71 ee f 8 68 b1 72 ee f8 e8 9c fa 23 c7 5b 5e 5f Apr 19 14:44:21 hx422 kernel: <0>Fatal exception: panic in 5 s econds CLUSTER.CONF FILE ================== <?xml version="1.0"?> <cluster alias="EMS_clu0" config_version="37" name="EMS_cluster0"> <cman expected_votes="3" two_node="0"/> <quorumd interval="4" tko="10" votes="1" log_facility="local4" log_level="7" label="myquorum" status_file="/tmp/qdisk_status.log" reboot="1"> <heuristic program="/etc/cluster/scripts/gw_check.sh" score="1" interval="8"/> </quorumd> <fence_daemon clean_start="1" post_join_delay="3" post_fail_delay="4"> </fence_daemon> <clusternodes> <clusternode name="hx422hb" votes="1" nodeid="1"> <fence> <method name="1"> <device name="hx422ilo"/> </method> <method name="2"> <device name="human"/> </method> </fence> </clusternode> <clusternode name="hx622hb" votes="1" nodeid="2"> <fence> <method name="1"> <device name="hx622ilo"/> </method> <method name="2"> <device name="human"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="hx422ilo" agent="fence_ilo" hostname="hx422ilo" login="Administrator" passwd="7M22YG48"/> <fencedevice name="hx622ilo" agent="fence_ilo" hostname="hx622ilo" login="Administrator" passwd="HMWUFDE5"/> <fencedevice name="human" agent="/etc/cluster/scripts/fence_auspost"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="EMS_domain" ordered="0" restricted="0"> <failoverdomainnode name="hx422hb" priority="1"/> </failoverdomain> </failoverdomains> <resources/> </rm> </cluster>
I think this is a problem related to qdisk. If you took qdisk out of the setup, leaving a normal 2-node cluster, then I'm quite confident you'd see the standard race to fence. Qdisk, of course, is supposed to "bias" this race to fence so that the node running services is the one who "wins".
I don't think this is qdisk, it looks like ACPI is causing soft-off.
FWIW, the qdisk configuration looks quite good.
Any ideas as to how is the node getting a kill message ('being told to leave the cluster') if the heartbeat cable is unplugged?
Can we get detailed qdisk logs for the event, just for completeness? * add 'log_facility="local5" log_level="7"' to the quorumd tag in cluster.conf * add 'local5.* /var/log/qdiskd_debug_log' to /etc/syslog.conf * restart syslogd (service syslog restart) * restart qdiskd (service qdiskd restart) This will greatly help figure out what's going on if it's related to qdisk.
Josef noticed that qdiskd's interval*tko is almost 4x what it should be for the default deadnode_timeout of 21 seconds. qdiskd's interval*tko should be < 1/2*deadnode_timeout, which defaults to 21 seconds on RHEL4. The default interval*tko of qdiskd is 10, which is < 1/2*(21). With a 40 second interval*tko, you need an 81 second deadnode_timeout.
AFAICT this was only a configuration problem and there are no code problems or changes relevant to this issue.