Description of problem: * 2 bladecentres, each providing 3 blades in seperate locations (widespread cluster) * 1 quorum-disk (2 datacentres) * 6 nodes + 1 quorum to determine which datacentre is quorate (is reaching SAN and network) Test * 1 bladecenter (3 blades) is down -> remaining bladecentre (3 blades) should take over services (3 votes + 1 vote quorum > 3) -> if quorum master is down, quorum will regain on another node (another member assumes master role), cluster gets quorate but rgmanager is not starting services and fenced claims, that it could not fence the 3 defunct-nodes Version-Release number of selected component (if applicable): RH5.2 + bugfixes rgmanager-2.0.38-2.el5 cman-2.0.84-2.el5 openais-0.80.3-15.el5 luci-0.12.0-7.el5 ricci-0.12.0-7.el5 How reproducible: * 6-node cluster + qdisk * eg. n1,n2,n3 in one bladecentre an n4,n5,n6 in another connected by ethernet * fencing by ILO * qdisk with heuristics (pinging core-switches, checking mount of partitions) * poweroff 1 bladecentre Steps to Reproduce case 1: 1. master-role quorum on n1, services running on n2 2. poweroff of bladecentre 2 (n4,n5,n6) Actual results case 1: * 3 members n4,n5,n6 are down * quorum remains online (because master is on member n1 and online) * services remain running * fenced claims that fencing of nodes n4,n5,n6 fails (because the whole bladecentre is not reachable and so fence_ilo fails to connect) Expected results: * like actual results -> OK Steps to Reproduce case 2: 1. master-role quorum on n1, services running on n2 2. poweroff of bladecentre 1 (n1,n2,n3) Actual results case 1: * 3 members n1,n2,n3 are down * quorum disolves (master is down) * services are down (n2 is down) * quorum regains on another master (n4) * services are not starting on member n4,n5 or n6 * fenced claims that fencing of nodes n1,n2,n3 fails (because the whole bladecentre is not reachable and so fence_ilo fails to connect) Expected results: * after a few tries fenced gives up fencing the unreachable nodes * after regaining quorum rgmanager starts the services on another node Additional info: * it seems, that fenced is in an fencing loop trying to fence the non-reachable nodes * rgmanager is running but not logging anything * rgmanager is not visible in clustat clustat: Cluster Status for AIM_Cluster_01 @ Mon Oct 13 12:36:39 2008 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ iconode04-sr1.xxxxx.ibk 1 Offline iconode03-sr1.xxxxx.ibk 2 Offline iconode06-sr2.xxxxx.ibk 3 Online, Local iconode01-sr1.xxxxx.ibk 4 Offline iconode05-sr2.xxxxx.ibk 5 Online iconode02-sr2.xxxxx.ibk 6 Online /dev/mapper/VD_AIM_CLUSTER_EVA8100-SR1-1p1 0 Online, Quorum Disk logs of fenced: Oct 13 12:15:26 iconode06-sr2 fenced[19986]: agent "fence_ilo" reports: Connect failed: connect: Connection timed out; Connection timed out at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Net/SSL.pm line 104, <> line 4. Oct 13 12:15:26 iconode06-sr2 ccsd[19907]: Attempt to close an unopened CCS descriptor (42780). Oct 13 12:15:26 iconode06-sr2 ccsd[19907]: Error while processing disconnect: Invalid request descriptor Oct 13 12:15:26 iconode06-sr2 fenced[19986]: fence "iconode03-sr1.xxxxx.ibk" failed Oct 13 12:15:31 iconode06-sr2 fenced[19986]: fencing node "iconode03-sr1.xxxxx.ibk" Oct 13 12:18:40 iconode06-sr2 fenced[19986]: agent "fence_ilo" reports: Connect failed: connect: Connection timed out; Connection timed out at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Net/SSL.pm line 104, <> line 4.
Ok - so, there are two problems here: (1) fencing is failing and (2) rgmanager isn't starting services. I think (2) is caused by (1), but I need to test more fully to be sure of that fact.
This works for me if fencing succeeds (which is a requirement)