Bug 466733 - rgmanager does not start services after regaining quorum
Summary: rgmanager does not start services after regaining quorum
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.2
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-13 10:41 UTC by Herbert L. Plankl
Modified: 2009-04-16 22:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-12-03 18:15:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Herbert L. Plankl 2008-10-13 10:41:01 UTC
Description of problem:
* 2 bladecentres, each providing 3 blades in seperate locations (widespread cluster)
* 1 quorum-disk (2 datacentres)
* 6 nodes + 1 quorum to determine which datacentre is quorate (is reaching SAN and network)

Test
* 1 bladecenter (3 blades) is down

-> remaining bladecentre (3 blades) should take over services (3 votes + 1 vote quorum > 3)
-> if quorum master is down, quorum will regain on another node (another member assumes master role), cluster gets quorate but rgmanager is not starting services and fenced claims, that it could not fence the 3 defunct-nodes

Version-Release number of selected component (if applicable):
RH5.2 + bugfixes
rgmanager-2.0.38-2.el5
cman-2.0.84-2.el5
openais-0.80.3-15.el5
luci-0.12.0-7.el5
ricci-0.12.0-7.el5

How reproducible:
* 6-node cluster + qdisk
* eg. n1,n2,n3 in one bladecentre an n4,n5,n6 in another connected by ethernet
* fencing by ILO
* qdisk with heuristics (pinging core-switches, checking mount of partitions)
* poweroff 1 bladecentre


Steps to Reproduce case 1:
1. master-role quorum on n1, services running on n2
2. poweroff of bladecentre 2 (n4,n5,n6)
  
Actual results case 1:
* 3 members n4,n5,n6 are down
* quorum remains online (because master is on member n1 and online)
* services remain running
* fenced claims that fencing of nodes n4,n5,n6 fails (because the whole bladecentre is not reachable and so fence_ilo fails to connect)

Expected results:
* like actual results -> OK

Steps to Reproduce case 2:
1. master-role quorum on n1, services running on n2
2. poweroff of bladecentre 1 (n1,n2,n3)
  
Actual results case 1:
* 3 members n1,n2,n3 are down
* quorum disolves (master is down)
* services are down (n2 is down)
* quorum regains on another master (n4)
* services are not starting on member n4,n5 or n6
* fenced claims that fencing of nodes n1,n2,n3 fails (because the whole bladecentre is not reachable and so fence_ilo fails to connect)

Expected results:
* after a few tries fenced gives up fencing the unreachable nodes
* after regaining quorum rgmanager starts the services on another node

Additional info:
* it seems, that fenced is in an fencing loop trying to fence the non-reachable nodes
* rgmanager is running but not logging anything
* rgmanager is not visible in clustat


clustat:
Cluster Status for AIM_Cluster_01 @ Mon Oct 13 12:36:39 2008
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 iconode04-sr1.xxxxx.ibk                                             1 Offline
 iconode03-sr1.xxxxx.ibk                                             2 Offline
 iconode06-sr2.xxxxx.ibk                                             3 Online, Local
 iconode01-sr1.xxxxx.ibk                                             4 Offline
 iconode05-sr2.xxxxx.ibk                                             5 Online
 iconode02-sr2.xxxxx.ibk                                             6 Online
 /dev/mapper/VD_AIM_CLUSTER_EVA8100-SR1-1p1                          0 Online, Quorum Disk


logs of fenced:
Oct 13 12:15:26 iconode06-sr2 fenced[19986]: agent "fence_ilo" reports: Connect failed: connect: Connection timed out; Connection timed out at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Net/SSL.pm line 104, <> line 4. 
Oct 13 12:15:26 iconode06-sr2 ccsd[19907]: Attempt to close an unopened CCS descriptor (42780). 
Oct 13 12:15:26 iconode06-sr2 ccsd[19907]: Error while processing disconnect: Invalid request descriptor 
Oct 13 12:15:26 iconode06-sr2 fenced[19986]: fence "iconode03-sr1.xxxxx.ibk" failed
Oct 13 12:15:31 iconode06-sr2 fenced[19986]: fencing node "iconode03-sr1.xxxxx.ibk"
Oct 13 12:18:40 iconode06-sr2 fenced[19986]: agent "fence_ilo" reports: Connect failed: connect: Connection timed out; Connection timed out at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Net/SSL.pm line 104, <> line 4.

Comment 1 Lon Hohberger 2008-10-13 15:27:56 UTC
Ok - so, there are two problems here:
(1) fencing is failing and
(2) rgmanager isn't starting services.

I think (2) is caused by (1), but I need to test more fully to be sure of that fact.

Comment 2 Lon Hohberger 2008-12-03 18:15:58 UTC
This works for me if fencing succeeds (which is a requirement)


Note You need to log in before you can comment on or make changes to this bug.