Hide Forgot
Description of problem: If cluster timeouts (quorum_dev_poll, consensus etc.) are set high, restarting a node leads to inconsistent cluster view. Services like rgmanager don't work anymore, GFS may freeze. The timeouts are the recommended timeouts for SAN. Version-Release number of selected component (if applicable): cman-2.0.115-68.el5_6.1 openais-0.80.6-28.el5 How reproducible: Set relatively high timeouts, restart a node, let it join the cluster. Steps to Reproduce: 1. set a 2 node cluster, with qdisk and following timeouts: <cman expected_votes="3" quorum_dev_poll="81000"/> <totem token="82000" consensus="99000"/> <quorumd label="qdisk_trovi" votes="1" tko="10" interval="8"/> 2. start up stable cluster 3. restart a node 4.let it rejoin the cluster (within the 80s after restart) Actual results: The node rejoins the cluster, then gets evicted, needs several minutes to complete "service cman start", locktables on both nodes differ: node1: group_tool: locktable: [1 2 2] JOIN_STOP_WAIT node2: group_tool: locktable: [1 2] JOIN_STOP_WAIT rgmanager does not start cleanly Expected results: clean rejoin or fencing Additional info: The problem has been reproduced with the current RHEL5.6 release, is independent from GFS and rgmanager.
Clarification: "restart a node" means hard reset of the machine, such as kernel panic or short power outage. We encountered the problem on a real-life, productive system. We reproduced the same behavior on a small cluster and use hard reset to simulate a typical failure case. Timeout settings are based on SAN timeouts and Red Hat recommendations, such as: https://access.redhat.com/kb/docs/DOC-37204 https://access.redhat.com/kb/docs/DOC-35071 The full process of reproducing the issue takes ca. 450 seconds.
Created attachment 484459 [details] Log from the "stable" node
Created attachment 484461 [details] Log from the node that has been reset
Any luck in reproducing / analyzing the problem? If you need more info, ask right ahead - the issue is prio-1 for me.
Hello Lon, The bug is flagged as [NEEDINFO]. Is there any info I can provide? [I don't see any answers on this bug. I see my description and my 4 comments. Is this right?] Best regards, Michal Markowski ATIX AG
Is this related to 645299 ?
Not related to 545299, more likely Bug #533369. Honza, I noticed you did all the work on this bug - can you take a look at what it would take to backport. Thanks
(In reply to comment #9) > Not related to 545299, more likely Bug #533369. > > Honza, I noticed you did all the work on this bug - can you take a look at what > it would take to backport. > > Thanks Steve, Bug #533369 is kernel "irq 9: nobody cared" after suspend to ram, so this is typo. Can you please send me correct number? Because after brief lookout to bug, it doesn't seems familiar for me.
correct Bug ID is Bug 553369
Steve, backport doesn't seems to be totally impossible, but it means change/add simply too much code. I would rather not take that risk especially in such late product lifecycle (5.8)