Bug 685069 - RHCS cannot handle node restart, when cluster timeouts are high
Summary: RHCS cannot handle node restart, when cluster timeouts are high
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-15 08:14 UTC by Michal Markowski
Modified: 2018-11-14 13:54 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-05 14:39:55 UTC
Target Upstream Version:


Attachments (Terms of Use)
Log from the "stable" node (61.94 KB, text/plain)
2011-03-15 13:01 UTC, Michal Markowski
no flags Details
Log from the node that has been reset (103.66 KB, text/plain)
2011-03-15 13:02 UTC, Michal Markowski
no flags Details

Description Michal Markowski 2011-03-15 08:14:18 UTC
Description of problem:
If cluster timeouts (quorum_dev_poll, consensus etc.) are set high, restarting a node leads to inconsistent cluster view. Services like rgmanager don't work anymore, GFS may freeze.
The timeouts are the recommended timeouts for SAN.

Version-Release number of selected component (if applicable):
cman-2.0.115-68.el5_6.1
openais-0.80.6-28.el5

How reproducible:
Set relatively high timeouts, restart a node, let it join the cluster.

Steps to Reproduce:
1.
set a 2 node cluster, with qdisk and following timeouts:
<cman expected_votes="3" quorum_dev_poll="81000"/>
<totem token="82000" consensus="99000"/>
<quorumd label="qdisk_trovi" votes="1" tko="10" interval="8"/>
2. start up stable cluster
3. restart a node
4.let it rejoin the cluster (within the 80s after restart) 
  
Actual results:
The node rejoins the cluster, then gets evicted, needs several minutes to complete "service cman start", locktables on both nodes differ:
node1: group_tool: locktable: [1 2 2] JOIN_STOP_WAIT
node2: group_tool: locktable: [1 2] JOIN_STOP_WAIT

rgmanager does not start cleanly

Expected results:
clean rejoin or fencing

Additional info:
The problem has been reproduced with the current RHEL5.6 release, is independent from GFS and rgmanager.

Comment 1 Michal Markowski 2011-03-15 08:50:44 UTC
Clarification:
"restart a node" means hard reset of the machine, such as kernel panic or short power outage. We encountered the problem on a real-life, productive system. We reproduced the same behavior on a small cluster and use hard reset to simulate a typical failure case. 

Timeout settings are based on SAN timeouts and Red Hat recommendations, such as:
https://access.redhat.com/kb/docs/DOC-37204
https://access.redhat.com/kb/docs/DOC-35071
The full process of reproducing the issue takes ca. 450 seconds.

Comment 2 Michal Markowski 2011-03-15 13:01:23 UTC
Created attachment 484459 [details]
Log from the "stable" node

Comment 3 Michal Markowski 2011-03-15 13:02:02 UTC
Created attachment 484461 [details]
Log from the node that has been reset

Comment 4 Michal Markowski 2011-03-23 16:20:45 UTC
Any luck in reproducing / analyzing the problem?
If you need more info, ask right ahead - the issue is prio-1 for me.

Comment 6 Michal Markowski 2011-04-18 12:43:50 UTC
Hello Lon,

The bug is flagged as [NEEDINFO]. Is there any info I can provide?

[I don't see any answers on this bug. I see my description and my 4 comments. Is this right?]


Best regards,
Michal Markowski
ATIX AG

Comment 8 Lon Hohberger 2011-04-25 19:33:04 UTC
Is this related to 645299 ?

Comment 9 Steven Dake 2011-04-25 19:42:33 UTC
Not related to 545299, more likely Bug #533369.

Honza, I noticed you did all the work on this bug - can you take a look at what it would take to backport.

Thanks

Comment 10 Jan Friesse 2011-04-26 06:15:08 UTC
(In reply to comment #9)
> Not related to 545299, more likely Bug #533369.
> 
> Honza, I noticed you did all the work on this bug - can you take a look at what
> it would take to backport.
> 
> Thanks

Steve,
Bug #533369 is kernel "irq 9: nobody cared" after suspend to ram, so this is typo. Can you please send me correct number? Because after brief lookout to bug, it doesn't seems familiar for me.

Comment 12 Steven Dake 2011-05-19 18:36:12 UTC
correct Bug ID is Bug 553369

Comment 13 Jan Friesse 2011-07-15 15:22:10 UTC
Steve,
backport doesn't seems to be totally impossible, but it means change/add simply too much code. I would rather not take that risk especially in such late product lifecycle (5.8)


Note You need to log in before you can comment on or make changes to this bug.