Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 685069

Summary:

RHCS cannot handle node restart, when cluster timeouts are high

Product:

Red Hat Enterprise Linux 5

Reporter:

Michal Markowski <markowski>

Component:

openais

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED WONTFIX

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

medium

Version:

5.6

CC:

cluster-maint, edamato, ekuric, grimme, hlawatschek, jwest, markowski, sdake

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-10-05 14:39:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Log from the "stable" node	none
Log from the node that has been reset	none

Description Michal Markowski 2011-03-15 08:14:18 UTC

Description of problem:
If cluster timeouts (quorum_dev_poll, consensus etc.) are set high, restarting a node leads to inconsistent cluster view. Services like rgmanager don't work anymore, GFS may freeze.
The timeouts are the recommended timeouts for SAN.

Version-Release number of selected component (if applicable):
cman-2.0.115-68.el5_6.1
openais-0.80.6-28.el5

How reproducible:
Set relatively high timeouts, restart a node, let it join the cluster.

Steps to Reproduce:
1.
set a 2 node cluster, with qdisk and following timeouts:
<cman expected_votes="3" quorum_dev_poll="81000"/>
<totem token="82000" consensus="99000"/>
<quorumd label="qdisk_trovi" votes="1" tko="10" interval="8"/>
2. start up stable cluster
3. restart a node
4.let it rejoin the cluster (within the 80s after restart) 
  
Actual results:
The node rejoins the cluster, then gets evicted, needs several minutes to complete "service cman start", locktables on both nodes differ:
node1: group_tool: locktable: [1 2 2] JOIN_STOP_WAIT
node2: group_tool: locktable: [1 2] JOIN_STOP_WAIT

rgmanager does not start cleanly

Expected results:
clean rejoin or fencing

Additional info:
The problem has been reproduced with the current RHEL5.6 release, is independent from GFS and rgmanager.

Comment 1 Michal Markowski 2011-03-15 08:50:44 UTC

Clarification:
"restart a node" means hard reset of the machine, such as kernel panic or short power outage. We encountered the problem on a real-life, productive system. We reproduced the same behavior on a small cluster and use hard reset to simulate a typical failure case. 

Timeout settings are based on SAN timeouts and Red Hat recommendations, such as:
https://access.redhat.com/kb/docs/DOC-37204
https://access.redhat.com/kb/docs/DOC-35071
The full process of reproducing the issue takes ca. 450 seconds.

Comment 2 Michal Markowski 2011-03-15 13:01:23 UTC

Created attachment 484459 [details]
Log from the "stable" node

Comment 3 Michal Markowski 2011-03-15 13:02:02 UTC

Created attachment 484461 [details]
Log from the node that has been reset

Comment 4 Michal Markowski 2011-03-23 16:20:45 UTC

Any luck in reproducing / analyzing the problem?
If you need more info, ask right ahead - the issue is prio-1 for me.

Comment 6 Michal Markowski 2011-04-18 12:43:50 UTC

Hello Lon,

The bug is flagged as [NEEDINFO]. Is there any info I can provide?

[I don't see any answers on this bug. I see my description and my 4 comments. Is this right?]


Best regards,
Michal Markowski
ATIX AG

Comment 8 Lon Hohberger 2011-04-25 19:33:04 UTC

Is this related to 645299 ?

Comment 9 Steven Dake 2011-04-25 19:42:33 UTC

Not related to 545299, more likely Bug #533369.

Honza, I noticed you did all the work on this bug - can you take a look at what it would take to backport.

Thanks

Comment 10 Jan Friesse 2011-04-26 06:15:08 UTC

(In reply to comment #9)
> Not related to 545299, more likely Bug #533369.
> 
> Honza, I noticed you did all the work on this bug - can you take a look at what
> it would take to backport.
> 
> Thanks

Steve,
Bug #533369 is kernel "irq 9: nobody cared" after suspend to ram, so this is typo. Can you please send me correct number? Because after brief lookout to bug, it doesn't seems familiar for me.

Comment 12 Steven Dake 2011-05-19 18:36:12 UTC

correct Bug ID is Bug 553369

Comment 13 Jan Friesse 2011-07-15 15:22:10 UTC

Steve,
backport doesn't seems to be totally impossible, but it means change/add simply too much code. I would rather not take that risk especially in such late product lifecycle (5.8)