Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 513260

Summary:

Cman kills wrong nodes..

Product:

Red Hat Enterprise Linux 5

Reporter:

Carl Trieloff <cctrieloff>

Component:

cman

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.3

CC:

cfeist, cluster-maint, djansa, edamato, iboverma, nick.hall, sdake, tao

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

cman-2.0.115-2.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 08:37:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

518060, 518061

Attachments:

Description	Flags
Test output including log captures during test case.	none

Description Carl Trieloff 2009-07-22 18:02:58 UTC

Use case:

create 4 nodes on cluster with cman & AIS, using redundent ring configuration.

- break the network to isolate one node.

(issue -- cman does not exit the node that lost quoram )

- re-establish the netowrk
(issue -- The 3 nodes exit, and not the 1 node that joined)

packages used:

cman-2.0.98-1.el5_3.1.hotfix.2
openais-0.80.3-22.el5_3.7

Comment 1 Perry Myers 2009-07-22 18:05:34 UTC

Need logs from all nodes in the cluster

Comment 2 Perry Myers 2009-07-22 18:12:00 UTC

Also, for reproducing this issue... when you say 'break the network' are you breaking all of the rings for a given node?  i.e. w/ redundant ring there would be multiple network connections for each node.  Or is only a single link/ring getting pulled?

Comment 3 Christine Caulfield 2009-07-23 07:54:48 UTC

As redundant ring is totally unsupported and untested software. Is it possible to test this without RRP enabled ?

Not only will it eliminate a potentially huge variable but, if the problem persists, it will simplify the logs hugely I suspect.

Comment 4 Nick Hall 2009-07-23 08:01:18 UTC

This isn't a redundant ring configuration - just a single ring using a dedicated network interface on each node.

I'll provide the logs shortly.

Comment 5 Carl Trieloff 2009-07-23 17:52:11 UTC

{From issue}

I’m still having the issues with cman, and I think it’s related to a multicast issue we’re seeing on the switch. 
 

Essentially one host in the cluster keeps dropping in and out of the IGMP snooping configuration on the switch, which causes it to drop in and out of the cluster.  When it drops out, it correctly is shown as being down in cman_tool; when it comes back, the rest of the cluster commits suicide. L
 

The logs from the rest of the cluster are essentially identical to the one I sent before.

Comment 6 Carl Trieloff 2009-07-23 17:53:04 UTC

The cluster config:

<?xml version="1.0"?>

<cluster config_version="11" name="testcluster">

   <clusternodes>

     <clusternode name="lnaiqlv21-cl2" votes="1" nodeid="1">

     </clusternode>

     <clusternode name="lnaiqlv22-cl2" votes="1" nodeid="2">

     </clusternode>

     <clusternode name="lnaiqlv23-cl2" votes="1" nodeid="3">

     </clusternode>

     <clusternode name="lnaiqlv24-cl2" votes="1" nodeid="4">

     </clusternode>

   </clusternodes>

   <cman port="5405">

     <multicast addr="239.255.255.1"/>

   </cman>

   <fencedevices/>

   <rm/>

   <totem version="2" secauth="off" threads="0"/> <!-- rrp_mode="active"/> -->

   <logging/>

   <amf mode="disabled"/>

   <event/>

   <aisexec/>

   <group/>

</cluster>

The log:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [TOTEM] Sending initial ORF token

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] CLM CONFIGURATION CHANGE

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] New Configuration:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.244)

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.245) 

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.246)

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] Members Left:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] Members Joined:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] CLM CONFIGURATION CHANGE

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] New Configuration:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.244) 

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.245)

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.246) 

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.247)

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] Members Left:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ] Members Joined:

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.247)

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [SYNC ] This node is within the primary component and will provide service.

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [TOTEM] entering OPERATIONAL state.

Jul 21 12:57:19 lnaiqlv21 openais[20947]: [MAIN ] Killing node lnaiqlv24-cl2 because it has rejoined the cluster without cman_tool join

Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] The token was lost in the OPERATIONAL state.

Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).

Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes).

Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] entering GATHER state from 2.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering GATHER state from 11.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Creating commit token because I am the rep.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Saving state aru 6 high seq received 6

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Storing new sequence id for ring 188d4

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering COMMIT state.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering RECOVERY state.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [0] member 10.229.21.244:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [1] member 10.229.21.245:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [2] member 10.229.21.246:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1

Jul 21 12:57:34 lnaiqlv21 openais[20947]: CMAN: Joined a cluster with disallowed nodes. must die

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Did not need to originate any messages in recovery.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Sending initial ORF token

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] CLM CONFIGURATION CHANGE

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] New Configuration:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.244)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.245)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.246)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] Members Left:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.247)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] Members Joined:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] CLM CONFIGURATION CHANGE

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] New Configuration:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.244)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.245)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ]       r(0) ip(10.229.21.246)

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] Members Left:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] Members Joined:

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [SYNC ] This node is within the primary component and will provide service.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering OPERATIONAL state.

Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM  ] got nodejoin message 10.229.21.245

Jul 21 12:57:34 lnaiqlv21 dlm_controld[20972]: cluster is down, exiting

Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: groupd_dispatch error -1 errno 11

Jul 21 12:57:34 lnaiqlv21 fenced[20966]: groupd is down, exiting

Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 3

Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: groupd connection died

Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 2

Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: cluster is down, exiting

Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 1

Jul 21 12:58:01 lnaiqlv21 ccsd[20939]: Unable to connect to cluster infrastructure after 30 seconds.

Jul 21 12:58:32 lnaiqlv21 ccsd[20939]: Unable to connect to cluster infrastructure after 60 seconds.

Comment 7 Carl Trieloff 2009-07-23 17:55:49 UTC

packages used:

cman-2.0.98-1.el5_3.4
openais-0.80.3-22.el5_3.8

Comment 8 Christine Caulfield 2009-07-24 15:36:23 UTC

I managed to make this happen using the STABLE3 code on Fedora 11. I'll go through the logs in detail on Monday.

Comment 14 Christine Caulfield 2009-08-14 11:31:34 UTC

Committed to the RHEL55 branch of git.

commit 34bccfffdb35f368a72e2fa6859f15f6e8f9ebb8
Author: Christine Caulfield <ccaulfie>
Date:   Wed Jul 29 11:17:47 2009 +0100

    cman: Fix a situation where cman could kill the wrong nodes

Comment 17 Nate Straz 2009-08-24 19:45:20 UTC

Chrissie,

I've written up a new test to cover this bug and I would like to know if we should be covering both the INPUT and the OUTPUT cases (where we put the DROP iptables rule in either chain)?

Comment 21 Nate Straz 2010-03-08 22:54:23 UTC

Created attachment 398643 [details]
Test output including log captures during test case.

I'm still hitting some problems when running this on higher node counts.  At times I get multiple partitions in cman with the rest of the nodes in openais membership as disallowed:

============================================================
Iteration 1: west-01 OUTPUT
============================================================
Setting up log capture: west-01 west-02 west-03 west-04 west-05 west-06 west-07 west-08
Stopping traffic from west-01
Waiting for other nodes to notice.
Restarting traffic from west-01
Waiting up to 60 seconds for things to blow up
        west-01 killed by node 2 because it joined without a full restart
        west-03 killing west-01 because it has rejoined the cluster with exisiting state
        west-02 killing west-01 because it has rejoined the cluster with exisiting state
        west-05 killing west-01 because it has rejoined the cluster with exisiting state
        west-06 killing west-01 because it has rejoined the cluster with exisiting state
        west-04 killing west-01 because it has rejoined the cluster with exisiting state
Error while checking for missing node
Cluster state - rows are 'cman_tool nodes' output from that node
         west-01 west-02 west-03 west-04 west-05 west-06 west-07 west-08
========================================================================
west-01
west-02        X       M      *d      *d      *d      *d      *d      *d
west-03        X      *d       M       M       M       M      *d      *d
west-04        X      *d       M       M       M       M      *d      *d
west-05        X      *d       M       M       M       M      *d      *d
west-06        X      *d       M       M       M       M      *d      *d
west-07        X      *d      *d      *d      *d      *d       M       M
west-08        X      *d      *d      *d      *d      *d       M       M
unexpected states marked with *

Comment 22 Christine Caulfield 2010-03-09 07:49:42 UTC

Disallowed state generally is not part of this bug. If we need to tune openais for higher node counts then it should really be in a separate BZ. I managed to get 32 nodes but there will very likely be loads that will break at lower node counts.

Comment 23 Christine Caulfield 2010-03-09 14:27:13 UTC

it might also be related to https://bugzilla.redhat.com/show_bug.cgi?id=556804

Comment 25 errata-xmlrpc 2010-03-30 08:37:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html