Bug 485026 - Cman kills first node in initial cluster setup
Summary: Cman kills first node in initial cluster setup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 487397
TreeView+ depends on / blocked
 
Reported: 2009-02-11 08:44 UTC by Marc Grimme
Modified: 2018-10-20 02:32 UTC (History)
32 users (show)

Fixed In Version: cman-2.0.100-1.el5
Doc Type: Bug Fix
Doc Text:
Cause: This bug was caused by a bug in the 'openais-only' detection code inside cman. If a node because quorate on its own, say because two_node=1 was set or a quorum disk was present then other services are allowed to start. This set the "Dirty" flag in CMAN to indicate that it has state that cannot be reproduced with a full restart. The check for rejecting node joins only checked for the remote node not having state. Consequence: If another node then joined the cluster, even if it didn't have any state it would be rejected. This means that the normal configuration for a 2 node cluster didn't work because one node is supposed to be quorate at all times, it would not let the second node join the cluster. Fix: The 'dirty' flag check now checks both the local and remote states of the nodes. So that the nodes can join the cluster if one of them does not have state. Result: Two node clusters now work correctly.
Clone Of:
Environment:
Last Closed: 2009-09-02 11:06:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster.conf and syslog of both nodes (104.99 KB, application/x-compressed-tar)
2009-02-11 09:02 UTC, tubesenf
no flags Details
cluster.conf cman status with multicast (494.40 KB, application/x-compressed-tar)
2009-02-11 16:25 UTC, tubesenf
no flags Details
syslog, cluster.conf and aislog for node1 (axqa03-1) (50.00 KB, application/x-tar)
2009-02-12 12:16 UTC, Marc Grimme
no flags Details
syslog, cluster.conf and aislog for node2 (axqa03-2) (30.00 KB, application/x-tar)
2009-02-12 12:17 UTC, Marc Grimme
no flags Details
syslog, aislog of RHEL5.3 cluster (9.99 KB, application/x-zip)
2009-02-12 14:42 UTC, Marc Grimme
no flags Details
ais.log of node which started later (7.78 KB, application/octet-stream)
2009-02-12 16:40 UTC, tubesenf
no flags Details
both nodes powered off. start of oracs1, then oracs2. logs (30.40 KB, text/plain)
2009-02-16 10:30 UTC, Gianluca Cecchi
no flags Details
Patch to fix cman bug (799 bytes, patch)
2009-02-17 15:42 UTC, Christine Caulfield
no flags Details | Diff
first node start logs (18.72 KB, text/plain)
2009-02-18 11:55 UTC, Gianluca Cecchi
no flags Details
second node start logs (16.95 KB, text/plain)
2009-02-18 11:56 UTC, Gianluca Cecchi
no flags Details
first node cman killed by second node (3.74 KB, text/plain)
2009-02-18 11:57 UTC, Gianluca Cecchi
no flags Details
cluster.conf of oracs1 and oracs2 version 39 (2.11 KB, text/plain)
2009-02-18 13:32 UTC, Gianluca Cecchi
no flags Details
full messages for oracs1 (625.33 KB, text/plain)
2009-02-18 14:34 UTC, Gianluca Cecchi
no flags Details
full messages for oracs2 (1.37 MB, text/plain)
2009-02-18 14:35 UTC, Gianluca Cecchi
no flags Details
cluster.conf as of comment#109 (1.39 KB, application/octet-stream)
2009-04-09 08:46 UTC, Gianluca Cecchi
no flags Details
messages file of orastud1, first node to start the cluster the first time (39.37 KB, text/plain)
2009-04-09 08:47 UTC, Gianluca Cecchi
no flags Details
messages file of orastud2, second node that should join the cluster the first time (38.45 KB, text/plain)
2009-04-09 08:48 UTC, Gianluca Cecchi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1341 0 normal SHIPPED_LIVE Low: cman security, bug fix, and enhancement update 2009-09-01 10:43:16 UTC

Description Marc Grimme 2009-02-11 08:44:13 UTC
Description of problem:
When a cluster starts up initially the first node's cman/openais gets killed when the second node joins.

Version-Release number of selected component (if applicable):
This behaviour can be seen with 5.2 and 5.3.

How reproducible:

Steps to Reproduce:
1. Shutdown cman on both nodes
2. Startup cman initscript on node1; wait for cluster to be fully running (i.e. fencedomain must be joined; validate with cman_tool services)
3. Startup cman initscript on node2; node1 will be killed by cman. No fencing will take place but all cluster dependent services will crash.
  
Actual results:
Cman/openais, .. crash on the first node of the cluster

Expected results:
Both nodes should stay in cluster.

Additional info:
This seems to be independent from the node count and qdisk and token timeouts. We've seen it with 2 node clusters with or without qdisk 3 node cluster with or without qdisk. This is reproducable.

Comment 1 Christine Caulfield 2009-02-11 08:55:18 UTC
Is this a crash (ie a segfault, and if so can we have a traceback please)? or are there messages in syslog saying why the node left the cluster ?

Comment 2 tubesenf 2009-02-11 09:02:38 UTC
Created attachment 331540 [details]
cluster.conf and syslog of both nodes

Comment 3 tubesenf 2009-02-11 09:05:19 UTC
we are experiencing the same issue on our two node rhel 5.3 cluster with quorum disk.

messages and cluster conf are attached.

this happened at 18:12 10-02-2009.

greetings 
daniel

Comment 4 Marc Grimme 2009-02-11 09:08:33 UTC
There is no traceback jet. But syslogs look the as described by daniel.

Should we set ulimit -c and then start up the cluster again?

Comment 5 Christine Caulfield 2009-02-11 09:20:24 UTC
Feb 10 18:11:32 dmse1-vm openais[1880]: [CMAN ] cman killed by node 2 because we
 rejoined the cluster without a full restart

That's the important message. It indicates a cluster split and rejoin. OR it can be caused by cisco switches being very slow to adjust to the multicast addresses.

See bugzillas passim.

Comment 6 Marc Grimme 2009-02-11 09:59:35 UTC
Note: 
1. we already setup the token timeout to 54000 as described in the cluster faq
2. This also happens on virtual machines being connected to a locally terminated bridge.
3. It also happens on non cisco switches.

Comment 7 Christine Caulfield 2009-02-11 10:17:28 UTC
The FAQ only recommends that token timeout for huge clusters. You shouldn't need it for anything under 30 nodes.

Comment 8 tubesenf 2009-02-11 10:35:04 UTC
hallo,
we are using foundry core switches type RX16

Comment 9 Marc Grimme 2009-02-11 10:35:47 UTC
Yes we only set it because of timeouts related to qdisk and multipathing.

I don't think it's related to the switches as we have it on nearly all clusters with different switches and even on locally terminated virtual guest clusters without any switch being involved.

Marc.

Comment 10 Christine Caulfield 2009-02-11 10:41:37 UTC
Looking closely at the syslogs, my guess is that the two nodes have come up as independent single-node clusters and hen joined together AFTER services have started on each.

A couple of suggestion would be to get cman_tool to wait for quorum before services get started using 'cman_tool join -wq', or (horrible hack!) put a sleep before the service start to allow the cluster to stabilise before they get going.

Comment 11 Marc Grimme 2009-02-11 11:19:52 UTC
We already set the CMAN_QUORUM_TIMEOUT in the initscript to 60 (totemtimeout 54000 < 60000) seconds. And the cluster came up quorate before joining the fencedomain. But still the problem persists.

Side note:
When we set the timeout for waiting for fencejoin (FENCE_START_TIMEOUT=30) very low. Then boot the first node. The join of the fencedomain will fail. Then we boot up the second one and it will work this time. Then we can join the first one to the fence domain (fence_tool join). But we have to wait for the second node. That means if we do it the other way round (implies a BIG sleep):
1. Boot first node (fails to join fencedomain because of low timeout)
2. fence_tool join on first node
3. Boot second node => first node gets killed
I would understand it in that way that the manipulation of the timeouts does not really help out.

Marc.

Comment 12 Lon Hohberger 2009-02-11 15:18:07 UTC
(In reply to comment #9)
> Yes we only set it because of timeouts related to qdisk and multipathing.
> 
> I don't think it's related to the switches as we have it on nearly all clusters
> with different switches and even on locally terminated virtual guest clusters
> without any switch being involved.
> 
> Marc.

Comment 13 Christine Caulfield 2009-02-11 15:23:06 UTC
Can you enable full CMAN logging and post the output from all nodes please ?

Comment 14 Lon Hohberger 2009-02-11 15:30:30 UTC
Oops... :)

Sorry about that quote -- checked a bad box.

Here's a silly question -- can you try switching the multicast address to 225.0.0.11 or something?

We used to use this address years ago, and still have people using it on RHEL3 clusters.  Some soft-switch vendors don't correctly forward site-local mcast addresses, despite IANA specs.  Now, it's in the 'reserved' address range, but it seems to work in years past, and has worked at least once where the default multicast address was being ignored by an "intelligent" switch.

Comment 15 Lon Hohberger 2009-02-11 15:33:35 UTC
http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_cisco_switches

Solution #2 in the above link -- how to make cman give a different multicast address to openais.

Comment 16 Marc Grimme 2009-02-11 15:43:45 UTC
(In reply to comment #13)
> Can you enable full CMAN logging and post the output from all nodes please ?

How can we do this?

Comment 17 Lon Hohberger 2009-02-11 15:54:07 UTC
I'm not sure what I noted in comment #15 will have any effect on a Xen cluster.  Hrm.

Comment 18 tubesenf 2009-02-11 16:25:09 UTC
Created attachment 331581 [details]
cluster.conf cman status with multicast

Comment 19 tubesenf 2009-02-11 16:26:39 UTC
hi,
tried the multicast tag in a realserver cluster and still the same error.

Comment 20 Christine Caulfield 2009-02-11 16:28:35 UTC
        <logging to_file="yes" logfile="/tmp/ais.log" fileline="on">
          <logger ident="CMAN" debug="on"/>
        </logging>

Comment 21 Lon Hohberger 2009-02-11 20:47:22 UTC
That's strange; your logs were claiming a different multicast address:

Feb 11 16:56:35 xenlx08 openais[8957]: [MAIN ] Using default multicast address of 239.192.111.201 

...

netstat -an | grep 5405 on my machine says the following:

udp        0      0 192.168.122.4:5405 0.0.0.0:*                               
udp        0      0 225.0.0.12:5405    0.0.0.0:*                               

(5405 is what openais uses for multicast).

It should say the same on your machine... (well, except the local IP will be different).

Comment 22 Lon Hohberger 2009-02-11 20:48:49 UTC
Wait, openais isn't chkconfig'd on is it?

Comment 23 Marc Grimme 2009-02-11 21:27:22 UTC
It's saying the same on this machine. We checked it with cman_tool status and it showed 225.0.0.12. You need to look at the logs at about 17:04 and later our time.

BTW: we don't have no openais chkconfiged cause everything is started via the cman initscript.

Comment 24 Lon Hohberger 2009-02-11 22:54:44 UTC
Thanks, Marc:

So, we start:

Feb 11 17:01:08 xenlx07 openais[5043]: [CMAN ] quorum device registered 
Feb 11 17:01:08 xenlx07 qdiskd[5094]: <notice> Score sufficient for master operation (20/20; required=10); upgrading 
Feb 11 17:01:23 xenlx07 qdiskd[5094]: <info> Assuming master role 
Feb 11 17:01:26 xenlx07 qdiskd[5094]: <notice> Writing eviction notice for node 2 
Feb 11 17:01:26 xenlx07 openais[5043]: [CMAN ] quorum regained, resuming activity 
Feb 11 17:01:29 xenlx07 qdiskd[5094]: <notice> Node 2 evicted 

We're running fine, and the other node boots:

Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] New Configuration: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] 	r(0) ip(192.168.0.1)  
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] Members Left: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] Members Joined: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] New Configuration: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] 	r(0) ip(192.168.0.1)  
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] 	r(0) ip(192.168.0.2)  
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] Members Left: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] Members Joined: 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] 	r(0) ip(192.168.0.2)  
Feb 11 17:04:44 xenlx07 openais[5043]: [SYNC ] This node is within the primary component and will provide service. 
Feb 11 17:04:44 xenlx07 openais[5043]: [TOTEM] entering OPERATIONAL state. 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] got nodejoin message 192.168.0.1 
Feb 11 17:04:44 xenlx07 openais[5043]: [CLM  ] got nodejoin message 192.168.0.2 
Feb 11 17:04:44 xenlx07 openais[5043]: [CPG  ] got joinlist message from node 1 
Feb 11 17:04:44 xenlx07 openais[5043]: [CPG  ] got joinlist message from node 2 
Feb 11 17:04:46 xenlx07 ntpd[5512]: synchronized to LOCAL(0), stratum 10
Feb 11 17:04:46 xenlx07 ntpd[5512]: kernel time sync enabled 0001

So, quorate, we are running fine, another node joins our quorate cluster and CMAN gets a MDK message:

Feb 11 17:05:00 xenlx07 openais[5043]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart 



Consequently, when the other node boots, everything looks good:

Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] New Configuration: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] 	r(0) ip(192.168.0.2)  
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] Members Left: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] Members Joined: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] New Configuration: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] 	r(0) ip(192.168.0.1)  
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] 	r(0) ip(192.168.0.2)  
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] Members Left: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] Members Joined: 
Feb 11 17:04:50 xenlx08 openais[4914]: [CLM  ] 	r(0) ip(192.168.0.1)  
Feb 11 17:04:50 xenlx08 openais[4914]: [SYNC ] This node is within the primary component and will provide service. 
Feb 11 17:04:50 xenlx08 openais[4914]: [TOTEM] entering OPERATIONAL state. 

Except this little thing:

Feb 11 17:04:50 xenlx08 openais[4914]: [MAIN ] Node xenlx07-ic not joined to cman because it has existing state

Comment 25 Marc Grimme 2009-02-12 09:11:48 UTC
Ok, but what do you read from "Node xenlx07-ic not joined to
cman because it has existing state"?
And note if we:
0. Set the FENCED_TIMEOUT=30 (for example)
1. Boot first node (fails to join fencedomain because of low timeout)
2. Boot second node (it successfully joins fencedomain)
3. fence_tool join on first node
It just works.
What do you think about that?
Thanks
Marc.

Comment 26 Christine Caulfield 2009-02-12 09:36:21 UTC
I can't say why it doesn't join the fence domain. But the rest makes perfect sense:

If only one node has state (ie there are no fence or other daemons running) then the node without state can join it quite happily. If both nodes have state before they see each other then you get that "cannot join" message.

I wonder if qdisk is causing single nodes to have quorum even though their aisexec processes are not talking to each other ?

Comment 27 Marc Grimme 2009-02-12 10:14:36 UTC
I don't think it's just related to the qdisk, cause it also happens without qdisk.

Comment 28 Marc Grimme 2009-02-12 12:14:51 UTC
I'm attaching new logs (even for openais). The logs only show the clusterjoin as I cleared them before.

Note: This is a cluster with older openais, kernel, .. (RHEL5.2) but the problem still looks like the same. Perhaps I can provide new logs for 5.3 today.

Comment 29 Marc Grimme 2009-02-12 12:16:33 UTC
Created attachment 331680 [details]
syslog, cluster.conf and aislog for node1 (axqa03-1)

Comment 30 Marc Grimme 2009-02-12 12:17:02 UTC
Created attachment 331681 [details]
syslog, cluster.conf and aislog for node2 (axqa03-2)

Comment 31 Marc Grimme 2009-02-12 14:39:58 UTC
Here are new logs with a RHEL5.3 cluster.

This time again these problems. But afterwards the fencedomain is in JOIN_WAIT status. Which leads to a unusable cluster.

That's different to the other clusters. But still the first node gets killed.

Marc.

Comment 32 Marc Grimme 2009-02-12 14:42:28 UTC
Created attachment 331691 [details]
syslog, aislog of RHEL5.3 cluster

Comment 33 tubesenf 2009-02-12 16:38:50 UTC
syslog output of first node started:
 Feb 12 17:22:50 dmse1-vm openais[21072]: [logging.c:0042] cman killed by node 2 because we rejoined the cluster without a full restart 
Feb 12 17:22:50 dmse1-vm gfs_controld[21131]: cluster is down, exiting
Feb 12 17:22:50 dmse1-vm dlm_controld[21125]: cluster is down, exiting
Feb 12 17:22:50 dmse1-vm fenced[21119]: cluster is down, exiting
Feb 12 17:22:50 dmse1-vm kernel: dlm: closing connection to node 2
Feb 12 17:22:50 dmse1-vm kernel: dlm: closing connection to node 1
Feb 12 17:22:50 dmse1-vm qdiskd[21102]: <err> cman_dispatch: Host is down 
Feb 12 17:22:50 dmse1-vm qdiskd[21102]: <err> Halting qdisk operations 
Feb 12 17:23:16 dmse1-vm ccsd[21066]: Unable to connect to cluster infrastructure after 30 seconds. 
Feb 12 17:23:46 dmse1-vm ccsd[21066]: Unable to connect to cluster infrastructure after 60 seconds. 

ais.log
[totemsrp.c:1842] aru a high delivered a received flag 1
[totemsrp.c:1950] Did not need to originate any messages in recovery.
[totemsrp.c:4080] Sending initial ORF token
[clm.c:0513] CLM CONFIGURATION CHANGE
[clm.c:0514] New Configuration:
[clm.c:0516]    r(0) ip(192.168.0.21) 
[clm.c:0518] Members Left:
[clm.c:0523] Members Joined:
[clm.c:0513] CLM CONFIGURATION CHANGE
[clm.c:0514] New Configuration:
[clm.c:0516]    r(0) ip(192.168.0.21) 
[clm.c:0516]    r(0) ip(192.168.0.22) 
[clm.c:0518] Members Left:
[clm.c:0523] Members Joined:
[clm.c:0525]    r(0) ip(192.168.0.22) 
[sync.c:0321] This node is within the primary component and will provide service.
[totemsrp.c:1678] entering OPERATIONAL state.
[clm.c:0595] got nodejoin message 192.168.0.21
[clm.c:0595] got nodejoin message 192.168.0.22
[cpg.c:0884] got joinlist message from node 1
[logging.c:0042] cman killed by node 2 because we rejoined the cluster without a full restart

i will attach ais.log of the second node

Comment 34 tubesenf 2009-02-12 16:40:11 UTC
Created attachment 331710 [details]
ais.log of node which started later

Comment 36 Mark Hlawatschek 2009-02-13 13:44:04 UTC
Chrissi, Lon,
have you already been able to reproduce the issue ?
-Mark

Comment 37 Michael Waite 2009-02-13 16:54:44 UTC
Christine,
Do we have everything we need to move forward on this issue?
There are many different business projects on hold as a result of this un-documented feature.

Comment 38 Lon Hohberger 2009-02-13 20:49:06 UTC
(In reply to comment #36)
> Chrissi, Lon,
> have you already been able to reproduce the issue ?
> -Mark

I haven't been able to reproduce this at all on my 2-node KVM or my 5-node baremetal cluster.

Comment 39 Marc Grimme 2009-02-14 10:03:45 UTC
Lon,
are you using RHEL5.[23] with all rpms provided from this channel?

And did you 
1. bootup node1. 
2. Wait until it has joined the fence_domain successfully (cman_tool services shows everything ok)
3. bootup node2.
4. Let it join the cluster
5. node1 will be thrown out of the cluster

I'm having two kvm nodes installed with RHEL5.2/3 and if I boot them this way this problem happens.

What do you think?

Regards Marc.

Comment 40 Christine Caulfield 2009-02-16 08:54:47 UTC
I have tried all sorts of things in attempts to reproduce the problem and the only thing that works for me is to start the two nodes up disconnected and with a quorum forced at 1. Which is a highly unusual (and pathological) setup !

Comment 41 Gianluca Cecchi 2009-02-16 10:29:30 UTC
hello, I have the same situation with a 2-nodes physical cluster.
It was previously in 5U2 with qdisk and fence_ilo as fenceing method.
I had to do a workaround to have it working in 5U2: start of qdisk in level 20 before cman (21) instead of defualt 22. Otherwise there were problems with start of anode alone.
Now this had to be solved in U3 because of qdisk start inclusioin in cman script itself.
(Note: I didn't apply November patches related to cluster)
After update to U3 I encounter the same problems as yours.
I'm going to attach my situation with logs from both nodes.
file is "boot_order_first_oracs1_then_oracs2.txt"
The problem is symmetrical, so that if I change the start sequence of the nodes, I get the same: first node cman killed by the other.
What is very bad in my opinion is that on first node the services (in my case some fs, oracle DB and ip) remains active but for example clvmd has problems and it is impossible to manage services....
HIH debugging and I'm available to help in testing if necessary.
We would like to go to production before 28th of February with U3 packages if possible.
Gianluca

Comment 42 Gianluca Cecchi 2009-02-16 10:30:45 UTC
Created attachment 332018 [details]
both nodes powered off. start of oracs1, then oracs2. logs

Comment 43 Gianluca Cecchi 2009-02-16 10:33:53 UTC
Note on myself posts # 41 and 42.
As I have a case open, I was requested in the mean time to try to downgrade openais, without success.
So in my logs the version of openais is openais-0.80.3-15.el5.x86_64 instead of openais-0.80.3-22.el5.x86_64 as provided by default U3.
Gianluca

Comment 44 Christine Caulfield 2009-02-16 10:47:19 UTC
Aha! I managed to reproduce it at last

The problem seems to be related to qdisk, though it might need to be solved
elsewhere. 

What seems to happen is that the first node starts up on its own and
becomes quorate thanks to qdisk. So it starts fencing and thus the
"Dirty" bit gets set.

When that node is stable, the second node is started. This becomes
quorate on its own (thanks to qdisk) and starts fencing, THEN it sees
the other node. They try to merge and BANG, a fencing race happens.

I think this happens partly because of the way aisexec always joins a
single node cluster on its own before seeing any other nodes. As far as
ais is concerned all joins are really merges. But for cman's sake we can
only allow that where there is no state (in this case fencing).

So the questions are:
- is fencing 'state' ? I appreciate that it's more than just an order to
kill a node but does it really need to preclude merges ?
- can we 'fix' qdiskd startup so that it doesn't let fenced get its
state before the second node joins?
- is it possible to set the dirty bit later? is just starting a daemon
really enough to preclude a join or does it have to have done something
first ?
- what help do we need from cman to accomplish any of this. Bearing in
mind that cman in RHEL5/6 already holds some fencing state of its own
that might be useful.

Comment 45 Christine Caulfield 2009-02-16 10:48:36 UTC
Gianluca: Downgrading openais will make no difference. Downgrading cman might - it depends if what you are seeing is what I described in comment #44

Comment 46 Christine Caulfield 2009-02-16 10:56:12 UTC
Michael Waite: Just to be clear, this isn't an undocumented feature. It's documented in the man pages and in my white paper on the subject. That doesn't change the fact that it's almost universally misunderstood though!

What has changed in RHEL5.3 is that a bug has been fixed in cman that tightens the rules up regarding nodes merging with state which is, I suspect, why this seems to happen more frequently than it did.

Comment 47 Gianluca Cecchi 2009-02-16 11:12:14 UTC
Thanks Christine for your comment #45. 
I already knew that: I specified my version of openais to avoid anyone thinking that I had a not aligned 5U3, due eventually to an update mistake or so...
Coming back to the problem: as from #44 can we try to start at the same instant the two nodes and supposed they arrives together at the starting cman phase we should get a correct join ot both the twos? 
Can I go with this?
I ask because when I am in situation where node has cman killed and services still active, I'm not able to stop rgmanager cleanly and I have to force a power off..... 

Thanks,
Gianluca

Comment 48 Mark Hlawatschek 2009-02-16 11:17:57 UTC
(In reply to comment #46)
> Michael Waite: Just to be clear, this isn't an undocumented feature. It's
> documented in the man pages and in my white paper on the subject. That doesn't
> change the fact that it's almost universally misunderstood though!
> 
> What has changed in RHEL5.3 is that a bug has been fixed in cman that tightens
> the rules up regarding nodes merging with state which is, I suspect, why this
> seems to happen more frequently than it did.

Can we agree, that we found a (critical) bug here ?

Comment 49 Michael Waite 2009-02-16 14:38:38 UTC
Good info posted above, looks like it is understood what the root cause is now.

What is now needed to resolve this issue in short order?

Comment 50 Christine Caulfield 2009-02-16 14:55:26 UTC
Yes, qdisk operation is totally broken in 5.3. It's a cman bug.

Comment 51 Marc Grimme 2009-02-16 15:22:40 UTC
Chrissi, are you sure it is only broken with qdisk. I have the same problems without qdisk in a two node cluster.

Marc.

Comment 52 Christine Caulfield 2009-02-16 15:38:04 UTC
What happens is that if a single node has quorum then another node joining that cluster will cause it to be killed.

Now it is possible to do that without qdisk, maybe using two_node=1, or forcing expected_votes to 1 when it really shouldn't be.

So, yes, it can happen without qdisk; it's not a bug in qdisk itself but that's the component most affected by it.

Comment 53 Gianluca Cecchi 2009-02-16 16:32:41 UTC
Can I also ask for a quality control in rgmanager stop script?
With this problem, when you have the two nodes disconnected, you are unable to cleanly shutdown the server.
In my case first start oracs1, then oracs2, then BOOM.
Now I want to restart.
A shutdown of oracs2 hangs at rgmanager stop phase.
inside the code there is this:
stop_cluster()
{
        kill -TERM `pidof $RGMGRD`

        while [ 0 ]; do

                if [ -n "`pidof $RGMGRD`" ]; then
                        echo -n $"Waiting for services to stop: " 
                        while [ -n "`pidof $RGMGRD`" ]; do
                                sleep 1
                        done
                        echo_success
                        echo
                else
                        echo $"Services are stopped."
                fi

So if a soft kill of the process clurgmgrd is not succesfull, it stays indefinitely at this point inside the while loop....
Can this be a stop process for a cluster resource manager????

Also putting a -d flag to rgmanager, you only get this in messages
Feb 16 17:22:53 oracs2 shutdown[24147]: shutting down for system halt
Feb 16 17:22:53 oracs2 init: Switching to runlevel: 0
Feb 16 17:22:55 oracs2 modclusterd: shutdown succeeded
Feb 16 17:22:55 oracs2 rgmanager: [24238]: <notice> Shutting down Cluster Servic
e Manager... 

I would assume a more robust thing...
Sorry, but I'm a little demotivated....

Comment 54 Michael Waite 2009-02-16 17:11:37 UTC
I am sending this to Subhendu Ghosh  outside of this thread. I really want to make sure that this issue has enough visibility.

Comment 55 Gianluca Cecchi 2009-02-16 17:27:27 UTC
I confirm  that starting both nodes at the same instant, they reach quorum and all is ok in this particular scenario.

Comment 56 Lon Hohberger 2009-02-16 17:56:37 UTC
I reproduced this in testing.  My logs are here:

Comment 57 Christine Caulfield 2009-02-17 15:42:31 UTC
Created attachment 332231 [details]
Patch to fix cman bug

This patch has been tested by me and lon and should fix the cman problem. If you can test it on your system please do and let me know the results.

Comment 58 Gianluca Cecchi 2009-02-17 16:14:40 UTC
so I have to rebuild the cman.src.rpm applying this patch, to be able to test?
Or do you have an rpm to try?

Comment 59 Christine Caulfield 2009-02-17 16:41:47 UTC
I don't have an rpm, sorry

Comment 60 Christine Caulfield 2009-02-17 17:19:58 UTC
This fix has been committed to the STABLE2 and STABLE3 trees. I'll wait until we get more feedback before adding it to RHEL5.

Comment 61 Christine Caulfield 2009-02-17 17:23:05 UTC
If you can wait until tomorrow-ish and no-one else steps in, I'll have a go at building a test RPM. I've never done it before though so don't expect too much...

Comment 62 Christine Caulfield 2009-02-18 09:42:23 UTC
Packages for x86_64 are here.

http://people.redhat.com/ccaulfie/packages/

This is the first time I've built packages for other people's use so install them at your own risk. I won't be doing i686 packages because I don't have a RHEL5 system to do that on, sorry.

If you try them please let me know how you get on.

Thanks.

Comment 65 Gianluca Cecchi 2009-02-18 11:53:47 UTC
It seems the same behaviour.
I'm going to attach three files:
first_node_start.log
second_node_start.log
first_node_killcman.log

Steps I've done.
Both nodes up (because started together)
shutdown of node without service
shutdown of node with service
start up of the nodes in single user mode and update
cman-2.0.98-1chrissie (with --force)
openais-0.80.3-22.el5 (I had 0.80.3-15 as initially it seemed to be the cause of the problem in other thread)
startup of both nodes in the same instant to be sure it is again ok, as I updated cman and openais. And infact it is ok.
shutdown of node without service (call it first_node)
shutdown of node with service (call it second_node)
startup first_node (it should be identical which one I power on, shouldn't it?)
startup second_node
boom
in both nodes you can see inside logs:
Feb 18 12:27:55 oracs2 openais[5750]: [CMAN ] CMAN 2.0.98 (built Feb 18 2009 08:49:44) started 
Feb 18 12:35:35 oracs1 openais[5743]: [CMAN ] CMAN 2.0.98 (built Feb 18 2009 08:49:44) started 

so I think it should the correct cman from chrissie...

HIH,
Gianluca

Comment 66 Gianluca Cecchi 2009-02-18 11:55:45 UTC
Created attachment 332362 [details]
first node start logs

Comment 67 Gianluca Cecchi 2009-02-18 11:56:51 UTC
Created attachment 332363 [details]
second node start logs

Comment 68 Gianluca Cecchi 2009-02-18 11:57:41 UTC
Created attachment 332364 [details]
first node cman killed by second node

Comment 69 Christine Caulfield 2009-02-18 13:12:08 UTC
Can you post the full logs please, there seems to be quite a large time gap missing from those. In particular there is no message on the first node saying that the second one join with state, just on the the second node that it was killed.

Also, if the order you give is correct, that's a different symptom than the original one I was investigating - where the first node was killed by the second. Can you post the cluster.conf and also enable qdiskd debugging messages (log_level="7") too please.

Comment 70 Gianluca Cecchi 2009-02-18 13:28:33 UTC
I can attach full logs. I sent them truncated for easier reading.
But it IS the same problem:
first node to start is oracs2 in this test. At 12:29 it finishes  to start the cluster infra and oracle service on cluster logical volumes.
And the file is first_node_start.log

Then I wait for some minutes and at 12:35 I power on second node that is, in this test case, oracs1.

And up to 12:35 no further messages in logs of oracs2 (aka the first node of the test).
At 12:35 you get the logs of second node starting (oracs1 see file second_node_start.log) and the messages on first node (oracs2 see file first_node_killcman.log).

In this latest log file above, that is related to the first node (oracs2) you can see in particular an initial join and then suddenly a kill; exactly as before:

Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] New Configuration: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ]   r(0) ip(192.168.16.8)  
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] Members Left: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] Members Joined: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] New Configuration: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ]   r(0) ip(192.168.16.1)  
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ]   r(0) ip(192.168.16.8)  
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] Members Left: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] Members Joined: 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ]   r(0) ip(192.168.16.1)  
Feb 18 12:35:34 oracs2 openais[5750]: [SYNC ] This node is within the primary component and will provide service. 
Feb 18 12:35:34 oracs2 openais[5750]: [TOTEM] entering OPERATIONAL state. 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] got nodejoin message 192.168.16.1 
Feb 18 12:35:34 oracs2 openais[5750]: [CLM  ] got nodejoin message 192.168.16.8 
Feb 18 12:35:34 oracs2 openais[5750]: [CPG  ] got joinlist message from node 2 
Feb 18 12:35:50 oracs2 openais[5750]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart 

let me know if I clarified or if I still have to send full logs.

Gianluca

BTW: have I a chance to cleanly shutdown the nodes when I am in this situation? For example with manual commands options of cman_tool or others?
I'm powering off and off the nodes... and I don't like it very much. 
Thanks,
Gianluca

Comment 71 Gianluca Cecchi 2009-02-18 13:32:38 UTC
Created attachment 332370 [details]
cluster.conf of oracs1 and oracs2 version 39

Comment 72 Christine Caulfield 2009-02-18 13:37:24 UTC
Please, please *always* attach full logs. I'd much rather have 2GB of log files to wade through than 1K of truncated logs that don't show what I'm looking for. 

I'm very good at filtering log files, it's my job and I've been doing it for a very long time now! And it's quite possible that I might spot something important that looks insignificant to you.

Comment 73 Gianluca Cecchi 2009-02-18 14:01:07 UTC
OK. I'm going to provide full logs. Sorry, I didn't want to offend...
But do you agree that it IS the same situation, do you?

ITMT I changed configuration:
[root@oracs1 ~]# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 39 to 40

Update complete.
[root@oracs1 ~]# cman_tool version -r 40
[root@oracs1 ~]# echo $?
0
diff between cluster.conf 39 and 40
< <cluster alias="oracs" config_version="39" name="oracs">
---
> <cluster alias="oracs" config_version="40" name="oracs">
< 	<quorumd device="/dev/mapper/mpath3" interval="3" label="acsquorum" log_level="4" tko="5" votes="1">
---
> 	<quorumd device="/dev/mapper/mpath3" interval="3" label="acsquorum" log_level="7" tko="5" votes="1">

Do I have to restart the cluster again to have qdiskd logging changes applied?

Comment 74 Gianluca Cecchi 2009-02-18 14:34:19 UTC
Created attachment 332383 [details]
full messages for oracs1

Comment 75 Gianluca Cecchi 2009-02-18 14:35:11 UTC
Created attachment 332384 [details]
full messages for oracs2

Comment 76 Christine Caulfield 2009-02-18 14:53:50 UTC
Actually, no it isn't the same problem. For the reasons I posted in comment #69.

It looks similar on the face of it, but it's subtly different.

Comment 78 Lon Hohberger 2009-02-18 15:29:08 UTC
Sorry, typo in last comment:

You also need to do this in cluster.conf:

  <quorumd ... log_facility="local4" ... >

Then add this to syslog.conf:

  local4.* /var/log/qdiskd.log

That will trap qdiskd's log messages to /var/log/qdiskd.log

After doing this, restart qdiskd + syslogd

Comment 79 Christine Caulfield 2009-02-18 16:09:02 UTC
Gianluca - there's a problem with my RPM, can you try this one:

http://people.redhat.com/ccaulfie/packages/cman-2.0.98-1subodh.i386.rpm

[disclaimer in comments 61 & 62 are relevant here!]

Sorry.

Comment 81 Gianluca Cecchi 2009-02-18 16:19:27 UTC
No problem. I will test the new cman rpm.
Ok. set up logging and file for qdiskd.
I have at the moment this on the nodes in qdiskd.log:
Feb 18 16:50:40 oracs1 openais[5754]: [CMAN ] lost contact with quorum device 
Feb 18 17:12:28 oracs1 qdiskd[8502]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=3 
Feb 18 17:12:28 oracs1 qdiskd[8502]: <debug> 1 heuristics loaded 
Feb 18 17:12:28 oracs1 qdiskd[8502]: <debug> Quorum Daemon: 1 heuristics, 3
interval, 5 tko, 1 votes 
Feb 18 17:12:28 oracs1 qdiskd[8502]: <debug> Run Flags: 00000031 
Feb 18 17:12:28 oracs1 qdiskd[8502]: <info> Quorum Partition: /dev/dm-5 Label:
acsquorum 
Feb 18 17:12:28 oracs1 qdiskd[8503]: <info> Quorum Daemon Initializing 
Feb 18 17:12:28 oracs1 qdiskd[8503]: <debug> I/O Size: 512  Page Size: 4096 
Feb 18 17:12:29 oracs1 qdiskd[8503]: <info> Heuristic: 'ping -c1 -w1
10.4.5.250' UP 
Feb 18 17:12:43 oracs1 qdiskd[8503]: <info> Initial score 1/1 
Feb 18 17:12:43 oracs1 qdiskd[8503]: <info> Initialization complete 
Feb 18 17:12:43 oracs1 qdiskd[8503]: <notice> Score sufficient for master
operation (1/1; required=1); upgrading 
Feb 18 17:12:49 oracs1 qdiskd[8503]: <debug> Making bid for master 
Feb 18 17:12:52 oracs1 qdiskd[8503]: <debug> Node 2 is UP 
Feb 18 17:13:01 oracs1 qdiskd[8503]: <info> Assuming master role 

Feb 18 16:50:41 oracs2 openais[5777]: [CMAN ] lost contact with quorum device 
Feb 18 17:12:44 oracs2 qdiskd[538]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=3 
Feb 18 17:12:44 oracs2 qdiskd[538]: <debug> 1 heuristics loaded 
Feb 18 17:12:44 oracs2 qdiskd[538]: <debug> Quorum Daemon: 1 heuristics, 3
interval, 5 tko, 1 votes 
Feb 18 17:12:44 oracs2 qdiskd[538]: <debug> Run Flags: 00000031 
Feb 18 17:12:44 oracs2 qdiskd[538]: <info> Quorum Partition: /dev/dm-5 Label:
acsquorum 
Feb 18 17:12:44 oracs2 qdiskd[539]: <info> Quorum Daemon Initializing 
Feb 18 17:12:44 oracs2 qdiskd[539]: <debug> I/O Size: 512  Page Size: 4096 
Feb 18 17:12:45 oracs2 qdiskd[539]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250'
UP 
Feb 18 17:12:50 oracs2 qdiskd[539]: <debug> Node 1 is UP 
Feb 18 17:12:59 oracs2 qdiskd[539]: <info> Initial score 1/1 
Feb 18 17:12:59 oracs2 qdiskd[539]: <info> Initialization complete 
Feb 18 17:12:59 oracs2 qdiskd[539]: <notice> Score sufficient for master
operation (1/1; required=1); upgrading 
Feb 18 17:13:02 oracs2 qdiskd[539]: <info> Node 1 is the master 

The gap in time (16:50 and 17:12) depends on the fact that the restart command
output for qdiskd was successfull but actually it wasn't able to restart....

[root@oracs2 ~]# service qdiskd restart
Stopping the Quorum Disk Daemon:                           [  OK  ]
Starting the Quorum Disk Daemon:                           [  OK  ]
[root@oracs2 ~]# service qdiskd status
qdiskd is stopped
[root@oracs2 ~]# service qdiskd start
Starting the Quorum Disk Daemon:                           [  OK  ]


But I don't understand you saying it is different situation.
In two days ago logs I had this scenario.
Start oracs1, then after 2 minutes oracs2 and boom and part of logs of second
node (oracs2) were:

Feb 16 10:17:10 oracs2 openais[5896]: [SYNC ] This node is within the primary
component and will provide service. 
Feb 16 10:17:10 oracs2 ccsd[5885]: Error while processing connect: Connection
refused 
Feb 16 10:17:10 oracs2 openais[5896]: [TOTEM] entering OPERATIONAL state. 
Feb 16 10:17:10 oracs2 openais[5896]: [MAIN ] Node node01 not joined to cman
because it has existing state 
Feb 16 10:17:10 oracs2 openais[5896]: [CLM  ] got nodejoin message 192.168.16.1 
Feb 16 10:17:10 oracs2 openais[5896]: [CLM  ] got nodejoin message 192.168.16.8 
Feb 16 10:17:10 oracs2 openais[5896]: [CPG  ] got joinlist message from node 1 
Feb 16 10:17:11 oracs2 ccsd[5885]: Cluster is not quorate.  Refusing
connection. 

Today the start order scenario was inverted: Start of oracs2, then after 6
minutes start of oracs1 and boom and part of logs of second node (now oracs1)
are:
Feb 18 12:35:37 oracs1 openais[5743]: [SYNC ] This node is within the primary
component and will provide service. 
Feb 18 12:35:37 oracs1 openais[5743]: [TOTEM] entering OPERATIONAL state. 
Feb 18 12:35:37 oracs1 openais[5743]: [MAIN ] Node node02 not joined to cman
because it has existing state 
Feb 18 12:35:37 oracs1 openais[5743]: [CLM  ] got nodejoin message 192.168.16.1 
Feb 18 12:35:37 oracs1 openais[5743]: [CLM  ] got nodejoin message 192.168.16.8 
Feb 18 12:35:37 oracs1 openais[5743]: [CPG  ] got joinlist message from node 2 
Feb 18 12:35:37 oracs1 ccsd[5732]: Cluster is not quorate.  Refusing
connection. 

So the hosts are inverted but the messages are the same in my opinion.
Or I have not understood you comment #69 ....

Comment 82 Michael Waite 2009-02-18 16:22:58 UTC
Adding Subhendu Ghosh

Comment 84 RHEL Program Management 2009-02-18 16:43:22 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 85 Lon Hohberger 2009-02-18 18:18:05 UTC
The qdiskd initscript 'restart' doesn't work; there's a bug open about this.

  "service qdiskd stop; sleep 3; service qdiskd start"

...should work fine.

Comment 86 Lon Hohberger 2009-02-18 22:38:58 UTC
I put Christine's fixed source RPM through our build system and tested it on my cluster; here are the resulting packages:

http://people.redhat.com/lhh/cman-2.0.98-2chrissie.src.rpm

http://people.redhat.com/lhh/cman-2.0.98-2chrissie.i386.rpm
http://people.redhat.com/lhh/cman-devel-2.0.98-2chrissie.i386.rpm

http://people.redhat.com/lhh/cman-2.0.98-2chrissie.x86_64.rpm
http://people.redhat.com/lhh/cman-devel-2.0.98-2chrissie.x86_64.rpm

The first pass was a bad build as previously noted.

Comment 88 Gianluca Cecchi 2009-02-19 12:13:25 UTC
I confirm that with cman-2.0.98-2 rpm the nodes are able to join correctly the cluster also when one starts after some minutes in respect of the other.
Christine, in comment #46 you referred to a white paper of yours: do you have a link for it?
As also RHEL Product and Program Management is seeing this, hopefully, can we hope in better rgmanager shutdown logic (see comment #53) and overall quality (see qdiskd restart problem comment #85).
And also a sort of document that gives a clear workflow of the several pieces involved (cman, ccsd, fence, qdiskd, rgmanager, openais, ecc) so that in case of failur of one submodule one can have a way to know dependencies and restart correct path?
Thanks for the help.
Gianluca

Comment 89 Christine Caulfield 2009-02-20 08:05:25 UTC
Sorry for the confusion over the RPMs. I have had my wrist gently slapped and I have renounced the building of them on my systems ;-)

My white papers are all here:

http://people.redhat.com/ccaulfie/docs/

The one I referred to is aiscman.pdf but CSNetworking.pdf and ClusterPic.pdf might also be useful to you.

Comment 90 Christine Caulfield 2009-02-20 09:22:34 UTC
About the other other bugs, can you please report them separately? If you leave them as a single comment in an unrelated bugzilla they will get lost unfortunately.

Comment 91 Christine Caulfield 2009-02-20 09:29:52 UTC
I've committed the fix for 5.4. Still waiting for ACKs for 5.3.z

commit 0bc0fbcf186c76eae15c17aef1e09ce90d34a771
Author: Christine Caulfield <ccaulfie>
Date:   Tue Feb 17 17:14:00 2009 +0000

    cman: Don't eject nodes that were quorate on their own if we have no state

Comment 93 Marc Grimme 2009-02-24 15:16:34 UTC
I can also confirm that this bug is fixed.

Thanks that's really great news!

I also backported the cman rpms to also fix this bug in RHEL5.2 (cman 2.0.84).
Those can be downloaded here:

http://download.atix.de/yum/comoonics/testrpms/cman-2.0.84-2.4.i386.rpm
http://download.atix.de/yum/comoonics/testrpms/cman-2.0.84-2.4.x86_64.rpm
http://download.atix.de/yum/comoonics/testrpms/cman-devel-2.0.84-2.4.i386.rpm
http://download.atix.de/yum/comoonics/testrpms/cman-devel-2.0.84-2.4.x86_64.rpm
http://download.atix.de/yum/comoonics/testrpms/cman-2.0.84-2.4.src.rpm

They are neither signed nor verified and I don't want to responsible for any
harm they cause. But for my tests they worked perfectly well.

Marc.

Comment 94 Marc Grimme 2009-02-24 20:51:37 UTC
Could you please tell us if there will be an appropriate errata for this bug?

Marc.

Comment 95 Chris Feist 2009-02-25 16:08:33 UTC
Marking as modified, fix has been committed for 5.4.

Comment 98 luis torrontegui 2009-02-26 17:29:56 UTC
hi. i´m have the same problem.
I apply the rpm http://people.redhat.com/lhh/cman-2.0.98-2chrissie.x86_64.rpm
and solved, but i have other problem. When i shutdown one of the nodes, the services do not relocated to the node who remain.


Help me please ?

Comment 99 Gianluca Cecchi 2009-02-26 18:31:19 UTC
I have just done this and for me it is ok.
My cluster.conf is between attachments and if I have the ACSSRV service active on a node (ip + 4fs + oracle script) and I run shutdown -r on it the service is started on the other node.
messages on other node contains:
Feb 26 19:24:53 oracs1 clurgmgrd[6784]: <notice> Member 2 shutting down 
Feb 26 19:24:59 oracs1 clurgmgrd[6784]: <notice> Starting stopped service service:ACSSRV 
Feb 26 19:24:59 oracs1 qdiskd[5932]: <info> Node 2 shutdown 
Feb 26 19:25:05 oracs1 qdiskd[5932]: <info> Assuming master role 
Feb 26 19:25:08 oracs1 openais[5889]: [TOTEM] The token was lost in the OPERATIONAL state. 
Feb 26 19:25:08 oracs1 openais[5889]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). 
Feb 26 19:25:08 oracs1 openais[5889]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). 
Feb 26 19:25:08 oracs1 openais[5889]: [TOTEM] entering GATHER state from 2. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] entering GATHER state from 0. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] Creating commit token because I am the rep. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] Saving state aru 50 high seq received 50 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] Storing new sequence id for ring 4d9ac 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] entering COMMIT state. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] entering RECOVERY state. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] position [0] member 192.168.16.1: 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] previous ring seq 317864 rep 192.168.16.1 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] aru 50 high delivered 50 received flag 1 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] Did not need to originate any messages in recovery. 
Feb 26 19:25:13 oracs1 openais[5889]: [TOTEM] Sending initial ORF token 
Feb 26 19:25:13 oracs1 openais[5889]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 26 19:25:13 oracs1 openais[5889]: [CLM  ] New Configuration: 
Feb 26 19:25:13 oracs1 kernel: dlm: closing connection to node 2
Feb 26 19:25:13 oracs1 openais[5889]: [CLM  ]   r(0) ip(192.168.16.1)  
Feb 26 19:25:13 oracs1 kernel: kjournald starting.  Commit interval 5 seconds
Feb 26 19:25:13 oracs1 openais[5889]: [CLM  ] Members Left: 
Feb 26 19:25:13 oracs1 kernel: EXT3 FS on dm-18, internal journal
Feb 26 19:25:13 oracs1 openais[5889]: [CLM  ]   r(0) ip(192.168.16.8)  
Feb 26 19:25:14 oracs1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 26 19:25:14 oracs1 openais[5889]: [CLM  ] Members Joined: 

you can see the munt of first file system and then there are the other resources of the service that start ok..
It takes in total about 30 seconds.

HIH, Gianluca

Comment 100 luis torrontegui 2009-02-26 19:07:34 UTC
Sorry, i not explain correctly my test.

I have 2 node cluster. When i power off (and take off energy) the services don´t relocated.
It´s a common behaviour? i saw who the remaining node stay trying to fence de power off node.


Thank´s,

Comment 101 Gianluca Cecchi 2009-02-27 10:49:50 UTC
Which fencing devices are you using? 
Actually the surviving node should do a query about the status of the other and getting an answer that is already powered off and ok without starting fencing at all.
The problem is that if you detach all the power suppplies from the server, the iLO too is unpowered (if for example you have it) and so the killing node would not be able to connect and get the status of the other.
I simulated this behaviour putting an iptables rule on a node and then doing a panic of the other.
What you get is:
Feb 26 18:52:18 oracs2 fenced[5968]: fencing node "node01"
Feb 26 18:52:28 oracs2 fenced[5968]: agent "fence_ilo" reports: Unable to connect/login to fencing device 
Feb 26 18:52:28 oracs2 fenced[5968]: fence "node01" failed
and again each 5 seconds

but I don't think there is an easy solution to this.
The important thing is to avoid corruption of data.
So my scenario to simulate a problem between production lan and iLO lan was this:
node1 with active service
block connection to iLO of node2
no message about this in logs (it seems fence lan is not monitored, till it is needed, is true?)
panic of node2
node1 tries to fence it indefinitely
then
manual restart of node2 in single user mode
iptables rule so that it is not able to see node1
(I assume a problem between Lans and so that each node doesn't see the other node iLO lan... it could not be so in real situation)
init 3 on node2
node2 starts the cluster infra correctly and get joined
node1 doesn't try to fence node2 anymore

You still are in critical situation, because in case of problem on node1, node2 will not be able to fence it and start the service....
but I think probably this should be more a matter of monitoring.. perhaps

HIH, and for the others, if I'm wrong in my considerations, please let me know
Gianluca

Comment 102 luis torrontegui 2009-02-27 22:21:44 UTC
hi Gianluca.


My test was dettach power suplies of node1 simulating a crash in hardware (which is owner of services).
Node2 stay trying connect to fence device of node1.
My question is, this is a common behaviour of the software? or the an error in configuration?

My fence devices is IPMI and server´s is DELL poweredge 1950.


thank´s

Comment 108 Arwin Tugade 2009-04-07 20:32:30 UTC
Just giving my .02 cents here.

Applied Christine's patch and all seems well.  I have a 2 node (RHEL5U3 x86_64) with qdisk on Dell 2950's using DRAC5 fencing (I will be trying this on a 3 node without qdisk shortly).  Bringing down the cluster and bringing up 1 node (eventually fences node 2) and the cluster forms and stays.

Also the issue stated in <a href="show_bug.cgi?id=485026#c53">comment #53</a> is no longer an issue for me with this patch(rgmanger stop doesn't hang indefinately anymore).

Thanks,
Arwin

Comment 109 Gianluca Cecchi 2009-04-09 08:43:08 UTC
So it seems I have the chance to install another cluster from scratch.
Systems are HP blades with updated rhel 5 U3 x86_64 and cman provided by Chrissie.
Names are orastud1 and orastud2. Packages installed on both:
rgmanager-2.0.46-1.el5
ricci-0.12.1-7.3.el5_3
device-mapper-1.02.28-2.el5
openais-0.80.3-22.el5
luci-0.12.1-7.3.el5_3
cman-2.0.98-2chrissie
device-mapper-multipath-0.4.7-23.el5_3.1
Running kernel on both is:
2.6.18-128.1.6.el5

setup of multipath. Both nodes see the quorum disk lun as mpath6 
No cluster services started at all

Steps done for the very initial startup of the cluster:
1) format of the quorum disk
# mkqdisk -c /dev/mapper/mpath6 -l orastudquorum 
mkqdisk v0.6.0
Writing new quorum disk label 'orastudquorum' to /dev/mapper/mpath6.
WARNING: About to destroy all data on /dev/mapper/mpath6; proceed [N/y] ? y
Initializing status block for node 1...
..
2) copy of the same cluster.conf file on both nodes (see attachment)
It contains at this moment only two ip services (STUDSRV and DWHSRV) and is based on quorum
3) On first node:
[root@orastud1 cluster]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing...

It remains here because I forgot to start qdisk... ;-(
in messages (I'm going to attach full logs too) I have:
Apr  8 18:06:37 orastud1 ccsd[8311]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:06:37 orastud1 ccsd[8311]: Error while processing connect: Connection refused 
and
[root@orastud1 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: orastud
Cluster Id: 14022
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Quorum: 2 Activity blocked
Active subsystems: 7
Flags: 
Ports Bound: 0  
Node name: intraclstud1
Node ID: 1
Multicast addresses: 239.192.54.252 
Node addresses: 10.10.16.201 

Even if actually I expected qdiskd startup incorporated in cman should come in place.... clarification?
Anyway, in another console:
[root@orastud1 ~]# service qdiskd start
Starting the Quorum Disk Daemon:                           [  OK  ]

After some seconds:
Apr  8 18:06:50 orastud1 qdiskd[8419]: <info> Initial score 1/1 
Apr  8 18:06:50 orastud1 qdiskd[8419]: <info> Initialization complete 
Apr  8 18:06:50 orastud1 openais[8322]: [CMAN ] quorum device registered 
Apr  8 18:06:50 orastud1 qdiskd[8419]: <notice> Score sufficient for master operation (1/1; required=1); upgrading 
...
Apr  8 18:07:05 orastud1 ccsd[8311]: Error while processing connect: Connection refused 
Apr  8 18:07:05 orastud1 openais[8322]: [CMAN ] quorum regained, resuming activity 
and now
[root@orastud1 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: orastud
Cluster Id: 14022
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Quorum device votes: 1
Total votes: 2
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: intraclstud1
Node ID: 1
Multicast addresses: 239.192.54.252 
Node addresses: 10.10.16.201 

and cman start on the first console completes with:
   Starting fencing... done
                                                           [  OK  ]
4) [root@orastud1 cluster]# service modclusterd start
Starting Cluster Module - cluster monitor: 
In messages:
Apr  8 18:12:06 orastud1 modclusterd: startup succeeded
Apr  8 18:12:38 orastud1 kernel: dlm: Using TCP for communications
Apr  8 18:12:39 orastud1 clurgmgrd[8733]: <notice> Resource Group Manager Starting 
Apr  8 18:12:45 orastud1 clurgmgrd[8733]: <notice> Starting stopped service service:STUDSRV 
Apr  8 18:12:45 orastud1 clurgmgrd[8733]: <notice> Starting stopped service service:DWHSRV 
Apr  8 18:12:46 orastud1 in.rdiscd[9304]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use
Apr  8 18:12:46 orastud1 in.rdiscd[9304]: Failed joining addresses 
Apr  8 18:12:46 orastud1 in.rdiscd[9307]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use
Apr  8 18:12:46 orastud1 in.rdiscd[9307]: Failed joining addresses 
Apr  8 18:12:47 orastud1 clurgmgrd[8733]: <notice> Service service:STUDSRV started 
Apr  8 18:12:47 orastud1 clurgmgrd[8733]: <notice> Service service:DWHSRV started 

I am now able to ping the two virtual IP of the cluster, from an external pc

5) [root@orastud1 cluster]# service ricci start
Starting ricci: 

in messages:
Apr  8 18:13:32 orastud1 ricci: startup succeeded

and
[root@orastud1 ~]# clustat
Cluster Status for orastud @ Wed Apr  8 18:14:17 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 intraclstud1                                                        1 Online, Local, rgmanager
 intraclstud2                                                        2 Offline
 /dev/dm-6                                                           0 Online, Quorum Disk

 Service Name                                                  Owner (Last)       State         
 ------- ----                                                  ----- ------                                                  -----         
 service:DWHSRV                                                intraclstud1       started       
 service:STUDSRV                                               intraclstud1       started       

[root@orastud1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2009-04-08 18:06:50  /dev/dm-6
   1   M      4   2009-04-08 18:05:50  intraclstud1
   2   X      0                        intraclstud2

[root@orastud1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 none        
[1]
dlm              1     rgmanager  00020001 none        
[1]

Now the tricky part: I want node2 to join this first started up cluster.....

on node2:
6) [root@orastud2 cluster]# service qdiskd start
Starting the Quorum Disk Daemon:                           [  OK  ]

In messages:
Apr  8 18:17:04 orastud2 qdiskd[14574]: <info> Waiting for CMAN to start 
why?

7) [root@orastud2 cluster]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... 

BUM... it remains here... in messages:
4 times:
Apr  8 18:17:31 orastud2 ccsd[14660]: cluster.conf (cluster name = orastud, version = 1) found. 
Apr  8 18:17:31 orastud2 ccsd[14660]: Remote copy of cluster.conf is from quorate node. 
Apr  8 18:17:31 orastud2 ccsd[14660]:  Local version # : 1 
Apr  8 18:17:31 orastud2 ccsd[14660]:  Remote version #: 1 

Apr  8 18:17:32 orastud2 openais[14671]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr  8 18:17:32 orastud2 ccsd[14660]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] New Configuration: 
Apr  8 18:17:33 orastud2 ccsd[14660]: Error while processing connect: Connection refused 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] Members Left: 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] Members Joined: 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] New Configuration: 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ]        r(0) ip(10.10.16.202)  
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] Members Left: 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] Members Joined: 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ]        r(0) ip(10.10.16.202)  
Apr  8 18:17:33 orastud2 openais[14671]: [SYNC ] This node is within the primary component and will provide service. 
Apr  8 18:17:33 orastud2 openais[14671]: [TOTEM] entering OPERATIONAL state. 
Apr  8 18:17:33 orastud2 openais[14671]: [CLM  ] got nodejoin message 10.10.16.202 
Apr  8 18:17:33 orastud2 ccsd[14660]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:17:33 orastud2 ccsd[14660]: Error while processing connect: Connection refused 
...
Apr  8 18:17:34 orastud2 qdiskd[14574]: <info> Quorum Partition: /dev/dm-12 Label: orastudquorum 
Apr  8 18:17:34 orastud2 qdiskd[14574]: <info> Quorum Daemon Initializing 
Apr  8 18:17:34 orastud2 ccsd[14660]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:17:34 orastud2 ccsd[14660]: Error while processing connect: Connection refused 
Apr  8 18:17:34 orastud2 ccsd[14660]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:17:34 orastud2 ccsd[14660]: Error while processing connect: Connection refused 
Apr  8 18:17:35 orastud2 qdiskd[14574]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' UP 
...
Apr  8 18:17:43 orastud2 qdiskd[14574]: <info> Node 1 is the master 
Apr  8 18:17:43 orastud2 dlm_controld[14700]: connect to ccs error -111, check ccsd or cluster status
Apr  8 18:17:43 orastud2 ccsd[14660]: Cluster is not quorate.  Refusing connection. 
Apr  8 18:17:43 orastud2 ccsd[14660]: Error while processing connect: Connection refused 
...

Apr  8 18:17:49 orastud2 qdiskd[14574]: <info> Initial score 1/1 
Apr  8 18:17:49 orastud2 qdiskd[14574]: <info> Initialization complete 
Apr  8 18:17:49 orastud2 openais[14671]: [CMAN ] quorum device registered 
Apr  8 18:17:49 orastud2 qdiskd[14574]: <notice> Score sufficient for master operation (1/1; required=1); upgrading 
no messages at all on node1.....

8) Ctrl+C on node2 on the cman start command
9) service cman stop on node 2
Apr  8 18:22:05 orastud2 ccsd[14660]: Stopping ccsd, SIGTERM received. 
Apr  8 18:22:05 orastud2 fenced[14694]: cluster is down, exiting
Apr  8 18:22:13 orastud2 dlm_controld[14700]: connect to ccs error -111, check ccsd or cluster status
Apr  8 18:22:13 orastud2 gfs_controld[14706]: connect to ccs error -111, check ccsd or cluster status

what do I do wrong? Do I have to use a particular cman command the first time on node2?
Or is this again the bug?
Thanks,
Gianluca

Comment 110 Gianluca Cecchi 2009-04-09 08:46:37 UTC
Created attachment 338880 [details]
cluster.conf as of comment#109

Comment 111 Gianluca Cecchi 2009-04-09 08:47:45 UTC
Created attachment 338881 [details]
messages file of orastud1, first node to start the cluster the first time

Comment 112 Gianluca Cecchi 2009-04-09 08:48:31 UTC
Created attachment 338882 [details]
messages file of orastud2, second node that should join the cluster the first time

Comment 113 Gianluca Cecchi 2009-04-09 09:09:45 UTC
As of my note in comment #109 :

"Even if actually I expected qdiskd startup incorporated in cman should come in
place.... clarification?"

I was wrong, because it depends on the fact that at that moment qdiskd service was not chkconfiged on for that node.
So, in cman script the function start_qdisk() contains:
/sbin/chkconfig --levels "$current_runlevel" qdiskd 2>/dev/null
    if [ $? -ne 0 ]; then
        # qdiskd doesn't start at this runlevel.
        return 0
    fi

and it is correct that I didn't see qdiskd starting.
Anyway I started it manually from another console session.

Comment 114 Gianluca Cecchi 2009-04-09 14:25:38 UTC
Just in case my issues are due to multicast propagation problems, how to check for this?
I found an utility named iperf on epel but I'm not sure if and how to use it to test....
To which network is multicast bound in redhat cluster? Intracluster one or both?
Thanks,
Gianluca

Comment 116 Gianluca Cecchi 2009-04-27 08:12:37 UTC
The problems described starting from comment#109 were indeed due to multicast issues.... and not with cluster software
It would be good to have some tools to test and verify multicast along with the cluster itself.

Comment 119 Christine Caulfield 2009-05-19 07:09:21 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause:   This bug was caused by a bug in the 'openais-only' detection code inside cman. If a node because quorate on its own, say because two_node=1 was set or a quorum disk was present then other services are allowed to start. This set the "Dirty" flag in CMAN to indicate that it has state that cannot be reproduced with a full restart. The check for rejecting node joins only checked for the remote node not having state.

Consequence:  If another node then joined the cluster, even if it didn't have any state it would be rejected. This means that the normal configuration for a 2 node cluster didn't work because one node is supposed to be quorate at all times, it would not let the second node join the cluster.


Fix:   The 'dirty' flag check now checks both the local and remote states of the nodes. So that the nodes can join the cluster if one of them does not have state.

Result:  Two node clusters now work correctly.

Comment 120 Chris Ward 2009-07-03 18:24:22 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 122 Arwin Tugade 2009-08-05 23:00:41 UTC
Upgrading to the cman package in the Beta channel did not work for me.  The version went from:

cman-2.0.98-1.el5_3.1.x86_64

to:

cman-2.0.108-1.el5.x86_64

I updated only on 1 or 3 nodes.  No qdisk is present.

    Mounting configfs... done
    Starting ccsd... done
    Starting cman... failed
/usr/sbin/cman_tool: aisexec daemon didn't start

Version of openais that I'm on is:

openais-0.80.3-22.el5_3.9

I'm forced for rollback to cman-2.0.98-1.el5_3.1.x86_64, then the node comes up fine.  I have my config posted in Bug 515615.

Thanks,
Arwin

Comment 123 Perry Myers 2009-08-06 00:11:35 UTC
Arwin, upgrade to openais in RHEL5 beta as well.  I think latest openais available in beta channel is openais-0.80.6-8.el5

Comment 124 errata-xmlrpc 2009-09-02 11:06:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html


Note You need to log in before you can comment on or make changes to this bug.