663397 – Testing to determine reasonable bounds for token timeout in Clusters

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 663397 - Testing to determine reasonable bounds for token timeout in Clusters

Summary: Testing to determine reasonable bounds for token timeout in Clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	739930 752866
Blocks:	693781
TreeView+	depends on / blocked

Reported:	2010-12-15 16:57 UTC by Perry Myers
Modified:	2011-12-06 14:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	cluster-3.0.12.1-23.el6
Doc Type:	Bug Fix
Doc Text:	Cause: corosync token timeouts larger than the time it takes a failed node to rejoin the cluster, so corosync does not detect the node failure until it rejoins. Consequence: by the time dlm_controld asks corosync for the new member list, the failed node has already been added again, so dlm_controld does not notice a change. This eventually leads to the most observable consequence of dlm lockspace operations being stuck. Fix: dlm_controld can notice a node was removed and added between checks by looking for a changed incarnation number. Result: dlm_controld can properly handle nodes that are quickly removed and re-added due to large token timeouts.
Clone Of:
Clones:	752866 (view as bug list)
Environment:
Last Closed:	2011-12-06 14:50:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster logs, token timeout 220 sec, service not correctly relocated (87.77 KB, application/x-gzip) 2011-10-05 15:57 UTC, Martin Juricek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1516	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2011-12-06 00:51:09 UTC

Description Perry Myers 2010-12-15 16:57:58 UTC

We allow users of RHEL HA Add On to adjust the token timeout arbitrarily via cluster.conf, however, we don't test the boundary conditions for the smallest timeout we should allow and the largest.

Primarily we haven't tested this due to the fact that we've never documented what reasonable timeout values are.

This bug is to track two things:

1. Engineering consensus on what the min and max fully supported token timeouts should be
2. QE testing of these extremes

dteigland has suggested that 5 seconds to 5 minutes are an appropriate range.  Input needed from other folks here.

Comment 1 Fabio Massimo Di Nitto 2010-12-15 18:04:12 UTC

(In reply to comment #0)
> We allow users of RHEL HA Add On to adjust the token timeout arbitrarily via
> cluster.conf, however, we don't test the boundary conditions for the smallest
> timeout we should allow and the largest.
> 
> Primarily we haven't tested this due to the fact that we've never documented
> what reasonable timeout values are.
> 
> This bug is to track two things:
> 
> 1. Engineering consensus on what the min and max fully supported token timeouts
> should be
> 2. QE testing of these extremes
>
> 
> dteigland has suggested that 5 seconds to 5 minutes are an appropriate range. 
> Input needed from other folks here.

Just a reminder that consensus (unless specified in cluster.conf) is calculated based on token timeout with the current formula:

if (node_count > 2) {
  consensus = (float)token+2000;
} else {
  consensus = (float)token*0.2;
  if (consensus < 200)
    consensus = 200;
  if (consensus > 2000)
    consensus = 2000;
}

Comment 2 Fabio Massimo Di Nitto 2011-01-06 10:31:23 UTC

I agree with David´s suggestion.

Steven, do you have any opinion? otherwise we should reassign the bug to QE.

Comment 11 Sayan Saha 2011-05-26 15:27:18 UTC

Dean - 5 seconds to 5 minutes is fine.

Comment 12 Fabio Massimo Di Nitto 2011-05-27 03:07:43 UTC

(In reply to comment #11)
> Dean - 5 seconds to 5 minutes is fine.

Sayan, we need to know the granularity.

We all agreed that the range has to be 5 secs to 5 mins.

Do you want a test run every 5 seconds delta? every 0.5 seconds delta?

first test is 5 seconds
second test is X seconds?
.....
last test is 5 minutues

Comment 13 Sayan Saha 2011-06-02 19:00:33 UTC

5 seconds to 5 minutes every 5 seconds should be OK. Give this a try and let us know.

Comment 14 Fabio Massimo Di Nitto 2011-06-03 06:29:33 UTC

(In reply to comment #13)
> 5 seconds to 5 minutes every 5 seconds should be OK. Give this a try and let us
> know.

Just to be clear, this means running approx 60 extra tests.

Comment 16 Fabio Massimo Di Nitto 2011-07-11 14:21:20 UTC

Unit test results

how to test:

<cluster name="fabbione" config_version="1" >
  <logging debug="on"/>
  <totem token="XXXXXXX"/>
  <cman two_nodes="1" expected_votes="1"/>
  <clusternodes>
....

XXXXX is replaced in loops by the test values.

serivice cman start on both nodes
tail -f /var/log/cluster/fenced.log on nodeA
date && pkill -9 corosync on nodeB

nodeA should fence nodeB no later than XXXXXXX value.

The exact fence time can vary a bit but it should always be lower than XXXXXXX.

Due to the fact that this test is very time consuming and not automatic yet, I performed a sample of the values in the range:

5000: ok
10000: ok
15000: ok
20000: ok
25000: ok
30000: ok
35000: ok
40000: ok
45000: ok
50000: ok
55000: ok
60000: ok
90000: ok
120000: ok
180000: ok
240000: ok
300000: ok

From 5 sec to 1m every 5 seconds, every 30secs between 1 and 2 minutes, every minute after.

Comment 20 Perry Myers 2011-07-12 12:50:41 UTC

I don't think we need to run tests for every token timeout _between_ 5 seconds and 5 minutes.  I think we can just test the boundaries.  So I suggest that we only run tests with token=5000 and token=300000.  It stands to reason that if those boundary times work, then token timeouts in the middle of that range should work just fine.

The other thing is that I'm not sure that the entire regression suite needs to be run for these boundary tests.  For example, not every resource agent needs to be tested with different token timeouts.  So we can reduce the scope of the tests here probably.  Dean, since you're most familiar with what tests are done as part of the regression test plan, and QE would have good knowledge of how token timeouts might relate to the different areas of the test plan, I'll leave it in your capable hands to determine how to trim the regression tests for these two boundary token timeout cases.

Comment 21 Fabio Massimo Di Nitto 2011-07-12 12:58:50 UTC

I agree with Perry that we don´t need the full regression test for each value of token.

This is my suggestion:

token 5 sec -> high load
token 10 sec -> high load
token 10 sec + qdiskd -> medium/high load
token 300 sec -> low/medium load (i doubt a cluster that has this failover timeout generates lots of load... or it would need to failover fast)
token 300 sec + qdiskd -> low/medium

IME a cluster with high load needs fast failover and viceversa.

so 5 run in total, and only 2 (5/10 secs) as heavy as possible. All the others just a small subset of the test suite should be just fine.

Comment 22 Fabio Massimo Di Nitto 2011-07-12 13:06:44 UTC

(In reply to comment #21)
> I agree with Perry that we don´t need the full regression test for each value
> of token.
> 
> This is my suggestion:
> 
> token 5 sec -> high load
> token 10 sec -> high load
> token 10 sec + qdiskd -> medium/high load
> token 300 sec -> low/medium load (i doubt a cluster that has this failover
> timeout generates lots of load... or it would need to failover fast)
> token 300 sec + qdiskd -> low/medium
> 
> IME a cluster with high load needs fast failover and viceversa.
> 
> so 5 run in total, and only 2 (5/10 secs) as heavy as possible. All the others
> just a small subset of the test suite should be just fine.

Sorry, I need to correct this.. 10 sec is default, you run that all the time.

To recap:

token 5 sec -> high load
token 300 sec -> low/medium load
token 300 sec + qdiskd -> low/medium

Comment 31 David Teigland 2011-09-26 18:23:55 UTC

If you run:
ps ax -o pid,stat,cmd,wchan | grep dlm_recoverd

do you see the following?

 3123 D    [dlm_recoverd]              dlm_wait_function
on the node that wasn't rebooted

 3089 D    [dlm_recoverd]              msleep
on the node that was rebooted

This is what I'm currently seeing with an upstream version of dlm when I do
node1: service cman start
node2: service cman start
node1: service clvmd start
node2: service clvmd start
node2: reboot -fin
node2: service cman start
node2: service clvmd start -- comms get stuck in dlm recovery

I don't know what's causing that yet.

Comment 32 David Teigland 2011-09-26 20:46:58 UTC

Data in comment 31 is from my devel machines which are an ad hoc combination of various newer upstream versions.

Now using test machines with RHEL6.2-20110923.3

two nodes with <totem token="220000"/>

node1: service cman start
node2: service cman start
node1: service clvmd start
node2: service clvmd start
node2: reboot -fin
node2: service cman start
node2: service clvmd start -- stuck due to problems with dlm connections

last dlm messages from node1:
dlm: got connection from 2
Extra connection from node 2 attempted

last dlm messages from node2:
dlm: connecting to 1

I need to reproduce this again with timestamped dlm messages, but I suspect that
the problem is because when node2 is added after the reboot, a failure and addition are processed back to back in quick succession.  That is what we want, but I think the closing/re-establishing of dlm connections gets confused by this, and the dlm connections end up being broken/unusable.

If that is in fact the problem, then I will probably look at adding an artificial delay to work around the problem instead of trying to change the dlm connection handling which is very tricky and outside my area of expertise.

Comment 33 David Teigland 2011-09-26 21:00:00 UTC

token timeouts greater than the node reboot time are clearly much different and more difficult than timeouts less than the node reboot time.  Perhaps they should be handled as separate features.  It doesn't appear that timeouts greater than reboot time are ready for 6.2.

Comment 34 David Teigland 2011-09-26 21:57:08 UTC

The simple test in comment 32 now works for me when I repeat it (as well as another test where I replace service clvmd start/stop with dlm_tool join/leave foo).  So the inconsistent results are making this more difficult to sort out.

Comment 35 David Teigland 2011-09-26 22:02:29 UTC

Note that sometimes "service clvmd start" completes and says "OK" when in fact it is stuck.  We need to check ps ax -o pid,stat,cmd,wchan | grep dlm

Comment 36 Martin Juricek 2011-09-27 12:36:26 UTC

I was using version RHEL-6.2-20110919.

Here is the ps output from the node, that wasn't rebooted:

[root@qe-x3650-01 ~]# ps ax -o pid,stat,cmd,wchan | grep dlm
 8727 Ssl  dlm_controld                poll_schedule_timeout
 8940 S    [dlm_astd]                  dlm_astd
 8941 S    [dlm_scand]                 dlm_scand
 8942 S    [dlm_recv/0]                worker_thread
 8943 S    [dlm_recv/1]                worker_thread
 8944 S    [dlm_recv/2]                worker_thread
 8945 S    [dlm_recv/3]                worker_thread
 8946 S    [dlm_recv/4]                worker_thread
 8947 S    [dlm_recv/5]                worker_thread
 8948 S    [dlm_recv/6]                worker_thread
 8949 S    [dlm_recv/7]                worker_thread
 8950 S    [dlm_send]                  worker_thread
 8951 D    [dlm_recoverd]              dlm_wait_function
 9032 S    [dlm_recoverd]              dlm_recoverd
 9061 S    [dlm_recoverd]              dlm_recoverd
11211 S    [dlm_recoverd]              dlm_recoverd
11705 S+   grep dlm                    pipe_wait


I don't have ps output from the rebooted node, because I wasn't able to get it booted correctly.

Comment 37 Martin Juricek 2011-09-27 15:02:19 UTC

I upgraded to version RHEL-6.2-20110926 (cman-3.0.12.1-22.el6.x86_64) and it seems to behave a little bit better. I was able to get ps output of the rebooted node too.

not rebooted node:
[root@qe-x3650-01 yum.repos.d]# ps ax -o pid,stat,cmd,wchan | grep dlm
 2958 Ssl  dlm_controld                poll_schedule_timeout
 3127 S    [dlm_astd]                  dlm_astd
 3128 S    [dlm_scand]                 dlm_scand
 3129 S    [dlm_recv/0]                worker_thread
 3130 S    [dlm_recv/1]                worker_thread
 3131 S    [dlm_recv/2]                worker_thread
 3132 S    [dlm_recv/3]                worker_thread
 3133 S    [dlm_recv/4]                worker_thread
 3134 S    [dlm_recv/5]                worker_thread
 3135 S    [dlm_recv/6]                worker_thread
 3136 S    [dlm_recv/7]                worker_thread
 3137 S    [dlm_send]                  worker_thread
 3138 D    [dlm_recoverd]              dlm_wait_function
 3213 S    [dlm_recoverd]              dlm_recoverd
 3242 S    [dlm_recoverd]              dlm_recoverd
14689 S    [dlm_recoverd]              dlm_recoverd
18181 S+   grep dlm                    pipe_wait


rebooted node:
[root@qe-x3650-02 ~]# ps ax -o pid,stat,cmd,wchan | grep dlm
 2022 Ssl  dlm_controld                poll_schedule_timeout
 2156 Ds   clvmd -T30                  dlm_new_lockspace
 2158 S    [dlm_astd]                  dlm_astd
 2159 S    [dlm_scand]                 dlm_scand
 2160 S    [dlm_recv/0]                worker_thread
 2161 S    [dlm_recv/1]                worker_thread
 2162 S    [dlm_recv/2]                worker_thread
 2163 S    [dlm_recv/3]                worker_thread
 2164 S    [dlm_recv/4]                worker_thread
 2165 S    [dlm_recv/5]                worker_thread
 2166 S    [dlm_recv/6]                worker_thread
 2167 S    [dlm_recv/7]                worker_thread
 2168 S    [dlm_send]                  worker_thread
 2169 D    [dlm_recoverd]              dlm_wait_function
 2473 D<l  rgmanager                   dlm_new_lockspace
 2613 S+   grep dlm                    pipe_wait

Comment 38 David Teigland 2011-09-27 20:42:48 UTC

I've discovered the root of the problem.  In the case when a node rejoins before the token has expired for it, cman delivers node down and node up events for it back to back in quick succession.  The daemons (fenced, dlm_controld, gfs_controld) get a cman callback indicating the member state has changed for each event, and then query the membership to see what's changed.  Sometimes, the first member query already includes the node addition, so they don't detect that the node left and rejoined via cman/quorum (they do via cpg).  The cman/quorum events drive a part of the recovery process, which never gets done.

The result of this in dlm_controld is that the /sys/kernel/config/dlm/cluster/comms/<nodeid> dir is not removed, so the old dlm connection is not closed, a new one is not created, so the recovery messages go nowhere, and dlm recovery gets stuck, which means userland daemons get stuck joining their lockspace, which is what we see above.

The solution in the daemons is fairly simple: compare the cluster incarnation numbers that cman is already returning, and if they are different then the node left and rejoined between membership queries.

This same fix should be done in fenced, dlm_controld and gfs_controld.  The cman events are not as important in fenced and gfs_controld as they are in dlm_controld, so this problem may not be as critical for them.

Comment 40 David Teigland 2011-09-29 15:09:53 UTC

pushed to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=867534832d5cc642a3c0692c2599cd01d38026e5
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=d5ba0b947a8995b83df94c08497c451e4ad867ee
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=0fac6d9e86424411289bf7926bc6f8c6805082e9

Comment 41 Fabio Massimo Di Nitto 2011-09-29 15:27:19 UTC

Moving back to POST as we need to build patches to fix this BZ.

Comment 43 David Teigland 2011-09-30 14:41:00 UTC

This bz started out as a general testonly bz to verify a range of different timeouts.  One timeout failed to work, and this bz was hijacked for that specific dlm_controld bug, with timeouts specifically larger than node reboot time.

(Perhaps another bug should be recreated for the original purpose of testing/verifying the original range of bzs?)

The test described in comment 30 is probably the best way to hit the specific problem this bz now represents.  A two node cluster, with token timeout larger than the time it takes a node to reboot (which is the key to hitting this specific problem.)

The specific scenario that requires the fix does not occur every time.  You could run the test a number of times and just expect that the scenario is likely to have happened once in that number.  Or, you can check the dlm_controld debug logs to see if the scenario happened or not.

To do that:

node1: service cman start
node2: service cman start
node1: service clvmd start
node2: service clvmd start
node2: reboot -fin
(wait for node2 reboot)
node2: service cman start
node2: service clvmd start  (the bug was causing this to get stuck)
node1: dlm_tool dump | grep 'cluster node'

If the problematic scenario did not occur, you'll see something like the following, where the same node (2) is removed and added in the same seq (260),
but on different lines:

1317160192 cluster node 2 removed seq 260
1317160192 cluster node 2 added seq 260

If the problematic scenario did occur, you'd see one line like the following,
which indicates the specific bug fix was used and worked:

1317160192 cluster node 2 removed and added seq 260 old 252 new 260

Comment 44 Martin Juricek 2011-10-05 15:57:19 UTC

Created attachment 526535 [details]
cluster logs, token timeout 220 sec, service not correctly relocated

I tried to run the test from Comment #30 with new version of packages. 

rpm -q cman corosync
cman-3.0.12.1-23.el6.x86_64
corosync-1.4.1-4.el6.x86_64
[root@qe-x3650-01 ~]# uname -r
2.6.32-203.el6.x86_64



The behaviour is little bit different now:

1. Two-node cluster, clustered service with mounted gfs2 filesystem is running
on node2.
2. run "reboot -fin" on node2
3. node2 reboots, token timeout is big enough so node1 thinks node2 is online

(so far the behaviour is the same as with previous version...)

4. node2 after reboot starts cman service; node2 is then fenced by node1, but  clustat on node1 shows node2 as "online" and cluster service is not relocated to node1.

5. node2 after fencing boots and starts cman, then  cluster service is relocated to node1; node2 successfully joins the cluster.


I think there is a problem in step #4, node1 after successful fencing node2 should set status of node2 to "offline" and relocate the cluster service to itself.

Comment 45 Lon Hohberger 2011-10-06 15:27:31 UTC

Martin,

This second issue may be an rgmanager bug.  I will need to take a look and get back to you.

Setting needinfo->me for now.

Comment 46 Lon Hohberger 2011-10-20 15:20:24 UTC

This appears to work:

Oct 20 11:16:16 rhel6-5 rgmanager[2239]: I am node #5
Oct 20 11:16:17 rhel6-5 rgmanager[2239]: Resource Group Manager Starting
Oct 20 11:16:17 rhel6-5 rgmanager[2239]: Loading Service Data
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: Initializing Services
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: Services Initialized
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: State change: Local UP
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: State change: rhel6-1 UP
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: State change: rhel6-2 UP
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: State change: rhel6-3 UP
Oct 20 11:16:34 rhel6-5 rgmanager[2239]: State change: rhel6-4 UP
Oct 20 11:17:31 rhel6-5 corosync[2021]:   [QUORUM] Members[4]: 1 3 4 5
Oct 20 11:17:31 rhel6-5 corosync[2021]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 20 11:17:31 rhel6-5 corosync[2021]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:17:31 rhel6-5 corosync[2021]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:17:31 rhel6-5 kernel: dlm: closing connection to node 2
Oct 20 11:17:31 rhel6-5 rgmanager[2239]: State change: rhel6-2 DOWN
Oct 20 11:17:31 rhel6-5 corosync[2021]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.138) ; members(old:5 left:1)
Oct 20 11:17:32 rhel6-5 corosync[2021]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 20 11:17:32 rhel6-5 fenced[2068]: fencing deferred to rhel6-1
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [QUORUM] Members[4]: 1 3 4 5
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:18:19 rhel6-5 kernel: dlm: closing connection to node 2
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.138) ; members(old:5 left:1)
Oct 20 11:18:19 rhel6-5 corosync[2021]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 20 11:18:20 rhel6-5 rgmanager[2239]: Taking over service service:a from down member rhel6-2
Oct 20 11:18:20 rhel6-5 rgmanager[2239]: Service service:a started

Comment 47 Lon Hohberger 2011-10-20 15:22:55 UTC

On rhel6-1:

...
Oct 20 11:17:32 rhel6-1 fenced[2079]: fencing node rhel6-2
Oct 20 11:17:33 rhel6-1 fenced[2079]: fence rhel6-2 success
Oct 20 11:18:19 rhel6-1 corosync[2032]:   [QUORUM] Members[4]: 1 3 4 5
Oct 20 11:18:19 rhel6-1 corosync[2032]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 20 11:18:19 rhel6-1 kernel: dlm: closing connection to node 2
Oct 20 11:18:19 rhel6-1 corosync[2032]:   [QUORUM] Members[5]: 1 2 3 4 5

It's at this point node2 has rejoined the cluster.

Comment 48 Lon Hohberger 2011-10-20 15:32:17 UTC

I tried killing the service owner a few more times -- in each case, failover completed after fencing and the service was restored to operation.

Comment 49 Lon Hohberger 2011-10-20 15:55:08 UTC

I did reproduce the timing issue once.  What appears to be happening is rgmanager blocks transitions while waiting for the DLM and corosync to stabilize on the new membership.


Oct 20 11:29:59 rhel6-2 rgmanager[2844]: I am node #2
Oct 20 11:30:00 rhel6-2 rgmanager[2844]: Resource Group Manager Starting
Oct 20 11:30:00 rhel6-2 rgmanager[2844]: Loading Service Data
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: Initializing Services
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: Services Initialized
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: State change: Local UP
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: State change: rhel6-1 UP
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: State change: rhel6-3 UP
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: State change: rhel6-4 UP
Oct 20 11:30:06 rhel6-2 rgmanager[2844]: State change: rhel6-5 UP

[reboot -fin rhel6-1]

Oct 20 11:33:14 rhel6-2 corosync[2388]:   [QUORUM] Members[4]: 2 3 4 5
Oct 20 11:33:14 rhel6-2 corosync[2388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 20 11:33:14 rhel6-2 kernel: dlm: closing connection to node 1
Oct 20 11:33:14 rhel6-2 corosync[2388]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:33:14 rhel6-2 corosync[2388]:   [QUORUM] Members[5]: 1 2 3 4 5

^^ rhel6-1 (1) has rebooted very fast and rejoined the cluster, so we must process the 'death' of it

Oct 20 11:33:14 rhel6-2 rgmanager[2844]: State change: rhel6-1 DOWN

^^ rgmanager begins to enter recovery, but must wait for DLM to recover now, and for corosync to allow messages again (messages are blocked while forming a new ring)

Oct 20 11:33:14 rhel6-2 corosync[2388]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.146) ; members(old:5 left:1)
Oct 20 11:33:14 rhel6-2 corosync[2388]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 20 11:33:14 rhel6-2 fenced[2438]: fencing node rhel6-1
Oct 20 11:33:16 rhel6-2 fenced[2438]: fence rhel6-1 success
Oct 20 11:33:50 rhel6-2 corosync[2388]:   [QUORUM] Members[4]: 2 3 4 5
Oct 20 11:33:50 rhel6-2 corosync[2388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.

^^ We have now actually finished cluster recovery of rhel6-1 after the first reboot

Oct 20 11:33:50 rhel6-2 kernel: dlm: closing connection to node 1

^^ dlm recovers

Oct 20 11:33:50 rhel6-2 corosync[2388]:   [QUORUM] Members[5]: 1 2 3 4 5
Oct 20 11:33:50 rhel6-2 corosync[2388]:   [QUORUM] Members[5]: 1 2 3 4 5

^^ Node rejoined nearly exactly as we finished forming a new corosync ring; messaging now has to wait for the 

Oct 20 11:33:50 rhel6-2 corosync[2388]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.146) ; members(old:5 left:1)
Oct 20 11:33:50 rhel6-2 dlm_controld[2454]: cpg_mcast_joined retry 100 start
Oct 20 11:33:50 rhel6-2 corosync[2388]:   [MAIN  ] Completed service synchronization, ready to provide service.

^^ rgmanager is now able to send/receive messages about states so:

Oct 20 11:33:50 rhel6-2 rgmanager[2844]: Taking over service service:a from down member rhel6-1
Oct 20 11:33:51 rhel6-2 rgmanager[2844]: Service service:a started

^^ Failover completes

Oct 20 11:34:01 rhel6-2 kernel: dlm: got connection from 1
Oct 20 11:34:09 rhel6-2 rgmanager[2844]: State change: rhel6-1 UP

^^ second reboot completed and rgmanager has started on rhel6-1 again



So, between 11:33:14 and 11:33:50, clustat reports the "dead" node as online simply because corosync still thinks it is online.  Clustat reports the corosync/cman view of the cluster, not rgmanager's internal view.

It hasn't actually completed moving the service yet because it can't send or receive messages while the cluster membership is in flux.

Comment 50 Lon Hohberger 2011-10-20 15:59:05 UTC

After everything stabilizes, failover completes and the states are correct.  It's during some of the transition period that clustat reports unexpected (note: not actually "wrong" per se).

So, I don't actually think this is a bug in rgmanager and I think we're good here.

Comment 51 David Teigland 2011-10-27 14:08:58 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: corosync token timeouts larger than the time it takes a failed node to rejoin the cluster, so corosync does not detect the node failure until it rejoins.
Consequence: by the time dlm_controld asks corosync for the new member list, the failed node has already been added again, so dlm_controld does not notice a change.  This eventually leads to the most observable consequence of dlm lockspace operations being stuck.
Fix: dlm_controld can notice a node was removed and added between checks by looking for a changed incarnation number.
Result: dlm_controld can properly handle nodes that are quickly removed and re-added due to large token timeouts.

Comment 54 Lon Hohberger 2011-11-10 16:18:23 UTC

*** Bug 752866 has been marked as a duplicate of this bug. ***

Comment 57 errata-xmlrpc 2011-12-06 14:50:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1516.html

Note You need to log in before you can comment on or make changes to this bug.