Bug 258121

Summary: groupd should sort out a quick fail-join-fail of one node
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: cmanAssignee: Chris Feist <cfeist>
Status: CLOSED ERRATA QA Contact:
Severity: low Docs Contact:
Priority: medium    
Version: 5.0CC: cluster-maint, nstraz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-04 20:45:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 469212    

Description David Teigland 2007-08-27 21:08:49 UTC
Description of problem:


Version-Release number of selected component (if applicable):

On nate's tank cluster during revolver it appears that:

- node 1 was selected by revolver to die, so it did reboot -fin
- node 2 sees node 1 fail due to its reboot
- node 2 starts recovery on node 1
- node 1 came back and rejoined the fence domain
- node 2 finally fenced node 1, power cycling it
- node 2 sees node 1 fail a second time while it's still
  processing the first failure of node 1
- groupd can't handle this and gets confused

The logs are in bug 251966 comments 20 and 22.


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2007-08-27 21:19:11 UTC
Gathered from groupd logs:

01:

m: +1 = 1,2,3,4
n: -2 = 1,3,4
o: -3 = 1,4
p: -4 = 1

02:

h: +2 = 1,2,4
i: +3 = 1,2,3,4
j: -1 = 2,3,4
k: +1 = 1,2,3,4
l: -1 = 2,3,4
m: +1 = 1,2,3,4
n: -2 = 1,3,4

03:

i: +3 = 1,2,3,4
j: -1 = 2,3,4
k: +1 = 1,2,3,4
l: -1 = 2,3,4
m: +1 = 1,2,3,4
n: -2 = 1,3,4
o: -3 = 2,4

04:

+4 = 1,2,3,4
-2 = 1,3,4
+2 = 1,2,3,4
-1 = 2,3,4
-2 = 3,4
-3 = 4
+1 = 1,4
h: +2 = 1,2,4
i: +3 = 1,2,3,4
j: -1 = 2,3,4
k: +1 = 1,2,3,4
l: -1 = 2,3,4
m: +1 = 1,2,3,4
n: -2 = 1,3,4
o: -3 = 1,4
p: -4 = 1


j: -1 = 2,3,4

04
0:default set current event to recovery for 1
0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1
0:default send stopped
0:default mark node 4 stopped
0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 3 stopped
0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 2 stopped
0:default process_current_event 100030003 1 FAIL_ALL_STOPPED
0:default action for app: start default 57 1 3 3 2 4
got client 3 start_done
0:default send started
0:default mark node 4 started
0:default mark node 3 started
(need a started message from 2)

03 
0:default set current event to recovery for 1
0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1
0:default send stopped
0:default mark node 4 stopped
0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 3 stopped
0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 2 stopped
0:default process_current_event 100030003 1 FAIL_ALL_STOPPED
0:default action for app: start default 11 1 3 3 4 2
0:default mark node 4 started
got client 3 start_done
0:default send started
0:default mark node 3 started
(need a started message from 2)

02 
0:default set current event to recovery for 1
0:default mark_node_started: event not starting 12 from 4
0:default mark node 4 started
0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 4 stopped
0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 3 stopped
0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1
0:default send stopped
0:default mark node 2 stopped
0:default process_current_event 100030003 1 FAIL_ALL_STOPPED
0:default action for app: start default 23 1 3 3 2 4
0:default mark node 4 started
0:default mark node 3 started

0:default confchg left 0 joined 1 total 4
0:default process_node_join 1
0:default cpg add node 1 total 4
0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT
0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1
0:default queue join event for nodeid 1

got client 3 start_done
0:default send started

0:default confchg left 1 joined 0 total 3
0:default confchg removed node 1 reason 3
0:default process_node_down 1
0:default cpg del node 1 total 3 - down
0:default purge event JOIN_BEGIN from 1
0:default mark node 2 started
0:default process_current_event 100030003 1 FAIL_ALL_STARTED
0:default action for app: finish default 23
0:default rev 100030003 done, remove group from rs 1
0:default done in recovery set 1



Comment 2 David Teigland 2007-08-30 18:08:32 UTC
This should be the final, precise description of went wrong.  It's not that
the fail-join-fail sequence isn't handled properly, it is.  The real problem
is that the nodes can process messages and confchg's in different orders
since they're coming from two different cpg's.

03
[j: -1 = 2,3,4]
0:default set current event to recovery for 1
0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1
0:default send stopped
0:default mark node 4 stopped
0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 3 stopped
0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 2 stopped
0:default process_current_event 100030003 1 FAIL_ALL_STOPPED
0:default action for app: start default 11 1 3 3 4 2
0:default mark node 4 started
got client 3 start_done
0:default send started
0:default mark node 3 started

[k: +1 = 1,2,3,4] -- 1 rejoins while we're still doing recovery for it
0:default confchg left 0 joined 1 total 4
0:default process_node_join 1
0:default cpg add node 1 total 4
0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT
0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1
0:default queue join event for nodeid 1

(corresponds with l below)
cman: node 1 removed
add_recovery_set_cman nodeid 1
old recovery for 1 still in progress

(this is the final started message for the initial recovery j)
0:default mark node 2 started
0:default process_current_event 100030003 1 FAIL_ALL_STARTED
0:default action for app: finish default 11
0:default rev 100030003 done, remove group from rs 1
0:default done in recovery set 1

(this join was queued above, is now made current prior to second fail l)
0:default process_current_event 100040001 1 JOIN_BEGIN
0:default action for app: stop default
0:default app node join: add 1 total 4

groupd confchg total 3 left 1 joined 0
add_recovery_set_cpg nodeid 1 procdown 0
old recovery for 1 still in progress

[l: -1 = 2,3,4] -- 1 fails again
0:default confchg left 1 joined 0 total 3
0:default confchg removed node 1 reason 3
0:default process_node_down 1
0:default cpg del node 1 total 3 - down
0:default make_event_id 100030003 nodeid 1 memb_count 3 type 3
0:default queue recover event for nodeid 1
0:default waiting for 3 more stopped messages before JOIN_ALL_STOPPED 1
0:default rev 1 will abort current ev 1 JOIN_STOP_WAIT
0:default mark node 1 stopped
0:default waiting for 3 more stopped messages before JOIN_ALL_STOPPED 1

The second failure of node 1 occurs after the started message from 2.
Difference here from 02 is that the join has been made current when
the second failure occurs; for 02 the join was still queued when the
second failure occured, so it was just purged from the queue and was
never current.


02
[j: -1 = 2,3,4]
0:default set current event to recovery for 1
0:default mark_node_started: event not starting 12 from 4
0:default mark node 4 started
0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 4 stopped
0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1
0:default mark node 3 stopped
0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1
0:default send stopped
0:default mark node 2 stopped
0:default process_current_event 100030003 1 FAIL_ALL_STOPPED
0:default action for app: start default 23 1 3 3 2 4
0:default mark node 4 started
0:default mark node 3 started

[k: +1 = 1,2,3,4] -- 1 rejoins while we're still doing recovery for it
0:default confchg left 0 joined 1 total 4
0:default process_node_join 1
0:default cpg add node 1 total 4
0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT
0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1
0:default queue join event for nodeid 1

(corresponds with l below)
cman: node 1 removed
add_recovery_set_cman nodeid 1
old recovery for 1 still in progress

[l: -1 = 2,3,4] -- 1 fails again
0:default confchg left 1 joined 0 total 3
0:default confchg removed node 1 reason 3
0:default process_node_down 1
0:default cpg del node 1 total 3 - down
0:default purge event JOIN_BEGIN from 1

(this is the final started message for the initial recovery j)
0:default mark node 2 started
0:default process_current_event 100030003 1 FAIL_ALL_STARTED
0:default action for app: finish default 23
0:default rev 100030003 done, remove group from rs 1
0:default done in recovery set 1

The second failure of node 1 occurs *before* the started message from 2.
Problem here seems to be that 03 sees message then confchg and 02 sees
confchg then message.  Messages go through groupd cpg, confchgs go through
per-group cpgs, so no agreed ordering of those messages and confchgs.  We
could probably take the nodedown confchgs from the groupd cpg, so nodedown
confchgs and messages would have agreed ordering.  join and leave confchgs
would still not have agreed ordering wrt messages, though -- is that a
problem too?


Comment 3 David Teigland 2007-08-31 18:34:16 UTC
I expect it'll be very difficult to reproduce this bug.  To attempt to
create this, you should delay fencing so that it's likely to complete
at just the time the node is rejoining the fence domain after rebooting.
And then you need a fair bit of luck.

What needs to happen to hit this bug is that a node needs to rejoin the fence
domain at exactly the right time so that the confchg for it joining is
received at the same time that the final started message is received
associated with its recovery. i.e. in the logs above, these two things need to
happen at once:

a) 0:default confchg left 0 joined 1 total 4
b) 0:default mark node 2 started

and not only that, you have to be lucky enough for one node to see a,b
and another node to see b,a.

With the fix I'm testing, both (a) and (b) originate from the same cpg
which guarantees all nodes see them in the same order.  Given the obvious
timing subtleties in this area, I'd be hesitant to put the fix into a
release without a fair time in testing.  Unless someone finds a much
easier way of reproducing the problem, of course.


Comment 4 David Teigland 2007-09-05 20:48:51 UTC
Comment 3 is wrong.  The two things that have to happen in reverse order
are:

a) 0:default confchg left 1 joined 0 total 3
   The failure of node 2.
b) 0:default mark node 2 started
   The last start for the recovery of node 2, for it's previous failure.

So, to reproduce, you'd want:
1. nodedown X callback (from killing node X)
2. wait for X to rejoin the fence domain
3. complete the fencing operation for step 1
   (if using manual, this will happen automatically when it rejoins)
4. nodedown X callback (from killing node X again)

In order to have a chance for two nodes to see steps 3 and 4 in reverse order,
you'd need to kill node X (for the second time), some unknown N seconds before
step 4.


Comment 5 David Teigland 2008-09-17 17:10:55 UTC
The fix for this bug has created another complicated kind of failure.

[root@tank-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    496   2008-09-11 14:08:28  tank-01
   2   M    508   2008-09-11 14:08:28  tank-03
   3   M    508   2008-09-11 14:08:28  tank-04
   4   M    512   2008-09-11 14:08:35  morph-01
   5   M    508   2008-09-11 14:08:28  morph-02
   6   M    516   2008-09-11 14:09:56  morph-03
   7   M    508   2008-09-11 14:08:28  morph-04

[root@tank-01 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_STOP_WAIT 1 100060001 1
[1 2 3 5 6 7]
dlm              1     clvmd    00020001 none
[1 2 3 4 5 6 7]

[root@tank-03 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_STOP_WAIT 1 100050001 1
[1 2 3 5 7]
dlm              1     clvmd    00020001 none
[1 2 3 4 5 6 7]

[root@tank-04 ~]# group_tool -v
type             level name        id       state node id local_done
fence            0     default     00010001 JOIN_STOP_WAIT 1 100060001 1
[1 2 3 5 7]
dlm              1     clvmd       00020001 none
[1 2 3 4 5 6 7]
dlm              1     tankmorph0  00040001 none
[3 5 7]
dlm              1     tankmorph1  00060001 none
[3 5 7]
dlm              1     tankmorph2  00080001 none
[3 5 7]
gfs              2     tankmorph0  00030001 none
[3 5 7]
gfs              2     tankmorph1  00050001 none
[3 5 7]
gfs              2     tankmorph2  00070001 none
[3 5 7]

[root@morph-01 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 JOIN_STOP_WAIT 4 400060001 1
[1 2 3 4 5 7]
dlm              1     clvmd    00020001 none
[1 2 3 4 5 6 7]

[root@morph-02 ~]# group_tool -v
type             level name        id       state node id local_done
fence            0     default     00010001 JOIN_STOP_WAIT 1 100060001 1
[1 2 3 5 7]
dlm              1     clvmd       00020001 none
[1 2 3 4 5 6 7]
dlm              1     tankmorph0  00040001 none
[3 5 7]
dlm              1     tankmorph1  00060001 none
[3 5 7]
dlm              1     tankmorph2  00080001 none
[3 5 7]
gfs              2     tankmorph0  00030001 none
[3 5 7]
gfs              2     tankmorph1  00050001 none
[3 5 7]
gfs              2     tankmorph2  00070001 none
[3 5 7]

[root@morph-03 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 JOIN_STOP_WAIT 6 600070001 1
[1 2 3 4 5 6 7]  
dlm              1     clvmd    00020001 none
[1 2 3 4 5 6 7]

[root@morph-04 ~]# group_tool -v
type             level name        id       state node id local_done
fence            0     default     00010001 JOIN_STOP_WAIT 1 100060001 1
[1 2 3 5 7]
dlm              1     clvmd       00020001 none
[1 2 3 4 5 6 7]
dlm              1     tankmorph0  00040001 none
[3 5 7]
dlm              1     tankmorph1  00060001 none
[3 5 7]
dlm              1     tankmorph2  00080001 none
[3 5 7]
gfs              2     tankmorph0  00030001 none
[3 5 7]
gfs              2     tankmorph1  00050001 none
[3 5 7]
gfs              2     tankmorph2  00070001 none
[3 5 7]


tank-01 (1)
1221160115 0:default confchg left 0 joined 1 total 6 +1
1221160115 0:default confchg left 1 joined 0 total 5 -6
1221160116 0:default confchg left 0 joined 1 total 6 +4
1221160197 0:default confchg left 0 joined 1 total 7 +6

tank-03 (2)
1221160104 0:default confchg left 0 joined 1 total 4 +2
1221160108 0:default confchg left 0 joined 1 total 5 +6
1221160115 0:default confchg left 0 joined 1 total 6 +1
1221160115 0:default confchg left 1 joined 0 total 5 -6
1221160116 0:default confchg left 0 joined 1 total 6 +4
1221160197 0:default confchg left 0 joined 1 total 7 +6

tank-04 (3)
1221160043 0:default confchg removed node 4 reason 3
1221160043 0:default confchg removed node 6 reason 3
1221160043 0:default confchg removed node 1 reason 3
1221160043 0:default confchg removed node 2 reason 3
1221160103 0:default confchg left 0 joined 1 total 4 +2
1221160108 0:default confchg left 0 joined 1 total 5 +6
1221160114 0:default confchg left 0 joined 1 total 6 +1
1221160114 0:default confchg left 1 joined 0 total 5 -6
1221160115 0:default confchg left 0 joined 1 total 6 +4
1221160196 0:default confchg left 0 joined 1 total 7 +6

morph-01 (4)
1221160116 0:default confchg left 0 joined 1 total 6 +4
1221160197 0:default confchg left 0 joined 1 total 7 +6

morph-02 (5)
1221160044 0:default confchg removed node 4 reason 3
1221160044 0:default confchg removed node 6 reason 3
1221160044 0:default confchg removed node 1 reason 3
1221160044 0:default confchg removed node 2 reason 3
1221160104 0:default confchg left 0 joined 1 total 4 +2
1221160109 0:default confchg left 0 joined 1 total 5 +6
1221160115 0:default confchg left 0 joined 1 total 6 +1
1221160115 0:default confchg left 1 joined 0 total 5 -6
1221160116 0:default confchg left 0 joined 1 total 6 +4
1221160197 0:default confchg left 0 joined 1 total 7 +6

morph-03 (6)
1221160197 0:default confchg left 0 joined 1 total 7 +6

morph-04 (7)
1221160044 0:default confchg removed node 4 reason 3
1221160044 0:default confchg removed node 6 reason 3
1221160044 0:default confchg removed node 1 reason 3
1221160044 0:default confchg removed node 2 reason 3
1221160104 0:default confchg left 0 joined 1 total 4 +2
1221160109 0:default confchg left 0 joined 1 total 5 +6
1221160115 0:default confchg left 0 joined 1 total 6 +1
1221160115 0:default confchg left 1 joined 0 total 5 -6
1221160116 0:default confchg left 0 joined 1 total 6 +4
1221160197 0:default confchg left 0 joined 1 total 7 +6


nodeid 1 (tank-01) is joining the domain when nodeid 6 (morph-03) fails.
(Note that the failure of 6 shortly after it has joined was not intended by
the revolver test.  It's a perfectly valid sequence of events that we should
be able to handle, it's just not what revolver was trying to do.)


tank-01 (1)
1221160115 0:default confchg left 0 joined 1 total 6
1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1

1221160115 0:default process_current_event 100060001 1 JOIN_BEGIN  (need 5)

1221160115 0:default confchg left 1 joined 0 total 5
1221160115 0:default confchg removed node 6 reason 3

1221160123 0:default mark node 5 stopped (need 4)
1221160123 0:default mark node 7 stopped (need 3)
1221160123 0:default ignore msg from 2 id 100050001 stopped
1221160123 0:default mark node 3 stopped (need 2)

tank-03 (2)
1221160115 0:default process_node_down 6
1221160115 0:default cpg del node 6 total 4 - down

1221160115 0:default confchg left 0 joined 1 total 6
1221160115 0:default make_event_id 100050001 nodeid 1 memb_count 5 type 1

1221160115 0:default confchg left 1 joined 0 total 5
1221160115 0:default confchg removed node 6 reason 3

1221160123 0:default process_current_event 100050001 1 JOIN_BEGIN

tank-04 (3)
1221160114 0:default confchg left 0 joined 1 total 6
1221160114 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1

1221160114 0:default process_node_down 6
1221160114 0:default cpg del node 6 total 5 - down

1221160114 0:default confchg left 1 joined 0 total 5
1221160114 0:default confchg removed node 6 reason 3

1221160122 0:default process_current_event 100060001 1 JOIN_BEGIN
1221160122 0:default app node join: add 1 total 5 (need 4)
1221160122 0:default mark node 5 stopped (need 3)
1221160122 0:default mark node 7 stopped (need 2)
1221160122 0:default ignore msg from 2 id 100050001 stopped
1221160122 0:default mark node 3 stopped (need 1)

morph-02 (5)
1221160115 0:default confchg left 0 joined 1 total 6
1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1

1221160115 0:default confchg left 1 joined 0 total 5
1221160115 0:default confchg removed node 6 reason 3

1221160115 0:default process_node_down 6
1221160115 0:default cpg del node 6 total 5 - down

1221160123 0:default process_current_event 100060001 1 JOIN_BEGIN
1221160123 0:default app node join: add 1 total 5 (need 4)
1221160123 0:default mark node 5 stopped (need 3)
1221160123 0:default mark node 7 stopped (need 2)
1221160123 0:default ignore msg from 2 id 100050001 stopped
1221160123 0:default mark node 3 stopped (need 1)

morph-04 (7)
1221160115 0:default confchg left 0 joined 1 total 6
1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1

1221160115 0:default process_node_down 6
1221160115 0:default cpg del node 6 total 5 - down

1221160115 0:default confchg left 1 joined 0 total 5
1221160115 0:default confchg removed node 6 reason 3

1221160123 0:default process_current_event 100060001 1 JOIN_BEGIN
1221160123 0:default app node join: add 1 total 5 (need 4)
1221160123 0:default mark node 5 stopped (need 3)
1221160123 0:default mark node 7 stopped (need 2)
1221160123 0:default ignore msg from 2 id 100050001 stopped
1221160123 0:default mark node 3 stopped (need 1)


It appears that the problems all go back to the fact that the failure of
node 6 comes at different places in the sequence of processing for the
nodes.  If all nodes were to process the failure of node 6 in response
to the actual cpg confchg instead of the groupd confchg (which was changed
by the fix to this bz), they would all follow the same sequence.  I don't
know which kind of failure is more likely; the one fixed by this bz, or the
one created by the fix.  I also don't know whether we could fix the original
problem in another way to avoid creating this new problem.

Comment 7 Lon Hohberger 2010-01-04 20:45:41 UTC
This was fixed in cman 2.0.76 and later.

cman-2.0.84 was included in RHEL 5.2.