Bug 258121
Summary: | groupd should sort out a quick fail-join-fail of one node | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Teigland <teigland> |
Component: | cman | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | cluster-maint, nstraz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-01-04 20:45:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 469212 |
Description
David Teigland
2007-08-27 21:08:49 UTC
Gathered from groupd logs: 01: m: +1 = 1,2,3,4 n: -2 = 1,3,4 o: -3 = 1,4 p: -4 = 1 02: h: +2 = 1,2,4 i: +3 = 1,2,3,4 j: -1 = 2,3,4 k: +1 = 1,2,3,4 l: -1 = 2,3,4 m: +1 = 1,2,3,4 n: -2 = 1,3,4 03: i: +3 = 1,2,3,4 j: -1 = 2,3,4 k: +1 = 1,2,3,4 l: -1 = 2,3,4 m: +1 = 1,2,3,4 n: -2 = 1,3,4 o: -3 = 2,4 04: +4 = 1,2,3,4 -2 = 1,3,4 +2 = 1,2,3,4 -1 = 2,3,4 -2 = 3,4 -3 = 4 +1 = 1,4 h: +2 = 1,2,4 i: +3 = 1,2,3,4 j: -1 = 2,3,4 k: +1 = 1,2,3,4 l: -1 = 2,3,4 m: +1 = 1,2,3,4 n: -2 = 1,3,4 o: -3 = 1,4 p: -4 = 1 j: -1 = 2,3,4 04 0:default set current event to recovery for 1 0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1 0:default send stopped 0:default mark node 4 stopped 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 3 stopped 0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 2 stopped 0:default process_current_event 100030003 1 FAIL_ALL_STOPPED 0:default action for app: start default 57 1 3 3 2 4 got client 3 start_done 0:default send started 0:default mark node 4 started 0:default mark node 3 started (need a started message from 2) 03 0:default set current event to recovery for 1 0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1 0:default send stopped 0:default mark node 4 stopped 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 3 stopped 0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 2 stopped 0:default process_current_event 100030003 1 FAIL_ALL_STOPPED 0:default action for app: start default 11 1 3 3 4 2 0:default mark node 4 started got client 3 start_done 0:default send started 0:default mark node 3 started (need a started message from 2) 02 0:default set current event to recovery for 1 0:default mark_node_started: event not starting 12 from 4 0:default mark node 4 started 0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 4 stopped 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 3 stopped 0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1 0:default send stopped 0:default mark node 2 stopped 0:default process_current_event 100030003 1 FAIL_ALL_STOPPED 0:default action for app: start default 23 1 3 3 2 4 0:default mark node 4 started 0:default mark node 3 started 0:default confchg left 0 joined 1 total 4 0:default process_node_join 1 0:default cpg add node 1 total 4 0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT 0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1 0:default queue join event for nodeid 1 got client 3 start_done 0:default send started 0:default confchg left 1 joined 0 total 3 0:default confchg removed node 1 reason 3 0:default process_node_down 1 0:default cpg del node 1 total 3 - down 0:default purge event JOIN_BEGIN from 1 0:default mark node 2 started 0:default process_current_event 100030003 1 FAIL_ALL_STARTED 0:default action for app: finish default 23 0:default rev 100030003 done, remove group from rs 1 0:default done in recovery set 1 This should be the final, precise description of went wrong. It's not that the fail-join-fail sequence isn't handled properly, it is. The real problem is that the nodes can process messages and confchg's in different orders since they're coming from two different cpg's. 03 [j: -1 = 2,3,4] 0:default set current event to recovery for 1 0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1 0:default send stopped 0:default mark node 4 stopped 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 3 stopped 0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 2 stopped 0:default process_current_event 100030003 1 FAIL_ALL_STOPPED 0:default action for app: start default 11 1 3 3 4 2 0:default mark node 4 started got client 3 start_done 0:default send started 0:default mark node 3 started [k: +1 = 1,2,3,4] -- 1 rejoins while we're still doing recovery for it 0:default confchg left 0 joined 1 total 4 0:default process_node_join 1 0:default cpg add node 1 total 4 0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT 0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1 0:default queue join event for nodeid 1 (corresponds with l below) cman: node 1 removed add_recovery_set_cman nodeid 1 old recovery for 1 still in progress (this is the final started message for the initial recovery j) 0:default mark node 2 started 0:default process_current_event 100030003 1 FAIL_ALL_STARTED 0:default action for app: finish default 11 0:default rev 100030003 done, remove group from rs 1 0:default done in recovery set 1 (this join was queued above, is now made current prior to second fail l) 0:default process_current_event 100040001 1 JOIN_BEGIN 0:default action for app: stop default 0:default app node join: add 1 total 4 groupd confchg total 3 left 1 joined 0 add_recovery_set_cpg nodeid 1 procdown 0 old recovery for 1 still in progress [l: -1 = 2,3,4] -- 1 fails again 0:default confchg left 1 joined 0 total 3 0:default confchg removed node 1 reason 3 0:default process_node_down 1 0:default cpg del node 1 total 3 - down 0:default make_event_id 100030003 nodeid 1 memb_count 3 type 3 0:default queue recover event for nodeid 1 0:default waiting for 3 more stopped messages before JOIN_ALL_STOPPED 1 0:default rev 1 will abort current ev 1 JOIN_STOP_WAIT 0:default mark node 1 stopped 0:default waiting for 3 more stopped messages before JOIN_ALL_STOPPED 1 The second failure of node 1 occurs after the started message from 2. Difference here from 02 is that the join has been made current when the second failure occurs; for 02 the join was still queued when the second failure occured, so it was just purged from the queue and was never current. 02 [j: -1 = 2,3,4] 0:default set current event to recovery for 1 0:default mark_node_started: event not starting 12 from 4 0:default mark node 4 started 0:default waiting for 3 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 4 stopped 0:default waiting for 2 more stopped messages before FAIL_ALL_STOPPED 1 0:default mark node 3 stopped 0:default waiting for 1 more stopped messages before FAIL_ALL_STOPPED 1 0:default send stopped 0:default mark node 2 stopped 0:default process_current_event 100030003 1 FAIL_ALL_STOPPED 0:default action for app: start default 23 1 3 3 2 4 0:default mark node 4 started 0:default mark node 3 started [k: +1 = 1,2,3,4] -- 1 rejoins while we're still doing recovery for it 0:default confchg left 0 joined 1 total 4 0:default process_node_join 1 0:default cpg add node 1 total 4 0:default queue_app_join: current event 1 100030003 FAIL_START_WAIT 0:default make_event_id 100040001 nodeid 1 memb_count 4 type 1 0:default queue join event for nodeid 1 (corresponds with l below) cman: node 1 removed add_recovery_set_cman nodeid 1 old recovery for 1 still in progress [l: -1 = 2,3,4] -- 1 fails again 0:default confchg left 1 joined 0 total 3 0:default confchg removed node 1 reason 3 0:default process_node_down 1 0:default cpg del node 1 total 3 - down 0:default purge event JOIN_BEGIN from 1 (this is the final started message for the initial recovery j) 0:default mark node 2 started 0:default process_current_event 100030003 1 FAIL_ALL_STARTED 0:default action for app: finish default 23 0:default rev 100030003 done, remove group from rs 1 0:default done in recovery set 1 The second failure of node 1 occurs *before* the started message from 2. Problem here seems to be that 03 sees message then confchg and 02 sees confchg then message. Messages go through groupd cpg, confchgs go through per-group cpgs, so no agreed ordering of those messages and confchgs. We could probably take the nodedown confchgs from the groupd cpg, so nodedown confchgs and messages would have agreed ordering. join and leave confchgs would still not have agreed ordering wrt messages, though -- is that a problem too? I expect it'll be very difficult to reproduce this bug. To attempt to create this, you should delay fencing so that it's likely to complete at just the time the node is rejoining the fence domain after rebooting. And then you need a fair bit of luck. What needs to happen to hit this bug is that a node needs to rejoin the fence domain at exactly the right time so that the confchg for it joining is received at the same time that the final started message is received associated with its recovery. i.e. in the logs above, these two things need to happen at once: a) 0:default confchg left 0 joined 1 total 4 b) 0:default mark node 2 started and not only that, you have to be lucky enough for one node to see a,b and another node to see b,a. With the fix I'm testing, both (a) and (b) originate from the same cpg which guarantees all nodes see them in the same order. Given the obvious timing subtleties in this area, I'd be hesitant to put the fix into a release without a fair time in testing. Unless someone finds a much easier way of reproducing the problem, of course. Comment 3 is wrong. The two things that have to happen in reverse order are: a) 0:default confchg left 1 joined 0 total 3 The failure of node 2. b) 0:default mark node 2 started The last start for the recovery of node 2, for it's previous failure. So, to reproduce, you'd want: 1. nodedown X callback (from killing node X) 2. wait for X to rejoin the fence domain 3. complete the fencing operation for step 1 (if using manual, this will happen automatically when it rejoins) 4. nodedown X callback (from killing node X again) In order to have a chance for two nodes to see steps 3 and 4 in reverse order, you'd need to kill node X (for the second time), some unknown N seconds before step 4. The fix for this bug has created another complicated kind of failure. [root@tank-01 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 496 2008-09-11 14:08:28 tank-01 2 M 508 2008-09-11 14:08:28 tank-03 3 M 508 2008-09-11 14:08:28 tank-04 4 M 512 2008-09-11 14:08:35 morph-01 5 M 508 2008-09-11 14:08:28 morph-02 6 M 516 2008-09-11 14:09:56 morph-03 7 M 508 2008-09-11 14:08:28 morph-04 [root@tank-01 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_STOP_WAIT 1 100060001 1 [1 2 3 5 6 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] [root@tank-03 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_STOP_WAIT 1 100050001 1 [1 2 3 5 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] [root@tank-04 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_STOP_WAIT 1 100060001 1 [1 2 3 5 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] dlm 1 tankmorph0 00040001 none [3 5 7] dlm 1 tankmorph1 00060001 none [3 5 7] dlm 1 tankmorph2 00080001 none [3 5 7] gfs 2 tankmorph0 00030001 none [3 5 7] gfs 2 tankmorph1 00050001 none [3 5 7] gfs 2 tankmorph2 00070001 none [3 5 7] [root@morph-01 ~]# group_tool -v type level name id state node id local_done fence 0 default 00000000 JOIN_STOP_WAIT 4 400060001 1 [1 2 3 4 5 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] [root@morph-02 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_STOP_WAIT 1 100060001 1 [1 2 3 5 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] dlm 1 tankmorph0 00040001 none [3 5 7] dlm 1 tankmorph1 00060001 none [3 5 7] dlm 1 tankmorph2 00080001 none [3 5 7] gfs 2 tankmorph0 00030001 none [3 5 7] gfs 2 tankmorph1 00050001 none [3 5 7] gfs 2 tankmorph2 00070001 none [3 5 7] [root@morph-03 ~]# group_tool -v type level name id state node id local_done fence 0 default 00000000 JOIN_STOP_WAIT 6 600070001 1 [1 2 3 4 5 6 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] [root@morph-04 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_STOP_WAIT 1 100060001 1 [1 2 3 5 7] dlm 1 clvmd 00020001 none [1 2 3 4 5 6 7] dlm 1 tankmorph0 00040001 none [3 5 7] dlm 1 tankmorph1 00060001 none [3 5 7] dlm 1 tankmorph2 00080001 none [3 5 7] gfs 2 tankmorph0 00030001 none [3 5 7] gfs 2 tankmorph1 00050001 none [3 5 7] gfs 2 tankmorph2 00070001 none [3 5 7] tank-01 (1) 1221160115 0:default confchg left 0 joined 1 total 6 +1 1221160115 0:default confchg left 1 joined 0 total 5 -6 1221160116 0:default confchg left 0 joined 1 total 6 +4 1221160197 0:default confchg left 0 joined 1 total 7 +6 tank-03 (2) 1221160104 0:default confchg left 0 joined 1 total 4 +2 1221160108 0:default confchg left 0 joined 1 total 5 +6 1221160115 0:default confchg left 0 joined 1 total 6 +1 1221160115 0:default confchg left 1 joined 0 total 5 -6 1221160116 0:default confchg left 0 joined 1 total 6 +4 1221160197 0:default confchg left 0 joined 1 total 7 +6 tank-04 (3) 1221160043 0:default confchg removed node 4 reason 3 1221160043 0:default confchg removed node 6 reason 3 1221160043 0:default confchg removed node 1 reason 3 1221160043 0:default confchg removed node 2 reason 3 1221160103 0:default confchg left 0 joined 1 total 4 +2 1221160108 0:default confchg left 0 joined 1 total 5 +6 1221160114 0:default confchg left 0 joined 1 total 6 +1 1221160114 0:default confchg left 1 joined 0 total 5 -6 1221160115 0:default confchg left 0 joined 1 total 6 +4 1221160196 0:default confchg left 0 joined 1 total 7 +6 morph-01 (4) 1221160116 0:default confchg left 0 joined 1 total 6 +4 1221160197 0:default confchg left 0 joined 1 total 7 +6 morph-02 (5) 1221160044 0:default confchg removed node 4 reason 3 1221160044 0:default confchg removed node 6 reason 3 1221160044 0:default confchg removed node 1 reason 3 1221160044 0:default confchg removed node 2 reason 3 1221160104 0:default confchg left 0 joined 1 total 4 +2 1221160109 0:default confchg left 0 joined 1 total 5 +6 1221160115 0:default confchg left 0 joined 1 total 6 +1 1221160115 0:default confchg left 1 joined 0 total 5 -6 1221160116 0:default confchg left 0 joined 1 total 6 +4 1221160197 0:default confchg left 0 joined 1 total 7 +6 morph-03 (6) 1221160197 0:default confchg left 0 joined 1 total 7 +6 morph-04 (7) 1221160044 0:default confchg removed node 4 reason 3 1221160044 0:default confchg removed node 6 reason 3 1221160044 0:default confchg removed node 1 reason 3 1221160044 0:default confchg removed node 2 reason 3 1221160104 0:default confchg left 0 joined 1 total 4 +2 1221160109 0:default confchg left 0 joined 1 total 5 +6 1221160115 0:default confchg left 0 joined 1 total 6 +1 1221160115 0:default confchg left 1 joined 0 total 5 -6 1221160116 0:default confchg left 0 joined 1 total 6 +4 1221160197 0:default confchg left 0 joined 1 total 7 +6 nodeid 1 (tank-01) is joining the domain when nodeid 6 (morph-03) fails. (Note that the failure of 6 shortly after it has joined was not intended by the revolver test. It's a perfectly valid sequence of events that we should be able to handle, it's just not what revolver was trying to do.) tank-01 (1) 1221160115 0:default confchg left 0 joined 1 total 6 1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1 1221160115 0:default process_current_event 100060001 1 JOIN_BEGIN (need 5) 1221160115 0:default confchg left 1 joined 0 total 5 1221160115 0:default confchg removed node 6 reason 3 1221160123 0:default mark node 5 stopped (need 4) 1221160123 0:default mark node 7 stopped (need 3) 1221160123 0:default ignore msg from 2 id 100050001 stopped 1221160123 0:default mark node 3 stopped (need 2) tank-03 (2) 1221160115 0:default process_node_down 6 1221160115 0:default cpg del node 6 total 4 - down 1221160115 0:default confchg left 0 joined 1 total 6 1221160115 0:default make_event_id 100050001 nodeid 1 memb_count 5 type 1 1221160115 0:default confchg left 1 joined 0 total 5 1221160115 0:default confchg removed node 6 reason 3 1221160123 0:default process_current_event 100050001 1 JOIN_BEGIN tank-04 (3) 1221160114 0:default confchg left 0 joined 1 total 6 1221160114 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1 1221160114 0:default process_node_down 6 1221160114 0:default cpg del node 6 total 5 - down 1221160114 0:default confchg left 1 joined 0 total 5 1221160114 0:default confchg removed node 6 reason 3 1221160122 0:default process_current_event 100060001 1 JOIN_BEGIN 1221160122 0:default app node join: add 1 total 5 (need 4) 1221160122 0:default mark node 5 stopped (need 3) 1221160122 0:default mark node 7 stopped (need 2) 1221160122 0:default ignore msg from 2 id 100050001 stopped 1221160122 0:default mark node 3 stopped (need 1) morph-02 (5) 1221160115 0:default confchg left 0 joined 1 total 6 1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1 1221160115 0:default confchg left 1 joined 0 total 5 1221160115 0:default confchg removed node 6 reason 3 1221160115 0:default process_node_down 6 1221160115 0:default cpg del node 6 total 5 - down 1221160123 0:default process_current_event 100060001 1 JOIN_BEGIN 1221160123 0:default app node join: add 1 total 5 (need 4) 1221160123 0:default mark node 5 stopped (need 3) 1221160123 0:default mark node 7 stopped (need 2) 1221160123 0:default ignore msg from 2 id 100050001 stopped 1221160123 0:default mark node 3 stopped (need 1) morph-04 (7) 1221160115 0:default confchg left 0 joined 1 total 6 1221160115 0:default make_event_id 100060001 nodeid 1 memb_count 6 type 1 1221160115 0:default process_node_down 6 1221160115 0:default cpg del node 6 total 5 - down 1221160115 0:default confchg left 1 joined 0 total 5 1221160115 0:default confchg removed node 6 reason 3 1221160123 0:default process_current_event 100060001 1 JOIN_BEGIN 1221160123 0:default app node join: add 1 total 5 (need 4) 1221160123 0:default mark node 5 stopped (need 3) 1221160123 0:default mark node 7 stopped (need 2) 1221160123 0:default ignore msg from 2 id 100050001 stopped 1221160123 0:default mark node 3 stopped (need 1) It appears that the problems all go back to the fact that the failure of node 6 comes at different places in the sequence of processing for the nodes. If all nodes were to process the failure of node 6 in response to the actual cpg confchg instead of the groupd confchg (which was changed by the fix to this bz), they would all follow the same sequence. I don't know which kind of failure is more likely; the one fixed by this bz, or the one created by the fix. I also don't know whether we could fix the original problem in another way to avoid creating this new problem. This was fixed in cman 2.0.76 and later. cman-2.0.84 was included in RHEL 5.2. |