Bug 692620
Summary: | cpg messages fail after overloaded node fails | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Teigland <teigland> | ||||||||||
Component: | corosync | Assignee: | Angus Salkeld <asalkeld> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 6.0 | CC: | cluster-maint, djansa, jfriesse, jkortus, sdake | ||||||||||
Target Milestone: | rc | Keywords: | Rebase | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | corosync-1.4.0-1.el6 | Doc Type: | Rebase: Bug Fixes and Enhancements | ||||||||||
Doc Text: |
Cause: The "left_nodes" field in the downlist message
was not been used.
Consequence: If one node is paused it can miss a
config change and thus report a larger
old_members than expected.
Fix: Use the left_nodes field.
Result: correct membership events.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2011-12-06 11:50:19 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Created attachment 489196 [details]
blackbox from living nodes
Dave, If you could add more details of the tests you ran (or pointers to them) would probably help honza come up with reproducer. Thanks Dave gave more details on irc. Dave ran a kernel patch which blocked the kernel for ~60 seconds. It was fenced by node 1. When it came back, all nodes were blocked. Dave indicated the kernel blocking may have nothing to do with the problem, and instead the problem is related to corosync stop/start within 60 seconds. He suggested trying sigstop on the affected node. It was actually unmounting, trying to unlock hundreds of thousands of locks, and spewing excessive printk to the console. I don't care about the failed node, though; the problem is that the other nodes don't recover after it's finally gone. The only way to approach this, I think, is to try to simulate what happened to corosync on the rogue node, i.e. "pause detected for 57666 ms". Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. I strongly believe that this is fixed by following commits: feff0e8542463773207a3b2c1f6004afba1f58d5 6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12 906d322d2aa0cee0c9f5b7bea49f4ee0550e9e07 in upstream. Steve, any objections to include them in 6.2? Also because all three patches are made by Angus reassigning to him, because he should be much more easily able to say if commits should/can fix problem. These commits are already in the corosync build corosync-1.2.3-10.el6 feff0e8542463773207a3b2c1f6004afba1f58d5 is already in the build (revision 2801) 6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12 is already in the build (revision 2785) 906d322d2aa0cee0c9f5b7bea49f4ee0550e9e07 is already in the build (revision 2814) reassigning for root cause analysis changing priority based upon feedback from honza which he will add shortly. Created attachment 496480 [details]
Reproducer
Reproducer for problem. How to use:
1) works on 3 nodes, with nodeid 1,2,3
2) patch applied on node 3
3) all 3 nodes runs corosync and testcpg
4) on node 3, touch /tmp/pause
5) wait 20 seconds
6) on node 1 and 2, testcpg should quit
7) if not, goto 4)
Node 1 testcpg output:
ConfchgCallback: group 'GROUP'
joined node/pid 3/21617 reason: 1
nodes in group now 3
node/pid 1/29172
node/pid 2/5902
node/pid 3/21617
---- PAUSE ON NODE 3
ConfchgCallback: group 'GROUP'
left node/pid 3/21617 reason: 3
nodes in group now 2
node/pid 1/29172
node/pid 2/5902
--- NODE 3 lives again
ConfchgCallback: group 'GROUP'
joined node/pid 3/21617 reason: 4
nodes in group now 3
node/pid 1/29172
node/pid 2/5902
node/pid 3/21617
ConfchgCallback: group 'GROUP'
left node/pid 1/29172 reason: 3
nodes in group now 2
node/pid 2/5902
node/pid 3/21617
We have left the building
Finalize result is 6 (should be 1)
Because I was able to reproduce problem (both Corosync 1.2.3 but also today git master), changing priority again. I'm also pretty sure it is problem in cpg membership algorithm. New one was designed/developed by Angus, so reassigning to him, because he should be able to solve bug much faster then me. One extra thing, on victim node (3), output from testcpg looks like: ConfchgCallback: group 'GROUP' joined node/pid 3/21617 reason: 1 nodes in group now 3 node/pid 1/29172 node/pid 2/5902 node/pid 3/21617 ConfchgCallback: group 'GROUP' left node/pid 1/29172 reason: 3 nodes in group now 2 node/pid 2/5902 node/pid 3/21617 ConfchgCallback: group 'GROUP' left node/pid 2/5902 reason: 3 nodes in group now 1 node/pid 3/21617 ConfchgCallback: group 'GROUP' joined node/pid 2/5902 reason: 4 nodes in group now 2 node/pid 2/5902 node/pid 3/21617 so it can be seen that quit of node 2 testcpg was not mentioned. Some analysis ... May 4 10:25:49 r6-2 corosync[1995]: [TST2 ] syncv2.c:196 Member left: r(0) ip(192.168.100.93) <node leaves> May 4 10:25:49 r6-2 corosync[1995]: [TOTEM ] totemsrp.c:1826 A processor joined or left the membership and a new membership was formed. May 4 10:25:49 r6-2 corosync[1995]: [CPG ] cpg.c:1207 sending my downlist May 4 10:25:49 r6-2 corosync[1995]: [CPG ] cpg.c:1035 received downlist left_list: 1 received in state 1 May 4 10:25:49 r6-2 corosync[1995]: [CPG ] cpg.c:1035 received downlist left_list: 1 received in state 1 May 4 10:25:49 r6-2 corosync[1995]: [CPG ] cpg.c:760 chosen downlist from node r(0) ip(192.168.100.91) <they correctly choose node 1 as master> May 4 10:25:49 r6-2 corosync[1995]: [TST2 ] syncv2.c:238 tst_sv2_sync_init_v2 r(0) ip(192.168.100.91) May 4 10:25:49 r6-2 corosync[1995]: [TST2 ] syncv2.c:267 sync: node left r(0) ip(192.168.100.93) May 4 10:25:49 r6-2 corosync[1995]: [TST2 ] syncv2.c:297 sync: activate correctly 4 May 4 10:25:49 r6-2 corosync[1995]: [MAIN ] main.c:326 Completed service synchronization, ready to provide service. May 4 10:26:07 r6-2 corosync[1995]: [TST2 ] syncv2.c:200 Member joined: r(0) ip(192.168.100.93) <node 3 makes a come back> May 4 10:26:07 r6-2 corosync[1995]: [TOTEM ] totemsrp.c:1826 A processor joined or left the membership and a new membership was formed. May 4 10:26:07 r6-2 corosync[1995]: [CPG ] cpg.c:1207 sending my downlist May 4 10:26:07 r6-2 corosync[1995]: [CPG ] cpg.c:1035 received downlist left_list: 0 received in state 1 May 4 10:26:07 r6-2 corosync[1995]: [CPG ] cpg.c:1035 received downlist left_list: 2 received in state 1 May 4 10:26:07 r6-2 corosync[1995]: [CPG ] cpg.c:1035 received downlist left_list: 0 received in state 1 May 4 10:26:07 r6-2 corosync[1995]: [CPG ] cpg.c:760 chosen downlist from node r(0) ip(192.168.100.93) <now they choose node 3> .. problem node 1 & 2 's cpg groups will be unusable (as David said) May 4 10:26:07 r6-2 corosync[1995]: [TST2 ] syncv2.c:238 tst_sv2_sync_init_v2 r(0) ip(192.168.100.91) May 4 10:26:07 r6-2 corosync[1995]: [TST2 ] syncv2.c:252 sync: node joined r(0) ip(192.168.100.93) May 4 10:26:07 r6-2 corosync[1995]: [TST2 ] syncv2.c:297 sync: activate correctly 4 May 4 10:26:07 r6-2 corosync[1995]: [MAIN ] main.c:326 Completed service synchronization, ready to provide service. The way we choose a master at the moment is: 1) greatest node count 2) then lowest nodeid "1" is for when we have a merge, we choose the segment with the most nodes I think this logic is OK, but the "broken" node should set it's node list to it's self (instead of the full node list) - if this is possible. Also it would be nice if I could get at the totem info saying that node 3 had a pause timeout (something like totem_fault_since_last_config). So we can change the above to 1) greatest _healthy_ node count 2) then lowest _health_ nodeid 3) then lowest nodeid (in any state) Any other clever ideas welcome :) I noticed that we have "left_nodes" in the downlist message that we are not using! Cool. There are 2 scenarios that can happen Case 1) this is the bug Here node 3 "misses" a config change - node 1 & 2 get an intermediate config change (node 3 left) May 04 12:31:23 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93) members(old:3 left:2) May 04 12:31:23 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.92) members(old:2 left:0) May 04 12:31:23 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.91) members(old:2 left:0) Node 3 is selected (currently "left_nodes" is ignored) Case 2) no bug Here node 3 gets an intermediate config change as well May 04 12:33:10 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93) members(old:3 left:2) May 04 12:33:10 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.91) members(old:2 left:0) May 04 12:33:10 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93) members(old:1 left:0) May 04 12:33:10 corosync [CPG ] cpg.c:721 comparing: sender r(0) ip(192.168.100.92) members(old:2 left:0) Here node 1 is correctly selected When I use change the master calculation to include the left_nodes then this fixes the bug: New master calculation: 1) node with largest (old_members - left_nodes) 2) then nodeid I'll post a patch soon. I also need to hunt down the previous related bugs and make sure I don't break something else. Testing needed... I know I have some of these tests as CTS test cases so I'll run those first. Whilst Honzas test case was super helpful it would be good to see if I can get a reproducer without a "test patch" so I can add it to CTS. Created attachment 496943 [details]
cpg: fix sync master selection when one node paused.
If one node is paused it can miss a config change and
thus report a larger old_members than expected.
The solution is to use the left_nodes field.
Master selection used to be "choose node with":
1) largest previous membership
2) (then as a tie-breaker) node with smallest nodeid
New selection:
1) largest (previous #nodes - #nodes know to have left)
2) (then as a tie-breaker) node with smallest nodeid
Signed-off-by: Angus Salkeld <asalkeld>
I have tested this bug and all the CpgConfigChange test cases in CTS. All look good. I have posted the patch upstream. So i am moving this to POST. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: The "left_nodes" field in the downlist message was not been used. Consequence: If one node is paused it can miss a config change and thus report a larger old_members than expected. Fix: Use the left_nodes field. Result: correct membership events. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html |
Created attachment 489188 [details] parts of /var/log/messages from each node Description of problem: I was running some high load gfs tests that caused corosync on one node to not be scheduled. The other nodes saw it leave, then come back a short while later when it finally got time to run. I then forcibly killed that node, so the others saw it leave again. That all seems to be fine. The problem is that cpg's on the remaining nodes never recovered properly, and daemons were not able to send messages. Version-Release number of selected component (if applicable): corosync-1.2.3-21.el6.x86_64 How reproducible: once, this happened incidentally as part of some one-off stress testing Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: