Bug 692620

Summary: cpg messages fail after overloaded node fails
Product: Red Hat Enterprise Linux 6 Reporter: David Teigland <teigland>
Component: corosyncAssignee: Angus Salkeld <asalkeld>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.0CC: cluster-maint, djansa, jfriesse, jkortus, sdake
Target Milestone: rcKeywords: Rebase
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-1.4.0-1.el6 Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Cause: The "left_nodes" field in the downlist message was not been used. Consequence: If one node is paused it can miss a config change and thus report a larger old_members than expected. Fix: Use the left_nodes field. Result: correct membership events.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 06:50:19 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
parts of /var/log/messages from each node
none
blackbox from living nodes
none
Reproducer
none
cpg: fix sync master selection when one node paused. none

Description David Teigland 2011-03-31 14:04:50 EDT
Created attachment 489188 [details]
parts of /var/log/messages from each node

Description of problem:

I was running some high load gfs tests that caused corosync on one node to not be scheduled.  The other nodes saw it leave, then come back a short while later when it finally got time to run.  I then forcibly killed that node, so the others saw it leave again.  That all seems to be fine.  The problem is that cpg's on the remaining nodes never recovered properly, and daemons were not able to send messages.

Version-Release number of selected component (if applicable):

corosync-1.2.3-21.el6.x86_64

How reproducible:

once, this happened incidentally as part of some one-off stress testing

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 Steven Dake 2011-03-31 14:43:00 EDT
Created attachment 489196 [details]
blackbox from living nodes
Comment 3 Steven Dake 2011-03-31 14:46:13 EDT
Dave,

If you could add more details of the tests you ran (or pointers to them) would probably help honza come up with reproducer.

Thanks
Comment 4 Steven Dake 2011-03-31 15:00:45 EDT
Dave gave more details on irc.

Dave ran a kernel patch which blocked the kernel for ~60 seconds.  It was fenced by node 1.  When it came back, all nodes were blocked.  Dave indicated the kernel blocking may have nothing to do with the problem, and instead the problem is related to corosync stop/start within 60 seconds.  He suggested trying sigstop on the affected node.
Comment 5 David Teigland 2011-03-31 15:08:50 EDT
It was actually unmounting, trying to unlock hundreds of thousands of locks, and spewing excessive printk to the console.  I don't care about the failed node, though; the problem is that the other nodes don't recover after it's finally gone.

The only way to approach this, I think, is to try to simulate what happened to corosync on the rogue node, i.e. "pause detected for 57666 ms".
Comment 6 RHEL Product and Program Management 2011-04-03 21:59:43 EDT
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 7 Jan Friesse 2011-04-04 12:27:25 EDT
I strongly believe that this is fixed by following commits:
feff0e8542463773207a3b2c1f6004afba1f58d5
6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12
906d322d2aa0cee0c9f5b7bea49f4ee0550e9e07

in upstream.

Steve, any objections to include them in 6.2?
Comment 8 Jan Friesse 2011-04-05 05:22:12 EDT
Also because all three patches are made by Angus reassigning to him, because he should be much more easily able to say if commits should/can fix problem.
Comment 9 Steven Dake 2011-04-05 15:51:30 EDT
These commits are already in the corosync build corosync-1.2.3-10.el6

feff0e8542463773207a3b2c1f6004afba1f58d5 is already in the build (revision 2801)
6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12 is already in the build (revision 2785)
906d322d2aa0cee0c9f5b7bea49f4ee0550e9e07 is already in the build (revision 2814)

reassigning for root cause analysis
Comment 10 Steven Dake 2011-04-28 11:24:05 EDT
changing priority based upon feedback from honza which he will add shortly.
Comment 11 Jan Friesse 2011-05-03 06:26:10 EDT
Created attachment 496480 [details]
Reproducer

Reproducer for problem. How to use:
1) works on 3 nodes, with nodeid 1,2,3
2) patch applied on node 3
3) all 3 nodes runs corosync and testcpg
4) on node 3, touch /tmp/pause
5) wait 20 seconds
6) on node 1 and 2, testcpg should quit
7) if not, goto 4)

Node 1 testcpg output:
ConfchgCallback: group 'GROUP'
joined node/pid 3/21617 reason: 1
nodes in group now 3
node/pid 1/29172
node/pid 2/5902
node/pid 3/21617

---- PAUSE ON NODE 3

ConfchgCallback: group 'GROUP'
left node/pid 3/21617 reason: 3
nodes in group now 2
node/pid 1/29172
node/pid 2/5902

--- NODE 3 lives again

ConfchgCallback: group 'GROUP'
joined node/pid 3/21617 reason: 4
nodes in group now 3
node/pid 1/29172
node/pid 2/5902
node/pid 3/21617

ConfchgCallback: group 'GROUP'
left node/pid 1/29172 reason: 3
nodes in group now 2
node/pid 2/5902
node/pid 3/21617
We have left the building
Finalize  result is 6 (should be 1)
Comment 12 Jan Friesse 2011-05-03 06:28:33 EDT
Because I was able to reproduce problem (both Corosync 1.2.3 but also today git master), changing priority again. I'm also pretty sure it is problem in cpg membership algorithm. New one was designed/developed by Angus, so reassigning to him, because he should be able to solve bug much faster then me.
Comment 13 Jan Friesse 2011-05-03 06:33:46 EDT
One extra thing,
on victim node (3), output from testcpg looks like:
ConfchgCallback: group 'GROUP'
joined node/pid 3/21617 reason: 1
nodes in group now 3
node/pid 1/29172
node/pid 2/5902
node/pid 3/21617

ConfchgCallback: group 'GROUP'
left node/pid 1/29172 reason: 3
nodes in group now 2
node/pid 2/5902
node/pid 3/21617

ConfchgCallback: group 'GROUP'
left node/pid 2/5902 reason: 3
nodes in group now 1
node/pid 3/21617

ConfchgCallback: group 'GROUP'
joined node/pid 2/5902 reason: 4
nodes in group now 2
node/pid 2/5902
node/pid 3/21617

so it can be seen that quit of node 2 testcpg was not mentioned.
Comment 14 Angus Salkeld 2011-05-03 20:51:42 EDT
Some analysis ...

May  4 10:25:49 r6-2 corosync[1995]:   [TST2  ] syncv2.c:196 Member left: r(0) ip(192.168.100.93) 

<node leaves>

May  4 10:25:49 r6-2 corosync[1995]:   [TOTEM ] totemsrp.c:1826 A processor joined or left the membership and a new membership was formed.
May  4 10:25:49 r6-2 corosync[1995]:   [CPG   ] cpg.c:1207 sending my downlist
May  4 10:25:49 r6-2 corosync[1995]:   [CPG   ] cpg.c:1035 received downlist left_list: 1 received in state 1
May  4 10:25:49 r6-2 corosync[1995]:   [CPG   ] cpg.c:1035 received downlist left_list: 1 received in state 1
May  4 10:25:49 r6-2 corosync[1995]:   [CPG   ] cpg.c:760 chosen downlist from node r(0) ip(192.168.100.91) 

<they correctly choose node 1 as master>

May  4 10:25:49 r6-2 corosync[1995]:   [TST2  ] syncv2.c:238 tst_sv2_sync_init_v2 r(0) ip(192.168.100.91) 
May  4 10:25:49 r6-2 corosync[1995]:   [TST2  ] syncv2.c:267 sync: node left r(0) ip(192.168.100.93) 
May  4 10:25:49 r6-2 corosync[1995]:   [TST2  ] syncv2.c:297 sync: activate correctly 4
May  4 10:25:49 r6-2 corosync[1995]:   [MAIN  ] main.c:326 Completed service synchronization, ready to provide service.
May  4 10:26:07 r6-2 corosync[1995]:   [TST2  ] syncv2.c:200 Member joined: r(0) ip(192.168.100.93) 

<node 3 makes a come back>

May  4 10:26:07 r6-2 corosync[1995]:   [TOTEM ] totemsrp.c:1826 A processor joined or left the membership and a new membership was formed.
May  4 10:26:07 r6-2 corosync[1995]:   [CPG   ] cpg.c:1207 sending my downlist
May  4 10:26:07 r6-2 corosync[1995]:   [CPG   ] cpg.c:1035 received downlist left_list: 0 received in state 1
May  4 10:26:07 r6-2 corosync[1995]:   [CPG   ] cpg.c:1035 received downlist left_list: 2 received in state 1
May  4 10:26:07 r6-2 corosync[1995]:   [CPG   ] cpg.c:1035 received downlist left_list: 0 received in state 1
May  4 10:26:07 r6-2 corosync[1995]:   [CPG   ] cpg.c:760 chosen downlist from node r(0) ip(192.168.100.93) 

<now they choose node 3> .. problem
node 1 & 2 's cpg groups will be unusable (as David said)

May  4 10:26:07 r6-2 corosync[1995]:   [TST2  ] syncv2.c:238 tst_sv2_sync_init_v2 r(0) ip(192.168.100.91) 
May  4 10:26:07 r6-2 corosync[1995]:   [TST2  ] syncv2.c:252 sync: node joined r(0) ip(192.168.100.93) 
May  4 10:26:07 r6-2 corosync[1995]:   [TST2  ] syncv2.c:297 sync: activate correctly 4
May  4 10:26:07 r6-2 corosync[1995]:   [MAIN  ] main.c:326 Completed service synchronization, ready to provide service.


The way we choose a master at the moment is:
1) greatest node count
2) then lowest nodeid

"1" is for when we have a merge, we choose the segment with the most nodes
I think this logic is OK, but the "broken" node should set it's node list 
to it's self (instead of the full node list) - if this is possible. Also
it would be nice if I could get at the totem info saying that node 3 had
a pause timeout (something like totem_fault_since_last_config).

So we can change the above to
1) greatest _healthy_ node count
2) then lowest _health_ nodeid
3) then lowest nodeid (in any state)

Any other clever ideas welcome :)
Comment 15 Angus Salkeld 2011-05-03 23:03:23 EDT
I noticed that we have "left_nodes" in the downlist message that we are not using! Cool.


There are 2 scenarios that can happen

Case 1) this is the bug
Here node 3 "misses" a config change - node 1 & 2 get an intermediate config change (node 3 left)

May 04 12:31:23 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93)  members(old:3 left:2)
May 04 12:31:23 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.92)  members(old:2 left:0)
May 04 12:31:23 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.91)  members(old:2 left:0)

Node 3 is selected (currently "left_nodes" is ignored)

Case 2) no bug
Here node 3 gets an intermediate config change as well

May 04 12:33:10 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93)  members(old:3 left:2)

May 04 12:33:10 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.91)  members(old:2 left:0)
May 04 12:33:10 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.93)  members(old:1 left:0)
May 04 12:33:10 corosync [CPG   ] cpg.c:721 comparing: sender r(0) ip(192.168.100.92)  members(old:2 left:0)

Here node 1 is correctly selected

When I use change the master calculation to include the left_nodes then this fixes the bug:
New master calculation:
1) node with largest (old_members - left_nodes)
2) then nodeid

I'll post a patch soon. I also need to hunt down the previous related bugs
and make sure I don't break something else. Testing needed...
I know I have some of these tests as CTS test cases so I'll run those first.
Whilst Honzas test case was super helpful it would be good to see if I can
get a reproducer without a "test patch" so I can add it to CTS.
Comment 16 Angus Salkeld 2011-05-04 19:46:36 EDT
Created attachment 496943 [details]
cpg: fix sync master selection when one node paused.

If one node is paused it can miss a config change and
thus report a larger old_members than expected.

The solution is to use the left_nodes field.

Master selection used to be "choose node with":
1) largest previous membership
2) (then as a tie-breaker) node with smallest nodeid

New selection:
1) largest (previous #nodes - #nodes know to have left)
2) (then as a tie-breaker) node with smallest nodeid

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Comment 17 Angus Salkeld 2011-05-04 20:27:35 EDT
I have tested this bug and all the CpgConfigChange test cases in CTS.
All look good.
I have posted the patch upstream.
So i am moving this to POST.
Comment 24 Angus Salkeld 2011-10-27 19:03:29 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: The "left_nodes" field in the downlist message
       was not been used.
Consequence: If one node is paused it can miss a
      config change and thus report a larger 
      old_members than expected.
Fix: Use the left_nodes field.
Result: correct membership events.
Comment 25 errata-xmlrpc 2011-12-06 06:50:19 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html