When isolating a node via iptables for a period of between 11-29 seconds and then removing the isolation the node, cpg membership is invalid. This simulates a network interface going offline for between 11-29 seconds, and then becoming active again. token = 10 sec consensus = 20 sec between token and consensus timeout period, an isolation followed by removal of isolation demonstrates this issue.
I have used (modified) laryngitis with a sleep between the isolate and the unisolate to test both the current rhel6 corosync package and one with my cpg membership patches (Bug 583844). Both always fail on which ever node it chooses. Jul 2 21:00:47 r3 qarshd[3448]: Running cmdline: /tmp/coro-netctl drop ... Jul 2 21:01:02 r3 qarshd[3469]: Running cmdline: /tmp/coro-netctl accept (note the time gap == 15sec) this is using a script to do the isolating that uses iptables-restore to make the 2 iptables commands more atomic (this improves the normal case). This seems to be as cpg doesn't handle the particular membership info that gets passed to it very well.
After putting in some debug (below), it looks like the cpg membership is ok. In this situation there is no change in the membership (no join/leaves). I think the bug is a non-issue. What do you think Steve? Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_memb) 192.168.100.91 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_memb) 192.168.100.92 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_memb) 192.168.100.93 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_new_memb) 192.168.100.91 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_new_memb) 192.168.100.92 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_new_memb) 192.168.100.93 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_trans_memb) 192.168.100.91 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_trans_memb) 192.168.100.92 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] my_trans_memb) 192.168.100.93 Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] left_list (empty) Jul 6 08:21:10 r2 corosync[2882]: [TOTEM ] join_list (empty) Jul 6 08:21:10 r2 corosync[2882]: [CPG ] cpg_confchg:members [13888] 1 2 3 Jul 6 08:21:10 r2 corosync[2882]: [CPG ] cpg_confchg:left [13888] <empty> Jul 6 08:21:10 r2 corosync[2882]: [CPG ] cpg_confchg:joined [13888] <empty>
There was initially a bug in my diagnostic code that caused Steve to make this bug. Not a bug.