Bug 610334 - isolating a node for 11-29 seconds results in bad cpg membership 15% of time
isolating a node for 11-29 seconds results in bad cpg membership 15% of time
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.0
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Angus Salkeld
Cluster QE
:
Depends On: 583844
Blocks: 599016
  Show dependency treegraph
 
Reported: 2010-07-02 01:57 EDT by Steven Dake
Modified: 2016-04-26 12:43 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 605313
Environment:
Last Closed: 2010-07-06 04:04:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Steven Dake 2010-07-02 02:00:50 EDT
When isolating a node via iptables for a period of between 11-29 seconds and then removing the isolation the node, cpg membership is invalid.

This simulates a network interface going offline for between 11-29 seconds, and then becoming active again.

token = 10 sec
consensus = 20 sec
between token and consensus timeout period, an isolation followed by removal of isolation demonstrates this issue.
Comment 2 Angus Salkeld 2010-07-02 07:17:38 EDT
I have used (modified) laryngitis with a sleep between the isolate and the unisolate to test both the current rhel6 corosync package and one with my cpg membership patches (Bug 583844). Both always fail on which ever node it chooses.

Jul  2 21:00:47 r3 qarshd[3448]: Running cmdline: /tmp/coro-netctl drop
...
Jul  2 21:01:02 r3 qarshd[3469]: Running cmdline: /tmp/coro-netctl accept
(note the time gap == 15sec)

this is using a script to do the isolating that uses iptables-restore
to make the 2 iptables commands more atomic (this improves the normal
case).

This seems to be as cpg doesn't handle the particular membership info that gets
passed to it very well.
Comment 3 Angus Salkeld 2010-07-05 19:05:31 EDT
After putting in some debug (below), it looks like the cpg membership is
ok. In this situation there is no change in the membership (no join/leaves).

I think the bug is a non-issue.

What do you think Steve?

Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_memb) 192.168.100.91
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_memb) 192.168.100.92
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_memb) 192.168.100.93
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_new_memb) 192.168.100.91
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_new_memb) 192.168.100.92
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_new_memb) 192.168.100.93
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_trans_memb) 192.168.100.91
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_trans_memb) 192.168.100.92
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] my_trans_memb) 192.168.100.93
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] left_list (empty)
Jul  6 08:21:10 r2 corosync[2882]:   [TOTEM ] join_list (empty)
Jul  6 08:21:10 r2 corosync[2882]:   [CPG   ] cpg_confchg:members [13888] 1 2 3 
Jul  6 08:21:10 r2 corosync[2882]:   [CPG   ] cpg_confchg:left    [13888] <empty>
Jul  6 08:21:10 r2 corosync[2882]:   [CPG   ] cpg_confchg:joined  [13888] <empty>
Comment 4 Angus Salkeld 2010-07-06 04:04:34 EDT
There was initially a bug in my diagnostic code that caused Steve to
make this bug. Not a bug.

Note You need to log in before you can comment on or make changes to this bug.