Description of problem: Nate was running revolver and things hung in recovery. groupd logs show that it's expecting a cpg confchg that it doesn't get. 1 M 4748 2008-12-02 14:52:45 tank-01 2 M 4792 2008-12-02 15:15:32 tank-03 3 M 4752 2008-12-02 14:52:45 tank-04 4 M 4796 2008-12-02 15:15:54 morph-01 7 M 4788 2008-12-02 15:15:31 morph-04 revolver killed: 2 (tank-03) 7 (morph-04) 4 (morph-01) remaining: 1 (tank-01) 3 (tank-04) tank-01 ------- 1228252444 cman: node 2 removed 1228252444 add_recovery_set_cman nodeid 2 1228252444 groupd confchg total 4 left 1 joined 0 1228252444 add_recovery_set_cpg nodeid 2 procdown 0 1228252465 cman: node 7 removed 1228252465 add_recovery_set_cman nodeid 7 1228252465 groupd confchg total 3 left 1 joined 0 1228252465 add_recovery_set_cpg nodeid 7 procdown 0 (7532 cpg_mcast retries from 1228252465 to 1228252482) 1228252482 cman: lost quorum 1228252482 cman: node 4 removed 1228252482 add_recovery_set_cman nodeid 4 (expecting groupd confchg total 2 left 1 for nodeid 4, don't get it) 1228252531 cman: have quorum 1228252531 cman: node 7 added 1228252531 groupd confchg total 4 left 0 joined 1 (don't know which nodeid) 1228252532 cman: node 2 added 1228252554 cman: node 4 added 1228252534 groupd confchg total 5 left 0 joined 1 (don't know which nodeid) tank-04 ------- 1228252443 cman: node 2 removed 1228252443 add_recovery_set_cman nodeid 2 1228252443 groupd confchg total 4 left 1 joined 0 1228252443 add_recovery_set_cpg nodeid 2 procdown 0 1228252464 cman: node 7 removed 1228252464 add_recovery_set_cman nodeid 7 1228252464 groupd confchg total 3 left 1 joined 0 1228252464 add_recovery_set_cpg nodeid 7 procdown 0 (7175 cpg_mcast retries from 1228252464 to 1228252482) 1228252482 cman: lost quorum 1228252482 cman: node 4 removed 1228252482 add_recovery_set_cman nodeid 4 (expecting groupd confchg total 2 left 1 for nodeid 4, don't get it) 1228252530 cman: node 7 added 1228252530 cman: have quorum 1228252530 groupd confchg total 4 left 0 joined 1 (don't know which nodeid) 1228252531 cman: node 2 added 1228252533 groupd confchg total 5 left 0 joined 1 (don't know which nodeid) 1228252554 cman: node 4 added Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 325566 [details] tarball of data data that nate collected
I failed to reproduce this, but Steven has a new IPC system almost ready to go into openais that might fix the problem.
my suggestion is we close this bug as fixed in openais-0.80.5-2 and reopen if it is duplicated. Several bugs that could cause this problem were fixed in this version.
Created attachment 362905 [details] revision 2044 - patch that may fix this problem patch that may fix this problem. The changelog is: r2044 | sdake | 2009-07-20 07:22:46 -0700 (Mon, 20 Jul 2009) | 12 lines Cpg synchronization patch for conf change messages. The root of the theoretical problem is that cpg_join or cpg_leave messages are being sent via the C apis between synchronization. With the current cpg, synchronization happens in confchg_fn, and then later in cpg_sync_process. cpg_sync_process is called much later after confchg_fn and introduces a small probability of a window of time for queued in totem (but not yet ordered by totem) for those cpg_join and cpg_leave operations to interact with the synchronization process which should happen in one atomic operation but currently is two distinct operations.
There is no cpg_join or cpg_leave happening in this bug. It sounds like the patch in comment #4 involves cpg_join or cpg_leave.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0180.html