This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 474400 - groupd missing cpg confchg
groupd missing cpg confchg
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.3
All Linux
low Severity medium
: rc
: ---
Assigned To: Steven Dake
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-03 12:29 EST by David Teigland
Modified: 2016-04-26 10:11 EDT (History)
3 users (show)

See Also:
Fixed In Version: openais-0.80.6-11.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-30 03:48:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
tarball of data (797.35 KB, application/x-gzip)
2008-12-03 12:34 EST, David Teigland
no flags Details
revision 2044 - patch that may fix this problem (4.92 KB, text/plain)
2009-09-28 11:03 EDT, Steven Dake
no flags Details

  None (edit)
Description David Teigland 2008-12-03 12:29:20 EST
Description of problem:

Nate was running revolver and things hung in recovery.
groupd logs show that it's expecting a cpg confchg that
it doesn't get.

   1   M   4748   2008-12-02 14:52:45  tank-01
   2   M   4792   2008-12-02 15:15:32  tank-03
   3   M   4752   2008-12-02 14:52:45  tank-04
   4   M   4796   2008-12-02 15:15:54  morph-01
   7   M   4788   2008-12-02 15:15:31  morph-04

revolver killed:
2 (tank-03)
7 (morph-04)
4 (morph-01)

remaining:
1 (tank-01)
3 (tank-04)

tank-01
-------

1228252444 cman: node 2 removed
1228252444 add_recovery_set_cman nodeid 2
1228252444 groupd confchg total 4 left 1 joined 0
1228252444 add_recovery_set_cpg nodeid 2 procdown 0

1228252465 cman: node 7 removed
1228252465 add_recovery_set_cman nodeid 7
1228252465 groupd confchg total 3 left 1 joined 0
1228252465 add_recovery_set_cpg nodeid 7 procdown 0

(7532 cpg_mcast retries from 1228252465 to 1228252482)

1228252482 cman: lost quorum
1228252482 cman: node 4 removed
1228252482 add_recovery_set_cman nodeid 4
(expecting groupd confchg total 2 left 1 for nodeid 4, don't get it)

1228252531 cman: have quorum
1228252531 cman: node 7 added
1228252531 groupd confchg total 4 left 0 joined 1 (don't know which nodeid)
1228252532 cman: node 2 added
1228252554 cman: node 4 added
1228252534 groupd confchg total 5 left 0 joined 1 (don't know which nodeid)

tank-04
-------

1228252443 cman: node 2 removed
1228252443 add_recovery_set_cman nodeid 2
1228252443 groupd confchg total 4 left 1 joined 0
1228252443 add_recovery_set_cpg nodeid 2 procdown 0

1228252464 cman: node 7 removed
1228252464 add_recovery_set_cman nodeid 7
1228252464 groupd confchg total 3 left 1 joined 0
1228252464 add_recovery_set_cpg nodeid 7 procdown 0

(7175 cpg_mcast retries from 1228252464 to 1228252482)

1228252482 cman: lost quorum
1228252482 cman: node 4 removed
1228252482 add_recovery_set_cman nodeid 4
(expecting groupd confchg total 2 left 1 for nodeid 4, don't get it)

1228252530 cman: node 7 added
1228252530 cman: have quorum
1228252530 groupd confchg total 4 left 0 joined 1 (don't know which nodeid)
1228252531 cman: node 2 added
1228252533 groupd confchg total 5 left 0 joined 1 (don't know which nodeid)
1228252554 cman: node 4 added



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Teigland 2008-12-03 12:34:18 EST
Created attachment 325566 [details]
tarball of data

data that nate collected
Comment 2 Christine Caulfield 2009-01-21 05:23:03 EST
I failed to reproduce this, but Steven has a new IPC system almost ready to go into openais that might fix the problem.
Comment 3 Steven Dake 2009-02-18 00:33:36 EST
my suggestion is we close this bug as fixed in openais-0.80.5-2 and reopen if it is duplicated.

Several bugs that could cause this problem were fixed in this version.
Comment 4 Steven Dake 2009-09-28 11:03:29 EDT
Created attachment 362905 [details]
revision 2044 - patch that may fix this problem

patch that may fix this problem.  The changelog is:

r2044 | sdake | 2009-07-20 07:22:46 -0700 (Mon, 20 Jul 2009) | 12 lines

Cpg synchronization patch for conf change messages.

The root of the theoretical problem is that cpg_join or cpg_leave
messages are being sent via the C apis between synchronization.  With
the current cpg, synchronization happens in confchg_fn, and then later
in cpg_sync_process.  cpg_sync_process is called much later after
confchg_fn and introduces a small probability of a window of time for
queued in totem (but not yet ordered by totem) for those cpg_join and
cpg_leave operations to interact with the synchronization process which
should happen in one atomic operation but currently is two distinct
operations.
Comment 6 David Teigland 2009-09-28 11:49:43 EDT
There is no cpg_join or cpg_leave happening in this bug.  It sounds like the patch in comment #4 involves cpg_join or cpg_leave.
Comment 9 errata-xmlrpc 2010-03-30 03:48:32 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0180.html

Note You need to log in before you can comment on or make changes to this bug.