Bug 474400 - groupd missing cpg confchg
Summary: groupd missing cpg confchg
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.3
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-03 17:29 UTC by David Teigland
Modified: 2016-04-26 14:11 UTC (History)
3 users (show)

Fixed In Version: openais-0.80.6-11.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:48:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tarball of data (797.35 KB, application/x-gzip)
2008-12-03 17:34 UTC, David Teigland
no flags Details
revision 2044 - patch that may fix this problem (4.92 KB, text/plain)
2009-09-28 15:03 UTC, Steven Dake
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0180 0 normal SHIPPED_LIVE openais bug fix update 2010-03-29 12:18:57 UTC

Description David Teigland 2008-12-03 17:29:20 UTC
Description of problem:

Nate was running revolver and things hung in recovery.
groupd logs show that it's expecting a cpg confchg that
it doesn't get.

   1   M   4748   2008-12-02 14:52:45  tank-01
   2   M   4792   2008-12-02 15:15:32  tank-03
   3   M   4752   2008-12-02 14:52:45  tank-04
   4   M   4796   2008-12-02 15:15:54  morph-01
   7   M   4788   2008-12-02 15:15:31  morph-04

revolver killed:
2 (tank-03)
7 (morph-04)
4 (morph-01)

remaining:
1 (tank-01)
3 (tank-04)

tank-01
-------

1228252444 cman: node 2 removed
1228252444 add_recovery_set_cman nodeid 2
1228252444 groupd confchg total 4 left 1 joined 0
1228252444 add_recovery_set_cpg nodeid 2 procdown 0

1228252465 cman: node 7 removed
1228252465 add_recovery_set_cman nodeid 7
1228252465 groupd confchg total 3 left 1 joined 0
1228252465 add_recovery_set_cpg nodeid 7 procdown 0

(7532 cpg_mcast retries from 1228252465 to 1228252482)

1228252482 cman: lost quorum
1228252482 cman: node 4 removed
1228252482 add_recovery_set_cman nodeid 4
(expecting groupd confchg total 2 left 1 for nodeid 4, don't get it)

1228252531 cman: have quorum
1228252531 cman: node 7 added
1228252531 groupd confchg total 4 left 0 joined 1 (don't know which nodeid)
1228252532 cman: node 2 added
1228252554 cman: node 4 added
1228252534 groupd confchg total 5 left 0 joined 1 (don't know which nodeid)

tank-04
-------

1228252443 cman: node 2 removed
1228252443 add_recovery_set_cman nodeid 2
1228252443 groupd confchg total 4 left 1 joined 0
1228252443 add_recovery_set_cpg nodeid 2 procdown 0

1228252464 cman: node 7 removed
1228252464 add_recovery_set_cman nodeid 7
1228252464 groupd confchg total 3 left 1 joined 0
1228252464 add_recovery_set_cpg nodeid 7 procdown 0

(7175 cpg_mcast retries from 1228252464 to 1228252482)

1228252482 cman: lost quorum
1228252482 cman: node 4 removed
1228252482 add_recovery_set_cman nodeid 4
(expecting groupd confchg total 2 left 1 for nodeid 4, don't get it)

1228252530 cman: node 7 added
1228252530 cman: have quorum
1228252530 groupd confchg total 4 left 0 joined 1 (don't know which nodeid)
1228252531 cman: node 2 added
1228252533 groupd confchg total 5 left 0 joined 1 (don't know which nodeid)
1228252554 cman: node 4 added



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2008-12-03 17:34:18 UTC
Created attachment 325566 [details]
tarball of data

data that nate collected

Comment 2 Christine Caulfield 2009-01-21 10:23:03 UTC
I failed to reproduce this, but Steven has a new IPC system almost ready to go into openais that might fix the problem.

Comment 3 Steven Dake 2009-02-18 05:33:36 UTC
my suggestion is we close this bug as fixed in openais-0.80.5-2 and reopen if it is duplicated.

Several bugs that could cause this problem were fixed in this version.

Comment 4 Steven Dake 2009-09-28 15:03:29 UTC
Created attachment 362905 [details]
revision 2044 - patch that may fix this problem

patch that may fix this problem.  The changelog is:

r2044 | sdake | 2009-07-20 07:22:46 -0700 (Mon, 20 Jul 2009) | 12 lines

Cpg synchronization patch for conf change messages.

The root of the theoretical problem is that cpg_join or cpg_leave
messages are being sent via the C apis between synchronization.  With
the current cpg, synchronization happens in confchg_fn, and then later
in cpg_sync_process.  cpg_sync_process is called much later after
confchg_fn and introduces a small probability of a window of time for
queued in totem (but not yet ordered by totem) for those cpg_join and
cpg_leave operations to interact with the synchronization process which
should happen in one atomic operation but currently is two distinct
operations.

Comment 6 David Teigland 2009-09-28 15:49:43 UTC
There is no cpg_join or cpg_leave happening in this bug.  It sounds like the patch in comment #4 involves cpg_join or cpg_leave.

Comment 9 errata-xmlrpc 2010-03-30 07:48:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0180.html


Note You need to log in before you can comment on or make changes to this bug.