Red Hat Bugzilla – Bug 504036
cpg dispatch activity stops after node failure
Last modified: 2009-07-08 11:57:44 EDT
Description of problem:
Running "cpgx -d1" on four nodes, where -d1 causes the test to periodically
kill and restart corosync. When this kill/restart happens on one node, others
are typically exiting/joining the cpg during at the same time. The result is
that cpgx stops receiving any cpg callbacks, and it just sits there forever.
More specifically, it appears that any cpg join gets stuck if the join occurs
during the failure/recovery period of another node that was killed.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
please replicate this issue and try to understand what the root cause is.
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.
More information and reason for this action is here:
Please retest with atleast revision 2288 or corosync 0.99 when available.
r2288 | sdake | 2009-06-23 18:01:57 -0700 (Tue, 23 Jun 2009) | 10 lines
Add assembly to free list when it is removed from a configuration change as
indicated by being in the left list.
This has side effect of clearing the assembly buffer the next time it is
referenced from the free list. This fixes a defect that stops forward
processing of the message streams because sync fails to finish when receiving
a sync message from a restarted processor because it throws away the message.
I still see the problem. Using
[svn/corosync/trunk]% svn info
Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync
Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f
Node Kind: directory
Last Changed Author: sdake
Last Changed Rev: 2289
Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009)
1. nodes 1,2,3: cpgx running
2. node 1: dies
3. nodes 2,3: cpgx stops receiving callbacks, no activity
(approx 7 second gap of no activity)
4. nodes 2,3: cpgx starts receiving callbacks again, this includes the confchg removing node 1.
if node 4 cpgx starts and joins the cpg in the window between step 2 and step 4, then it will never receive any callbacks.
This sounds like a different bugzilla then originally reported.
Reassigning to Honzaf to investigate.
Created attachment 350943 [details]
Proposed cpgx patch fixing this problem
I think I finally found problem. In cpgx. Problem was hidden in node->sync_from. This value is not updated, if node dies. If node in sync_from dies, no one other will send sync and it looks like node will not receiving messages (dispatch) what is not true.
Let me know, if this solves your problem or not (maybe I found some different problem).
(In reply to comment #7)
> Created an attachment (id=350943) [details]
> Proposed cpgx patch fixing this problem
> I think I finally found problem. In cpgx. Problem was hidden in
> node->sync_from. This value is not updated, if node dies. If node in sync_from
> dies, no one other will send sync and it looks like node will not receiving
> messages (dispatch) what is not true.
> Let me know, if this solves your problem or not (maybe I found some different
please remove line
+ synced_nodes[i++] = node->nodeid;
from patch. This will add ugly overflow problem, and it's not needed.
Not a corosync bug. I've verified that Honza's cpgx patch fixes this. The fact that cpgx prints nothing until it's synced misled me to think it was getting no callbacks, when it fact it was, but just wasn't printing them. I'll add some debug output in this space between join and sync to avoid this confusion again.