Bug 504036

Summary: cpg dispatch activity stops after node failure
Product: [Fedora] Fedora Reporter: David Teigland <teigland>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: agk, fdinitto, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-08 15:57:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed cpgx patch fixing this problem none

Description David Teigland 2009-06-03 21:48:13 UTC
Description of problem:

Running "cpgx -d1" on four nodes, where -d1 causes the test to periodically
kill and restart corosync.  When this kill/restart happens on one node, others
are typically exiting/joining the cpg during at the same time.  The result is
that cpgx stops receiving any cpg callbacks, and it just sits there forever.

More specifically, it appears that any cpg join gets stuck if the join occurs
during the failure/recovery period of another node that was killed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Steven Dake 2009-06-03 21:50:04 UTC
Honza,

please replicate this issue and try to understand what the root cause is.

Comment 2 Bug Zapper 2009-06-09 17:03:34 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Steven Dake 2009-06-24 01:08:33 UTC
Dave,

Please retest with atleast revision 2288 or corosync 0.99 when available.

Regards

-----------------------------------------------------------------------
r2288 | sdake | 2009-06-23 18:01:57 -0700 (Tue, 23 Jun 2009) | 10 lines

Add assembly to free list when it is removed from a configuration change as
indicated by being in the left list.

This has side effect of clearing the assembly buffer the next time it is
referenced from the free list.  This fixes a defect that stops forward
processing of the message streams because sync fails to finish when receiving
a sync message from a restarted processor because it throws away the message.

Comment 4 David Teigland 2009-06-24 17:21:46 UTC
I still see the problem.  Using

[svn/corosync/trunk]% svn info
Path: .
URL: svn+ssh://svn.fedorahosted.org/svn/corosync/trunk
Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync
Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f
Revision: 2289
Node Kind: directory
Schedule: normal
Last Changed Author: sdake
Last Changed Rev: 2289
Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009)

Comment 5 David Teigland 2009-06-24 18:42:36 UTC
1. nodes 1,2,3: cpgx running
2. node 1: dies
3. nodes 2,3: cpgx stops receiving callbacks, no activity
(approx 7 second gap of no activity)
4. nodes 2,3: cpgx starts receiving callbacks again, this includes the confchg removing node 1.

if node 4 cpgx starts and joins the cpg in the window between step 2 and step 4, then it will never receive any callbacks.

Comment 6 Steven Dake 2009-06-24 18:54:11 UTC
This sounds like a different bugzilla then originally reported.

Reassigning to Honzaf to investigate.

Comment 7 Jan Friesse 2009-07-08 14:25:19 UTC
Created attachment 350943 [details]
Proposed cpgx patch fixing this problem

David,
I think I finally found problem. In cpgx. Problem was hidden in node->sync_from. This value is not updated, if node dies. If node in sync_from dies, no one other will send sync and it looks like node will not receiving messages (dispatch) what is not true.

Let me know, if this solves your problem or not (maybe I found some different problem).

Comment 8 Jan Friesse 2009-07-08 14:32:04 UTC
(In reply to comment #7)
> Created an attachment (id=350943) [details]
> Proposed cpgx patch fixing this problem
> 
> David,
> I think I finally found problem. In cpgx. Problem was hidden in
> node->sync_from. This value is not updated, if node dies. If node in sync_from
> dies, no one other will send sync and it looks like node will not receiving
> messages (dispatch) what is not true.
> 
> Let me know, if this solves your problem or not (maybe I found some different
> problem).  

David,
please remove line
+		synced_nodes[i++] = node->nodeid;
from patch. This will add ugly overflow problem, and it's not needed.

Comment 9 David Teigland 2009-07-08 15:57:44 UTC
Not a corosync bug.  I've verified that Honza's cpgx patch fixes this.  The fact that cpgx prints nothing until it's synced misled me to think it was getting no callbacks, when it fact it was, but just wasn't printing them.  I'll add some debug output in this space between join and sync to avoid this confusion again.