Bug 504036
Summary: | cpg dispatch activity stops after node failure | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Teigland <teigland> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 11 | CC: | agk, fdinitto, sdake | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-07-08 15:57:44 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
David Teigland
2009-06-03 21:48:13 UTC
Honza, please replicate this issue and try to understand what the root cause is. This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle. Changing version to '11'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Dave, Please retest with atleast revision 2288 or corosync 0.99 when available. Regards ----------------------------------------------------------------------- r2288 | sdake | 2009-06-23 18:01:57 -0700 (Tue, 23 Jun 2009) | 10 lines Add assembly to free list when it is removed from a configuration change as indicated by being in the left list. This has side effect of clearing the assembly buffer the next time it is referenced from the free list. This fixes a defect that stops forward processing of the message streams because sync fails to finish when receiving a sync message from a restarted processor because it throws away the message. I still see the problem. Using [svn/corosync/trunk]% svn info Path: . URL: svn+ssh://svn.fedorahosted.org/svn/corosync/trunk Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f Revision: 2289 Node Kind: directory Schedule: normal Last Changed Author: sdake Last Changed Rev: 2289 Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009) 1. nodes 1,2,3: cpgx running 2. node 1: dies 3. nodes 2,3: cpgx stops receiving callbacks, no activity (approx 7 second gap of no activity) 4. nodes 2,3: cpgx starts receiving callbacks again, this includes the confchg removing node 1. if node 4 cpgx starts and joins the cpg in the window between step 2 and step 4, then it will never receive any callbacks. This sounds like a different bugzilla then originally reported. Reassigning to Honzaf to investigate. Created attachment 350943 [details]
Proposed cpgx patch fixing this problem
David,
I think I finally found problem. In cpgx. Problem was hidden in node->sync_from. This value is not updated, if node dies. If node in sync_from dies, no one other will send sync and it looks like node will not receiving messages (dispatch) what is not true.
Let me know, if this solves your problem or not (maybe I found some different problem).
(In reply to comment #7) > Created an attachment (id=350943) [details] > Proposed cpgx patch fixing this problem > > David, > I think I finally found problem. In cpgx. Problem was hidden in > node->sync_from. This value is not updated, if node dies. If node in sync_from > dies, no one other will send sync and it looks like node will not receiving > messages (dispatch) what is not true. > > Let me know, if this solves your problem or not (maybe I found some different > problem). David, please remove line + synced_nodes[i++] = node->nodeid; from patch. This will add ugly overflow problem, and it's not needed. Not a corosync bug. I've verified that Honza's cpgx patch fixes this. The fact that cpgx prints nothing until it's synced misled me to think it was getting no callbacks, when it fact it was, but just wasn't printing them. I'll add some debug output in this space between join and sync to avoid this confusion again. |