Bug 859170

Summary: Non-ready HA broker can be incorrectly promoted to primary
Product: Red Hat Enterprise MRG Reporter: Jason Dillaman <jdillama>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED CURRENTRELEASE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: unspecified Docs Contact:
Priority: high    
Version: DevelopmentCC: aconway, esammons, jross, lzhaldyb, mcressma
Target Milestone: 2.3Keywords: OtherQA
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: qpid-cpp-0.18-2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-19 12:38:28 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 698367    

Description Jason Dillaman 2012-09-20 13:35:30 EDT
Description of problem:
rgmanager can promote a non-ready backup HA broker to primary when other backup brokers are available in the ready state.  This can result in loss of messages and broker configuration.  Additionally, this can cause the previously ready backups to throw exceptions when connecting to the new primary:

Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [HA] critical Backup queue Queue1: Replication failed: Invalid position move, preceeds messages
Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [Protocol] error Unexpected exception: Invalid position move, preceeds messages
Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [Broker] error Connection closed by error: Invalid position move, preceeds messages(501)

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:

Steps to Reproduce:
1. Start a primary and backup broker
2. Inject messages into the primary and ensure messages replicate to backup
3. Restart the primary broker and manually re-promote to primary
Actual results:
Restarted broker becomes primary

Expected results:
Restarted broker refuses to become primary since at least one ready backup was discovered within some timeout
Comment 1 Alan Conway 2012-09-21 08:38:23 EDT
Have you also seen this problem when using rgmanager? 
If so was the failing node rebooted or just had qpidd restarted?
Comment 2 Jason Dillaman 2012-09-21 11:08:59 EDT
Yes, I have run into this problem w/o running the manual steps above.  We encounter a lot of flapping during our test startup due to the sheer number of connections and queues being created.  This results in 'qpid-ha' timing out on its QMF query -- which results in rgmanager stopping the primary promotion service and relocating it.