Bug 1640804

Summary:	[OSP14] DuplicateMessageError after rolling restart of RabbitMQ
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	puppet-tripleo	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	nlevinki <nlevinki>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	14.0 (Rocky)	CC:	abeekhof, aherr, apevec, bdobreli, chjones, dbecker, dciabrin, jeckersb, jjoyce, joflynn, jschluet, lhh, mbooth, mburns, mcornea, michele, morazi, pkomarov, plemenko, rbartal, slinaber, tvignaud
Target Milestone:	zstream	Keywords:	TestOnly, Triaged, ZStream
Target Release:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	puppet-tripleo-9.4.1-0.20190508182406.el7ost	Doc Type:	Known Issue
Doc Text:	When you restart all three controller nodes, it might not be possible to launch tenant instances in the overcloud. A "DuplicateMessageError" message is logged in the overcloud logs. As a workaround, on one of the overcloud controllers, run this command: pcs resource restart rabbitmq-bundle	Story Points:	---
Clone Of:
Clones:	1721054 (view as bug list)		Environment:
Last Closed:	2019-08-13 10:44:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1592528
Bug Blocks:	1721054

Description Marius Cornea 2018-10-18 19:25:45 UTC

Description of problem:
Unable to boot instances after rebooting the overcloud nodes: ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message in nova-scheduler.log


Version-Release number of selected component (if applicable):
2018-10-10.3 puddle

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controller + 1 computes + 3 ceph nodes
2. Scale out with 1 compute node
3. Reboot the overcloud nodes
4. Remove the compute node added in step 2
5. Reboot the overcloud nodes
6. Launch an instance

Actual results:
Instance ends up in ERROR state.

Expected results:
Instance successfully launched.

Additional info:
Attaching sosreports.

Comment 1 Marius Cornea 2018-10-18 19:28:40 UTC

The nodes reboot is done via this playbook:
https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml

Comment 3 Marius Cornea 2018-10-18 20:48:47 UTC

After running 'pcs resource restart rabbitmq-bundle' I was able to successfully launch the instance.

Comment 4 Matthew Booth 2018-10-24 12:08:25 UTC

Our best guess is that this is a rabbit issue. If not, possibly oslo.messaging.

Comment 7 John Eckersberg 2018-10-31 14:48:08 UTC

(In reply to Matthew Booth from comment #4)
> Our best guess is that this is a rabbit issue. If not, possibly
> oslo.messaging.

Probably more oslo.messaging.  I would be very surprised if rabbitmq delivered the same message more than once.  However oslo.messaging would not surprise me if something went wrong with either (1) retrying on error in publish, or (2) mishandling of message acknowledgement.

Comment 8 John Eckersberg 2018-11-12 20:41:59 UTC

Actually, this looks like the same thing as bug 1592528.  Doing a rolling restart of rabbitmq causes publishes to not be confirmed, which means the sender errors out on publish and republishes the message indefinitely, causing duplicate messages in the log.

I am going to leave this one open since it is filed against OSP14 and the other is OSP13, but I'll continue the investigation on bug 1592528 since there has already been substantial work detailed there.

Comment 12 Bogdan Dobrelya 2018-11-23 10:26:47 UTC

(In reply to John Eckersberg from comment #7)
> (In reply to Matthew Booth from comment #4)
> > Our best guess is that this is a rabbit issue. If not, possibly
> > oslo.messaging.
> 
> Probably more oslo.messaging.  I would be very surprised if rabbitmq
> delivered the same message more than once.  However oslo.messaging would not

Duplication of messages is totally possible for corner cases, like network partitions, and rebooting the cluster one by one is a corner case as well. See some testing results [0] I did the good old days. A little bit rusty as of present time, speaking of used versions, but I think that still applies for rabbit HA queues.

The comparative table shows a lot of duplicated messages popping up after network partitions for mirrored HA queues and different configurations (those are split as 1*-2*-3* layouts).

[0] https://docs.google.com/document/d/1sNIWvjIj3xn6O9Oq9kf0h2UKGXGKcY7OkUV-svBRsrw/edit#heading=h.4thk2iean508

> surprise me if something went wrong with either (1) retrying on error in
> publish, or (2) mishandling of message acknowledgement.

Comment 13 pkomarov 2019-01-08 09:09:54 UTC


I was able to reproduce the errors here using the same playbook. 

After applying the patch above the cluster came back healthy after overcloud reboot using the restart playbook.
patch : https://review.gerrithub.io/439266

Results and cluster health checks after applying the patch : 
http://pastebin.test.redhat.com/691818

Marius, can you :
Re-test the automation reproducer of this bug with : 
IR_GERRIT_CHANGE: 439266

or manual version: 
- return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller)  
- test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266)


Thanks

Comment 17 Peter Lemenkov 2019-06-10 10:54:09 UTC

*** Bug 1661806 has been marked as a duplicate of this bug. ***

Comment 18 Lon Hohberger 2019-08-09 10:42:18 UTC

According to our records, this should be resolved by puppet-tripleo-9.4.1-0.20190508182407.el7ost.  This build is available now.

Comment 19 pkomarov 2019-08-12 11:56:35 UTC

Verified via : 

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/rolling_oc_reboot/job/DFG-pidone-rolling_oc_reboot-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-workload_before_after/39/

check fix is present : 
journal.log:Aug 11 19:57:19 undercloud-0.redhat.local yum[4469]: Installed: puppet-tripleo-9.4.1-0.20190508182407.el7ost.noarch