1640804 – [OSP14] DuplicateMessageError after rolling restart of RabbitMQ

Bug 1640804 - [OSP14] DuplicateMessageError after rolling restart of RabbitMQ

Summary: [OSP14] DuplicateMessageError after rolling restart of RabbitMQ

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	zstream
Target Release:	14.0 (Rocky)
Assignee:	RHOS Maint
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1661806 (view as bug list)
Depends On:	1592528
Blocks:	1721054
TreeView+	depends on / blocked

Reported:	2018-10-18 19:25 UTC by Marius Cornea
Modified:	2019-12-29 17:41 UTC (History)
CC List:	22 users (show)
Fixed In Version:	puppet-tripleo-9.4.1-0.20190508182406.el7ost
Doc Type:	Known Issue
Doc Text:	When you restart all three controller nodes, it might not be possible to launch tenant instances in the overcloud. A "DuplicateMessageError" message is logged in the overcloud logs. As a workaround, on one of the overcloud controllers, run this command: pcs resource restart rabbitmq-bundle
Clone Of:
Clones:	1721054 (view as bug list)
Environment:
Last Closed:	2019-08-13 10:44:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gerrithub.io	439266	0	None	None	None	2019-06-10 10:56:13 UTC
OpenStack gerrit	666153	0	None	MERGED	RabbitMQ: always allow promotion on HA queue during failover	2021-02-03 07:11:19 UTC

Description Marius Cornea 2018-10-18 19:25:45 UTC

Description of problem:
Unable to boot instances after rebooting the overcloud nodes: ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message in nova-scheduler.log


Version-Release number of selected component (if applicable):
2018-10-10.3 puddle

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controller + 1 computes + 3 ceph nodes
2. Scale out with 1 compute node
3. Reboot the overcloud nodes
4. Remove the compute node added in step 2
5. Reboot the overcloud nodes
6. Launch an instance

Actual results:
Instance ends up in ERROR state.

Expected results:
Instance successfully launched.

Additional info:
Attaching sosreports.

Comment 1 Marius Cornea 2018-10-18 19:28:40 UTC

The nodes reboot is done via this playbook:
https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml

Comment 3 Marius Cornea 2018-10-18 20:48:47 UTC

After running 'pcs resource restart rabbitmq-bundle' I was able to successfully launch the instance.

Comment 4 Matthew Booth 2018-10-24 12:08:25 UTC

Our best guess is that this is a rabbit issue. If not, possibly oslo.messaging.

Comment 7 John Eckersberg 2018-10-31 14:48:08 UTC

(In reply to Matthew Booth from comment #4)
> Our best guess is that this is a rabbit issue. If not, possibly
> oslo.messaging.

Probably more oslo.messaging.  I would be very surprised if rabbitmq delivered the same message more than once.  However oslo.messaging would not surprise me if something went wrong with either (1) retrying on error in publish, or (2) mishandling of message acknowledgement.

Comment 8 John Eckersberg 2018-11-12 20:41:59 UTC

Actually, this looks like the same thing as bug 1592528.  Doing a rolling restart of rabbitmq causes publishes to not be confirmed, which means the sender errors out on publish and republishes the message indefinitely, causing duplicate messages in the log.

I am going to leave this one open since it is filed against OSP14 and the other is OSP13, but I'll continue the investigation on bug 1592528 since there has already been substantial work detailed there.

Comment 12 Bogdan Dobrelya 2018-11-23 10:26:47 UTC

(In reply to John Eckersberg from comment #7)
> (In reply to Matthew Booth from comment #4)
> > Our best guess is that this is a rabbit issue. If not, possibly
> > oslo.messaging.
> 
> Probably more oslo.messaging.  I would be very surprised if rabbitmq
> delivered the same message more than once.  However oslo.messaging would not

Duplication of messages is totally possible for corner cases, like network partitions, and rebooting the cluster one by one is a corner case as well. See some testing results [0] I did the good old days. A little bit rusty as of present time, speaking of used versions, but I think that still applies for rabbit HA queues.

The comparative table shows a lot of duplicated messages popping up after network partitions for mirrored HA queues and different configurations (those are split as 1*-2*-3* layouts).

[0] https://docs.google.com/document/d/1sNIWvjIj3xn6O9Oq9kf0h2UKGXGKcY7OkUV-svBRsrw/edit#heading=h.4thk2iean508

> surprise me if something went wrong with either (1) retrying on error in
> publish, or (2) mishandling of message acknowledgement.

Comment 13 pkomarov 2019-01-08 09:09:54 UTC


I was able to reproduce the errors here using the same playbook. 

After applying the patch above the cluster came back healthy after overcloud reboot using the restart playbook.
patch : https://review.gerrithub.io/439266

Results and cluster health checks after applying the patch : 
http://pastebin.test.redhat.com/691818

Marius, can you :
Re-test the automation reproducer of this bug with : 
IR_GERRIT_CHANGE: 439266

or manual version: 
- return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller)  
- test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266)


Thanks

Comment 17 Peter Lemenkov 2019-06-10 10:54:09 UTC

*** Bug 1661806 has been marked as a duplicate of this bug. ***

Comment 18 Lon Hohberger 2019-08-09 10:42:18 UTC

According to our records, this should be resolved by puppet-tripleo-9.4.1-0.20190508182407.el7ost.  This build is available now.

Comment 19 pkomarov 2019-08-12 11:56:35 UTC

Verified via : 

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/rolling_oc_reboot/job/DFG-pidone-rolling_oc_reboot-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-workload_before_after/39/

check fix is present : 
journal.log:Aug 11 19:57:19 undercloud-0.redhat.local yum[4469]: Installed: puppet-tripleo-9.4.1-0.20190508182407.el7ost.noarch

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
aherr
apevec
bdobreli
chjones
dbecker
dciabrin
jeckersb
jjoyce
joflynn
jschluet
lhh
mbooth
mburns
mcornea
michele
morazi
pkomarov
plemenko
rbartal
slinaber
tvignaud