Bug 1640804
Summary: | [OSP14] DuplicateMessageError after rolling restart of RabbitMQ | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | |
Component: | puppet-tripleo | Assignee: | RHOS Maint <rhos-maint> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | nlevinki <nlevinki> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 14.0 (Rocky) | CC: | abeekhof, aherr, apevec, bdobreli, chjones, dbecker, dciabrin, jeckersb, jjoyce, joflynn, jschluet, lhh, mbooth, mburns, mcornea, michele, morazi, pkomarov, plemenko, rbartal, slinaber, tvignaud | |
Target Milestone: | zstream | Keywords: | TestOnly, Triaged, ZStream | |
Target Release: | 14.0 (Rocky) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | puppet-tripleo-9.4.1-0.20190508182406.el7ost | Doc Type: | Known Issue | |
Doc Text: |
When you restart all three controller nodes, it might not be possible to launch tenant instances in the overcloud. A "DuplicateMessageError" message is logged in the overcloud logs.
As a workaround, on one of the overcloud controllers, run this command:
pcs resource restart rabbitmq-bundle
|
Story Points: | --- | |
Clone Of: | ||||
: | 1721054 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-13 10:44:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1592528 | |||
Bug Blocks: | 1721054 |
Description
Marius Cornea
2018-10-18 19:25:45 UTC
The nodes reboot is done via this playbook: https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml After running 'pcs resource restart rabbitmq-bundle' I was able to successfully launch the instance. Our best guess is that this is a rabbit issue. If not, possibly oslo.messaging. (In reply to Matthew Booth from comment #4) > Our best guess is that this is a rabbit issue. If not, possibly > oslo.messaging. Probably more oslo.messaging. I would be very surprised if rabbitmq delivered the same message more than once. However oslo.messaging would not surprise me if something went wrong with either (1) retrying on error in publish, or (2) mishandling of message acknowledgement. Actually, this looks like the same thing as bug 1592528. Doing a rolling restart of rabbitmq causes publishes to not be confirmed, which means the sender errors out on publish and republishes the message indefinitely, causing duplicate messages in the log. I am going to leave this one open since it is filed against OSP14 and the other is OSP13, but I'll continue the investigation on bug 1592528 since there has already been substantial work detailed there. (In reply to John Eckersberg from comment #7) > (In reply to Matthew Booth from comment #4) > > Our best guess is that this is a rabbit issue. If not, possibly > > oslo.messaging. > > Probably more oslo.messaging. I would be very surprised if rabbitmq > delivered the same message more than once. However oslo.messaging would not Duplication of messages is totally possible for corner cases, like network partitions, and rebooting the cluster one by one is a corner case as well. See some testing results [0] I did the good old days. A little bit rusty as of present time, speaking of used versions, but I think that still applies for rabbit HA queues. The comparative table shows a lot of duplicated messages popping up after network partitions for mirrored HA queues and different configurations (those are split as 1*-2*-3* layouts). [0] https://docs.google.com/document/d/1sNIWvjIj3xn6O9Oq9kf0h2UKGXGKcY7OkUV-svBRsrw/edit#heading=h.4thk2iean508 > surprise me if something went wrong with either (1) retrying on error in > publish, or (2) mishandling of message acknowledgement. I was able to reproduce the errors here using the same playbook. After applying the patch above the cluster came back healthy after overcloud reboot using the restart playbook. patch : https://review.gerrithub.io/439266 Results and cluster health checks after applying the patch : http://pastebin.test.redhat.com/691818 Marius, can you : Re-test the automation reproducer of this bug with : IR_GERRIT_CHANGE: 439266 or manual version: - return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller) - test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266) Thanks *** Bug 1661806 has been marked as a duplicate of this bug. *** According to our records, this should be resolved by puppet-tripleo-9.4.1-0.20190508182407.el7ost. This build is available now. Verified via : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/rolling_oc_reboot/job/DFG-pidone-rolling_oc_reboot-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-workload_before_after/39/ check fix is present : journal.log:Aug 11 19:57:19 undercloud-0.redhat.local yum[4469]: Installed: puppet-tripleo-9.4.1-0.20190508182407.el7ost.noarch |