Description of problem: Unable to boot instances after rebooting the overcloud nodes: ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message in nova-scheduler.log Version-Release number of selected component (if applicable): 2018-10-10.3 puddle How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 controller + 1 computes + 3 ceph nodes 2. Scale out with 1 compute node 3. Reboot the overcloud nodes 4. Remove the compute node added in step 2 5. Reboot the overcloud nodes 6. Launch an instance Actual results: Instance ends up in ERROR state. Expected results: Instance successfully launched. Additional info: Attaching sosreports.
The nodes reboot is done via this playbook: https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml
After running 'pcs resource restart rabbitmq-bundle' I was able to successfully launch the instance.
Our best guess is that this is a rabbit issue. If not, possibly oslo.messaging.
(In reply to Matthew Booth from comment #4) > Our best guess is that this is a rabbit issue. If not, possibly > oslo.messaging. Probably more oslo.messaging. I would be very surprised if rabbitmq delivered the same message more than once. However oslo.messaging would not surprise me if something went wrong with either (1) retrying on error in publish, or (2) mishandling of message acknowledgement.
Actually, this looks like the same thing as bug 1592528. Doing a rolling restart of rabbitmq causes publishes to not be confirmed, which means the sender errors out on publish and republishes the message indefinitely, causing duplicate messages in the log. I am going to leave this one open since it is filed against OSP14 and the other is OSP13, but I'll continue the investigation on bug 1592528 since there has already been substantial work detailed there.
(In reply to John Eckersberg from comment #7) > (In reply to Matthew Booth from comment #4) > > Our best guess is that this is a rabbit issue. If not, possibly > > oslo.messaging. > > Probably more oslo.messaging. I would be very surprised if rabbitmq > delivered the same message more than once. However oslo.messaging would not Duplication of messages is totally possible for corner cases, like network partitions, and rebooting the cluster one by one is a corner case as well. See some testing results [0] I did the good old days. A little bit rusty as of present time, speaking of used versions, but I think that still applies for rabbit HA queues. The comparative table shows a lot of duplicated messages popping up after network partitions for mirrored HA queues and different configurations (those are split as 1*-2*-3* layouts). [0] https://docs.google.com/document/d/1sNIWvjIj3xn6O9Oq9kf0h2UKGXGKcY7OkUV-svBRsrw/edit#heading=h.4thk2iean508 > surprise me if something went wrong with either (1) retrying on error in > publish, or (2) mishandling of message acknowledgement.
I was able to reproduce the errors here using the same playbook. After applying the patch above the cluster came back healthy after overcloud reboot using the restart playbook. patch : https://review.gerrithub.io/439266 Results and cluster health checks after applying the patch : http://pastebin.test.redhat.com/691818 Marius, can you : Re-test the automation reproducer of this bug with : IR_GERRIT_CHANGE: 439266 or manual version: - return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller) - test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266) Thanks
*** Bug 1661806 has been marked as a duplicate of this bug. ***
According to our records, this should be resolved by puppet-tripleo-9.4.1-0.20190508182407.el7ost. This build is available now.
Verified via : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/rolling_oc_reboot/job/DFG-pidone-rolling_oc_reboot-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-workload_before_after/39/ check fix is present : journal.log:Aug 11 19:57:19 undercloud-0.redhat.local yum[4469]: Installed: puppet-tripleo-9.4.1-0.20190508182407.el7ost.noarch