1592528 – [OSP13] DuplicateMessageError after rolling restart of RabbitMQ

Bug 1592528 - [OSP13] DuplicateMessageError after rolling restart of RabbitMQ

Summary: [OSP13] DuplicateMessageError after rolling restart of RabbitMQ

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	z8
Target Release:	13.0 (Queens)
Assignee:	RHOS Maint
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Duplicates (6):	1600373 1685030 1688649 1690955 1715620 1750692 (view as bug list)
Depends On:	1661806
Blocks:	1640804 1721054
TreeView+	depends on / blocked

Reported:	2018-06-18 18:56 UTC by Artem Hrechanychenko
Modified:	2024-03-25 15:05 UTC (History)
CC List:	44 users (show)
Fixed In Version:	puppet-tripleo-8.4.1-21.el7ost
Doc Type:	Release Note
Doc Text:	In rare circumstances, when you reboot a controller node several times, RabbitMQ can run in an inconsistent state that blocks API operations on the overcloud. The symptoms for this issue are: - Entries in any of the OpenStack service logs of the form: DuplicateMessageError: Found duplicate message(629ff0024219488499b0fac0cacaa3a5). Skipping it. - "openstack network agent list" returns that some agents are DOWN To restore normal operation, run the following command on any one of the controller nodes: pcs resource restart rabbitmq-bundle
Clone Of:
Environment:
Last Closed:	2019-09-03 16:55:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	439266	None	None	None	2019-01-08 09:19:08 UTC
Launchpad	1823305	None	None	None	2019-06-13 09:27:27 UTC
OpenStack gerrit	666401	'None'	MERGED	RabbitMQ: always allow promotion on HA queue during failover	2021-02-15 05:22:43 UTC
Red Hat Issue Tracker	OSP-23646	None	None	None	2023-03-24 14:11:24 UTC
Red Hat Knowledge Base (Solution)	3366271	None	None	Instance creation not working in OSP12 [rabbitmq errors]	2019-06-02 02:33:48 UTC
Red Hat Product Errata	RHBA-2019:2624	None	None	None	2019-09-03 16:55:51 UTC

Description Artem Hrechanychenko 2018-06-18 18:56:04 UTC

Description of problem:

(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router r1
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| ca6318f7-2355-4209-a7db-a5a8bbcc9758 | controller-0.localdomain | True           | :-)   | active   |
| d29d7a1f-f6a5-4cd3-b934-f12eab99ddc7 | controller-2.localdomain | True           | :-)   | active   |
| bca76377-1c4a-436c-a1a0-b78ee3221482 | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+

(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+--------------------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks                             |
+--------------------------------------+--------------+--------+------------+-------------+--------------------------------------+
| fde448f5-5b41-4890-a2ab-03e3de46600d | after_deploy | ACTIVE | -          | Running     | tenantvxlan=192.168.32.9, 10.0.0.184 |
| 5621811d-270d-4287-bb7d-55ace71beede | after_reboot | ERROR  | -          | NOSTATE     |                                      |
+--------------------------------------+--------------+--------+------------+-------------+--------------------------------------+


| fault                                | {"message": "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 5621811d-270d-4287-bb7d-55ace71beede.", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 580, in build_instances |
|                                      |     raise exception.MaxRetriesExceeded(reason=msg)                                                                                                                                                                                                                              

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Artem Hrechanychenko 2018-06-18 19:23:40 UTC

Steps to Reproduce:

1) deploy osp13 using GA puddle with 3contr+3 ceph and 1 comp + vxlan
2) launch instance in OC
3) scale up compute node
4) reboot OC 
5) start after_deploy instance and deploy new one. check external connectivity


Actual results:
Cannot ping after-deploy instance and deploy after_reboot instance

                                                                                                                   |
| fault                                | {"message": "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 8f4a7773-c500-427d-980d-81ecf47ffa1a.", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 580, in build_instances |
|                                      |     raise exception.MaxRetriesExceeded(reason=msg)     

Version-Release number of selected component (if applicable):
OSP13 2018-06-15.2 puddle

Comment 3 Artem Hrechanychenko 2018-06-18 19:25:41 UTC

Sosreport - http://rhos-release.virt.bos.redhat.com/log/bz1592528/

Comment 4 Brent Eagles 2018-06-18 20:43:08 UTC

The system looks to be in very bad state. The oslo messaging layer seems to be broken for several processes:

from nova_scheduler.log on controller-0: 

2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 368, in _callback
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in __call__
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit     unique_id = self.msg_id_cache.check_duplicate_message(message)
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 121, in check_duplicate_message
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit     raise rpc_common.DuplicateMessageError(msg_id=msg_id)
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(629ff0024219488499b0fac0cacaa3a5). Skipping it.
2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit 
2018-06-18 20:15:54.728 1 DEBUG oslo_service.periodic_task [req-888db3cb-9c73-40bb-8d28-266d29360029 6c6d9f8e3ebc4d64bc37d5ea53f30c18 feddecd5125c4c22ba4a11e802cc49f0 - default default] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2018-06-18 20:15:59.474 1 ERROR oslo.messaging._drivers.impl_rabbit [req-888db3cb-9c73-40bb-8d28-266d29360029 6c6d9f8e3ebc4d64bc37d5ea53f30c18 feddecd5125c4c22ba4a11e802cc49f0 - default default] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(5897c0d1d18c4141a321e4826d6b93ab). Skipping it.

DuplicateMessageError messages can also be found in cinder-scheduler.log, neutron/server.log, and neutron/openvswitch-agent.log on all of the controllers.

Comment 5 Eran Kuris 2018-06-19 07:07:58 UTC

I think its related to BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1589684

Comment 6 Damien Ciabrini 2018-06-19 10:36:00 UTC

(In reply to Eran Kuris from comment #5)
> I think its related to BZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=1589684

If I understand correctly https://bugzilla.redhat.com/show_bug.cgi?id=1589684#c8,  one side effect would be to see two keepalived in master state among the 3 controllers, which seems to be what I see on this env:

(undercloud) [stack@undercloud-0 ~]$ for i in 192.168.24.19 192.168.24.10 192.168.24.13; do echo $i; ssh heat-admin@$i 'sudo docker logs $(sudo docker ps -q --filter name=neutron-keepalived) 2>&1 | grep -e Transition -e Enter';done
192.168.24.19
VRRP_Instance(VR_1) Entering BACKUP STATE
VRRP_Instance(VR_1) Entering BACKUP STATE
VRRP_Instance(VR_1) Transition to MASTER STATE
VRRP_Instance(VR_1) Entering MASTER STATE
VRRP_Instance(VR_1) Transition to MASTER STATE
192.168.24.10
VRRP_Instance(VR_1) Entering BACKUP STATE
192.168.24.13
VRRP_Instance(VR_1) Entering BACKUP STATE
VRRP_Instance(VR_1) Transition to MASTER STATE
VRRP_Instance(VR_1) Entering MASTER STATE
VRRP_Instance(VR_1) Transition to MASTER STATE


We're still investigating the oslo.messaging error due to duplicate message, as it didn't seem to show up on https://bugzilla.redhat.com/show_bug.cgi?id=1589684

Comment 8 Peter Lemenkov 2018-06-19 14:01:09 UTC

(In reply to Brent Eagles from comment #4)
> The system looks to be in very bad state. The oslo messaging layer seems to
> be broken for several processes:
> 
> from nova_scheduler.log on controller-0: 
> 
> 2018-06-18 20:15:51.699 1 ERROR oslo.messaging._drivers.impl_rabbit
...
> DuplicateMessageError: Found duplicate
> message(629ff0024219488499b0fac0cacaa3a5). Skipping it.

It's not necessary broken just because it receives duplicate messages from RabbitMQ. Neither RabbitMQ itself.

In case of a network failure RabbitMQ will retransmit all the messages it didn't receive acknowledgement yet. This is documented behaviour, and client's software (oslo.messaging in our case) should handle deduplication on its own level (what it did actually). This is certainly an issue, but a recoverable one.

The problem is network glitch or something else which disrupted communication between RabbitMQ and oslo.messaging.

Comment 9 Peter Lemenkov 2018-06-19 14:03:09 UTC

(In reply to Damien Ciabrini from comment #6)
> We're still investigating the oslo.messaging error due to duplicate message,
> as it didn't seem to show up on
> https://bugzilla.redhat.com/show_bug.cgi?id=1589684

Just a note. As I said in my previous comment #8 duplication of messages is expected in some cases. This shouldn't affect oslo.messaging though. Oslo.messaging must handle it and recover (skip duplicated messages).

Comment 13 John Eckersberg 2018-06-19 20:46:02 UTC

Some observations looking at the controller logs...

Only nova-scheduler is having this issue across the three controllers:

[root@controller-0 nova]# grep -c DuplicateMessage *
nova-api.log:0
nova-api.log.1:0
nova-api.log.2:0
nova-api.log.3:0
nova-api.log.4:0
nova-api.log.5:0
nova-api.log.6:0
nova-api-metadata.log:0
nova-api-metadata.log.1:0
nova-api-metadata.log.2:0
nova-api-metadata.log.3:0
nova-conductor.log:0
nova-consoleauth.log:0
nova-manage.log:0
nova-novncproxy.log:0
nova-placement-api.log:0
nova-scheduler.log:2595

[root@controller-1 nova]# grep -c DuplicateMessage *
nova-api.log:0
nova-api.log.1:0
nova-api.log.2:0
nova-api.log.3:0
nova-api.log.4:0
nova-api.log.5:0
nova-api.log.6:0
nova-api-metadata.log:0
nova-api-metadata.log.1:0
nova-api-metadata.log.2:0
nova-api-metadata.log.3:0
nova-conductor.log:0
nova-consoleauth.log:0
nova-novncproxy.log:0
nova-placement-api.log:0
nova-scheduler.log:2601

[root@controller-2 nova]# grep -c DuplicateMessage *
nova-api.log:0
nova-api.log.1:0
nova-api.log.2:0
nova-api-metadata.log:0
nova-api-metadata.log.1:0
nova-conductor.log:0
nova-consoleauth.log:0
nova-novncproxy.log:0
nova-placement-api.log:0
nova-scheduler.log:2547

Pulling out the duplicate message IDs, we see it is the same ones repeating across all three controllers:

[root@controller-0 nova]# grep 'impl_rabbit DuplicateMessageError' nova-scheduler.log | sed -r 's/[^(]+\(([^)]+)\).*/\1/' | sort | uniq -c
    435 2a46bc14ced24125aa1975d87f734d1f
    430 d84511cd00ec4f27902b1b7a0f4d0dcc

[root@controller-1 nova]# grep 'impl_rabbit DuplicateMessageError' nova-scheduler.log | sed -r 's/[^(]+\(([^)]+)\).*/\1/' | sort | uniq -c
      1 0256618d724f4f6f9ebc1291c24f1891
    436 2a46bc14ced24125aa1975d87f734d1f
    430 d84511cd00ec4f27902b1b7a0f4d0dcc

[root@controller-2 nova]# grep 'impl_rabbit DuplicateMessageError' nova-scheduler.log | sed -r 's/[^(]+\(([^)]+)\).*/\1/' | sort | uniq -c
      1 0256618d724f4f6f9ebc1291c24f1891
    427 2a46bc14ced24125aa1975d87f734d1f
    421 d84511cd00ec4f27902b1b7a0f4d0dcc

Which makes me think something is broken in the publisher, such that it keeps sending the same message repeatedly.  Also I don't see anything in the logs matching 'acknowledge' or 'basic_ack', which I would expect if the messages were being requeued.

Comment 14 Brent Eagles 2018-06-20 12:18:39 UTC

I provoked a failover by rebooting one of the systems and things didn't resolve. The duplicate active router just moved from one node to another. I noticed pinging ha ports (ha-<id>) between qrouter namespaces only worked between two of the nodes. We need to find out why this connectivity broke.

Comment 17 Brian Haley 2018-06-21 19:47:39 UTC

So looking at a setup I was pointed at on sealusa8.mobius.lab.eng.rdu2.redhat.com I can see there are some differences in the OVS flows that are basically leaving one of the controllers isolated from the others.

For below, this is the IP map:
  172.17.2.10 ceph-0
  172.17.2.12 controller-2
  172.17.2.14 controller-0
  172.17.2.16 controller-1

On controller-2, there are four ports on br-tun:

        Port "vxlan-ac11020e"
            Interface "vxlan-ac11020e"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.12", out_key=flow, remote_ip="172.17.2.14"}
        Port "vxlan-ac11020a"
            Interface "vxlan-ac11020a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.12", out_key=flow, remote_ip="172.17.2.10"}
        Port "vxlan-ac11020d"
            Interface "vxlan-ac11020d"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.12", out_key=flow, remote_ip="172.17.2.13"}
        Port "vxlan-ac110210"
            Interface "vxlan-ac110210"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.12", out_key=flow, remote_ip="172.17.2.16"}


And there are four flows for these in the flow table:

 cookie=0x45e7b7601c22f520, duration=7364.503s, table=0, n_packets=0, n_bytes=0, priority=1,in_port="vxlan-ac11020d" actions=resubmit(,4)
 cookie=0x45e7b7601c22f520, duration=26.322s, table=0, n_packets=171, n_bytes=9434, priority=1,in_port="vxlan-ac11020e" actions=resubmit(,4)
 cookie=0x45e7b7601c22f520, duration=19.139s, table=0, n_packets=0, n_bytes=0, priority=1,in_port="vxlan-ac11020a" actions=resubmit(,4)
 cookie=0x45e7b7601c22f520, duration=9.625s, table=0, n_packets=98, n_bytes=5292, priority=1,in_port="vxlan-ac110210" actions=resubmit(,4)


However, on controller-0 (and controller-1), there are also four ports on br-tun:

        Port "vxlan-ac11020d"
            Interface "vxlan-ac11020d"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.16", out_key=flow, remote_ip="172.17.2.13"}
        Port "vxlan-ac11020c"
            Interface "vxlan-ac11020c"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.16", out_key=flow, remote_ip="172.17.2.12"}
        Port "vxlan-ac11020e"
            Interface "vxlan-ac11020e"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.16", out_key=flow, remote_ip="172.17.2.14"}
        Port "vxlan-ac11020a"
            Interface "vxlan-ac11020a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.16", out_key=flow, remote_ip="172.17.2.10"}

But there are only two flows instead of four:

 cookie=0xb0781d0a6ee5c5a2, duration=6.886s, table=0, n_packets=0, n_bytes=0, priority=1,in_port="vxlan-ac11020a" actions=resubmit(,4)
 cookie=0xb0781d0a6ee5c5a2, duration=4.990s, table=0, n_packets=40, n_bytes=2799, priority=1,in_port="vxlan-ac11020e" actions=resubmit(,4)

So it's no surprise that one controller is isolated as it's packets are being dropped on input.

At this point I don't know what's broken, I'll have to look at the logs further, just wanted to get the info out.

Comment 18 Brent Eagles 2018-06-22 01:52:34 UTC

The general state of things seems to be pretty messed. neutron's agents are showing as not-alive in 'neutron agent-list' with beam.smp running at over 100% on controller 1 and 2 and neutron server.log quite high. The load might be expected if neutron is tryin g to recover but we need to sort out how it got into this state.

Comment 19 Brent Eagles 2018-06-22 01:58:49 UTC

It was also discovered that the neutron packages in the container images were a bit out of date on this system. It's not clear whether this is contributing to the problem we are seeing but it may be worthwhile retesting with the latest packages.

Comment 20 John Eckersberg 2018-06-22 02:12:27 UTC

There's definitely something wrong with rabbitmq, I can't get any of the nodes to give me a list of consumers.  It starts to display a few and then hangs.  Something is probably wedged somewhere, but the stuck processes diagnostic is not showing anything out of the ordinary so it's not so obvious what is wrong.

Comment 21 Bogdan Dobrelya 2018-06-22 09:16:00 UTC

In the attached sosreports, after a 2nd reboot serie (logged events after Jun 21 17:21:31), AMQP connectivity via internal API network seem never get recovered fully, at least for neutron server. There is multiple "AMQP server on controller-[012].internalapi.localdomain:5672 is unreachable: timed out" had been logged for quite a while, more than 2h, for all controllers. More details below.

At first, right after a 2nd reboot, heat, nova, aodh ceilometer and neutron services were experiencing AMQP connection issues, see http://pastebin.test.redhat.com/607749 .

A 10 minutes later, the recovery for mysql and rabbitmq still has yet been completed, see http://pastebin.test.redhat.com/607753 . 

This though ended up with the neutron servers on controllers had never (I assume 2h == never) managed to recover its AMQP connections, see http://pastebin.test.redhat.com/607756

To me, this looks like internal api / control plane disruption had been confirmed. I can see no rabbitmq clustering related errors though. Why neutron servers could never connect AMQP? Was this caused by OVS lost flows issues perhaps?

Comment 22 Damien Ciabrini 2018-06-22 09:36:58 UTC

(In reply to Bogdan Dobrelya from comment #21)
> In the attached sosreports, after a 2nd reboot serie (logged events after
> Jun 21 17:21:31), AMQP connectivity via internal API network seem never get
> recovered fully, at least for neutron server. There is multiple "AMQP server
> on controller-[012].internalapi.localdomain:5672 is unreachable: timed out"
> had been logged for quite a while, more than 2h, for all controllers. More
> details below.
> 
> At first, right after a 2nd reboot, heat, nova, aodh ceilometer and neutron
> services were experiencing AMQP connection issues, see
> http://pastebin.test.redhat.com/607749 .
> 
> A 10 minutes later, the recovery for mysql and rabbitmq still has yet been
> completed, see http://pastebin.test.redhat.com/607753 . 
> 
> This though ended up with the neutron servers on controllers had never (I
> assume 2h == never) managed to recover its AMQP connections, see
> http://pastebin.test.redhat.com/607756
> 
> To me, this looks like internal api / control plane disruption had been
> confirmed. I can see no rabbitmq clustering related errors though. Why
> neutron servers could never connect AMQP? Was this caused by OVS lost flows
> issues perhaps?

Hmm I'm not convinced it's a disruption of the entire internal API, because I don't see any galera disconnection during that timeframe in /var/log/containers/mysql/mysqld.log. It would should 'Turning message relay off" for a long period of time

Likewise, I don't see any warning from corosync, which is very sensible to temporary network disruption.

Comment 23 Bogdan Dobrelya 2018-06-22 09:38:57 UTC

I forgot to post mysql connectivity failures I mentioned were still happening 10 min later http://pastebin.test.redhat.com/607766

Comment 24 Damien Ciabrini 2018-06-22 10:08:43 UTC

(In reply to Bogdan Dobrelya from comment #23)
> I forgot to post mysql connectivity failures I mentioned were still
> happening 10 min later http://pastebin.test.redhat.com/607766

Those are rare, (12 occurences total at around the second reboot) and most probably indicates that the a connection from the DB connection pool was targetting a mysql server on a node that restarted in the meantime.

Comment 25 Damien Ciabrini 2018-06-22 11:04:18 UTC

(In reply to Bogdan Dobrelya from comment #21)
> 
> This though ended up with the neutron servers on controllers had never (I
> assume 2h == never) managed to recover its AMQP connections, see
> http://pastebin.test.redhat.com/607756
> 

Also, thanks for the hint. It seems strange indeed that neutron_api container, I see that the neutron_server processes on e.g. controller-0 are having issues connecting to any rabbit node:


()[neutron@controller-0 /]$ tail -f /var/log/neutron/server.log | grep 'unreachable'
2018-06-22 11:03:03.083 28 ERROR oslo.messaging._drivers.impl_rabbit [req-a5f9b9fe-9575-4318-8d10-d4d1362a0c3b - - - - -] [e4dc830a-9b2a-48e4-baed-514d5d7f4c9d] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-06-22 11:03:07.297 29 ERROR oslo.messaging._drivers.impl_rabbit [req-35f8969a-38cb-4865-bbaf-7ec892c7ff04 - - - - -] [571ebd19-1879-4003-aaf7-db34515ecf80] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-06-22 11:03:07.466 29 ERROR oslo.messaging._drivers.impl_rabbit [req-0d77d0e3-5d9e-4e12-9db4-baa031a8ebf4 - - - - -] [e44bbe69-97c6-4038-947d-34de1029ca5d] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-06-22 11:03:07.749 27 ERROR oslo.messaging._drivers.impl_rabbit [req-a5f9b9fe-9575-4318-8d10-d4d1362a0c3b - - - - -] [aa6d3b50-7f7c-40ea-b3a1-b9f5d606016d] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-06-22 11:03:09.571 28 ERROR oslo.messaging._drivers.impl_rabbit [req-3efa69b7-0eb1-4468-934b-11102342600d - - - - -] [85864e61-d4b1-4471-8362-0b62b558b92d] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-06-22 11:03:10.338 27 ERROR oslo.messaging._drivers.impl_rabbit [req-0d77d0e3-5d9e-4e12-9db4-baa031a8ebf4 - - - - -] [b7039951-0c6a-4097-a078-35cc52e2239a] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out


Whereas if I log on controller-2 in the rabbitmq container, I can clearly see that rabbit is still accepting connections:

()[root@controller-2 rabbitmq]# tail -f /var/log/rabbitmq/rabbit\@controller-2.log.1 | grep 'accepting AMQP connection'                                                                                     
accepting AMQP connection <0.28610.3858> (172.17.1.13:52798 -> 172.17.1.23:5672)
accepting AMQP connection <0.42.3859> (172.17.1.13:52812 -> 172.17.1.23:5672)
accepting AMQP connection <0.13115.3849> (172.17.1.13:52842 -> 172.17.1.23:5672)
accepting AMQP connection <0.25473.3822> (172.17.1.23:36344 -> 172.17.1.23:5672)
accepting AMQP connection <0.11194.3859> (172.17.1.13:52852 -> 172.17.1.23:5672)
accepting AMQP connection <0.30344.3859> (172.17.1.23:36424 -> 172.17.1.23:5672)
accepting AMQP connection <0.11820.3860> (172.17.1.13:53014 -> 172.17.1.23:5672)
accepting AMQP connection <0.9562.3861> (172.17.1.16:55472 -> 172.17.1.23:5672)
accepting AMQP connection <0.21688.3861> (172.17.1.16:55542 -> 172.17.1.23:5672)
accepting AMQP connection <0.29325.3861> (172.17.1.16:55578 -> 172.17.1.23:5672)
accepting AMQP connection <0.31864.3838> (172.17.1.23:36736 -> 172.17.1.23:5672)
accepting AMQP connection <0.18881.3790> (172.17.1.23:36748 -> 172.17.1.23:5672)
accepting AMQP connection <0.15995.3825> (172.17.1.23:36746 -> 172.17.1.23:5672)
accepting AMQP connection <0.531.3860> (172.17.1.23:36860 -> 172.17.1.23:5672)

Comment 26 Damien Ciabrini 2018-06-22 11:20:29 UTC

Also, this neutron_api container is behaving strangely:
[root@controller-0 ~]# docker exec -it neutron_api bash                                                                                                                                                     
()[neutron@controller-0 /]$ touch /tmp/foo
touch: cannot touch '/tmp/foo': No such file or directory
()[neutron@controller-0 /]$ exit
[root@controller-0 ~]# docker exec -it neutron_dhcp bash
()[neutron@controller-0 /]$ touch /tmp/foo
()[neutron@controller-0 /]$ exit
[root@controller-0 ~]# getenforce
Permissive


Seems like I can't even touch file in that container, while other containers are just fine.

I've just check on a working OSP13 deployment and creating such a file works without issue :/

Comment 27 Damien Ciabrini 2018-06-22 11:47:32 UTC

But anyway I can still create a file under /foo, so this is a red herring.

/foo contains a small script [1] to initial AMQP connection to rabbitmq, and when ran manually from within container neutron_api, the connection seems to succeed:

[root@controller-0 /]# /foo --server controller-2.internalapi.localdomain --username guest --password UF6FF3BVZNNFxBv3BEZaWmPYP
OK

So definitely this indicates that rabbit is ok, but something in the neutron_server processes running in that container is broken and prevent the service from connecting to the message queue properly.


[1] https://blog.sleeplessbeastie.eu/2017/07/10/how-to-check-connection-to-the-rabbitmq-message-broker/

Comment 28 Damien Ciabrini 2018-06-22 21:48:15 UTC

Quick updates from last tests done with John and Artem

We performed many successive controller reboots, always one after the other, and waited for service to come back up between each failover.

We used a 3-controller, 1 compute overcloud, with 4 different VM running on the compute. 

After 5 or six consecutive failover, starting failing over the controller nodes which hold the neutron's HA router to see if this would trigger a specific failure.

We could reproduce a the bad error logs seen in comment #25 and and started to see DuplicateMessage error for many different services:
[root@controller-0 containers]# grep -rl DuplicateMessage /var/log/containers/*                                                                                                          
/var/log/containers/cinder/cinder-scheduler.log
/var/log/containers/neutron/server.log
/var/log/containers/neutron/openvswitch-agent.log
/var/log/containers/nova/nova-scheduler.log

at this time, rabbitmq was stuck when we list consumers
[root@controller-0 containers]# docker exec -it rabbitmq-bundle-docker-0 /bin/bash                                                                                                       
# rabbitmqctl list_consumers
[...]
q-plugin        <rabbit.30391.0> 1       true    0       []
q-plugin        <rabbit.30257.0> 1       true    0       []
q-plugin        <rabbit.30070.0> 1       true    0       []
q-reports-plugin_fanout_a8831563659f4b1988ffefb1976a6c08        <rabbit.30252.0> 3       true    0       []
engine_worker_fanout_88f19d633b6545979406c55ea7fbd87d   <rabbit.23618.0> 3       true    0       []
neutron-vo-QosPolicy-1.7.controller-2.localdomain       <rabbit.32494.0> 2       true    0       []
[stuck-here]

We interestignly, it's still processing message, and consequently the neutron agents show as UP in openstack monitoring:

(undercloud) [stack@undercloud-0 ~]$ openstack network agent list
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
|  ID                                   | Agent Type         |  Host                                 | Availability Zone | Alive | State  | Binary                    |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
|  019c8636-0254-4a4a-bac9-7994500fd1e1 | Baremetal Node     |  663d6a51-b68a-492d-b7d1-ace7f40fc620 | None              | :-)   | UP     | ironic-neutron-agent      |
|  17021043-cc02-46ad-a011-463d144de7fd | Baremetal Node     |  dbe7e541-4f17-481a-883b-c9c9c6b8d0ff | None              | :-)   | UP     | ironic-neutron-agent      |
|  2879746e-c23a-456e-8759-b0424dbb92ab | Baremetal Node     |  dba4f3d1-3938-4ef8-ba55-9eb4dbd578bf | None              | :-)   | UP     | ironic-neutron-agent      |
|  682caeea-4ffe-4155-a312-23da449da8ac | DHCP agent         |  undercloud-0.redhat.local            | nova              | :-)   | UP     | neutron-dhcp-agent        |
|  85068bf5-0583-4ae7-9a1d-a78cb35bd43a | Baremetal Node     |  cf957300-d333-4212-b1a1-4881d82a6ae9 | None              | :-)   | UP     | ironic-neutron-agent      |
|  946f12e6-3b95-4ec6-a9e4-1e36e5621a8e | Baremetal Node     |  e2f6517b-011b-42d3-9ce2-b65915c15584 | None              | :-)   | UP     | ironic-neutron-agent      |
|  a5e2e7da-6512-468f-bd9b-d246ac71a040 | Baremetal Node     |  c05a1c42-1e26-46fc-8fc5-b3503b73ccb6 | None              | :-)   | UP     | ironic-neutron-agent      |
|  aceb7e51-b950-4b35-85fe-594a72cb3e81 | Baremetal Node     |  a0c16bbd-1dab-4e15-8bb3-fcaee4206c1c | None              | :-)   | UP     | ironic-neutron-agent      |
|  bb064dcf-657a-4fd1-8e4a-ceff89b140ad | Open vSwitch agent |  undercloud-0.redhat.local            | None              | :-)   | UP     | neutron-openvswitch-agent |
|  ca2f504c-853f-42aa-807c-08e7c0f97219 | Baremetal Node     |  3566f403-d710-4430-a60e-33035dbdf991 | None              | :-)   | UP     | ironic-neutron-agent      |
|  e9a6422e-5a3d-41a6-9fa0-d3e8f3705348 | Baremetal Node     |  98100fd4-d146-4b28-9ad3-3ee655f4208a | None              | :-)   | UP     | ironic-neutron-agent      |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+


Victor Stinner had a look earlier today with strace and it seems that  the neutron-server process can still contact rabbitmq and exchanges data correctly. It can start a RPC.
What this "timed out" message seems to indicate is that the caller in neutron-server did not get any response after 60s, and oslo.messaging mark that failure as time out.
It's not clear yet why this specific error log shows up as "server .* unreachable" nor what message exactly did not receive an answer.


Also, we did not reproduce the "split-brain" where multiple HA router are started, probably because rabbitmq is still processing messages and ovs has all his port configured on all controllers.



It's not clear yet, what exactely causes rabbit to be overwhelmed with duplicateerror logs and neutron services to stop working in healthcheck.

Next we'll try to narrow the issue more by e.g.:
  . restarting only servers which don't host the ha-router
  . run the same test on BM
  . gather more info about the system's health between each step to try to capture when rabbitmq start misbehaving and try to correlate that with a specific event.

Comment 29 Brent Eagles 2018-06-22 22:03:14 UTC

I was working on sealusa8 along with Brian and Slaweq. Brian and Slaweq identified issues with some flow rules, but there were RPC issues with the agents at the time which probably caused that. John killed an errant rabbitmq process after which the agents were able to sync properly and the system eventually healed.

With the system in a working state, I ran through a few different restart scenarios:

- serial shutdown and restart each controller starting with controller 0
- restart whichever controller where the ha router was active
- shutdown two controllers leaving only one active, then restarting them
- random restarts

So far I've been unable to reproduce having two active routers. I started checking for DuplicateMessage errors for the last two scenarios but the DuplicateMessage errors did not appear. On sealusa8 at least, this does not appear to be 100% reproducible.

Comment 30 Brent Eagles 2018-06-25 02:23:11 UTC

I've reproduced a few times on sealusa8. While it does not happen 100% of the time, it is frequent enough.

Do we have an SME that can evaluate whether we are hitting some variation of https://bugs.launchpad.net/oslo.messaging/+bug/1695746 ? Also since DuplicateMessageError's occur for the same message repeatedly it seems like whatever process required to clear it from the queue isn't happening. Perhaps there is a bug in the DuplicateMessageError handling code?

Comment 34 Victor Stinner 2018-06-25 14:23:13 UTC

According to Mehdi Abaakouk, Oslo Messaging doesn't send "ACK" messages on duplicate messages. He wrote a fix for that:
https://review.openstack.org/#/c/577824/

But he is not sure if Oslo Messaging sends ACK or not, maybe it's fine.

Anyway, if there is a bug, it would only fix the side effect of a previous bug: a service should not send duplicated messages :-)

Comment 35 Victor Stinner 2018-06-25 14:38:23 UTC

Mehdi Abaakouk abandonned https://review.openstack.org/#/c/577824/

"I fact we reject them here: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L372"

Aha? In /var/log/containers/cinder/cinder-scheduler.log, I see the same message sent again and again every 2 minutes since 2018-06-24 21:49:49:

2018-06-24 21:49:49.883 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-24 21:51:50.898 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a
). Skipping it.
(...)
2018-06-25 14:21:16.082 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:23:17.097 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:23:17.097 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:25:18.113 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:25:18.113 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:27:19.127 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:27:19.127 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:29:20.141 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:29:20.141 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:31:21.158 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:31:21.158 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:33:22.174 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:33:22.174 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:35:23.188 1 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.
2018-06-25 14:35:23.188 1 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(3d9b5d0ad53044e98d986e2a434f4c7a). Skipping it.

Comment 36 Victor Stinner 2018-06-25 15:54:45 UTC

Something happened at 2018-06-25 11:01: since this date, neutron-server services of the 3 controllers started to 11 "unreachable" errors per controller per minute. Example of logs:

    2018-06-25 14:58:38 Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 47128.
    2018-06-25 14:58:42 AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out

Example of timings, errors at:

15:43:13
15:43:18
15:43:35
15:43:38
15:43:41
15:43:42
15:43:46
15:43:47
15:43:49
15:43:51
15:43:53

Comment 37 Victor Stinner 2018-06-25 16:01:33 UTC

> Something happened at 2018-06-25 11:01: since this date, neutron-server services of the 3 controllers started to 11 "unreachable" errors per controller per minute.

Oh sorry, I only parsed recent logs with my tool. In fact, the first error is more likely near 2018-06-24 17:32: event correlated with a restart of RabbitMQ on controller-0 for example.

Comment 38 Victor Stinner 2018-06-25 16:56:53 UTC

I ran strace on a neutron-server process of controller-0.

Below are syscalls of an Oslo Messaging socket to controller-2.internalapi.localdomain (172.17.1.23). There is a RPC call to the method "tunnel_update" on the queue "q-agent-notifier-tunnel-update_fanout". It seems like the RPC call didn't get any reply in 60 seconds. After 60 seconds, I see recvfrom() getting a few bytes, just before Oslo Messaging closes the socket.

shutdown() is called on the socket 60 seconds after the last successful recvfrom(). close() is called on the socket 2 minutes after the shutdown.

---
16:28:02.295418 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 61
16:28:02.295494 fcntl(61, F_GETFL)      = 0x2 (flags O_RDWR)
16:28:02.295526 fcntl(61, F_SETFL, O_RDWR|O_NONBLOCK) = 0
16:28:02.295576 fcntl(61, F_GETFD)      = 0
16:28:02.295616 fcntl(61, F_SETFD, FD_CLOEXEC) = 0

16:28:02.295664 connect(61, {sa_family=AF_INET, sin_port=htons(5672), sin_addr=inet_addr("172.17.1.23")}, 16) = -1 EINPROGRESS (Operation now in progress)
16:28:02.295866 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.296029 epoll_wait(11, [{EPOLLOUT, {u32=61, u64=6375011298798731325}}], 1023, 123) = 1
16:28:02.296120 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6eb00) = 0

16:28:02.296175 getsockopt(61, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
16:28:02.296236 connect(61, {sa_family=AF_INET, sin_port=htons(5672), sin_addr=inet_addr("172.17.1.23")}, 16) = 0

16:28:02.296308 setsockopt(61, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
16:28:02.296358 getsockopt(61, SOL_TCP, TCP_NODELAY, [0], [4]) = 0
16:28:02.296404 getsockopt(61, SOL_TCP, TCP_MAXSEG, [1448], [4]) = 0
16:28:02.296462 getsockopt(61, SOL_TCP, TCP_CORK, [0], [4]) = 0
16:28:02.296501 getsockopt(61, SOL_TCP, TCP_KEEPIDLE, [5], [4]) = 0
16:28:02.296556 getsockopt(61, SOL_TCP, TCP_KEEPINTVL, [1], [4]) = 0
16:28:02.296615 getsockopt(61, SOL_TCP, TCP_KEEPCNT, [5], [4]) = 0
16:28:02.296655 getsockopt(61, SOL_TCP, TCP_SYNCNT, [6], [4]) = 0
16:28:02.296720 getsockopt(61, SOL_TCP, TCP_LINGER2, [60], [4]) = 0
16:28:02.296763 getsockopt(61, SOL_TCP, TCP_DEFER_ACCEPT, [0], [4]) = 0
16:28:02.296825 getsockopt(61, SOL_TCP, TCP_WINDOW_CLAMP, [182232], [4]) = 0
16:28:02.296867 getsockopt(61, SOL_TCP, TCP_QUICKACK, [1], [4]) = 0
16:28:02.296908 getsockopt(61, SOL_TCP, TCP_USER_TIMEOUT, [0], [4]) = 0
16:28:02.296957 setsockopt(61, SOL_TCP, TCP_NODELAY, [1], 4) = 0
16:28:02.297018 setsockopt(61, SOL_TCP, TCP_MAXSEG, [1448], 4) = 0
16:28:02.297071 setsockopt(61, SOL_TCP, TCP_CORK, [0], 4) = 0
16:28:02.297107 setsockopt(61, SOL_TCP, TCP_KEEPIDLE, [60], 4) = 0
16:28:02.297142 setsockopt(61, SOL_TCP, TCP_KEEPINTVL, [10], 4) = 0
16:28:02.297195 setsockopt(61, SOL_TCP, TCP_KEEPCNT, [9], 4) = 0
16:28:02.297231 setsockopt(61, SOL_TCP, TCP_SYNCNT, [6], 4) = 0
16:28:02.297268 setsockopt(61, SOL_TCP, TCP_LINGER2, [60], 4) = 0
16:28:02.297304 setsockopt(61, SOL_TCP, TCP_DEFER_ACCEPT, [0], 4) = 0
16:28:02.297341 setsockopt(61, SOL_TCP, TCP_WINDOW_CLAMP, [182232], 4) = 0
16:28:02.297378 setsockopt(61, SOL_TCP, TCP_QUICKACK, [1], 4) = 0
16:28:02.297414 setsockopt(61, SOL_TCP, TCP_USER_TIMEOUT, [1000], 4) = 0

16:28:02.297463 sendto(61, "AMQP\1\1\0\t", 8, 0, NULL, 0) = 8
16:28:02.297624 recvfrom(61, 0x7f46ac01af84, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.297781 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.297880 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 121) = 1
16:28:02.298727 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6e2c0) = 0
16:28:02.298803 recvfrom(61, "\1\0\0\0\0\1\373", 7, 0, NULL, NULL) = 7
16:28:02.298870 recvfrom(61, "\0\n\0\n\0\t\0\0\1\326\fcapabilitiesF\0\0\0\307\22publisher_confirmst\1\32exchange_exchange_bindingst\1\nbasic.nackt\1\26consumer_cancel_notifyt\1\22connection.blockedt\1\23consumer_prioritiest\1\34authentication_failure_closet\1\20per_consumer_qost\1\17direct_reply_tot\1\fcluster_nameS\0\0\0\37rabbit\tcopyrightS\0\0\0.Copyright (C) 2007-2018 Pivotal Software, Inc.\vinformationS\0\0\0005Licensed under the MPL.  See http://www.rabbitmq.com/\10platformS\0\0\0\23Erlang/OTP 18.3.4.8\7productS\0\0\0\10RabbitMQ\7versionS\0\0\0\0063.6.15\0\0\0\16PLAIN AMQPLAIN\0\0\0\5en_US", 507, 0, NULL, NULL) = 507
16:28:02.298922 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1

16:28:02.299266 sendto(61, "\1\0\0\0\0\1*\0\n\0\v\0\0\0\330\7productS\0\0\0\7py-amqp\17product_versionS\0\0\0\0052.1.4\17connection_nameS\0\0\0006neutron-server:28:c896f463-9114-4d8f-ae78-06fc79940b1b\fcapabilitiesF\0\0\0M\22connection.blockedt\1\34authentication_failure_closet\1\26consumer_cancel_notifyt\1\10AMQPLAIN\0\0\0007\5LOGINS\0\0\0\5guest\10PASSWORDS\0\0\0\31UF6FF3BVZNNFxBv3BEZaWmPYP\5en_US\316", 306, 0, NULL, 0) = 306
16:28:02.299377 recvfrom(61, 0x7f46ac01a744, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.299473 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.299562 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 120) = 1
16:28:02.299951 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6e2c0) = 0
16:28:02.300018 recvfrom(61, "\1\0\0\0\0\0\f", 7, 0, NULL, NULL) = 7
16:28:02.300095 recvfrom(61, "\0\n\0\36\0\0\0\2\0\0\0<", 12, 0, NULL, NULL) = 12
16:28:02.300138 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1

16:28:02.300253 sendto(61, "\1\0\0\0\0\0\f\0\n\0\37\377\377\0\2\0\0\0<\316", 20, 0, NULL, 0) = 20
16:28:02.300393 sendto(61, "\1\0\0\0\0\0\10\0\n\0(\1/\0\0\316", 16, 0, NULL, 0) = 16
16:28:02.300491 recvfrom(61, 0x7f46ac01a744, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.300580 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.300745 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 118) = 1
16:28:02.300842 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6e2c0) = 0
16:28:02.300895 recvfrom(61, "\1\0\0\0\0\0\5", 7, 0, NULL, NULL) = 7
16:28:02.300952 recvfrom(61, "\0\n\0)\0", 5, 0, NULL, NULL) = 5
16:28:02.300997 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1

16:28:02.301254 sendto(61, "\1\0\1\0\0\0\5\0\24\0\n\0\316", 13, 0, NULL, 0) = 13
16:28:02.301366 recvfrom(61, 0x7f46ac01a744, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.301440 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.301521 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 118) = 1
16:28:02.302096 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6e380) = 0
16:28:02.302140 recvfrom(61, "\1\0\1\0\0\0\10", 7, 0, NULL, NULL) = 7
16:28:02.302195 recvfrom(61, "\0\24\0\v\0\0\0\0", 8, 0, NULL, NULL) = 8
16:28:02.302254 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1
16:28:02.302575 setsockopt(61, SOL_TCP, TCP_USER_TIMEOUT, [60000], 4) = 0
16:28:02.303929 getsockname(61, {sa_family=AF_INET, sin_port=htons(44034), sin_addr=inet_addr("172.17.1.16")}, [16]) = 0

16:28:02.305406 sendto(61, "\1\0\1\0\0\0008\0(\0\n\0\0%q-agent-notifier-tunnel-update_fanout\6fanout\4\0\0\0\0\316", 64, 0, NULL, 0) = 64
16:28:02.305521 recvfrom(61, 0x7f46ac01a744, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.305626 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.305775 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 113) = 1
16:28:02.305860 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6dd70) = 0
16:28:02.305919 recvfrom(61, "\1\0\1\0\0\0\4", 7, 0, NULL, NULL) = 7
16:28:02.306000 recvfrom(61, "\0(\0\v", 4, 0, NULL, NULL) = 4
16:28:02.306079 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1
16:28:02.306202 setsockopt(61, SOL_TCP, TCP_USER_TIMEOUT, [60000], 4) = 0

16:28:02.306408 sendto(61, "\1\0\1\0\0\0\5\0U\0\n\0\316", 13, 0, NULL, 0) = 13
16:28:02.306560 recvfrom(61, 0x7f46ac01a744, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.306662 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0
16:28:02.306810 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 112) = 1
16:28:02.306891 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6d950) = 0
16:28:02.306965 recvfrom(61, "\1\0\1\0\0\0\4", 7, 0, NULL, NULL) = 7
16:28:02.307056 recvfrom(61, "\0U\0\v", 4, 0, NULL, NULL) = 4
16:28:02.307105 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1

16:28:02.307336 sendto(61, "\1\0\1\0\0\0.\0<\0(\0\0%q-agent-notifier-tunnel-update_fanout\0\0\316\2\0\1\0\0\0+\0<\0\0\0\0\0\0\0\0\3\360\370\0\20application/json\5utf-8\0\0\0\0\2\0\316\3\0\1\0\0\3\360{\"oslo.message\": \"{\\\"_context_domain\\\": null, \\\"_context_request_id\\\": \\\"req-ad27bed5-92f5-4b46-940d-328ff78abddc\\\", \\\"_context_global_request_id\\\": null, \\\"_context_auth_token\\\": null, \\\"_context_resource_uuid\\\": null, \\\"_context_tenant_name\\\": null, \\\"_context_user\\\": null, \\\"_context_user_id\\\": null, \\\"_context_show_deleted\\\": false, \\\"_context_is_admin\\\": true, \\\"version\\\": \\\"1.0\\\", \\\"_context_project_domain\\\": null, \\\"_context_timestamp\\\": \\\"2018-06-25 01:31:07.189543\\\", \\\"method\\\": \\\"tunnel_update\\\", \\\"_context_project\\\": null, \\\"_context_roles\\\": [], \\\"args\\\": {\\\"tunnel_ip\\\": \\\"172.17.2.16\\\", \\\"tunnel_type\\\": \\\"vxlan\\\"}, \\\"_unique_id\\\": \\\"ab3a2864cff54bf08a8a2f8f57389466\\\", \\\"_context_tenant_id\\\": null, \\\"_context_is_admin_project\\\": true, \\\"_context_project_name\\\": null, \\\"_context_user_identity\\\": \\\"- - - - -\\\", \\\"_context_tenant\\\": null, \\\"_context_project_id\\\": null, \\\"_context_read_only\\\": false, \\\"_context_user_domain\\\": null, \\\"_context_user_name\\\": null}\", \"oslo.version\": \"2.0\"}\316", 1121, 0, NULL, 0) = 1121
16:28:02.307480 recvfrom(61, 0x7f46ac01aad4, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:28:02.307581 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0

16:29:02.216530 epoll_wait(11, [{EPOLLIN, {u32=61, u64=6375011298798731325}}], 1023, 89) = 1
16:29:02.302751 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6dd70) = 0
16:29:02.302887 recvfrom(61, "\10\0\0\0\0\0\0", 7, 0, NULL, NULL) = 7
16:29:02.302991 recvfrom(61, "\316", 1, 0, NULL, NULL) = 1
16:29:02.303085 recvfrom(61, 0x7f46ac55ebc4, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
16:29:02.303243 epoll_ctl(11, EPOLL_CTL_ADD, 61, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=61, u64=6375011298798731325}}) = 0

16:30:02.303266 epoll_ctl(11, EPOLL_CTL_DEL, 61, 0x7ffd6ee6dd70) = 0
16:30:02.303487 shutdown(61, SHUT_RDWR) = 0

16:30:27.559738 epoll_wait(11, [{EPOLLIN, {u32=62, u64=6375011298798731326}}], 1023, 61) = 1
16:32:03.324628 close(61)               = 0
---

Comment 39 Mehdi ABAAKOUK 2018-06-25 19:35:55 UTC

We have one case where the Duplicate message is not removed from rabbitmq. This is when the connection is closed during the message.reject() that should remove it.

Comment 40 Brent Eagles 2018-06-25 19:58:36 UTC

In reference to comments made earlier about containers possibly not being shutdown gracefully, systemd-shutdown will send a SIGTERM to remaining processes and those that do not shutdown quickly enough will be sent a SIGKILL. For example the following was captured from an attached console:

[ 1556.163360] type=1700 audit(1529955121.410:884): dev=qr-90058585-3f prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
[ 1558.874838] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 1558.886653] systemd-shutdown[1]: Sending SIGKILL to PID 2044 (docker-containe).
[ 1558.892701] systemd-shutdown[1]: Sending SIGKILL to PID 2079 (kolla_start).
[ 1558.897848] systemd-shutdown[1]: Sending SIGKILL to PID 2189 (ceilometer-agen).
[ 1558.903574] systemd-shutdown[1]: Sending SIGKILL to PID 2190 (docker-containe).
[ 1558.909220] systemd-shutdown[1]: Sending SIGKILL to PID 2217 (docker-containe).
[ 1558.914799] systemd-shutdown[1]: Sending SIGKILL to PID 2277 (docker-containe).
[ 1558.920219] systemd-shutdown[1]: Sending SIGKILL to PID 2317 (kolla_start).
[ 1558.924792] systemd-shutdown[1]: Sending SIGKILL to PID 2329 (kolla_start).
[ 1558.932073] systemd-shutdown[1]: Sending SIGKILL to PID 2376 (kolla_start).
[ 1558.943927] systemd-shutdown[1]: Sending SIGKILL to PID 2584 (aodh-notifier: ).
[ 1558.952055] systemd-shutdown[1]: Sending SIGKILL to PID 2588 (docker-containe).
[ 1558.956779] systemd-shutdown[1]: Sending SIGKILL to PID 2612 (cinder-schedule).
[ 1558.961131] systemd-shutdown[1]: Sending SIGKILL to PID 2690 (neutron-l3-agen).
[ 1558.965247] systemd-shutdown[1]: Sending SIGKILL to PID 2742 (kolla_start).
[ 1558.969272] systemd-shutdown[1]: Sending SIGKILL to PID 2904 (docker-containe).
[ 1558.978826] systemd-shutdown[1]: Sending SIGKILL to PID 2906 (docker-containe).
[ 1558.985754] systemd-shutdown[1]: Sending SIGKILL to PID 3044 (nova-consoleaut).
[ 1558.994902] systemd-shutdown[1]: Sending SIGKILL to PID 3068 (kolla_start).
[ 1558.998755] systemd-shutdown[1]: Sending SIGKILL to PID 3143 (kolla_start).
[ 1559.006546] systemd-shutdown[1]: Sending SIGKILL to PID 3381 (ceilometer-agen).
[ 1559.010528] systemd-shutdown[1]: Sending SIGKILL to PID 3485 (aodh-evaluator:).
[ 1559.014241] systemd-shutdown[1]: Sending SIGKILL to PID 3520 (docker-containe).
[ 1559.020100] systemd-shutdown[1]: Sending SIGKILL to PID 3592 (neutron-dhcp-ag).
[ 1559.024805] systemd-shutdown[1]: Sending SIGKILL to PID 3617 (kolla_start).
[ 1559.029351] systemd-shutdown[1]: Sending SIGKILL to PID 4003 (aodh-listener: ).
[ 1559.033909] systemd-shutdown[1]: Sending SIGKILL to PID 4061 (docker-containe).
[ 1559.038420] systemd-shutdown[1]: Sending SIGKILL to PID 4162 (aodh-notifier: ).
[ 1559.043028] systemd-shutdown[1]: Sending SIGKILL to PID 4376 (kolla_start).
[ 1559.058361] systemd-shutdown[1]: Sending SIGKILL to PID 4943 (gnocchi-metricd).
[ 1559.062582] systemd-shutdown[1]: Sending SIGKILL to PID 5391 (aodh-listener: ).
[ 1559.066004] systemd-shutdown[1]: Sending SIGKILL to PID 5574 (aodh-evaluator:).
[ 1559.069310] systemd-shutdown[1]: Sending SIGKILL to PID 6388 (gnocchi-metricd).
[ 1559.074764] systemd-shutdown[1]: Sending SIGKILL to PID 6390 (gnocchi-metricd).
[ 1559.077960] systemd-shutdown[1]: Sending SIGKILL to PID 6398 (gnocchi-metricd).
[ 1559.082096] systemd-shutdown[1]: Sending SIGKILL to PID 6409 (gnocchi-metricd).

Comment 41 John Eckersberg 2018-06-26 01:37:38 UTC

OK, I think I have a pretty good handle on the problem here.  Not quite the "root" cause yet... but let's call it the stem of the plant where it goes into the ground.

I'm going to focus on nova here, only because it was the one I chose to debug because it's less noisy in the logs and over AMQP.  This should apply equally as much to other services, and I made extra sure that the neutron services were behaving the same way.

nova-scheduler is logging a DuplicateMessage error every two minutes.  I verified that oslo.messaging is correctly handling this by issuing a basic.reject back to the broker, with requeue=false.  So the message is not being redelivered by rabbitmq.

Instead, the producer is repeatedly publishing the same message.  I verified this by capturing all AMQP traffic on each controller and then matching the message payload against the message id from the error log.  With this I was able to trace the origin of the message back to the compute service as the publisher.

Capturing the AMQP traffic from the compute node revealed this repeated behavior:

- Establish new connection to RabbitMQ
- Do some not-relevant connection setup
- Call confirm.select with nowait=false, which is important later
- Call basic.publish, to the scheduler_fanout exchange

At this point, what *should* happen is the server replies with basic.ack.  However that does not actually happen.  Instead, the connection waits exactly 120 seconds before the client initiates a teardown.  During the 120 seconds, everything lower-level functions properly -- TCP keepalives are exchanged back and forth, and the AMQP heartbeat thread is heartbeating ok.

Here the client re-establishes the connection and the whole thing repeats indefinitely.

Going back to the "confirm.select, nowait=false" bit, this enables publisher confirm which can be explored in more depth in the rabbitmq docs[1].  In short, the broker will notify the publisher once it has fully queued the message.  For our purposes in OSP, this means the message has been accepted by all queues (and mirrors) which are bound to the exchange.  More on bindings later.

You can see from the thread dump on the compute node that this same behavior is reflected.  The sync_instance_info periodic task (thus the nice 120 interval) is running, which is what tries to publish to scheduler_fanout.  The publish succeeds, and then the thread blocks while waiting for the publisher confirm:

------                        Green Thread                        ------
/usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main
    `result = function(*args, **kwargs)`
/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py:137 in _run_loop
    `result = func(*self.args, **self.kw)`
/usr/lib/python2.7/site-packages/nova/service.py:295 in periodic_tasks
    `return self.manager.periodic_tasks(ctxt, raise_on_error=raise_on_error)`
/usr/lib/python2.7/site-packages/nova/manager.py:106 in periodic_tasks
    `return self.run_periodic_tasks(context, raise_on_error=raise_on_error)`
/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:220 in run_periodic_tasks
    `task(self, context)`
/usr/lib/python2.7/site-packages/nova/compute/manager.py:1640 in _sync_scheduler_instance_info
    `self.scheduler_client.sync_instance_info(context, self.host, uuids)`
/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py:83 in sync_instance_info
    `self.queryclient.sync_instance_info(context, host_name, instance_uuids)`
/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py:37 in __run_method
    `return getattr(self.instance, __name)(*args, **kwargs)`
/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py:95 in sync_instance_info
    `instance_uuids)`
/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py:183 in sync_instance_info
    `instance_uuids=instance_uuids)`
/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py:152 in cast
    `self.transport._send(self.target, msg_ctxt, msg, retry=self.retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/transport.py:131 in _send
    `timeout=timeout, retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:559 in send
    `retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:533 in _send
    `conn.fanout_send(target.topic, msg, retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1281 in fanout_send
    `self._ensure_publishing(self._publish, exchange, msg, retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1155 in _ensure_publishing
    `self.ensure(method, retry=retry, error_callback=_error_callback)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:807 in ensure
    `ret, channel = autoretry_method()`
/usr/lib/python2.7/site-packages/kombu/connection.py:494 in _ensured
    `return fun(*args, **kwargs)`
/usr/lib/python2.7/site-packages/kombu/connection.py:570 in __call__
    `return fun(*args, channel=channels[0], **kwargs), channels[0]`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:796 in execute_method
    `method()`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1187 in _publish
    `compression=self.kombu_compression)`
/usr/lib/python2.7/site-packages/kombu/messaging.py:181 in publish
    `exchange_name, declare,`
/usr/lib/python2.7/site-packages/kombu/messaging.py:203 in _publish
    `mandatory=mandatory, immediate=immediate,`
/usr/lib/python2.7/site-packages/amqp/channel.py:1759 in basic_publish_confirm
    `self.wait(spec.Basic.Ack)`
/usr/lib/python2.7/site-packages/amqp/abstract_channel.py:93 in wait
    `self.connection.drain_events(timeout=timeout)`
/usr/lib/python2.7/site-packages/amqp/connection.py:465 in drain_events
    `return self.blocking_read(timeout)`
/usr/lib/python2.7/site-packages/amqp/connection.py:469 in blocking_read
    `frame = self.transport.read_frame()`
/usr/lib/python2.7/site-packages/amqp/transport.py:237 in read_frame
    `frame_header = read(7, True)`
/usr/lib/python2.7/site-packages/amqp/transport.py:377 in _read
    `s = recv(n - len(rbuf))`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:354 in recv
    `return self._recv_loop(self.fd.recv, b'', bufsize, flags)`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:348 in _recv_loop
    `self._read_trampoline()`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:319 in _read_trampoline
    `timeout_exc=socket.timeout("timed out"))`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:203 in _trampoline
    `mark_as_closed=self._mark_as_closed)`
/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py:162 in trampoline
    `return hub.switch()`
/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch
    `return self.greenlet.switch()`

When the request times out, the stack unwinds back to the _ensure_publishing/ensure/_ensured part, it reconnects, and publishes the same message object again.  This is why we get the DuplicateMessage errors.

If you're following along closely, you're now asking the same question that baffled me:  If the compute service never received the publisher confirm... how does the message flow all the way through to the scheduler nodes in order for them to see the duplicate in the first place?

The answer is bindings, and in particular the way they interact with publisher confirms.

Messages are published to exchanges.  Consumers read messages from queues.  Queues are bound to exchanges with bindings.  For exchanges of type fanout, each bound queue will receive a copy of each message published to the exchange.

Now, recall that in order for rabbitmq to send a publisher confirm, all applicable queues must have received the message.

Here's the problem:  we have bindings with non-existant queues!  This should never be the case, and is indicative of a bug in rabbitmq.

But here it is:

()[root@controller-1 /]# rabbitmqctl list_bindings source_name destination_name routing_key | grep ^scheduler_fanout; rabbitmqctl list_queues name pid slave_pids | grep ^scheduler_fanout_                                                                                                
scheduler_fanout        scheduler_fanout_8d22dc1db1134e23800b29d2adf36ec9       scheduler
scheduler_fanout        scheduler_fanout_90f17b0f3bdc466f876a398ff2567b2a       scheduler
scheduler_fanout        scheduler_fanout_b8ee661524e94a9a97a7280162d15905       scheduler
scheduler_fanout        scheduler_fanout_cb3525d4ecae46ae8242df1f8a7e2408       scheduler
scheduler_fanout        scheduler_fanout_d3c5f126eed5475ca6f2aad8dbdcc2a5       scheduler
scheduler_fanout        scheduler_fanout_ea23bbca01cd4da2a6ea2a672f7cf5de       scheduler
scheduler_fanout_b8ee661524e94a9a97a7280162d15905       <rabbit.17881.0> [<rabbit.30886.0>, <rabbit.2538.0>]                                                                                                                                  
scheduler_fanout_cb3525d4ecae46ae8242df1f8a7e2408       <rabbit.14973.0> [<rabbit.2851.0>, <rabbit.3053.0>]                                                                                                                                   
scheduler_fanout_90f17b0f3bdc466f876a398ff2567b2a       <rabbit.2932.0>  [<rabbit.3059.0>, <rabbit.3283.0>]                                                                                                                                   

We have six queues bound to scheduler_fanout, but three of them don't exist at all.  This means that publisher confirms on the scheduler_fanout queue will never suceed.  So the compute service will keep trying to publish that same message forever, waiting for a basic.ack that never comes, as rabbitmq tries to route messages to queues that don't exist.

Finally as I mentioned before, it's not just nova scheduler.  I see the same thing happening across all services.  So this certainly explains the weirdness going on with neutron as well.

[1] https://www.rabbitmq.com/confirms.html#publisher-confirms

Comment 46 John Eckersberg 2018-07-05 21:03:55 UTC

This reproduces 100% of the time as such:

For each controller:
  - pcs cluster stop
  - reboot machine
  - wait for boot and pcs cluster starup

After the third controller reboots, there are orphaned bindings as seen here:

(undercloud) [stack@undercloud-0 ~]$ sqlite3 report.db 'select distinct destination_name from bindings where not exists (select name from queues where name = bindings.destination_name);'
q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de
q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43
q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8
scheduler_fanout_9461e9c5e62647628b8f18200f90eed3

These four queues have in common:

- The queue master was originally located on controller-0
- The queue consumer was also originally located on controller-0

This can be seen here in the initial state:

(undercloud) [stack@undercloud-0 ~]$ sqlite3 report-2018-07-05_14:16:12-c19.db "select pid, slave_pids, synchronised_slave_pids from queues where name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
<rabbit.1241.0>|[<rabbit.2435.0>, <rabbit.2226.0>]|[<rabbit.2226.0>, <rabbit.2435.0>]
<rabbit.2099.0>|[<rabbit.2499.0>, <rabbit.2290.0>]|[<rabbit.2290.0>, <rabbit.2499.0>]
<rabbit.7518.0>|[<rabbit.2531.0>, <rabbit.2310.0>]|[<rabbit.2310.0>, <rabbit.2531.0>]
<rabbit.7673.0>|[<rabbit.2311.0>, <rabbit.2101.0>]|[<rabbit.2101.0>, <rabbit.2311.0>]
(undercloud) [stack@undercloud-0 ~]$ sqlite3 report-2018-07-05_14:16:12-c19.db "select channel_pid from consumers where queue_name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
<rabbit.2092.0>
<rabbit.7505.0>
<rabbit.7668.0>
<rabbit.1227.0>

Now let's look at how things progress as we iterate through the controllers.

After rebooting controller-0:

(undercloud) [stack@undercloud-0 ~]$ sqlite3 report-2018-07-05_14:20:49-c19.db "select pid, slave_pids, synchronised_slave_pids from queues where name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
<rabbit.2311.0>|[<rabbit.2101.0>, <rabbit.2269.0>]|[<rabbit.2101.0>]
<rabbit.2435.0>|[<rabbit.2226.0>, <rabbit.2413.0>]|[<rabbit.2226.0>]
<rabbit.2499.0>|[<rabbit.2290.0>, <rabbit.2485.0>]|[<rabbit.2290.0>]
<rabbit.2531.0>|[<rabbit.2310.0>, <rabbit.2513.0>]|[<rabbit.2310.0>]
(undercloud) [stack@undercloud-0 ~]$ sqlite3 report-2018-07-05_14:20:49-c19.db "select channel_pid from consumers where queue_name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
(undercloud) [stack@undercloud-0 ~]$ 

So what happened?

- Each queue failed over to controller-1 since it was the eldest remaining slave
- A new slave was added on controller-0 when it came back up
- Only the controller-2 slave is synchronised
- There are no consumers, which is expected since we rebooted the node which contained all consumers

After rebooting controller-1:

(undercloud) [stack@undercloud-0 ~]$ sqlite3 report-2018-07-05_14:25:39-c19.db "select pid, slave_pids, synchronised_slave_pids from queues where name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
<rabbit.2101.0>|[<rabbit.2269.0>, <rabbit.2450.0>]|[]
<rabbit.2226.0>|[<rabbit.2413.0>, <rabbit.2608.0>]|[]
<rabbit.2290.0>|[<rabbit.2485.0>, <rabbit.2696.0>]|[]
<rabbit.2310.0>|[<rabbit.2513.0>, <rabbit.2720.0>]|[]

Now what?

- Each queue failed over to controller-2 since it was the eldest remaining slave
- A new slave was added on controller-1 when it came back up
- There are now *no* synchronised slaves

The "no synchronised slaves" behavior is expected.  When a slave is added, the slave only receives messages which were queued after the slave joined the group.  The design is that the slave synchronises with the master due to the consumer(s) of the queue consuming the messages that were present before the slave joined.  However because we rebooted the node with the consumer, the old messages aren't consumed.  So the slave never syncs up.  Here's a rough timeline of how we get into this state:

- controller-0 is rebooted, consumers go away, queue failover to controller-1 as above

While controller-0 is rebooting:

- At least one message M is published to the fanout exchange and queued on the master (now controller-1)
- M is mirrored to the slave on controller-2, controller-2 stays synchronised

When controller-0 starts:

- rmq on controller-0 starts, adds slave but is *not* synchronised because it does not know about M

And then basically the same thing again, except with controller-1.

OK, with that out of the way... what happens when we reboot controller-2?

We know from the very beginning of this discussion that there are orphaned bindings.  We verify that the queues now no longer exist:

(undercloud) [stack@undercloud-0 ~]$ sqlite3 report.db "select pid, slave_pids, synchronised_slave_pids from queues where name in ('q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de','q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43','q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8','scheduler_fanout_9461e9c5e62647628b8f18200f90eed3');"
(undercloud) [stack@undercloud-0 ~]$ 

From the rmq log on controller-2, we can see the evolution of the above events:

[root@controller-2 ~]# grep -B1 -e 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' -e 'q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43' -e 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' -e 'scheduler_fanout_9461e9c5e62647628b8f18200f90eed3' /var/log/containers/rabbitmq/rabbit\@controller-2.log                                                                                                                                                                                                               
=INFO REPORT==== 5-Jul-2018::18:15:42 ===
Mirrored queue 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' in vhost '/': Adding mirror on node 'rabbit@controller-2': <0.2101.0>
--
=INFO REPORT==== 5-Jul-2018::18:15:42 ===
Mirrored queue 'scheduler_fanout_9461e9c5e62647628b8f18200f90eed3' in vhost '/': Adding mirror on node 'rabbit@controller-2': <0.2226.0>
--
=INFO REPORT==== 5-Jul-2018::18:15:42 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' in vhost '/': Adding mirror on node 'rabbit@controller-2': <0.2290.0>
--
=INFO REPORT==== 5-Jul-2018::18:15:42 ===
Mirrored queue 'q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43' in vhost '/': Adding mirror on node 'rabbit@controller-2': <0.2310.0>
--
=INFO REPORT==== 5-Jul-2018::18:21:00 ===
Mirrored queue 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' in vhost '/': Slave <rabbit.2101.0> saw deaths of mirrors <rabbit.7673.0>
--
=INFO REPORT==== 5-Jul-2018::18:21:00 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' in vhost '/': Slave <rabbit.2290.0> saw deaths of mirrors <rabbit.2099.0>
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' in vhost '/': Slave <rabbit.2101.0> saw deaths of mirrors <rabbit.2311.0>
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' in vhost '/': Promoting slave <rabbit.2101.0> to master
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' in vhost '/': Slave <rabbit.2290.0> saw deaths of mirrors <rabbit.2499.0>
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' in vhost '/': Promoting slave <rabbit.2290.0> to master
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'scheduler_fanout_9461e9c5e62647628b8f18200f90eed3' in vhost '/': Slave <rabbit.2226.0> saw deaths of mirrors <rabbit.2435.0>
--
=INFO REPORT==== 5-Jul-2018::18:25:46 ===
Mirrored queue 'scheduler_fanout_9461e9c5e62647628b8f18200f90eed3' in vhost '/': Promoting slave <rabbit.2226.0> to master
--
=INFO REPORT==== 5-Jul-2018::18:25:48 ===
Mirrored queue 'q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43' in vhost '/': Promoting slave <rabbit.2310.0> to master
--
=WARNING REPORT==== 5-Jul-2018::18:30:34 ===
Mirrored queue 'scheduler_fanout_9461e9c5e62647628b8f18200f90eed3' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
--
=WARNING REPORT==== 5-Jul-2018::18:30:34 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_697445ca8d4a47dcb667e2567eb501de' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
--
=WARNING REPORT==== 5-Jul-2018::18:30:34 ===
Mirrored queue 'q-server-resource-versions_fanout_eed82e68749248b3a0b987c809c773a8' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
--
=WARNING REPORT==== 5-Jul-2018::18:30:34 ===
Mirrored queue 'q-server-resource-versions_fanout_70cdbf036566402ea416fff2a1f07f43' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

The last four lines are the key -- once we shut down the master and we have no synchronised slaves, the queue just stops.

Around this point, something goes horribly wrong.  What exactly, I don't know yet.

Comment 47 Eran Kuris 2018-07-12 13:08:58 UTC

*** Bug 1600373 has been marked as a duplicate of this bug. ***

Comment 49 pkomarov 2019-01-08 09:06:49 UTC

I was able to reproduce the errors here using the same playbook. 

After applying the patch above the cluster came back healthy after overcloud reboot using the restart playbook.
patch : https://review.gerrithub.io/439266

Results and cluster health checks after applying the patch : 
http://pastebin.test.redhat.com/691818

Artem, can you :
Re-test the automation reproducer of this bug with : 
IR_GERRIT_CHANGE: 439266

or manual version: 
- return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller)  
- test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266)


Thanks

Comment 50 Takashi Kajinami 2019-03-15 00:05:14 UTC

Hi,

I see the same issue in deployments with z3, and believe this problem is still there.

I've found some activities going on in the related cases, but can not see the whole picture.
Can anybody tell me the current status about this issue?

Thank you,
Takashi

Comment 51 Damien Ciabrini 2019-04-05 11:20:55 UTC

Let me do a quick summary of where we are with that bug.

TL;DR: We think we now fully understand what the root issue is and we have some ongoing plans to fix it.

In OpenStack, rabbitmq message recipients are currently configured to be HA queues mirrored across controllers (for failover), but not durable (not persisted to disk for recoveries on server restart). One logical recipient has typically one master queue, and at least one mirror queue.

When a master queue becomes unavailable, a mirror automatically takes over the master role, as long as it's synchronized (i.e it received all the messages that the previous master queue had).

Publishing a message to a recipient consists in pushing the message to a rabbitmq "exchange", which has "bindings" (i.e. routes) to the master and mirror queues. The message publishing is acknowledged once all connected queues acknowledge its reception.

If one queue replica disappears (e.g. rabbitmq stops, controller node reboots), and reconnects _after_ some messages have been queued and not consumed yet in the remaining replicas, this reconnecting replica will become an "unsynchronized" mirror. As such, it won't be able to take over the Master role automatically if a master failover happens.

So during a rolling restart of all rabbitmq servers, it might happen that all master queues disconnects sequentially, and reconnect as unsynchronized mirrors. In such a case, when the last master disconnects, no mirror can take over the master role, and RabbitMQ deletes all the queues for the logical recipient. However, the important detail is that RabbitMQ _does not_ delete the "bindings" to those queues [1].

At this time, when publishing a message to the original logical recipient, rabbitmq will still receive it in the exchange, try to push it to inexistant queues via the leftover "bindings" and will never acknowledge the publishing, because there's no queue anymore to publish to.

Going back to our OpenStack context: an OpenStack client/service can send a "notification" to many consumers at once (i.e. Pub/Sub idiom) via a "fanout" exchange. Each registered consumer has its own HA queue, which means its own master and mirror queues.

As described above, it might happen that a rolling restart of controller nodes deletes all queues for a consumer. If that consumer never comes back online, the bindings to its queues will linger in the fanout exchange, publishing to this particular consumer will never be acknowledged, and consequently the fanout exchange can never acknowledge the publishing of the message to the OpenStack client.

The OpenStack client is unaware of that condition, so it will retry to publish the same message to the fanout exchange. At that time, some consumers already received and acknowledged the original message, ultimately resulting in the DuplicateMessageError that we experience.

To prevent that situation from happening, we are currently evaluating another HA policy configuration in RabbitMQ, that would allow unsynchronized mirrors to take over Master role if no synchronized mirror is available. In the rare circumstances when such Master failover would need to happen, it is expected that the RPC retry mechanism in OpenStack would cope with the few lost messages, and thus would have no adverse effect on the cloud.

We'll update the bugzilla as soon as we get confirmation that we approach is working as expected.

[1] these semantics are out of scope for this discussion. See RabbitMQ docs for more details

Comment 61 Michele Baldessari 2019-05-31 09:26:11 UTC

*** Bug 1715620 has been marked as a duplicate of this bug. ***

Comment 64 Chris Jones 2019-06-06 12:04:24 UTC

*** Bug 1688649 has been marked as a duplicate of this bug. ***

Comment 65 Chris Jones 2019-06-06 12:19:38 UTC

*** Bug 1690955 has been marked as a duplicate of this bug. ***

Comment 66 Chris Jones 2019-06-06 12:24:42 UTC

*** Bug 1685030 has been marked as a duplicate of this bug. ***

Comment 83 pkomarov 2019-08-20 06:27:03 UTC

the fix resolves the issue described in this BZ

Comment 84 Chris Jones 2019-08-20 09:28:13 UTC

Hotfix approved.

Comment 87 errata-xmlrpc 2019-09-03 16:55:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2624

Comment 88 Luca Miccini 2019-10-22 06:51:23 UTC

*** Bug 1750692 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
aherr
ahrechan
amuller
apevec
bcafarel
bdobreli
beagles
bhaley
broskos
chjones
chrisw
cswanson
dciabrin
dhill
ebarrera
ekuris
jeckersb
jjoyce
jlibosva
jraju
jschluet
jstransk
kgiusti
knoha
lhh
lmiccini
marjones
markmc
mbayer
mburns
mcornea
mfuruta
michele
mzheng
ohochman
pkomarov
plemenko
sasha
shdunne
slinaber
srevivo
tkajinam
tvignaud