Bug 1669499 - queue "ironic-neutron-agent-heartbeat.info" keeps on populating messages with no consumer
Summary: queue "ironic-neutron-agent-heartbeat.info" keeps on populating messages with...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-baremetal
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z7
: 13.0 (Queens)
Assignee: Harald Jensås
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks: 1684564
TreeView+ depends on / blocked
 
Reported: 2019-01-25 13:56 UTC by Ketan Mehta
Modified: 2019-07-10 13:02 UTC (History)
11 users (show)

Fixed In Version: python-networking-baremetal-1.0.0-2.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1684564 (view as bug list)
Environment:
Last Closed: 2019-07-10 13:01:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1814544 0 None None None 2019-02-04 14:43:42 UTC
OpenStack Storyboard 2004933 0 None None None 2019-02-04 14:52:58 UTC
OpenStack Storyboard 2004938 0 None None None 2019-02-05 04:10:45 UTC
OpenStack gerrit 634850 0 None master: MERGED networking-baremetal: Ensure notifications are consumed from non-pool queue (I7b3e0db64b7b7d3372e5ca24cea88b4b897651be) 2019-04-08 15:48:45 UTC
OpenStack gerrit 634932 0 None master: MERGED networking-baremetal: Set amqp_auto_delete=true for notifications transport. (Ie51d0a0b02ed5ea336f3280d84d77cf8fec90ccb) 2019-04-08 15:48:35 UTC
OpenStack gerrit 635158 0 None stable/rocky: MERGED networking-baremetal: Set amqp_auto_delete=true for notifications transport. (Ie51d0a0b02ed5ea336f3280d84d77cf8fec90ccb) 2019-04-08 15:48:26 UTC
OpenStack gerrit 635509 0 None stable/queens: MERGED networking-baremetal: Set amqp_auto_delete=true for notifications transport. (Ie51d0a0b02ed5ea336f3280d84d77cf8fec90ccb) 2019-04-08 15:48:16 UTC
OpenStack gerrit 635721 0 None stable/rocky: MERGED networking-baremetal: Ensure notifications are consumed from non-pool queue (I7b3e0db64b7b7d3372e5ca24cea88b4b897651be) 2019-04-08 15:48:06 UTC
OpenStack gerrit 635724 0 None stable/queens: MERGED networking-baremetal: Ensure notifications are consumed from non-pool queue (I7b3e0db64b7b7d3372e5ca24cea88b4b897651be) 2019-04-08 15:47:56 UTC
OpenStack gerrit 637579 0 None master: MERGED networking-baremetal: Rename agent queue - fixes broken minor update (Iebb92beff9c4e16581a07b16ccdfbbeba3176543) 2019-04-08 15:47:46 UTC
OpenStack gerrit 638104 0 None stable/rocky: MERGED networking-baremetal: Rename agent queue - fixes broken minor update (Iebb92beff9c4e16581a07b16ccdfbbeba3176543) 2019-04-08 15:47:36 UTC
OpenStack gerrit 638114 0 None stable/queens: MERGED networking-baremetal: Rename agent queue - fixes broken minor update (Iebb92beff9c4e16581a07b16ccdfbbeba3176543) 2019-04-08 15:47:26 UTC
Red Hat Product Errata RHBA-2019:1744 0 None None None 2019-07-10 13:02:16 UTC

Description Ketan Mehta 2019-01-25 13:56:20 UTC
Description of problem:

After a recent upgrade from RHOSP12 to RHOSP13 and on the undercloud node there are a couple of RabbitMQ queues with large number of ever growing messages with no consumer available.

These are the queues: 

[1]. ironic-neutron-agent-heartbeat.info
[2]. queues with some <uuid>

The messages keep on growing and the present count is around ~ 75k in queue [1].

The queue is still growing on and on.

In RHOSP13 test environment as well same behaviour was noticed for this queue and the message count is around 20k, this queue has not been marked for auto_delete so the messages aren't cleaned up after a set interval of time.

There is no such queue as mentioned in [1] in RHOSP12 or earlier so not sure about the expected behaviour or purpose this servers.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.sudo rabbitmqctl list_queues |awk 'int($2)>=1000'
  d6edc11d-cebc-41fd-ae84-8262c659b920	9128
  ironic-neutron-agent-heartbeat.info	27323
  55dca8c1-be63-43b7-a190-ff7550071412	9238
  c1292189-881c-4bf5-a06d-ed34b576306f	9162

2.sudo rabbitmqctl list_queues name messages consumers |grep ironic-neutron-agent-heartbeat.info
ironic-neutron-agent-heartbeat.info	27325	0

3.sudo rabbitmqctl list_queues name auto_delete consumers |grep ironic-neutron-agent-heartbeat.info
ironic-neutron-agent-heartbeat.info	false	0

Actual results:


Expected results:
Queues should be cleaned up or messages should be consumed by the desired consumers.

Additional info:

Comment 3 Harald Jensås 2019-02-03 13:37:32 UTC
I've been looking at this bug, I think there are two problems:

a) Whenever the ironic-neutron-agent is started the agent UUID is generated. Each time we restart the agent a new UUID is generated and new queues are created

[stack@ironic-devstack devstack]$  sudo rabbitmqctl list_queues name messages consumers auto_delete |egrep 'ironic-neutron-agent-heartbeat.info|975d3852-9b40-4416-89a8-1681e0f94638|8f177e6b-342e-42fd-b571-38fafd05371c|da565ed0-3d2f-4629-a8ee-59bd247b36b6'
8f177e6b-342e-42fd-b571-38fafd05371c	338	0	false
ironic-neutron-agent-heartbeat.info	4030	0	false
975d3852-9b40-4416-89a8-1681e0f94638	4004	0	false
da565ed0-3d2f-4629-a8ee-59bd247b36b6	0	1	false


  ^^ In this example the agent was restarted 3 times. We have 1 consumer on da565ed0-3d2f-4629-a8ee-59bd247b36b6 which is the UUID generated on the last agent start/restart.

I've tried to add some cleanup on stop and reset of the agent. i.e 

        self.listener.stop()
        self.transport.cleanup()

However doing this does not delete the UUID queues.

b) No consumer on ironic-neutron-agent-heartbeat.info queue

Need to investigate this more, but I think this may be an issue in oslo.messaging when using pool.

""" This delivery pattern can be altered somewhat by specifying a pool name for the listener. Listeners with the same pool name behave like a subgroup within the group of listeners subscribed to the same topic/exchange. Each subgroup of listeners will receive a copy of the notification to be consumed by one member of the subgroup. Therefore, multiple copies of the notification will be delivered - one to the group of listeners that have no pool name (if they exist), and one to each subgroup of listeners that share the same pool name. """

My theory is that with the rabbit driver the above creates a queue for each subgroup (agent uuid in this case). 


---
To clean up the queues:

 systemctl stop ironic-neutron-agent.service
 systemctl resart rabbitmq-server.service
 systemctl start ironic-neutron-agent.service

This will clean up the UUID queues left over from previous ironic-neutron-agent restarts.

 Note, this does not solve the ironic-neutron-agent-heartbeat.info queue that does not have a listener.



Another option may be to set [oslo_messaging_rabbit]/amqp_auto_delete = true in /etc/neutron.conf. This will make the UUID queues be automatically deleted after ironic-neutron-agent restart. (However, I'm not sure how that might effect other neutrons sevices.)

Comment 5 Harald Jensås 2019-02-04 14:42:55 UTC
I've changed component to python-oslo-messaging.

Issue b) is a scenario that according to the oslo docs[1] can happen, when no listener without a pool exists the notification queue is'nt consumed and will fill up. I've opened a bug for this: https://bugs.launchpad.net/oslo.messaging/+bug/1814544. Either oslo.messaging should ensure that queue is drained when all listeners use pool, or the docs should be updated to inform users that they need to consume the non-pool queue as well.


[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/notify/listener.py#L31-L33

Comment 7 Harald Jensås 2019-02-05 13:55:42 UTC
I belive the patches https://review.openstack.org/634850 and https://review.openstack.org/634932 fixes both problems.

Here is an example:

Starting two instances of the agent:
------------------------------------

Feb 05 14:37:56 ironic-devstack.lab.example.com ironic-neutron-agent[11716]: INFO networking_baremetal.agent.ironic_neutron_agent [-] Starting agent networking-baremetal.
Feb 05 14:37:56 ironic-devstack.lab.example.com ironic-neutron-agent[11716]: INFO networking_baremetal.agent.ironic_neutron_agent [-] Adding member id 6cd60d72-3960-4b2d-9f2f-e2eb38eacb9d on host ironic-devstack.lab.example.com to hashring.
Feb 05 14:38:59 ironic-devstack.lab.example.com ironic-neutron-agent[11716]: INFO networking_baremetal.agent.ironic_neutron_agent [-] Adding member id 7eb4f7d3-b458-4d44-8a98-6af3d27ae658 on host ironic-devstack.lab.example.com to hashring.


  We get two pool queues, each with one consumer, and both agents are consuming the non-pool queue:

[stack@ironic-devstack ~]$  sudo rabbitmqctl list_queues name messages consumers auto_delete |egrep 'ironic-neutron-agent'
+--------------------------------------------------------------------------+----------+-----------+------------+
| queue name                                                               | messages | consumers | auto_delete|
+--------------------------------------------------------------------------+----------+-----------+------------+
| ironic-neutron-agent-heartbeat-pool-6cd60d72-3960-4b2d-9f2f-e2eb38eacb9d |        0 |         1 |       true |
| ironic-neutron-agent-heartbeat.info                                      |        0 |         2 |       true |
| ironic-neutron-agent-heartbeat-pool-7eb4f7d3-b458-4d44-8a98-6af3d27ae658 |        0 |         1 |       true |
+--------------------------------------------------------------------------+----------+-----------+------------+

Kill one of the agent instances:
--------------------------------

Feb 05 14:41:56 ironic-devstack.lab.example.com ironic-neutron-agent[11716]: INFO networking_baremetal.agent.ironic_neutron_agent [-] Removing member 7eb4f7d3-b458-4d44-8a98-6af3d27ae658 on host ironic-devstack.lab.example.com from hashring.

  The pool-queue for the agent we killed is automatically deleted as it's consumer is no longer present. And we still have 1 consumer on the non-pool queue:

[stack@ironic-devstack ~]$  sudo rabbitmqctl list_queues name messages consumers auto_delete |egrep 'ironic-neutron-agent'
+--------------------------------------------------------------------------+----------+-----------+------------+
| queue name                                                               | messages | consumers | auto_delete|
+--------------------------------------------------------------------------+----------+-----------+------------+
| ironic-neutron-agent-heartbeat-pool-6cd60d72-3960-4b2d-9f2f-e2eb38eacb9d |        0 |         1 |       true |
| ironic-neutron-agent-heartbeat.info                                      |        0 |         1 |       true |
+--------------------------------------------------------------------------+----------+-----------+------------+

Comment 10 Harald Jensås 2019-02-08 12:49:15 UTC
As a workaround set up a cron job that purges the ironic-neutron-agent-heartbeat.info queue at regular intervals. 

Command to purge the queue is: 

  rabbitmqctl purge_queue ironic-neutron-agent-heartbeat.info

Comment 11 Harald Jensås 2019-02-26 09:49:57 UTC
The changes tox fix this has been merged upstream, and backported to upstream stable branches.

Comment 20 errata-xmlrpc 2019-07-10 13:01:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1744


Note You need to log in before you can comment on or make changes to this bug.