Bug 1397110 - Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.audit' in vhost '/' due to timeout
Summary: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifi...
Keywords:
Status: CLOSED DUPLICATE of bug 1397393
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ga
: 10.0 (Newton)
Assignee: Peter Lemenkov
QA Contact: Asaf Hirshberg
URL:
Whiteboard:
Depends On:
Blocks: 1418668
TreeView+ depends on / blocked
 
Reported: 2016-11-21 16:13 UTC by Yurii Prokulevych
Modified: 2022-08-16 13:51 UTC (History)
15 users (show)

Fixed In Version: rabbitmq-server-3.6.3-6.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1418668 (view as bug list)
Environment:
Last Closed: 2016-12-07 14:05:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server pull 918 0 None closed Handle unexpected gm group alterations prior to dead pid removals from queue 2020-12-07 14:06:49 UTC
Github rabbitmq rabbitmq-server pull 946 0 None closed [OCF HA] Delete Mnesia schema on mnesia reset 2020-12-07 14:06:47 UTC
Github rabbitmq rabbitmq-server pull 951 0 None closed Handle not found outside of verify 2020-12-07 14:06:47 UTC
Github rabbitmq rabbitmq-server pull 970 0 None closed Stomp GM with reason `shutdown` in more cases 2020-12-07 14:07:16 UTC
Launchpad 1620649 0 None None None 2016-11-22 14:06:25 UTC
Red Hat Issue Tracker OSP-4584 0 None None None 2022-08-16 13:51:56 UTC

Description Yurii Prokulevych 2016-11-21 16:13:38 UTC
Description of problem:
-----------------------
After upgrade from RHOS-9 to RHOS-10 a bunch of errors in ceilometer/agent-notification.log

    2016-11-21 15:23:26.482 290254 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to declare consumer for topic 'notifications.audit': Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'not
    ifications.audit' in vhost '/' due to timeout
    2016-11-21 15:23:26.483 290254 ERROR cotyledon [-] Unhandled exception
    2016-11-21 15:23:26.483 290254 ERROR cotyledon Traceback (most recent call last):
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 52, in _exit_on_exception
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     yield
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 130, in _run
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.run()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/ceilometer/notification.py", line 184, in run
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.event_pipe_manager)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/ceilometer/notification.py", line 260, in _configure_main_queue_listeners
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     listener.start()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 268, in wrapper
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     log_after, timeout_timer)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 188, in run_once
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     post_fn = fn()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 267, in <lambda>
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     states[state].run_once(lambda: fn(self, *args, **kwargs),
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 420, in start
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.listener = self._create_listener()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/listener.py", line 136, in _create_listener
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self._batch_timeout
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 122, in _listen_for_notifications
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     targets_and_priorities, pool, batch_size, batch_timeout
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 496, in listen_for_notifications
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     callback=listener, queue_name=pool)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1132, in declare_topic_consumer
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.declare_consumer(consumer)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1040, in declare_consumer
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     error_callback=_connect_error)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 814, in ensure
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     ret, channel = autoretry_method()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     return fun(*args, **kwargs)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     return fun(*args, channel=channels[0], **kwargs), channels[0]
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 790, in execute_method
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     method()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1028, in _declare_consumer
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     consumer.declare(self)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 310, in declare
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.queue.declare()
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 522, in declare
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     self.queue_declare(nowait, passive=False)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 548, in queue_declare
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     nowait=nowait)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1258, in queue_declare
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     (50, 11),  # Channel.queue_declare_ok
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     return self.dispatch_method(method_sig, args, content)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 87, in dispatch_method
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     return amqp_method(self, args)
    2016-11-21 15:23:26.483 290254 ERROR cotyledon   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
    2016-11-21 15:23:26.483 290254 ERROR cotyledon     reply_code, reply_text, (class_id, method_id), ChannelError,
    2016-11-21 15:23:26.483 290254 ERROR cotyledon NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.audit' in vhost '/' due to timeout
    2016-11-21 15:23:26.483 290254 ERROR cotyledon
    2016-11-21 15:23:26.494 236155 INFO cotyledon [-] Child 290254 exited with status 2


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
rabbitmq-server-3.6.3-5.el7ost.noarch
puppet-rabbitmq-5.6.0-1.057a013git.el7ost.noarch

python-oslo-messaging-5.10.0-5.el7ost.noarch
python-cotyledon-1.2.7-2.el7ost.noarch

Steps to Reproduce:
1. After running 'convergence' step check ceilometer/agent-notification.log


Actual results:
---------------
Traceback in log. No new measures are processed by ceilometer.

Additional info:
----------------
Virtual setup: 3controllers + 2computes +3ceph.
Queue notifications.sample was growing.

Attempt to list consumers was falling with next message:

    rabbitmqctl list_consumers
     
    Listing consumers ...
    Error: {badrpc,
               {'EXIT',
                   {noproc,
                       {gen_server2,call,[<10457.15686.0>,consumers,infinity]}}}}

Comment 2 Yurii Prokulevych 2016-11-21 16:21:38 UTC
Forgot to mention overcloud is with SSL and IPv6.

Comment 5 Peter Lemenkov 2016-11-22 13:41:09 UTC
First issue I've just found is this one:

https://github.com/rabbitmq/rabbitmq-server/issues/914

It was fixed upstream, and available since 3.6.6 version (see PR 918). This is a part of the log which shows that the cluster is affected by this one:

** Reason for termination ==
** {{bad_return_value,
        {error,
            {function_clause,
                [{rabbit_mirror_queue_misc,promote_slave,
                     [[]],
                     [{file,"src/rabbit_mirror_queue_misc.erl"},{line,282}]},
                 {rabbit_mirror_queue_misc,'-remove_from_queue/3-fun-1-',3,
                     [{file,"src/rabbit_mirror_queue_misc.erl"},{line,93}]},
                 {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]},
                 {mnesia_tm,execute_transaction,5,
                     [{file,"mnesia_tm.erl"},{line,808}]},
                 {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
                     [{file,"src/rabbit_misc.erl"},{line,532}]},
                 {worker_pool_worker,'-run/2-fun-0-',3,
                     [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}},
    {gen_server2,call,
        [<24684.21919.0>,{add_on_right,{14,<0.28037.0>}},infinity]}}

There might be more issues, and I'll update this ticket if I find anything.

Comment 6 Peter Lemenkov 2016-11-22 13:42:26 UTC
We need only this commit:

https://github.com/rabbitmq/rabbitmq-server/pull/918/commits/a416060

Comment 7 Peter Lemenkov 2016-11-22 13:48:02 UTC
The same issue:

** Reason for termination ==
** {{{case_clause,{ok,<24684.21670.0>,[<24676.26042.0>],[]}},
     [{rabbit_mirror_queue_coordinator,handle_cast,2,
          [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,351}]},
      {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]},
    {gen_server2,call,
        [<24676.28451.0>,{add_on_right,{19,<0.28396.0>}},infinity]}}

Comment 8 Peter Lemenkov 2016-11-22 14:08:03 UTC
Another one issue is related to RabbitMQ OCF script. Similar issue for the different OCF scrip was described here:

https://bugs.launchpad.net/fuel/+bug/1620649

It was addressed upstream in PR 946.

Comment 9 Peter Lemenkov 2016-11-22 14:11:39 UTC
Btw it looks like networking wasn't reliable:

=INFO REPORT==== 21-Nov-2016::14:37:29 ===
node 'rabbit@controller-0' down: etimedout

=WARNING REPORT==== 21-Nov-2016::14:37:29 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 21-Nov-2016::14:37:29 ===
node 'rabbit@controller-0' up

=INFO REPORT==== 21-Nov-2016::14:37:29 ===
rabbit on node 'rabbit@controller-2' down

=INFO REPORT==== 21-Nov-2016::14:37:29 ===
Stopping RabbitMQ

=INFO REPORT==== 21-Nov-2016::14:37:29 ===
stopped TCP Listener on [FD00:FD00:FD00:2000::15]:5672

Comment 10 Peter Lemenkov 2016-11-22 14:34:59 UTC
Another one issue:

** Reason for termination ==
** {bad_return_value,
       {error,
           {{badmatch,{error,not_found}},
            [{gm,'-record_dead_member_in_group/5-fun-1-',4,
                 [{file,"src/gm.erl"},{line,1129}]},
             {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]},
             {mnesia_tm,execute_transaction,5,
                 [{file,"mnesia_tm.erl"},{line,808}]},
             {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
                 [{file,"src/rabbit_misc.erl"},{line,532}]},
             {worker_pool_worker,'-run/2-fun-0-',3,
                 [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}}

It was fixed upstream in this PR:

https://github.com/rabbitmq/rabbitmq-server/pull/951

Comment 11 Peter Lemenkov 2016-11-22 14:38:21 UTC
Yet another one issue:

** Reason for termination ==
** {{badmatch,{error,{{duplicate_live_master,'rabbit@controller-1'},
                      {gen_server2,call,
                                   [<0.24797.0>,
                                    {add_on_right,{1,<0.25919.0>}},
                                    infinity]}}}},
    [{rabbit_mirror_queue_master,init_with_existing_bq,3,
                                 [{file,"src/rabbit_mirror_queue_master.erl"},
                                  {line,104}]},
     {rabbit_mirror_queue_master,init,3,
                                 [{file,"src/rabbit_mirror_queue_master.erl"},
                                  {line,99}]},
     {rabbit_amqqueue_process,init_it2,3,
                              [{file,"src/rabbit_amqqueue_process.erl"},
                               {line,194}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

This was addressed upstream in PR 970, and available since ver. 3.6.6

We interested in this commit partcularly:

https://github.com/rabbitmq/rabbitmq-server/commit/4005a5a

Comment 12 Peter Lemenkov 2016-11-22 15:06:11 UTC
This issue was addressed in PR 918:

** Reason for termination == 
** {{bad_return_value,
        {error,
            {function_clause,
                [{gm,check_membership,
                     [{5,<24678.21141.0>},{error,not_found}],
                     [{file,"src/gm.erl"},{line,1590}]},
                 {gm,'-record_new_member_in_group/4-fun-1-',3,
                     [{file,"src/gm.erl"},{line,1167}]},
                 {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]},
                 {mnesia_tm,execute_transaction,5,
                     [{file,"mnesia_tm.erl"},{line,808}]},
                 {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
                     [{file,"src/rabbit_misc.erl"},{line,532}]},
                 {worker_pool_worker,'-run/2-fun-0-',3,
                     [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}},
    {gen_server2,call,
        [<24678.21141.0>,{add_on_right,{5,<0.20098.0>}},infinity]}}

This commit addresses it:

https://github.com/rabbitmq/rabbitmq-server/pull/918/commits/53f10c9

Comment 13 Peter Lemenkov 2016-11-22 15:15:05 UTC
Ok, that's all for now. There are other issues, just grep for duplicate_live_master and gm_deaths, but they are likely related to these ones. I guess if we address these listed above, then all  other will go away.

My proposal is to backport all six patches from PRs 918, 970 to RabbitMQ 3.6.3 we're using. Then I'll adapt fix proposed in PR 946 to our resource-agent' script.

Comment 14 Peter Lemenkov 2016-11-22 15:48:37 UTC
Almost done. All RabbitMQ-related patches applied in this build - rabbitmq-server-3.6.3-6.el7ost.

Comment 15 Peter Lemenkov 2016-11-22 15:51:41 UTC
(In reply to Peter Lemenkov from comment #8)
> Another one issue is related to RabbitMQ OCF script. Similar issue for the
> different OCF scrip was described here:
> 
> https://bugs.launchpad.net/fuel/+bug/1620649
> 
> It was addressed upstream in PR 946.

For the fix for this issue please go to bug 1397393. Just for the record it can be easily spotted by the following log messages:

=ERROR REPORT==== 21-Nov-2016::12:32:38 ===
Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_731558_864930")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479490843523035021,-576460752303423006,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,490892,472236}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479731540535703648,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}}

Comment 16 Peter Lemenkov 2016-11-22 16:06:49 UTC
Regarding this (see comment 1):

Attempt to list consumers was falling with next message:

    rabbitmqctl list_consumers
     
    Listing consumers ...
    Error: {badrpc,
               {'EXIT',
                   {noproc,
                       {gen_server2,call,[<10457.15686.0>,consumers,infinity]}}}}

This looks like RabbitMQ is in a failed state. I believe that if we address all the issues mentioned above, then this one will magically gone.

Comment 29 Udi Shkalim 2016-12-07 14:05:47 UTC

*** This bug has been marked as a duplicate of bug 1397393 ***

Comment 30 Peter Lemenkov 2016-12-07 14:13:13 UTC
Just an explanation of the comment 29. We believe we've fixed all the issues related to RabbitMQ itself. Still one issue related to resource-agents remains. We'll address it in bug 1397393 (thus we closed this ticket as a duplicate of that one).


Note You need to log in before you can comment on or make changes to this bug.