Description of problem: ----------------------- After upgrade from RHOS-9 to RHOS-10 a bunch of errors in ceilometer/agent-notification.log 2016-11-21 15:23:26.482 290254 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to declare consumer for topic 'notifications.audit': Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'not ifications.audit' in vhost '/' due to timeout 2016-11-21 15:23:26.483 290254 ERROR cotyledon [-] Unhandled exception 2016-11-21 15:23:26.483 290254 ERROR cotyledon Traceback (most recent call last): 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 52, in _exit_on_exception 2016-11-21 15:23:26.483 290254 ERROR cotyledon yield 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 130, in _run 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.run() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/ceilometer/notification.py", line 184, in run 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.event_pipe_manager) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/ceilometer/notification.py", line 260, in _configure_main_queue_listeners 2016-11-21 15:23:26.483 290254 ERROR cotyledon listener.start() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 268, in wrapper 2016-11-21 15:23:26.483 290254 ERROR cotyledon log_after, timeout_timer) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 188, in run_once 2016-11-21 15:23:26.483 290254 ERROR cotyledon post_fn = fn() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 267, in <lambda> 2016-11-21 15:23:26.483 290254 ERROR cotyledon states[state].run_once(lambda: fn(self, *args, **kwargs), 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 420, in start 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.listener = self._create_listener() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/listener.py", line 136, in _create_listener 2016-11-21 15:23:26.483 290254 ERROR cotyledon self._batch_timeout 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 122, in _listen_for_notifications 2016-11-21 15:23:26.483 290254 ERROR cotyledon targets_and_priorities, pool, batch_size, batch_timeout 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 496, in listen_for_notifications 2016-11-21 15:23:26.483 290254 ERROR cotyledon callback=listener, queue_name=pool) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1132, in declare_topic_consumer 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.declare_consumer(consumer) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1040, in declare_consumer 2016-11-21 15:23:26.483 290254 ERROR cotyledon error_callback=_connect_error) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 814, in ensure 2016-11-21 15:23:26.483 290254 ERROR cotyledon ret, channel = autoretry_method() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured 2016-11-21 15:23:26.483 290254 ERROR cotyledon return fun(*args, **kwargs) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__ 2016-11-21 15:23:26.483 290254 ERROR cotyledon return fun(*args, channel=channels[0], **kwargs), channels[0] 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 790, in execute_method 2016-11-21 15:23:26.483 290254 ERROR cotyledon method() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1028, in _declare_consumer 2016-11-21 15:23:26.483 290254 ERROR cotyledon consumer.declare(self) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 310, in declare 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.queue.declare() 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 522, in declare 2016-11-21 15:23:26.483 290254 ERROR cotyledon self.queue_declare(nowait, passive=False) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 548, in queue_declare 2016-11-21 15:23:26.483 290254 ERROR cotyledon nowait=nowait) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1258, in queue_declare 2016-11-21 15:23:26.483 290254 ERROR cotyledon (50, 11), # Channel.queue_declare_ok 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait 2016-11-21 15:23:26.483 290254 ERROR cotyledon return self.dispatch_method(method_sig, args, content) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 87, in dispatch_method 2016-11-21 15:23:26.483 290254 ERROR cotyledon return amqp_method(self, args) 2016-11-21 15:23:26.483 290254 ERROR cotyledon File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close 2016-11-21 15:23:26.483 290254 ERROR cotyledon reply_code, reply_text, (class_id, method_id), ChannelError, 2016-11-21 15:23:26.483 290254 ERROR cotyledon NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.audit' in vhost '/' due to timeout 2016-11-21 15:23:26.483 290254 ERROR cotyledon 2016-11-21 15:23:26.494 236155 INFO cotyledon [-] Child 290254 exited with status 2 Version-Release number of selected component (if applicable): ------------------------------------------------------------- rabbitmq-server-3.6.3-5.el7ost.noarch puppet-rabbitmq-5.6.0-1.057a013git.el7ost.noarch python-oslo-messaging-5.10.0-5.el7ost.noarch python-cotyledon-1.2.7-2.el7ost.noarch Steps to Reproduce: 1. After running 'convergence' step check ceilometer/agent-notification.log Actual results: --------------- Traceback in log. No new measures are processed by ceilometer. Additional info: ---------------- Virtual setup: 3controllers + 2computes +3ceph. Queue notifications.sample was growing. Attempt to list consumers was falling with next message: rabbitmqctl list_consumers Listing consumers ... Error: {badrpc, {'EXIT', {noproc, {gen_server2,call,[<10457.15686.0>,consumers,infinity]}}}}
Forgot to mention overcloud is with SSL and IPv6.
First issue I've just found is this one: https://github.com/rabbitmq/rabbitmq-server/issues/914 It was fixed upstream, and available since 3.6.6 version (see PR 918). This is a part of the log which shows that the cluster is affected by this one: ** Reason for termination == ** {{bad_return_value, {error, {function_clause, [{rabbit_mirror_queue_misc,promote_slave, [[]], [{file,"src/rabbit_mirror_queue_misc.erl"},{line,282}]}, {rabbit_mirror_queue_misc,'-remove_from_queue/3-fun-1-',3, [{file,"src/rabbit_mirror_queue_misc.erl"},{line,93}]}, {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]}, {mnesia_tm,execute_transaction,5, [{file,"mnesia_tm.erl"},{line,808}]}, {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1, [{file,"src/rabbit_misc.erl"},{line,532}]}, {worker_pool_worker,'-run/2-fun-0-',3, [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}}, {gen_server2,call, [<24684.21919.0>,{add_on_right,{14,<0.28037.0>}},infinity]}} There might be more issues, and I'll update this ticket if I find anything.
We need only this commit: https://github.com/rabbitmq/rabbitmq-server/pull/918/commits/a416060
The same issue: ** Reason for termination == ** {{{case_clause,{ok,<24684.21670.0>,[<24676.26042.0>],[]}}, [{rabbit_mirror_queue_coordinator,handle_cast,2, [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,351}]}, {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, {gen_server2,call, [<24676.28451.0>,{add_on_right,{19,<0.28396.0>}},infinity]}}
Another one issue is related to RabbitMQ OCF script. Similar issue for the different OCF scrip was described here: https://bugs.launchpad.net/fuel/+bug/1620649 It was addressed upstream in PR 946.
Btw it looks like networking wasn't reliable: =INFO REPORT==== 21-Nov-2016::14:37:29 === node 'rabbit@controller-0' down: etimedout =WARNING REPORT==== 21-Nov-2016::14:37:29 === Cluster minority/secondary status detected - awaiting recovery =INFO REPORT==== 21-Nov-2016::14:37:29 === node 'rabbit@controller-0' up =INFO REPORT==== 21-Nov-2016::14:37:29 === rabbit on node 'rabbit@controller-2' down =INFO REPORT==== 21-Nov-2016::14:37:29 === Stopping RabbitMQ =INFO REPORT==== 21-Nov-2016::14:37:29 === stopped TCP Listener on [FD00:FD00:FD00:2000::15]:5672
Another one issue: ** Reason for termination == ** {bad_return_value, {error, {{badmatch,{error,not_found}}, [{gm,'-record_dead_member_in_group/5-fun-1-',4, [{file,"src/gm.erl"},{line,1129}]}, {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]}, {mnesia_tm,execute_transaction,5, [{file,"mnesia_tm.erl"},{line,808}]}, {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1, [{file,"src/rabbit_misc.erl"},{line,532}]}, {worker_pool_worker,'-run/2-fun-0-',3, [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}} It was fixed upstream in this PR: https://github.com/rabbitmq/rabbitmq-server/pull/951
Yet another one issue: ** Reason for termination == ** {{badmatch,{error,{{duplicate_live_master,'rabbit@controller-1'}, {gen_server2,call, [<0.24797.0>, {add_on_right,{1,<0.25919.0>}}, infinity]}}}}, [{rabbit_mirror_queue_master,init_with_existing_bq,3, [{file,"src/rabbit_mirror_queue_master.erl"}, {line,104}]}, {rabbit_mirror_queue_master,init,3, [{file,"src/rabbit_mirror_queue_master.erl"}, {line,99}]}, {rabbit_amqqueue_process,init_it2,3, [{file,"src/rabbit_amqqueue_process.erl"}, {line,194}]}, {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]} This was addressed upstream in PR 970, and available since ver. 3.6.6 We interested in this commit partcularly: https://github.com/rabbitmq/rabbitmq-server/commit/4005a5a
This issue was addressed in PR 918: ** Reason for termination == ** {{bad_return_value, {error, {function_clause, [{gm,check_membership, [{5,<24678.21141.0>},{error,not_found}], [{file,"src/gm.erl"},{line,1590}]}, {gm,'-record_new_member_in_group/4-fun-1-',3, [{file,"src/gm.erl"},{line,1167}]}, {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]}, {mnesia_tm,execute_transaction,5, [{file,"mnesia_tm.erl"},{line,808}]}, {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1, [{file,"src/rabbit_misc.erl"},{line,532}]}, {worker_pool_worker,'-run/2-fun-0-',3, [{file,"src/worker_pool_worker.erl"},{line,77}]}]}}}, {gen_server2,call, [<24678.21141.0>,{add_on_right,{5,<0.20098.0>}},infinity]}} This commit addresses it: https://github.com/rabbitmq/rabbitmq-server/pull/918/commits/53f10c9
Ok, that's all for now. There are other issues, just grep for duplicate_live_master and gm_deaths, but they are likely related to these ones. I guess if we address these listed above, then all other will go away. My proposal is to backport all six patches from PRs 918, 970 to RabbitMQ 3.6.3 we're using. Then I'll adapt fix proposed in PR 946 to our resource-agent' script.
Almost done. All RabbitMQ-related patches applied in this build - rabbitmq-server-3.6.3-6.el7ost.
(In reply to Peter Lemenkov from comment #8) > Another one issue is related to RabbitMQ OCF script. Similar issue for the > different OCF scrip was described here: > > https://bugs.launchpad.net/fuel/+bug/1620649 > > It was addressed upstream in PR 946. For the fix for this issue please go to bug 1397393. Just for the record it can be easily spotted by the following log messages: =ERROR REPORT==== 21-Nov-2016::12:32:38 === Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_731558_864930") ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479490843523035021,-576460752303423006,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,490892,472236}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479731540535703648,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}}
Regarding this (see comment 1): Attempt to list consumers was falling with next message: rabbitmqctl list_consumers Listing consumers ... Error: {badrpc, {'EXIT', {noproc, {gen_server2,call,[<10457.15686.0>,consumers,infinity]}}}} This looks like RabbitMQ is in a failed state. I believe that if we address all the issues mentioned above, then this one will magically gone.
*** This bug has been marked as a duplicate of bug 1397393 ***
Just an explanation of the comment 29. We believe we've fixed all the issues related to RabbitMQ itself. Still one issue related to resource-agents remains. We'll address it in bug 1397393 (thus we closed this ticket as a duplicate of that one).