Bug 1585032
| Summary: | HA test failed because it is unable to create an instance due to volume in error status "Unknown" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Raoul Scarazzini <rscarazz> |
| Component: | rabbitmq-server | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED EOL | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 12.0 (Pike) | CC: | abishop, apevec, berrange, cschwede, dasmith, eglynn, fpercoco, geguileo, jeckersb, jhakimra, kchamart, lhh, michele, mkrcmari, sbauza, sgordon, srevivo, vromanso |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | 12.0 (Pike) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-11 16:26:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Raoul Scarazzini
2018-06-01 07:55:53 UTC
sosreports: http://file.rdu.redhat.com/~rscarazz/BZ1585032/ Full error from the log is: 2018-05-31 17:34:14.337 1 INFO heat.engine.resource [req-323afbf1-f845-4160-9295-e47279165618 - admin - default default] CREATE: CinderVolume "instance_volume" [8c12f87c-6c15-4263-a4ae-c955ec66e34f] Stack "stack_test_pacemaker-light-a" [90029a02-0373-4283-bdef-5ac19e794b5e] 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource Traceback (most recent call last): 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_recorder 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource yield 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 939, in _do_action 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args) 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource step = next(subtask) 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 890, in action_handler_task 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource done = check(handler_data) 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/cinder/volume.py", line 318, in check_create_complete 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource complete = super(CinderVolume, self).check_create_complete(vol_id) 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/volume_base.py", line 56, in check_create_complete 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource resource_status=vol.status) 2018-05-31 17:34:14.337 1 ERROR heat.engine.resource ResourceInError: Went to status error due to "Unknown" So looking at the related log in cinder volume what we see is that after the test described in the first comment the status of the resource seems fine: 2018-05-31 17:32:30.609 118015 INFO cinder.volume.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Driver initialization completed successfully. 2018-05-31 17:32:30.615 118015 INFO cinder.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Initiating service 5 cleanup 2018-05-31 17:32:30.618 118015 INFO cinder.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Service 5 cleanup completed. But then something goes wrong with the messaging layer: 2018-05-31 17:32:39.224 118015 ERROR oslo.messaging._drivers.impl_rabbit [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Failed to declare consumer for topic 'cinder-volume.hostgroup': (0, 0): (541) INTERNAL_ERROR: InternalError: (0, 0): (541) INTERNAL_ERROR 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Error starting thread.: InternalError: (0, 0): (541) INTERNAL_ERROR 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service Traceback (most recent call last): 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 721, in run_service 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service service.start() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/cinder/service.py", line 244, in start 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.backend_rpcserver.start() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 270, in wrapper 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service log_after, timeout_timer) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 190, in run_once 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service post_fn = fn() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 269, in <lambda> 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service states[state].run_once(lambda: fn(self, *args, **kwargs), 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 416, in start 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.listener = self._create_listener() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 148, in _create_listener 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service return self.transport._listen(self._target, 1, None) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 138, in _listen 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service batch_timeout) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 583, in listen 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service callback=listener) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1126, in declare_topic_consumer 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.declare_consumer(consumer) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1028, in declare_consumer 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service error_callback=_connect_error) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 807, in ensure 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service ret, channel = autoretry_method() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 494, in _ensured 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service return fun(*args, **kwargs) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 570, in __call__ 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service return fun(*args, channel=channels[0], **kwargs), channels[0] 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 796, in execute_method 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service method() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1016, in _declare_consumer 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service consumer.declare(self) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 304, in declare 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.queue.declare() 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self._create_queue(nowait=nowait, channel=channel) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.queue_declare(nowait=nowait, passive=False, channel=channel) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service nowait=nowait, 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1165, in queue_declare 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service spec.Queue.DeclareOk, returns_tuple=True, 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 93, in wait 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service self.connection.drain_events(timeout=timeout) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 465, in drain_events 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service return self.blocking_read(timeout) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 470, in blocking_read 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service return self.on_inbound_frame(frame) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service callback(channel, method_sig, buf, None) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 474, in on_inbound_method 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service method_sig, payload, content, 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service listener(*args) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 596, in _on_close 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service (class_id, method_id), ConnectionError) 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service InternalError: (0, 0): (541) INTERNAL_ERROR 2018-05-31 17:32:39.225 118015 ERROR oslo_service.service 2018-05-31 17:33:09.314 118015 WARNING oslo_messaging.server [req-de6c2c04-706e-49ac-b019-895dd74a4797 - - - - -] Possible hang: wait is waiting for stop to complete One very important note is that from the cluster side of things Rabbit and all the resources are fine, so I'd guess this problem is limited to Cinder volume, mostly because other components of the openstack-heat stack gets created without any problem. And above that the fact that we're able to start the openstack-heat deployment is a proof of the fact that core resources are working just fine. We don't move on in the openstack-heat deployment if the overall status is up and running without failed actions. I'm changing the component of this bug, because it feels more a cinder-volume issue. This other bug might be related: https://bugzilla.redhat.com/show_bug.cgi?id=1568512 I'm guessing this might be due to an issue with a sequencing issue when cinder-volume is running under pacemaker. The c-vol logs indicates the core pacemaker resources (galera and rabbitmq) were stopped when c-vol was running, which makes me think all pacemaker resources were stopped at the same time. Likewise, when c-vol starts up, the "Failed to declare consumer for topic 'cinder-volume.hostgroup'" error suggests rabbitmq was not up. I'm thinking the correct test sequence should be something like this: 1. Stop systemd services 2. Stop non-core pacemaker resources (c-vol, manila-share, etc.) 3. Stop core pacemaker resources (galera, rabbitmq) 4. Start core pacemaker resources 5. Start non-core pacemaker resources 6. Start systemd services I'm not so sure this is a Cinder problem. Error 541 INTERNAL_ERROR is coming from RabbitMQ itself, therefore it has nothing to do with Cinder. And we saw how it took over 1 minute of constant retries to actually connect to the RabbitMQ server. Looking into the RabbitMQ logs we can see this error: =INFO REPORT==== 31-May-2018::17:29:48 === Mirrored queue 'cinder-volume.hostgroup' in vhost '/': Master <rabbit.5655.0> saw deaths of mirrors <rabbit.2265.0> Which could be related to: https://github.com/rabbitmq/rabbitmq-server/issues/714 And it could be related to: =WARNING REPORT==== 31-May-2018::17:28:36 === closing AMQP connection <0.7711.0> (172.17.0.19:44990 -> 172.17.0.19:5672 - cinder-volume:152042:0222cf36-cb31-4c88-bfe2-cd27d9eaf0b9): client unexpectedly closed TCP connection Which shouldn't happen since Cinder is properly stopping: 2018-05-31 17:27:59.364 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Caught SIGTERM, stopping children 2018-05-31 17:27:59.365 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Waiting on 1 children to exit 2018-05-31 17:28:36.438 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Child 152042 exited with status 0 In any case, the main source of the problem is the internal RabbitMQ error. Also worth noting, Cinder volume seems to be the only service that has been restarted. All other services (the ones that seem to work) haven't been restarted, including cinder scheduler and API. Hi folks, thanks for your contribution. Let me clarify why we do these tests: we do *want* to stop *just* the core resources, so Rabbit and Galera, because we expect all the other resources (systemd and also the other ones managed by Pacemaker, like cinder-volume) to survive autonomously after this kind of failures. This behavior is a requirement for all the HA setups. We don't control anymore all the resources with Pacemaker, so they are "on their own". One last additional note aside: when Rabbit is stopped and then started, but all the other resources, from Neutron to Heat, are working correctly, respecting the expected behavior. Raoul, you mention that only Rabbit and Galera are being stopped, but logs suggest otherwise. The Cinder Volume service is being stopped: 2018-05-31 17:27:59.364 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Caught SIGTERM, stopping children 2018-05-31 17:27:59.365 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Waiting on 1 children to exit 2018-05-31 17:28:36.438 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Child 152042 exited with status 0 Gorka, yes, you're right. Sorry if I made the confusion here: in the test [1] that is failing we intentionally stop the $OVERCLOUD_SYSTEMD_RESOURCES [2] and $OVERCLOUD_CORE_RESOURCES [3] resources. The openstack-cinder-volume resource is part (and the only one) of the systemd list, which means that it gets stopped too. That said, it doesn't change what we expect from the result of the test: the resource should be able to recover properly without any external intervention. Otherwise, we'll need to create a constraint to locate it to the core resources to make sure it is stopped in combination with the core resources and is started after. Until today it was not necessary, and I still feel like the behavior we're expecting is the correct one. [1] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/test/test_pacemaker-light-a [2] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/ha-test-suite.sh#L58 [3] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/ha-test-suite.sh#L56 I agree, it should be expected to run fine. But the error we are seeing is a RabbitMQ internal error on the queue used by Cinder. It shouldn't matter if Cinder volume is stopping cleanly, or if it's abruptly dying. It shouldn't matter either if RabbitMQ is stopped or dies. We should never end up with a queue that cannot be used, which is what's happening here. Reassigning to messaging for further debugging. The relevant cinder-volume connection is connected to controller-0, and we see the crash for the queue.declare in the log here:
=INFO REPORT==== 31-May-2018::17:32:39 ===
accepting AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672)
=INFO REPORT==== 31-May-2018::17:32:39 ===
Connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672) has a client-provided name: cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e
=ERROR REPORT==== 31-May-2018::17:32:39 ===
** Generic server <0.6483.0> terminating
** Last message in was {'$gen_cast',
{method,
{'queue.declare',0,
<<"cinder-volume.hostgroup">>,
false,false,false,false,false,[]},
none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.6475.0>,
<0.6481.0>,<0.6475.0>,
<<"172.17.0.19:46030 -> 172.17.0.19:5672">>,
{lstate,<0.6482.0>,false},
none,1,
{[],[]},
{user,<<"guest">>,
[administrator],
[{rabbit_auth_backend_internal,none}]},
<<"/">>,
<<"cinder-volume.hostgroup@tripleo_iscsi">>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
erlang},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{set,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
<0.6476.0>,
{state,fine,5000,#Ref<0.0.3.2526>},
false,1,
{{0,nil},{0,nil}},
[],
{{0,nil},{0,nil}},
[{<<"connection.blocked">>,bool,true},
{<<"authentication_failure_close">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true}],
none,0,none,flow,[]}
** Reason for termination ==
** {{noproc,
{gen_server,call,
[{rabbit_amqqueue_sup_sup,'rabbit@overcloud-controller-1'},
{start_child,
[{amqqueue,
{resource,<<"/">>,queue,
<<"cinder-volume.hostgroup">>},
false,false,none,[],none,[],[],[],
[{vhost,<<"/">>},
{name,<<"ha-all">>},
{pattern,<<"^(?!amq\\.).*">>},
{'apply-to',<<"all">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[],[],live},
declare]},
infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},
{rabbit_amqqueue_sup_sup,start_queue_process,3,
[{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
{rabbit_amqqueue,declare,6,[{file,"src/rabbit_amqqueue.erl"},{line,317}]},
{rabbit_channel,handle_method,3,
[{file,"src/rabbit_channel.erl"},{line,1325}]},
{rabbit_channel,handle_cast,2,
[{file,"src/rabbit_channel.erl"},{line,457}]},
{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
=ERROR REPORT==== 31-May-2018::17:32:39 ===
Error on AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672 - cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e, vhost: '/', user: 'guest', state: running), channel 1:
{{noproc,
{gen_server,call,
[{rabbit_amqqueue_sup_sup,'rabbit@overcloud-controller-1'},
{start_child,
[{amqqueue,
{resource,<<"/">>,queue,
<<"cinder-volume.hostgroup">>},
false,false,none,[],none,[],[],[],
[{vhost,<<"/">>},
{name,<<"ha-all">>},
{pattern,<<"^(?!amq\\.).*">>},
{'apply-to',<<"all">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[],[],live},
declare]},
infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},
{rabbit_amqqueue_sup_sup,start_queue_process,3,
[{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
{rabbit_amqqueue,declare,6,[{file,"src/rabbit_amqqueue.erl"},{line,317}]},
{rabbit_channel,handle_method,3,
[{file,"src/rabbit_channel.erl"},{line,1325}]},
{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,457}]},
{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
=WARNING REPORT==== 31-May-2018::17:32:39 ===
Non-AMQP exit reason '{{noproc,
{gen_server,call,
[{rabbit_amqqueue_sup_sup,
'rabbit@overcloud-controller-1'},
{start_child,
[{amqqueue,
{resource,<<"/">>,queue,
<<"cinder-volume.hostgroup">>},
false,false,none,[],none,[],[],[],
[{vhost,<<"/">>},
{name,<<"ha-all">>},
{pattern,<<"^(?!amq\\.).*">>},
{'apply-to',<<"all">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[],[],live},
declare]},
infinity]}},
[{gen_server,call,3,
[{file,"gen_server.erl"},{line,212}]},
{rabbit_amqqueue_sup_sup,start_queue_process,3,
[{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
{rabbit_amqqueue,declare,6,
[{file,"src/rabbit_amqqueue.erl"},{line,317}]},
{rabbit_channel,handle_method,3,
[{file,"src/rabbit_channel.erl"},{line,1325}]},
{rabbit_channel,handle_cast,2,
[{file,"src/rabbit_channel.erl"},{line,457}]},
{gen_server2,handle_msg,2,
[{file,"src/gen_server2.erl"},{line,1032}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,240}]}]}'
=INFO REPORT==== 31-May-2018::17:32:39 ===
closing AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672 - cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e)
This is attempting to declare the queue master for the cinder-volume.hostgroup queue onto overcloud-controller-1. The reason for this is because we have set the queue_master_locator option in rabbitmq.config:
{queue_master_locator, <<"min-masters">>},
So controller-0 decides that controller-1 has the fewest number of master queues and tries to declare it there.
However, at the time this is happening, controller-1 is restarting. Note the error is at 17:32:39, and then compare to the rabbit log on controller-1:
=INFO REPORT==== 31-May-2018::17:32:36 ===
Stopped RabbitMQ application
=INFO REPORT==== 31-May-2018::17:32:38 ===
Clustering with ['rabbit@overcloud-controller-0'] as disc node
=INFO REPORT==== 31-May-2018::17:32:41 ===
Starting RabbitMQ 3.6.5 on Erlang 18.3.4.7
So at 39 seconds, controller-1 has rejoined the cluster but has not yet started the rabbit app.
This is probably a bug somewhere in the master locator code. I would expect it to verify the target node is actually up, not just clustered.
As a workaround for testing purposes, just remove the queue_master_locator setting from rabbitmq.config. This will cause all queue declarations to assign the master to the local node where the declaration happens. In the example above, cinder-volume is connected to controller-0 so the master will also be located local to controller-0. That avoids this problem entirely. As discussed, the workaround is not applicable here, since we rely on these tests inside the CI, which means we can't (or at least we should not) touch the environments. So we'll need to find a permanent solution. (In reply to John Eckersberg from comment #10) > This is probably a bug somewhere in the master locator code. I would expect > it to verify the target node is actually up, not just clustered. So, the master locator code *does* verify that the target node is actually up, to the best of its ability. The problem is that it does so by reading the list of running nodes from mnesia. It appears at this point where the node is in the process of re-joining the cluster after having previously been a member, the mnesia state on the other cluster members hasn't caught up that the node is missing yet because the underlying runtime hasn't flagged the node as down yet. There's not really a way to guarantee that this approach will be 100% successful. There's always going to be a window where a node can go down and the other cluster members are not aware yet. If a queue declare occurs during that window and the down node is chosen as the master, it will always fail like this. I think the best strategy is to just catch the failure and try another valid node if one is available. I'll put together a PR for upstream to change the master_locator strategies to be able to cope with a prioritized list instead of just selecting a single node. That way it can iterate through the preferences during a failure. This is handled a bit better in 3.7.x here: https://github.com/rabbitmq/rabbitmq-server/commit/b753442da578029d49478f2511e647fe29af9b63 I'll try first to backport this and get a new build. It probably won't help the behavior but we'll at least get better error messages. (In reply to John Eckersberg from comment #15) > This is handled a bit better in 3.7.x here: > > https://github.com/rabbitmq/rabbitmq-server/commit/ > b753442da578029d49478f2511e647fe29af9b63 > > I'll try first to backport this and get a new build. It probably won't help > the behavior but we'll at least get better error messages. This didn't work so well on first attempt. The internal records have changed quite a bit between 3.6.x and 3.7.x and a straightforward backport is not really possible. I'm going to just turn the queue_master_locator strategy back to client-local. It's not really a great choice for failover performance, but it should fix this particular issue for CI. Upstream review - https://review.openstack.org/#/c/587064/ |