Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1585032

Summary:	HA test failed because it is unable to create an instance due to volume in error status "Unknown"
Product:	Red Hat OpenStack	Reporter:	Raoul Scarazzini <rscarazz>
Component:	rabbitmq-server	Assignee:	John Eckersberg <jeckersb>
Status:	CLOSED EOL	QA Contact:	Udi Shkalim <ushkalim>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	12.0 (Pike)	CC:	abishop, apevec, berrange, cschwede, dasmith, eglynn, fpercoco, geguileo, jeckersb, jhakimra, kchamart, lhh, michele, mkrcmari, sbauza, sgordon, srevivo, vromanso
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-11 16:26:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Raoul Scarazzini 2018-06-01 07:55:53 UTC

Description of problem:

While testing HA in OSPPhase0 we stop every systemd resource, stop the core resources Galera and Rabbitmq inside the cluster, then we start every systemd resource and check that everything is fine. After the recovery of the core resources (so after verifying that every service is right on the cluster side), we start an openstack-heat template [1] that spawn an instance using a previously created volume.
At this stage it could racily happen that the openstack-heat deployment fails with this error:

2018-05-31 17:34:58Z [stack_test_pacemaker-light-a.instance_volume]: CREATE_FAILED  ResourceInError: resources.instance_volume: Went to status error due to "Unknown"

[1] https://github.com/openstack/tripleo-ha-utils/blob/master/roles/validate-ha/templates/validate-ha-heat-template.yaml.j2

Version-Release number of selected component (if applicable):

Puddle 2018-05-29.1

How reproducible:

This problem is a race condition, so repeating the same test a certain amount of time can lead to the problem.

Steps to Reproduce:

Just run a few times the pacemaker_ng_a test as explained in https://github.com/openstack/tripleo-ha-utils/tree/master/tools/ha-test-suite

Actual results:

Heat stack creation fails with volume in status unknown

Expected results:

Heat stack creation succeeds

Additional info:

I'll append sosreports soon. Investigations are still ongoing. Hopefully, we can find a reproducer somehow.

Comment 1 Raoul Scarazzini 2018-06-01 09:02:58 UTC

sosreports: http://file.rdu.redhat.com/~rscarazz/BZ1585032/

Comment 2 Raoul Scarazzini 2018-06-01 09:43:55 UTC

Full error from the log is:

2018-05-31 17:34:14.337 1 INFO heat.engine.resource [req-323afbf1-f845-4160-9295-e47279165618 - admin - default default] CREATE: CinderVolume "instance_volume" [8c12f87c-6c15-4263-a4ae-c955ec66e34f] Stack "stack_test_pacemaker-light-a" [90029a02-0373-4283-bdef-5ac19e794b5e]
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource Traceback (most recent call last):
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_recorder
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     yield
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 939, in _do_action
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     step = next(subtask)
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 890, in action_handler_task
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     done = check(handler_data)
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/cinder/volume.py", line 318, in check_create_complete
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     complete = super(CinderVolume, self).check_create_complete(vol_id)
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/volume_base.py", line 56, in check_create_complete
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource     resource_status=vol.status)
2018-05-31 17:34:14.337 1 ERROR heat.engine.resource ResourceInError: Went to status error due to "Unknown"

So looking at the related log in cinder volume what we see is that after the test described in the first comment the status of the resource seems fine:

2018-05-31 17:32:30.609 118015 INFO cinder.volume.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Driver initialization completed successfully.
2018-05-31 17:32:30.615 118015 INFO cinder.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Initiating service 5 cleanup
2018-05-31 17:32:30.618 118015 INFO cinder.manager [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Service 5 cleanup completed.

But then something goes wrong with the messaging layer:

2018-05-31 17:32:39.224 118015 ERROR oslo.messaging._drivers.impl_rabbit [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Failed to declare consumer for topic 'cinder-volume.hostgroup': (0, 0): (541) INTERNAL_ERROR: InternalError: (0, 0): (541) INTERNAL_ERROR
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service [req-e514286f-7a7c-443b-ac02-c44ccfa889b6 - - - - -] Error starting thread.: InternalError: (0, 0): (541) INTERNAL_ERROR
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service Traceback (most recent call last):
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 721, in run_service
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     service.start()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/service.py", line 244, in start
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.backend_rpcserver.start()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 270, in wrapper
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     log_after, timeout_timer)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 190, in run_once
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     post_fn = fn()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 269, in <lambda>
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     states[state].run_once(lambda: fn(self, *args, **kwargs),
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/server.py", line 416, in start
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.listener = self._create_listener()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 148, in _create_listener
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     return self.transport._listen(self._target, 1, None)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 138, in _listen
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     batch_timeout)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 583, in listen
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     callback=listener)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1126, in declare_topic_consumer
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.declare_consumer(consumer)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1028, in declare_consumer
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     error_callback=_connect_error)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 807, in ensure
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     ret, channel = autoretry_method()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 494, in _ensured
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     return fun(*args, **kwargs)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 570, in __call__
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     return fun(*args, channel=channels[0], **kwargs), channels[0]
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 796, in execute_method
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     method()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1016, in _declare_consumer
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     consumer.declare(self)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 304, in declare
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.queue.declare()
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self._create_queue(nowait=nowait, channel=channel)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.queue_declare(nowait=nowait, passive=False, channel=channel)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     nowait=nowait,
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1165, in queue_declare
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     spec.Queue.DeclareOk, returns_tuple=True,
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 93, in wait
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     self.connection.drain_events(timeout=timeout)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 465, in drain_events
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     return self.blocking_read(timeout)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 470, in blocking_read
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     return self.on_inbound_frame(frame)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     callback(channel, method_sig, buf, None)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 474, in on_inbound_method
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     method_sig, payload, content,
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     listener(*args)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 596, in _on_close
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service     (class_id, method_id), ConnectionError)
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service InternalError: (0, 0): (541) INTERNAL_ERROR
2018-05-31 17:32:39.225 118015 ERROR oslo_service.service 
2018-05-31 17:33:09.314 118015 WARNING oslo_messaging.server [req-de6c2c04-706e-49ac-b019-895dd74a4797 - - - - -] Possible hang: wait is waiting for stop to complete

One very important note is that from the cluster side of things Rabbit and all the resources are fine, so I'd guess this problem is limited to Cinder volume, mostly because other components of the openstack-heat stack gets created without any problem. And above that the fact that we're able to start the openstack-heat deployment is a proof of the fact that core resources are working just fine.
We don't move on in the openstack-heat deployment if the overall status is up and running without failed actions.

I'm changing the component of this bug, because it feels more a cinder-volume issue.

Comment 3 Raoul Scarazzini 2018-06-01 14:25:33 UTC

This other bug might be related: https://bugzilla.redhat.com/show_bug.cgi?id=1568512

Comment 4 Alan Bishop 2018-06-05 10:42:31 UTC

I'm guessing this might be due to an issue with a sequencing issue when cinder-volume is running under pacemaker. The c-vol logs indicates the core pacemaker resources (galera and rabbitmq) were stopped when c-vol was running, which makes me think all pacemaker resources were stopped at the same time. Likewise, when c-vol starts up, the "Failed to declare consumer for topic 'cinder-volume.hostgroup'" error suggests rabbitmq was not up.

I'm thinking the correct test sequence should be something like this:

1. Stop systemd services
2. Stop non-core pacemaker resources (c-vol, manila-share, etc.)
3. Stop core pacemaker resources (galera, rabbitmq)
4. Start core pacemaker resources
5. Start non-core pacemaker resources
6. Start systemd services

Comment 5 Gorka Eguileor 2018-06-05 10:56:17 UTC

I'm not so sure this is a Cinder problem.

Error 541 INTERNAL_ERROR is coming from RabbitMQ itself, therefore it has nothing to do with Cinder.  And we saw how it took over 1 minute of constant retries to actually connect to the RabbitMQ server.

Looking into the RabbitMQ logs we can see this error:

=INFO REPORT==== 31-May-2018::17:29:48 ===
Mirrored queue 'cinder-volume.hostgroup' in vhost '/': Master <rabbit.5655.0> saw deaths of mirrors <rabbit.2265.0>

Which could be related to: https://github.com/rabbitmq/rabbitmq-server/issues/714

And it could be related to:

=WARNING REPORT==== 31-May-2018::17:28:36 ===
closing AMQP connection <0.7711.0> (172.17.0.19:44990 -> 172.17.0.19:5672 - cinder-volume:152042:0222cf36-cb31-4c88-bfe2-cd27d9eaf0b9):
client unexpectedly closed TCP connection

Which shouldn't happen since Cinder is properly stopping:

2018-05-31 17:27:59.364 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Caught SIGTERM, stopping children
2018-05-31 17:27:59.365 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Waiting on 1 children to exit
2018-05-31 17:28:36.438 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Child 152042 exited with status 0


In any case, the main source of the problem is the internal RabbitMQ error.

Also worth noting, Cinder volume seems to be the only service that has been restarted.  All other services (the ones that seem to work) haven't been restarted, including cinder scheduler and API.

Comment 6 Raoul Scarazzini 2018-06-06 07:13:35 UTC

Hi folks, thanks for your contribution.
Let me clarify why we do these tests: we do *want* to stop *just* the core resources, so Rabbit and Galera, because we expect all the other resources (systemd and also the other ones managed by Pacemaker, like cinder-volume) to survive autonomously after this kind of failures.
This behavior is a requirement for all the HA setups. We don't control anymore all the resources with Pacemaker, so they are "on their own".
One last additional note aside: when Rabbit is stopped and then started, but all the other resources, from Neutron to Heat, are working correctly, respecting the expected behavior.

Comment 7 Gorka Eguileor 2018-06-06 11:33:58 UTC

Raoul, you mention that only Rabbit and Galera are being stopped, but logs suggest otherwise.

The Cinder Volume service is being stopped:

2018-05-31 17:27:59.364 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Caught SIGTERM, stopping children
2018-05-31 17:27:59.365 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Waiting on 1 children to exit
2018-05-31 17:28:36.438 151948 INFO oslo_service.service [req-67084d20-9d8e-4e0f-9e83-889e1146f350 - - - - -] Child 152042 exited with status 0

Comment 8 Raoul Scarazzini 2018-06-06 11:56:10 UTC

Gorka, yes, you're right. Sorry if I made the confusion here: in the test [1] that is failing we intentionally stop the $OVERCLOUD_SYSTEMD_RESOURCES [2] and $OVERCLOUD_CORE_RESOURCES [3] resources.
The openstack-cinder-volume resource is part (and the only one) of the systemd list, which means that it gets stopped too.
That said, it doesn't change what we expect from the result of the test: the resource should be able to recover properly without any external intervention. Otherwise, we'll need to create a constraint to locate it to the core resources to make sure it is stopped in combination with the core resources and is started after.  Until today it was not necessary, and I still feel like the behavior we're expecting is the correct one.

[1] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/test/test_pacemaker-light-a
[2] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/ha-test-suite.sh#L58
[3] https://github.com/openstack/tripleo-ha-utils/blob/master/tools/ha-test-suite/ha-test-suite.sh#L56

Comment 9 Gorka Eguileor 2018-06-06 12:23:53 UTC

I agree, it should be expected to run fine.

But the error we are seeing is a RabbitMQ internal error on the queue used by Cinder.

It shouldn't matter if  Cinder volume is stopping cleanly, or if it's abruptly dying.  It shouldn't matter either if RabbitMQ is stopped or dies.  We should never end up with a queue that cannot be used, which is what's happening here.

Reassigning to messaging for further debugging.

Comment 10 John Eckersberg 2018-06-06 14:36:28 UTC

The relevant cinder-volume connection is connected to controller-0, and we see the crash for the queue.declare in the log here:

=INFO REPORT==== 31-May-2018::17:32:39 ===
accepting AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672)

=INFO REPORT==== 31-May-2018::17:32:39 ===
Connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672) has a client-provided name: cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e

=ERROR REPORT==== 31-May-2018::17:32:39 ===
** Generic server <0.6483.0> terminating
** Last message in was {'$gen_cast',
                           {method,
                               {'queue.declare',0,
                                   <<"cinder-volume.hostgroup">>,
                                   false,false,false,false,false,[]},
                               none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.6475.0>,
                            <0.6481.0>,<0.6475.0>,
                            <<"172.17.0.19:46030 -> 172.17.0.19:5672">>,
                            {lstate,<0.6482.0>,false},
                            none,1,
                            {[],[]},
                            {user,<<"guest">>,
                                [administrator],
                                [{rabbit_auth_backend_internal,none}]},
                            <<"/">>,
                            <<"cinder-volume.hostgroup@tripleo_iscsi">>,
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            {state,
                                {dict,0,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]}}},
                                erlang},
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            {set,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            <0.6476.0>,
                            {state,fine,5000,#Ref<0.0.3.2526>},
                            false,1,
                            {{0,nil},{0,nil}},
                            [],
                            {{0,nil},{0,nil}},
                            [{<<"connection.blocked">>,bool,true},
                             {<<"authentication_failure_close">>,bool,true},
                             {<<"consumer_cancel_notify">>,bool,true}],
                            none,0,none,flow,[]}
** Reason for termination == 
** {{noproc,
        {gen_server,call,
            [{rabbit_amqqueue_sup_sup,'rabbit@overcloud-controller-1'},
             {start_child,
                 [{amqqueue,
                      {resource,<<"/">>,queue,
                          <<"cinder-volume.hostgroup">>},
                      false,false,none,[],none,[],[],[],
                      [{vhost,<<"/">>},
                       {name,<<"ha-all">>},
                       {pattern,<<"^(?!amq\\.).*">>},
                       {'apply-to',<<"all">>},
                       {definition,[{<<"ha-mode">>,<<"all">>}]},
                       {priority,0}],
                      [],[],live},
                  declare]},
             infinity]}},
    [{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},
     {rabbit_amqqueue_sup_sup,start_queue_process,3,
         [{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
     {rabbit_amqqueue,declare,6,[{file,"src/rabbit_amqqueue.erl"},{line,317}]},
     {rabbit_channel,handle_method,3,
         [{file,"src/rabbit_channel.erl"},{line,1325}]},
     {rabbit_channel,handle_cast,2,
         [{file,"src/rabbit_channel.erl"},{line,457}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

=ERROR REPORT==== 31-May-2018::17:32:39 ===
Error on AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672 - cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e, vhost: '/', user: 'guest', state: running), channel 1:
{{noproc,
     {gen_server,call,
         [{rabbit_amqqueue_sup_sup,'rabbit@overcloud-controller-1'},
          {start_child,
              [{amqqueue,
                   {resource,<<"/">>,queue,
                       <<"cinder-volume.hostgroup">>},
                   false,false,none,[],none,[],[],[],
                   [{vhost,<<"/">>},
                    {name,<<"ha-all">>},
                    {pattern,<<"^(?!amq\\.).*">>},
                    {'apply-to',<<"all">>},
                    {definition,[{<<"ha-mode">>,<<"all">>}]},
                    {priority,0}],
                   [],[],live},
               declare]},
          infinity]}},
 [{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},
  {rabbit_amqqueue_sup_sup,start_queue_process,3,
      [{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
  {rabbit_amqqueue,declare,6,[{file,"src/rabbit_amqqueue.erl"},{line,317}]},
  {rabbit_channel,handle_method,3,
      [{file,"src/rabbit_channel.erl"},{line,1325}]},
  {rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,457}]},
  {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1032}]},
  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

=WARNING REPORT==== 31-May-2018::17:32:39 ===
Non-AMQP exit reason '{{noproc,
                        {gen_server,call,
                         [{rabbit_amqqueue_sup_sup,
                           'rabbit@overcloud-controller-1'},
                          {start_child,
                           [{amqqueue,
                             {resource,<<"/">>,queue,
                              <<"cinder-volume.hostgroup">>},
                             false,false,none,[],none,[],[],[],
                             [{vhost,<<"/">>},
                              {name,<<"ha-all">>},
                              {pattern,<<"^(?!amq\\.).*">>},
                              {'apply-to',<<"all">>},
                              {definition,[{<<"ha-mode">>,<<"all">>}]},
                              {priority,0}],
                             [],[],live},
                            declare]},
                          infinity]}},
                       [{gen_server,call,3,
                         [{file,"gen_server.erl"},{line,212}]},
                        {rabbit_amqqueue_sup_sup,start_queue_process,3,
                         [{file,"src/rabbit_amqqueue_sup_sup.erl"},{line,42}]},
                        {rabbit_amqqueue,declare,6,
                         [{file,"src/rabbit_amqqueue.erl"},{line,317}]},
                        {rabbit_channel,handle_method,3,
                         [{file,"src/rabbit_channel.erl"},{line,1325}]},
                        {rabbit_channel,handle_cast,2,
                         [{file,"src/rabbit_channel.erl"},{line,457}]},
                        {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1032}]},
                        {proc_lib,init_p_do_apply,3,
                         [{file,"proc_lib.erl"},{line,240}]}]}'

=INFO REPORT==== 31-May-2018::17:32:39 ===
closing AMQP connection <0.6475.0> (172.17.0.19:46030 -> 172.17.0.19:5672 - cinder-volume:118015:a9c3e789-694a-42b8-9104-6f11d93e2d0e)


This is attempting to declare the queue master for the cinder-volume.hostgroup queue onto overcloud-controller-1.  The reason for this is because we have set the queue_master_locator option in rabbitmq.config:

    {queue_master_locator, <<"min-masters">>},

So controller-0 decides that controller-1 has the fewest number of master queues and tries to declare it there.

However, at the time this is happening, controller-1 is restarting.  Note the error is at 17:32:39, and then compare to the rabbit log on controller-1:

=INFO REPORT==== 31-May-2018::17:32:36 ===
Stopped RabbitMQ application

=INFO REPORT==== 31-May-2018::17:32:38 ===
Clustering with ['rabbit@overcloud-controller-0'] as disc node

=INFO REPORT==== 31-May-2018::17:32:41 ===
Starting RabbitMQ 3.6.5 on Erlang 18.3.4.7

So at 39 seconds, controller-1 has rejoined the cluster but has not yet started the rabbit app.

This is probably a bug somewhere in the master locator code.  I would expect it to verify the target node is actually up, not just clustered.

Comment 11 John Eckersberg 2018-06-06 14:38:06 UTC

As a workaround for testing purposes, just remove the queue_master_locator setting from rabbitmq.config.  This will cause all queue declarations to assign the master to the local node where the declaration happens.  In the example above, cinder-volume is connected to controller-0 so the master will also be located local to controller-0.  That avoids this problem entirely.

Comment 12 Raoul Scarazzini 2018-06-06 15:58:35 UTC

As discussed, the workaround is not applicable here, since we rely on these tests inside the CI, which means we can't (or at least we should not) touch the environments. So we'll need to find a permanent solution.

Comment 14 John Eckersberg 2018-06-20 17:37:26 UTC

(In reply to John Eckersberg from comment #10)
> This is probably a bug somewhere in the master locator code.  I would expect
> it to verify the target node is actually up, not just clustered.

So, the master locator code *does* verify that the target node is actually up, to the best of its ability.  The problem is that it does so by reading the list of running nodes from mnesia.  It appears at this point where the node is in the process of re-joining the cluster after having previously been a member, the mnesia state on the other cluster members hasn't caught up that the node is missing yet because the underlying runtime hasn't flagged the node as down yet.

There's not really a way to guarantee that this approach will be 100% successful.  There's always going to be a window where a node can go down and the other cluster members are not aware yet.  If a queue declare occurs during that window and the down node is chosen as the master, it will always fail like this.

I think the best strategy is to just catch the failure and try another valid node if one is available.  I'll put together a PR for upstream to change the master_locator strategies to be able to cope with a prioritized list instead of just selecting a single node.  That way it can iterate through the preferences during a failure.

Comment 15 John Eckersberg 2018-06-21 14:11:21 UTC

This is handled a bit better in 3.7.x here:

https://github.com/rabbitmq/rabbitmq-server/commit/b753442da578029d49478f2511e647fe29af9b63

I'll try first to backport this and get a new build.  It probably won't help the behavior but we'll at least get better error messages.

Comment 16 John Eckersberg 2018-07-02 13:57:35 UTC

(In reply to John Eckersberg from comment #15)
> This is handled a bit better in 3.7.x here:
> 
> https://github.com/rabbitmq/rabbitmq-server/commit/
> b753442da578029d49478f2511e647fe29af9b63
> 
> I'll try first to backport this and get a new build.  It probably won't help
> the behavior but we'll at least get better error messages.

This didn't work so well on first attempt.  The internal records have changed quite a bit between 3.6.x and 3.7.x and a straightforward backport is not really possible.

Comment 17 John Eckersberg 2018-07-30 13:48:26 UTC

I'm going to just turn the queue_master_locator strategy back to client-local.  It's not really a great choice for failover performance, but it should fix this particular issue for CI.

Upstream review - https://review.openstack.org/#/c/587064/