Description of problem: Network connectivity was lost among nodes on rabbitmq cluster and it seems did not recover successfully. I found the servers in really bad shape, for example |-sh---rabbitmq-server---su---rabbitmq-server---beam.smp-+-inet_gethost---inet_gethost | `-956*[{beam.smp}] <========================= |-snmpd |-radosgw---280*[{radosgw}] |-httpd-+-256*[httpd] | |-3*[httpd---12*[{httpd}]] | |-112*[httpd---4*[{httpd}]] | |-2*[httpd---58*[{httpd}]] | `-14*[httpd---3*[{httpd}]] |-heat-api---56*[heat-api] |-heat-api-cfn---56*[heat-api-cfn] |-heat-api-cloudw---56*[heat-api-cloudw] |-heat-engine---56*[heat-engine] |-httpd-+-256*[httpd] |-glance-api---56*[glance-api] |-glance-registry---56*[glance-registry] systemd-+-/usr/bin/python---ceilometer-agen---96*[{ceilometer-agen}] |-/usr/bin/python---ceilometer-coll---72*[{ceilometer-coll}] |-/usr/bin/python---ceilometer-poll---6*[{ceilometer-poll}] |-/usr/bin/python---22*[{/usr/bin/python}] |-/usr/bin/python-+-/usr/bin/python---8*[{/usr/bin/python}] | |-gnocchi-metricd---37*[{gnocchi-metricd}] | |-9*[gnocchi-metricd---15*[{gnocchi-metricd}]] | |-gnocchi-metricd---21*[{gnocchi-metricd}] | `-8*[{/usr/bin/python}] Deleting mnesia database recovered the environment, where we were not able to neither request a token. Some details of rabbitmq logs: =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2898.287> (192.168.4.28:44618 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2806.395> (192.168.4.15:54327 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2929.287> (192.168.4.28:44622 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2952.287> (192.168.4.28:44624 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2917.287> (192.168.4.28:44620 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3012.287> (192.168.4.28:44628 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.1405.455> (192.168.4.15:43866 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3084.287> (192.168.4.28:44634 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3080.287> (192.168.4.28:44636 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.8830.414> (192.168.4.15:36736 -> 192.168.4.38:5672): {inet_error,etimedout} - Cluster stopped cause there is no quorum Mirrored queue 'vnc_config.dcd01-contrail-controller-0-8082' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.f4118d72-6385-4f8a-b2a6-55f343e6f02f' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.4bbc48d9-96f0-44d5-9fc2-f9ae189a0776' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.a3f45e55-1b78-43d9-8408-522384da2817' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'conductor_fanout_8d3fe7ebb5f74bca8d9702edf0eb3438' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'conductor_fanout_056afd73365e4618a79531b14d35578d' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.1d5a26df-3001-4c76-9c7c-1216dc26f4fa' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.0cc0ce65-ad40-4b89-9133-6e639d06ee59' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'reply_1b4524de6ea2422c90d75d7d1d18b7de' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.e0006b2a-4fd3-4694-a960-bdfd8fcb847b' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker_fanout_4e8b12b459f044f4980058aa666fc894' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.35d6a43c-02bd-42d0-9e82-2d5dfbe6d2b2' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_fanout_bf5bc0c04980423993c2c5a3271ee0a5' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'reply_17e23423abbd4fa4a0f700a0c5467308' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_fanout_22bdf60c68f644a3a21652bceefc74df' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === It seem it recovers but some minutes after it star throwing heartbeat timeouts: =INFO REPORT==== 5-May-2018::20:40:43 === accepting AMQP connection <0.1937.458> (192.168.4.15:60362 -> 192.168.4.38:5672) =ERROR REPORT==== 5-May-2018::20:40:46 === closing AMQP connection <0.17545.457> (192.168.4.38:39838 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s =INFO REPORT==== 5-May-2018::20:40:46 === accepting AMQP connection <0.1970.458> (192.168.4.38:60968 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:47 === accepting AMQP connection <0.2067.458> (192.168.4.38:32804 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:54 === accepting AMQP connection <0.2085.458> (192.168.4.28:42488 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:55 === accepting AMQP connection <0.2178.458> (192.168.4.28:42510 -> 192.168.4.38:5672) =ERROR REPORT==== 5-May-2018::20:40:57 === closing AMQP connection <0.7377.229> (192.168.4.15:58464 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s =ERROR REPORT==== 5-May-2018::20:40:58 === closing AMQP connection <0.7418.229> (192.168.4.15:58470 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s Version-Release number of selected component (if applicable): OSP10 How reproducible: unsure Steps to Reproduce: 1. Bring the connectivy down for some minutes 2. 3. Actual results: rabbit seems to not recover Expected results: rabbit recovers Additional info:
Yes, Pablo is right - this is a duplicate of a bug 1441685. Please upgrade RabbitMQ to the latest release, and it will recover much better. Feel free to reopen it if the issue still persists (after upgrade). *** This bug has been marked as a duplicate of bug 1441685 ***