Bug 1575885
| Summary: | rabbitmq seems did not recover from lost connectivity to other members | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED DUPLICATE | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | medium | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 10.0 (Newton) | CC: | apevec, jeckersb, lhh, pcaruana, plemenko, rscarazz, srevivo |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-05-28 14:00:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Yes, Pablo is right - this is a duplicate of a bug 1441685. Please upgrade RabbitMQ to the latest release, and it will recover much better. Feel free to reopen it if the issue still persists (after upgrade). *** This bug has been marked as a duplicate of bug 1441685 *** |
Description of problem: Network connectivity was lost among nodes on rabbitmq cluster and it seems did not recover successfully. I found the servers in really bad shape, for example |-sh---rabbitmq-server---su---rabbitmq-server---beam.smp-+-inet_gethost---inet_gethost | `-956*[{beam.smp}] <========================= |-snmpd |-radosgw---280*[{radosgw}] |-httpd-+-256*[httpd] | |-3*[httpd---12*[{httpd}]] | |-112*[httpd---4*[{httpd}]] | |-2*[httpd---58*[{httpd}]] | `-14*[httpd---3*[{httpd}]] |-heat-api---56*[heat-api] |-heat-api-cfn---56*[heat-api-cfn] |-heat-api-cloudw---56*[heat-api-cloudw] |-heat-engine---56*[heat-engine] |-httpd-+-256*[httpd] |-glance-api---56*[glance-api] |-glance-registry---56*[glance-registry] systemd-+-/usr/bin/python---ceilometer-agen---96*[{ceilometer-agen}] |-/usr/bin/python---ceilometer-coll---72*[{ceilometer-coll}] |-/usr/bin/python---ceilometer-poll---6*[{ceilometer-poll}] |-/usr/bin/python---22*[{/usr/bin/python}] |-/usr/bin/python-+-/usr/bin/python---8*[{/usr/bin/python}] | |-gnocchi-metricd---37*[{gnocchi-metricd}] | |-9*[gnocchi-metricd---15*[{gnocchi-metricd}]] | |-gnocchi-metricd---21*[{gnocchi-metricd}] | `-8*[{/usr/bin/python}] Deleting mnesia database recovered the environment, where we were not able to neither request a token. Some details of rabbitmq logs: =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2898.287> (192.168.4.28:44618 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2806.395> (192.168.4.15:54327 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2929.287> (192.168.4.28:44622 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2952.287> (192.168.4.28:44624 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.2917.287> (192.168.4.28:44620 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3012.287> (192.168.4.28:44628 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.1405.455> (192.168.4.15:43866 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3084.287> (192.168.4.28:44634 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.3080.287> (192.168.4.28:44636 -> 192.168.4.38:5672): {inet_error,etimedout} =ERROR REPORT==== 5-May-2018::20:36:23 === closing AMQP connection <0.8830.414> (192.168.4.15:36736 -> 192.168.4.38:5672): {inet_error,etimedout} - Cluster stopped cause there is no quorum Mirrored queue 'vnc_config.dcd01-contrail-controller-0-8082' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.f4118d72-6385-4f8a-b2a6-55f343e6f02f' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.4bbc48d9-96f0-44d5-9fc2-f9ae189a0776' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.a3f45e55-1b78-43d9-8408-522384da2817' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'conductor_fanout_8d3fe7ebb5f74bca8d9702edf0eb3438' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'conductor_fanout_056afd73365e4618a79531b14d35578d' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.1d5a26df-3001-4c76-9c7c-1216dc26f4fa' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'heat-engine-listener.0cc0ce65-ad40-4b89-9133-6e639d06ee59' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'reply_1b4524de6ea2422c90d75d7d1d18b7de' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.e0006b2a-4fd3-4694-a960-bdfd8fcb847b' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker_fanout_4e8b12b459f044f4980058aa666fc894' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_worker.35d6a43c-02bd-42d0-9e82-2d5dfbe6d2b2' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_fanout_bf5bc0c04980423993c2c5a3271ee0a5' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'reply_17e23423abbd4fa4a0f700a0c5467308' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === Mirrored queue 'engine_fanout_22bdf60c68f644a3a21652bceefc74df' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 5-May-2018::20:36:31 === It seem it recovers but some minutes after it star throwing heartbeat timeouts: =INFO REPORT==== 5-May-2018::20:40:43 === accepting AMQP connection <0.1937.458> (192.168.4.15:60362 -> 192.168.4.38:5672) =ERROR REPORT==== 5-May-2018::20:40:46 === closing AMQP connection <0.17545.457> (192.168.4.38:39838 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s =INFO REPORT==== 5-May-2018::20:40:46 === accepting AMQP connection <0.1970.458> (192.168.4.38:60968 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:47 === accepting AMQP connection <0.2067.458> (192.168.4.38:32804 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:54 === accepting AMQP connection <0.2085.458> (192.168.4.28:42488 -> 192.168.4.38:5672) =INFO REPORT==== 5-May-2018::20:40:55 === accepting AMQP connection <0.2178.458> (192.168.4.28:42510 -> 192.168.4.38:5672) =ERROR REPORT==== 5-May-2018::20:40:57 === closing AMQP connection <0.7377.229> (192.168.4.15:58464 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s =ERROR REPORT==== 5-May-2018::20:40:58 === closing AMQP connection <0.7418.229> (192.168.4.15:58470 -> 192.168.4.38:5672): missed heartbeats from client, timeout: 60s Version-Release number of selected component (if applicable): OSP10 How reproducible: unsure Steps to Reproduce: 1. Bring the connectivy down for some minutes 2. 3. Actual results: rabbit seems to not recover Expected results: rabbit recovers Additional info: