Bug 1575885 - rabbitmq seems did not recover from lost connectivity to other members
Summary: rabbitmq seems did not recover from lost connectivity to other members
Keywords:
Status: CLOSED DUPLICATE of bug 1441685
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-08 07:51 UTC by Eduard Barrera
Modified: 2021-09-09 13:58 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-28 14:00:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eduard Barrera 2018-05-08 07:51:01 UTC
Description of problem:

Network connectivity was lost among nodes on rabbitmq cluster and it seems did not recover successfully.

I found the servers in really bad shape, for example



        |-sh---rabbitmq-server---su---rabbitmq-server---beam.smp-+-inet_gethost---inet_gethost
        |                                                        `-956*[{beam.smp}] <=========================
        |-snmpd


        |-radosgw---280*[{radosgw}]


        |-httpd-+-256*[httpd]
        |       |-3*[httpd---12*[{httpd}]]
        |       |-112*[httpd---4*[{httpd}]]
        |       |-2*[httpd---58*[{httpd}]]
        |       `-14*[httpd---3*[{httpd}]]

        |-heat-api---56*[heat-api]
        |-heat-api-cfn---56*[heat-api-cfn]
        |-heat-api-cloudw---56*[heat-api-cloudw]
        |-heat-engine---56*[heat-engine]
        |-httpd-+-256*[httpd]


        |-glance-api---56*[glance-api]
        |-glance-registry---56*[glance-registry]

systemd-+-/usr/bin/python---ceilometer-agen---96*[{ceilometer-agen}]
        |-/usr/bin/python---ceilometer-coll---72*[{ceilometer-coll}]
        |-/usr/bin/python---ceilometer-poll---6*[{ceilometer-poll}]
        |-/usr/bin/python---22*[{/usr/bin/python}]
        |-/usr/bin/python-+-/usr/bin/python---8*[{/usr/bin/python}]
        |                 |-gnocchi-metricd---37*[{gnocchi-metricd}]
        |                 |-9*[gnocchi-metricd---15*[{gnocchi-metricd}]]
        |                 |-gnocchi-metricd---21*[{gnocchi-metricd}]
        |                 `-8*[{/usr/bin/python}]


Deleting mnesia database recovered the environment, where we were not able to neither request a token.

Some details of rabbitmq logs:

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.2898.287> (192.168.4.28:44618 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.2806.395> (192.168.4.15:54327 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.2929.287> (192.168.4.28:44622 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.2952.287> (192.168.4.28:44624 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.2917.287> (192.168.4.28:44620 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.3012.287> (192.168.4.28:44628 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.1405.455> (192.168.4.15:43866 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.3084.287> (192.168.4.28:44634 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.3080.287> (192.168.4.28:44636 -> 192.168.4.38:5672):
{inet_error,etimedout}

=ERROR REPORT==== 5-May-2018::20:36:23 ===
closing AMQP connection <0.8830.414> (192.168.4.15:36736 -> 192.168.4.38:5672):
{inet_error,etimedout}


- Cluster stopped cause there is no quorum

Mirrored queue 'vnc_config.dcd01-contrail-controller-0-8082' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_worker.f4118d72-6385-4f8a-b2a6-55f343e6f02f' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'heat-engine-listener.4bbc48d9-96f0-44d5-9fc2-f9ae189a0776' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_worker.a3f45e55-1b78-43d9-8408-522384da2817' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'conductor_fanout_8d3fe7ebb5f74bca8d9702edf0eb3438' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'conductor_fanout_056afd73365e4618a79531b14d35578d' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'heat-engine-listener.1d5a26df-3001-4c76-9c7c-1216dc26f4fa' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'heat-engine-listener.0cc0ce65-ad40-4b89-9133-6e639d06ee59' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'reply_1b4524de6ea2422c90d75d7d1d18b7de' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_worker.e0006b2a-4fd3-4694-a960-bdfd8fcb847b' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_worker_fanout_4e8b12b459f044f4980058aa666fc894' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_worker.35d6a43c-02bd-42d0-9e82-2d5dfbe6d2b2' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_fanout_bf5bc0c04980423993c2c5a3271ee0a5' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'reply_17e23423abbd4fa4a0f700a0c5467308' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===
Mirrored queue 'engine_fanout_22bdf60c68f644a3a21652bceefc74df' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 5-May-2018::20:36:31 ===


It seem it recovers but some minutes after it star throwing heartbeat timeouts:

=INFO REPORT==== 5-May-2018::20:40:43 ===
accepting AMQP connection <0.1937.458> (192.168.4.15:60362 -> 192.168.4.38:5672)

=ERROR REPORT==== 5-May-2018::20:40:46 ===
closing AMQP connection <0.17545.457> (192.168.4.38:39838 -> 192.168.4.38:5672):
missed heartbeats from client, timeout: 60s

=INFO REPORT==== 5-May-2018::20:40:46 ===
accepting AMQP connection <0.1970.458> (192.168.4.38:60968 -> 192.168.4.38:5672)

=INFO REPORT==== 5-May-2018::20:40:47 ===
accepting AMQP connection <0.2067.458> (192.168.4.38:32804 -> 192.168.4.38:5672)

=INFO REPORT==== 5-May-2018::20:40:54 ===
accepting AMQP connection <0.2085.458> (192.168.4.28:42488 -> 192.168.4.38:5672)

=INFO REPORT==== 5-May-2018::20:40:55 ===
accepting AMQP connection <0.2178.458> (192.168.4.28:42510 -> 192.168.4.38:5672)

=ERROR REPORT==== 5-May-2018::20:40:57 ===
closing AMQP connection <0.7377.229> (192.168.4.15:58464 -> 192.168.4.38:5672):
missed heartbeats from client, timeout: 60s

=ERROR REPORT==== 5-May-2018::20:40:58 ===
closing AMQP connection <0.7418.229> (192.168.4.15:58470 -> 192.168.4.38:5672):
missed heartbeats from client, timeout: 60s




Version-Release number of selected component (if applicable):
OSP10

How reproducible:
unsure

Steps to Reproduce:
1. Bring the connectivy down for some minutes
2.
3.

Actual results:
rabbit seems to not recover

Expected results:
rabbit recovers

Additional info:

Comment 4 Peter Lemenkov 2018-05-28 14:00:07 UTC
Yes, Pablo is right - this is a duplicate of a bug 1441685. Please upgrade RabbitMQ to the latest release, and it will recover much better.

Feel free to reopen it if the issue still persists (after upgrade).

*** This bug has been marked as a duplicate of bug 1441685 ***


Note You need to log in before you can comment on or make changes to this bug.