Bug 1419177 - rabbitmq cluster not forming with bigswitch virtual switch deployment on openstack controller nodes.
Summary: rabbitmq cluster not forming with bigswitch virtual switch deployment on open...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Linux
unspecified
unspecified
Target Milestone: async
: 8.0 (Liberty)
Assignee: Peter Lemenkov
QA Contact: bigswitch
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-03 19:50 UTC by bigswitch
Modified: 2023-09-14 03:53 UTC (History)
9 users (show)

Fixed In Version: rabbitmq-server-3.3.5-33.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-20 12:48:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sos report from all the controllers (10.24 MB, application/x-xz)
2017-02-03 19:50 UTC, bigswitch
no flags Details
controller0 (10.39 MB, application/x-xz)
2017-02-16 16:42 UTC, bigswitch
no flags Details
controller1 (10.80 MB, application/x-xz)
2017-02-16 16:43 UTC, bigswitch
no flags Details
sosreport from controller 2 (13.68 MB, application/x-xz)
2017-02-24 17:55 UTC, bigswitch
no flags Details
sosreport from controller 1 (13.85 MB, application/x-xz)
2017-02-24 17:56 UTC, bigswitch
no flags Details
sosreport from controller 0 (13.92 MB, application/x-xz)
2017-02-24 17:59 UTC, bigswitch
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github fedora-erlang/rabbitmq-server/commit/18210ba 0 None None None 2017-02-20 18:32:22 UTC
Github fedora-erlang/rabbitmq-server/commit/724f531 0 None None None 2017-02-20 18:30:45 UTC
Github rabbitmq rabbitmq-server issues 368 0 None None None 2017-02-20 17:52:21 UTC
Github rabbitmq rabbitmq-server issues 714 0 None None None 2017-02-20 17:46:38 UTC
Red Hat Product Errata RHBA-2017:1542 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 8 Bug Fix and Enhancement Advisory 2017-06-20 16:45:36 UTC

Description bigswitch 2017-02-03 19:50:28 UTC
Created attachment 1247567 [details]
sos report from all the controllers

Description of problem:

when installing bigswitch virtual switch onto controller nodes , deployment is failing and also rabbitmq server is not forming a cluster.
Version-Release number of selected component (if applicable):
RHOSP8
How reproducible:

RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
Steps to Reproduce:
1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
2.
3.

Actual results:


Expected results:


Additional info:


sos report attached from all 3 controllers.

Comment 1 Peter Lemenkov 2017-02-07 12:09:49 UTC
(In reply to bigswitch from comment #0)
> Created attachment 1247567 [details]
> sos report from all the controllers
> 
> Description of problem:
> 
> when installing bigswitch virtual switch onto controller nodes , deployment
> is failing and also rabbitmq server is not forming a cluster.
> Version-Release number of selected component (if applicable):
> RHOSP8
> How reproducible:
> 
> RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
> Steps to Reproduce:
> 1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> 
> sos report attached from all 3 controllers.

The tarball attached contains only data from one node. Could you please reupload other tarballs?

Comment 2 bigswitch 2017-02-16 16:42:29 UTC
Created attachment 1250907 [details]
controller0

Comment 3 bigswitch 2017-02-16 16:43:08 UTC
Created attachment 1250908 [details]
controller1

Comment 4 bigswitch 2017-02-16 16:43:31 UTC
added sos report from all controllers

Comment 5 Peter Lemenkov 2017-02-20 17:45:50 UTC
Ok, few observations.

Your cluster got hit by netsplit issues, so please consider upgrading up to resource-agents-3.9.5-86.el7 where netsplits handled better (see bug #1397393 for further details).

Comment 6 Peter Lemenkov 2017-02-20 17:46:38 UTC
=ERROR REPORT==== 3-Feb-2017::17:53:23 ===
** Generic server <0.2070.0> terminating
** Last message in was {'$gen_cast',{gm_deaths,[<20755.2101.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"metering.sample">>},
                                false,false,none,
                                [{<<"x-ha-policy">>,longstr,<<"all">>}],
                                <0.2067.0>,[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"ha-all">>},
                                 {pattern,<<"^(?!amq\\.).*">>},
                                 {'apply-to',<<"all">>},
                                 {definition,[{<<"ha-mode">>,<<"all">>}]},
                                 {priority,0}],
                                [],[]},
                            <0.2071.0>,
                            {state,
                                {dict,0,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]}}},
                                erlang},
                            #Fun<rabbit_mirror_queue_master.5.69179775>,
                            #Fun<rabbit_mirror_queue_master.6.69179775>}
** Reason for termination ==
** {{case_clause,{ok,<20750.2036.0>,[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2,
         [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}


This is GH#714.

Comment 7 Peter Lemenkov 2017-02-20 17:52:22 UTC
=ERROR REPORT==== 3-Feb-2017::17:53:36 ===
** Generic server <0.1960.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.10102>,process,<20755.1906.0>,
                               noconnection}
** When Server state == {state,
                            {1,<0.1960.0>},
                            {{0,<20750.1899.0>},#Ref<0.0.0.10101>},
                            {{0,<20755.1906.0>},#Ref<0.0.0.10102>},
                            {resource,<<"/">>,queue,
                                <<"heat-engine-listener_fanout_87c95e6a2a2242529d00fbef9880d75e">>},
                            rabbit_mirror_queue_slave,
                            {2,
                             [{{0,<20750.1899.0>},
                               {view_member,
                                   {0,<20750.1899.0>},
                                   [],
                                   {0,<20755.1906.0>},
                                   {1,<0.1960.0>}}},
                              {{0,<20755.1906.0>},
                               {view_member,
                                   {0,<20755.1906.0>},
                                   [],
                                   {1,<0.1960.0>},
                                   {0,<20750.1899.0>}}},
                              {{1,<0.1960.0>},
                               {view_member,
                                   {1,<0.1960.0>},
                                   [],
                                   {0,<20750.1899.0>},
                                   {0,<20755.1906.0>}}}]},
                            0,
                            [{{0,<20750.1899.0>},{member,{[],[]},0,0}},
                             {{0,<20755.1906.0>},
                              {member,
                                  {[{3,{delete_and_terminate,normal}}],[]},
                                  3,2}},
                             {{1,<0.1960.0>},{member,{[],[]},0,0}}],
                            [<0.1956.0>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            {true,{shutdown,ring_shutdown}}}
** Reason for termination ==
** {function_clause,[{orddict,fetch,
                              [{1,<0.1960.0>},[]],
                              [{file,"orddict.erl"},{line,72}]},
                     {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1223}]},
                     {gm,change_view,2,[{file,"src/gm.erl"},{line,1395}]},
                     {gm,handle_info,2,[{file,"src/gm.erl"},{line,729}]},
                     {gen_server2,handle_msg,2,
                                  [{file,"src/gen_server2.erl"},{line,1022}]},
                     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}


This is GH#368.

Comment 8 Peter Lemenkov 2017-02-20 18:00:03 UTC
Ok, I can't find any other issues so far. I'll update this ticket shortly.

Comment 9 Peter Lemenkov 2017-02-20 18:30:29 UTC
GH#368 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/724f531

GH#714 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/18210ba

Comment 10 Peter Lemenkov 2017-02-21 15:31:10 UTC
Both issues are addressed in rabbitmq-server-3.3.5-33.el7ost.

Comment 11 bigswitch 2017-02-22 18:20:40 UTC
Is there a procedure to upgrade rabbitmq-server during deployment or is it part of RHOSP10?

Comment 12 Peter Lemenkov 2017-02-23 14:26:42 UTC
(In reply to bigswitch from comment #11)
> Is there a procedure to upgrade rabbitmq-server during deployment or is it
> part of RHOSP10?

We're spipping rabbitmq-server-3.6.3 within RHOS9 and more recent RHOS versions which already contains all the patches necessary.

As for RHOS8 the package (rabbitmq-server-3.3.5-33.el7ost) will be available in z-stream relatively soon.

Comment 13 bigswitch 2017-02-24 17:53:23 UTC
We are still seeing cluster is up, but deployment is failing with RHOSP9. Attached is the sosreport
rabbitmq version is
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
rabbitmq-server-3.6.3-5.el7ost.noarch

Comment 14 bigswitch 2017-02-24 17:55:26 UTC
Created attachment 1257393 [details]
sosreport from controller 2

Comment 15 bigswitch 2017-02-24 17:56:27 UTC
Created attachment 1257394 [details]
sosreport from controller 1

Comment 16 bigswitch 2017-02-24 17:59:08 UTC
Created attachment 1257395 [details]
sosreport from controller 0

Comment 17 Peter Lemenkov 2017-02-27 16:45:30 UTC
(In reply to bigswitch from comment #13)
> We are still seeing cluster is up, but deployment is failing with RHOSP9.
> Attached is the sosreport
> rabbitmq version is
> [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
> rabbitmq-server-3.6.3-5.el7ost.noarch

I see that nodes 0 and 1 cannot join the rest of the cluster initially. Still they finally joined it a little later.

Node 0 failed to join at 24-Feb-2017::00:21:22, and rejoined at 24-Feb-2017::00:21:41 (20 seconds later).

Node 1 stopped operation due to "pause minority" option at 24-Feb-2017::00:16:55, and rejoined at 24-Feb-2017::00:21:42. See what's happened from the node-1's point of view:

===================================

=INFO REPORT==== 24-Feb-2017::00:16:55 ===
rabbit on node 'rabbit@overcloud-controller-0' down
...
=ERROR REPORT==== 24-Feb-2017::00:16:55 ===
Partial partition detected:
 * We saw DOWN from rabbit@overcloud-controller-0
 * We can still see rabbit@overcloud-controller-2 which can see rabbit@overcloud-controller-0
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-0' up
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:20:44 ===
rabbit on node 'rabbit@overcloud-controller-2' down
...
=INFO REPORT==== 24-Feb-2017::00:21:00 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}}
...
=INFO REPORT==== 24-Feb-2017::00:21:19 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:21:42 ===
rabbit on node 'rabbit@overcloud-controller-0' up

===================================

This

Comment 18 Peter Lemenkov 2017-02-27 16:48:39 UTC
Sorry, accidentally pressed send.

This shows that there was something at "24-Feb-2017::00:16:55".

I looked at node-0 logs:

Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30073]: ERROR: [findif] failed
...
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5498] device (p1p1): enslaved to non-master-type device ivs; ignoring
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5531] manager: (ext4000): new Generic device (/org/freedesktop/NetworkManager/Devices/15)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5546] manager: (storagemgmt3997): new Generic device (/org/freedesktop/NetworkManager/Devices/16)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5561] manager: (api3999): new Generic device (/org/freedesktop/NetworkManager/Devices/17)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5578] manager: (storage3998): new Generic device (/org/freedesktop/NetworkManager/Devices/18)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5699] device (ext4000): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5707] device (storagemgmt3997): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5708] device (api3999): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5709] device (storage3998): link connected
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
...
Feb 24 00:16:49 overcloud-controller-0.localdomain lrmd[27617]:   notice: ip-10.8.84.70_monitor_10000:30043:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
Feb 24 00:16:49 overcloud-controller-0.localdomain crmd[27620]:   notice: overcloud-controller-0-ip-10.8.84.70_monitor_10000:90 [ ocf-exit-reason:Unable to find nic or netmask.\n ]
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] A processor failed, forming new configuration.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] The network interface is down.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.15}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.14}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.16}
Feb 24 00:16:50 overcloud-controller-0.localdomain bash[29987]: ifup ivs port ext4000
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6024] dhcp4 (em1): request timed out
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6024] dhcp4 (em1): state changed unknown -> timeout
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): canceled DHCP transaction, DHCP client pid 6779
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): state changed timeout -> done
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6133] device (em1): state change: ip-config -> failed (reason 'ip-config-unavailable') [70 120 5]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6135] manager: NetworkManager state is now DISCONNECTED
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6138] device (em1): Activation: failed for connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6145] device (em1): state change: failed -> disconnected (reason 'none') [120 30 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6175] policy: auto-activating connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6186] device (em1): Activation: starting connection 'System em1' (1dad842d-1912-ef5a-a43a-bc238fb267e7)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6188] device (em1): state change: disconnected -> prepare (reason 'none') [30 40 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6189] manager: NetworkManager state is now CONNECTING
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6194] device (em1): state change: prepare -> config (reason 'none') [40 50 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6201] device (em1): state change: config -> ip-config (reason 'none') [50 70 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6205] dhcp4 (em1): activation: beginning transaction (timeout in 45 seconds)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6267] dhcp4 (em1): dhclient started with pid 30228
Feb 24 00:16:50 overcloud-controller-0.localdomain dhclient[30228]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 3 (xid=0x763a0705)

Comment 19 Peter Lemenkov 2017-02-27 16:51:18 UTC
I bet something went wrong with networking.

I could advice you to upgrade resource-agents from your current resource-agents-3.9.5-82.el7.x86_64 up to resource-agents-3.9.5-86.el7.x86_64 where we improved recovery time. In the case mentioned above RabbitMQ will recover 1 minute faster - down from 5 to 4 minutes.

But this won't fix networking instability I see in logs. Still 4 minutes is better than 5.

Comment 22 Fabio Massimo Di Nitto 2017-03-21 14:19:16 UTC
Can somebody from bigswitch please verify the build mentioned in comment #19 and let us know ASAP if that helps?

Comment 25 errata-xmlrpc 2017-06-20 12:48:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1542

Comment 26 Red Hat Bugzilla 2023-09-14 03:53:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.