Bug 1419177

Summary: rabbitmq cluster not forming with bigswitch virtual switch deployment on openstack controller nodes.
Product: Red Hat OpenStack Reporter: bigswitch <rhosp-bugs-internal>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: bigswitch <rhosp-bugs-internal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: apevec, fdinitto, jeckersb, jjoyce, lemenkov, lhh, rhosp-bugs-internal, srevivo, ushkalim
Target Milestone: asyncKeywords: OtherQA, Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: rabbitmq-server-3.3.5-33.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-20 12:48:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sos report from all the controllers
none
controller0
none
controller1
none
sosreport from controller 2
none
sosreport from controller 1
none
sosreport from controller 0 none

Description bigswitch 2017-02-03 19:50:28 UTC
Created attachment 1247567 [details]
sos report from all the controllers

Description of problem:

when installing bigswitch virtual switch onto controller nodes , deployment is failing and also rabbitmq server is not forming a cluster.
Version-Release number of selected component (if applicable):
RHOSP8
How reproducible:

RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
Steps to Reproduce:
1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
2.
3.

Actual results:


Expected results:


Additional info:


sos report attached from all 3 controllers.

Comment 1 Peter Lemenkov 2017-02-07 12:09:49 UTC
(In reply to bigswitch from comment #0)
> Created attachment 1247567 [details]
> sos report from all the controllers
> 
> Description of problem:
> 
> when installing bigswitch virtual switch onto controller nodes , deployment
> is failing and also rabbitmq server is not forming a cluster.
> Version-Release number of selected component (if applicable):
> RHOSP8
> How reproducible:
> 
> RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
> Steps to Reproduce:
> 1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> 
> sos report attached from all 3 controllers.

The tarball attached contains only data from one node. Could you please reupload other tarballs?

Comment 2 bigswitch 2017-02-16 16:42:29 UTC
Created attachment 1250907 [details]
controller0

Comment 3 bigswitch 2017-02-16 16:43:08 UTC
Created attachment 1250908 [details]
controller1

Comment 4 bigswitch 2017-02-16 16:43:31 UTC
added sos report from all controllers

Comment 5 Peter Lemenkov 2017-02-20 17:45:50 UTC
Ok, few observations.

Your cluster got hit by netsplit issues, so please consider upgrading up to resource-agents-3.9.5-86.el7 where netsplits handled better (see bug #1397393 for further details).

Comment 6 Peter Lemenkov 2017-02-20 17:46:38 UTC
=ERROR REPORT==== 3-Feb-2017::17:53:23 ===
** Generic server <0.2070.0> terminating
** Last message in was {'$gen_cast',{gm_deaths,[<20755.2101.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"metering.sample">>},
                                false,false,none,
                                [{<<"x-ha-policy">>,longstr,<<"all">>}],
                                <0.2067.0>,[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"ha-all">>},
                                 {pattern,<<"^(?!amq\\.).*">>},
                                 {'apply-to',<<"all">>},
                                 {definition,[{<<"ha-mode">>,<<"all">>}]},
                                 {priority,0}],
                                [],[]},
                            <0.2071.0>,
                            {state,
                                {dict,0,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]}}},
                                erlang},
                            #Fun<rabbit_mirror_queue_master.5.69179775>,
                            #Fun<rabbit_mirror_queue_master.6.69179775>}
** Reason for termination ==
** {{case_clause,{ok,<20750.2036.0>,[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2,
         [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}


This is GH#714.

Comment 7 Peter Lemenkov 2017-02-20 17:52:22 UTC
=ERROR REPORT==== 3-Feb-2017::17:53:36 ===
** Generic server <0.1960.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.10102>,process,<20755.1906.0>,
                               noconnection}
** When Server state == {state,
                            {1,<0.1960.0>},
                            {{0,<20750.1899.0>},#Ref<0.0.0.10101>},
                            {{0,<20755.1906.0>},#Ref<0.0.0.10102>},
                            {resource,<<"/">>,queue,
                                <<"heat-engine-listener_fanout_87c95e6a2a2242529d00fbef9880d75e">>},
                            rabbit_mirror_queue_slave,
                            {2,
                             [{{0,<20750.1899.0>},
                               {view_member,
                                   {0,<20750.1899.0>},
                                   [],
                                   {0,<20755.1906.0>},
                                   {1,<0.1960.0>}}},
                              {{0,<20755.1906.0>},
                               {view_member,
                                   {0,<20755.1906.0>},
                                   [],
                                   {1,<0.1960.0>},
                                   {0,<20750.1899.0>}}},
                              {{1,<0.1960.0>},
                               {view_member,
                                   {1,<0.1960.0>},
                                   [],
                                   {0,<20750.1899.0>},
                                   {0,<20755.1906.0>}}}]},
                            0,
                            [{{0,<20750.1899.0>},{member,{[],[]},0,0}},
                             {{0,<20755.1906.0>},
                              {member,
                                  {[{3,{delete_and_terminate,normal}}],[]},
                                  3,2}},
                             {{1,<0.1960.0>},{member,{[],[]},0,0}}],
                            [<0.1956.0>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            {true,{shutdown,ring_shutdown}}}
** Reason for termination ==
** {function_clause,[{orddict,fetch,
                              [{1,<0.1960.0>},[]],
                              [{file,"orddict.erl"},{line,72}]},
                     {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1223}]},
                     {gm,change_view,2,[{file,"src/gm.erl"},{line,1395}]},
                     {gm,handle_info,2,[{file,"src/gm.erl"},{line,729}]},
                     {gen_server2,handle_msg,2,
                                  [{file,"src/gen_server2.erl"},{line,1022}]},
                     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}


This is GH#368.

Comment 8 Peter Lemenkov 2017-02-20 18:00:03 UTC
Ok, I can't find any other issues so far. I'll update this ticket shortly.

Comment 9 Peter Lemenkov 2017-02-20 18:30:29 UTC
GH#368 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/724f531

GH#714 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/18210ba

Comment 10 Peter Lemenkov 2017-02-21 15:31:10 UTC
Both issues are addressed in rabbitmq-server-3.3.5-33.el7ost.

Comment 11 bigswitch 2017-02-22 18:20:40 UTC
Is there a procedure to upgrade rabbitmq-server during deployment or is it part of RHOSP10?

Comment 12 Peter Lemenkov 2017-02-23 14:26:42 UTC
(In reply to bigswitch from comment #11)
> Is there a procedure to upgrade rabbitmq-server during deployment or is it
> part of RHOSP10?

We're spipping rabbitmq-server-3.6.3 within RHOS9 and more recent RHOS versions which already contains all the patches necessary.

As for RHOS8 the package (rabbitmq-server-3.3.5-33.el7ost) will be available in z-stream relatively soon.

Comment 13 bigswitch 2017-02-24 17:53:23 UTC
We are still seeing cluster is up, but deployment is failing with RHOSP9. Attached is the sosreport
rabbitmq version is
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
rabbitmq-server-3.6.3-5.el7ost.noarch

Comment 14 bigswitch 2017-02-24 17:55:26 UTC
Created attachment 1257393 [details]
sosreport from controller 2

Comment 15 bigswitch 2017-02-24 17:56:27 UTC
Created attachment 1257394 [details]
sosreport from controller 1

Comment 16 bigswitch 2017-02-24 17:59:08 UTC
Created attachment 1257395 [details]
sosreport from controller 0

Comment 17 Peter Lemenkov 2017-02-27 16:45:30 UTC
(In reply to bigswitch from comment #13)
> We are still seeing cluster is up, but deployment is failing with RHOSP9.
> Attached is the sosreport
> rabbitmq version is
> [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
> rabbitmq-server-3.6.3-5.el7ost.noarch

I see that nodes 0 and 1 cannot join the rest of the cluster initially. Still they finally joined it a little later.

Node 0 failed to join at 24-Feb-2017::00:21:22, and rejoined at 24-Feb-2017::00:21:41 (20 seconds later).

Node 1 stopped operation due to "pause minority" option at 24-Feb-2017::00:16:55, and rejoined at 24-Feb-2017::00:21:42. See what's happened from the node-1's point of view:

===================================

=INFO REPORT==== 24-Feb-2017::00:16:55 ===
rabbit on node 'rabbit@overcloud-controller-0' down
...
=ERROR REPORT==== 24-Feb-2017::00:16:55 ===
Partial partition detected:
 * We saw DOWN from rabbit@overcloud-controller-0
 * We can still see rabbit@overcloud-controller-2 which can see rabbit@overcloud-controller-0
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-0' up
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:20:44 ===
rabbit on node 'rabbit@overcloud-controller-2' down
...
=INFO REPORT==== 24-Feb-2017::00:21:00 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}}
...
=INFO REPORT==== 24-Feb-2017::00:21:19 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:21:42 ===
rabbit on node 'rabbit@overcloud-controller-0' up

===================================

This

Comment 18 Peter Lemenkov 2017-02-27 16:48:39 UTC
Sorry, accidentally pressed send.

This shows that there was something at "24-Feb-2017::00:16:55".

I looked at node-0 logs:

Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30073]: ERROR: [findif] failed
...
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5498] device (p1p1): enslaved to non-master-type device ivs; ignoring
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5531] manager: (ext4000): new Generic device (/org/freedesktop/NetworkManager/Devices/15)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5546] manager: (storagemgmt3997): new Generic device (/org/freedesktop/NetworkManager/Devices/16)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5561] manager: (api3999): new Generic device (/org/freedesktop/NetworkManager/Devices/17)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5578] manager: (storage3998): new Generic device (/org/freedesktop/NetworkManager/Devices/18)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5699] device (ext4000): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5707] device (storagemgmt3997): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5708] device (api3999): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5709] device (storage3998): link connected
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
...
Feb 24 00:16:49 overcloud-controller-0.localdomain lrmd[27617]:   notice: ip-10.8.84.70_monitor_10000:30043:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
Feb 24 00:16:49 overcloud-controller-0.localdomain crmd[27620]:   notice: overcloud-controller-0-ip-10.8.84.70_monitor_10000:90 [ ocf-exit-reason:Unable to find nic or netmask.\n ]
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] A processor failed, forming new configuration.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] The network interface is down.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.15}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.14}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.16}
Feb 24 00:16:50 overcloud-controller-0.localdomain bash[29987]: ifup ivs port ext4000
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6024] dhcp4 (em1): request timed out
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6024] dhcp4 (em1): state changed unknown -> timeout
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): canceled DHCP transaction, DHCP client pid 6779
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): state changed timeout -> done
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6133] device (em1): state change: ip-config -> failed (reason 'ip-config-unavailable') [70 120 5]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6135] manager: NetworkManager state is now DISCONNECTED
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6138] device (em1): Activation: failed for connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6145] device (em1): state change: failed -> disconnected (reason 'none') [120 30 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6175] policy: auto-activating connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6186] device (em1): Activation: starting connection 'System em1' (1dad842d-1912-ef5a-a43a-bc238fb267e7)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6188] device (em1): state change: disconnected -> prepare (reason 'none') [30 40 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6189] manager: NetworkManager state is now CONNECTING
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6194] device (em1): state change: prepare -> config (reason 'none') [40 50 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6201] device (em1): state change: config -> ip-config (reason 'none') [50 70 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6205] dhcp4 (em1): activation: beginning transaction (timeout in 45 seconds)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6267] dhcp4 (em1): dhclient started with pid 30228
Feb 24 00:16:50 overcloud-controller-0.localdomain dhclient[30228]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 3 (xid=0x763a0705)

Comment 19 Peter Lemenkov 2017-02-27 16:51:18 UTC
I bet something went wrong with networking.

I could advice you to upgrade resource-agents from your current resource-agents-3.9.5-82.el7.x86_64 up to resource-agents-3.9.5-86.el7.x86_64 where we improved recovery time. In the case mentioned above RabbitMQ will recover 1 minute faster - down from 5 to 4 minutes.

But this won't fix networking instability I see in logs. Still 4 minutes is better than 5.

Comment 22 Fabio Massimo Di Nitto 2017-03-21 14:19:16 UTC
Can somebody from bigswitch please verify the build mentioned in comment #19 and let us know ASAP if that helps?

Comment 25 errata-xmlrpc 2017-06-20 12:48:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1542

Comment 26 Red Hat Bugzilla 2023-09-14 03:53:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days