Bug 1419177

Summary:

rabbitmq cluster not forming with bigswitch virtual switch deployment on openstack controller nodes.

Product:

Red Hat OpenStack

Reporter:

bigswitch <rhosp-bugs-internal>

Component:

rabbitmq-server

Assignee:

Peter Lemenkov <plemenko>

Status:

CLOSED ERRATA

QA Contact:

bigswitch <rhosp-bugs-internal>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.0 (Liberty)

CC:

apevec, fdinitto, jeckersb, jjoyce, lemenkov, lhh, rhosp-bugs-internal, srevivo, ushkalim

Target Milestone:

async

Keywords:

OtherQA, Triaged, ZStream

Target Release:

8.0 (Liberty)

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

rabbitmq-server-3.3.5-33.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-06-20 12:48:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sos report from all the controllers	none
controller0	none
controller1	none
sosreport from controller 2	none
sosreport from controller 1	none
sosreport from controller 0	none

Description bigswitch 2017-02-03 19:50:28 UTC

Created attachment 1247567 [details]
sos report from all the controllers

Description of problem:

when installing bigswitch virtual switch onto controller nodes , deployment is failing and also rabbitmq server is not forming a cluster.
Version-Release number of selected component (if applicable):
RHOSP8
How reproducible:

RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
Steps to Reproduce:
1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
2.
3.

Actual results:


Expected results:


Additional info:


sos report attached from all 3 controllers.

Comment 1 Peter Lemenkov 2017-02-07 12:09:49 UTC

(In reply to bigswitch from comment #0)
> Created attachment 1247567 [details]
> sos report from all the controllers
> 
> Description of problem:
> 
> when installing bigswitch virtual switch onto controller nodes , deployment
> is failing and also rabbitmq server is not forming a cluster.
> Version-Release number of selected component (if applicable):
> RHOSP8
> How reproducible:
> 
> RHOSP8 with bigswitch virutal switch onto openstack controller nodes.
> Steps to Reproduce:
> 1. RHOSP 8 deployment with bigswitch virtual switch onto controller nodes.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> 
> sos report attached from all 3 controllers.

The tarball attached contains only data from one node. Could you please reupload other tarballs?

Comment 2 bigswitch 2017-02-16 16:42:29 UTC

Created attachment 1250907 [details]
controller0

Comment 3 bigswitch 2017-02-16 16:43:08 UTC

Created attachment 1250908 [details]
controller1

Comment 4 bigswitch 2017-02-16 16:43:31 UTC

added sos report from all controllers

Comment 5 Peter Lemenkov 2017-02-20 17:45:50 UTC

Ok, few observations.

Your cluster got hit by netsplit issues, so please consider upgrading up to resource-agents-3.9.5-86.el7 where netsplits handled better (see bug #1397393 for further details).

Comment 6 Peter Lemenkov 2017-02-20 17:46:38 UTC

=ERROR REPORT==== 3-Feb-2017::17:53:23 ===
** Generic server <0.2070.0> terminating
** Last message in was {'$gen_cast',{gm_deaths,[<20755.2101.0>]}}
** When Server state == {state,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"metering.sample">>},
                                false,false,none,
                                [{<<"x-ha-policy">>,longstr,<<"all">>}],
                                <0.2067.0>,[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"ha-all">>},
                                 {pattern,<<"^(?!amq\\.).*">>},
                                 {'apply-to',<<"all">>},
                                 {definition,[{<<"ha-mode">>,<<"all">>}]},
                                 {priority,0}],
                                [],[]},
                            <0.2071.0>,
                            {state,
                                {dict,0,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]}}},
                                erlang},
                            #Fun<rabbit_mirror_queue_master.5.69179775>,
                            #Fun<rabbit_mirror_queue_master.6.69179775>}
** Reason for termination ==
** {{case_clause,{ok,<20750.2036.0>,[]}},
    [{rabbit_mirror_queue_coordinator,handle_cast,2,
         [{file,"src/rabbit_mirror_queue_coordinator.erl"},{line,354}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}


This is GH#714.

Comment 7 Peter Lemenkov 2017-02-20 17:52:22 UTC

=ERROR REPORT==== 3-Feb-2017::17:53:36 ===
** Generic server <0.1960.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.10102>,process,<20755.1906.0>,
                               noconnection}
** When Server state == {state,
                            {1,<0.1960.0>},
                            {{0,<20750.1899.0>},#Ref<0.0.0.10101>},
                            {{0,<20755.1906.0>},#Ref<0.0.0.10102>},
                            {resource,<<"/">>,queue,
                                <<"heat-engine-listener_fanout_87c95e6a2a2242529d00fbef9880d75e">>},
                            rabbit_mirror_queue_slave,
                            {2,
                             [{{0,<20750.1899.0>},
                               {view_member,
                                   {0,<20750.1899.0>},
                                   [],
                                   {0,<20755.1906.0>},
                                   {1,<0.1960.0>}}},
                              {{0,<20755.1906.0>},
                               {view_member,
                                   {0,<20755.1906.0>},
                                   [],
                                   {1,<0.1960.0>},
                                   {0,<20750.1899.0>}}},
                              {{1,<0.1960.0>},
                               {view_member,
                                   {1,<0.1960.0>},
                                   [],
                                   {0,<20750.1899.0>},
                                   {0,<20755.1906.0>}}}]},
                            0,
                            [{{0,<20750.1899.0>},{member,{[],[]},0,0}},
                             {{0,<20755.1906.0>},
                              {member,
                                  {[{3,{delete_and_terminate,normal}}],[]},
                                  3,2}},
                             {{1,<0.1960.0>},{member,{[],[]},0,0}}],
                            [<0.1956.0>],
                            {[],[]},
                            [],0,undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            {true,{shutdown,ring_shutdown}}}
** Reason for termination ==
** {function_clause,[{orddict,fetch,
                              [{1,<0.1960.0>},[]],
                              [{file,"orddict.erl"},{line,72}]},
                     {gm,check_neighbours,1,[{file,"src/gm.erl"},{line,1223}]},
                     {gm,change_view,2,[{file,"src/gm.erl"},{line,1395}]},
                     {gm,handle_info,2,[{file,"src/gm.erl"},{line,729}]},
                     {gen_server2,handle_msg,2,
                                  [{file,"src/gen_server2.erl"},{line,1022}]},
                     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}


This is GH#368.

Comment 8 Peter Lemenkov 2017-02-20 18:00:03 UTC

Ok, I can't find any other issues so far. I'll update this ticket shortly.

Comment 9 Peter Lemenkov 2017-02-20 18:30:29 UTC

GH#368 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/724f531

GH#714 is addressed by this commit:

* https://github.com/fedora-erlang/rabbitmq-server/commit/18210ba

Comment 10 Peter Lemenkov 2017-02-21 15:31:10 UTC

Both issues are addressed in rabbitmq-server-3.3.5-33.el7ost.

Comment 11 bigswitch 2017-02-22 18:20:40 UTC

Is there a procedure to upgrade rabbitmq-server during deployment or is it part of RHOSP10?

Comment 12 Peter Lemenkov 2017-02-23 14:26:42 UTC

(In reply to bigswitch from comment #11)
> Is there a procedure to upgrade rabbitmq-server during deployment or is it
> part of RHOSP10?

We're spipping rabbitmq-server-3.6.3 within RHOS9 and more recent RHOS versions which already contains all the patches necessary.

As for RHOS8 the package (rabbitmq-server-3.3.5-33.el7ost) will be available in z-stream relatively soon.

Comment 13 bigswitch 2017-02-24 17:53:23 UTC

We are still seeing cluster is up, but deployment is failing with RHOSP9. Attached is the sosreport
rabbitmq version is
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
rabbitmq-server-3.6.3-5.el7ost.noarch

Comment 14 bigswitch 2017-02-24 17:55:26 UTC

Created attachment 1257393 [details]
sosreport from controller 2

Comment 15 bigswitch 2017-02-24 17:56:27 UTC

Created attachment 1257394 [details]
sosreport from controller 1

Comment 16 bigswitch 2017-02-24 17:59:08 UTC

Created attachment 1257395 [details]
sosreport from controller 0

Comment 17 Peter Lemenkov 2017-02-27 16:45:30 UTC

(In reply to bigswitch from comment #13)
> We are still seeing cluster is up, but deployment is failing with RHOSP9.
> Attached is the sosreport
> rabbitmq version is
> [root@overcloud-controller-0 heat-admin]# rpm -qa | grep rabbitmq
> rabbitmq-server-3.6.3-5.el7ost.noarch

I see that nodes 0 and 1 cannot join the rest of the cluster initially. Still they finally joined it a little later.

Node 0 failed to join at 24-Feb-2017::00:21:22, and rejoined at 24-Feb-2017::00:21:41 (20 seconds later).

Node 1 stopped operation due to "pause minority" option at 24-Feb-2017::00:16:55, and rejoined at 24-Feb-2017::00:21:42. See what's happened from the node-1's point of view:

===================================

=INFO REPORT==== 24-Feb-2017::00:16:55 ===
rabbit on node 'rabbit@overcloud-controller-0' down
...
=ERROR REPORT==== 24-Feb-2017::00:16:55 ===
Partial partition detected:
 * We saw DOWN from rabbit@overcloud-controller-0
 * We can still see rabbit@overcloud-controller-2 which can see rabbit@overcloud-controller-0
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-0' up
...
=INFO REPORT==== 24-Feb-2017::00:16:57 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:20:44 ===
rabbit on node 'rabbit@overcloud-controller-2' down
...
=INFO REPORT==== 24-Feb-2017::00:21:00 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}}
...
=INFO REPORT==== 24-Feb-2017::00:21:19 ===
rabbit on node 'rabbit@overcloud-controller-2' up
...
=INFO REPORT==== 24-Feb-2017::00:21:42 ===
rabbit on node 'rabbit@overcloud-controller-0' up

===================================

This

Comment 18 Peter Lemenkov 2017-02-27 16:48:39 UTC

Sorry, accidentally pressed send.

This shows that there was something at "24-Feb-2017::00:16:55".

I looked at node-0 logs:

Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30073]: ERROR: [findif] failed
...
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5498] device (p1p1): enslaved to non-master-type device ivs; ignoring
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5531] manager: (ext4000): new Generic device (/org/freedesktop/NetworkManager/Devices/15)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5546] manager: (storagemgmt3997): new Generic device (/org/freedesktop/NetworkManager/Devices/16)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5561] manager: (api3999): new Generic device (/org/freedesktop/NetworkManager/Devices/17)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5578] manager: (storage3998): new Generic device (/org/freedesktop/NetworkManager/Devices/18)
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5699] device (ext4000): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5707] device (storagemgmt3997): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5708] device (api3999): link connected
Feb 24 00:16:48 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924208.5709] device (storage3998): link connected
Feb 24 00:16:49 overcloud-controller-0.localdomain IPaddr2(ip-10.8.84.70)[30067]: ERROR: Unable to find nic or netmask.
...
Feb 24 00:16:49 overcloud-controller-0.localdomain lrmd[27617]:   notice: ip-10.8.84.70_monitor_10000:30043:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
Feb 24 00:16:49 overcloud-controller-0.localdomain crmd[27620]:   notice: overcloud-controller-0-ip-10.8.84.70_monitor_10000:90 [ ocf-exit-reason:Unable to find nic or netmask.\n ]
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] A processor failed, forming new configuration.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] The network interface is down.
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.15}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.14}
Feb 24 00:16:49 overcloud-controller-0.localdomain corosync[27576]:  [TOTEM ] adding new UDPU member {172.17.0.16}
Feb 24 00:16:50 overcloud-controller-0.localdomain bash[29987]: ifup ivs port ext4000
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6024] dhcp4 (em1): request timed out
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6024] dhcp4 (em1): state changed unknown -> timeout
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): canceled DHCP transaction, DHCP client pid 6779
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6130] dhcp4 (em1): state changed timeout -> done
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6133] device (em1): state change: ip-config -> failed (reason 'ip-config-unavailable') [70 120 5]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6135] manager: NetworkManager state is now DISCONNECTED
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <warn>  [1487924210.6138] device (em1): Activation: failed for connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6145] device (em1): state change: failed -> disconnected (reason 'none') [120 30 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6175] policy: auto-activating connection 'System em1'
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6186] device (em1): Activation: starting connection 'System em1' (1dad842d-1912-ef5a-a43a-bc238fb267e7)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6188] device (em1): state change: disconnected -> prepare (reason 'none') [30 40 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6189] manager: NetworkManager state is now CONNECTING
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6194] device (em1): state change: prepare -> config (reason 'none') [40 50 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6201] device (em1): state change: config -> ip-config (reason 'none') [50 70 0]
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6205] dhcp4 (em1): activation: beginning transaction (timeout in 45 seconds)
Feb 24 00:16:50 overcloud-controller-0.localdomain NetworkManager[628]: <info>  [1487924210.6267] dhcp4 (em1): dhclient started with pid 30228
Feb 24 00:16:50 overcloud-controller-0.localdomain dhclient[30228]: DHCPDISCOVER on em1 to 255.255.255.255 port 67 interval 3 (xid=0x763a0705)

Comment 19 Peter Lemenkov 2017-02-27 16:51:18 UTC

I bet something went wrong with networking.

I could advice you to upgrade resource-agents from your current resource-agents-3.9.5-82.el7.x86_64 up to resource-agents-3.9.5-86.el7.x86_64 where we improved recovery time. In the case mentioned above RabbitMQ will recover 1 minute faster - down from 5 to 4 minutes.

But this won't fix networking instability I see in logs. Still 4 minutes is better than 5.

Comment 22 Fabio Massimo Di Nitto 2017-03-21 14:19:16 UTC

Can somebody from bigswitch please verify the build mentioned in comment #19 and let us know ASAP if that helps?

Comment 25 errata-xmlrpc 2017-06-20 12:48:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1542

Comment 26 Red Hat Bugzilla 2023-09-14 03:53:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days